You are on page 1of 63

Part II

Multiple Linear Regression

86

Chapter 7
Multiple Regression
A multiple linear regression model is a linear model that describes
how a y-variable relates to two or more xvariables (or transformations of
x-variables).
For example, suppose that a researcher is studying factors that might affect systolic blood pressures for women aged 45 to 65 years old. The response
variable is systolic blood pressure (Y ). Suppose that two predictor variables
of interest are age (X1 ) and body mass index (X2 ). The general structure of
a multiple linear regression model for this situation would be
Y = 0 + 1 X1 + 2 X2 + .
The equation 0 + 1 X1 + 2 X2 describes the mean value of blood
pressure for specific values of age and BMI.
The error term () describes the characteristics of the differences between individual values of blood pressure and their expected values of
blood pressure.
One note concerning terminology. A linear model is one that is linear in
the beta coefficients, meaning that each beta coefficient simply multiplies an
x-variable or a transformation of an x-variable. For instance y = 0 + 1 x +
2 x2 +  is called a multiple linear regression model even though it describes
a quadratic, curved, relationship between y and a single x-variable.
87

88

CHAPTER 7. MULTIPLE REGRESSION

7.1

About the Model

Notation for the Population Model


A population model for a multiple regression model that relates a yvariable to p 1 predictor variables is written as
yi = 0 + 1 xi,1 + 2 xi,2 + . . . + p1 xi,p1 + i .

(7.1)

We assume that the i have a normal distribution with mean 0 and


constant variance 2 . These are the same assumptions that we used in
simple regression with one x-variable.
The subscript i refers to the ith individual or unit in the population. In
the notation for the xvariables, the subscript following i simply denotes
which x-variable it is.
Estimates of the Model Parameters
The estimates of the coefficients are the values that minimize the
sum of squared errors for the sample. The exact formula for this will
be given in the next chapter when we introduce matrix notation.
The letter b is used to represent a sample estimate of a coefficient.
Thus b0 is the sample estimate of 0 , b1 is the sample estimate of 1 ,
and so on.
SSE
MSE = np
estimates 2 , the variance of the errors. In the formula,
n = sample size, p = number of coefficients in the model and SSE =
sum of squared errors. Notice that for simple linear regression p = 2.
Thus, we get the formula for MSE that we introduced in that context
of one predictor.

In the case of two predictors, the estimated regression equation yields


a plane (as opposed to a line in the simple linear regression setting).
For more than two predictors, the estimated regression equation yields
a hyperplane.
STAT 501

D. S. Young

CHAPTER 7. MULTIPLE REGRESSION

89

Predicted Values and Residuals


A predicted value is calculated as yi = b0 + b1 xi,1 + b2 xi,2 + . . . +
bp1 xi,p1 , where the b values come from statistical software and the
x-values are specified by us.
A residual (error) term is calculated as ei = yi yi , the difference
between an actual and a predicted value of y.
A plot of residuals versus predicted values ideally should resemble a horizontal random band. Departures from this form indicates
difficulties with the model and/or data.
Other residual analyses can be done exactly as we did in simple regression. For instance, we might wish to examine a normal probability
plot (NPP) of the residuals. Additional plots to consider are plots of
residuals versus each x-variable separately. This might help us identify
sources of curvature or nonconstant variance.
Interaction Terms
An interaction term is when there is a coupling or combined effect of
2 or more independent variables.
Suppose we have a response variable (Y ) and two predictors (X1 and
X2 ). Then, the regression model with an interaction term is written as
Y = 0 + 1 X1 + 2 X2 + 3 X1 X2 + .
Suppose you also have a third predictor (X3 ). Then, the regression
model with all interaction terms is written as
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X1 X2 + 5 X1 X3
+ 6 X2 X3 + 7 X1 X2 X3 + .
In a model with more predictors, you can imagine how much the model
grows by adding interactions. Just make sure that you have enough
observations to cover the degrees of freedom used in estimating the
corresponding regression coefficients!
D. S. Young

STAT 501

90

CHAPTER 7. MULTIPLE REGRESSION


For each observation, their value of the interaction is found by multiplying the recorded values of the predictor variables in the interaction.
In models with interaction terms, the significance of the interaction
term should always be assessed first before proceeding with significance
testing of the main variables.
If one of the main variables is removed from the model, then the model
should not include any interaction terms involving that variable.

7.2

Significance Testing of Each Variable

Within a multiple regression model, we may want to know whether a particular x-variable is making a useful contribution to the model. That is,
given the presence of the other x-variables in the model, does a particular
x-variable help us predict or explain the y-variable? For instance, suppose
that we have three x-variables in the model. The general structure of the
model could be
Y = 0 + 1 X1 + 2 X2 + 3 X3 + .
(7.2)
As an example, to determine whether variable X1 is a useful predictor variable
in this model, we could test
H0 : 1 = 0
HA : 1 6= 0.
If the null hypothesis above were the case, then a change in the value of
X1 would not change Y , so Y and X1 are not related. Also, we would still be
left with variables X2 and X3 being present in the model. When we cannot
reject the null hypothesis above, we should say that we do not need variable
X1 in the model given that variables X2 and X3 will remain in the model.
In general, the interpretation of a slope in multiple regression can be tricky.
Correlations among the predictors can change the slope values dramatically
from what they would be in separate simple regressions.
To carry out the test, statistical software will report p-values for all coefficients in the model. Each p-value will be based on a t-statistic calculated
as
t = (sample coefficient - hypothesized value) / standard error of coefficient.
STAT 501

D. S. Young

CHAPTER 7. MULTIPLE REGRESSION

91

For our example above, the t-statistic is:


t =

b1 0
b1
=
.
s.e.(b1 )
s.e.(b1 )

Note that the hypothesized value is usually just 0, so this portion of the
formula is often omitted.

7.3

Examples

Example 1: Heat Flux Data Set


The data are from n = 29 homes used to test solar thermal energy. The
variables of interest for our model are y = total heat flux, and x1 , x2 , and
x3 , which are the focal points for the east, north, and south directions, respectively. There are two other measurements in this data set: another
measurement of the focal points and the time of day. We will not utilize
these predictors at this time. Table 7.1 gives the data used for this analysis.
The regression model of interest is
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + i .
Figure 7.1(a) gives a histogram of the residuals. While the shape is not completely bell-shaped, it again is not suggestive of any severe departures from
normality. Figure 7.1(b) gives a plot of the residuals versus the fitted values. Again, the values appear to be randomly scattered about 0, suggesting
constant variance.
The following provides the t-tests for the individual regression coefficients:
##########
Coefficients:
Estimate Std. Error t value
(Intercept) 389.1659
66.0937
5.888
east
2.1247
1.2145
1.750
north
-24.1324
1.8685 -12.915
south
5.3185
0.9629
5.523
--Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|)
3.83e-06
0.0925
1.46e-12
9.69e-06

***
.
***
***

* 0.05 . 0.1 1

Residual standard error: 8.598 on 25 degrees of freedom


D. S. Young

STAT 501

92

CHAPTER 7. MULTIPLE REGRESSION

Histogram of Residuals
20

0.06

Residuals vs. Fitted Values

Residuals

0.03

0.02

Density

0.04

10

0.05

15

0.01

15

0.00

10

10

10

20

200

Residuals

(a)

220

240

260

Fitted Values

(b)

Figure 7.1: (a) Histogram of the residuals for the heat flux data set. (b) Plot
of the residuals.

Multiple R-Squared: 0.8741,


Adjusted R-squared: 0.859
F-statistic: 57.87 on 3 and 25 DF, p-value: 2.167e-11
##########
At the = 0.05 significance level, both north and south appear to be statistically significant predictors of heat flux. However, east is not (with a p-value
of 0.0925). While we could claim this is a marginally significant predictor,
we will rerun the analysis by dropping the east predictor.
The following provides the t-tests for the individual regression coefficients
for the newly suggested model:
##########
Coefficients:
Estimate Std. Error t value
(Intercept) 483.6703
39.5671 12.224
north
-24.2150
1.9405 -12.479
south
4.7963
0.9511
5.043
--Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|)
2.78e-12 ***
1.75e-12 ***
3.00e-05 ***
* 0.05 . 0.1 1

Residual standard error: 8.932 on 26 degrees of freedom


STAT 501

D. S. Young

CHAPTER 7. MULTIPLE REGRESSION

93

Multiple R-Squared: 0.8587,


Adjusted R-squared: 0.8478
F-statistic: 79.01 on 2 and 26 DF, p-value: 8.938e-12
##########
The residual plots still appear okay (they are not included here) and we
obtain new estimates for our model (in the above). Some things to note from
this final analysis are:
The final sample multiple regression equation is
yi = 483.67 24.22xi,2 + 4.80xi,3 .
To use this equation for prediction, we substitute specified values for
the two directions (i.e., north and south).
We can interpret the slopes in the same way that we do for a straightline model, but we have to add the constraint that values of other
variables remain constant.
When the south position is held constant, the average flux temperature for a home decreases by 24.22 degrees for each 1 unit
increase in the north position.
When the north position is held constant, the average flux temperature for a home increases by 4.80 degrees for each 1 unit increase
in the south position.
The value of R2 = 0.8587 means that the model (the two x-variables)
explains 85.87% of the observed variation in a homes flux temperature.

The value MSE = 8.9321 is the estimated standard deviation of the


residuals. Roughly, it is the average absolute size of a residual.
Example 2: Kola Project Data Set
The Kola Project ran from 1993-1998 and involved extensive geological surveys of Finland, Norway, and Russia. The entire published data set consists
of over 600 observations measured on 111 variables. Table 7.2 provides merely
a subset of this data for three variables. The data is subsetted on the LITO
variable for counts of 1. The sample size of this subset is n = 131.
The investigators are interested in modeling the geological composition
variable Cr INAA as a function of Cr and Co. A scatterplot of this data with
D. S. Young

STAT 501

94

CHAPTER 7. MULTIPLE REGRESSION

the least squares plane is provided in Figure 7.2. In this 3D plot, observations
above the plane (i.e., observations with positive residuals) are given by green
points and observations below the plane (i.e., observations with negative
residuals) are given by red points. The output for fitting a multiple linear
regression model to this data is below:
Residuals:
Min
1Q
-149.95 -34.42

Median
-14.74

3Q
11.58

Max
560.38

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.3483
11.6908
4.563 1.17e-05 ***
Cr
1.8577
0.2324
7.994 6.66e-13 ***
Co
2.1808
1.7530
1.244
0.216
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 74.76 on 128 degrees of freedom
Multiple R-squared: 0.544,
Adjusted R-squared: 0.5369
F-statistic: 76.36 on 2 and 128 DF, p-value: < 2.2e-16
Note that Co is found to be not statistically significant. However, the scatterplot in Figure 7.2 clearly shows that the data is skewed to the right for
each of the variables (i.e., the bulk of the data is clustered near the lower-end
of values for each variable while there are fewer values as you increase along
a given axis). In fact, a plot of the standardized residuals against the fitted
values (Figure 7.3) indicates that a transformation is needed.
Since the data appears skewed to the right for each of the variables, a
log transformation on Cr INAA, Cr, and Co will be taken. The scatterplot
in Figure 7.4 shows the results from this transformations along with the
new least squares plane. Clearly, the transformation has done a better job
linearizing the relationship. The output for fitting a multiple linear regression
model to this transformed data is below:
Residuals:
Min
1Q Median
-0.8181 -0.2443 -0.0667

STAT 501

3Q
0.1748

Max
1.3401

D. S. Young

CHAPTER 7. MULTIPLE REGRESSION

95

Figure 7.2: 3D scatterplot of the Kola data set with the least squares plane.

Standardized Residuals

100

200

300

400

500

600

Fitted Values

Figure 7.3: The standardized residuals versus the fitted values for the raw
Kola data set.

D. S. Young

STAT 501

96

CHAPTER 7. MULTIPLE REGRESSION

Figure 7.4: 3D scatterplot of the Kola data set where the logarithm of each
variable has been taken.

Standardized Residuals

4.0

4.5

5.0

5.5

6.0

Fitted Values

Figure 7.5: The standardized residuals versus the fitted values for the logtransformed Kola data set.

STAT 501

D. S. Young

CHAPTER 7. MULTIPLE REGRESSION

97

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.65109
0.17630 15.037 < 2e-16 ***
ln_Cr
0.57873
0.08415
6.877 2.42e-10 ***
ln_Co
0.08587
0.09639
0.891
0.375
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.3784 on 128 degrees of freedom
Multiple R-squared: 0.5732,
Adjusted R-squared: 0.5665
F-statistic: 85.94 on 2 and 128 DF, p-value: < 2.2e-16
There is also a noted improvement in the plot of the standardized residuals
versus the fitted values (Figure 7.5). Notice that the log transformation of
Co is not statistically significant as it has a high p-value (0.375).
After omitting the log transformation of Co from our analysis, a simple
linear regression model is fit to the data. Figure 7.6 provides a scatterplot
of the data and a plot of the standardized residuals against the fitted values.
These plots, combined with the following simple linear regression output,
indicate a highly statistically significant relationship between the log transformation of Cr INAA and the log transformation of Cr.
Residuals:
Min
1Q
Median
-0.85999 -0.24113 -0.05484

3Q
0.17339

Max
1.38702

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.60459
0.16826
15.48
<2e-16 ***
ln_Cr
0.63974
0.04887
13.09
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.3781 on 129 degrees of freedom
Multiple R-squared: 0.5705,
Adjusted R-squared: 0.5672
F-statistic: 171.4 on 1 and 129 DF, p-value: < 2.2e-16

D. S. Young

STAT 501

98

CHAPTER 7. MULTIPLE REGRESSION

6.5

3.5

5.5

5.0

ln(Cr_INAA)

4.0

4.5

Standardized Residuals

6.0

4
ln(Cr)

(a)

4.0

4.5

5.0

5.5

6.0

Fitted Values

(b)

Figure 7.6: (a) Scatterplot of the Kola data set where the logarithm of
Cr INAA has been regressed on the logarithm of Cr. (b) Plot of the standardized residuals for this simple linear regression fit.

STAT 501

D. S. Young

CHAPTER 7. MULTIPLE REGRESSION

i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Flux Insolation East South


271.8
783.35
33.53 40.55
264.0
748.45
36.50 36.19
238.8
684.45
34.66 37.31
230.7
827.80
33.13 32.52
251.6
860.45
35.75 33.71
257.9
875.15
34.46 34.14
263.9
909.45
34.60 34.85
266.5
905.55
35.38 35.89
229.1
756.00
35.85 33.53
239.3
769.35
35.68 33.79
258.0
793.50
35.35 34.72
257.6
801.65
35.04 35.22
267.3
819.65
34.07 36.50
267.0
808.55
32.20 37.60
259.6
774.95
34.32 37.89
240.4
711.85
31.08 37.71
227.2
694.85
35.73 37.00
196.0
638.10
34.11 36.76
278.7
774.55
34.79 34.62
272.3
757.90
35.77 35.40
267.4
753.35
36.44 35.96
254.5
704.70
37.82 36.26
224.7
666.80
35.07 36.34
181.5
568.55
35.26 35.90
227.5
653.10
35.56 31.84
253.6
704.05
35.73 33.16
263.0
709.60
36.46 33.83
265.8
726.90
36.26 34.89
263.8
697.15
37.20 36.27

99

North
16.66
16.46
17.66
17.50
16.40
16.28
16.06
15.93
16.60
16.41
16.17
15.92
16.04
16.19
16.62
17.37
18.12
18.53
15.54
15.70
16.45
17.62
18.12
19.05
16.51
16.02
15.89
15.83
16.71

Time
13.20
14.11
15.68
10.53
11.00
11.31
11.96
12.58
10.66
10.85
11.41
11.91
12.85
13.58
14.21
15.56
15.83
16.41
13.10
13.63
14.51
15.38
16.10
16.73
10.58
11.28
11.91
12.65
14.06

Table 7.1: The heat flux for homes data.

D. S. Young

STAT 501

100

X1
40.9
60.7
29.6
40.9
27.8
23.5
16.6
29.2
6.9
57.1
50
129
106
36.5
66.6
37.2
42
17.5
67.1
10.6
11.3
44
34.1
29.3
49.2
118
10.8
59.3
20
24.4
28.6
37.9
10.2

CHAPTER 7. MULTIPLE REGRESSION

X2
6.2
10.5
10.1
8.7
5.2
3.8
15.2
5.1
2
7.8
9.7
30.7
13.7
7.3
10.5
9.6
9.2
3.9
8.6
2.5
2
14.1
12.1
5.2
14
18.3
3.4
9.6
3.8
5.4
6.6
30.3
4.7

Y
300
270
140
120
240
110
64
92
37
250
190
210
220
81
170
120
120
64
120
69
27
130
240
87
180
330
78
300
110
120
76
130
46

X1
71.4
66.3
99.1
18.6
30.5
28.9
23
44.2
27.7
18.1
10
16.3
31.6
19.5
25.2
13.6
23.2
41.6
18
37.4
32.3
217
16.7
29.8
15.5
14
30
21.9
33
30.8
55.5
25.9
16.5

X2
11.8
9
16.1
3.6
8.6
5.2
5.3
9.1
8
4.7
3.3
3.5
8.7
6.8
5.5
2.9
6.4
10.1
4
8.7
7.2
10
5
7.8
2.5
3
6.9
9
5.6
6.5
11.5
6.9
4.7

Y
200
230
220
93
140
130
120
120
100
100
87
83
130
90
110
170
88
150
120
97
97
400
49
160
70
68
120
390
71
110
130
96
63

X1
52.1
18
23.7
37.7
16.1
40
38.4
23.7
16.4
13.4
24
28.8
18.4
9.4
5.9
83.8
280
21.9
18.5
26.7
50.2
30.9
25.5
21.4
32.3
31.9
28.7
36.2
45.2
16.3
50
15.8
19.3

X2
7.7
6.1
9.3
6.1
2.7
5.5
14.4
4.9
5.4
2.5
4.1
15.2
3.8
3.3
1.6
16.2
25.2
5
3.5
5.1
7
10
5.5
3.9
6.3
3.7
6.7
5.3
3.8
5.4
7.5
4.8
3.5

Y
140
75
54
110
68
100
100
90
82
100
93
110
63
47
86
160
640
62
92
170
340
120
61
140
220
110
120
94
99
59
130
79
54

X1
21.3
73.5
80.1
75.4
32.3
19.4
20
48.3
40
22.5
31.8
17.1
10.9
30.6
52.6
53.9
88.6
25.6
18.9
16.7
19.6
15.1
25.1
8.4
25.4
18.7
21.6
24
19.3
243
9.6
36.4

X2
4.2
8.5
18.8
16.6
8.7
3.9
5.5
8.2
6.7
2.4
14.7
6.2
2
7.3
11.4
11.5
15.5
6.8
3.1
4.5
5.6
4.2
7
2
6.9
4.4
7.6
5.2
4.3
24.1
2.4
6

Y
110
210
170
790
100
62
300
110
120
95
180
180
50
110
130
210
320
69
110
84
86
110
150
34
140
72
110
110
130
590
47
110

Table 7.2: The subset of the Kola data. Here X1 , X2 , and Y are the variables
Cr, Co, and Cr INAA, respectively.

STAT 501

D. S. Young

Chapter 8
Matrix Notation in Regression
There are two main reasons for using matrices in regression. First, the notation simplifies the writing of the model. Secondly, and most importantly,
matrix formulas provide the means by which statistical software calculates
the estimated coefficients and their standard errors, as well as the set of predicted values for the observed sample. If necessary, a review of matrices and
some of their basic properties can be found in Appendix B.

8.1

Matrices and Regression

In matrix notation, the theoretical regression model for the population is


written as
Y = X + .
The four different items in the equation are:
1. Y is a n-dimensional column vector that vertically lists the y values:

Y=

Y1
Y2
..
.

Yn
2. The X matrix is a matrix in which each row gives the x-variable data
for a different observation. The first column equals 1 for all observations
(unless doing a regression through the origin), and each column after
101

102

CHAPTER 8. MATRIX NOTATION IN REGRESSION


the first gives the data for a different variable. There is a column
for each variable, including any added interactions, transformations,
indicators, and so on. The abstract formulation is:

1 X1,1 . . . X1,p1
1 X2,1 . . . X2,p1

X = ..
.
..
..
...
.

.
.
1 Xn,1 . . . Xn,p1
In the subscripting, the first value is the observation number and the
second number is the variable number. The first column is always a
column of 1s. The X matrix has n rows and p columns.

3. is a p-dimensional column vector listing the coefficients:

0
1

= .. .
.
p1
Notice the subscript for the numbering of the s. As an example,
for simple linear regression, = (0 1 )T . The vector will contain
symbols, not numbers, as it gives the population parameters.
4.  is a n-dimensional column vector listing the errors:

1
2

 = .. .
.
n
Again, we will not have numerical values for the  vector.
As an example, suppose that data for a y-variable and two x-variables is
as given in Table 8.1. For the model
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,1 xi,2 + i ,
the matrices Y, X, , and  are as follows:
STAT 501

D. S. Young

CHAPTER 8. MATRIX NOTATION IN REGRESSION


yi
xi,1
xi,2

103

6 5 10 12 14 18
1 1 3 5 3 5
1 2 1 1 2 2

Table 8.1: A sample data set.

Y=

6
5
10
12
14
18

, X =

1
1
1
1
1
1

1
1
3
5
3
5

1 1
2 2
1 3
1 5
2 6
2 10

, =

2 ,  =

1
2
3
4
5
6

1. Notice that the first column of the X matrix equals 1 for all rows
(observations), the second column gives the values of xi,1 , the third
column lists the values of xi,2 , and the fourth column gives the values
of the interaction values xi,1 xi,2 .
2. For the theoretical model, we do not know the values of the beta coefficients or the errors. In those two matrices (column vectors) we can
only list the symbols for these items.
3. There is a slight abuse of notation that occurs here which often happens
when writing regression models in matrix notation. I stated earlier how
capital letters are reserved for random variables and lower case letters
are reserved for realizations. In this example, capital letters have been
used for the realizations. There should be no misunderstanding as
it will usually be clear if we are in the context of discussing random
variables or their realizations.
Finally, using Calculus rules for matrices, it can be derived that the ordinary least squares estimates of the coefficients are calculated using the
matrix formula
b = (XT X)1 XT y,
D. S. Young

STAT 501

104

CHAPTER 8. MATRIX NOTATION IN REGRESSION

which minimizes the sum of squared errors


||e||2 = eT e
T (Y Y)

= (Y Y)
= (Y XbT (Y Xb),
where b = (b0 b1 . . . bp1 )T . As in the simple linear regression case, these
regression coefficient estimators are unbiased (i.e., E(b) = ). The formula
above is used by statistical software to calculate values of the sample coefficients.
An important theorem in regression analysis (and Statistics in general)
is the Gauss-Markov Theorem, which we alluded to earlier. Since we
have the proper matrix notation in place, we will now formalize this very
important result.
Theorem 1 (Gauss-Markov Theorem) Suppose that we have the linear regression model
Y = X + ,
where E(i |X) = 0 and E(i |X) = 2 for all i = 1, . . . , n. Then
= b = (XT X)1 XT Y

is an unbiased estimator of and has the smallest variance of all other


unbiased estimates of .
Any estimator which is unbiased and has smaller variance than any other
unbiased estimators is called a best linear unbiased estimator or BLUE.
An important note regarding the matrix expressions introduced above is
that
= Xb
Y
= X(XT X)1 XT Y
= HY
and

e=YY
= Y HY
= (Inn H)Y,
STAT 501

D. S. Young

CHAPTER 8. MATRIX NOTATION IN REGRESSION

105

where H = X(XT X)1 XT is the n n hat matrix. H is important for


several reasons as it appears often in regression formulas. One important
implication of H is that it is a projection matrix, meaning that it projects
the response vector, Y, as a linear combination of the columns of the X
Also, the diagonal
matrix in order to obtain the vector of fitted values, Y.
of this matrix contains the hj,j values we introduced earlier in the context of
Studentized residuals, which is important when discussing leverage.

8.2

Variance-Covariance Matrix of b

Two important characteristics of the sample multiple regression coefficients


are their standard errors and their correlations with each other. The variancecovariance matrix of the sample coefficients b is a symmetric p p square
matrix. Remember that p is the number of beta coefficients in the model
(including the intercept).
The rows and the columns of the variance-covariance matrix are in coefficient order (first row is information about b0 , second is about b1 , and so
on).
The diagonal values (from top left to bottom right) are the variances
of the sample coefficients (written as Var(bi )). The standard error of a
coefficient is the square root of its variance.
An off-diagonal value is the covariance between two coefficient estimates (written as Cov(bi , bj )).
The correlation between two coefficient estimates can be determined
using the following relationship: correlation = covariance divided by
product of standard deviations (written as Corr(bi , bj )).
In regression, the theoretical variance-covariance matrix of the sample coefficients is
V(b) = 2 (XT X)1 .
Recall, the MSE estimates 2 , so the estimated variance-covariance matrix of the sample beta coefficients is calculated as

V(b)
= MSE(XT X)1 .
D. S. Young

STAT 501

106

CHAPTER 8. MATRIX NOTATION IN REGRESSION

100 (1 )% confidence intervals are also readily available for :


bj tnp;1/2

V(b)
j,j ,

th

where V(b)
diagonal element of the estimated variance-covariance
j,j is the j
matrix of the sample beta coefficients (i.e., the (estimated) standard error).
Furthermore, the Bonferroni joint 100(1 )% confidence intervals are:

bj

tnp;1/(2p)

V(b)
j,j ,

for j = 0, 1, 2, . . . , (p 1).

8.3

Statistical Intervals

The statistical intervals for estimating the mean or predicting new observations in the simple linear regression case can easily extend to the multiple
regression case. Here, it is only necessary to present the formulas.
First, let use define the vector of given predictors as

1
Xh,1

Xh,2
Xh =
.

..

.
Xh,p1
We are interested in either intervals for E(Y |X = Xh ) or intervals for the
value of a new response y given that the observation has the particular value
Xh . First we define the standard error of the fit at Xh given by:
q
T
1

s.e.(Yh ) = MSE(XT
h (X X) Xh ).
Now, we can give the formulas for the various intervals:
100 (1 )% Confidence Interval:
yh tnp;1/2 s.e.(
yh ).
STAT 501

D. S. Young

CHAPTER 8. MATRIX NOTATION IN REGRESSION

107

Bonferroni Joint 100 (1 )% Confidence Intervals:


yhi ),
yhi tnp;1/(2q) s.e.(
for i = 1, 2, . . . , q.
100 (1 )% Working-Hotelling Confidence Band:
q

s.e.(
yh ).
yh pFp,np;1
100 (1 )% Prediction Interval:
p
yh tnp;1/2 MSE/m + [s.e.(
yh )]2 ,
where m = 1 corresponds to a prediction interval for a new observation at a given Xh and m > 1 corresponds to the mean of m new
observations calculated at the same Xh .
Bonferroni Joint 100 (1 )% Prediction Intervals:
p
yhi tnp;1/(2q) MSE + [s.e.(
yhi )]2 ,
for i = 1, 2, . . . , q.
Scheff
e Joint 100 (1 )% Prediction Intervals:
q

yhi qFq,np;1
(MSE + [s.e.(
yh )]2 ),
for i = 1, 2, . . . , q.
[100 (1 )%]/[100 P %] Tolerance Intervals:
One-Sided Intervals:

(, yh + K,P MSE)
and

(
yh K,P MSE, )

are the upper and lower one-sided tolerance intervals, respectively,


where K,P is found similarly as in the simple linear regression
T
1
1
setting, but with n = (XT
h (X X) Xh ) .
D. S. Young

STAT 501

108

CHAPTER 8. MATRIX NOTATION IN REGRESSION


Two-Sided Interval:

yh K/2,P/2 MSE,
where K/2,P/2 is found similarly as in the simple linear regression setting, but with n as given above and f = np, where p is the dimension
of Xh .

8.4

Example

Example: Heat Flux Data Set (continued )


Refer back to the heat flux data set where only north and south were used as
predictors of insolation. The MSE for this model is equal to 79.7819. However, if we are interested in the full variance-covariance matrix and correlation
matrix, then this must be calculated by hand by finding the (XT X)1 . Then,

19.6229 0.5521 0.2918

V(b)
= 79.7819 0.5521 0.0472 0.0066
0.2918 0.0066 0.0113

1565.5532 44.0479 23.2797


3.7657
0.5305 .
= 44.0479
23.2797 0.5305
0.9046
Taking the square roots of the diagonal terms of this matrix gives you the
values of s.e.(b0 ), s.e.(b1 ), and s.e.(b2 ).
We can also calculate the correlation matrix of b (denoted by rb ) for
this data set:

Var(b0 )
Cov(b0 ,b1 )
Cov(b0 ,b2 )
Var(b0 )Var(b1 )
Var(b0 )Var(b2 )
0 )Var(b0 )
Var(b
Var(b1 )
Cov(b1 ,b2 )

Cov(b1 ,b0 )

rb = Var(b1 )Var(b0 )

Var(b
)Var(b
)
Var(b
1
1
1 )Var(b2 )

Cov(b2 ,b0 )
Cov(b2 ,b1 )
Var(b2 )
Var(b2 )Var(b0 )
Var(b2 )Var(b1 )
Var(b2 )Var(b2 )

1565.5532
44.0479

23.2797
(1565.5532)(1565.5532)
(1565.5532)(3.7657)
(1565.5532)(0.9046)
44.0479

3.7657
0.5305

=
(3.7657)(1565.5532)
(3.7657)(3.7657)
(3.7657)(0.9046)

23.2797
0.5305
0.9046
(0.9046)(1565.5532)
(0.9046)(3.7657)
(0.9046)(0.9046)

1
0.5737 0.6186

0.5737
1
0.2874 .
=
0.6186 0.2874
1
STAT 501

D. S. Young

CHAPTER 8. MATRIX NOTATION IN REGRESSION

109

rb is an estimate of the population correlation matrix b . For example,


Corr(b1 , b2 ) = 0.2874, which implies there is a fairly low, negative correlation between the average change in flux for each unit increase in the south
position and each unit increase in the north position. Therefore, the presence
of the north position only slightly affects the estimate of the souths beta coefficient. The consequence is that it is fairly easy to separate the individual
effects of these two variables. Note that we usually do not care about correlations concerning the intercept, b0 since we usually wish to provide an
interpretation concerning the x-variables.
If all x-variables are uncorrelated with each other, then all covariances
between pairs of sample coefficients that multiply x-variables will equal 0.
This means that the estimate of one beta is not affected by the presence of the
other x-variables. Many experiments are designed to achieve this property,
but achieving it with real data is often a different story.
The correlation matrix presented above should NOT be confused with
the correlation matrix, r, constructed for each pairwise combination of the
variables Y, X1 , X2 , . . . , Xp1 ; namely:

1
Corr(Y, X1 )
. . . Corr(Y, Xp1 )
Corr(X1 , Y )
1
. . . Corr(X1 , Xp1 )

r=
.
..
..
..
..

.
.
.
.
Corr(Xp1 , Y ) Corr(Xp1 , X1 ) . . .
1
Note that all of the diagonal entries are 1 because the correlation between
a variable and itself is a perfect (positive) association. This correlation matrix is what most statistical software reports and it does not always report
rb . The interpretation of each entry in r is identical to the Pearson correlation coefficient interpretation presented earlier. Specifically, it provides the
strength and direction of the association between the variables corresponding to the row and column of the respective entry. For this example, the
correlation matrix is:

1
0.8488 0.1121
1
0.2874 .
r = 0.8488
0.1121 0.2874
1
We can also calculate the 95% confidence intervals for the regression coefficients. First note that t26,0.975 = 2.0555.
The 95% confidence interval for
1 is calculated using 24.2150 2.0555 3.7657 and for 2 it is calculated
D. S. Young

STAT 501

110

CHAPTER 8. MATRIX NOTATION IN REGRESSION

using 4.7963 2.0555 0.9046. Thus, we are 95% confident that the true
population regression coefficients for the north and south focal points are
between (-28.2039, -20.2262) and (2.8413, 6.7513), respectively.

STAT 501

D. S. Young

Chapter 9
Indicator Variables
We next discuss how to include categorical predictor variables in a regression
model. A categorical variable is a variable for which the possible outcomes are nameable characteristics, groups or treatments. Some examples
are gender (male or female), highest educational degree attained (secondary
school, college undergraduate, college graduate), blood pressure medication
used (drug 1, drug 2, drug 3), etc.
We use indicator variables to incorporate a categorical x-variable into a
regression model. An indicator variable equals 1 when an observation is in
a particular group and equals 0 when an observation is not in that group. An
interaction between an indicator variable and a quantitative variable exists
if the slope between the response and the quantitative variable depends upon
the specific value present for the indicator variable.

9.1

The Leave One Out Method

When a categorical predictor variable has k categories, it is possible to define


k indicator variables. However, as explained later, we should only use k 1
of them as predictor variables in the regression model.
Let us consider an example where we are analyzing data for a clinical trial
done to compare the effectiveness of three different medications used to treat
high blood pressure. n = 90 participants are randomly divided into three
groups of 30 patients and each group is assigned a different medication. The
response variable is the reduction in diastolic blood pressure in a 3 month
period. In addition to the treatment variables, two other predictor variables
111

112

CHAPTER 9. INDICATOR VARIABLES

will be X1 =age and X2 =body mass index.


We are examining three different treatments so we can define the following
three indicator variables for the treatment:
X3 = 1 if patient used treatment 1, 0 otherwise
X4 = 1 if patient used treatment 2, 0 otherwise
X5 = 1 if patient used treatment 3, 0 otherwise.
On the surface, it seems that our model should be the following overparameterized model, a model that requires us to make a modification in
order to estimate coefficients:
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + 4 xi,4 + 5 xi,5 + i .

(9.1)

The difficulty with this model is that the X matrix has a linear dependency,
so we cant estimate the individual coefficients (technically, this is because
there will be an infinite number of solutions for the betas). The dependency
stems from the fact that Xi,3 + Xi,4 + Xi,5 = 1 for all observations because
each patient uses one (and only one) of the treatments. In the X matrix, the
linear dependency is that the sum of the last three columns will equal the
first column (all 1 s). This scenario leads to what is called collinearity and
we investigate this in the next chapter.
One solution (there are others) for avoiding this difficulty is the leave
one out method. The leave one out method has the general rule that
whenever a categorical predictor variable has k categories, it is possible to
define k indicator variables, but we should only use k 1 of them to describe
the differences among the k categories. For the overall fit of the model, it
does not matter which set of k 1 indicators we use. The choice of which
k 1 indicator variables we use, however, does affect the interpretation of
the coefficients that multiply the specific indicators in the model.
In our example with three treatments (and three possible indicator variables), we might leave out the third indicator giving us this model:
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + 4 xi,4 + i .

(9.2)

For the overall fit of the model, it would work equally well to leave out
the first indicator and include the other two or to leave out the second and
include the first and third.
STAT 501

D. S. Young

CHAPTER 9. INDICATOR VARIABLES

9.2

113

Coefficient Interpretations

The interpretation of the coefficients that multiply indicator variables is


tricky. The interpretation for the individual betas with the leave one out
method is that a coefficient multiplying an indicator in the model measures
the difference between the group defined by the indicator in the model and
the group defined by the indicator that was left. Usually, a control or placebo
group is the one that is left out.
Let us consider our example again. We are predicting decreases in blood
pressure in response to X1 =age, X2 =body mass, and which of three different
treatments a person used. The variables X3 and X4 are indicators of the
treatment, as defined above. The model we will examine is
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + 4 xi,4 + i .
To see what is going on, look at each treatment separately by substituting
the appropriately defined values of the two indicators into the equation.
For treatment 1, by definition X3 = 1 and X4 = 0 leading to
yi = 0 + 1 xi,1 + 2 xi,2 + 3 (1) + 4 (0) + i
= 0 + 1 xi,1 + 2 xi,2 + 3 + i .
For treatment 2, by definition X3 = 0 and X4 = 1 leading to
yi = 0 + 1 xi,1 + 2 xi,2 + 3 (0) + 4 (1) + i
= 0 + 1 xi,1 + 2 xi,2 + 4 + i .
For treatment 3, by definition X3 = 0 and X4 = 0 leading to
yi = 0 + 1 xi,1 + 2 xi,2 + 3 (0) + 4 (0) + i
= 0 + 1 xi,1 + 2 xi,2 + i .
Now compare the three equations to each other. The only difference
between the equations for treatments 1 and 3 is the coefficient 3 . The only
difference between the equations for treatments 2 and 3 is the coefficient 4 .
This leads to the following meanings for the coefficients:
3 = difference in mean response for treatment 1 versus treatment 3,
assuming the same age and body mass.
D. S. Young

STAT 501

114

CHAPTER 9. INDICATOR VARIABLES

4 = difference in mean response for treatment 2 versus treatment 3,


assuming the same age and body mass.
Here the coefficients are measuring differences from the third treatment.
With the leave one out method, a coefficient multiplying an indicator in
the model measures the difference between the group defined by the indicator
in the model and the group defined by the indicator that was left.
IMPORTANT CAUTIONS: Notice that the coefficient that multiplies an indicator variable in the model does not retain the meaning implied
by the definition of the indicator. It is common for students to wrongly state
that a coefficient measures the difference between that group and the other
groups. That is WRONG! It is also incorrect to say only that a coefficient
multiplying an indicator measures the effect of being in that group. An
effect has to involve a comparison - with the leave one out method it is a
comparison to the group associated with the indicator left out.
One application where many indicator variables (or binary predictors)
are used is in conjoint analysis, which is a marketing tool that attempts
to capture a respondents preference given the presence or absence of various
attribute levels. The X matrix is called a dummy matrix as it consists
of only 1s and 0s. The response is then regressed on the indicators using
ordinary least squares and researchers attempt to quantify items like identification of different market segments, predict profitability, or predict the
impact of a new competitor.
One additional note is that, in theory, with a linear dependence there are
an infinite number of suitable solutions for the betas (as will be seen with
multicollinearity). With the leave one out method, we are picking one with
a particular meaning and then the resulting coefficients measure differences
from the specified group. A method, often used in courses focused strictly
on ANOVA or Design of Experiments, offers a different meaning for what we
estimate. There it will be more common to parameterize in a way so that a
coefficient measures how a group differs from an overall average.

9.3

Testing Overall Group Differences

To test the overall significance of a categorical predictor variable, we use


a general linear F -test procedure (which is developed in detail later). We
form the reduced model by dropping the indicator variables from the model.
STAT 501

D. S. Young

CHAPTER 9. INDICATOR VARIABLES

115

More technically, the null hypothesis is that the coefficients multiplying the
indicator all equal 0.
For our example with three treatments of high blood pressure and additional x-variables age and body mass, the details for doing an overall test of
treatment differences are:
Full model is: yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + 4 xi,4 + i .
Null hypothesis is: H0 : 3 = 4 = 0.
Reduced model is: yi = 0 + 1 xi,1 + 2 xi,2 + i .

9.4

Interactions

To examine a possible interaction between a categorical predictor and a quantitative predictor, include product variables between each indicator and the
quantitative variable.
As an example, suppose we thought there could be an interaction between the body mass variable (X2 ) and the treatment variable. This would
mean that we thought that treatment differences in blood pressure reduction
depend on the specific value of body mass. The model we would use is:
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + 4 xi,4 + 5 xi,2 xi,3 + 6 xi,2 xi,4 + i .
To test whether there is an interaction, the null hypothesis is H0 : 5 =
6 = 0. We would use the general linear F test procedure to carry out the
test. The full model is the interaction model given three lines above. The
reduced model is now:
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + 4 xi,4 + i .
A visual way to assess if there is an interaction is by using an interaction
plot. An interaction plot is created by plotting the response versus the
quantitative predictor and connecting the successive values according to the
grouping of the observations. Recall that an interaction between factors
occurs when the change in response from lower levels to higher levels of one
factor is not quite the same as going from lower levels to higher levels of
another factor. Interaction plots allow us to compare the relative strength of
the effects across factors. What results is one of three possible trends:
D. S. Young

STAT 501

116

CHAPTER 9. INDICATOR VARIABLES

The lines could be (nearly) parallel, which indicates no interaction.


This means that the change in the response from lower levels to higher
levels for each factor is roughly the same.
The lines intersect within the scope of the study, which indicates an
interaction. This means that the change in the response from lower
levels to higher levels of one factor is noticeably different than the
change in another factor. This type of interaction is called a disordinal
interaction.
The lines do not intersect within the scope of the study, but the trends
indicate that if we were to extend the levels of our factors, then we
may see an interaction. This type of interaction is called an ordinal
interaction.
Figure 9.1 illustrates each type of interaction plot using a mock data set
pertaining to the mean tensile strength measured at three different speeds
of 3 different processes. The upper left plot illustrates the case where no
interaction is present because the change in mean tensile strength is similar
for each process as you increase the speed (i.e., the lines are parallel). The
upper right plot illustrates an interaction because as the speeds increase, the
change in mean tensile strength is noticeably different depending on which
process is being used (i.e., the lines cross). The bottom right plot illustrates
an ordinal interaction where no interaction is present within the scope of the
range of speeds studied, but if these trends continued for higher speeds, then
we may see an interaction (i.e., the lines may cross).
It should also be noted that just because lines cross, it does not necessarily
imply the interaction is statistically significant. Lines which appear nearly
parallel, yet cross at some point, may not yield a statistically significant
interaction term. If two lines cross, the more different the slopes appear and
the more data that is available, then the more likely the interaction term will
be significant.

9.5

Relationship to ANCOVA

When dealing with categorical predictors in regression analysis, we often say


that we are performing a regression with indicator variables or a regression
STAT 501

D. S. Young

CHAPTER 9. INDICATOR VARIABLES

117

60

Treatment 1
Treatment 2
Treatment 3

80

Disordinal Interaction
Treatment 1
Treatment 2
Treatment 3

50

100

No Interaction

40

Response

20

2.0

10

1.5

1.0

30

20

60

40

Response

2.5

3.0

3.5

4.0

1.0

1.5

2.0

Predictor

2.5

3.0

3.5

4.0

Predictor

(a)

(b)
Ordinal Interaction

50

60

Treatment 1
Treatment 2
Treatment 3

40

30

Response

10

20

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Predictor

(c)

Figure 9.1: (a) A plot of no interactions amongst the groups (notice how the
lines are nearly parallel). (b) A plot of a disordinal interaction amongst the
groups (notice how the lines intersect). (c) A plot of an ordinal interaction
amongst the groups (notice how the lines dont intersect, but if we were to
extrapolate beyond the predictor limits, then the lines would likely cross).

D. S. Young

STAT 501

118

CHAPTER 9. INDICATOR VARIABLES

with interactions (if we are interested in testing for interactions with indicator variables and other variables). However, in the design and analysis
of experiments literature, this model is also used, but with a slightly different motivation. Various experimental layouts using ANOVA tables are
commonly used in the design and analysis of experiments. These ANOVA
tables are constructed to compare the means of several levels of one or more
treatments. For example, a one-way ANOVA can be used to compare six
different dosages of blood pressure pills and the mean blood pressure of individuals who are taking one of those six dosages. In this case, there is one
factor with six different levels. Suppose further that there are four different
races represented in this study. Then a two-way ANOVA can be used since
we have two factors - the dosage of the pill and the race of the individual
taking the pill. Furthermore, an interaction term can be included if we suspect that the dosage a person is taking and the race of the individual have
a combined effect on the response. As you can see, you can extend to the
more general n-way ANOVA (with or without interactions) for the setting
with n treatments. However, dealing with n > 2 can often lead to difficulty
in interpreting the results.
One other important thing to point out with ANOVA models is that,
while they use least squares for estimation, they differ from how categorical
variables are handled in a regression model. In an ANOVA model, there is
a parameter estimated for the factor level means and these are used for the
linear model of the ANOVA. This differs slightly from a regression model
which estimates a regression coefficient for, say, n 1 indicator variables
(assuming there are n levels of the categorical variable and we are using
the leave-one-out method). Also, ANOVA models utilize ANOVA tables,
which are broken down by each factor (i.e., you would look at the sums of
squares for each factor present). ANOVA tables for regression models simply
test if the regression model has at least one variable which is a significant
predictor of the response. More details on these differences are better left to
a course on design of experiments.
When there is also a continuous variable measured with each response,
then the n-way ANOVA model needs to reflect the continuous variable. This
model is then referred to as an Analysis of Covariance (or ANCOVA)
model. The continuous variable in an ANCOVA model is usually called the
covariate or sometimes the concomitant variable. One difference in how
an ANCOVA model is approached is that an interaction between the covariate and each factor is always tested first. The reason why is because an
STAT 501

D. S. Young

CHAPTER 9. INDICATOR VARIABLES

119

ANCOVA is conducted to investigate the overall relationship between the


response and the covariate while assuming this relationship is true for all
groups (i.e., for all treatment levels). If, however, this relationship does differ across the groups, then the overall regression model is inaccurate. This
assumption is called the assumption of homogeneity of slopes. This is
assessed by testing for parallel slopes, which involves testing the interaction
term between the covariate and each factor in the ANCOVA table. If the
interaction is not statistically significant, then you can claim parallel slopes
and proceed to build the ANCOVA model. If the interaction is statistically
significant, then the regression model used is not appropriate and an ANCOVA model should not be used.
As an example of how to write ANCOVA models, first consider the oneway ANCOVA setting. Suppose we have i = 1, . . . , I treatments and each
treatment has j = 1, . . . , Ji pairs of continuous variables measured (i.e.,
(xi,1 , yi,1 ), . . . , (xi,Ji , yi,Ji )). Then the one-way ANCOVA model is written
as
yi,j = i + xi,j + i,j ,
where i is the mean of the ith treatment level, is the common regression
slope, and the i,j are iid normal with mean 0 and variance 2 . So note
that the test of parallel slopes concerns testing if is the same for all slopes
versus if it is not the same for all slopes. A high p-value indicates that
we have parallel slopes (or homogeneity of slopes) and can therefore use an
ANCOVA model.

9.6

Coded Variables

In the early days when computing power was limited, coding of the variables
accomplished simplifying the linear algebra and thus allowing least squares
solutions to be solved manually. Many methods exist for coding data, such
as:
Converting variables to two values (e.g., {-1, 1} or{0, 1}).
Converting variables to three values (e.g., {-1, 0, 1}).
Coding continuous variables to reflect only important digits (e.g., if
the costs of various nuclear programs range from $100,000 to $150,000,
D. S. Young

STAT 501

120

CHAPTER 9. INDICATOR VARIABLES


coding can be done by dividing through by $100,000, resulting in the
range being from 1 to 1.5).

The purpose of coding is to simplify the calculation of (XT X)1 in the various
regression equations, which was especially important when this had to be
done by hand. It is important to note that the above methods are just a few
possibilities and that there are no specific guidelines or rules of thumb for
when to code data.
Today when (XT X)1 is calculated with computers, there may be a significant rounding error in the linear algebra manipulations if the difference
in the magnitude of the predictors is large. Good statistical programs assess
the probability of such errors, which would warrant using coded variables.
When coding variables, one should be aware of different magnitudes of the
parameter estimates compared to those for the original data. The intercept
term can change dramatically, but we are concerned with any drastic changes
in the slope estimates. In order to protect against additional errors due to
the varying magnitudes of the regression parameters, you can compare plots
of the actual data and the coded data and see if they appear similar.

9.7

Examples

Example 1: Software Development Data Set


Suppose that data from n = 20 institutions is collected on similar software
development projects. The data set includes Y = number of man-years
required for each project, X1 = number of application subprograms developed
for the project, and X2 = 1 if an academic institution developed the program
or 0 if a private firm developed the program. The data is given in Table 9.1.
Suppose we wish to estimate the number of man-years necessary for developing this type of software for the purpose of contract bidding. We also
suspect a possible interaction between the number of application subprograms developed and the type of institution. Thus, we consider the multiple
regression model
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,1 xi,2 + i .
So first, we fit the above model and assess the significance of the interaction
term.
STAT 501

D. S. Young

CHAPTER 9. INDICATOR VARIABLES

121

##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23.8112
20.4315 -1.165
0.261
subprograms
0.8541
0.1066
8.012 5.44e-07 ***
institution 35.3686
26.7086
1.324
0.204
sub.inst
-0.2019
0.1556 -1.297
0.213
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 38.42 on 16 degrees of freedom
Multiple R-Squared: 0.8616,
Adjusted R-squared: 0.8356
F-statistic: 33.2 on 3 and 16 DF, p-value: 4.210e-07
##########

350

The above gives the t-tests for these predictors. Notice that only the predictor
of application subprograms (i.e., X1 ) is statistically significant, so we should
consider dropping the interaction term for starters.

250
200

150

100

50

Number of ManYears

300

Academic Institution
Private Firm

100

200

300

400

Number of Subprograms

Figure 9.2: An interaction plot where the grouping is by institution.


An interaction plot can also be used to justify use of an interaction term.
Figure 9.2 provides the interaction plot for this data set. This plot seems to
D. S. Young

STAT 501

122

CHAPTER 9. INDICATOR VARIABLES

indicate a possible (disordinal) interaction. A test of this interaction term


yields p = 0.213 (see the earlier output). Even though the interaction plot
indicates a possible interaction, the actual interaction term is deemed not
statistically significant and thus we can drop it from the model.
i Subprograms
1
135
2
128
3
221
4
82
5
401
6
360
7
241
8
130
9
252
10
220
11
112
12
29
13
57
14
28
15
41
16
27
17
33
18
7
19
17
20
94

Institution
0
1
0
1
0
1
0
0
1
0
0
1
0
1
1
1
1
0
0
1

Man-Years
52
58
207
95
346
244
215
112
195
54
48
39
31
57
20
33
19
6
7
56

Table 9.1: The software development data set.


We next provide the analysis without the interaction term. Though the
results are not shown here, a test of each predictor shows that the subprograms predictor is statistically significant (p = 0.000), while the institution
predictor is not statistically significant (p = 0.612). This then tells us that
there is no statistically significant difference in man-years for this type of software development between academic institutions and private firms. However,
the number of subprograms is still a statistically significant predictor. So the
final model should be a simple linear regression model with subprograms as
the predictor and man-years as the response. The final estimated regression
STAT 501

D. S. Young

CHAPTER 9. INDICATOR VARIABLES

123

equation is
yi = 3.47742 + 0.75088xi,1 ,
which can be found from the following output:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.47742
13.12068 -0.265
0.794
subprograms 0.75088
0.07591
9.892 1.06e-08 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 38.38 on 18 degrees of freedom
Multiple R-Squared: 0.8446,
Adjusted R-squared: 0.836
F-statistic: 97.85 on 1 and 18 DF, p-value: 1.055e-08
##########
Example 2: Steam Output Data (continued )
Consider coding the steam output data by rounding the temperature to the
nearest integer value ending in either 0 or 5. For example, a temperature of
57.5 degrees would be rounded up to 60 degrees while a temperature of 76.8
degrees would be rounded down to 75 degrees. While you would probably
not utilize coding on such an easy data set where magnitude is not an issue,
it is utilized here just for illustrative purposes.
Figure 9.3 compares the scatterplots of this data set with the original
temperature value and the coded temperature value. The plots look comparable, suggesting that coding could be used here. Recall that the estimated
regression equation for the original data was yi = 13.6230 0.0798xi . The
estimated regression equation for the coded data is yi = 13.7765 0.0824xi ,
which is also comparable.

D. S. Young

STAT 501

124

CHAPTER 9. INDICATOR VARIABLES

Steam Data

Steam Data

11

12

10

10

Steam Usage (Monthly)

11

Steam Usage (Monthly)

12

30

40

50

60

Uncoded Temperature (Fahrenheit)

(a)

70

30

40

50

60

70

Coded Temperature (Fahrenheit)

(b)

Figure 9.3: Comparing scatterplots of the atmospheric pressure data with


the original temperature (a) and with the temperature coded (b). A line of
best fit for each is also shown.

STAT 501

D. S. Young

Chapter 10
Multicollinearity
Recall that the columns of a matrix are linearly dependent if one column
can be expressed as a linear combination of the other columns. A matrix
theorem is that if there is a linear dependence among the columns of X, then
(XT X)1 does not exist. This means that we cant determine estimates of
the beta coefficients since the formula for determining the estimates involves
(XT X)1 .
In multiple regression, the term multicollinearity refers to the linear
relationships among the x-variables. Often, the use of this term implies that
the x-variables are correlated with each other, so when the x-variables are not
correlated with each other, we might say that there is no multicollinearity.

10.1

Sources and Effects of Multicollinearity

There are various sources for multicollinearity. For example, in the data
collection phase an investigator may have drawn the data from such a narrow
subspace of the independent variables that collinearity appears. Physical
constraints, such as design limits, may also impact the range of some of these
independent variables. Model specification (such as defining more variables
than observations or specifying too many higher-ordered terms/interactions)
and outliers can both lead to collinearity.
When there is no multicollinearity among x-variables, the effects of the
individual x-variables can be estimated independently of each other (although
we will still want to do a multiple regression). When multicollinearity is
present, the estimated coefficients are correlated (confounding) with each
125

126

CHAPTER 10. MULTICOLLINEARITY

other. This creates difficulty when we attempt to interpret how individual


x-variables affect y.
Along with this correlation, multicollinearity has a multitude of other
ramifications on our analysis, including:
inaccurate regression coefficient estimates,
inflated standard errors of the regression coefficient estimates,
deflated t-tests for significance testing of the regression coefficients,
false nonsignificance determined by the p-values, and
degradation of model predictability.
In designed experiments with multiple x-variables, researchers usually
choose the value of the x-variables so that there is no multicollinearity. In
observational studies (sample surveys), it is nearly always the case that the
x-variables will be correlated.

10.2

Detecting and Correcting Multicollinearity

We introduce three primary ways for detecting multicollinearity - two of


which are fairly straight-forward to implement, while the third method is
actually a variety of measures based on the eigenvalues and eigenvectors of
the standardized design matrix.
Method 1: Pairwise Scatterplots
For the first method, we can visually inspect the data by doing pairwise
scatterplots of the independent variables. So
 if you have p 1 independent
p1
variables, then you should inspect all 2 pairwise scatterplots. You will
be looking for any plots that seem to indicate a linear relationship between
pairs of independent variables.
Method 2: VIF
Second, we can use a measure of multicollinearity called the variance inflation factor (VIF ). This is defined as
1
,
V IFj =
1 Rj2
STAT 501

D. S. Young

CHAPTER 10. MULTICOLLINEARITY

127

where Rj2 is the coefficient of determination obtained by regressing Xj on


the remaining independent variables. A common rule of thumb is that if
V IFj = 1, then there is no collinearity, if 1 < V IFj < 5, then there is
possibly some moderate collinearity, and if V IFj 5, then there is a strong
indication of a collinearity problem. Most of the time, we will shoot for values
as close to 1 as possible and that usually will be sufficient. The bottom line
is that the higher the V IF , the more likely multicollinearity is an issue.
Sometimes, the tolerance is also reported. The tolerance is simply the
inverse of the V IF (i.e., T olj = V IFj1 ). In this case, the lower the T ol, the
more likely multicollinearity is an issue.
If multicollinearity is suspected after doing the above, then a couple of
things can be done. First, reassess the choice of model and determine if
there are any unnecessary terms and remove them. You may wish to start
by removing the one you most suspect first, because this will then drive down
the V IF s of the remaining variables.
Next, check for outliers and see what effects some of the observations with
higher residuals have on the analysis. Remove some (or all) of the suspected
outliers and see how that effects the pairwise scatterplots and V IF values.
You can also standardize the variables which involves simply subtracting
each variable by its mean and dividing by its standard deviation. Thus, the
standardized X matrix is given as:
X1,1 X1 X1,2 X2
p1
X1,p1 X
.
.
.
s
s
sXp1
X2,1X1X1 X2,2X2X2
p1
X2,p1 X

.
.
.
s X1

s X2
sXp1

X = n1
,
.
.
.
.

..
..
..
..

Xn,2 X2
Xn,p1 Xp1
Xn,1 X1
...
sX
sX
sX
1

p1

which is a n (p 1) matrix, and the standardized Y vector is given as:

1
Y = n1

Y1 Y
sY
Y2 Y
sY

..
.

Yn Y
sY

which is still a n-dimensional vector. Here,


sP
n
2
i=1 (Xi,j Xj )
sXj =
n1
D. S. Young

STAT 501

128

CHAPTER 10. MULTICOLLINEARITY

for j = 1, 2, . . . , (p 1) and
sP
sY =

n
i=1 (Yi

Y )2
.
n1

Notice that we have removed the column of 1s in forming X , effectively


reducing the column dimension of the original X matrix by 1. Because of
this, we no longer can estimate an intercept term (b0 ), which may be an
important part of the analysis. Thus, proceed with this method only if you
believe the intercept term adds little value to explaining the science behind
your regression model!
When using the standardized variables, the regression model of interest
becomes:
Y = X +  ,
where is now a (p 1)-dimensional vector of standardized regression coefficients and  is an n-dimensional vector of errors pertaining to this standardized model. Thus, the ordinary least squares estimates are
b = (XT X )1 XT Y
= r1
XX rXY ,
where rXX is the (p1)(p1) correlation matrix of the predictors and rXY
is the (p1)-dimensional vector of correlation coefficients between the predictors and the response. Because b is a function of correlations, this method is
called a correlation transformation. Sometimes, it may be enough to just
simply center the variables by their respective means in order to decrease the
V IF s. Note the relationship between the quantities introduced above and
the correlation matrix r from earlier:


1 rT
XY
r=
.
rXY rXX
Method 3: Eigenvalue Methods
Finally, the third method for identifying potential multicollinearity concerns
a variety of measures utilizing eigenvalues and eigenvectors. First, note that
the eigenvalue j and the corresponding (p 1)-dimensional orthonormal
eigenvectors j are solutions to the system of equations:
XT X j = j j ,
STAT 501

D. S. Young

CHAPTER 10. MULTICOLLINEARITY

129

for j = 1, . . . , (p 1). Since the j s are normalized, it follows that


jT XT X j = j .
Therefore, if j 0, then X j 0; i.e., the columns of X are approximately linearly dependent. Thus, since the sum of the eigenvalues must
equal the number of predictors (i.e., (p 1)), then very small j s (say, near
0.05) are indicative of collinearity. Another
commonly used is to
Pp1criterion
1
declare multicollinearity is present when j=1 j > 5(p 1). Moreover, the
entries of the corresponding j s indicate the nature of the linear dependencies; i.e., large elements of the eigenvectors identify the predictor variables
that comprise the collinearity.
A measure of the overall multicollinearity of the variables can be obtained
by computing what is called the p
condition number of the correlation matrix (i.e., r) which is defined as (p1) /(1) , such that (1) and (p1) are
the minimum and maximum eigenvalues, respectively. Obviously this quantity is always greater than 1, so a large number is indicative of collinearity.
Empirical evidence suggests that a value less than 15 typically means weak
collinearity, values between 15 and 30 is evidence of moderate collinearity,
while anything over 30 is evidence of strong collinearity.
Condition numbers for the individual
p predictors can also be calculated.
This is accomplished by taking cj = (p1) /j for each j = 1, . . . , (p 1).
When data is centered and scaled, then cj 100 indicates no collinearity,
100 < cj < 1000 indicates moderate collinearity, while cj 1000 indicates
strong collinearity for predictor Xj . When the data is only scaled (i.e., for
regression through the origin models), then collinearity will always be worse.
Thus, more relaxed limits are usually used. For example, a common rule
of thumb is to use 5 times the limits mentioned above; namely, cj 500
indicates no collinearity, 500 < cj < 5000 indicates moderate collinearity,
while cj 5000 indicates strong collinearity for predictor Xj .
It should be noted that there are many heuristic ways other than those described above to assess multicollinearity with eigenvalues and eigenvectors.1
Moreover, it should be noted that some observations can have an undue influence on these various measures of collinearity. These observations are called
collinearity-influential observations and care should be taken with how
1

For example, one such technique involves taking the square eigenvector relative to the
square eigenvalue and then seeing what percentage each quantity in this (p1)-dimensional
vector explains of the total variation for the corresponding regression coefficient.

D. S. Young

STAT 501

130

CHAPTER 10. MULTICOLLINEARITY

these observations are handled. You can typically use some of the residual
diagnostic measures (e.g., DFITs, Cooks Di , DFBETAS, etc.) for identifying potential collinearity-influential observations since there is no established
or agreed-upon method for classifying such observations.
Finally, there are also some more advanced regression procedures that
can be performed in the presence of multicollinearity. Such methods include
principal components regression and ridge regression. These methods are
discussed later.

10.3

Examples

Example 1: Muscle Mass Data Set


Suppose that data from n = 6 individuals is collected on their muscle mass.
The data set includes Y = muscle mass, X1 = age, and two possible (yet
redundant) indicator variables for gender:
1. X2 = 1 if the person is male (M), and 0 if female (F).
2. X3 = 1 if the person is female (F), and 0 if male (M).
The data are given in Table 10.1.
mass
age
sex

60 50 70 42 50 45
40 45 43 60 60 65
M F M F M F

Table 10.1: The muscle mass data set.


Suppose that we (mistakenly) attempt to use the model
yi = 0 + 1 xi,1 + 2 xi,2 + 3 xi,3 + i .
Notice that we used both indicator
and this model,

1
1

1
X=
1

1
1
STAT 501

variables in the model. For these data


40
45
43
60
60
65

1
0
1
0
1
0

0
1
0
1
0
1

D. S. Young

CHAPTER 10. MULTICOLLINEARITY

131

The sum of the last two columns equals the first column for every row in the
X matrix. This is a linear dependence, so parameter estimates cannot be
calculated because (XT X)1 does not exist. In practice, the usual solution
is to drop one of the indicator variables from the model. Another solution is
to drop the intercept (thus dropping the first column of X above), but that
is not usually done.
For this example, we cant proceed with a multiple regression analysis because there is perfect collinearity with X2 and X3 . Sometimes, a generalized
inverse can be used (which requires more of a discussion beyond the scope
of this course) or if you attempt to do an analysis on such a data set, the
software you are using may zero out one of the variables that is contributing
to the collinearity and then proceed to do an analysis. However, this can
lead to errors in the final analysis.
Example 2: Heat Flux Data Set (continued )
Let us return to the heat flux data set. Let our model include the east,
south, and north focal points, but also incorporate time and insolation as
predictors. First, let us run a multiple regression analysis which includes
these predictors.
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 325.43612
96.12721
3.385 0.00255 **
east
2.55198
1.24824
2.044 0.05252 .
north
-22.94947
2.70360 -8.488 1.53e-08 ***
south
3.80019
1.46114
2.601 0.01598 *
time
2.41748
1.80829
1.337 0.19433
insolation
0.06753
0.02899
2.329 0.02900 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 8.039 on 23 degrees of freedom
Multiple R-Squared: 0.8988,
Adjusted R-squared: 0.8768
F-statistic: 40.84 on 5 and 23 DF, p-value: 1.077e-10
##########
We see that time is not a statistically significant predictor of heat flux and,
in fact, east has become marginally significant.
D. S. Young

STAT 501

132

CHAPTER 10. MULTICOLLINEARITY

However, let us now look at the V IF values:


##########
east
1.355448
##########

north
2.612066

south
3.175970

time insolation
5.370059
2.319035

Notice that the V IF for time is fairly high (about 5.37). This is somewhat
a high value and should be investigated further. So next, the pairwise scatterplots are given in Figure 10.1 (we will only look at the plots involving
the time variable since that is the variable we are investigating). Notice how
there appears to be a noticeable linear trend between time and the south
focal point. There also appears to be some sort of curvilinear trend between
time and the north focal point as well as between time and insolation. These
plots, combined with the V IF for time, suggests to look at a model without
the time variable.
After removing the time variable, we obtain the new V IF values:
##########
east
1.277792
##########

north
1.942421

south insolation
1.206057
1.925791

Notice how removal of the time variable has sharply decreased the V IF
values for the other variables. The regression coefficient estimates are:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 270.21013
88.21060
3.063 0.00534 **
east
2.95141
1.23167
2.396 0.02471 *
north
-21.11940
2.36936 -8.914 4.42e-09 ***
south
5.33861
0.91506
5.834 5.13e-06 ***
insolation
0.05156
0.02685
1.920 0.06676 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 8.17 on 24 degrees of freedom
Multiple R-Squared: 0.8909,
Adjusted R-squared: 0.8727
F-statistic: 48.99 on 4 and 24 DF, p-value: 3.327e-11
##########
STAT 501

D. S. Young

18.5

34

16.5

17.5

35

North

13

14

15

16

11

12

13

Time

14

15

16

Time

(a)

(b)

40

900

850

36

650

34

Insolation

800

750

700

38

South

16.0

12

15.5

31

32

11

17.0

36

18.0

37

East

33

133

19.0

38

CHAPTER 10. MULTICOLLINEARITY

600

32

11

12

13

14
Time

(c)

15

16

11

12

13

14

15

16

Time

(d)

Figure 10.1: Pairwise scatterplots of the variable time versus (a) east, (b)
north, (c) south, and (d) insolation. Notice how there appears to be some sort
of relationship between time and the predictors north, south, and insolation.

D. S. Young

STAT 501

134

CHAPTER 10. MULTICOLLINEARITY

So notice how now east is statistically significant while insolation is marginally significant. If we proceeded to drop insolation from the model, then
we would be back to the analysis we did earlier in the chapter. This illustrates
how dropping or adding a predictor to a model can change the significance of
other predictors. We will return to this when we discuss stepwise regression.

STAT 501

D. S. Young

Chapter 11
ANOVA II
As in simple regression, the ANOVA table for a multiple regression model displays quantities that measure how much of the variability in the y-variable is
explained and how much is not explained by the x-variables. The calculation
of the quantities involved is nearly identical to what we did in simple regression. The main difference has to do with the degrees of freedom quantities.
The basic structure is given in Table 11.1 and the explanations follow.
Source
Regression
Error
Total

df
p1
np
n1

Pn SS
yi y)2
Pni=1 (
(yi yi )2
Pi=1
n
)2
i=1 (yi y

MS
F
MSR MSR/MSE
MSE

Table 11.1: ANOVA table for any regression.

P
The sum of squares for total is SSTO = ni=1 (yi y)2 , which is the
sum of squared deviations from the overall mean of y and dfT = n 1.
SSTO is a measure of the overall variation in the y-values. In matrix
notation, SSTO = ||Y Y 1||2 .
P
The sum of squared errors is SSE = ni=1 (yi yi )2 , which is the sum
of squared observed errors (residuals) for the observed data. SSE is a
measure of the variation in y that is not explained by the regression.
For multiple linear regression, dfE = n p, where p = number of beta
coefficients in the model (including the intercept 0 ). As an example,
135

136

CHAPTER 11. ANOVA II


if there are four x-variables then p = 5 (the betas multiplying the
2.
x-variables plus the intercept). In matrix notation, SSE = ||Y Y||

The mean squared error is MSE =


the variance of the errors.

SSE
dfE

SSE
,
np

which estimates 2 ,

The sum of squares due to the regression is SSR = SSTO SSE,


and it is a measure of the total variation in y that can be explained by
the regression with the x-variable. Also, dfR = dfT dfE . For multiple
regression, dfR = (n 1) (n p) = p 1. In matrix notation,
Y 1||2 .
SSR = ||Y
The mean square for the regression is MSR =

11.1

SSR
dfR

SSR
.
p1

Uses of the ANOVA Table

1. The F -statistic in the ANOVA given in Table 11.1 can be used to test
whether the y-variable is related to one or more of the x-variables in
the model. Specifically, F = MSR/MSE is a test statistic for
H0 : 1 = 2 = . . . = p1 = 0
HA : at least one i 6= 0 for i = 1, . . . , p 1.
The null hypothesis means that the y-variable is not related to any of
the x-variables in the model. The alternative hypothesis means that
the y-variable is related to one or more of the x-variables in the model.
Statistical software will report a p-value for this test statistic. The pvalue is calculated as the probability to the right of the calculated value
of F in an F distribution with p1 and np degrees of freedom (often
written as Fp1,np;1 ). The usual decision rule also applies here in
that if p < 0.05, reject the null hypothesis. If that is our decision,
we conclude that y is related to at least one of the x-variables in the
model.

2. MSE is the estimate of the error variance 2 . Thus s = MSE estimates the standard deviation of the errors.
3. As in simple regression, R2 = SSTO-SSE
, but here it is called the coSSTO
efficient of multiple determination. R2 is interpreted as the proSTAT 501

D. S. Young

CHAPTER 11. ANOVA II

137

portion of variation in the observed y values that is explained by the


model (i.e., by the entire set of x-variables in the model).

11.2

The General Linear F -Test

The general linear F -test procedure is used to test any null hypothesis
that, if true, still leaves us with a linear model (linear in the s). The most
common application is to test whether a particular set of coefficients are
all equal 0. As an example, suppose we have a response variable (Y ) and 5
predictor variables (X1 , X2 , . . . , X5 ). Then, we might wish to test
H0 : 1 = 3 = 4 = 0
HA : at least one of {1 , 3 , 4 } =
6 0.
The purpose for testing a hypothesis like this is to determine if we could
eliminate variables X1 , X3 , and X4 from a multiple regression model (an
action implied by the statistical truth of the null hypothesis).
The full model is the multiple regression model that includes all variables
under consideration. The reduced model is the regression model that would
result if the null hypothesis is true. The general linear F -statistic is

F =

SSE(reduced)SSE(full)
dfE (reduced)dfE (full)

MSE(full)

Here, this F -statistic has degrees of freedom df1 = dfE (reduced) dfE (full)
and df2 = dfE (full). With the rejection region approach and a 0.05 significance level, we reject H0 if the calculated F is greater than the tabled value
Fdf1 ,df2 ;1 , which is the 95th percentile of the appropriate Fdf1 ,df2 -distribution.
A p-value is found as the probability that the F -statistic would be as large
or larger than the calculated F .
To summarize, the general linear F -test is used in settings where there
are many predictors and it is desirable to see if only one or a few of the
predictors can adequately perform the task of estimating the mean response
and prediction of new observations. Sometimes the full and reduced sums of
squares (that we introduced above) are referred to as extra sum of squares.
D. S. Young

STAT 501

138

CHAPTER 11. ANOVA II

11.3

Extra Sums of Squares

The extra sums of squares measure the marginal reduction in the SSE
when one or more predictor variables are added to the regression model given
that other predictors are already in the model. In probability theory, we write
A|B which means that event A happens GIVEN that event B happens (the
vertical bar means given). We also utilize this notation when writing extra
sums of squares. For example, suppose we are considering two predictors,
X1 and X2 . The SSE when both variables are in the model is smaller than
when only one of the predictors is in the model. This is because when both
variables are in the model, they both explain additional variability in Y
which drives down the SSE compared to when only one of the variables is
in the model. This difference is what we call the extra sums of squares. For
example,
SSR(X1 |X2 ) = SSE(X2 ) SSE(X1 , X2 ),
which measures the marginal effect of adding X1 to the model, given that
X2 is already in the model. An equivalent expression is to write
SSR(X1 |X2 ) = SSR(X1 , X2 ) SSR(X2 ),
which can be viewed as the marginal increase in the regression sum of squares.
Notice (for the second formulation) that the corresponding degrees of freedom
is (3-1)-(2-1)=1 (because the df for SSR(X1 , X2 ) is (3-1)=2 and the df for
SSR(X2 ) is (2-1)=1). Thus,
MSR(X1 |X2 ) =

SSR(X1 |X2 )
.
1

When more predictors are available, then there are a vast array of possible decompositions of the SSR into extra sums of squares. One generic
formulation is if you have p predictors, then
SSR(X1 , . . . , Xj , . . . , Xp ) = SSR(Xj ) + SSR(X1 |Xj ) + . . .
+ SSR(Xj1 |X1 , . . . , Xj2 , Xj )
+ SSR(Xj+1 |X1 , . . . , Xj ) + . . .
+ SSR(Xp |X1 , . . . , Xp1 ).
In the above, j is just being used to indicate any one of the p predictors.
You can also calculate the marginal increase in the regression sum of squares
STAT 501

D. S. Young

CHAPTER 11. ANOVA II

139

when adding more than one predictor. One generic formulation is


SSR(X1 , . . . , Xj |Xj+1 , . . . , Xp ) = SSR(X1 |Xj+1 , . . . , Xp )
+ SSR(X2 |X1 , Xj+1 , . . . , Xp ) + . . .
+ SSR(Xj1 |X1 , . . . , Xj2 , Xj+1 , . . . , Xp )
+ SSR(Xj |X1 , . . . , Xj1 , Xj+1 , . . . , Xp ).
Again, you could imagine many such possibilities.
The primary use of extra sums of squares is in the testing of whether or
not certain predictors can be dropped from the model (i.e., the general linear
F -test). Furthermore, they can also be used to calculate a version of R2 for
such models, called the partial R2 .

11.4

Lack of Fit Testing in the Multiple Regression Setting

Formal lack of fit testing can also be performed in the multiple regression
setting; however, the ability to achieve replicates can be more difficult as more
predictors are added to the model. Note that the corresponding ANOVA
table (Table 11.2) is similar to that introduced for the simple linear regression
setting. However, now we have the notion of p regression parameters and the
number of replicates (m) refers to the number of unique X vectors. In other
words, each predictor must have the same value for two observations for it
to be considered a replicate. For example, suppose we have 3 predictors for
our model. The observations (40, 10, 12) and (40, 10, 7) are unique levels
for our X vectors, whereas the observations (10, 5, 13) and (10, 5, 13) would
constitute a replicate.
Formal lack of fit testing in multiple regression can be difficult due to
sparse data, unless the experiment was designed properly to achieve replicates. However, other methods can be employed for lack of fit testing when
you do not have replicates. Such methods involve data subsetting. The basic approach is to establish criteria by introducing indicator variables, which
in turn creates coded variables (as discussed earlier). By coding the variables,
you can artificially create replicates and then you can proceed with lack of fit
testing. Another approach with data subsetting is to look at central regions
of the data (i.e., observations where the leverage is less than (1.1) p/n) and
treat this as a reduced data set. Then compare this reduced fit to the full fit
D. S. Young

STAT 501

140

CHAPTER 11. ANOVA II


Source
Regression
Error
Lack of Fit
Pure Error
Total

df
p1
np
mp
nm
n1

SS
SSR
SSE
SSLOF
SSPE
SSTO

MS
MSR
MSE
MSLOF
MSPE

F
MSR/MSE
MSLOF/MSPE

Table 11.2: ANOVA table for multiple linear regression which includes a lack
of fit test.

(i.e., the fit with all of the data), for which the formulas for a lack of fit test
can be employed. Be forewarned that these methods should only be used as
exploratory methods and they are heavily dependent on what sort of data
subsetting method is used.

11.5

Partial R2

Suppose we have set up a general linear F -test. Then, we may be interested


in seeing what percent of the variation in the response cannot be explained
by the predictors in the reduced model (i.e., the model specified by H0 ), but
can be explained by the rest of the predictors in the full model. If we obtain
a large percentage, then it is likely we would want to specify some or all of
the remaining predictors to be in the final model since they explain so much
variation.
The way we formally define this percentage is by what is called the partial R2 (or it is also called the coefficient of partial determination).
Specifically, suppose we have three predictors: X1 , X2 , and X3 . For the corresponding multiple regression model (with response Y ), we wish to know
what percent of the variation is explained by X2 and X3 which is not explained by X1 . In other words, given X1 , what additional percent of the
variation can be explained by X2 and X3 ? Note that here the full model will
include all three predictors, while the reduced model will only include X1 .
After obtaining the relevant ANOVA tables for these two models, the
STAT 501

D. S. Young

CHAPTER 11. ANOVA II

141

partial R2 is as follows:
SSR(X2 , X3 |X1 )
SSE(X1 )
SSE(X1 ) SSE(X1 , X2 , X3 )
=
SSE(X1 )
SSE(reduced) SSE(full)
.
=
SSE(reduced)

2
RY,2,3|1
=

Then, this gives us the proportion of variation explained by X2 and X3 that


cannot be explained by X1 . Note that the last line of the above equation is
just demonstrating that the partial R2 has a similar form to the R2 that we
calculated in the simple linear regression case.
More generally, consider partitioning the predictors X1 , X2 , . . . , Xk into
two groups, A and B, containing u and k u predictors, respectively. The
proportion of variation explained by the predictors in group B that cannot
be explained by the predictors in group A is given by
SSR(B|A)
SSE(A)
SSE(A) SSE(A, B)
.
=
SSE(A)

2
RY,B|A
=

These partial R2 values can also be used to calculate the power for the
corresponding general linear F -test. The power of this test is calculated by
first finding the tabled 100 (1 )th percentile of the Fu,nk1 -distribution.
Next we calculate Fu,nk1;1 (), which is the 100 (1 )th percentile
of a non-central Fu,nk1 -distribution with non-centrality parameter . The
non-centrality parameter is calculated as:

=n

2
2 
RY,A,B
RY,B
.
2
1 RY,A,B

Finally, the power is simply the probability that the calculated Fu,nk1;1 ()
value is greater than the calculated Fu,nk1;1 value under the Fu,nk1 ()distribution.
D. S. Young

STAT 501

142

11.6

CHAPTER 11. ANOVA II

Partial Leverage and Partial Residual


Plots

Next we establish a way to visually assess the relationship of a given predictor


to the response when accounting for all of the other predictors in the multiple
linear regression model. Suppose we have p 1 predictors X1 , . . . , Xp1 and
that we are trying to assess each predictors relationship with a response
Y given the other predictors are already in the model. Let rY[j] denote the
residuals that result from regressing Y on all of the predictors except Xj .
Moreover, let rX[j] denote the residuals that result from regressing Xj on all
of the remaining p 2 predictors. A partial leverage regression plot
(also referred to as an added variable plot, adjusted variable plot, or
individual coefficients plot) is constructed by plotting rY[j] on the y-axis
and rX[j] on the x-axis.1 Then, the relationship between these two sets of
residuals is examined to provide insight into Xj s contribution to the response
given the other p2 predictors are in the model. This is a helpful exploratory
measure if you are not quite sure about what type of relationship (e.g., linear,
quadratic, logarithmic, etc.) that the response may have with a particular
predictor.
More formally, partial leverage is used to measure the contribution of the
individual predictor variables to the leverage of each observation. That is,
if hi,i is the ith diagonal entry of the hat matrix, the partial leverage is a
measure of how hi,i changes as a variable is added to the regression model.
The partial leverage is computed as:
rx2[j] ,i
.
(hj )i = Pn
2
k=1 rx[j] ,k
If you do a simple linear regression of rY[j] on rX[j] , you will see that the
OLS line goes through the origin. This is because the OLS line goes through
the point (
rX[j] , rY[j] ) and each of these means is 0. Thus, a regression through
the origin can be fit to this data. Moreover, the slope from this regression
through the origin fit is equal to the slope for Xj if it were included in the
full model where Y is regressed on all p 1 predictors.
We have given a great deal of attention to various residual measures
as well as plots to assess the adequacy of our fitted model. We can also
1

Note that you can produce p 1 partial leverage regression plots (i.e., one for each
predictor).

STAT 501

D. S. Young

CHAPTER 11. ANOVA II

143

use another type of residual to check the assumption of linearity for each
predictor. Partial residuals are residuals that have not been adjusted for a
particular predictor variable (say, Xj ). Suppose, we partition the X matrix
such that X = (Xj , X j ). For this formulation, Xj is the same as the
X matrix, but with the vector of observations for the predictor Xj omitted
(i.e., this vector of values is X j ). Similarly, let us partition the vector of
T
estimated regression coefficients as b = (bT
j , bj ) . Then, the set of partial
residuals for the predictor Xj would be
ej = Y Xj bj
+Y
Xj bj
=YY
= e (Xj bj Xb)
= e + bj X j .
Note in the above that bj is just a univariate quantity, so bj X j is still an
n-dimensional vector. Finally, a plot of ej versus Xj has slope bj . The more
the data deviates from a straight-line fit for this plot (which is sometimes
called a component plus residual plot), the greater the evidence that a
higher-ordered term or transformation on this predictor variable is necessary.
Note also that the vector e would provide the residuals if a straight-line fit
were made to these data.

11.7

Examples

Example 1: Heat Flux Data Set (continued )


Refer back to the raw data given in Table 7.1. We will again look at the
full model which includes all of the predictor variables (even though we have
shown which predictors should likely be removed from the model). The
response is still y = total heat flux, while the predictors are x1 = insolation
recording, x2 = east focal point, x3 = south focal point, x4 = north focal
point, and x5 = time of the recordings. So the full model will be
Y = X + ,
where = (0 1 2 3 4 5 )T , Y is a 29-dimensional response vector, X is
a 29 5-dimensional design matrix, and  is a 29-dimensional error vector.
First, here is the ANOVA for the model with all of the predictors:
D. S. Young

STAT 501

144

CHAPTER 11. ANOVA II

##########
Analysis of Variance Table
Response: flux
Df Sum Sq Mean Sq F value
Pr(>F)
Regression 5 13195.5 2639.1 40.837 1.077e-10 ***
Residuals 23 1486.4
64.6
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##########
Now, using the results from earlier (which seem to indicate a model including
only the north and south focal points as predictors) let us test the following
hypothesis:
H0 : 1 = 2 = 5 = 0
HA : at least one of {1 , 2 , 5 } =
6 0.
In other words, we only want our model to include the south (x3 ) and north
(x4 ) focal points. We see that MSE(full) = 64.63, SSE(full) = 1486.40, and
dfE (full) = 23.
Next we calculate the ANOVA table for the above null hypothesis:
##########
Analysis of Variance Table
Response: flux
Df Sum Sq Mean Sq F value
Pr(>F)
Regression 2 12607.6 6303.8 79.013 8.938e-12 ***
Residuals 26 2074.3
79.8
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##########
The ANOVA for this analysis shows that SSE(reduced) = 2074.33 and
dfE (reduced) = 26. Thus, the F -statistic is:
F =
STAT 501

2074.331486.40
2623

64.63

= 3.027998,
D. S. Young

CHAPTER 11. ANOVA II

145

which follows a F3,23 distribution. The p-value (i.e., the probability of getting
an F -statistic as extreme or more extreme than 3.03 under a F3,23 distribution) is 0.0499. Thus we just barely claim statistical significance and conclude that at least one of the other predictors (insolation, east focal point,
and time) is a statistically significant predictor of heat flux.
We can also calculate the power of this F -test by using the partial R2
values. Specifically,
2
RY,1,2,3,4,5
= 0.8987602
2
RY,3,4
= 0.8587154
0.8987602 0.8587154
= (29)
1 0.8987602
= 11.47078.

Therefore, P (F > 3.027998) under an F3,23 (11.47078)-distribution gives the


power, which is as follows:
##########
Power
F-Test 0.7455648
##########
Notice that this is not very powerful (i.e., we usually hope to attain a power
of at least 0.80). Thus the probability of committing a Type II error is
somewhat high.
We can also calculate the partial R2 for this testing situation:
SSE(X3 , X4 ) SSE(X1 , X2 , X3 , X4 , X5 )
SSE(X3 , X4 )
2074.33 1486.4
=
2074.33
= 0.2834.

2
=
RY,1,2,5|3,4

This means that insolation, the east focal point, and time explain about
28.34% of the variation in heat flux that could not be explained by the north
and south focal points.
Example 2: Simulated Data for Partial Leverage Plots
Suppose we have a response variable, Y , and two predictors, X1 and X2 . Let
us consider three settings:
D. S. Young

STAT 501

146

CHAPTER 11. ANOVA II

1. Y is a function of X1 ;
2. Y is a function of X1 and X2 ; and
3. Y is a function of X1 , X2 , and X22 .
Setting (3) is a quadratic regression model which falls under the polynomial
regression framework, which we discuss in greater detail later. Figure 11.1
shows the partial leverage regression plots for X1 and X2 for each of these
three settings. In Figure 11.1(a), we see how the plot indicates that there
is a strong linear relationship between Y and X1 when X2 is in the model,
but this is not the case between Y and X2 when X1 is in the model (Figure
11.1(b)). Figures 11.1(c) and 11.1(d) shows that there is a strong linear
relationship between Y and X1 when X2 is in the model as well as between
Y and X2 when X1 is in the model. Finally, Figure 11.1(e) shows that there
is a linear relationship between Y and X1 when X2 is in the model, but there
is an indication of a quadratic (i.e., curvilinear) relationship between Y and
X2 when X1 is in the model.
For setting (2), the fitted model when Y is regressed on X1 and X2 is

Y = 9.92 + 6.95X1 4.20X2 as shows in the following output:


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
9.9174
2.4202
4.098 0.000164 ***
X1
6.9483
0.3427 20.274 < 2e-16 ***
X2
-4.2044
0.3265 -12.879 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 6.058 on 47 degrees of freedom
Multiple R-squared: 0.915,
Adjusted R-squared: 0.9114
F-statistic:
253 on 2 and 47 DF, p-value: < 2.2e-16
The slope from the regression through the origin fit of rY[1] on rX[1] is 6.95
while the slope from the regression through the origin fit of rY[2] on rX[2] is
-4.20. The output from both of these fits is given below:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
r_X.1
6.9483
0.3357
20.7
<2e-16 ***
STAT 501

D. S. Young

147

40

CHAPTER 11. ANOVA II

20

rY[2]

20

rY[1]

10

40

10

20

20

20

40

rY[2]

10

rY[1]

10

20

(b)

40

(a)

rX[2]

rX[1]

rX[1]

400

300

100

(d)

150

(c)

rX[2]

rY[2]

rY[1]

100

50

200

50

200 100

rX[1]

(e)

rX[2]

(f)

Figure 11.1: Scatterplots of (a) rY[1] versus rX[1] and (b) rY[2] versus rX[2] for
setting (1), (c) rY[1] versus rX[1] and (d) rY[2] versus rX[2] for setting (2), and
(e) rY[1] versus rX[1] and (f) rY[2] versus rX[2] for setting (3).
D. S. Young

STAT 501

148
--Signif. codes:

CHAPTER 11. ANOVA II

0 *** 0.001 ** 0.01 * 0.05 . 0.1

Residual standard error: 5.933 on 49 degrees of freedom


Multiple R-squared: 0.8974,
Adjusted R-squared: 0.8953
F-statistic: 428.5 on 1 and 49 DF, p-value: < 2.2e-16
-------------------------------------------------------------Coefficients:
Estimate Std. Error t value Pr(>|t|)
r_X.2
-4.2044
0.3197 -13.15
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 5.933 on 49 degrees of freedom
Multiple R-squared: 0.7792,
Adjusted R-squared: 0.7747
F-statistic: 172.9 on 1 and 49 DF, p-value: < 2.2e-16
Note that while the slopes from these partial leverage regression routines
are the same as their respective slopes in the full multiple linear regression
routine, the other statistics regarding hypothesis testing are not the same
since, fundamentally, different assumptions are made for the respective tests.

STAT 501

D. S. Young

You might also like