You are on page 1of 10

Chap.

11, page 1
Math 445 Chapter 11 Model Checking and Refinement

Rainfall data

In the rainfall data, we ended up leaving out case 28 (Death Valley) because it had a large residual and
its altitude was the lowest in the data set. The resulting model is therefore not applicable to such low
altitude locations. If case 28 had not been unusual, then we would not have been justified in omitting it.

Without #28 (Death Valley)


Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -2.074 .525 -3.951 .001
Altitude (ft) .000725 .000241 4.647 3.012 .006
Latitude (degrees) .093924 .014285 .773 6.575 .000
Rainshadow -.431176 .059929 -.662 -7.195 .000
Altitude*Latitude -.000019 .000006 -4.620 -2.959 .007
a. Dependent Variable: Log10(Precipitation)

R2 = .80

Since there is an interaction between Altitude and Latitude, interpretation of the coefficients for these
variables becomes a little complicated. However, we can interpret the effect of the Rainshadow variable
in this model.
Chap. 11, page 2
Case 28 is an example of an outlier, a case for which the model does not fit well. Outliers have large
residuals. We are also interested in influential cases, cases whose omission changes the fitted model
substantially. Influential cases may not be outliers. Least squares is sensitive to unusual cases and an
influential case may “pull” the regression plane toward it so much that it does not have a large residual.
In simple linear regression, we could often identify influential cases simply from a scatterplot. In
multiple regression, it may not be possible to see influential cases in pairwise scatterplots and we need
additional tools.

Case-Influence statistics

Leverage:
The leverage of a case is based only on the values of the explanatory variables. It measures the distance
of the case from the mean for the explanatory variables (in multidimensional space). For one
explanatory variable, the leverage is

hi =
1 ⎡ Xi − X ⎤
2
1 (X i − X )2 + 1
⎢ ⎥ + =
(n − 1) ⎣ s X ⎦ n ∑ ( X i − X )2 n
With more than one explanatory variable, the leverage is a measure of distance in higher-dimensional
space. The distance takes into account the joint variability of the variables – see Display 11.10 on p.
316.

High-leverage cases are easy to identify visually with only one explanatory variable, but become
increasingly difficult to identify visually with more explanatory variables.

Leverages are always between 1/n and 1. The average of all the leverages in a data set is always p/n
where p is the number of explanatory variables. SPSS computes centered leverages (under the Linear
Regression…Save button), even though it calls them simply “leverages.” The centered leverage is
hi − 1 / n . Therefore, the centered leverage is between 0 and 1-1/n.

Leverage measures the potential influence of a case. High leverage cases have the potential to change
the least squares fit substantially.
Chap. 11, page 3
Studentized residuals

While the true residuals (what we called the ε i ) all have the same standard deviation σ in the regression
model, the observed residuals ei don’t. Why not?

Consider simple linear regression:


• True residual ε i = Yi − ( β 0 + β1 X i )
• Observed residual: ei = Yi − ( βˆ0 + βˆ1 X i )

First, we already know that the size of the observed residuals tend to be smaller than the sizes of the true
residuals. That’s why we divide by n-2 when we compute the standard deviation of the observed
residuals to get an estimate of the standard deviation of the true residuals. The reason that the observed
residuals tend to be smaller is that the least squares line is the line which best fits the data so the
deviations from this line will tend to be smaller than the deviations from the true line.

What do we mean when we say that the residuals do not all have the same standard deviation? How can
a single value have a standard deviation?

What we mean is: what is the standard deviation of the residuals at each X i from many simulated sets
of data from the linear regression model with a fixed set of Xi ’s?

To carry out this simulation we would follow the following steps. The Xi ’s remain the same for every
simulation.
1. Generate a set of a set of Yi ’s where each Yi is from a normal distribution with mean β 0 + β1 X i
and standard deviation σ. That gives a set of n pairs of values ( X 1 , Y1 ), ( X 2 , Y2 ),… , ( X n , Yn ) .
2. Fit the least squares line
3. Compute the residuals.
4. Repeat steps 1-3 many times with a new set of Yi ’s each time.

Now look at the distribution of observed residuals for each X i and, in particular, compute the standard
deviation of the observed residuals at each X i . You will find that the standard deviations are different
and that the standard deviation of the residuals for X i ’s far from X (high leverage values) is smaller
than for X i ’s near X (low leverage values). In fact, it can be shown that the standard deviation of the
residual at X i is:

SD(Residuali) = σ (1 − hi )

where hi is the leverage. This formula applies to any multiple regression model, not just the simple
linear regression model.
Chap. 11, page 4
Example (simple linear regression):

Suppose Yi is normal with mean µ (Yi ) = 1 + 2 X i , i= 1,..,5, and standard deviation σ = 1, and that the
X i ’s are 1, 4, 5, 6 and 14.

Here are the Yi ’s from one simulation: 3.42, 9.86, 10.05, 12.90, 27.38. The least squares line is
Yˆ = 1.73 + 1.83 X and the residuals are: -0.145, 0.803, -0.844, 0.182, 0.004.
i i

Repeating the simulation 10,000 times, here are the mean and standard deviation of the residuals at each
Xi :

Xi 1 4 5 6 14
mean of residuals 0.008 0.002 -0.013 0.000 0.004
std. dev. of residuals 0.737 0.869 0.884 0.900 0.350

Calculate the leverages for these 5 X i ’s:

Use the formula on the previous page to calculate the standard deviation of the residuals at each X i .
How do they match the values estimated from the simulation?

Why are we so concerned about the standard deviation of the residuals at different X i ’s?
• Because a big residual is more unusual at a high leverage point than at a low-leverage point.
Therefore, standardizing the residuals by an estimate of their standard deviation is a better way to
compare residuals. Since residuals always have mean 0, this means dividing each residual by an
estimate of its standard deviation σ (1 − hi ) . Since we don’t know σ, we replace it by σˆ (the
square root of mean square residual in the ANOVA table).
• The studentized residual is

resi
studresi = .
σˆ 1 − hi
Chap. 11, page 5
• Studentized residuals are also sometimes called internally studentized residuals. In SPSS, they are
called “studentized residuals” (under the Save button on the Linear Regression window).
• A potential problem with the studentized residuals is that σ̂ may be inflated if a residual is an
outlier. Therefore, a modified version of the studentized residual is the externally studentized
residual, called the studentized deleted residual in SPSS. σ̂ is replaced by σˆ (i ) , the estimated
standard deviation of the residuals from the model fit with the ith observation omitted.

resi
studresi* =
σˆ (i ) 1 − hi

Internally and externally studentized residuals can be used in just the same way as the raw residuals: in
residual plots, normal probability plots, etc. In fact, they are preferred to the raw residuals because the
nonconstant variance of the raw residuals has been corrected for.

When examining studentized residuals, one should look for outliers. In addition, one can use the
standard normal distribution as a rough guide for identifying unusual values: e.g, we expect about 5% of
values less than -2 or greater than 2 and less than 1% to be outside the range –3 to 3.

Cook’s Distance
A more direct measure of the influence of an observation is Cook’s Distance, which measures how
much the fitted values change when each observation is omitted. For case i,

Di = ∑
n (Yˆ j (i ) − Yˆ j )
2

j =1 pσˆ 2

where p is the number of regression coefficients. The numerator of the above expression is what’s
important; the denominator just standardizes the statistic.

Yˆ j is the fitted value for case j when the whole data set is used to fit the model. Yˆ j (i ) is the fitted value
for case j when case i is omitted in fitting the model. So, for example, to calculate D1 we omit case 1,
calculate the model, and calculate the fitted values for all observations including case 1. We then
calculate the sum of the squared differences between these predicted values and the predicted values
from the model fit to all the data. A values of Cook’s D close to or greater than 1 is often considered to
be indicative of an observation with large influence.

While Cook’s D is a useful measure if the goal of the model is prediction, it is not as useful for seeing
how a particular coefficient changes when an observation is omitted. However, it can be used to
identify cases to check – omit a case with large Cook’s D and see how the coefficients of interest
change.
Chap. 11, page 6
Other measures of influence

A number of other measures of influence have been proposed. However, some of these measures are
redundant and it is not necessary to look at all of them. Two others that SPSS computes are DfFits,
which measures how much the predicted value for case i changes when case I is omitted and DfBetas,
which measures how much the omission of case i changes each of the coefficients in the model (hence,
for each case, there is a separate DfBetas value for each variable).

Rainfall data: model Logprec = Altitude + Latitude + Rainshadow + Altitude*Latitude

Results with all cases included:


Chap. 11, page 7
Without case #28:

In Sec. 11.4.4, p. 320, Ramsey and Schafer suggest that if “the residual plot from a good inferential
model fails to suggest any problems, there is generally no need to examine case influence statistics at
all.” I would agree except that I would suggest that the residual plot should use the externally
studentized residuals (=studentized deleted residuals).
Chap. 11, page 8
We next examine two types of plots useful in refining models:
• Partial regression leverage plots (also called added-variable plots) are useful for visually
identifying influential and high leverage points for each regression coefficient separately. These
are not discussed in the text, but are easily available in SPSS.

• Partial residual plots (also called component-plus-residual plots) are useful for identifying
nonlinear relationships in a multiple regression model. These are discussed in the text, but are
not readily available in SPSS. They can be constructed in SPSS, but it’s rather tedious.

It might seem that simply plotting the response variable Y versus each explanatory variable would be
adequate for assessing the relationship between Y and each X variable for a multiple regression model.
However, these plots can be misleading because they do not control for the values of the other X
variables. For example, an apparently strong relationship between Y and X 1 may disappear when other
variables are included in the model. If the scatterplot of Y versus X 1 looks curved, it does not
necessarily mean that a squared term will be necessary with the other X variables in the model.
Similarly, a case that appears influential in the Y versus X 1 scatterplot may not be influential with the
other X variables in the model and a case that doesn’t appear influential may turn out to be so with the
other X variables in the model.

Plots of the residuals versus each X variable are also inadequate. They are better than Y versus X plots
because they show only the unexplained variation in Y on the y-axis. However, the X variables are not
adjusted for relationships with each other.

Partial regression leverage plots (not in text)


• A partial regression leverage plot (or added-variable plot) attempts to separate out the relationship
between the response and any explanatory variable after adjusting for the other explanatory
variables in the model.
• The steps involved creating the partial regression leverage plot for variable X 1 are:
1. Compute the residuals from the regression of Y on all the other X variables in the model except
X1 .
2. Compute the residuals from the regression of X 1 on all the other X variables in the model.
3. Plot the first set of residuals on the y-axis against the second set on the x-axis.

Steps 1-3 are repeated for all the X variables in the model.

The partial regression leverage plot for X 1 looks at the relationship between Y and X 1 after adjusting
for the other X variables. It turns out that the slope of the least squares line for this plot is exactly equal
to βˆ1 , the coefficient on X 1 in the regression model with all the X’s in it. In addition, high leverage and
influential cases for βˆ can be identified from this scatterplot. This is the primary use of the partial
1
regression leverage plots.

SPSS: Partial regression leverage plots for all X variables can be generated automatically in SPSS by
selecting “Produce all partial plots” on the Plot menu for the Regression…Linear menu.
Chap. 11, page 9
Partial residual plots
• A partial residual plot (or component-plus-residual plot) is constructed differently from a partial
regression leverage plot, but also has the property that the slope of the least squares line through
the plot is the coefficient for that variable in the multiple regression model with all the X variables
included.
• Partial residual plots are better than partial regression leverage plots for identifying nonlinear
relationships between Y and an X variable after adjusting for the other X variables in the model.
• If a clear nonlinear relationship is identified, possible solutions include adding the square of the X
variable to the model, transforming the X variable, or transforming the Y variable.
• To construct the partial residual plot for X 1 , follow the following steps. For the sake of this
example, assume there are three other X variables in the model: X 2 , X 3 , X 4 .

1. Regress Y on all the X variables to obtain Yˆ = βˆ 0 + βˆ1 X 1 + βˆ 2 X 2 + βˆ 3 X 3 + βˆ 4 X 4 .

2. Compute the partial residuals for X 1 as pres = Y − βˆ 0 − βˆ 2 X 2 − βˆ3 X 3 − βˆ 4 X 4 .


3. Plot the partial residuals for X 1 (on the y-axis) against X 1 (on the x-axis).

Steps 1-3 should be repeated for X 2 , X 3 , X 4 .

• Partial residual plots are also useful for identifying high leverage and influential cases.

• SPSS does not automatically produce partial residual plots (recall that “partial plots” in the PSSS
regression menu means “partial regression leverage plots”). It is somewhat of a hassle to produce
these plots in SPSS manually, but it can be done by following steps 1-3. It is easier to replace step
2 by the equivalent calculation:

2. pres = res + βˆ1 X 1 where res is the residual from the full model fit in step 1.

Thus, the steps are: fit the full model (step 1) and save the residuals as RES_1. Use Transform…
Compute to compute the partial residuals as RES_1+ βˆ1 X 1 where you will type in the value for
βˆ from the model fit in step 1. Plot the partial residuals versus X . Repeat for the other
1 1
variables.

A loess smooth can be added to the partial residual plot to help identify non-linear relationships.

The following page contains both partial regression leverage plots and partial residual plots for the
rainfall data where the log10(Precip) is regressed on Altitude, Latitude, and Rainshadow with no
interaction. It might be best to look for nonlinear relationships before considering interactions, but
certainly these plots can also be used for models with interactions. Case #28 has been omitted.
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -1.137 .479 -2.372 .026
Altitude (ft) .0000139 .0000167 .089 .832 .413
Latitude (degrees) .06835 .01302 .562 5.250 .000
Rainshadow -.40686 .06795 -.625 -5.988 .000
a. Dependent Variable: Log10(Precipitation)
Chap. 11, page 10

Partial regression leverage plots Partial residual plots