You are on page 1of 17

Chapter 4: Model Adequacy Checking

In this chapter, we discuss some introductory aspect of model adequacy checking, including: Residual Analysis, Residual plots, Detection and treatment of outliers, The PRESS statistic

Testing for lack of fit.

The major assumptions that we have made in regression analysis are:

The relationship between the response Y and the regressors is linear, at least approximately. The error term has zero mean. The error term has constant variance .
2

The errors are uncorrelated.

The errors are normally distributed.

Assumptions 4 and 5 together imply that the errors are independent. Recall that assumption 5 is required for hypothesis testing and interval estimation.

Residual Analysis: The residuals e1 , e2 ,L, en have the following important properties:

(a) The mean of

e is 0.
i

(b) The estimate of population variance computed from the n residuals is:

(ei e ) e
n 2 n i =1

n p

i =1

n p

SS

Re s

n p

= MS Re s

(c) Since the sum of

e is zero, they are not independent. However, if the number of


i

residuals ( n ) is large relative to the number of parameters ( p ), the dependency effect can be ignored in an analysis of residuals.

Standardized

Residual:

The

quantity d i =

e MS
i

, i = 1,2, L , n ,
Re s

is

called

standardized residual. The standardized residuals have mean zero and approximately unit variance. A large standardized residual ( d i > 3 ) potentially indicates an outlier.

Recall that

e
Therefore,
Var

(I H )Y = (I H )(X + ) = (I H )
H

(e ) = var[(I

) ] = (I H )var( )(I H

) = (I H ).
/ 2

Studentized Residual: The quantity t i =

e MS

(1 hii ) Re s

e S
2

(1 hii )

, i = 1,2, L , n ,

is called the studentized residual. The studentized residuals have approximately a Student's t distribution with n p degrees of freedom.

If we delete the i th observation, fit the regression model to the remaining n 1 observations, and calculate the predicted value of y corresponding to the deleted observation, the corresponding predictor error is ei =

) yy
i
i

( i )

. Generally a large

difference between the ordinary residual and the PRESS residual will indicate a point where the model fits the data well, but a model without that point predicts poorly.

These prediction errors are usually called PRESS residuals or deleted residuals. It can be shown that e( i ) =

e 1 h
i

. Therefore,

ii

Var e( i )

( )

= Var ei = 1 hii

(1hii )

Var (ei ) =

(1hii )

[ (1 h )] = 1
2 ii

ii

Note that a standardized PRESS residual is

(1 h ) e e = = Var (e ) (1 h ) (1 h ) which, if we use MS to estimate is just the studentized residual.


i ( i ) ii i 2 2 i ii ii
2

Re s

R-student Residual: The quantity r i =

e
2

S ( i ) (1 hii)

, i = 1,2, L , n , is called the R2

student residual or jackknife residuals, where the quantity S ( i ) is the residual variance computed with the i th observation removed. It can be shown that (n p ) MS e 1 h =
i Re s 2

2 ( i )

ii

n p 1

If the usual assumptions in regression analysis are met, the jackknife residual follows exactly a t -distribution with n p 1 degrees of freedom.

Example 1: Consider the following data:


y

x
7 3 3 4 5
1 1 = 1 1 1

16 11 12 14 10

5 4 6 1 2
5 4 6 1 2

7 3 3 4 5

16 11 = 12 , X 14 10

X X

5 = 22 18

22 108 79

18 79 82

(X / X )

2.7155 1 = 0.3967 0.2139

0.3967 0.0893 0.0010

0.2139 0.0010 0.0582

H = X (X / X )

1 1 1 / = 1 X 1 1

7 3 3 4 5

5 4 2.8645 6 0.3712 1 0.2592 2

0.3712 0.0936 0.0067

0.2592 1 1 0.0067 7 3 0.0719 5 4

1 1 1 3 4 5 6 1 2

0.9252 0.0935 = 0.0748 0.1121 0.2056

0.0935 0.3832 0.4268 0.1931 0.0903

0.0748 0.4268 0.7030 0.1101 0.0945

0.1121 0.1931 0.1101 0.6096 0.4195

0.2056 0.0903 0.0945 0.4195 0.3790

h11 = 0.9252, h22 = 0.3832, h33 = 0.7030, h44 = 0.6096, h55 = 0.3790

(I H ) y

0.0748 0.0935 = 0.0748 0.1121 0.2056

0.0935 0.6168 0.4268 0.1931 0.0903

0.0748 0.4268 0.2970 0.1101 0.0945

0.2056 0.84 16 0.1931 0.0903 0.45 11 0.1101 0.0945 12 = 0.16 14 0.3904 0.4195 2.26 10 2.81 0.4195 0.6210 0.1121

MS
d 1 d 2 d 3 = d 4 d 5

Re s

n p

ee

'

13.9374 = 6.97 2

e MS

Re s

0.84 0.32 0.45 0.17 1 = 0.16 = 0.06 6.97 2.26 0.86 2.81 1.06

t1 t 2 t 3 = t 4 t 5

MS Re s (1 h11) e1 MS Re s (1 h22 ) e1 = MS Re s (1 h33) e1 MS Re s (1 h44 ) e1 MS Re s (1 h55)

0.84 6.97(1 0.9252) 1.16 0.45 6.97(1 0.3832) 0.22 0.16 = 0.11 6.97(1 0.7030) 1.37 2.26 1.35 6.97(1 0.6096) 2.81 6.97(1 0.3790)

2 ( 1)

(n p ) MS e 1 h =
1 Re s

11

n p 1

. (5 3)6.97 10084 .9252 = = 4 .5 5 3 1

S S S

2 ( 2 )

(n p ) MS e 1 h =
2 Re s

22

n p 1
Re s

(5 3)6.97 10.3832 = 13.6 5 3 1


2

(0.45)

2 ( 3)

(n p ) MS e 1 h =
3

33

n p 1
Re s

. (5 3)6.97 10016 .7030 = = 13.9 5 3 1 . (5 3)6.97 12026 .6096 = = 0.86 5 3 1 (5 3)6.97 10.3790 = 1.22 5 3 1 (2.81)
2 2

2 ( 4 )

(n p ) MS e 1 h =
44

44

n p 1
Re s

2 ( 5 )

(n p ) MS e 1 h =
55

55

n p 1

r ( 1) r ( 2 ) r ( 3) = r ( 4 ) r ( 5)

e1 2 S ( 1) (1 h11) e1 2 S ( 2) (1 h22 ) e1 = 2 S ( 3) (1 h33) e1 2 S ( 4) (1 h44 ) e1 2 S ( 5) (1 h55)

0.84 4.5(1 0.9252) 1.45 0.45 0.15 13.6(1 0.3832) 0.16 0.08 = 13.9(1 0.7030) 3.90 2.26 0.86(1 0.6096) 3.23 2.81 1.22(1 0.3790)

SAS Output: Residuals, Studentized Residuals and R-student Residuals Obs 1 2 3 4 5 Residuals 0.84112 -0.44860 0.15888 2.26168 -2.81308 student 1.16423 -0.21618 0.11034 1.36988 -1.35107 Rstudent 1.45010 -0.15468 0.07826 3.89917 -3.23320

S cat t er pl ot of X ver sus X 2 1


x1 7

3
1 2 3 x2 4 5 6

Graphical Analysis of Residuals: (a) Normal probability plot: If the normality assumption is not badly violated, the conclusion reached by a regression analysis in which normality is assumed will generally be reliable and accurate. A very simple method of checking the normality assumption is to construct a normal probability plot of residuals.

Let

e ,e
(1)

(2)

, L , e( n) be the residuals ranked in increasing order. Note that

where

1 i E e( i ) = 2 n denotes the standard normal cumulative distribution. Normal

( )

probability

plots are constructed by plotting the ranked residuals

( i)

against the expected normal

1 i 1 value 2 . The resulting points should lie approximately on a straight line. n Substantial departures from a straight line indicate that the distribution is not normal.

If normality is deemed unsatisfactory, the Y values may be transformed by using a Log, square root, etc. to see whether the new set of observation is approximately normal.

(b) Plot of Residuals versus the Fitted values: A plot of the residuals
scaled residuals

(or the
i

d , t or r
i i

) versus the corresponding fitted values

) y is

useful for detecting several common types of model inadequacies.

If the plot of residuals versus the fitted values can be contained in a horizontal band, then there are no obvious model defects.

The outward-opening funnel pattern implies that the variance of is an increasing function of Y . An inward-opening funnel indicates that the variance of decrease as Y increases. The double-bow often occurs when Y is a proportion between zero and one. The usual approach for dealing with inequality of variance is to apply a suitable transformation to either the regressor or the response variable.

A curved plot indicates nonlinearity. This could mean that other regressor variables are needed in the model. For example a squared term may be necessary. Transformation on the regressor and/or the response variable may be helpful in these cases.

A plot of residuals versus the predicted values may also reveal one or more unusually large residuals. These points are potential residuals. Extreme predicted value with large residual could also indicate either the variance is not constant or the true relationship between Y and X is not linear. These possibilities should be investigated before the points are considered outliers.

(c) Plot of Residuals versus the Regressors: Plotting the residuals versus corresponding values of each regressor variable can also be helpful. Once again a horizontal band containing the residuals is desirable. The funnel and double-bow patterns indicate nonconstant variance. The curved band or a nonlinear pattern in general indicates that the assumed relationship between Y and the regressor X j is not correct. Thus, either higher-order terms in X j
(such as X j ) or a transformation should be considered.
2

Note that in the simple linear regression it is not necessary to plot residuals versus both predicted values and the regressor variable since the predicted values are linear combinations of the regressor values.

(d) Plot of Residuals in Time sequence: It is a good idea to plot the residuals against time order, if the time sequence in which the data were collected is known. If a horizontal band will enclose all of the residuals and the residuals will fluctuate in a more or less random fashion within this band, then there are no autocorrelation.

(e) Partial Regression plots: A limitation of the plot of residuals versus regressor variables is that they may not completely show the correct or complete marginal effect of a regressor, given the other regressors in the model. The partial regression plot considers the marginal role of the regressor X j given other regressors that are already in the model. In this plot, the
response variable Y and the regressor

are both regressed against the other

regressors in the model and the residuals obtained for each regression. The plot of these residuals against each other provides information about the nature of the marginal relationship for regressor X j under consideration.

If the regressor

enters the model linearly, the partial regression plot should show a

linear relationship with a slope equal to

in the multiple linear regression model.


j

Note that:

Partial regression plots only suggest possible relationship between regressor and the response. These plots may not give information about the proper form of the relationship if several variables already in the model are incorrectly specified. It will usually be necessary to investigate several alternative forms for the relationship between the regressor and Y or several transformations. Residual plots for these subsequent models should be examined to identify the best relationship or transformation.

Partial regression plots will not, in general, detect interaction effects among the regressors.

The presence of strong collinearity can cause partial regression plots to give incorrect information about the relationship between the response and the regressor variables.

(f) Partial Residual plots: Suppose that the model contains the regressor X 1 , X 2 ,L , X k . The partial residuals for regressor X j are defined
as ei Y | x j = ei +
*

ij

, i = 1,2, L , n where the

e are the residuals from


i

the model with all k regressors included. The partial residuals are plotted versus xij and the interpretation of the partial residual plot is very similar to that of the partial regression plot.

Example 2 (Delivery Time Data): A soft drink bottler is analyzing the vending machine service routes in his distribution system. He is interested in predicting the amount of time required by the route driver to service the vending machines in an outlet. This service activity includes stocking the machine with beverage products and minor maintenance or housekeeping. The industrial engineer responsible for the study has suggested that the two most important variables affecting the deliver time ( Y ) are the number of cases of product stocked ( X 1 ) and the distance walked by the route driver ( X 2 ). The engineer
has collected 40 observations on deliver time.

SAS Output:

R egr essi on M odel

Y on X and X 1 2

y = 2. 4123 + 6392 x1 + 0136 x2 1. 0.


1. 0

N 20 R sq 0. 9525 A R dj sq 0. 9469

0. 8

0. 6

RS ME 3. 7303

0. 4

0. 2

0. 0
0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0

C F of R T D N D S UE T

Q Q pl ot of R udent R - st esi dual s


5 4 3 2 1 0 -1 -2

-2

-1
N m or al

0
Q uant i l es

R egr essi on M odel

Y on X and X 1 2

y = 2. 4123 + 6392 x1 + 0136 x2 1. 0.


5 4 3 2 1 0 -1 -2
0 10 20 30 40 50 60 70 80

N 20 R sq 0. 9525 A R dj sq 0. 9469 RS ME 3. 7303

P edi ct ed V ue r al

R egr essi on M odel

Y on X and X 1 2

y = 2. 4123 + 6392 x1 + 0136 x2 1. 0.


5 4 3 2 1 0 -1 -2
0. 0 2. 5 5. 0 7. 5 10. 012. 5 15. 0 17. 5 20. 0 22. 5 25. 0 27. 5 30. 0

N 20 R sq 0. 9525 A R dj sq 0. 9469 RS ME 3. 7303

x1

R egr essi on M odel

Y on X and X 1 2

y = 2. 4123 + 6392 x1 + 0136 x2 1. 0.


5 4 3 2 1 0 -1 -2
0 200 400 600 800 1000 1200 1400 1600

N 20 R sq 0. 9525 A R dj sq 0. 9469 RS ME 3. 7303

x2

P t i al ar
pr 1 60

R esi dual

pl ot s

50

40

30

20

10

0
0 10 x1 20 30

P t i al ar
pr 2 30

R esi dual

pl ot s

20

10

- 10
0 200 400 600 800 x2 1000 1200 1400 1600

PRESS Statistic: PRESS residuals are defined by ei =

) yy
i

( i )

, where

) y

( i )

is the

predicted value of the i th observed response based on a fit to the remaining n 1 sample points. Large PRESS residuals are potentially useful in identifying observations where the model does not fit the data well or observation for which the model is likely to provide poor future predictions. The PREES Statistic is defined by

) PRESS = y i y (i )
n i =1

ei = 1 hii
n i =1

PRESS is generally regarded as a measure of how well a regression model will perform in predicting new data. One very important of the PRESS statistic is in comparing regression models. Generally, a model with a small value of PRESS is desired. The 2 PRESS statistic can be also used to compute an R -like statistic for prediction, say PRESS 2 RPr ediction = 1

SS

This statistic gives some indication of the predictive capability of the regression model.

Example 2 (Cont.):

R
R
2

= 1 SS REs = 1

SS

236.56224 = 0.9525 4977.99610


546.03153 = 0.8903 4977.99610

Pr ediction

= 1

PRESS

SS

= 1

Therefore, we could expect this model to explain about 89.03% of the variation in predicting new observations, as compared to approximately 95.25% of the variability in the original data explained by the least-squares fit.

Lack of Fit of the Regression Model:

You might also like