1.7K views

Uploaded by Fanny Sylvia C.

- 330_Lecture6_2014.pdf
- year2007_2008
- Week #7 Homework Answers
- Correlation
- Hashim and Zaman
- African Development-Dead Ends and New Beginnings by Meles Zenawi
- SSRN-id1950361
- Eco No Metric Assignment 3
- Lecture 4
- Final Project Final
- Logs&Regression
- Gratisexam.com SAS Institute.actualtests.A00 240.v2015!03!12.by.michael Stave.65q
- plik_1453803064
- MT Review Session_140
- f-snm
- Broockman Kalla Aronow Lg Irregularities
- Neonatal
- 15BCE0657_EXP3
- Publication 4
- journal agricultural food

You are on page 1of 10

11, page 1

Math 445 Chapter 11 Model Checking and Refinement

Rainfall data

In the rainfall data, we ended up leaving out case 28 (Death Valley) because it had a large residual and

its altitude was the lowest in the data set. The resulting model is therefore not applicable to such low

altitude locations. If case 28 had not been unusual, then we would not have been justified in omitting it.

Coefficientsa

Unstandardized Standardized

Coefficients Coefficients

Model B Std. Error Beta t Sig.

1 (Constant) -2.074 .525 -3.951 .001

Altitude (ft) .000725 .000241 4.647 3.012 .006

Latitude (degrees) .093924 .014285 .773 6.575 .000

Rainshadow -.431176 .059929 -.662 -7.195 .000

Altitude*Latitude -.000019 .000006 -4.620 -2.959 .007

a. Dependent Variable: Log10(Precipitation)

R2 = .80

Since there is an interaction between Altitude and Latitude, interpretation of the coefficients for these

variables becomes a little complicated. However, we can interpret the effect of the Rainshadow variable

in this model.

Chap. 11, page 2

Case 28 is an example of an outlier, a case for which the model does not fit well. Outliers have large

residuals. We are also interested in influential cases, cases whose omission changes the fitted model

substantially. Influential cases may not be outliers. Least squares is sensitive to unusual cases and an

influential case may “pull” the regression plane toward it so much that it does not have a large residual.

In simple linear regression, we could often identify influential cases simply from a scatterplot. In

multiple regression, it may not be possible to see influential cases in pairwise scatterplots and we need

additional tools.

Case-Influence statistics

Leverage:

The leverage of a case is based only on the values of the explanatory variables. It measures the distance

of the case from the mean for the explanatory variables (in multidimensional space). For one

explanatory variable, the leverage is

hi =

1 ⎡ Xi − X ⎤

2

1 (X i − X )2 + 1

⎢ ⎥ + =

(n − 1) ⎣ s X ⎦ n ∑ ( X i − X )2 n

With more than one explanatory variable, the leverage is a measure of distance in higher-dimensional

space. The distance takes into account the joint variability of the variables – see Display 11.10 on p.

316.

High-leverage cases are easy to identify visually with only one explanatory variable, but become

increasingly difficult to identify visually with more explanatory variables.

Leverages are always between 1/n and 1. The average of all the leverages in a data set is always p/n

where p is the number of explanatory variables. SPSS computes centered leverages (under the Linear

Regression…Save button), even though it calls them simply “leverages.” The centered leverage is

hi − 1 / n . Therefore, the centered leverage is between 0 and 1-1/n.

Leverage measures the potential influence of a case. High leverage cases have the potential to change

the least squares fit substantially.

Chap. 11, page 3

Studentized residuals

While the true residuals (what we called the ε i ) all have the same standard deviation σ in the regression

model, the observed residuals ei don’t. Why not?

• True residual ε i = Yi − ( β 0 + β1 X i )

• Observed residual: ei = Yi − ( βˆ0 + βˆ1 X i )

First, we already know that the size of the observed residuals tend to be smaller than the sizes of the true

residuals. That’s why we divide by n-2 when we compute the standard deviation of the observed

residuals to get an estimate of the standard deviation of the true residuals. The reason that the observed

residuals tend to be smaller is that the least squares line is the line which best fits the data so the

deviations from this line will tend to be smaller than the deviations from the true line.

What do we mean when we say that the residuals do not all have the same standard deviation? How can

a single value have a standard deviation?

What we mean is: what is the standard deviation of the residuals at each X i from many simulated sets

of data from the linear regression model with a fixed set of Xi ’s?

To carry out this simulation we would follow the following steps. The Xi ’s remain the same for every

simulation.

1. Generate a set of a set of Yi ’s where each Yi is from a normal distribution with mean β 0 + β1 X i

and standard deviation σ. That gives a set of n pairs of values ( X 1 , Y1 ), ( X 2 , Y2 ),… , ( X n , Yn ) .

2. Fit the least squares line

3. Compute the residuals.

4. Repeat steps 1-3 many times with a new set of Yi ’s each time.

Now look at the distribution of observed residuals for each X i and, in particular, compute the standard

deviation of the observed residuals at each X i . You will find that the standard deviations are different

and that the standard deviation of the residuals for X i ’s far from X (high leverage values) is smaller

than for X i ’s near X (low leverage values). In fact, it can be shown that the standard deviation of the

residual at X i is:

SD(Residuali) = σ (1 − hi )

where hi is the leverage. This formula applies to any multiple regression model, not just the simple

linear regression model.

Chap. 11, page 4

Example (simple linear regression):

Suppose Yi is normal with mean µ (Yi ) = 1 + 2 X i , i= 1,..,5, and standard deviation σ = 1, and that the

X i ’s are 1, 4, 5, 6 and 14.

Here are the Yi ’s from one simulation: 3.42, 9.86, 10.05, 12.90, 27.38. The least squares line is

Yˆ = 1.73 + 1.83 X and the residuals are: -0.145, 0.803, -0.844, 0.182, 0.004.

i i

Repeating the simulation 10,000 times, here are the mean and standard deviation of the residuals at each

Xi :

Xi 1 4 5 6 14

mean of residuals 0.008 0.002 -0.013 0.000 0.004

std. dev. of residuals 0.737 0.869 0.884 0.900 0.350

Use the formula on the previous page to calculate the standard deviation of the residuals at each X i .

How do they match the values estimated from the simulation?

Why are we so concerned about the standard deviation of the residuals at different X i ’s?

• Because a big residual is more unusual at a high leverage point than at a low-leverage point.

Therefore, standardizing the residuals by an estimate of their standard deviation is a better way to

compare residuals. Since residuals always have mean 0, this means dividing each residual by an

estimate of its standard deviation σ (1 − hi ) . Since we don’t know σ, we replace it by σˆ (the

square root of mean square residual in the ANOVA table).

• The studentized residual is

resi

studresi = .

σˆ 1 − hi

Chap. 11, page 5

• Studentized residuals are also sometimes called internally studentized residuals. In SPSS, they are

called “studentized residuals” (under the Save button on the Linear Regression window).

• A potential problem with the studentized residuals is that σ̂ may be inflated if a residual is an

outlier. Therefore, a modified version of the studentized residual is the externally studentized

residual, called the studentized deleted residual in SPSS. σ̂ is replaced by σˆ (i ) , the estimated

standard deviation of the residuals from the model fit with the ith observation omitted.

resi

studresi* =

σˆ (i ) 1 − hi

Internally and externally studentized residuals can be used in just the same way as the raw residuals: in

residual plots, normal probability plots, etc. In fact, they are preferred to the raw residuals because the

nonconstant variance of the raw residuals has been corrected for.

When examining studentized residuals, one should look for outliers. In addition, one can use the

standard normal distribution as a rough guide for identifying unusual values: e.g, we expect about 5% of

values less than -2 or greater than 2 and less than 1% to be outside the range –3 to 3.

Cook’s Distance

A more direct measure of the influence of an observation is Cook’s Distance, which measures how

much the fitted values change when each observation is omitted. For case i,

Di = ∑

n (Yˆ j (i ) − Yˆ j )

2

j =1 pσˆ 2

where p is the number of regression coefficients. The numerator of the above expression is what’s

important; the denominator just standardizes the statistic.

Yˆ j is the fitted value for case j when the whole data set is used to fit the model. Yˆ j (i ) is the fitted value

for case j when case i is omitted in fitting the model. So, for example, to calculate D1 we omit case 1,

calculate the model, and calculate the fitted values for all observations including case 1. We then

calculate the sum of the squared differences between these predicted values and the predicted values

from the model fit to all the data. A values of Cook’s D close to or greater than 1 is often considered to

be indicative of an observation with large influence.

While Cook’s D is a useful measure if the goal of the model is prediction, it is not as useful for seeing

how a particular coefficient changes when an observation is omitted. However, it can be used to

identify cases to check – omit a case with large Cook’s D and see how the coefficients of interest

change.

Chap. 11, page 6

Other measures of influence

A number of other measures of influence have been proposed. However, some of these measures are

redundant and it is not necessary to look at all of them. Two others that SPSS computes are DfFits,

which measures how much the predicted value for case i changes when case I is omitted and DfBetas,

which measures how much the omission of case i changes each of the coefficients in the model (hence,

for each case, there is a separate DfBetas value for each variable).

Chap. 11, page 7

Without case #28:

In Sec. 11.4.4, p. 320, Ramsey and Schafer suggest that if “the residual plot from a good inferential

model fails to suggest any problems, there is generally no need to examine case influence statistics at

all.” I would agree except that I would suggest that the residual plot should use the externally

studentized residuals (=studentized deleted residuals).

Chap. 11, page 8

We next examine two types of plots useful in refining models:

• Partial regression leverage plots (also called added-variable plots) are useful for visually

identifying influential and high leverage points for each regression coefficient separately. These

are not discussed in the text, but are easily available in SPSS.

• Partial residual plots (also called component-plus-residual plots) are useful for identifying

nonlinear relationships in a multiple regression model. These are discussed in the text, but are

not readily available in SPSS. They can be constructed in SPSS, but it’s rather tedious.

It might seem that simply plotting the response variable Y versus each explanatory variable would be

adequate for assessing the relationship between Y and each X variable for a multiple regression model.

However, these plots can be misleading because they do not control for the values of the other X

variables. For example, an apparently strong relationship between Y and X 1 may disappear when other

variables are included in the model. If the scatterplot of Y versus X 1 looks curved, it does not

necessarily mean that a squared term will be necessary with the other X variables in the model.

Similarly, a case that appears influential in the Y versus X 1 scatterplot may not be influential with the

other X variables in the model and a case that doesn’t appear influential may turn out to be so with the

other X variables in the model.

Plots of the residuals versus each X variable are also inadequate. They are better than Y versus X plots

because they show only the unexplained variation in Y on the y-axis. However, the X variables are not

adjusted for relationships with each other.

• A partial regression leverage plot (or added-variable plot) attempts to separate out the relationship

between the response and any explanatory variable after adjusting for the other explanatory

variables in the model.

• The steps involved creating the partial regression leverage plot for variable X 1 are:

1. Compute the residuals from the regression of Y on all the other X variables in the model except

X1 .

2. Compute the residuals from the regression of X 1 on all the other X variables in the model.

3. Plot the first set of residuals on the y-axis against the second set on the x-axis.

Steps 1-3 are repeated for all the X variables in the model.

The partial regression leverage plot for X 1 looks at the relationship between Y and X 1 after adjusting

for the other X variables. It turns out that the slope of the least squares line for this plot is exactly equal

to βˆ1 , the coefficient on X 1 in the regression model with all the X’s in it. In addition, high leverage and

influential cases for βˆ can be identified from this scatterplot. This is the primary use of the partial

1

regression leverage plots.

SPSS: Partial regression leverage plots for all X variables can be generated automatically in SPSS by

selecting “Produce all partial plots” on the Plot menu for the Regression…Linear menu.

Chap. 11, page 9

Partial residual plots

• A partial residual plot (or component-plus-residual plot) is constructed differently from a partial

regression leverage plot, but also has the property that the slope of the least squares line through

the plot is the coefficient for that variable in the multiple regression model with all the X variables

included.

• Partial residual plots are better than partial regression leverage plots for identifying nonlinear

relationships between Y and an X variable after adjusting for the other X variables in the model.

• If a clear nonlinear relationship is identified, possible solutions include adding the square of the X

variable to the model, transforming the X variable, or transforming the Y variable.

• To construct the partial residual plot for X 1 , follow the following steps. For the sake of this

example, assume there are three other X variables in the model: X 2 , X 3 , X 4 .

3. Plot the partial residuals for X 1 (on the y-axis) against X 1 (on the x-axis).

• Partial residual plots are also useful for identifying high leverage and influential cases.

• SPSS does not automatically produce partial residual plots (recall that “partial plots” in the PSSS

regression menu means “partial regression leverage plots”). It is somewhat of a hassle to produce

these plots in SPSS manually, but it can be done by following steps 1-3. It is easier to replace step

2 by the equivalent calculation:

2. pres = res + βˆ1 X 1 where res is the residual from the full model fit in step 1.

Thus, the steps are: fit the full model (step 1) and save the residuals as RES_1. Use Transform…

Compute to compute the partial residuals as RES_1+ βˆ1 X 1 where you will type in the value for

βˆ from the model fit in step 1. Plot the partial residuals versus X . Repeat for the other

1 1

variables.

A loess smooth can be added to the partial residual plot to help identify non-linear relationships.

The following page contains both partial regression leverage plots and partial residual plots for the

rainfall data where the log10(Precip) is regressed on Altitude, Latitude, and Rainshadow with no

interaction. It might be best to look for nonlinear relationships before considering interactions, but

certainly these plots can also be used for models with interactions. Case #28 has been omitted.

Coefficientsa

Unstandardized Standardized

Coefficients Coefficients

Model B Std. Error Beta t Sig.

1 (Constant) -1.137 .479 -2.372 .026

Altitude (ft) .0000139 .0000167 .089 .832 .413

Latitude (degrees) .06835 .01302 .562 5.250 .000

Rainshadow -.40686 .06795 -.625 -5.988 .000

a. Dependent Variable: Log10(Precipitation)

Chap. 11, page 10

- 330_Lecture6_2014.pdfUploaded byAnonymous gUySMcpSq
- year2007_2008Uploaded bysaumyagargi190878
- Week #7 Homework AnswersUploaded byAnonymous Qed2LOxF
- CorrelationUploaded byMaple De los Santos
- Hashim and ZamanUploaded bycp91110
- African Development-Dead Ends and New Beginnings by Meles ZenawiUploaded bySomtoo Umeadi
- SSRN-id1950361Uploaded bySitti Najwa
- Eco No Metric Assignment 3Uploaded byPrashant Saurabh
- Lecture 4Uploaded bykeyyongpark
- Final Project FinalUploaded bysaratkarvivek
- Logs&RegressionUploaded bycarloscorderolobo
- Gratisexam.com SAS Institute.actualtests.A00 240.v2015!03!12.by.michael Stave.65qUploaded byshanmuga89
- plik_1453803064Uploaded byMichele Paolo Paprusso
- MT Review Session_140Uploaded byTomás Villena
- f-snmUploaded byCerin91
- Broockman Kalla Aronow Lg IrregularitiesUploaded byChristineNyambe
- NeonatalUploaded byRia Iea Ohorella
- 15BCE0657_EXP3Uploaded byharshit
- Publication 4Uploaded byZahid Bashir
- journal agricultural foodUploaded byollasekarsari
- Interpretation of Data.docxUploaded byNers Ibarrientos
- 9605 HN Forecast ErrorUploaded byanurag nayak
- Hr_Exam2010V2AKUploaded byTrung Vu Duc
- Analysis of Lead Time Delays in Supply Chain a Case StudyUploaded byKumar Vinit
- Mission Hospital Case Solution_Sec AUploaded byKaran Kakkar
- biostat3.1Uploaded bySonia Rahma A
- Gender Discriminaton in WorkforceUploaded byTahir Rose
- Jacobson 1978Uploaded byChiko Freddy
- Word AnovaUploaded byrenjikphilip
- 17814_introduction to RlogisticUploaded byHaresh Verma

- Chapter 14Uploaded byFanny Sylvia C.
- Chapter 10Uploaded byFanny Sylvia C.
- ReviewChaps3-4Uploaded byFanny Sylvia C.
- Chapter 20Uploaded byFanny Sylvia C.
- Chapter 12Uploaded byFanny Sylvia C.
- SampleSizeCalcRevisitedUploaded byFanny Sylvia C.
- Non%26ParaBootUploaded byFanny Sylvia C.
- Hypo%26PowerLectureUploaded byFanny Sylvia C.
- Model- vs. design-based sampling and variance estimationUploaded byFanny Sylvia C.
- Chapter 13Uploaded byFanny Sylvia C.
- Chapter 9Uploaded byFanny Sylvia C.
- ReviewChaps1-2Uploaded byFanny Sylvia C.
- Chapter 8Uploaded byFanny Sylvia C.
- Charles TaylorUploaded byFanny Sylvia C.
- Chapter 21Uploaded byFanny Sylvia C.
- An Ova PowerUploaded byFanny Sylvia C.
- Good Article on Standard Error vs Standard DeviationUploaded byAshok Kumar Bharathidasan
- Clustering in the Linear ModelUploaded byFanny Sylvia C.
- Chapter 7Uploaded byFanny Sylvia C.
- Chapter 7Uploaded byFanny Sylvia C.
- R Matrix TutorUploaded byFanny Sylvia C.
- Intro BootstrapUploaded byMichalaki Xrisoula
- Data Modeling: General Linear Model &Statistical InferenceUploaded byFanny Sylvia C.
- Chapter5p2LectureUploaded byFanny Sylvia C.
- The not so Short Introduction to LaTeXUploaded byoetiker
- Chapter 6Uploaded byFanny Sylvia C.
- Chapter 5Uploaded byFanny Sylvia C.
- Bio Math 94 CLUSTERING POPULATIONS BY MIXED LINEAR MODELSUploaded byFanny Sylvia C.
- GRM: Generalized Regression Model for Clustering Linear SequencesUploaded byFanny Sylvia C.

- Re 15228Uploaded bytarunsonwane
- Heckscher Ohlin TheoryUploaded byApoorv Srivastava
- Mohamed Adel Dahlan ResumeUploaded byMohamed Dahlan
- OO0121Uploaded byAnonymous 9eadjPSJNg
- Administration of the Sultanate 1.docxUploaded byNikhilVaka
- dbms2-1Uploaded byThess Dimalibot Bengua
- Patrick Debideen Rev. TwoUploaded byRobert Fowler
- EFT-101Uploaded byRaman Yadav
- Tackling the Shortage of Construction Skills in NigeriaUploaded byJoshua Tanimowo
- lampiran spss wawanUploaded bysuprihatin 12
- Rmt Vm Catalog TseUploaded byJuan David Valderrama
- Uttho Mere Desh- Arise My CountryUploaded byHarsh Vardhan
- Your Navigator Deluxe v1.0 User's Guide - US Cellular (BlackBerry Devices)Uploaded bytelenav
- Consumer BehaviourUploaded byginish12
- Sketchlab.com Bin List 3 of 4 Version 2Uploaded byRazvan Biris
- MN_2018-05-08Uploaded bymoorabool
- - [Catalog] - Ceramic EMI Filters 1Uploaded byYeşim Firidin
- ICE Flooding 2016 - Delegate PackUploaded byMuhammad Akbar Walenna
- printTi.pdfUploaded byVenkatesan Sundaram
- Business Level Strategy-2Uploaded bydhrumil
- ACCT 505 Solutions to Weekly Practice Problems - Week 1Uploaded byjohn_986589364
- FortiGate-TCP-MSS-Option-V2.pdfUploaded byCindy Tan
- AP Gov Chapter 2 TermsUploaded byHannah
- 33872 2017 Resident HandbookUploaded byP
- 2981_8-9_TenderBulletin_a (1)Uploaded byCassandra Stapelberg
- CRM SecurityUploaded byBhaskar Sharma
- Pharmaceutical Manufacturers embracing lean six sigmaUploaded byhaiderfatmi
- BT 42008Uploaded byJosé Adelino
- final draft of toledo city lspUploaded byapi-194560166
- AIMCAT 1220Uploaded byHariharan Venkataramanan