You are on page 1of 6

Chapter 8, page 1

Math 445 Chapter 8: A Closer Look at Assumptions for Simple Linear Regression

Assumptions of linear regression model:


1. Linearity
2. Constant variance
3. Normality
4. Independence

Assumptions 1, 2 and 4 are the most important. Violation of 1 can bias estimates of means and
predictions. Violations of 2 and 4 can lead to under- or over-estimates of standard errors and misleading
inferences and confidence intervals. Violation of 3 is only a problem when sample sizes are small. An
exception is prediction intervals for an individual response which depend critically on the normality
assumption (confidence intervals for the mean response at a particular X are robust to normality because
of the Central Limit Theorem).

Assessing assumptions
Linearity and constant variance assumptions: assess through scatterplots, smoothing (loess, for example),
and residual plots

Example: Ozone level and maximum temperature on 111 days at a location on New Jersey, summer 1973
200

100

150
Unstandardized Residual
Ozone(ppb)

50
100

0
50

0 -50

50 60 70 80 90 100 -20 0 20 40 60 80 100


Maximum temperature (F) Unstandardized Predicted Value

• The relationship is not linear and the variance appears to increase as temperature increases. These
violations suggest transforming the response variable (transforming the explanatory variable will
not solve the nonconstant variance problem).
• When deciding whether to transform the response variable or the explanatory variable (or both),
sometimes it is helpful to look at histograms of each variable individually. If the distribution of
either variable is skewed, this suggests transforming that variable. In this example, the
distribution of ozone is skewed to the right while the distribution of temperature is roughly
symmetric.
• See Display 8.6 on p. 213 for suggested courses of action for other patterns.
Chapter 8, page 2
• Recall the ladder of powers: the family of power transformations (Chapter 10 of DeVeaux,
Velleman and Bock). Examples:
2 represents squaring (y2)
1 represents no transformation (y)
½ represents square root ( y )
0, by convention, represents log(y) (to any base)
-1/2 represents reciprocal square root (- 1 / y ) (the negative preserves the original order)
-1 represents reciprocal (-1/y)

For univariate data, powers less than 1 are often used for variables whose distribution is skewed to the
right; the stronger the skew, often the smaller the power needed (0 is smaller than ½, -1/2 is smaller than
0, etc.).

Log transformation is generally the most interpretable, though other transformations are sometimes
interpretable in special situations (see bottom of p. 216; in particular, the inverse transformation is
interpretable for rations where miles per gallon, for example, becomes gallons per mile).

Can easily try different transformations (in SPSS Chart Editor, can do power transformations with non-
negative exponents to X, Y or both).

A log transformation works well for the Ozone data, making the relationship more linear and the variance
more constant. There is one moderate outlier which we’ll address later.
2.50

0.5
2.00
Unstandardized Residual
Log10(Ozone)

1.50
0.0

1.00

-0.5

0.50

0.00 -1.0

50 60 70 80 90 100 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2


Maximum temperature (F) Unstandardized Predicted Value

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) -.8028 .1976 -4.062 .000 -1.1945 -.4111
Maximum temperature (F) .0294 .0025 .745 11.654 .000 .0244 .0344
a. Dependent Variable: Log10(Ozone)
Chapter 8, page 3
Before proceeding to the interpretation of this model, we first address the other assumptions: normality
and independence. Normality is not crucial with larger sample sizes, but we should make sure that there
is not strong skewness or outliers. The assumption of a normal distribution at each value of X means that
the residuals ε i = Yi − ( β 0 + β1 X i ) are assumed to be N (0, σ ) . Thus we can look at the distribution of
the observed residuals res = e = Y − ( βˆ + βˆ X ) with a histogram and/or normal probability plot.
i i i 0 1 i

25 Normal Q-Q Plot of Unstandardized Residual

0.75

20

0.50

Expected Normal Value


Frequency

15
0.25

0.00
10

-0.25

-0.50

0
-0.75
-1.00 -0.50 0.00 0.50 -1.0 -0.5 0.0 0.5
Unstandardized Residual Observed Value

The residuals for the log(Ozone) model appear quite symmetrically distributed with only one mild outlier
on the negative end.

The assumption of independence of the residuals can only be judged from the sampling plan and/or from
plotting the residuals versus time order or other covariates that may have been measured. For example, if
these observations had come from two different locations, then the independence assumption would be
violated. We would want to examine a scatterplot with the points from the two locations identified to see
if the relationship were different at the two locations. We would also want to plot the residuals versus day
number to see if there were patterns in the residuals.

0.50
Unstandardized Residual

0.00

-0.50

-1.00
41
45
49
21
25
29

57
61
65
69
5
9

33
37

53

73
77
81
85
89
93
97
1

13
17

101
105
109

Sequence number
Chapter 8, page 4
Interpretation of transformed model

The fitted model is


µˆ [log(Ozone Temp)] = −.8028 + .0294Temp
If we transform back, by taking 10 to each side, the left-hand side does not become the mean of Y because
the mean of the logged data is not the log of the mean of the raw data. However, if the transformation has
succeeded in making the distribution of the log(Y) values symmetric about their mean, then
Median [log(Y X )] = µ [log(Y X )]

Medians can be transformed back: the median of the logged data is the log of the median of the original
data. Therefore, we can say:

Estimated median(Ozone│Temp) = 10-.8028+.0294 Temp = 10-.8028 10.0294 Temp =(.1575)10.0294 Temp

Note that
Estimated Median(Ozone Temp + 1) (.1575)10.0294 ( Temp+1)
= .0294 Temp
= 10.0294 = 1.070
Estimated Median(Ozone Temp) (.1575)10

This means that median ozone level is estimated to increase by a factor of 1.070, or 7.0%, for every one
degree increase in maximum temperature (95% confidence interval 5.8% to 8.2%, since 10.0244 = 1.058
and 10.0344 = 1.082).

Other output from the Regression procedure in SPSS

Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .745a .555 .551 .25207
a. Predictors: (Constant), Maximum temperature (F)

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 8.629 1 8.629 135.813 .000a
Residual 6.926 109 .064
Total 15.555 110
a. Predictors: (Constant), Maximum temperature (F)
b. Dependent Variable: Log10(Ozone)

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) -.8028 .1976 -4.062 .000 -1.1945 -.4111
Maximum temperature (F) .0294 .0025 .745 11.654 .000 .0244 .0344
a. Dependent Variable: Log10(Ozone)
Chapter 8, page 5
The t-statistics and P-values (“Sig.”) reported in the Coefficients table are for testing the hypothesis
H 0 : β 0 = 0 and the hypothesis H 0 : β1 = 0 . The former is usually not of interest, but the latter is a test
of the equal-means model.

The ANOVA table is precisely analogous to the ANOVA table for comparing several groups. It
compares the linear regression model with 2 parameters for the means ( β 0 and β1 ), which is the full
model, to the equal-means model µ (Y X ) = β 0 , which is the reduced model.

n
• Total sum of squares = residual sum of squares for equal-means (reduced) model = ∑ (Yi − Y ) 2 .
i =1

∑ [Yi − (βˆ0 + βˆ1 )]


n 2
• Residual sum of squares = residual sum of squares for full model = .
i =1
1 n
• Mean square residual = ∑
n − 2 i =1
resi2 =σˆ 2

The F-test is a test of the simple linear regression model versus the equal-means model. Since the only
difference between the two models is the parameter β1 , this is a two-sided test of the hypothesis
H 0 : β1 = 0 . This is mathematically equivalent to the t-test of this hypothesis that is reported in the
regression coefficients table.

R2: the proportion of variation explained by the model

The R-squared statistic, or coefficient of determination gives us the percentage of the total variation in the
response, y, that is explained by the explanatory variable, x, which for our example yields:

total sum of squares-Residual sum of squares (15.555 − 6.926)


R2 = = = 0.555
total sum of squares 15.555

The residual sum of squares is the deviation in y away from the regression model and hence the difference
of the total variation and the residual variation represents the reduction in the variation achieved by
modeling y in terms of the model.

For linear regression, R2 is identical to the square of the sample correlation coefficient for the response
and the explanatory variable. Hence, this quantity is only a valid measure if the assumptions are met—i.e.
that the data are random samples and should never be used to evaluate the adequacy of the linear model.
Chapter 8, page 6
Case Study 8.2: Breakdown times for Insulating Fluid

Separate means model


ANOVA

Log(Time)
Sum of
Squares df Mean Square F Sig.
Between Groups 196.477 6 32.746 13.004 .000
Within Groups 173.749 69 2.518
Total 370.226 75

Linear regression model


ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 190.151 1 190.151 78.141 .000a
Residual 180.075 74 2.433
Total 370.226 75
a. Predictors: (Constant), Voltage (kV)
b. Dependent Variable: Log(Time)

Questions:

1) How much is the residual sum of squares lowered by going from the 2 parameter regression model
to the 7 parameter ‘separate means model’?

2) Calculate R2 for the regression model.

3) Fill in the Composite ANOVA Table shown below.

Source Sum of Squares d. f. Mean Square F-statistic p-value


Between
Regression
Lack of fit
Within
Total