You are on page 1of 8

Chap.

7, page 1
Chapter 7 The Simple Linear Regression Model

A common model for modeling the relationship between two quantitative variables is the linear
regression model. Don’t be fooled by the “linear” part: as we’ll see, linear regression models can
often be used to model relationships which aren’t linear.

Although we looked at the linear regression model last semester, we only looked at one part of it
– the part that models the mean response Y as a linear function of X. We’ll extend the model to
model the scatter of the individual data points around the line. The way we extend it makes the
linear regression model exactly like the ANOVA model, except that the explanatory variable is
quantitative instead of categorical.

We assume that at each X, the distribution of Y values is normal with mean β 0 + β1 X and
standard deviation σ.

µ (Y X ) = β 0 + β 1 X

σ (Y X ) = σ 2

Data: ( X 1 , Y1 ), ( X 2 , Y2 ),…, ( X n , Yn ) . The Yi ’s are assumed to be independent.

Least squares estimates of β 0 and β 1 are denoted by β̂ 0 and βˆ1 . The predicted or fitted value
of Y for a particular X is:

µˆ (Y X ) = βˆ 0 + βˆ1 X .

This is also denoted Ŷ in many books.

The fitted values for the data points are:

Yˆi = fit i = βˆ 0 + βˆ1 X i

and the residuals are:

resi = Yi − fit i = Yi − Yˆi .

The residuals are sometimes denoted ei in other texts.

By modeling the distribution of data points around the line, we can make inferences from the
sample data about the regression parameters.
Chap. 7, page 2
Case Study 7.2: Meat Processing and pH

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 3.00647 1 3.00647 444.306 .000a
Residual .05413 8 .00677
Total 3.06060 9
a. Predictors: (Constant), Log(hours)
b. Dependent Variable: pH

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 6.9836 .0485 143.897 .000
Log(hours) -.7257 .0344 -.991 -21.079 .000
a. Dependent Variable: pH

Hours pH Log(hours) fit res


1 7.02 0 6.9836 0.0364
1 6.93 0 6.9836 -0.0536
2 6.42 0.69 6.4806 -0.0606
2 6.51 0.69 6.4806 0.0294
4 6.07 1.39 5.9777 0.0923
4 5.99 1.39 5.9777 0.0123
6 5.59 1.79 5.6834 -0.0934
6 5.8 1.79 5.6834 0.1166
8 5.51 2.08 5.4747 0.0353
8 5.36 2.08 5.4747 -0.1147
Chap. 7, page 3

Another (equivalent) way to write the linear regression model is

Yi = β 0 + β 1 X i + ε i

where the ε i ’s are independent N(0,σ) random variables.

Formulas for least squares estimators:

n
∑ ( X i − X )(Yi − Y )
i =1
βˆ1 = n
, βˆ 0 = Y − βˆ1 X
∑ ( X i − X )2
i =1

Mean of residuals is 0 (always true for least squares)

sum of squared residuals


∑ resi2
Estimate of σ is σˆ = = i =1 .
degrees of freedom n−2
Degrees of freedom = n - #parameters in the model for the means = n –2 for simple linear
regression

The ANOVA table gives the sum of squared residuals and the mean square residual which is
σˆ 2 = 0.00677 so σˆ = 0.0823.

The standard errors of β̂ 0 and βˆ1 represent the estimated standard deviations of the sampling
distributions of β̂ and βˆ . The sampling distributions refer to how the least squares estimates
0 1
would vary from sample to sample. We view the X i ’s as fixed; they are viewed to remain the
same from sample to sample while the Yi ’s are random.

1 1 X2
SE ( βˆ1 ) = σˆ , SE ( βˆ 0 ) = σˆ +
(n − 1) s X2 n (n − 1) s X2

Confidence intervals for slope and intercept are Estimate ± t df (1 − α / 2) SE(Estimate)


Chap. 7, page 4
Example: Steer carcass data

Predicted pH = 6.9836 - .7257 Log(Hours)

where Log is natural logarithm.

Inferences for slope:


Mean pH is estimated to decrease by .7257 for every one unit increase in Log(Hours). A one
unit increase in Log(Hours) is an increase in Hours by a factor of e ≈ 2.72. If we had used
Log10(Hours) instead, the interpretation would be easier: the slope represents the increase in
predicted pH for every 10-fold increase in time since slaughter.

A 95% confidence interval for β 1 is -.7257 ± t 8 (.975) (.0344) = -.7257 ± 2.306 (.0344) =
-.7257± .0793 = -.805 to -.646. So we are 95% confident that the decrease in mean pH is
between .646 and .805 for every 2.72-fold increase in time since slaughter.

The confidence interval can also be obtained from SPSS by choosing Options in the
Analyze…Regression…Linear window.
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 6.984 .049 143.897 .000 6.872 7.096
Log(hours) -.726 .034 -.991 -21.079 .000 -.805 -.646
a. Dependent Variable: pH

Inferences for intercept:


The intercept β 0 represents the mean value of Y when X = 0. Usually, this is not particularly
meaningful. It is usually more meaningful to estimate the mean value of Y at particular values of
X which are meaningful and interesting, which is covered next.

Inferences for the mean response at a particular value of X:


Inferences about the slope of the regression line tell us about how big the change is in the mean
response (Y) for a 1-unit increase in X. Sometimes, we are interested in a confidence interval for
the mean response at a particular X, say X 0 . According to the model, the true mean of Y at X 0
0 0 1 0 ( 0
)
is µ (Y X ) = β + β X . The estimate of this is µˆ Y X = βˆ + βˆ X . The standard error of
0 1 0

µˆ (Y X 0 ) is

[( )]
SE µˆ Yˆ X 0 = σˆ
1 ( X 0 − X )2
+
n (n − 1) s X2

Note that the standard error is bigger for values of X 0 further from X and is smallest at X .
Chap. 7, page 5
Steer data: What is the estimated mean pH for carcasses 3 hours old? Give a 95% confidence
interval for the mean pH after 3 hours.

First, remember that the X variable in the regression model is log(Hours), so X 0 = log(3) =
( )
1.0986 (natural logarithm). Therefore, µˆ Y X 0 = 1.0986 = 6.9836 - .7257(1.0986) = 6.186.

To calculate the standard error, we need to compute X , the mean of the log(Hours) for the 10
data points and s X2 , the sample variance of log(Hours). From SPSS,
Descriptive Statistics

N Mean Std. Deviation Variance


LogTime 10 1.19013 .796480 .63438
Valid N (listwise) 10

Hence, X = 1.1901 and (n − 1) s X2 = 9(.63438) = 5.709.

Therefore,

[( )]
SE µˆ Yˆ X 0 = 1.0986 = 0.0823
1 (1.0986 − 1.1901) 2
10
+
5.709
= 0.0262

and a 95% confidence interval for the mean pH among all steers after 3 hours is

6.186 ± t 8 (.975) (.0262) = 6.186 ± 2.306(.0262) = 6.186 ± .0604 ≈ 6.13 to 6.25

Simultaneous confidence intervals for the mean response at several values of X


If we want simultaneous confidence intervals at several different values of X, we can use
Bonferroni if the number of values is small. We can compute simultaneous confidence intervals
at every possible value of X using a Scheffe procedure. The result is a set of confidence bands
for the regression line. We are 95% (or whatever the chosen confidence level) that the
regression line lies entirely within the bands. Thus, we are 95% confident that the true means at
all possible values of X are all within the confidence band limits. The formula for the
simultaneous confidence bands is
βˆ 0 + βˆ1 X ± 2 F2,n−2 (1 − α ) SE[µ̂ (Y X )]
This is referred to as the Workman-Hotelling procedure. In practice, you compute these limits at
a large number of X values, then join the limits to make a smooth curve on the scatterplot. Some
programs will do this automatically, but SPSS will not. It will, however, plot the individual
confidence intervals for all X’s using the t coefficient rather than the Scheffe coefficient.

Steer data: for simultaneous 95% confidence intervals, F2,n −2 (.1 − α ) = F2,8 (.95) = 4.46. The
confidence interval for the mean pH after 3 hours is therefore (see above):

6.186 ± 2(4.46) (.0262) = 6.186 ± 2.987(.0262) = 6.186 ± .0782 = 6.11 to 6.26

We could compute confidence intervals for any number of values of X.


Chap. 7, page 6

Prediction interval for a future response


The confidence intervals above is for the mean pH for all steer 3 hours after slaughter. A 95%
prediction interval for the pH of an individual steer 3 hours after slaughter is an interval in which
you are 95% confident that the pH of a particular steer will lie 3 hours after slaughter. A
confidence interval is for a mean; a prediction interval is for an individual.

The predicted value for a future response at X = X 0 is

Pred(Y X 0 ) = µˆ (Y X 0 ) = βˆ 0 + βˆ1 X 0

The standard error of prediction is

1 ( X 0 − X )2
SE[Pred(Y X 0 )] = σˆ 2 + SE[µˆ (Y X 0 )] = σˆ 1 +
2
+
n (n − 1) s X2

The standard error of prediction has two parts: the uncertainty due to estimating the mean
response at X 0 and the uncertainty due to the fact that individual observations vary around that
mean with standard deviation σ. Note that while the standard error of the mean response at X 0
goes to 0 as n increases, the standard error of prediction never goes to 0. An individual 100(1-
α)% prediction interval for the response of an individual at X 0 is

βˆ 0 + βˆ1 X 0 ± t n −2 (1 − α / 2) SE[Pr ed(Y X 0 )]

For the steer data, a 95% prediction interval for the pH of a particular steer 3 hours after
slaughter is:
1 (1.0986 − 1.1901) 2
6.186 ± 2.306 (.0823) 1 + + = 6.186 ± 2.306(.08637) = 6.186 ± .1992 =
10 5.709
5.99 to 6.39.

Simultaneous prediction intervals can be computed for several different X values using
Bonferroni, but there is no analog to the Working-Hotelling Scheffe-based procedure for
simultaneous prediction intervals at all possible values of X.
Chap. 7, page 7
SPSS commands

Analyze…Regression…Linear

Under Statistics button, you can choose to get confidence intervals for β 0 and β1 .

Under Save button:


• Unstandardized Predicted Values
• Unstandardized Residuals
• Prediction Intervals: Mean: this isn’t a prediction interval, it’s an individual confidence
interval for the mean response at each X. SPSS does not compute the Working-Hotelling
simultaneous confidence intervals
• Prediction Intervals: Individual: this is a prediction interval for an individual response at
each X

To obtain predicted values, confidence intervals and prediction intervals for a value of X not in
the data set, add a case to the data with the desired X value, but leave the value of Y blank (it
should display a period which indicates a missing value).

SPSS can plot the individual confidence intervals for mean response and the prediction
intervals for an individual response. Create a scatterplot and double-click the plot to get into
Chart Editor. Select one of the data points and click on the “Add fit line” icon. Under the “Fit
line” tab you can select “Mean” or “Individual” confidence intervals. The first gives individual
(not simultaneous) confidence intervals for the mean response at each X and the second gives
prediction intervals.
Chap. 7, page 8

95% individual confidence intervals for the mean, 95% Working-Hotelling simultaneous
confidence bands for the mean, and 95% individual prediction intervals for a single response
(this graph is from S-Plus,; SPSS will only do the first and last of the three).

0.95 bands
7.0
6.5
y

6.0
5.5

0.0 0.5 1.0 1.5 2.0


x