You are on page 1of 16

Example of Sequence of Activities Associated

with Multiple Regression Analysis

Sample Problem : Predicting Home Heating Oil Usage


(Statistics for Managers - Levine et.al.)

Let's start by generating a basic Multiple Linear Regression Analysis for data supplied by
Levine et.al. associated with an investigation associated with Heating Oil Usage (for the
Month of January) based upon two suggested Independent Variables:

- Average daily temperature in Farenheit where the randomly selected home was
located; and
- Amount (in Inches) of Insulation in the randomly selected home during the
month the data were gathered.

We can begin by generating the sample model using the ENTER command (versus the
Forward, Backward, or Stepwise commands, which would usually be a more preferable
approach, since we obtain an R-squared change at each step), where:

Yi' = b0 + b1X1i + b2X2i

Variables Entered/Removedb
Model Summary

Variables Variables Std. Error


Model Entered Removed Method
Adjusted of the
1 Inches of
Attic Model R R Square R Square Estimate
Insulation, 1 .983 a .966 .960 26.0138
Average
Daily
. Enter a. Predictors: (Constant), Inches of Attic Insulation,
Temperatu Average Daily Temperature (F) - January
re (F) - a
January
a. All requested variables entered.
b. Dependent Variable: Oil Consumption - January
ANOVAb

Sum of Mean
Model Squares df Square F Sig.
SS 1 Regression 228014.6 2 114007.3 168.471 .000a
R
Residual 8120.603 12 676.717
SS Total 236135.2 14
E a. Predictors: (Constant), Inches of Attic Insulation, Average Daily Temperature (F)
SS - January
T b. Dependent Variable: Oil Consumption - January

Coefficientsa

Standardi
b0 zed
b1
Unstandardized Coefficien 95% Confidence
Coefficients ts Interval for B
b2 Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 562.151 21.093 26.651 .000 516.193 608.109
Average Daily
-5.437 .336 -.866 -16.170 .000 -6.169 -4.704
Temperature (F) - January
Inches of Attic Insulation -20.012 2.343 -.457 -8.543 .000 -25.116 -14.908
a. Dependent Variable: Oil Consumption - January
So, we can interpret the results as:

* Y’I = 562.151 – 5.43658Tempi - 20.0123InchesI

so, for example, for houses with 6 inches of insulation that experienced an
average temperature of 30 degrees for the month of January, we would
infer that the average heating oil used would be 278.9798 gallons

* Expected Oil Consumption is expected to decrease by 5.44 Gallons per Month


for every increase of 1 degree in average Temperature, for any given
amount of attic insulation (that is, with amount of insulation accounted
for, or held constant)

* Expected Oil Consumption is expected to decrease by 20.01 Gallons per Month


for every increase of 1 inch of insulation, for any given house
experiencing an average temperature at a given value (that is, with
temperature accounted for, or held constant)

* R2, the Coefficient of Multiple Determination, shows that 96.56% of the


variability in Heating Oil Usage is explained by Temperature and
Insulation. Adjusting for the number of predictors in the model, and the
sample size employed, our estimate of R2 = 0.96

* Using Table Curve 3, we can portray the Multiple Regression equation as:
Forward Inclusion

The Forward Inclusion method of Multiple Regression enters each significant variable
into the equation, in the order of greatest to least magnitude of effect on the dependent
variable. It has three advantages over the previous method displayed:

- we can judge the relative importance of each variable, even if all ‘end up’ in the
equation;

- we obtain the R2 change at each step, allowing us to judge if the variability


explained by the inclusion of each variable makes it worth considering /
monitoring; and

- insignificant variables are not included in the equation.

For this example, we would obtain:

a
Variables Entered/Removed

Variables Variables
Model Entered Removed Method
1 Forward
Average
(Criterion:
Daily
Probabilit
Temperatu .
y-of-F-to-e
re (F) -
nter <=
January
.050)
2 Forward
(Criterion:
Inches of
Probabilit
Attic .
y-of-F-to-e
Insulation
nter <=
.050)
a. Dependent Variable: Oil Consumption - January

Model Summary

Change Statistics
Adjusted Std. Error of R Square
Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change
1 .870 a .756 .738 66.5125 .756 40.377 Coefficients 1 a 13 .000
2 .983 b .966 .960
ANOVAc .209
26.0138 72.985 1 12 .000
Standardi
a. Predictors: (Constant), Average Daily Temperature (F) - January
Sum of zed
b. Predictors: (Constant), Average Daily Temperature (F) - January, Inches of Attic Insulation
Unstandardized Coefficien
Model Squares df Mean Square F ts Sig.
Coefficients 95% Confidence Interval for B
1 Regression Model 178624.4 1 178624.424 40.377 a
B Std. Error Beta t .000 Sig. Lower Bound Upper Bound
Residual 1 57510.805
(Constant) 13 4423.908
436.438 38.640 11.295 .000 352.962 519.914
Average Daily
Total 236135.2 14
Temperature (F) - January
-5.462 .860 -.870 -6.354 .000 -7.319 -3.605
2 Regression 2 228014.6 (Constant) 2 114007.313
562.151 21.093 168.471 .000 b
26.651 .000 516.193 608.109
Residual Average Daily
8120.603 12
Temperature (F) - January
676.717.336
-5.437 -.866 -16.170 .000 -6.169 -4.704
Total 236135.2 14
Inches of Attic Insulation -20.012 2.343 -.457 -8.543 .000 -25.116 -14.908
a. Dependent Variable: Oil Consumption - January
a. Predictors: (Constant), Average Daily Temperature (F) - January
b. Predictors: (Constant), Average Daily Temperature (F) - January, Inches of Attic
Insulation
c. Dependent Variable: Oil Consumption - January
Excluded Variablesb

Collinearit
y
Partial Statistics
Model Beta In t Sig. Correlation Tolerance
1 Inches of Attic Insulation -.457a -8.543 .000 -.927 1.000
a. Predictors in the Model: (Constant), Average Daily Temperature (F) - January
b. Dependent Variable: Oil Consumption - January

As shown by this output, Temperature is far more influential in its effect on Heating Oil
Consumption than Attic Insulation.

Residual Analysis

Generally, the next step in Multiple Regression Analysis is to generate a series of


Residual Plots:

- Residuals / Standardized Residuals versus Y’ : shows that the data may not be
linear, and that a transformation of at least one X variable, or the Y
variable, may be in order

- Residuals / Standardized Residuals versus Each X (Independent Variable) :


presence of a pattern would show the need to transform the variable, given
evidence of a non-linear effect

- If the data were collected in a time order, the generation of Residuals by Time,
with the calculation of the Durbin-Watson statistic would be employed.

We first note the absence of any pattern in the plot of the Predicted Values versus the
Residuals:
60

40

20
Unstandardized Residual

-20

-40
0 100 200 300 400

Unstandardized Predicted Value

Next, as previously described:

60

40

20
Unstandardized Residual

-20

-40
0 10 20 30 40 50 60 70 80

Average Daily Temperature (F) - January


60

40

20
Unstandardized Residual

-20

-40
2 4 6 8 10 12

Inches of Attic Insulation

Testing for our other underlying assumptions as detailed in the Simple Regression
material, we can also show the ‘standard’ plots employed at this point as well. Three
plots have been generated to show that the observed Y values plotted against the
Unstandardized, Standardized, and Studentized Residuals produce the same general
distribution, and conclusions:

Normal P-P Plot


Histogram
Scatterplot Scatterplot
Dependent Variable: Oil Consumption - January
Dependent
Dependent Variable:
Variable: OilOil Consumption
Consumption - January
- January Dependent
1.00 Variable: Oil Consumption - January
5 2.5 3

2.0
.75
FrequencyStandardized Residual

4 2
Regression Studentized Residual

1.5

3 1.0 1
.50
Expected Cum Prob

.5
2 0
0.0
.25
1 -.5
Regression

Std. Dev = .93


-1
60 Mean = 0.00
-1.0
N = 15.00 0.00
0
-1.5 -2 0.00 .25 .50 .75 1 .00
-1.50 -1.00 -.50 0.00 .50 1.00 1.50 2.00
0 100 200 300 400 500 0 100 200 300 400 5 00
40
Observed Cum Prob
Regression Standardized Residual Oil Consumption - January
Oil Consumption - January

20
Unstandardized Residual

-20

-40
0 100 200 300 400 500

Oil Consumption - January


Testing the Significance of the Entire (Final) Multiple Regression Model
and Inferences About the Population Regression Coefficients

Testing the hypothesis that:

H0 : β 1 = β 2 = 0 (There is no linear relationship between the dependent


variable
And the explanatory variables)
H1 : At least on β j ≠0

Seeing that F = 168.47 and p = 0.00, we reject the Null Hypothesis.

Next, we can use the Table below (also presented earlier) to make the following
observations:

Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 436.438 38.640 11.295 .000 352.962 519.914
Average Daily
-5.462 .860 -.870 -6.354 .000 -7.319 -3.605
Temperature (F) - January
2 (Constant) 562.151 21.093 26.651 .000 516.193 608.109
Average Daily
-5.437 .336 -.866 -16.170 .000 -6.169 -4.704
Temperature (F) - January
Inches of Attic Insulation -20.012 2.343 -.457 -8.543 .000 -25.116 -14.908
a. Dependent Variable: Oil Consumption - January

* we reject the hypothesis that β 1 = 0. t = -16.17; p = 0.000. Our point estimate


for
this value is –5.437; and our 95% CI for the Slope (β 1) is –6.169 to –
4.704, taking into account the effects of Temperature

* we reject the hypothesis that β 2 = 0. t = -8.543; p = 0.000. Our point estimate


for
this value is –20.012; and our 95% CI for the Slope (β 1) is –25.116 to
–14.908, taking into account the effects of Insulation
Testing Portions or Sub-Components of the Multiple Regression Model

If we have not conducted a Forward Inclusion analysis, we would at this point wish to
test for the contribution of each individual variable to the Regression Model. This
approach provides slightly more information than using the Forward method, in that two
independent and distinct Model and Anova Tables are generated, and an F statistic can be
generated for each variable. Many statisticians find, however, that a Forward Inclusion
approach provides all of the data necessary for an analysis of Independent Variable
contribution.

The ‘ENTER / Component’ approach would yield:

b
Variables Entered/Removed Model Summary

Variables Variables Adjusted Std. Error of


Model Entered Removed Method Model R R Square R Square the Estimate
1 Average 1 .870a .756 .738 66.5125
Daily
a. Predictors: (Constant), Average Daily Temperature (F) -
Temperatu . Enter
re (F) - a January
January
a. All requested variables entered.
b. Dependent Variable: Oil Consumption - January

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 178624.4 1 178624.424 40.377 .000 a
Residual 57510.805 13 4423.908
Total 236135.2 14
a. Predictors: (Constant), Average Daily Temperature (F) - January
b. Dependent Variable: Oil Consumption - January

Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 436.438 38.640 11.295 .000 352.962 519.914
Average Daily
-5.462 .860 -.870 -6.354 .000 -7.319 -3.605
Temperature (F) - January
a. Dependent Variable: Oil Consumption - January
b
Variables Entered/Removed Model Summary

Variables Variables Adjusted Std. Error of


Model Entered Removed Method Model R R Square R Square the Estimate
1 Inches of 1 .465a .216 .156 119.3117
Attic a
. Enter a. Predictors: (Constant), Inches of Attic Insulation
Insulation
a. All requested variables entered.
b. Dependent Variable: Oil Consumption - January

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 51076.465 1 51076.465 3.588 .081 a
Residual 185058.8 13 14235.290
Total 236135.2 14
a. Predictors: (Constant), Inches of Attic Insulation
b. Dependent Variable: Oil Consumption - January

Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 345.378 74.691 4.624 .000 184.019 506.738
Inches of Attic Insulation -20.350 10.743 -.465 -1.894 .081 -43.560 2.859
a. Dependent Variable: Oil Consumption - January

Then, at this point, we would use the Sums of Squares from the three ANOVA tables to
test for the contribution of each variable, after its companion variable has been added to
the Multiple Regression equation.

Given :

SSR = 228,014.63 @ 2 df
MSR = 676.72 @ 12 df

SSRTemp = 178,624.42 @ 1 df
SSRInsulation = 51,076.46 @ 1 df

Then:

SSR (Xk all variables except k) = SSR(X1 and X2) – SSR(X2)

so
Contribution of variable X1 given X2 has been included :

SSR(X1 X2) = SSR (X1 and X2) – SSR(X2) and

Contribution of variable X2 given X1 has been included :

SSR(X2 X1) = SSR (X1 and X2) – SSR(X1)

So:

Contribution of Temperature (X1) After Insulation Has Been Added:

SSR(X1 X2) = 228,015 – 51,076 = 176, 939

then FPartial = 176,939 / 676.717 = 261.47; p = 0.000

Contribution of Insulation (X2) After Temperature Has Been Added:

SSR(X2 X1) = 228,015 – 178,624 = 49,391

then FPartial = 49,391 / 676.717 = 72.99; p = 0.000

Similar to the results shown when conducting a Forward Inclusion analysis, both
variables are significant. Having found this, we can also calculate Coefficients of Partial
Determination; which break down the Coefficient of Multiple Determination into the
component Coefficients associated with each variable. SPSS will automatically calculate
these values for us when that option is toggled:

The Coefficients may be obtained as:

Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts 95% Confidence Interval for B Correlations
Model B Std. Error Beta t Sig. Lower Bound Upper Bound Zero-order Partial Part
1 (Constant) 562.151 21.093 26.651 .000 516.193 608.109
Average Daily
-5.437 .336 -.866 -16.170 .000 -6.169 -4.704 -.870 -.978 -.866
Temperature (F) - January
Inches of Attic Insulation -20.012 2.343 -.457 -8.543 .000 -25.116 -14.908 -.465 -.927 -.457
a. Dependent Variable: Oil Consumption - January

r21.2 = -.9782 = .9564 = 95.64%

and

r22.1 = -.9272 = .8593 = 85.93%

Other Important Measures and Analyses in


Multiple Regression Analysis

One of the major threats to Multiple Regression analysis concerns the possibility of
collinearity; the situation where independent variables are highly correlated with one
another, which can lead to spurious predictions. For years, statisticians (refer to the
handout from the old SPSS manual) recommended that, as a first step, a correlational
matrix be generated among the Independent Variables so that (by observation) one could
determine whether a “very high” relationship existed among the variables. Of course, it
was always difficult to objectively determine how high was “too high”.

To solve this problem, we use the Variance Inflationary Factor (VIF):

1
VIFj =
1 – R2j

where R2j is the coefficient of multiple determination of explanatory variable Xj with all
other explanatory variables. In the case of just two variables, R21 is simply the Coefficient
of Determination between the two. In the case of the Heating Oil data, rT x I = 0.00892, so

VIF1 = VIF2 = 1 / { 1 – (0.008922) } ≅ 1.00

SPSS, when toggling ‘Collinearity Diagnostics’ will calculate this value for us:

Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts Collinearity Statistics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Constant) 562.151 21.093 26.651 .000
Average Daily
-5.437 .336 -.866 -16.170 .000 1.000 1.000
Temperature (F) - January
Inches of Attic Insulation -20.012 2.343 -.457 -8.543 .000 1.000 1.000
a. Dependent Variable: Oil Consumption - January

Here’s an example from another Multiple Regression Analysis, where there were more
than two independent variables involved:
Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts Collinearity Statistics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Constant) -330.832 110.895 -2.983 .007
Total Staff 1.246 .412 .529 3.023 .006 .586 1.707
REMOTE -.118 .054 -.324 -2.180 .041 .811 1.233
DUBNER -.297 .118 -.408 -2.519 .020 .685 1.459
Total Labor .131 .059 .417 2.200 .039 .500 1.999
a. Dependent Variable: STANDBY

If a set of variables are uncorrelated, then the VIF will be equal to approximately 1.00. If
a set of variables is highly correlated, a VIF might exceed 10. Some statisticians suggest
that if the VIF exceeds 10, then alternatives to the generated model should be explored.
More conservative statisticians suggest that 5 is a more appropriate maximum threshold
value.

The Cp Statistic and Model Building

In building a model through Forward, Backward, or Stepwise inclusion, the statistician


constantly may have a number of potential models which can be used to describe a
predictive value for the Dependent Variable (actually, its Criterion Measure). This is
particularly true when a large number of Independent Variables are significant. To
optimize the model employed, particularly when more than 2 variables are involved, the
Cp Statistic may be employed :
(1 – R2p) (n - T)
Cp = - [n – 2(p+1)]
1 – R2T

where

p = number of independent variables included in a model


T = total number of parameters (Intercept included) available to be estimated in
the full regression model
R2p = coefficient of multiple determination for a regression model with p
independent variables included
R2T = coefficient of multiple determination for a full regression model containing
the Intercept and all T estimated parameters

and where

the goal is to find models whose Cp is close to or below ( p + 1 )

You might also like