MultipleRegression CompleteStepwiseProblems Spring2006

SW388R7
Data Analysis &

Computers II
Stepwise Multiple Regression
Slide 1
Differences between stepwise and other

methods of multiple regression
Sample problem
Steps in stepwise multiple regression
Homework Problems
Compu
ters II
Types of multiple regression
Slide 2
Different types of multiple regression are distinguished by the method

for entering the independent variables into the analysis.
In standard (or simultaneous) multiple regression, all of the

independent variables are entered into the analysis at the same.
In hierarchical (or sequential) multiple regression, the independent

variables are entered in an order prescribed by the analyst.
In stepwise (or statistical) multiple regression, the independent

variables are entered according to their statistical contribution in
explaining the variance in the dependent variable.
No matter what method of entry is chosen, a multiple regression that

includes the same independent variables and the same dependent
variables will produce the same multiple regression equation.
Compu
ters II
Stepwise multiple regression
Slide 3
Stepwise regression is designed to find the most parsimonious

set of predictors that are most effective in predicting the
dependent variable.
Variables are added to the regression equation one at a time,

using the statistical criterion of maximizing the R of the
included variables.
The process of adding more variables stops when all of the

available variables have been included or when it is not
possible to make a statistically significant improvement in R
using any of the variables not yet included.
Since variables will not be added to the regression equation

unless they make a statistically significant addition to the
analysis, all of the independent variable selected for inclusion
will have a statistically significant relationship to the
dependent variable.
Compu
ters II
Differences in statistical outputs
Slide 4
Each time SPSS includes or removes a variable from the

analysis, SPSS considers it a new step or model, i.e. there will
be one model and result for each variable included in the
analysis.
SPSS provides a table of variables included in the analysis and a

table of variables excluded from the analysis. It is possible that
none of the variables will be included. It is possible that all of
the variables will be included.
The order of entry of the variables can be used as a measure of

relative importance.
Once a variable is included, its interpretation in stepwise

regression is the same as it would be using other methods for
including regression variables.
Compu
ters II
Differences in solving stepwise regression problems
Slide 5
The level of significance for the analysis is included in the

specifications for the statistical analysis. While we will use 0.05
as the level of significance for our problems, a different level of
significance can be entered in the SPSS Options dialog box.
The preferred sample size requirement is larger for stepwise

regression, i.e. 50 x the number of independent variables.
Stepwise procedures are notorious for over-fitting the sample to

the detriment of generalizability. Validation analysis is
absolutely necessary. If generalizability is compromised, it is
permissible to interpret the variables included in the 75%
training analysis (though we will not do this in our problems).
While multicollinearity for all variable can be examined, it is

really only a problem for the variables not included in the
analysis. If a variable is included in the stepwise analysis, it will
not have a collinear relationship.
Compu
ters II
Slide 6
A stepwise regression problem
When the problem asks us to identify the best

set of predictors, we will do stepwise multiple
regression.
Multiple regression is feasible if the dependent
variable is metric and the independent
variables (both predictors and controls) are
metric or dichotomous, and the available data
is sufficient to satisfy the sample size
requirements.
Compu
ters II
Slide 7
Level of measurement - answer

Stepwise multiple regression
requires that the dependent
variable be metric and the
independent variables be metric
or dichotomous.
True with caution

is the correct
answer.
Compu
ters II
Slide 8
Sample size - question
The second question asks about the

sample size requirements for multiple
regression.
To answer this question, we will run the
initial or baseline multiple regression to
obtain some basic data about the
problem and solution.
Compu
ters II
Slide 9
The baseline regression - 1
After we check for violations of

assumptions and outliers, we will
make a decision whether we should
interpret the model that includes the
transformed variables and omits
outliers (the revised model), or
whether we will interpret the model
that uses the untransformed
variables and includes all cases
including the outliers (the baseline
model).
In order to make this decision, we
run the baseline regression before
we examine assumptions and
outliers, and record the R for the
baseline model. If using
transformations and outliers
substantially improves the analysis
(a 2% increase in R), we interpret
the revised model. If the increase is
smaller, we interpret the baseline
model.
To run the baseline

model, select Regression
| Linear from the
Analyze model.
ters II
Slide
10

First, move the
dependent variable
rincom98 to the
Dependent text box.
Second, move the

independent variables
hrs1, wkrslf, and
prestg80 to the
Independent(s) list box.
Third, select the method for
entering the variables into the
analysis from the drop down
Method menu. In this example,
we select Stepwise to request the
best subset of variables.
ters II
Slide
11
Click on the Statistics

button to specify the
statistics options that we
want.
ters II
Slide
12

First, mark the
checkboxes for
Estimates on the
Regression
Coefficients panel.
Second, mark the checkboxes for Model

Fit, Descriptives, and R squared change.
The R squared change statistic will tell
us the contribution of each additional
variable that the stepwise procedure
adds to the analysis.
Fifth, click on
the Continue
button to close
the dialog box.
Third, mark the

Durbin-Watson
statistic on the
Residuals panel.
Fourth, mark the the

Collinearity diagnostics
to get tolerance values
for testing
multicollinearity.
ters II
Slide
13
Next, we need to specify

the statistical criteria to use
for including variables in the
analysis.
Click on the Options button.
ters II
Slide
14
First, the default level of significance for entering

variables to the regression equation is .05. Since that
is the alpha level for our problem we do not need to
make any change.
The criteria for removing a variable from the analysis
is usually set at twice the level for including variables.
Second, click
on the Continue
button to close
the dialog box.
ters II
Slide
15
Click on the OK
button to
request the
regression
output.
ters II
Slide
16
R for the baseline model

The R of 0.257 is the benchmark
that we will use to evaluate the
utility of transformations and the
elimination of outliers.
Prior to any transformations of variables

to satisfy the assumptions of multiple
regression or the removal of outliers,
the proportion of variance in the
dependent variable explained by the
independent variables (R) was 25.7%.
In stepwise regression, the relationship
will always be significant if any
variables are included because the
variables can only be included if they
contributed to a statistically significant
relationship.
In stepwise regression, the model number

corresponds to the number of variables
included in the stepwise analysis. Two
variables are included in this problem.
ters II
Slide
17
Sample size evidence and answer
Descriptive Statistics
RINCOM98
HRS1
WRKSLF
PRESTG80
Mean
13.94
41.22
1.88
45.96
Std. Deviation
5.287
12.776
.331
14.174
N
145
145
145
145
Stepwise multiple regression requires that the

minimum ratio of valid cases to independent
variables be at least 5 to 1. The ratio of valid
cases (145) to number of independent variables
(3) was 48.3 to 1, which was equal to or greater
than the minimum ratio. The requirement for a
minimum ratio of cases to independent variables
was satisfied.
However, the ratio of 48.3 to 1 did not satisfy the
preferred ratio of 50 cases per independent
variable. A caution should be added to the
interpretation of the analysis and validation
analysis should be conducted.
True with caution is the correct answer.
ters II
Slide
18
Assumption of normality for the dependent

variable - question
Having satisfied the level of measurement

and sample size requirements, we turn our
attention to conformity with three of the
assumptions of multiple regression:
normality, linearity, and homoscedasticity.
First, we will evaluate the assumption of
normality for the dependent variable.
ters II
Slide
19
Run the script to test normality

First, move the variables to the
list boxes based on the role that
the variable plays in the analysis
and its level of measurement.
Second, click on the Assumption of

Normality option button to request that
SPSS produce the output needed to
evaluate the assumption of normality.
Fourth, click on
the OK button to
produce the output.
Third, mark the checkboxes
for the transformations that
we want to test in evaluating
the assumption.
ters II
Slide
20
Normality of the dependent variable:

respondents income
Descriptives
RESPONDENTS INCOME
Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
The dependent variable "income" [rincom98]

satisfied the criteria for a normal distribution.
The skewness of the distribution (-0.686)
was between -1.0 and +1.0 and the kurtosis
of the distribution (-0.253) was between -1.0
and +1.0.
Statistic
13.35
12.52
Std. Error
.419
14.18
13.54
15.00
29.535
5.435
1
23
22
8.00
-.686
-.253
.187
.373
True is the
correct answer.
ters II
Slide
21
Normality of the independent variable: hrs1
Next, we will evaluate the

assumption of normality for
the independent variable,
number of hours worked in
the past week.
ters II
Slide
22
Normality of the independent variable:

number of hours worked in the past week
Descriptives
NUMBER OF HOURS
Mean
WORKED LAST WEEK 95% Confidence
Interval for Mean
Lower Bound
Upper Bound
Statistic
40.99
39.10
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
The independent variable "number of hours

worked in the past week" [hrs1] satisfied the
criteria for a normal distribution. The skewness
of the distribution (-0.324) was between -1.0
and +1.0 and the kurtosis of the distribution
(0.935) was between -1.0 and +1.0.
Std. Error
.958
42.88
41.21
40.00
161.491
12.708
4
80
76
10.00
-.324
.935
.183
.364
True is the
correct answer.
ters II
Slide
23
Normality of the independent variable: prestg80
Finally, we will evaluate the

assumption of normality for
the independent variable,
The independent variable
"occupational prestige
score" [prestg80]
ters II
Slide
24
Normality of the second independent variable:

occupational prestige score
Descriptives
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
44.17
42.45
Std. Error
.873
45.89
43.82
43.00
194.196
13.935
17
86
69
18.00
.401
-.630
The independent variable "occupational

prestige score" [prestg80] satisfied the criteria
for a normal distribution. The skewness of the
distribution (0.401) was between -1.0 and +1.0
and the kurtosis of the distribution (-0.630)
was between -1.0 and +1.0.
.153
.304
True is the
correct answer.
ters II
Slide
25
Assumption of linearity for respondents income and

number of hours worked last week - question
All of the metric variables included in the

analysis satisfied the assumption of normality.
Next we will test the relationships for linearity.
ters II
Slide
26
Run the script to test linearity
First, click on the

Assumption of Linearity
option button to request
that SPSS produce the
output needed to evaluate
the assumption of linearity.
When the linearity option

is selected, a default set
of transformations to test
is marked.
Second, click on
the OK button to
produce the output.
ters II
Slide
27
Linearity test: respondents income and number of

hours worked last week
Correlations
RESPONDENTS INCOME
NUMBER OF HOURS
WORKED LAST WEEK
Logarithm of Reflected
Values of HRS1 [LG10(
81-HRS1)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Logarithm of
Square Root
Inverse of
NUMBER OF
Reflected
of Reflected
Reflected
HOURS
Values of
Values of
Values of
RESPONDEN
WORKED
HRS1 [LG10( HRS1 [SQRT(
HRS1 [-1/(
TS INCOME
LAST WEEK
81-HRS1)]
81-HRS1)]
81-HRS1)]
1
.337**
-.231**
-.303**
-.059
.
.000
.005
.000
.475
168
149
149
149
149
.337**
1
-.871**
-.981**
-.361**
True is the correct answer.
.000
.
.000
.000
.000
"number of176
149
176The correlation
176 between 176
hours
worked
in
the
past
week" and.743**
-.231**
-.871**
1
.946**
"income" was statistically significant
.005
.000(r=.337, p<0.001).
.
.000
.000
A linear
relationship exists between these
149
Square Root of Reflected
Values of HRS1 [SQRT(
81-HRS1)]
Inverse of Reflected
Values of HRS1 [-1/(
81-HRS1)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
**. Correlation is significant at the 0.01 level (2-tailed).
-.303**
.000
149
-.059
.475
149
176variables. 176
-.981**
.000
176
-.361**
.000
176
.946**
.000
176
.743**
.000
176
176
176
1
.
176
.502**
.000
176
.502**
.000
176
1
.
176
ters II
Slide
28
Assumption of linearity for respondents income

and occupational prestige score - question
All of the metric variables included in the

analysis satisfied the assumption of normality.
Next we will test the relationships for linearity.
ters II
Linearity test: respondents income
and occupational prestige score
Slide
29
Correlations
RESPONDENTS INCOME
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
Logarithm of PRESTG80
[LG10(PRESTG80)]
Square Root of
PRESTG80
[SQRT(PRESTG80)]
Inverse of PRESTG80
[-1/(PRESTG80)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
RS
OCCUPA
TIONAL
PRESTIG
RESPONDEN E SCORE
TS INCOME
(1980)
1
.440**
.
.000
168
168
.440**
1
.000
168
Logarithm of
Square Root
Inverse of
PRESTG80
of PRESTG80 PRESTG80
[LG10(PRES
[SQRT(PRES
[-1/(PREST
TG80)]
TG80)]
G80)]
.436**
.440**
.414**
.000
.000
.000
168 correct answer.
168
168
True is the
.985**
.996**
.936**
The correlation between

.000
.000
"occupational
prestige
score" and.000
"income" was statistically significant
255
255
(r=.440,255
p<0.001). A255
linear
.985** relationship
1 exists between
.996** these.982**
variables.
.000
.
.000
.000
.
Pearson Correlation
Sig. (2-tailed)
N
.436**
.000
168
255
255
255
255
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
.440**
.000
168
.414**
.000
168
.996**
.000
255
.936**
.000
255
.996**
.000
255
.982**
.000
255
1
.
255
.962**
.000
255
.962**
.000
255
1
.
255
**. Correlation is significant at the 0.01 level (2-tailed).
ters II
Slide
30
Assumption of homogeneity of variance - question
Self-employment is the only

dichotomous independent variable
in the analysis. We will test if for
homogeneity of variance using
income as the dependent variable.
ters II
Slide
31
Run the script to test

homogeneity of variance
First, click on the

Assumption of Homogeneity
option button to request
that SPSS produce the
output needed to evaluate
the assumption of linearity.
When the homogeneity of

variance option is selected, a
default set of transformations
to test is marked.
Second, click on
the OK button to
produce the output.
ters II
Slide
32
Assumption of homogeneity of variance
Based on the Levene Test, the

variance in "income" [rincom98] is
homogeneous for the categories of
"self-employment" [wrkslf]. The
probability associated with the
Levene Statistic (p=0.076) is
greater than the level of
significance (0.01), so we fail to
reject the null hypothesis that the
variance is equal across groups,
and conclude that the
homoscedasticity assumption is
satisfied.
The homogeneity of variance
assumption was satisfied. True is
the correct answer.
ters II
Slide
33
Detection of outliers - question
In multiple regression, an outlier in the solution

can be defined as a case that has a large residual
because the equation did a poor job of predicting
its value.
We will run the baseline regression again and have
SPSS compute the standardized residual for each
case. Cases with a standardized residual larger
than +/- 3.0 will be treated as outliers.
ters II
Slide
34
Re-running the baseline regression - 1
Having decided to
use the baseline
model for the
interpretation of this
analysis, the SPSS
regression output
was re-created.
To run the baseline
model, select Regression
| Linear from the
Analyze model.
ters II
Slide
35

First, move the
dependent variable
rincom98 to the
Dependent text box.
Second, move the

hrs1, wkrslf, and
prestg80 to the
Independent(s) list box.
Third, select the method for

entering the variables into the
analysis from the drop down
Method menu. In this example,
we select Stepwise to request the
best subset of variables.
ters II
Slide
36
Click on the Statistics

button to specify the
statistics options that we
want.
ters II
Slide
37

First, mark the
checkboxes for
Estimates on the
Regression
Coefficients panel.
Second, mark the checkboxes for Model

Fit, Descriptives, and R squared change.
The R squared change statistic will tell
us whether or not the variables added
after the controls have a relationship to
the dependent variable.
Third, mark the

Durbin-Watson
statistic on the
Residuals panel.
Fourth, mark the

checkbox for the
Casewise diagnostics,
which will be used to
identify outliers.
Sixth, click on
the Continue
button to close
the dialog box.
Fifth, mark the

Collinearity diagnostics
to get tolerance values
for testing
multicollinearity.
ters II
Slide
38
Click on the
Save button to
save the
standardized
residuals to the
data editor.
ters II
Slide
39
Mark the checkbox for

Standardized Residuals so
that SPSS saves a new
variable in the data editor.
We will use this variable to
omit outliers in the revised
regression model.
Click on the
Continue
button to close
the dialog box.
ters II
Slide
40
Click on the OK
button to
request the
regression
output.
ters II
Slide
41
Outliers in the analysis

If cases have a standardized residual larger than +/- 3.0,
SPSS creates a table titled Casewise Diagnostics, in which it
lists the cases and values that results in their being an outlier.
If there are no outliers, SPSS does not print the Casewise
Diagnostics table. There was no table for this problem. The
answer to the question is true.
We can verify that all standardized residuals

were less than +/- 3.0 by looking the
minimum and maximum standardized
residuals in the table of Residual Statistics.
Both the minimum and maximum fell in the
acceptable range.
Since there were no outliers,
the correct answer is true.
ters II
Slide
42
Selecting the model to interpret - question
Since there were no

transformations used
and there were no
outliers, we can use
the baseline
regression for our
interpretation.
The correct answer is
false.
ters II
Slide
43
Assumption of independence of errors - question
We can now check the

assumption of independence
of errors for the analysis we
will interpret.
ters II
Slide
44
Assumption of independence of errors:

evidence and answer
Model Summaryc
Change Statistics
Model
1
2
Std. Error
Multiple regression Adjusted
assumes that
theof
R
R Square
R Square
the Estimate
errors are
independent and there is no
.424a
.180
.174
4.804
serial correlation.
Errors are the
b
.507
.257
.247
4.588
R Square
Change
.180
.077
residuals or differences between the

a. Predictors: (Constant), PRESTG80
actual score for a case and the score
b. Predictors:
(Constant),
PRESTG80,
HRS1 equation.
estimated
by the
regression
c. Dependent
Variable:
RINCOM98 implies that the size
No serial
correlation
of the residual for one case has no
impact on the size of the residual for the
next case.
The Durbin-Watson statistic is used to
test for the presence of serial correlation
among the residuals. The value of the
Durbin-Watson statistic ranges from 0 to
4. As a general rule of thumb, the
residuals are not correlated if the
Durbin-Watson statistic is approximately
2, and an acceptable range is 1.50 2.50.
F Change
31.350
14.788
df1
1
1
df2
143
142
Sig. F Change
.000
.000
Durbin-W
atson
The Durbin-Watson statistic

for this problem is 1.866
which falls within the
acceptable range from 1.50 to
2.50. The analysis satisfies
the assumption of
independence of errors. True
is the correct answer.
If the Durbin-Watson statistic
was not in the acceptable
range, we would add a
caution to the findings for a
violation of regression
assumptions.
1.866
ters II
Slide
45
Multicollinearity - question
The final condition that can have

an impact on our interpretation
is multicollinearity.
ters II
Slide
46
Multicollinearity evidence and answer
Multicollinearity occurs when one independent variable is

so strongly Since multicollinearity will result in a variable
not being included in the analysis, out examination of
tolerances focuses on the table of excluded variables.
The tolerance values for all of the independent variables
are larger than 0.10: "number of hours worked in the
past week" [hrs1] (.954), "self-employment" [wrkslf]
(.979) and "occupational prestige score" [prestg80]
(.954).
Multicollinearity is not a problem in this regression
analysis. True is the correct answer.
ters II
Slide
47
Overall relationship between dependent variable

and independent variables - question
The first finding we want to

confirm concerns the overall
relationship between the
dependent variable and one or
more of the independent
variables.
ters II
Slide
48

and independent variables evidence and answer 1
Stepwise multiple regression was performed to identify
the best predictors of the dependent variable "income"
[rincom98] among the independent variables "number of
hours worked in the past week" [hrs1], "selfemployment" [wrkslf], and "occupational prestige score"
[prestg80].
ANOVAc
Model
1
Regression
Residual
Total
Regression
Residual
Total
Sum of
Squares
723.647
3300.795
4024.441
1034.982
2989.460
4024.441
df
1
143
144
2
142
144
Mean Square
723.647
23.082
F
31.350
Sig.
.000a
517.491
21.053
24.581
.000b
a. Predictors: (Constant), PRESTG80

b. Predictors: (Constant), PRESTG80, HRS1
Based
on the results in the ANOVA table (F(2, 142) =
c. Dependent Variable: RINCOM98
24.581, p<0.001), there was an overall relationship

between the dependent variable "income" [rincom98] and
one or more of independent variables. Since the
probability of the F statistic (p<0.001) was less than or
equal to the level of significance (0.05), the null
hypothesis that the Multiple R for all independent variables
was equal to 0 was rejected. The purpose of the analysis,
to identify a relationship between some of independent
variables and the dependent variable, was supported.
ters II
Slide
49

and independent variables evidence and answer 2
The Multiple R for the relationship between the independent variables

included in the analysis and the dependent variable was 0.507, which
would be characterized as moderate using the rule of thumb that a
correlation less than or equal to 0.20 is characterized as very weak;
greater than 0.20 and less than or equal to 0.40 is weak; greater than
0.40 and less than or equal to 0.60 is moderate; greater than 0.60
and less than or equal to 0.80 is strong; and greater than 0.80 is very
strong.
The relationship between the independent variables and the
dependent variable was correctly characterized as moderate.
True with caution is the correct answer. Caution in interpreting the
relationship should be exercised because of inclusion of ordinal
variables; and cases to variables ratio less than 50:1.
ters II
Slide
50
Best subset of predictors - question
The next finding concerns the

list of independent variables
that are statistically significant.
ters II
Slide
51
Best subset of predictors evidence and answer

Coefficientsa
Model
1
2
(Constant)
PRESTG80
(Constant)
PRESTG80
HRS1
Unstandardized
Coefficients
B
Std. Error
6.669
1.358
.158
.028
2.862
1.632
.135
.028
.118
.031
Standardized
Coefficients
Beta
.424
.363
.285
t
4.911
5.599
1.754
4.898
3.846
Sig.
.000
.000
.082
.000
.000
Collinearity Statistics
Tolerance
VIF
1.000
1.000
.954
.954
1.049
1.049
a. Dependent Variable: RINCOM98
The best predictors of scores for the dependent variable "income"

[rincom98] were "occupational prestige score" [prestg80]; and
"number of hours worked in the past week" [hrs1].
The variable "number of hours worked in the past week" [hrs1] was
not included in the list of predictors in the question, so false is the
correct answer.
ters II
Slide
52
Relationship of the first independent variable and the

dependent variable - question
In the stepwise regression problems,

we will focus on the entry order of
the independent variables and the
interpretation of individual
relationships of independent
variables on the dependent variable.
ters II
Slide
53

dependent variable evidence and answer 1
In the table
of variables
entered and
removed,
"number of
hours worked
in the past
week" [hrs1]
was added to
the
regression
equation in
model 2.
In the table of variables entered

and removed, "number of hours
worked in the past week" [hrs1]
was added to the regression
equation in model 2. The increase
in R Square as a result of including
this variable was .077 which was
statistically significant, F(1, 142) =
14.788, p<0.001.
ters II
Slide
54

dependent variable evidence and answer 2
Coefficientsa
Model
1
2
(Constant)
PRESTG80
(Constant)
PRESTG80
HRS1
Unstandardized
Coefficients
B
Std. Error
6.669
1.358
.158
.028
2.862
1.632
.135
.028
.118
.031
Standardized
Coefficients
Beta
.424
.363
.285
t
4.911
5.599
1.754
4.898
3.846
Sig.
.000
.000
.082
.000
.000
Tolerance
VIF
1.000
1.000
.954
.954
1.049
1.049
The b coefficient for the relationship between the dependent variable

"income" [rincom98] and the independent variable "number of hours
worked in the past week" [hrs1]. was .118, which implies a direct
relationship because the sign of the coefficient is positive. Higher numeric
values for the independent variable "number of hours worked in the past
week" [hrs1] are associated with higher numeric values for the
dependent variable "income" [rincom98]. The statement in the problem
that "survey respondents who worked longer hours in the past week had
higher incomes" is correct.
True with caution is the correct answer. Caution in interpreting the
relationship should be exercised because of cases to variables ratio less
than 50:1; and an ordinal variable treated as metric.
ters II
Slide
55
Relationship of the second independent variable and

the dependent variable - question
ters II
Slide
56
Relationship of the second independent variable and

the dependent variable evidence and answer
The independent variable

"self-employment" [wrkslf]
was not included in the
regression equation. It did not
increase the percentage of
variance explained in the
dependent variable by an
amount large enough to be
statistically significant.
False is the correct answer.
ters II
Slide
57
Relationship of the third independent variable and

the dependent variable - question
ters II
Slide
58

the dependent variable evidence and answer 1
In the table of
variables
entered and
removed,
"occupational
prestige
score"
[prestg80]
was added to
the regression
equation in
model 1.
The increase in R Square as a
result of including this variable
was .180 which was statistically
significant, F(1, 143) = 31.350,
p<0.001.
ters II
the dependent variable evidence and answer 2
Slide
59
Coefficientsa
Model
1
2
(Constant)
PRESTG80
(Constant)
PRESTG80
HRS1
Unstandardized
Coefficients
B
Std. Error
6.669
1.358
.158
.028
2.862
1.632
.135
.028
.118
.031
Standardized
Coefficients
Beta
.424
.363
.285
t
4.911
5.599
1.754
4.898
3.846
Sig.
.000
.000
.082
.000
.000
Tolerance
VIF
1.000
1.000
.954
.954
1.049
1.049
The b coefficient for the relationship between the dependent variable

"income" [rincom98] and the independent variable "occupational prestige
score" [prestg80]. was .135, which implies a direct relationship because
the sign of the coefficient is positive. Higher numeric values for the
independent variable "occupational prestige score" [prestg80] are
associated with higher numeric values for the dependent variable
"income" [rincom98].
The statement in the problem that "survey respondents who had more
prestigious occupations had lower incomes" is incorrect. The direction of
the relationship is stated incorrectly.
False is the correct answer.
ters II
Slide
60
Validation analysis - question
The problem states the

random number seed to use
in the validation analysis.
ters II
Slide
61
Validation analysis:
set the random number seed
Validate the results of

your regression analysis
by conducting a 75/25%
cross-validation, using
200070 as the random
number seed.
To set the random number

seed, select the Random
Number Seed command
from the Transform menu.
ters II
Slide
62
Set the random number seed
First, click on the

Set seed to option
button to activate
the text box.
Second, type in the

random seed stated in
the problem.
Third, click on the OK
button to complete the
dialog box.
Note that SPSS does not
provide you with any
feedback about the change.
ters II
Slide
63
Validation analysis:
compute the split variable
To enter the formula for the

variable that will split the
sample in two parts, click
on the Compute
command.
ters II
Slide
64
The formula for the split variable

First, type the name for the
new variable, split, into the
Target Variable text box.
Second, the formula for the
value of split is shown in the
text box.
The uniform(1) function
generates a random decimal
number between 0 and 1.
The random number is
compared to the value 0.75.
Third, click on the

OK button to
complete the dialog
box.
If the random number is less

than or equal to 0.75, the
value of the formula will be 1,
the SPSS numeric equivalent
to true. If the random
number is larger than 0.75,
the formula will return a 0,
the SPSS numeric equivalent
to false.
ters II
Slide
65
The split variable in the data editor
In the data editor, the

split variable shows a
random pattern of zeros
and ones.
To select the cases for the
training sample, we select
the cases where split = 1.
ters II
Slide
66
Repeat the regression for the validation
To repeat the multiple

regression analysis for the
validation sample, select
Regression | Linear from the
Analyze tool button.
ters II
Slide
67
Using "split" as the selection variable
First, scroll
down the list of
variables and
highlight the
variable split.
Second, click on the

right arrow button to
move the split variable
to the Selection
Variable text box.
ters II
Slide
68
Setting the value of split to select cases
When the variable named

split is moved to the
Selection Variable text
box, SPSS adds "=?" after
the name to prompt up to
enter a specific value for
split.
Click on the
Rule button
to enter a
value for split.
ters II
Slide
69
Completing the value selection
First, type the value

for the training
sample, 1, into the
Value text box.
Second, click on the

Continue button to
complete the value entry.
ters II
Slide
70
Requesting output for the validation analysis
Click on the OK
button to
request the
output.
When the value entry

dialog box is closed, SPSS
adds the value we entered
after the equal sign. This
specification now tells
SPSS to include in the
analysis only those cases
that have a value of 1 for
the split variable.
ters II
Slide
71
Validation Overall Relationship

The validation analysis requires that the
regression model for the 75% training
sample replicate the pattern of statistical
significance found for the full data set.
In the analysis of the

75% training sample,
the relationship between
the set of independent
variables and the
dependent variable was
statistically significant,
F(2, 105) = 20.195,
p<0.001, as was the
overall relationship in
the analysis of the full
data set, F(2, 142) =
24.581, p<0.001.
ters II
Slide
72
Validation - Relationship of Individual Independent

Variables to Dependent Variable
In stepwise multiple regression, the pattern of individual

relationships between the dependent variable and the
independent variables will be the same if the same variables
are selected as predictors for the analysis using the full data
set and the analysis using the 75% training sample. In this
analysis, the same two variables entered into the regression
model: "occupational prestige score" [prestg80]; and
"number of hours worked in the past week" [hrs1].
ters II
Slide
73
Validation - Comparison of Training Sample and

Validation Sample
The total proportion of variance explained in the

model using the training sample was 27.8%
(.527), compared to 70.7% (.841) for the
validation sample. The value of R for the
validation sample was actually larger than the
value of R for the training sample, implying a
better fit than obtained for the training sample.
This supports a conclusion that the regression
model would be effective in predicting scores for
cases other than those included in the sample.
The validation analysis supported the

generalizability of the findings of the
analysis to the population represented
by the sample in the data set.
The answer to the question is true.
SW388R7
Data Analysis &
Computers II
Slide 74
Steps in complete stepwise

regression analysis
The following flow charts depict the process for solving the complete
regression problem and determining the answer to each of the
questions encountered in the complete analysis.
Text in italics (e.g. True, False, True with caution, Incorrect
application of a statistic) represent the answers to each specific
question.
Many of the steps in stepwise regression analysis are identical to the
steps in standard regression analysis. Steps that are different are
identified with a magenta background, with the specifics of the
difference underlined.
ters II
Slide
75
Complete stepwise multiple regression analysis:

level of measurement
Question: do variables included in the analysis satisfy the level
of measurement requirements?
Is the dependent
variable metric and the
metric or dichotomous?
Examine all independent
variables controls as
well as predictors
No
Incorrect
application of
a statistic
Yes
Ordinal variables included
in the relationship?
No
True
Yes
True with caution
ters II
Slide
76

sample size
Question: Number of variables and cases satisfy sample size
requirements?
Compute the baseline
regression in SPSS
Ratio of cases to
independent variables at
least 5 to 1?
Include both controls and

predictors, in the count of
No
Inappropriate
application of
a statistic
Yes
Ratio of cases to
independent variables at
preferred sample size of at
least 50 to 1?
Yes
True
No
True with caution
ters II
Slide
77

assumption of normality
Question: each metric variable satisfies the assumption of
normality?
Test the dependent
variable and
The variable satisfies
criteria for a normal
distribution?
Yes
True
If more than one
transformation
satisfies normality,
use one with
smallest skew
No
False
Log, square root, or

inverse
transformation
satisfies normality?
Yes
Use transformation
in revised model,
no caution needed
No
Use untransformed
variable in analysis,
add caution to
interpretation for
violation of normality
ters II
Slide
78

assumption of linearity
Question: relationship between dependent variable and metric
independent variable satisfies assumption of linearity?
If dependent variable was
transformed for normality, use
transformed dependent
variable in the test for linearity.
Probability of Pearson
correlation (r) <=
level of significance?
If independent variable
was transformed to
satisfy normality, skip
check for linearity.
No
If more than one

transformation
satisfies
linearity, use one
with largest r
Probability of correlation
(r) for relationship with
any transformation of IV
<= level of significance?
No
Yes
Yes
Use transformation
in revised model
True
Weak
relationship.
No caution
needed
ters II
Slide
79

assumption of homogeneity of variance
Question: variance in dependent variable is uniform across the
categories of a dichotomous independent variable?
If dependent variable was
transformed for normality,
substitute transformed
dependent variable in the test
for the assumption of
homogeneity of variance
Probability of Levene
statistic <= level of
significance?
No
True
Yes
False
Do not test transformations of

dependent variable, add caution to
interpretation for violation of
homoscedasticity
ters II
Slide
80
Complete stepwise multiple regression

analysis: detecting outliers
Question: After incorporating any transformations, no outliers
were detected in the regression analysis.
If any variables were transformed
for normality or linearity, substitute
transformed variables in the
regression for the detection of
outliers.
Is the standardized residual

for any case greater than
+/-3.00?
Yes
False
No
True
Remove outliers and run

revised regression again.
ters II
Slide
81

picking regression model for interpretation
Question: interpretation based on model that includes
transformation of variables and removes outliers?
Yes
Pick revised regression with
transformations and omitting
outliers for interpretation
True
R for revised regression

greater than R for
baseline regression by 2%
or more?
No
Pick baseline regression with
untransformed variables and all
cases for interpretation
False
ters II
Slide
82

assumption of independence of errors
Question: serial correlation of errors is not a problem in this regression
analysis?
Residuals are
independent,
Durbin-Watson between
1.5 and 2.5?
Yes
True
No
False
NOTE: caution
for violation of
assumption of
independence of
errors
ters II
Slide
83

multicollinearity
Question: Multicollinearity is not a problem in this regression analysis?
Tolerance for all IVs

greater than 0.10,
indicating no
multicollinearity?
Yes
True
No
False
NOTE: halt the
analysis if it is not
okay to simply
exclude the
variable from the
analysis.
ters II
Slide
84

overall relationship
Question: Finding about overall relationship between
dependent variable and independent variables.
Probability of F test of
regression for last model
<= level of significance?
No
False
Yes
Strength of relationship for

included variables
interpreted correctly?
No
False
Yes
Small sample, ordinal
variables, or violation of
assumption in the
relationship?
No
True
Yes
True with caution
ters II
Slide
85

subset of best predictors
Question: Finding about list of best subset of predictors?
Listed variables match

variables in table of
variables entered/removed.
No
False
Yes
assumption in the
relationship?
No
True
Yes
True with caution
ters II
Slide
86

individual relationships - 1
Question: Finding about individual relationship between
independent variable and dependent variable.
Order of entry into

regression equation stated
correctly?
No
False
Yes
Significance of R2 change
for variable <= level of
significance?
Yes
No
False
ters II
Slide
87

individual relationships - 2
Direction of relationship
between included variables
and DV interpreted oorrectly?
No
False
Yes
assumption in the
relationship?
No
True
Yes
True with caution
ters II
Slide
88

validation analysis - 1
Question: The validation analysis supports the generalizability of the
findings?
Set the random seed and randomly
split the sample into 75% training
sample and 25% validation
sample.
Probability of ANOVA test

for training sample <=
level of significance?
Yes
No
False
ters II
Slide
89

validation analysis - 2
Same variables entered

into regression equation in
training sample?
No
False
Yes
Shrinkage in R (R for
training sample - R for
validation sample) < 2%?
Yes
True
No
False
ters II
Slide
90
Homework Problems
Multiple Regression Stepwise Problems - 1
The stepwise regression homework problems parallel the

complete standard regression problems and the complete
hierarchical problems. The only assumption made is the
problems is that there is no problem with missing data.
The complete stepwise multiple regression will include:
Testing assumptions of normality and linearity
Testing for outliers
Determining whether to use transformations or exclude
outliers,
Testing for independence of errors,
Checking for multicollinearty, and
Validating the generalizability of the analysis.
ters II
Slide
91
Homework Problems
The statement of the stepwise
regression problem identifies the
dependent variable and the
independent variables from which we
will extract a parsimonious subset.
ters II
Slide
92
Homework Problems
The findings, which must all be correct for a problem to

be true, include:
an ordered listing of the included independent
variables
an interpretive statement about each of the
independent variables.
a statement about the strength of the overall
relationship.
ters II
Slide
93
Homework Problems
The first prerequisite for a problem is the

satisfaction of the level of measurement
and minimum sample size requirements.
Failing to satisfy either of these
requirement results in an inappropriate
application of a statistic.
ters II
Slide
94
Homework Problems
The assumption of normality requires that
each metric variable be tested. If the
variable is not normal, transformations
should be examined to see if we can
improve the distribution of the variable. If
transformations are unsuccessful, a
caution is added to any true findings.
ters II
Slide
95
Homework Problems
The assumption of linearity is examined
for any metric independent variables that
were not transformed for the assumption
of normality.
ters II
Slide
96
Homework Problems
After incorporating any
transformations, we look for outliers
using standard residuals as the
criterion.
ters II
Slide
97
Homework Problems
We compare the results of the regression
without transformations and exclusion of
outliers to the model with transformations
and excluding outliers to determine
whether we will base our interpretation on
the baseline or the revised analysis.
ters II
Slide
98
Homework Problems
We test for the assumption of independence of

errors and the presence of multicollinearity.
If we violate the assumption of independence,
we attach a caution to our findings.
If there is a mutlicollinearity problem, we halt
the analysis, since we may be reporting
erroneous findings.
ters II
Slide
99
Homework Problems
In stepwise regression, we interpret the

R for the overall relationship at the step
or model when the last statistically
significant variable was entered.
ters II
Slide
100
Homework Problems
The primary purpose of stepwise regression is to

identify the best subset of predictors and the order in
which variables were included in the regression
equation. The order tells us the relative importance
of the predictors, i.e. best predictor, second best,
ters II
Slide
101
Homework Problems
The relationships between predictor independent

variables and the dependent variable stated in the
problem must be statistically significant, and worded
correctly for the direction of the relationship. The
interpretation of individual predictors is the same for
standard, hierarchical, and stepwise regression.
ters II
Slide
102
Homework Problems
We use a 75-25% validation strategy to support the generalizability of

our findings. The validation must support:
the significance of the overall relationship,
the inclusion of the same variables in the validation model that were
included in the full model, though not necessarily in the same order,
and
the shrinkage in R for the validation sample must not be more than
2% less than the training sample.
ters II
Slide
103
Homework Problems
Cautions are added as limitations

to the analysis, if needed.

MultipleRegression CompleteStepwiseProblems Spring2006

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MultipleRegression CompleteStepwiseProblems Spring2006

Uploaded by

Copyright:

Available Formats

SW388R7

Data Analysis &

Stepwise Multiple Regression

Differences between stepwise and other

Different types of multiple regression are distinguished by the method

In standard (or simultaneous) multiple regression, all of the

In hierarchical (or sequential) multiple regression, the independent

In stepwise (or statistical) multiple regression, the independent

No matter what method of entry is chosen, a multiple regression that

Stepwise regression is designed to find the most parsimonious

Variables are added to the regression equation one at a time,

The process of adding more variables stops when all of the

Since variables will not be added to the regression equation

Each time SPSS includes or removes a variable from the

SPSS provides a table of variables included in the analysis and a

The order of entry of the variables can be used as a measure of

Once a variable is included, its interpretation in stepwise

The level of significance for the analysis is included in the

The preferred sample size requirement is larger for stepwise

Stepwise procedures are notorious for over-fitting the sample to

While multicollinearity for all variable can be examined, it is

A stepwise regression problem

When the problem asks us to identify the best

Level of measurement - answer

True with caution

Sample size - question

The second question asks about the

The baseline regression - 1

After we check for violations of

To run the baseline

The baseline regression - 2

Second, move the

The baseline regression - 3

Click on the Statistics

The baseline regression - 4

Second, mark the checkboxes for Model

Third, mark the

Fourth, mark the the

The baseline regression - 5

Next, we need to specify

The baseline regression - 6

First, the default level of significance for entering

The baseline regression - 7

R for the baseline model

Prior to any transformations of variables

In stepwise regression, the model number

Sample size evidence and answer

Stepwise multiple regression requires that the

Assumption of normality for the dependent

Having satisfied the level of measurement

Run the script to test normality

Second, click on the Assumption of

Normality of the dependent variable:

The dependent variable "income" [rincom98]

Normality of the independent variable: hrs1

Next, we will evaluate the

Normality of the independent variable:

The independent variable "number of hours

Normality of the independent variable: prestg80

Finally, we will evaluate the

Normality of the second independent variable:

The independent variable "occupational

Assumption of linearity for respondents income and