You are on page 1of 103

SW388R7

Data Analysis &


Computers II

Stepwise Multiple Regression

Slide 1

Differences between stepwise and other


methods of multiple regression
Sample problem
Steps in stepwise multiple regression
Homework Problems

Compu
ters II
Types of multiple regression

Slide 2

Different types of multiple regression are distinguished by the method


for entering the independent variables into the analysis.

In standard (or simultaneous) multiple regression, all of the


independent variables are entered into the analysis at the same.

In hierarchical (or sequential) multiple regression, the independent


variables are entered in an order prescribed by the analyst.

In stepwise (or statistical) multiple regression, the independent


variables are entered according to their statistical contribution in
explaining the variance in the dependent variable.

No matter what method of entry is chosen, a multiple regression that


includes the same independent variables and the same dependent
variables will produce the same multiple regression equation.

Compu
ters II
Stepwise multiple regression

Slide 3

Stepwise regression is designed to find the most parsimonious


set of predictors that are most effective in predicting the
dependent variable.

Variables are added to the regression equation one at a time,


using the statistical criterion of maximizing the R of the
included variables.

The process of adding more variables stops when all of the


available variables have been included or when it is not
possible to make a statistically significant improvement in R
using any of the variables not yet included.

Since variables will not be added to the regression equation


unless they make a statistically significant addition to the
analysis, all of the independent variable selected for inclusion
will have a statistically significant relationship to the
dependent variable.

Compu
ters II
Differences in statistical outputs

Slide 4

Each time SPSS includes or removes a variable from the


analysis, SPSS considers it a new step or model, i.e. there will
be one model and result for each variable included in the
analysis.

SPSS provides a table of variables included in the analysis and a


table of variables excluded from the analysis. It is possible that
none of the variables will be included. It is possible that all of
the variables will be included.

The order of entry of the variables can be used as a measure of


relative importance.

Once a variable is included, its interpretation in stepwise


regression is the same as it would be using other methods for
including regression variables.

Compu
ters II
Differences in solving stepwise regression problems

Slide 5

The level of significance for the analysis is included in the


specifications for the statistical analysis. While we will use 0.05
as the level of significance for our problems, a different level of
significance can be entered in the SPSS Options dialog box.

The preferred sample size requirement is larger for stepwise


regression, i.e. 50 x the number of independent variables.

Stepwise procedures are notorious for over-fitting the sample to


the detriment of generalizability. Validation analysis is
absolutely necessary. If generalizability is compromised, it is
permissible to interpret the variables included in the 75%
training analysis (though we will not do this in our problems).

While multicollinearity for all variable can be examined, it is


really only a problem for the variables not included in the
analysis. If a variable is included in the stepwise analysis, it will
not have a collinear relationship.

Compu
ters II
Slide 6

A stepwise regression problem

When the problem asks us to identify the best


set of predictors, we will do stepwise multiple
regression.
Multiple regression is feasible if the dependent
variable is metric and the independent
variables (both predictors and controls) are
metric or dichotomous, and the available data
is sufficient to satisfy the sample size
requirements.

Compu
ters II
Slide 7

Level of measurement - answer


Stepwise multiple regression
requires that the dependent
variable be metric and the
independent variables be metric
or dichotomous.

True with caution


is the correct
answer.

Compu
ters II
Slide 8

Sample size - question

The second question asks about the


sample size requirements for multiple
regression.
To answer this question, we will run the
initial or baseline multiple regression to
obtain some basic data about the
problem and solution.

Compu
ters II
Slide 9

The baseline regression - 1

After we check for violations of


assumptions and outliers, we will
make a decision whether we should
interpret the model that includes the
transformed variables and omits
outliers (the revised model), or
whether we will interpret the model
that uses the untransformed
variables and includes all cases
including the outliers (the baseline
model).
In order to make this decision, we
run the baseline regression before
we examine assumptions and
outliers, and record the R for the
baseline model. If using
transformations and outliers
substantially improves the analysis
(a 2% increase in R), we interpret
the revised model. If the increase is
smaller, we interpret the baseline
model.

To run the baseline


model, select Regression
| Linear from the
Analyze model.

ters II
Slide
10

The baseline regression - 2


First, move the
dependent variable
rincom98 to the
Dependent text box.

Second, move the


independent variables
hrs1, wkrslf, and
prestg80 to the
Independent(s) list box.
Third, select the method for
entering the variables into the
analysis from the drop down
Method menu. In this example,
we select Stepwise to request the
best subset of variables.

ters II
Slide
11

The baseline regression - 3

Click on the Statistics


button to specify the
statistics options that we
want.

ters II
Slide
12

The baseline regression - 4


First, mark the
checkboxes for
Estimates on the
Regression
Coefficients panel.

Second, mark the checkboxes for Model


Fit, Descriptives, and R squared change.
The R squared change statistic will tell
us the contribution of each additional
variable that the stepwise procedure
adds to the analysis.

Fifth, click on
the Continue
button to close
the dialog box.

Third, mark the


Durbin-Watson
statistic on the
Residuals panel.

Fourth, mark the the


Collinearity diagnostics
to get tolerance values
for testing
multicollinearity.

ters II
Slide
13

The baseline regression - 5

Next, we need to specify


the statistical criteria to use
for including variables in the
analysis.
Click on the Options button.

ters II
Slide
14

The baseline regression - 6

First, the default level of significance for entering


variables to the regression equation is .05. Since that
is the alpha level for our problem we do not need to
make any change.
The criteria for removing a variable from the analysis
is usually set at twice the level for including variables.

Second, click
on the Continue
button to close
the dialog box.

ters II
Slide
15

The baseline regression - 7

Click on the OK
button to
request the
regression
output.

ters II
Slide
16

R for the baseline model


The R of 0.257 is the benchmark
that we will use to evaluate the
utility of transformations and the
elimination of outliers.

Prior to any transformations of variables


to satisfy the assumptions of multiple
regression or the removal of outliers,
the proportion of variance in the
dependent variable explained by the
independent variables (R) was 25.7%.
In stepwise regression, the relationship
will always be significant if any
variables are included because the
variables can only be included if they
contributed to a statistically significant
relationship.

In stepwise regression, the model number


corresponds to the number of variables
included in the stepwise analysis. Two
variables are included in this problem.

ters II
Slide
17

Sample size evidence and answer

Descriptive Statistics
RINCOM98
HRS1
WRKSLF
PRESTG80

Mean
13.94
41.22
1.88
45.96

Std. Deviation
5.287
12.776
.331
14.174

N
145
145
145
145

Stepwise multiple regression requires that the


minimum ratio of valid cases to independent
variables be at least 5 to 1. The ratio of valid
cases (145) to number of independent variables
(3) was 48.3 to 1, which was equal to or greater
than the minimum ratio. The requirement for a
minimum ratio of cases to independent variables
was satisfied.
However, the ratio of 48.3 to 1 did not satisfy the
preferred ratio of 50 cases per independent
variable. A caution should be added to the
interpretation of the analysis and validation
analysis should be conducted.
True with caution is the correct answer.

ters II
Slide
18

Assumption of normality for the dependent


variable - question

Having satisfied the level of measurement


and sample size requirements, we turn our
attention to conformity with three of the
assumptions of multiple regression:
normality, linearity, and homoscedasticity.
First, we will evaluate the assumption of
normality for the dependent variable.

ters II
Slide
19

Run the script to test normality


First, move the variables to the
list boxes based on the role that
the variable plays in the analysis
and its level of measurement.

Second, click on the Assumption of


Normality option button to request that
SPSS produce the output needed to
evaluate the assumption of normality.

Fourth, click on
the OK button to
produce the output.
Third, mark the checkboxes
for the transformations that
we want to test in evaluating
the assumption.

ters II
Slide
20

Normality of the dependent variable:


respondents income
Descriptives
RESPONDENTS INCOME

Mean
95% Confidence
Interval for Mean

Lower Bound
Upper Bound

5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

The dependent variable "income" [rincom98]


satisfied the criteria for a normal distribution.
The skewness of the distribution (-0.686)
was between -1.0 and +1.0 and the kurtosis
of the distribution (-0.253) was between -1.0
and +1.0.

Statistic
13.35
12.52

Std. Error
.419

14.18
13.54
15.00
29.535
5.435
1
23
22
8.00
-.686
-.253

.187
.373

True is the
correct answer.

ters II
Slide
21

Normality of the independent variable: hrs1

Next, we will evaluate the


assumption of normality for
the independent variable,
number of hours worked in
the past week.

ters II
Slide
22

Normality of the independent variable:


number of hours worked in the past week
Descriptives
NUMBER OF HOURS
Mean
WORKED LAST WEEK 95% Confidence
Interval for Mean

Lower Bound
Upper Bound

Statistic
40.99
39.10

5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

The independent variable "number of hours


worked in the past week" [hrs1] satisfied the
criteria for a normal distribution. The skewness
of the distribution (-0.324) was between -1.0
and +1.0 and the kurtosis of the distribution
(0.935) was between -1.0 and +1.0.

Std. Error
.958

42.88
41.21
40.00
161.491
12.708
4
80
76
10.00
-.324
.935

.183
.364

True is the
correct answer.

ters II
Slide
23

Normality of the independent variable: prestg80

Finally, we will evaluate the


assumption of normality for
the independent variable,
The independent variable
"occupational prestige
score" [prestg80]

ters II
Slide
24

Normality of the second independent variable:


occupational prestige score
Descriptives
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)

Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

Lower Bound
Upper Bound

Statistic
44.17
42.45

Std. Error
.873

45.89
43.82
43.00
194.196
13.935
17
86
69
18.00
.401
-.630

The independent variable "occupational


prestige score" [prestg80] satisfied the criteria
for a normal distribution. The skewness of the
distribution (0.401) was between -1.0 and +1.0
and the kurtosis of the distribution (-0.630)
was between -1.0 and +1.0.

.153
.304

True is the
correct answer.

ters II
Slide
25

Assumption of linearity for respondents income and


number of hours worked last week - question

All of the metric variables included in the


analysis satisfied the assumption of normality.
Next we will test the relationships for linearity.

ters II
Slide
26

Run the script to test linearity

First, click on the


Assumption of Linearity
option button to request
that SPSS produce the
output needed to evaluate
the assumption of linearity.

When the linearity option


is selected, a default set
of transformations to test
is marked.

Second, click on
the OK button to
produce the output.

ters II
Slide
27

Linearity test: respondents income and number of


hours worked last week
Correlations

RESPONDENTS INCOME

NUMBER OF HOURS
WORKED LAST WEEK
Logarithm of Reflected
Values of HRS1 [LG10(
81-HRS1)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

Logarithm of
Square Root
Inverse of
NUMBER OF
Reflected
of Reflected
Reflected
HOURS
Values of
Values of
Values of
RESPONDEN
WORKED
HRS1 [LG10( HRS1 [SQRT(
HRS1 [-1/(
TS INCOME
LAST WEEK
81-HRS1)]
81-HRS1)]
81-HRS1)]
1
.337**
-.231**
-.303**
-.059
.
.000
.005
.000
.475
168
149
149
149
149
.337**
1
-.871**
-.981**
-.361**
True is the correct answer.
.000
.
.000
.000
.000
"number of176
149
176The correlation
176 between 176
hours
worked
in
the
past
week" and.743**
-.231**
-.871**
1
.946**
"income" was statistically significant
.005
.000(r=.337, p<0.001).
.
.000
.000
A linear

relationship exists between these

149
Square Root of Reflected
Values of HRS1 [SQRT(
81-HRS1)]
Inverse of Reflected
Values of HRS1 [-1/(
81-HRS1)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

**. Correlation is significant at the 0.01 level (2-tailed).

-.303**
.000
149
-.059
.475
149

176variables. 176
-.981**
.000
176
-.361**
.000
176

.946**
.000
176
.743**
.000
176

176

176

1
.
176
.502**
.000
176

.502**
.000
176
1
.
176

ters II
Slide
28

Assumption of linearity for respondents income


and occupational prestige score - question

All of the metric variables included in the


analysis satisfied the assumption of normality.
Next we will test the relationships for linearity.

ters II
Linearity test: respondents income
and occupational prestige score

Slide
29

Correlations

RESPONDENTS INCOME

RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
Logarithm of PRESTG80
[LG10(PRESTG80)]

Square Root of
PRESTG80
[SQRT(PRESTG80)]
Inverse of PRESTG80
[-1/(PRESTG80)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)

RS
OCCUPA
TIONAL
PRESTIG
RESPONDEN E SCORE
TS INCOME
(1980)
1
.440**
.
.000
168
168
.440**
1

.000
168

Logarithm of
Square Root
Inverse of
PRESTG80
of PRESTG80 PRESTG80
[LG10(PRES
[SQRT(PRES
[-1/(PREST
TG80)]
TG80)]
G80)]
.436**
.440**
.414**
.000
.000
.000
168 correct answer.
168
168
True is the
.985**
.996**
.936**

The correlation between


.000
.000
"occupational
prestige
score" and.000
"income" was statistically significant
255
255
(r=.440,255
p<0.001). A255
linear
.985** relationship
1 exists between
.996** these.982**
variables.
.000
.
.000
.000
.

Pearson Correlation
Sig. (2-tailed)
N

.436**
.000
168

255

255

255

255

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

.440**
.000
168
.414**
.000
168

.996**
.000
255
.936**
.000
255

.996**
.000
255
.982**
.000
255

1
.
255
.962**
.000
255

.962**
.000
255
1
.
255

**. Correlation is significant at the 0.01 level (2-tailed).

ters II
Slide
30

Assumption of homogeneity of variance - question

Self-employment is the only


dichotomous independent variable
in the analysis. We will test if for
homogeneity of variance using
income as the dependent variable.

ters II
Slide
31

Run the script to test


homogeneity of variance

First, click on the


Assumption of Homogeneity
option button to request
that SPSS produce the
output needed to evaluate
the assumption of linearity.

When the homogeneity of


variance option is selected, a
default set of transformations
to test is marked.

Second, click on
the OK button to
produce the output.

ters II
Slide
32

Assumption of homogeneity of variance

Based on the Levene Test, the


variance in "income" [rincom98] is
homogeneous for the categories of
"self-employment" [wrkslf]. The
probability associated with the
Levene Statistic (p=0.076) is
greater than the level of
significance (0.01), so we fail to
reject the null hypothesis that the
variance is equal across groups,
and conclude that the
homoscedasticity assumption is
satisfied.
The homogeneity of variance
assumption was satisfied. True is
the correct answer.

ters II
Slide
33

Detection of outliers - question

In multiple regression, an outlier in the solution


can be defined as a case that has a large residual
because the equation did a poor job of predicting
its value.
We will run the baseline regression again and have
SPSS compute the standardized residual for each
case. Cases with a standardized residual larger
than +/- 3.0 will be treated as outliers.

ters II
Slide
34

Re-running the baseline regression - 1

Having decided to
use the baseline
model for the
interpretation of this
analysis, the SPSS
regression output
was re-created.
To run the baseline
model, select Regression
| Linear from the
Analyze model.

ters II
Slide
35

Re-running the baseline regression - 2


First, move the
dependent variable
rincom98 to the
Dependent text box.

Second, move the


independent variables
hrs1, wkrslf, and
prestg80 to the
Independent(s) list box.

Third, select the method for


entering the variables into the
analysis from the drop down
Method menu. In this example,
we select Stepwise to request the
best subset of variables.

ters II
Slide
36

Re-running the baseline regression - 3

Click on the Statistics


button to specify the
statistics options that we
want.

ters II
Slide
37

Re-running the baseline regression - 4


First, mark the
checkboxes for
Estimates on the
Regression
Coefficients panel.

Second, mark the checkboxes for Model


Fit, Descriptives, and R squared change.
The R squared change statistic will tell
us whether or not the variables added
after the controls have a relationship to
the dependent variable.

Third, mark the


Durbin-Watson
statistic on the
Residuals panel.

Fourth, mark the


checkbox for the
Casewise diagnostics,
which will be used to
identify outliers.

Sixth, click on
the Continue
button to close
the dialog box.

Fifth, mark the


Collinearity diagnostics
to get tolerance values
for testing
multicollinearity.

ters II
Slide
38

Re-running the baseline regression - 5

Click on the
Save button to
save the
standardized
residuals to the
data editor.

ters II
Slide
39

Re-running the baseline regression - 6

Mark the checkbox for


Standardized Residuals so
that SPSS saves a new
variable in the data editor.
We will use this variable to
omit outliers in the revised
regression model.

Click on the
Continue
button to close
the dialog box.

ters II
Slide
40

Re-running the baseline regression - 7

Click on the OK
button to
request the
regression
output.

ters II
Slide
41

Outliers in the analysis


If cases have a standardized residual larger than +/- 3.0,
SPSS creates a table titled Casewise Diagnostics, in which it
lists the cases and values that results in their being an outlier.
If there are no outliers, SPSS does not print the Casewise
Diagnostics table. There was no table for this problem. The
answer to the question is true.

We can verify that all standardized residuals


were less than +/- 3.0 by looking the
minimum and maximum standardized
residuals in the table of Residual Statistics.
Both the minimum and maximum fell in the
acceptable range.
Since there were no outliers,
the correct answer is true.

ters II
Slide
42

Selecting the model to interpret - question

Since there were no


transformations used
and there were no
outliers, we can use
the baseline
regression for our
interpretation.
The correct answer is
false.

ters II
Slide
43

Assumption of independence of errors - question

We can now check the


assumption of independence
of errors for the analysis we
will interpret.

ters II
Slide
44

Assumption of independence of errors:


evidence and answer
Model Summaryc
Change Statistics

Model
1
2

Std. Error
Multiple regression Adjusted
assumes that
theof
R
R Square
R Square
the Estimate
errors are
independent and there is no
.424a
.180
.174
4.804
serial correlation.
Errors are the
b
.507
.257
.247
4.588

R Square
Change
.180
.077

residuals or differences between the


a. Predictors: (Constant), PRESTG80
actual score for a case and the score
b. Predictors:
(Constant),
PRESTG80,
HRS1 equation.
estimated
by the
regression
c. Dependent
Variable:
RINCOM98 implies that the size
No serial
correlation
of the residual for one case has no
impact on the size of the residual for the
next case.
The Durbin-Watson statistic is used to
test for the presence of serial correlation
among the residuals. The value of the
Durbin-Watson statistic ranges from 0 to
4. As a general rule of thumb, the
residuals are not correlated if the
Durbin-Watson statistic is approximately
2, and an acceptable range is 1.50 2.50.

F Change
31.350
14.788

df1
1
1

df2
143
142

Sig. F Change
.000
.000

Durbin-W
atson

The Durbin-Watson statistic


for this problem is 1.866
which falls within the
acceptable range from 1.50 to
2.50. The analysis satisfies
the assumption of
independence of errors. True
is the correct answer.
If the Durbin-Watson statistic
was not in the acceptable
range, we would add a
caution to the findings for a
violation of regression
assumptions.

1.866

ters II
Slide
45

Multicollinearity - question

The final condition that can have


an impact on our interpretation
is multicollinearity.

ters II
Slide
46

Multicollinearity evidence and answer

Multicollinearity occurs when one independent variable is


so strongly Since multicollinearity will result in a variable
not being included in the analysis, out examination of
tolerances focuses on the table of excluded variables.
The tolerance values for all of the independent variables
are larger than 0.10: "number of hours worked in the
past week" [hrs1] (.954), "self-employment" [wrkslf]
(.979) and "occupational prestige score" [prestg80]
(.954).
Multicollinearity is not a problem in this regression
analysis. True is the correct answer.

ters II
Slide
47

Overall relationship between dependent variable


and independent variables - question

The first finding we want to


confirm concerns the overall
relationship between the
dependent variable and one or
more of the independent
variables.

ters II
Slide
48

Overall relationship between dependent variable


and independent variables evidence and answer 1
Stepwise multiple regression was performed to identify
the best predictors of the dependent variable "income"
[rincom98] among the independent variables "number of
hours worked in the past week" [hrs1], "selfemployment" [wrkslf], and "occupational prestige score"
[prestg80].
ANOVAc
Model
1

Regression
Residual
Total
Regression
Residual
Total

Sum of
Squares
723.647
3300.795
4024.441
1034.982
2989.460
4024.441

df
1
143
144
2
142
144

Mean Square
723.647
23.082

F
31.350

Sig.
.000a

517.491
21.053

24.581

.000b

a. Predictors: (Constant), PRESTG80


b. Predictors: (Constant), PRESTG80, HRS1
Based
on the results in the ANOVA table (F(2, 142) =
c. Dependent Variable: RINCOM98

24.581, p<0.001), there was an overall relationship


between the dependent variable "income" [rincom98] and
one or more of independent variables. Since the
probability of the F statistic (p<0.001) was less than or
equal to the level of significance (0.05), the null
hypothesis that the Multiple R for all independent variables
was equal to 0 was rejected. The purpose of the analysis,
to identify a relationship between some of independent
variables and the dependent variable, was supported.

ters II
Slide
49

Overall relationship between dependent variable


and independent variables evidence and answer 2

The Multiple R for the relationship between the independent variables


included in the analysis and the dependent variable was 0.507, which
would be characterized as moderate using the rule of thumb that a
correlation less than or equal to 0.20 is characterized as very weak;
greater than 0.20 and less than or equal to 0.40 is weak; greater than
0.40 and less than or equal to 0.60 is moderate; greater than 0.60
and less than or equal to 0.80 is strong; and greater than 0.80 is very
strong.
The relationship between the independent variables and the
dependent variable was correctly characterized as moderate.
True with caution is the correct answer. Caution in interpreting the
relationship should be exercised because of inclusion of ordinal
variables; and cases to variables ratio less than 50:1.

ters II
Slide
50

Best subset of predictors - question

The next finding concerns the


list of independent variables
that are statistically significant.

ters II
Slide
51

Best subset of predictors evidence and answer


Coefficientsa

Model
1
2

(Constant)
PRESTG80
(Constant)
PRESTG80
HRS1

Unstandardized
Coefficients
B
Std. Error
6.669
1.358
.158
.028
2.862
1.632
.135
.028
.118
.031

Standardized
Coefficients
Beta
.424
.363
.285

t
4.911
5.599
1.754
4.898
3.846

Sig.
.000
.000
.082
.000
.000

Collinearity Statistics
Tolerance
VIF
1.000

1.000

.954
.954

1.049
1.049

a. Dependent Variable: RINCOM98

The best predictors of scores for the dependent variable "income"


[rincom98] were "occupational prestige score" [prestg80]; and
"number of hours worked in the past week" [hrs1].
The variable "number of hours worked in the past week" [hrs1] was
not included in the list of predictors in the question, so false is the
correct answer.

ters II
Slide
52

Relationship of the first independent variable and the


dependent variable - question

In the stepwise regression problems,


we will focus on the entry order of
the independent variables and the
interpretation of individual
relationships of independent
variables on the dependent variable.

ters II
Slide
53

Relationship of the first independent variable and the


dependent variable evidence and answer 1

In the table
of variables
entered and
removed,
"number of
hours worked
in the past
week" [hrs1]
was added to
the
regression
equation in
model 2.

In the table of variables entered


and removed, "number of hours
worked in the past week" [hrs1]
was added to the regression
equation in model 2. The increase
in R Square as a result of including
this variable was .077 which was
statistically significant, F(1, 142) =
14.788, p<0.001.

ters II
Slide
54

Relationship of the first independent variable and the


dependent variable evidence and answer 2
Coefficientsa

Model
1
2

(Constant)
PRESTG80
(Constant)
PRESTG80
HRS1

Unstandardized
Coefficients
B
Std. Error
6.669
1.358
.158
.028
2.862
1.632
.135
.028
.118
.031

Standardized
Coefficients
Beta
.424
.363
.285

t
4.911
5.599
1.754
4.898
3.846

Sig.
.000
.000
.082
.000
.000

Collinearity Statistics
Tolerance
VIF
1.000

1.000

.954
.954

1.049
1.049

a. Dependent Variable: RINCOM98

The b coefficient for the relationship between the dependent variable


"income" [rincom98] and the independent variable "number of hours
worked in the past week" [hrs1]. was .118, which implies a direct
relationship because the sign of the coefficient is positive. Higher numeric
values for the independent variable "number of hours worked in the past
week" [hrs1] are associated with higher numeric values for the
dependent variable "income" [rincom98]. The statement in the problem
that "survey respondents who worked longer hours in the past week had
higher incomes" is correct.
True with caution is the correct answer. Caution in interpreting the
relationship should be exercised because of cases to variables ratio less
than 50:1; and an ordinal variable treated as metric.

ters II
Slide
55

Relationship of the second independent variable and


the dependent variable - question

ters II
Slide
56

Relationship of the second independent variable and


the dependent variable evidence and answer

The independent variable


"self-employment" [wrkslf]
was not included in the
regression equation. It did not
increase the percentage of
variance explained in the
dependent variable by an
amount large enough to be
statistically significant.
False is the correct answer.

ters II
Slide
57

Relationship of the third independent variable and


the dependent variable - question

ters II
Slide
58

Relationship of the third independent variable and


the dependent variable evidence and answer 1

In the table of
variables
entered and
removed,
"occupational
prestige
score"
[prestg80]
was added to
the regression
equation in
model 1.
The increase in R Square as a
result of including this variable
was .180 which was statistically
significant, F(1, 143) = 31.350,
p<0.001.

ters II
Relationship of the third independent variable and
the dependent variable evidence and answer 2

Slide
59

Coefficientsa

Model
1
2

(Constant)
PRESTG80
(Constant)
PRESTG80
HRS1

Unstandardized
Coefficients
B
Std. Error
6.669
1.358
.158
.028
2.862
1.632
.135
.028
.118
.031

Standardized
Coefficients
Beta
.424
.363
.285

t
4.911
5.599
1.754
4.898
3.846

Sig.
.000
.000
.082
.000
.000

Collinearity Statistics
Tolerance
VIF
1.000

1.000

.954
.954

1.049
1.049

a. Dependent Variable: RINCOM98

The b coefficient for the relationship between the dependent variable


"income" [rincom98] and the independent variable "occupational prestige
score" [prestg80]. was .135, which implies a direct relationship because
the sign of the coefficient is positive. Higher numeric values for the
independent variable "occupational prestige score" [prestg80] are
associated with higher numeric values for the dependent variable
"income" [rincom98].
The statement in the problem that "survey respondents who had more
prestigious occupations had lower incomes" is incorrect. The direction of
the relationship is stated incorrectly.
False is the correct answer.

ters II
Slide
60

Validation analysis - question

The problem states the


random number seed to use
in the validation analysis.

ters II
Slide
61

Validation analysis:
set the random number seed

Validate the results of


your regression analysis
by conducting a 75/25%
cross-validation, using
200070 as the random
number seed.

To set the random number


seed, select the Random
Number Seed command
from the Transform menu.

ters II
Slide
62

Set the random number seed

First, click on the


Set seed to option
button to activate
the text box.

Second, type in the


random seed stated in
the problem.
Third, click on the OK
button to complete the
dialog box.
Note that SPSS does not
provide you with any
feedback about the change.

ters II
Slide
63

Validation analysis:
compute the split variable

To enter the formula for the


variable that will split the
sample in two parts, click
on the Compute
command.

ters II
Slide
64

The formula for the split variable


First, type the name for the
new variable, split, into the
Target Variable text box.
Second, the formula for the
value of split is shown in the
text box.
The uniform(1) function
generates a random decimal
number between 0 and 1.
The random number is
compared to the value 0.75.

Third, click on the


OK button to
complete the dialog
box.

If the random number is less


than or equal to 0.75, the
value of the formula will be 1,
the SPSS numeric equivalent
to true. If the random
number is larger than 0.75,
the formula will return a 0,
the SPSS numeric equivalent
to false.

ters II
Slide
65

The split variable in the data editor

In the data editor, the


split variable shows a
random pattern of zeros
and ones.
To select the cases for the
training sample, we select
the cases where split = 1.

ters II
Slide
66

Repeat the regression for the validation

To repeat the multiple


regression analysis for the
validation sample, select
Regression | Linear from the
Analyze tool button.

ters II
Slide
67

Using "split" as the selection variable

First, scroll
down the list of
variables and
highlight the
variable split.

Second, click on the


right arrow button to
move the split variable
to the Selection
Variable text box.

ters II
Slide
68

Setting the value of split to select cases

When the variable named


split is moved to the
Selection Variable text
box, SPSS adds "=?" after
the name to prompt up to
enter a specific value for
split.

Click on the
Rule button
to enter a
value for split.

ters II
Slide
69

Completing the value selection

First, type the value


for the training
sample, 1, into the
Value text box.

Second, click on the


Continue button to
complete the value entry.

ters II
Slide
70

Requesting output for the validation analysis

Click on the OK
button to
request the
output.

When the value entry


dialog box is closed, SPSS
adds the value we entered
after the equal sign. This
specification now tells
SPSS to include in the
analysis only those cases
that have a value of 1 for
the split variable.

ters II
Slide
71

Validation Overall Relationship


The validation analysis requires that the
regression model for the 75% training
sample replicate the pattern of statistical
significance found for the full data set.

In the analysis of the


75% training sample,
the relationship between
the set of independent
variables and the
dependent variable was
statistically significant,
F(2, 105) = 20.195,
p<0.001, as was the
overall relationship in
the analysis of the full
data set, F(2, 142) =
24.581, p<0.001.

ters II
Slide
72

Validation - Relationship of Individual Independent


Variables to Dependent Variable

In stepwise multiple regression, the pattern of individual


relationships between the dependent variable and the
independent variables will be the same if the same variables
are selected as predictors for the analysis using the full data
set and the analysis using the 75% training sample. In this
analysis, the same two variables entered into the regression
model: "occupational prestige score" [prestg80]; and
"number of hours worked in the past week" [hrs1].

ters II
Slide
73

Validation - Comparison of Training Sample and


Validation Sample

The total proportion of variance explained in the


model using the training sample was 27.8%
(.527), compared to 70.7% (.841) for the
validation sample. The value of R for the
validation sample was actually larger than the
value of R for the training sample, implying a
better fit than obtained for the training sample.
This supports a conclusion that the regression
model would be effective in predicting scores for
cases other than those included in the sample.

The validation analysis supported the


generalizability of the findings of the
analysis to the population represented
by the sample in the data set.
The answer to the question is true.

SW388R7
Data Analysis &
Computers II
Slide 74

Steps in complete stepwise


regression analysis
The following flow charts depict the process for solving the complete
regression problem and determining the answer to each of the
questions encountered in the complete analysis.
Text in italics (e.g. True, False, True with caution, Incorrect
application of a statistic) represent the answers to each specific
question.
Many of the steps in stepwise regression analysis are identical to the
steps in standard regression analysis. Steps that are different are
identified with a magenta background, with the specifics of the
difference underlined.

ters II
Slide
75

Complete stepwise multiple regression analysis:


level of measurement
Question: do variables included in the analysis satisfy the level
of measurement requirements?

Is the dependent
variable metric and the
independent variables
metric or dichotomous?
Examine all independent
variables controls as
well as predictors

No

Incorrect
application of
a statistic

Yes
Ordinal variables included
in the relationship?

No
True

Yes

True with caution

ters II
Slide
76

Complete stepwise multiple regression analysis:


sample size
Question: Number of variables and cases satisfy sample size
requirements?
Compute the baseline
regression in SPSS

Ratio of cases to
independent variables at
least 5 to 1?

Include both controls and


predictors, in the count of
independent variables

No

Inappropriate
application of
a statistic

Yes

Ratio of cases to
independent variables at
preferred sample size of at
least 50 to 1?

Yes
True

No

True with caution

ters II
Slide
77

Complete stepwise multiple regression analysis:


assumption of normality
Question: each metric variable satisfies the assumption of
normality?
Test the dependent
variable and
independent variables
The variable satisfies
criteria for a normal
distribution?

Yes
True
If more than one
transformation
satisfies normality,
use one with
smallest skew

No

False

Log, square root, or


inverse
transformation
satisfies normality?

Yes
Use transformation
in revised model,
no caution needed

No

Use untransformed
variable in analysis,
add caution to
interpretation for
violation of normality

ters II
Slide
78

Complete stepwise multiple regression analysis:


assumption of linearity
Question: relationship between dependent variable and metric
independent variable satisfies assumption of linearity?
If dependent variable was
transformed for normality, use
transformed dependent
variable in the test for linearity.

Probability of Pearson
correlation (r) <=
level of significance?

If independent variable
was transformed to
satisfy normality, skip
check for linearity.

No

If more than one


transformation
satisfies
linearity, use one
with largest r
Probability of correlation
(r) for relationship with
any transformation of IV
<= level of significance?

No
Yes

Yes

Use transformation
in revised model

True

Weak
relationship.
No caution
needed

ters II
Slide
79

Complete stepwise multiple regression analysis:


assumption of homogeneity of variance
Question: variance in dependent variable is uniform across the
categories of a dichotomous independent variable?
If dependent variable was
transformed for normality,
substitute transformed
dependent variable in the test
for the assumption of
homogeneity of variance

Probability of Levene
statistic <= level of
significance?

No
True

Yes

False

Do not test transformations of


dependent variable, add caution to
interpretation for violation of
homoscedasticity

ters II
Slide
80

Complete stepwise multiple regression


analysis: detecting outliers
Question: After incorporating any transformations, no outliers
were detected in the regression analysis.
If any variables were transformed
for normality or linearity, substitute
transformed variables in the
regression for the detection of
outliers.

Is the standardized residual


for any case greater than
+/-3.00?

Yes

False

No
True

Remove outliers and run


revised regression again.

ters II
Slide
81

Complete stepwise multiple regression analysis:


picking regression model for interpretation
Question: interpretation based on model that includes
transformation of variables and removes outliers?

Yes
Pick revised regression with
transformations and omitting
outliers for interpretation

True

R for revised regression


greater than R for
baseline regression by 2%
or more?

No
Pick baseline regression with
untransformed variables and all
cases for interpretation

False

ters II
Slide
82

Complete stepwise multiple regression analysis:


assumption of independence of errors
Question: serial correlation of errors is not a problem in this regression
analysis?

Residuals are
independent,
Durbin-Watson between
1.5 and 2.5?

Yes

True

No

False

NOTE: caution
for violation of
assumption of
independence of
errors

ters II
Slide
83

Complete stepwise multiple regression analysis:


multicollinearity
Question: Multicollinearity is not a problem in this regression analysis?

Tolerance for all IVs


greater than 0.10,
indicating no
multicollinearity?

Yes
True

No

False
NOTE: halt the
analysis if it is not
okay to simply
exclude the
variable from the
analysis.

ters II
Slide
84

Complete stepwise multiple regression analysis:


overall relationship
Question: Finding about overall relationship between
dependent variable and independent variables.
Probability of F test of
regression for last model
<= level of significance?

No

False

Yes

Strength of relationship for


included variables
interpreted correctly?

No

False

Yes
Small sample, ordinal
variables, or violation of
assumption in the
relationship?

No
True

Yes

True with caution

ters II
Slide
85

Complete stepwise multiple regression analysis:


subset of best predictors
Question: Finding about list of best subset of predictors?

Listed variables match


variables in table of
variables entered/removed.

No

False

Yes
Small sample, ordinal
variables, or violation of
assumption in the
relationship?

No
True

Yes

True with caution

ters II
Slide
86

Complete stepwise multiple regression analysis:


individual relationships - 1
Question: Finding about individual relationship between
independent variable and dependent variable.

Order of entry into


regression equation stated
correctly?

No

False

Yes

Significance of R2 change
for variable <= level of
significance?

Yes

No

False

ters II
Slide
87

Complete stepwise multiple regression analysis:


individual relationships - 2

Direction of relationship
between included variables
and DV interpreted oorrectly?

No

False

Yes
Small sample, ordinal
variables, or violation of
assumption in the
relationship?

No
True

Yes

True with caution

ters II
Slide
88

Complete stepwise multiple regression analysis:


validation analysis - 1
Question: The validation analysis supports the generalizability of the
findings?
Set the random seed and randomly
split the sample into 75% training
sample and 25% validation
sample.

Probability of ANOVA test


for training sample <=
level of significance?

Yes

No

False

ters II
Slide
89

Complete stepwise multiple regression analysis:


validation analysis - 2

Same variables entered


into regression equation in
training sample?

No

False

Yes

Shrinkage in R (R for
training sample - R for
validation sample) < 2%?

Yes
True

No

False

ters II
Slide
90

Homework Problems
Multiple Regression Stepwise Problems - 1

The stepwise regression homework problems parallel the


complete standard regression problems and the complete
hierarchical problems. The only assumption made is the
problems is that there is no problem with missing data.
The complete stepwise multiple regression will include:
Testing assumptions of normality and linearity
Testing for outliers
Determining whether to use transformations or exclude
outliers,
Testing for independence of errors,
Checking for multicollinearty, and
Validating the generalizability of the analysis.

ters II
Slide
91

Homework Problems
Multiple Regression Stepwise Problems - 2
The statement of the stepwise
regression problem identifies the
dependent variable and the
independent variables from which we
will extract a parsimonious subset.

ters II
Slide
92

Homework Problems
Multiple Regression Stepwise Problems - 3

The findings, which must all be correct for a problem to


be true, include:
an ordered listing of the included independent
variables
an interpretive statement about each of the
independent variables.
a statement about the strength of the overall
relationship.

ters II
Slide
93

Homework Problems
Multiple Regression Stepwise Problems - 4

The first prerequisite for a problem is the


satisfaction of the level of measurement
and minimum sample size requirements.
Failing to satisfy either of these
requirement results in an inappropriate
application of a statistic.

ters II
Slide
94

Homework Problems
Multiple Regression Stepwise Problems - 5
The assumption of normality requires that
each metric variable be tested. If the
variable is not normal, transformations
should be examined to see if we can
improve the distribution of the variable. If
transformations are unsuccessful, a
caution is added to any true findings.

ters II
Slide
95

Homework Problems
Multiple Regression Stepwise Problems - 6
The assumption of linearity is examined
for any metric independent variables that
were not transformed for the assumption
of normality.

ters II
Slide
96

Homework Problems
Multiple Regression Stepwise Problems - 7
After incorporating any
transformations, we look for outliers
using standard residuals as the
criterion.

ters II
Slide
97

Homework Problems
Multiple Regression Stepwise Problems - 8
We compare the results of the regression
without transformations and exclusion of
outliers to the model with transformations
and excluding outliers to determine
whether we will base our interpretation on
the baseline or the revised analysis.

ters II
Slide
98

Homework Problems
Multiple Regression Stepwise Problems - 9

We test for the assumption of independence of


errors and the presence of multicollinearity.
If we violate the assumption of independence,
we attach a caution to our findings.
If there is a mutlicollinearity problem, we halt
the analysis, since we may be reporting
erroneous findings.

ters II
Slide
99

Homework Problems
Multiple Regression Stepwise Problems - 9

In stepwise regression, we interpret the


R for the overall relationship at the step
or model when the last statistically
significant variable was entered.

ters II
Slide
100

Homework Problems
Multiple Regression Stepwise Problems - 10

The primary purpose of stepwise regression is to


identify the best subset of predictors and the order in
which variables were included in the regression
equation. The order tells us the relative importance
of the predictors, i.e. best predictor, second best,

ters II
Slide
101

Homework Problems
Multiple Regression Stepwise Problems - 11

The relationships between predictor independent


variables and the dependent variable stated in the
problem must be statistically significant, and worded
correctly for the direction of the relationship. The
interpretation of individual predictors is the same for
standard, hierarchical, and stepwise regression.

ters II
Slide
102

Homework Problems
Multiple Regression Stepwise Problems - 12

We use a 75-25% validation strategy to support the generalizability of


our findings. The validation must support:
the significance of the overall relationship,
the inclusion of the same variables in the validation model that were
included in the full model, though not necessarily in the same order,
and
the shrinkage in R for the validation sample must not be more than
2% less than the training sample.

ters II
Slide
103

Homework Problems
Multiple Regression Stepwise Problems - 13

Cautions are added as limitations


to the analysis, if needed.

You might also like