You are on page 1of 74

Stepwise Binary Logistic Regression

Stepwise Binary Logistic Regression - 1

Stepwise binary logistic regression is very similar to stepwise multiple regression in terms
of its advantages and disadvantages.

Stepwise logistic regression is designed to find the most parsimonious set of predictors
that are most effective in predicting the dependent variable.

Variables are added to the logistic regression equation one at a time, using the statistical
criterion of reducing the -2 Log Likelihood error for the included variables.

After each variable is entered, each of the included variables are tested to see if the model
would be better off the variable were excluded. This does not happen often.

The process of adding more variables stops when all of the available variables have been
included or when it is not possible to make a statistically significant reduction in -2 Log
Likelihood using any of the variables not yet included.

Nonmetric variables are added to the logistic regression as a group. It is possible, and often
likely, that not all of the individual dummy-coded variables will have a statistically
significant individual relationship with the dependent variable. We limit our interpretation
to the dummy-coded variables that do have a statistically significant individual
relationship.

Stepwise Binary Logistic Regression - 2

SPSS provides a table of variables included in the analysis and a table of variables
excluded from the analysis. It is possible that none of the variables will be included. It is
possible that all of the variables will be included.

The order of entry of the variables can be used as a measure of relative importance.

Once a variable is included, its interpretation in stepwise logistic regression is the same as
it would be using other methods for including variables.

The number of cases required for stepwise logistics regression is greater than the number
for the other forms. We will use the norm of 20 cases for each independent variable,
double the recommendation of Hosmer and Lemeshow.

Pros and Cons of Stepwise Logistic Regression

Stepwise logistic regression can be used when the goal is to produce a predictive model
that is parsimonious and accurate because it excludes variables that do not contribute to
explaining differences in the dependent variable.

Stepwise logistic regression is less useful for testing hypotheses about statistical
relationships. It is widely regarded as atheoretical and its usage is not recommended.

Stepwise logistic regression can be useful in finding relationships that have not been
tested before. Its findings invite one to speculate on why an unusual relationship makes
sense.

It is not legitimate to do a stepwise logistic regression and present the results as though
one were testing a hypothesis that included the variables found to be significant in the
stepwise logistic regression.

Using statistical criteria to determine relationships is vulnerable to over-fitting the data set
used to develop the model at the expense of generalizability.

When stepwise logistic regression is used, some form of validation analysis is a necessity.
We will use 75/25% cross-validation.

75/25% Cross-validation

To do cross validation, we randomly split the data set into a 75% training sample and a
25% validation sample. We will use the training sample to develop the model, and we test
its effectiveness on the validation sample to test the applicability of the model to cases not
used to develop it.

In order to be successful, the follow two questions must be answers affirmatively:


Did the stepwise logistic regression of the training sample produce the same subset of
predictors produced by the regression model of the full data set?
If yes, compare the classification accuracy rate for the 25% validation sample to the
classification accuracy rate for the 75% training sample. If the shrinkage (accuracy
for the 75% training sample - accuracy for the 25% validation sample) is 2% (0.02) or
less, we conclude that validation was successful.

Note: shrinkage may be a negative value, indicating that the accuracy rate for the
validation sample is larger than the accuracy rate for the training sample. Negative
shrinkage (increase in accuracy) is evidence of a successful validation analysis.

If the validation is successful, we base our interpretation on the model that included all
cases.

The Problem in Blackboard

The problem statement tells us:


the variables included in the analysis
to make the assumption that it is not
necessary to omit outliers
whether each variable should be treated
as metric or non-metric
the type of dummy coding and reference
category for non-metric variables
the alpha for both the statistical
relationships and for diagnostic tests
the random number seed for the
validation analysis

The Statement about Level of Measurement

The first statement in the problem asks about level of


measurement. Stepwise binary logistic regression
requires that the dependent variable be dichotomous, the
metric independent variables be interval level, and the
non-metric independent variables be dummy-coded if
they are not dichotomous. SPSS Binary Logistic
Regression calls non-metric variables categorical.

SPSS Binary Logistic


Regression will dummy-code
categorical variables for us,
provided it is useful to use
either the first or last
category as the reference
category.

Marking the Statement about Level of Measurement


The dependent variable "attitude toward abortion when a woman
wants one for any reason" [abany] is dichotomous level, satisfying
the requirement for the dependent variable. variable.
The independent variable "age" [age] is interval level, satisfying the
requirement for independent variables.
The independent variable "highest year of school completed" [educ]
is interval level, satisfying the requirement for independent
variables.
The independent variable "income" [rincom98] is ordinal level, but
the problem calls for treating it as metric by applying the common
convention of treating ordinal variables as interval level.

The independent variable "socioeconomic index" [sei] is


interval level, satisfying the requirement for independent
variables
The independent variable "sex" [sex] is dichotomous level,
satisfying the requirement for independent variables.
The independent variable "respondent's degree of religious
fundamentalism" [fund] is ordinal level, which the problem
instructs us to dummy-code as a non-metric variable.
Mark the check box as a correct statement.

The statement about multicollinearity and other numerical problems

To check for multicolliearity, we


run the binary logistic
regression in SPSS and check
for outliers.

Multicollinearity in the logistic regression solution is detected


by examining the standard errors for the b coefficients. A
standard error larger than 2.0 indicates numerical problems,
such as multicollinearity among the independent variables,
cells with a zero count for a dummy-coded independent
variable because all of the subjects have the same value for
the variable, and 'complete separation' whereby the two
groups in the dependent event variable can be perfectly
separated by scores on one of the independent variables.
Analyses that indicate numerical problems should not be
interpreted.

Running the Stepwise binary logistic regression

Select the Regression |


Binary Logistic
command from the
Analyze menu.

Selecting the dependent variable

First, highlight the


dependent variable
abany in the list of
variables.

Second, click on the right


arrow button to move the
dependent variable to the
Dependent text box.

Selecting the independent variables

First, move the control independent variables stated in


the problem
"age" [age],
"highest year of school completed" [educ],
"income" [rincom98], "socioeconomic index" [sei],
"sex" [sex] and
"respondent's degree of religious fundamentalism"
[fund])
to the Covariates list box.

Declare the categorical variables - 1

To indicate that "sex" [sex]


and "respondent's degree
of religious
fundamentalism" [fund] are
categorical variables, we
click on the Categorical
button.

Declare the categorical variables - 2

Move the variables sex


and fund to the
Categorical Covariates list
box.

SPSS assigns its default method for


dummy-coding, Indicator coding, to
each variable, placing the name of
the coding scheme in parentheses
after each variable name.

Declare the categorical variables - 3

We accept the default of


using the Indicator method
for dummy-coding
variable..

We will also accept the


default of using the last
category as the reference
category for each
variable.

Click on the
Continue
button to close
the dialog box.

Specifying the method for including variables

Since the problem calls for a


Stepwise binary logistic regression,
we select the Forward:LR method for
including variables.
Forward LR uses likelihood ratio tests
to determine which variables are
entered in what order.

Requesting the output

Click on the OK
button to request
the output.

While optional
statistical output is
available, we do not
need to request any
optional statistics.

Checking for multicollinearity

The standard errors for the variables included in the stepwsie


procedure were: the standard error for "highest year of
school completed" [educ] was .09, the standard error for
survey respondents who said they were religiously
fundamentalist was .56 and the standard error for survey
respondents who said they were religiously moderate was .
48.

Marking the statement about multicollinearity and other numerical problems

Since none of the independent


variables in this analysis had a
standard error larger than 2.0,
we mark the check box to
indicate there was no evidence
of multicollinearity.

The statement about sample size

Hosmer and Lemeshow, who wrote the


widely used text on logistic regression,
suggest that the sample size should be 10
cases for every independent variable.
Because stepwise procedures tend to overfit
the data at the expense of generalizability,
we will double the requirement to 20 cases
for every independent variable.

The output for sample size


We find the number of cases
included in the analysis in
the Case Processing
Summary.

The 106 cases available for the analysis did not


satisfy the recommended sample size of 140 (7
independent variables times 20 cases per
variable), which is based on double the
recommended number of 10 cases per
independent variable for logistic regression
recommended by Hosmer and Lemeshow
because of the issue of over-fitting the data
when using stepwise methods. The failure to
meet the sample size requirement should be
mentioned as a limitation to the analysis. The
number of independent variables includes 4
metric variables and 3 dummy-coded variables.

Marking the statement for sample size

Since we do not satisfy the


sample size requirement, we
leave the check box unmarked.
We should consider including this
as a limitation to the analysis.

The stepwise relationship between the dependent and independent variables

Three statements in the problem list


different combinations of the variables
included in the stepwise logistic
regression.
To determine which is correct, we look
at the table of Variables in the
Equation for Block 1 in the SPSS
output.

The output for the stepwise relationship

Two independent variables satisfied the


statistical criteria for entry into the model. The
variable "highest year of school completed"
[educ] had the largest individual impact
(entered on step 1) on the dependent variable
"attitude toward abortion when a woman
wants one for any reason" [abany]. The
second variable included in the model at step 2
was "respondent's degree of religious
fundamentalism" [fund].

Marking the statement for stepwise relationship

Two independent variables satisfied the


statistical criteria for entry into the model. The
variable "highest year of school completed"
[educ] had the largest individual impact on the
dependent variable "attitude toward abortion
when a woman wants one for any reason"
[abany]. The second variable included in the
model was "respondent's degree of religious
fundamentalism" [fund].

We mark the first check


box in the set of three.

Note that in stepwise logistic


regression, if any variables are
entered, the overall relationship
must be significant, since that is
the criteria for including variables.

The statement about the relationship between education and abortion for any reason
Having satisfied the criteria for the
stepwise relationship, we examine the
findings for individual relationships with
the dependent variable. If the overall
relationship were not significant, we would
not interpret the individual relationships.

The first two statements


offer alternative
interpretations for the
relationship between
education and abortion for
any reason.

Output for the relationship between education and abortion for any reason

The probability of the Wald statistic for the independent


variable "highest year of school completed" [educ] ((1, N =
106) = 5.48, p = .019) was less than or equal to the level of
significance of .05. The null hypothesis that the b coefficient
for "highest year of school completed" [educ] was equal to
zero was rejected. The value of Exp(B) for the variable
"highest year of school completed" [educ] was 1.235 which
implies an increase in the odds of 23.5% (1.235 - 1.000 = .
235). The statement that 'For each unit increase in "highest
year of school completed", survey respondents were 23.5%
more likely to have thought it should be possible for a woman
to obtain a legal abortion if she wants it for any reason' is
correct.

Marking the statement for relationship between education and abortion for any reason

Survey respondents were 23.5% more


likely to have thought it should be
possible for a woman to obtain a legal
abortion if she wants it for any reason, we
mark the check box for the second
statement.

Statement for relationship between fundamentalism and abortion for any reason

The next two statements concerns


the relationship between the
dummy-coded variable for
religiously fundamentalist and
abortion for any reason.

Output for relationship between fundamentalism and abortion for any reason

The probability of the Wald statistic for the independent


variable survey respondents who said they were religiously
fundamentalist ((1, N = 106) = 6.80, p = .009) was less
than or equal to the level of significance of .05. The null
hypothesis that the b coefficient for survey respondents who
said they were religiously fundamentalist was equal to zero
was rejected. The value of Exp(B) for the variable survey
respondents who said they were religiously fundamentalist
was .231 which implies a decrease in the odds of 76.9% (.231
- 1.000 = -.769). The statement that 'Survey respondents
who said they were religiously fundamentalist were 76.9%
less likely to have thought it should be possible for a woman
to obtain a legal abortion if she wants it for any reason
compared to those who said they were religiously liberal' is
correct.

Marking the relationship between fundamentalism and abortion for any reason

The statement that 'Survey respondents


who said they were religiously
fundamentalist were 76.9% less likely to
have thought it should be possible for a
woman to obtain a legal abortion if she
wants it for any reason compared to those
who said they were religiously liberal' is
correct. The first statement is marked.

Statement for relationship between fundamentalism and abortion for any reason

The next statement concerns the


relationship between the dummycoded variable for religious
moderation and abortion for any
reason.

Output for relationship between fundamentalism and abortion for any reason

The probability of the Wald statistic for the independent


variable survey respondents who said they were religiously
moderate ((1, N = 106) = 2.87, p = .090) was greater
than the level of significance of .05. The null hypothesis that
the b coefficient for survey respondents who said they were
religiously moderate was equal to zero was not rejected.
Survey respondents who said they were religiously moderate
does not have an impact on the odds that survey respondents
have thought it should be possible for a woman to obtain a
legal abortion if she wants it for any reason. The analysis
does not support the relationship that 'Survey respondents
who said they were religiously moderate were 56.0% less
likely to have thought it should be possible for a woman to
obtain a legal abortion if she wants it for any reason
compared to those who said they were religiously liberal.

Marking the relationship between fundamentalism and abortion for any reason

Since the relationship was not


statistically significant, we do not
mark the check box for the
statement.

Statement for relationship between socioeconomic index and abortion for any reason

The next statement concerns the


relationship between the metric
variable socioeconomic index and
abortion for any reason.

Output for relationship between socioeconomic index and abortion for any reason

The independent variable "socioeconomic index" [sei]


was not included among the statistically significant
predictors and should not be intepreted. The statement
that "For each unit increase in "socioeconomic index",
survey respondents were 10.5% more likely to have
thought it should be possible for a woman to obtain a
legal abortion if she wants it for any reason" is not
correct.

Marking the relationship between socioeconomic index and abortion for any reason

Since the relationship was


not statistically significant,
the statement is marked.

Statement about the usefulness of the model based on classification accuracy

The final statement concerns the usefulness of the


logistic regression model. The independent variables
could be characterized as useful predictors
distinguishing survey respondents who use a computer
from survey respondents who not use a computer if
the classification accuracy rate was substantially higher
than the accuracy attainable by chance alone.
Operationally, the classification accuracy rate should
be 25% or more higher than the proportional by
chance accuracy rate.

Computing proportional by-chance accuracy rate

At Block 0 with no
independent variables
in the model, all of the
cases are predicted to
be members of the
modal group, 0=NO in
this example.

The proportion in the largest group is


50.9%% or .509. The proportion in
the other group is 1.0 0.509 = .
491.

The proportional by chance accuracy rate was


computed by calculating the proportion of cases for
each group based on the number of cases in each
group in the classification table at Step 0, and then
squaring and summing the proportion of cases in
each group (.509 + .491 = .500).

Output for the usefulness of the model based on classification accuracy


To be characterized as a useful model, the accuracy rate
should be 25% higher than the by chance accuracy
rate.
The by chance accuracy criteria is computed by
multiplying the by chance accurate rate of .500 times
1.25, or 1.25 x .500 = .625 (62.5%)..

The classification accuracy rate computed by SPSS


was 67.9% which was greater than or equal to
the proportional by chance accuracy criteria of
62.5% (1.25 x 50.0% = 62.5%).
The criteria for classification accuracy is satisfied.

Marking the statement for usefulness of the model

Since the criteria for classification


accuracy was satisfied, the check
box is marked.

Statement about Cross-validation

The findings from our analysis are


generalizable to the extent that they are
applicable to cases not included in the
analysis. Since we cannot collect new
cases, we will divide our sample into two
subsets, using one subset to create the
model and test the findings on the second
subset of cases which were not included in
the analysis that created the model.

The final statement concerns the


generalizability of our findings to
the larger population. To answer
this question, we will do a
75/25% cross-validation.

Creating the Training Sample and the Validation Sample - 1


The 75/25% cross-validation requires
that we randomly divide the cases for
this analysis into two parts:
75% of the cases will be used to run
the stepwise logistic regression (the
training sample), which will be tested
for accuracy on the remaining 25% of
the cases (the validation sample).

To set the seed for the random


number generator, select
Random Number Generator
from the Transform menu.

NOTE: you must use the random number


seed that is stated in the problem in order
to produce the same results that I found.
Any other seed will generate a different
random sequence that can produce results
that are very different from mine.

Creating the Training Sample and the Validation Sample - 2

First, mark the check


for Set Starting Point.
Fourth, click on the
OK button to complete
the action.

Second, select
the option button
for a Fixed
Value.

Third, type the seed


number provided in the
problem directions: 981982.

NOTE: SPSS does not provide any feedback


that the seed has been set or changed. If
you are in doubt, you can reopen the
dialog box and see what it indicates.

Creating the Training Sample and the Validation Sample - 3


We will create a variable that will
contain the information about whether a
case is in the training sample or the
validation sample. We will name this
variable split and use a value of 1 to
indicate the training sample and a value
of 0 to indicate the validation sample.

To create the new


variable, select Compute
from the Transform
menu.

Creating the Training Sample and the Validation Sample - 4

Type the name of


the new variable,
split, in the Target
Variable text box.

Type the formula as


shown in the
Numeric Expression
text box.

The formula uses the SPSS UNIFORM


function to create a uniform distribution
of decimal numbers between 0 and 1. If
the generated number for a case is less
than or equal to 0.75, the statement in
the text box is True and the split
variable will be assigned a 1 for that
case. If the generated number is larger
than 0.75, the statement is false and
the case will be assigned a 0 for split.

Click on the OK
button to create
the variable.

Creating the Training Sample and the Validation Sample - 5

If we scroll the data editor


window to the right, we see
the split variable in a new
column.

Creating the Training Sample and the Validation Sample - 6

If we created a frequency distribution for


the split variable, we see that the
breakdown is approximately, not exactly,
correct. This is a consequence of
generating random numbers you have
no control over the sequence that it
generates beyond setting an initial seed.
Though I have done it to create
specific results for homework
problems, it is not acceptable to
run repeated series of random
numbers until one gets a
sequence that has desirable
properties.

An Additional Task before Running the Stepwise Logistic Regression on Training Sample

Before we run the regression on the training sample, we need an additional step that will
enable us to compare the accuracy of the model for the training sample to the accuracy of
the model for the validation sample, using the R2 for each as our measure of accuracy.

We need to exclude from the analysis cases that are missing data for any of the variables
that we have designated as candidates for inclusion. If we dont specifically do this, SPSS
may include different cases in predicting values for the dependent variable than it does in
determining which variables to include in the model.

In model building, SPSS does listwise exclusion of missing data and omits any cases that
have missing data for any variable. In predicting scores on the dependent variable, it
excludes cases that are missing data for only the variables included in the stepwise model.
Thus, when selecting variables, SPSS assumes that only respondents who answer all
questions are valid cases; in predicting scores, it assumes that failing to answer a question
on a variable that is not included has no importance in the analysis.

Selecting Cases with Valid Data for All Variables in the Analysis - 1

To include only those


cases that have valid
data for all variables in
the analysis, choose the
Select Cases command
from the Data menu.

Selecting Cases with Valid Data for All Variables in the Analysis - 2
First, mark the
option button for If
condition is
satisfied.

Second, click on
the If button to
add the condition.

Selecting Cases with Valid Data for All Variables in the Analysis - 3

Type
NMISS(abany,age,educ,sex,rincom98,fund,sei) = 0
in the condition textbox. In the parentheses, we
type the names of the dependent variable and all of
the independent variables.

The SPSS NMISS function counts the number


of variables in the list that have missing data.
Telling SPSS to include cases for which this
calculation results in 0 indicates that the case
was not missing data for any of the variables.

Selecting Cases with Valid Data for All Variables in the Analysis - 4

Click on the
Continue button to
close the dialog box.

Selecting Cases with Valid Data for All Variables in the Analysis - 5

Click on the OK
button to
execute the
command.

Selecting Cases with Valid Data for All Variables in the Analysis - 6

The excluded cases


have a slash through
the case number.

Run the Stepwise Logistic Regression on the Training Sample - 1

To run the logistic regression,


select Regression > Binary
Logisitic from the Analyze
menu.

Run the Stepwise Logistic Regression on the Training Sample - 2


Move the dependent variable:
"attitude toward abortion when a
woman wants one for any reason"
[abany]
to the Dependent text box.

Move the control independent variables stated in the problem


"age" [age],
"highest year of school completed" [educ],
"sex" [sex] and
"respondent's degree of religious fundamentalism" [fund])
"income" [rincom98],
"socioeconomic index" [sei],
to the Covariates list box.

Run the Stepwise Logistic Regression on the Training Sample - 3

To indicate that "sex" [sex]


and "respondent's degree
of religious
fundamentalism" [fund] are
categorical variables, we
click on the Categorical
button.

Run the Stepwise Logistic Regression on the Training Sample - 4

Move the variables sex


and fund to the
Categorical Covariates list
box.

Click on the
Continue
button to close
the dialog box.

Run the Stepwise Logistic Regression on the Training Sample - 5

Since the problem calls for a


Stepwise binary logistic regression,
we select the Forward:LR method for
including variables.
Forward LR uses likelihood ratio tests
to determine which variables are
entered in what order.

Run the Stepwise Logistic Regression on the Training Sample - 6


To select the training sample, we
move the split variable to the
Selection Variable text box.

First, highlight
the split variable.

Second, click on the


right arrow button to the
left of the Selection
Variable text box..

Run the Stepwise Logistic Regression on the Training Sample - 7

Click on the Rule button


to specify the value that
we want split to use to
select cases.

Run the Stepwise Logistic Regression on the Training Sample - 7

First, type 1 in
the Value text
box. Recall that
this is the value
of split indicating
training cases.

Second, click on the


Continue button to
close the dialog box.

Run the Stepwise Logistic Regression on the Training Sample - 8

Click on the OK
button to produce
the output.

Validating the Model - 1


The stepwise binary logistic regression of
the training sample resulted in the same
number of steps as the full sample model
(2).

If the number of
steps were different,
the validation would
fail.

Validating the Model - 2

The same variables were selected in the


stepwise logistic regression of the
training sample that were selected in the
stepwise logistic regression of the full
sample "highest year of school
completed" [educ], "respondent's degree
of religious fundamentalism" [fund].

If the variables
included were different,
the validation would
fail.

Validating the Model - 3


Third, we compare the accuracy of the model
for the validation sample to the accuracy of
the model for the training sample.

The classification accuracy rate for the model using the


training sample was 67.9%, compared to 72.7% for the
validation sample. The classification accuracy for the
validation sample was actually larger than the
classification accuracy for the training sample, implying a
better fit than obtained for the training sample. This
supports a conclusion that the logistic regression model
based on this analysis would be effective in predicting
scores for cases other than those included in the sample.

Marking the Check Box for the Cross-validation Statement

The validation analysis supported


the generalizability of the findings
of the analysis to the population
represented by the sample in the
data set.
We mark the check box for the
validation.

Stepwise Binary Logistic Regression: Level of Measurement

Level of
measurement ok?

No

Do not mark check box


for level of measurement
Mark: Inappropriate
application of the statistic

Yes

Stop
Mark check box
for level of measurement

Ordinal level variable


treated as metric?

No

Yes

Consider limitation in
discussion of findings

Stepwise Binary Logistic Regression: Multicollinearity and Sample Size


Run Stepwise Binary Logistic Regression,
Assuming that it is not necessary to
remove any outliers

Multicollinearity/Numeric
al Problems (S. E. > 2.0)

Yes

Do not mark check box


for no multicollinearity
Stop

No

Mark check box


for no multicollinearity

Adequate Sample Size


(Number of IVs x 20)

Yes

Mark check box


for sample size

No

Do not mark check box


for sample size
Consider limitation in
discussion of findings

Logic Diagram for Solving Homework Problems: Stepwise Relationship

1+ variables entered
in model?

Note: model will


be statistically
significant if any
variables entered

No

Stop (no significant


predictors)

No

Do not mark check box for


correct subset

Yes

Parsimonious subset of
variables correctly
identified?

Yes

Mark check box


for correct subset

Stepwise Binary Logistic Regression: Individual Relationships


For each of the variables
included by the stepwise
procedure.
Individual relationship
(Wald Sig )?

No

Yes

Correct interpretation of
direction and strength of
relationship?

Yes

Mark check box


for individual relationship

Yes

Additional individual
Relationships to
interpret?

No

No

Do not mark check box for


individual relationship

Stepwise Binary Logistic Regression: Classification Accuracy

Classification accuracy =
or > 1.25 x by chance
accuracy rate

No

Do not mark check box for


classification accuracy
Stop (the model does meet
criteria for usefulness)

Yes
Mark check box for
classification accuracy

Stepwise Binary Logistic Regression: Cross-validation

Create split variable


using specified seed

Select cases with no missing


values for all variables

Run stepwise logistic regression


on training sample

Same variables entered


in full model?

No

Do not mark check box for


supporting validation

No

Do not mark check box for


supporting validation

Yes

Shrinkage for accuracy


rate < or = 2%?

Yes

Mark check box for


supporting validation

You might also like