You are on page 1of 16

page 1

Chapter 20 Logistic Regression for Binary Response Variables

Response variable is binary, denoted by 0,1.

Example 1 Case Study 20.1: Survival in the Donner Party. What is the relationship between age
and sex of individuals in the Donner Party and whether or not they survived?

Example 2 Birthweight data: birthweights of 189 newborns were recorded along with a number
of other variables concerning the mother: weight, smoker or not, race, etc. Suppose we were
interested only in whether the baby was underweight (defined to be under 2500 gms) or not and
what characteristics of the mother, if any, are associated with having underweight babies.

Example 3 Pronghorn data: In a study of winter habitat selection by pronghorn in the Red Rim
area in south-central Wyoming (Source: Manly, McDonald, and Thomas 1993, Resource
Selection by Animals, Chapman and Hall, pp. 16-24; data from Ryder 1983), presence/absence of
pronghorn during winters of 1980-81 and 1981-82 were recorded for a systematic sample of 256
plots of 4 ha. each Other variables recorded for each plot were: density (in thousands/ha) of
sagebrush, black greasewood, Nuttal’s saltbrush, and Douglas rabbitbrush, and the slope,
distance to water, and aspect. What variables are most strongly associated with
presence/absence of pronghorn and can you formulate a model to predict the probability that
pronghorn will be present on a plot?

Could use linear regression to regress 0/1 response on quantitative and categorical explanatory
variables. What are the problems with this approach?

Better approach: logistic regression model

Y = 0,1 response variable


X 1 ,..., X p : explanatory variables

For a particular set of values of the explanatory variables,

µ (Y X 1 ,…, X p ) = π = proportion of 1’s


page 2
• Rather than model µ (Y ) = π as a linear function of the explanatory variables, model
⎛ π ⎞
logit(π) = ln⎜ ⎟ as a linear function of the explanatory variables; that is:
⎝1− π ⎠

logit(π) = β 0 + β1 X 1 + … + β p X p

• Why use logit(π)? Could you use another function?

Given the value of η = logit(π), can calculate


π=
1 + eη

Example 3: In the pronghorn example, suppose, hypothetically, the true relationship between
probability of use and distance to water (in meters) followed the logit model:

logit(π) = 3 - .0015Water

e 3 − .0015 Water
Then π = .
1 + e 3 − .0015Water

Calculate π for the following distances (note that it might be easier to compute 1-π =
1
, then subtract from 1),
3 − .0015Water
1+ e

Water = 100 m. 1000 m. 3000m.


page 3

1.0 0.8
Probability of use
0.4 0.6
0.2
0.0

0 1000 2000 3000 4000 5000


Distance to water (m)

• What is the interpretation of the coefficient -.0015 for the variable distance to water?

For every 1 m. increase in the distance to water, the log-odds of use decrease by .0015;
for every 1 km. increase in distance to water, log-odds of use decrease by 1.5.

• More meaningful: for every 1 m. increase in the distance to water, the odds of use change
by a multiplicative factor of e −.0015 = .999; for every 1 km. increase in distance to water,
the odds of use change by a multiplicative factor of e −1.5 = .223 (we could also reverse
these statements; for example, the odds of use increase by a factor of e1.5 = 4.48 for
every km. closer to water a plot is.)

Variance

The 0/1 response variable Y is a Bernoulli random variable:

µ (Y X 1 ,…, X p ) = π
SD( Y X 1 , …, X p ) = π (1 − π )

Variance of Y is not constant.

The logistic regression model is an example of a generalized linear model (in contrast to a
general linear model which is the usual regression model with normal errors and constant
variance). A generalized linear model is specified by:

1. a link function which specifies what function of µ (Y ) is a linear function of X 1 , … , X p .


In logistic regression, the link function is the logit function.
page 4
2. the distribution of Y for a fixed set of values of X 1 , … , X p . In the logit model, this is the
Bernoulli distribution.

The usual linear regression model is also a generalized linear model. The link function is the
identity: that is, f ( µ ) = µ , and the distribution is normal with constant variance.

There are other generalized linear models which are useful (e.g., Poisson response distribution
with log link function). A general methodology has been developed to fit and analyze these
models.

Estimation of Logistic Regression Coefficients


Estimation of parameters in linear regression model was by least squares. If we assume normal
errors with constant variance, the least squares estimators are the same as the maximum
likelihood estimators (MLE’s).

Maximum likelihood estimation is based on a simple principle: the estimates of the parameters in
a model are the values which maximize the probability (likelihood) of observing the sample data
we have.

Example: Suppose we select a random sample of 10 UM students in order to estimate what


proportion of students own a car. We find that 6 out of the 10 own a car. What is the MLE of
the proportion p of all students who won a car? What do we think it should turn out to be?

We model the responses from the students as 10 independent Bernoulli trials with probability of
success p on each trial. Then the total number of successes, say Y, in 10 trials follows a binomial
model:

⎛10 ⎞
Pr(Y = y ) = ⎜⎜ ⎟⎟ p y (1 − p)10− y , y = 0,1,…,10
⎝ y⎠

The maximum likelihood principle says to find the value of p which maximizes the probability
of observing the number of successes we actually observed in the sample; that is, find p to
maximize:

⎛10 ⎞
Pr(Y = 4) = ⎜⎜ ⎟⎟ p 4 (1 − p) 6
⎝4⎠

This is called the likelihood function. The following is a graph of Pr(Y=4) versus p. At what
value of p does it appear to be maximized?
page 5

0.25
0.20
0.15
Pr(Y = 4)
0.10
0.05
0.0

0.0 0.2 0.4 0.6 0.8 1.0


p

We can also find the exact solution using calculus. Note that finding the value of p to maximize
Pr(Y = 4) is equivalent to finding the value of p which maximizes
⎡⎛10 ⎞⎤
[ ⎡⎛10 ⎞⎤
]
ln[Pr(Y = 4)] = ln ⎢⎜⎜ ⎟⎟⎥ + ln p 4 (1 − p ) 6 = ln ⎢⎜⎜ ⎟⎟⎥ + 4 ln p + 6 ln(1 − p )
⎣⎝ 4 ⎠⎦ ⎣⎝ 4 ⎠⎦

Take the derivative with respect to p and setting it equal to 0:

Now, solve for p:

By repeating this process but replacing 10 by n (the number of trials) and 4 by y (the number of
successes), we find a general expression for the MLE of p in n independent Bernoulli trials:

y
pˆ =
n

• Back to logistic regression: we use the maximum likelihood principle to find estimators of
the β’s in the logistic regression model. The likelihood function is the probability of
observing the particular set of failures and successes that we observed in the sample. But
there’s a difference from the binomial model above: the model says that the probability of
success is possibly different for each subject because it depends on the explanatory
variables X 1 , … , X p . Dropping the subscript i to avoid confusion, the probability of
success for an individual is:
page 6

β 0 + β1 X 1 +…+ β p X p
e
Pr(Y = 1) = π where π = β 0 + β1 X 1 +…+ β p X p
.
1+ e

• Note that we can also write the model for a single individual as

Pr(Y = y ) = π y (1 − π )1− y , y = 0, 1.

• If the responses are independent, then the probability of observing the set of 0/1 responses
we observed is

n
Pr(Y1 = y1 , Y2 = y2 , …, Yn = yn ) = π 1y1 (1 − π 1 ) y1 π 2y2 (1 − π 2 ) y2 …π nyn (1 − π n ) yn = ∏ π iyi (1 − π i ) yi
i =1

where each of the π i ’s is, in turn, a function of the β i ’s. Therefore, we maximize this
likelihood function as a function of the β i ’s. This cannot be done analytically and must be
done numerically using a computer program.

• SPSS can fit logit models: use Analyze…Regression…BinaryLogistic. Identify the


binary dependent variable and all the covariates.

• If any of the covariates are categorical then click Categorical to tell SPSS how to create
indicator variables (or other way of coding) for each categorical variable. If a binary
covariate is already coded as 0-1, it’s not necessary to identify it as categorical.

• Interaction terms can be added when specifying the covariates in the logistic regression
command in SPSS. Select two or more variables (keep the Ctrl key pressed while
selecting the second and subsequent variables), then press the “a*b>” button.

Example: Birthweight data. The response variable was the 0/1 variable “low” which was an
indicator of low birthweight, An additive model with all the covariates was fit. The variables
were:

Name Meaning
low Birthweight < 2500 gm.
age Age (years)
lwt Weight at last menstrual period (lbs)
race Race (1=White, 2=Black, 3=Other)
smoke Smoke (0/1)
ptl Number of premature labors
ht Hypertension (0.1)
ui Uterine irritability (0/1)
ftv Number of physician visits in first trimester
page 7

There is a lot of SPSS out put. For now, we’re interested in only two tables: one that indicates
how the categorical variable Race was coded, and the final table of model coefficients (under
Block 1).

Categorical Variables Codings

Parameter coding
Frequency (1) (2)
Race (1=White, White 96 .000 .000
2=Black, 3=Other) Black 26 1.000 .000
Other 67 .000 1.000

I chose Indicator coding for Race with the first category (White) as the reference category. The
two columns of this table are the two indicator variables for Race.

Variables in the Equation

95.0% C.I.for EXP(B)


B S.E. Wald df Sig. Exp(B) Lower Upper
Step
a
age -.030 .037 .637 1 .425 .971 .903 1.044
1 lwt -.015 .007 4.969 1 .026 .985 .971 .998
race 7.116 2 .028
race(1) 1.272 .527 5.820 1 .016 3.569 1.270 10.033
race(2) .880 .441 3.990 1 .046 2.412 1.017 5.723
smoke .939 .402 5.450 1 .020 2.557 1.163 5.624
ptl .543 .345 2.474 1 .116 1.722 .875 3.388
ht 1.863 .698 7.136 1 .008 6.445 1.642 25.291
ui .768 .459 2.793 1 .095 2.155 .876 5.301
ftv .065 .172 .143 1 .705 1.067 .761 1.497
Constant .481 1.197 .161 1 .688 1.617
a. Variable(s) entered on step 1: age, lwt, race, smoke, ptl, ht, ui, ftv.

The columns in this table are:

1. B: the estimated coefficients


2. S.E. The standard errors (SE’s) for the estimated coefficients (the β i ’s) are estimated
using asymptotic results for maximum likelihood estimators. An approximate confidence
interval for each coefficient can be calculated using β i ± z SE where z is the appropriate
percentile of the standard normal distribution. We don’t use the t-distribution; that only
applies to the linear regression model with normal errors.
2
⎛ βˆi ⎞
3. Wald: the Wald statistic is ⎜ ⎟ . It is used as a two-sided test of the hypothesis
⎜ SE ( βˆ ) ⎟
⎝ i ⎠
H 0 : β i = 0 given that all the other variables are in the model. Under this null
hypothesis βˆ / SE ( βˆ ) has approximately a N(0,1) distribution. The square of a N(0,1)
i i
page 8

is a chi-square distribution with 1 d.f. so the Wald statisic can be compared to a χ 2 (1) to
get a test of H 0 : β i = 0 .

4. df: The d.f. for the chi-square distribution which the Wald statistic is compared to. It is 1
except for categorical explanatory variables with more than 2 levels (however, the test for
each indicator variable involved in the categorical variable still has 1 d.f.).

5. Sig.: The P-value for the test of H 0 : β i = 0 using the Wald statistic. This is the area to
the right of the Wald statistic in a χ 2 (1) distribution.

6. Exp(B): exp(β i ) has the interpretation that it represents how many times the odds of a
positive response increase for every one-unit increase in xi , given that the values of the
other variables in the model remain the same. For example, the coefficient for Smoke is
2.557. Since Smoke is 0/1, this means that, according to this model, the odds that a
smoker will have a low birthweight baby are estimated to be 2.557 times higher (95% CI:
1.163 to 5.624; see next column) than the odds a non-smoker will, even after adjusting
for Last weight, Race and the other variables in the model. Similarly, every one-year
increase in age reduces the odds of a low birthweight baby to .971 of what they were
(95% CI: .903 to 1.044), after adjusting for the other variables.

7. 95.0% C.I. for Exp(B): A confidence interval for exp(β i ) (the true value) can be

[ ( ) ( )]
calculated as exp β i − z SE ,exp β i + z SE . SPSS will calculate a confidence interval
for each exp(β i ) in the model if, under Options, you select “CI for exp(B).”

Comparing two models


In linear regression, we used the extra sum-of-squares F test to compare two nested models (two
models are nested if one is a special case of the other). The analogous test in logistic regression
compares the values of –2ln(Maximized likelihood function). What is this quantity? Recall that
page 9
the MLE’s of the β i ’s are the values that maximize the likelihood function of the data (defined a
couple of pages ago). So, we find that maximum value of the likelihood function, take the
natural log and multiply by –2.

• The quantity –2 ln(Maximized likelihood) is also called the deviance of a model since
larger values indicate greater deviation from the assumed model. Comparing two nested
models by the difference in deviances is a drop-in-deviance test.

SPSS gives the value of –2 ln(Maximized likelihood) in the output. For the full model for low
birthweight examined previously:

Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 201.285a .162 .228
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.

• The difference between the values of –2ln(Maximized likelihood function) for a full and
reduced model has approximately a chi-square distribution if the null hypothesis that the
extra parameters are all 0 is true. The d.f. is the difference in the number of parameters
for the two models.

Example:
Compare the full birthweight model to one without Smoke. The value of –2ln(Maximized
likelihood function) for the model without Smoke is:
Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 206.906a .137 .192
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.

The test statistic is

Drop-in-deviance = 206.906 – 201.285 = 5.621


The P-value is the area to the right of 5.621 in χ 2 (1) distribution (since there is one less
parameter in the reduced model). Thus, P =.018, which indicates moderately strong evidence of
an effect of smoking on the probability of having a low-birthweight baby after adjusting for the
other variables in the model.

• The drop-in-deviance test is a likelihood ratio test (LR test) because it is based on the
natural log of the ratio of the maximized likelihoods (the difference of logs is the log of
the ratio). The extra sum-of squares F-test in linear regression also turns out to be a
likelihood ratio test.
page 10
• If the full and reduced models differ by only one parameter, as in this example, then the
likelihood ratio test is testing the same thing as the Wald test above. In this example, the
test statistic is slightly different than the Wald test value of 5.450 and P = .020. The
likelihood ratio test is preferred. The two tests will generally give similar results, but not
always.

• The relationship between the Wald test and the likelihood ratio test is analogous to
relationship between the t-test for a single coefficient and the F-test in linear regression.
However, in linear regression, the two tests are exactly equivalent. Not so in logistic
regression.

One can obtain a LR test for each coefficient in SPSS by selecting “Backward-LR” under
Method. Also be sure that Display: At each step is checked under Options. This performs a
backward stepwise logistic regression. All we are using it for, though, is the output at the first
step where it reports a likelihood ratio test for each variable with all the other variables in the
model. The table we are looking for looks like this. The LR test for each term is listed under
Step 1.
Model if Term Removed

Change in
Model Log -2 Log Sig. of the
Variable Likelihood Likelihood df Change
Step age -100.965 .645 1 .422
1 lwt -103.402 5.520 1 .019
race -104.376 7.468 2 .024
smoke -103.453 5.621 1 .018
ptl -101.916 2.546 1 .111
ht -104.403 7.521 1 .006
ui -102.014 2.742 1 .098
ftv -100.713 .142 1 .706
Step age -100.993 .559 1 .455
2 lwt -103.404 5.381 1 .020
race -104.386 7.344 2 .025
smoke -103.458 5.489 1 .019
ptl -101.974 2.521 1 .112
ht -104.406 7.384 1 .007
ui -102.053 2.680 1 .102
Step lwt -104.056 6.126 1 .013
3 race -105.155 8.325 2 .016
smoke -103.864 5.743 1 .017
ptl -102.108 2.231 1 .135
ht -104.732 7.479 1 .006
ui -102.449 2.912 1 .088
Step lwt -105.583 6.949 1 .008
4 race -106.413 8.609 2 .014
smoke -105.755 7.293 1 .007
ht -105.957 7.698 1 .006
ui -104.124 4.031 1 .045

Model Selection
Both AIC and BIC can be used as model selection criteria. As with linear regression models,
they are only relative measures of fit, not absolute measures of fit.

AIC = Deviance + 2p

BIC = Deviance + plog(n)


page 11

where p is the number of parameters in the model. Stepwise model selection methods are
available in SPSS using likelihood ratio tests or Wald’s test. The LR methods are preferred.
Other software programs, like S-Plus, have stepwise procedures using AIC or BIC.

Interactions
Interaction terms can be added when specifying the covariates in the logistic regression
command in SPSS. Select two or more variables (keep the Ctrl key pressed while selecting the
second and subsequent variables), then press the “a*b>” button.

Residuals
There are two standard ways to define a residual in logistic regression.

yi − πˆ i
1. Pearson residual = .
πˆ i (1 − πˆ i )

This is called a “Standardized Residual” in the Save option under Regression…Binary


Logistic in SPSS, but it is labeled “Normalized Residual” in the data sheet. The Pearson
residual is the raw (unstandardized) residual y i − πˆ i standardized by its estimated standard
deviation.

⎧ − 2 ln πˆ i if yi = 1

2. Deviance residual = ⎨
⎪− − 2 ln(1 − πˆ ) if y = 0
⎩ i i

This is called the “Deviance Residual” in SPSS. The deviance residuals have the property
that the sum of the deviance residuals squared is the deviance D (= -2 ln(maximized
likelihood)) for the model.

The Pearson residual is more easily understood, but the deviance residual directly gives the
contribution of each point to the lack of fit of the model.

There is also a “Studentized residual” in SPSS (but labeled “Standard residual” in the data sheet).
It’s usually almost identical to the deviance residual so we won’t consider it here.

The residuals in a logistic regression (either Pearson or deviance) are not as useful as in a linear
regression unless the data are grouped (see Chapter 21). The residuals are not assumed to have a
normal distribution, as in linear regression. Since the residuals don’t have a normal distribution,
the usual ±2 or ±3 cutoffs don’t necessarily apply. Instead, we can simply look for outliers
among the distribution of Pearson or deviance residuals. We can plot the values against the
predicted probabilities.

Measures of influence
Measures of influence attempt to measure how much individual observations influence the
model. Observations with high influence merit special examination. Many measures of
influence have been suggested and are used for linear regression. Some of these have analogies
page 12
for logistic regression. However, the guidelines for deciding what is a big enough value of an
influence measure to merit special attention are less developed for logistic regression than for
linear regression and the guidelines developed there are sometimes not appropriate for logistic
regression.

Cook’s Distance: This is a measure of how much the residuals change when the case is deleted.
Large values indicate a large change when the observation is left out. Plot Di against case
number or against πˆ i and look for outliers.

The leverage is a measure of the potential influence of an observation. In linear regression, the
leverage is a function of how far its covariate vector x i is from the average. It is a function only
of the covariate vector. In logistic regression, the leverage is a function of x i and of π i (which
must be estimated) – it isn’t necessarily the observations which have the most extreme x i ’s
which have the most leverage.

Assessing the overall fit of the model


Assessing the fit of a logistic regression model is difficult. Since the response is 0/1, plots of the
response versus explanatory variables are not as useful as when the response is continuous.
There is no good measure analogous to R2; SPSS will report a couple of R2-like measures that
have been proposed for logistic regression, but these don’t have the same interpretation as in
linear regression and are not particularly good measures of fit.

In theory, a chi-square goodness-of-fit test could be carried out by comparing the observed count
in each category (meaning each combination of the values of the explanatory variables) against
the expected count. However, unless there are multiple subjects in each combination of the
values of the covariates (the explanatory variables), the observed counts are 0’s and 1’s; the
expected counts are simply the πˆ i ’s. But the chi-square approximation for the goodness-of-fit
statistic does not hold when the expected counts are all less than 1 (most of them should be
above 5 to use the chi-square approximation). A way around this difficulty is to group the data.
The Hosmer-Lemeshow test groups individuals by their value of πˆ i into ten groups of about
equal size. That is, the tenth of the observations with the lowest values of πˆ i are put in the first
group, the next tenth lowest values in the second, and so on. It then totals the πˆ i ’s for everyone
in each group (that is, it totals the expected number of successes) and does the same for the
1 − πˆ i ’s (the expected number of failures). It then compares the observed and expected counts in
the 20 cells using Pearson’s X2 and compares the result to a chi-square with 8 d.f. The expected
counts still need to be large enough for the chi-square approximation to hold, but this is much
more likely to happen since we group the data. Since the cases are grouped, this test is “crude”
in a sense, but it is a useful tool.

Another goodness-of-fit tool is the classification table. Each observation is classified according
to its estimated π: if πˆ i > .5, then its classified as a “success” (a 1), otherwise as a “failure” (a 0).
The table of actual group versus predicted group is used to assess how well the classifier does.
This is a very crude way of assessing how well the model fits since it does not measure how far
πˆ i is from yi ; only whether it is on the same side of .5. as yi . In addition, if yi =1 is rare, then
page 13
the estimated probabilities may all be less than 1 and the classification table isn’t useful in
assessing fit (the same is true if yi =0 is rare).
page 14

Example
Birthweight data: using only LastWeight, Race, Smoke

First fit a rich model


Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 210.491a .120 .169
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.

Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
lwt -.012 .045 .073 1 .786 .988
1 race 1.733 2 .420
race(1) -3.481 2.717 1.642 1 .200 .031
race(2) -2.969 2.907 1.043 1 .307 .051
smoke -.679 1.933 .123 1 .726 .507
lwt * race 1.271 2 .530
lwt by race(1) .017 .021 .634 1 .426 1.017
lwt by race(2) .025 .022 1.269 1 .260 1.026
lwt by smoke .008 .016 .257 1 .612 1.008
race * smoke 1.911 2 .385
race(1) by smoke 1.308 .953 1.882 1 .170 3.698
race(2) by smoke .667 1.180 .320 1 .572 1.949
lwt2 .000 .000 .167 1 .683 1.000
Constant 1.849 3.131 .349 1 .555 6.353
a. Variable(s) entered on step 1: lwt, race, smoke, lwt * race , lwt * smoke , race * smoke , lwt2.

Compare to a main effects model


Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 215.015a .099 .139
a. Estimation terminated at iteration number 4 because
parameter estimates changed by less than .001.

Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
lwt -.013 .006 4.415 1 .036 .987
1 race 8.730 2 .013
race(1) -.971 .412 5.543 1 .019 .379
race(2) .320 .526 .370 1 .543 1.377
smoke 1.060 .378 7.850 1 .005 2.886
Constant .861 .797 1.168 1 .280 2.366
a. Variable(s) entered on step 1: lwt, race, smoke.

• Carry out drop in deviance test:


page 15

Categorical Variables Codings

Parameter coding
Frequency (1) (2)
Race (1=White, White 96 1.000 .000
2=Black, 3=Other) Black 26 .000 1.000
Other 67 .000 .000

The largest difference is between White and Other. The difference between Black and Other is
smaller and not statistically significant. We might consider coding race as simply
White/Nonwhite.

Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 215.383a .097 .136
a. Estimation terminated at iteration number 4 because
parameter estimates changed by less than .001.

Variables in the Equation

95.0% C.I.for EXP(B)


B S.E. Wald df Sig. Exp(B) Lower Upper
Step
a
lwt -.0122 .0060 4.118 1 .042 .988 .976 1.000
1 Nonwhite 1.077 .373 8.342 1 .004 2.937 1.414 6.102
smoke 1.103 .372 8.775 1 .003 3.013 1.452 6.252
Constant -.275 .834 .109 1 .742 .760
a. Variable(s) entered on step 1: lwt, Nonwhite, smoke.

Hosmer and Lemeshow Test

Step Chi-square df Sig.


1 7.071 8 .529

Some residual analysis should also be done to see that there are not any outliers or particularly
influential points.

• Write the logit equation from the last model for each of the four groups represented by the
two categorical explanatory variables:
page 16

Compare the odds of a low birthweight baby for a nonwhite smoking mother who weighs 150
pounds versus a white, nonsmoking mother who weighs 120 pounds.