You are on page 1of 4

Logistic Regression in Minitab

Logistic Regression Overview



Both logistic regression and least squares regression investigate the relationship between a response variable and
one or more predictors. A practical difference between them is that logistic regression techniques are used with
categorical response variables, and linear regression techniques are used with continuous response variables.

Minitab provides three logistic regression procedures that you can use to assess the relationship between one or
more predictor variables and a categorical response variable of the following types:

Variable Number of
type categories Characteristics Examples
Binary 2 two levels success, failure
yes, no

Ordinal 3 or more natural ordering of the levels none, mild, severe
fine, medium, coarse

Nominal 3 or more no natural ordering of the blue, black, red, yellow
Levels sunny, rainy, cloudy


Both logistic and least squares regression methods estimate parameters in the model so that the fit of the model is
optimized. Least squares minimizes the sum of squared errors to obtain parameter estimates, whereas logistic
regression obtains maximum likelihood estimates of the parameters using an iterative-reweighted least squares
algorithm [25].



Interpreting Estimated Coefficients in Binary Logistic Regression

main topic
The interpretation of the estimated coefficients depends on: the link function, reference event, and reference factor
levels (see Setting reference and event levels). The estimated coefficient associated with a predictor (factor or
covariate) represents the change in the link function for each unit change in the predictor, while all other predictors
are held constant. A unit change in a factor refers to a comparison of a certain level to the reference level.

The logit link provides the most natural interpretation of the estimated coefficients and is therefore the default link in
Minitab. A summary of the interpretation follows:

The odds of a reference event is the ratio of P(event) to P(not event). The estimated coefficient of a predictor
(factor or covariate) is the estimated change in the log of P(event)/P(not event) for each unit change in the
predictor, assuming the other predictors remain constant.
The estimated coefficients can also be used to calculate the odds ratio, or the ratio between two odds.
Exponentiating the estimated coefficient of a factor yields the ratio of P(event)/P(not event) for a certain factor
level compared to the reference level. The odds ratios at different values of the covariate can be constructed
relative to zero. In the covariate case, it may be more meaningful to interpret the odds and not the odds ratio.
Note that an estimated coefficient of zero or an odds ratio of one both imply the same thingthe factor or
covariate has no effect.

To change how you view the estimated coefficients, you can change the event or reference levels in the Options
subdialog box. See Setting reference and event levels.

Example of Binary Logistic Regression

main topic interpreting results session command see also

You are a researcher who is interested in understanding the effect of smoking and weight upon resting pulse rate.
Because you have categorized the responsepulse rateinto low and high, a binary logistic regression analysis is
appropriate to investigate the effects of smoking and weight upon pulse rate.

1. Open the worksheet EXH_REGR.MTW.
2. Choose Stat > Regression > Binary Logistic Regression.
3. In Response, enter RestingPulse. In Model, enter Smokes Weight. In Factors (optional), enter Smokes.
4. Click Graphs. Check Delta chi-square vs probability and Delta chi-square vs leverage. Click OK.
5. Click Results. Choose In addition, list of factor level values, tests for terms with more than 1 degree
of freedom, and 2 additional goodness-of-fit tests. Click OK in each dialog box.

Session window output
Binary Logistic Regression: RestingPulse versus Smokes, Weight

Li nk Funct i on: Logi t

Response I nf or mat i on

Var i abl e Val ue Count
Rest i ngPul se Low 70 ( Event )
Hi gh 22
Tot al 92


Fact or I nf or mat i on

Fact or Level s Val ues
Smokes 2 No, Yes


Logi st i c Regr essi on Tabl e
Odds 95%CI
Pr edi ct or Coef SE Coef Z P Rat i o Lower Upper
Const ant - 1. 98717 1. 67930 - 1. 18 0. 237
Smokes
Yes - 1. 19297 0. 552980 - 2. 16 0. 031 0. 30 0. 10 0. 90
Wei ght 0. 0250226 0. 0122551 2. 04 0. 041 1. 03 1. 00 1. 05


Log- Li kel i hood = - 46. 820
Test t hat al l sl opes ar e zer o: G = 7. 574, DF = 2, P- Val ue = 0. 023


Goodness- of - Fi t Test s
Met hod Chi - Squar e DF P
Pear son 40. 8477 47 0. 724
Devi ance 51. 2008 47 0. 312
Hosmer - Lemeshow 4. 7451 8 0. 784
Br own:
Gener al Al t er nat i ve 0. 9051 2 0. 636
Symmet r i c Al t er nat i ve 0. 4627 1 0. 496


Tabl e of Obser ved and Expect ed Fr equenci es:
( See Hosmer - Lemeshow Test f or t he Pear son Chi - Squar e St at i st i c)

Gr oup
Val ue 1 2 3 4 5 6 7 8 9 10 Tot al
Low
Obs 4 6 6 8 8 6 8 12 10 2 70
Exp 4. 4 6. 4 6. 3 6. 6 6. 9 7. 2 8. 3 12. 9 9. 1 1. 9
Hi gh
Obs 5 4 3 1 1 3 2 3 0 0 22
Exp 4. 6 3. 6 2. 7 2. 4 2. 1 1. 8 1. 7 2. 1 0. 9 0. 1
Tot al 9 10 9 9 9 9 10 15 10 2 92

Measur es of Associ at i on:
( Bet ween t he Response Var i abl e and Pr edi ct ed Pr obabi l i t i es)

Pai r s Number Per cent Summar y Measur es
Concor dant 1045 67. 9 Somer s' D 0. 38
Di scor dant 461 29. 9 Goodman- Kr uskal Gamma 0. 39
Ti es 34 2. 2 Kendal l ' s Tau- a 0. 14
Tot al 1540 100. 0


Graph window output


Interpreting the results
The Session window output contains the following seven parts:
Response Information displays the number of missing observations and the number of observations that fall into
each of the two response categories. The response value that has been designated as the reference event is the first
entry under Value and labeled as the event. In this case, the reference event is low pulse rate (see Factor variables
and reference levels).

Factor Information displays all the factors in the model, the number of levels for each factor, and the factor level
values. The factor level that has been designated as the reference level is first entry under Values, the subject does
not smoke (see Factor variables and reference levels).

Logistic Regression Table shows the estimated coefficients, standard error of the coefficients, z-values, and p-
values. When you use the logit link function, you also see the odds ratio and a 95% confidence interval for the odds
ratio.

From the output, you can see that the estimated coefficients for both Smokes (z = -2.16, p = 0.031) and Weight
(z = 2.04, p = 0.041) have pvalues less than 0.05, indicating that there is sufficient evidence that the coefficients
are not zero using an -level of 0.05.
The estimated coefficient of -1.193 for Smokes represents the change in the log of P(low pulse)/P(high pulse)
when the subject smokes compared to when he/she does not smoke, with the covariate Weight held constant.
The estimated coefficient of 0.0250 for Weight is the change in the log of P(low pulse)/P(high pulse) with a 1 unit
(1 pound) increase in Weight, with the factor Smokes held constant.
Although there is evidence that the estimated coefficient for Weight is not zero, the odds ratio is very close to one
(1.03), indicating that a one pound increase in weight minimally effects a person's resting pulse rate. A more
meaningful difference would be found if you compared subjects with a larger weight difference (for example, if
the weight unit is 10 pounds, the odds ratio becomes 1.28, indicating that the odds of a subject having a low
pulse increases by 1.28 times with each 10 pound increase in weight).
For Smokes, the negative coefficient of -1.193 and the odds ratio of 0.30 indicate that subjects who smoke tend
to have a higher resting pulse rate than subjects who do not smoke. Given that subjects have the same weight,
the odds ratio can be interpreted as the odds of smokers in the sample having a low pulse being 30% of the odds
of non-smokers having a low pulse.

Next, the last Log-Likelihood from the maximum likelihood iterations is displayed along with the statistic G. This
statistic tests the null hypothesis that all the coefficients associated with predictors equal zero versus these
coefficients not all being equal to zero. In this example, G = 7.574, with a p-value of 0.023, indicating that there is
sufficient evidence that at least one of the coefficients is different from zero, given that your accepted -level is
greater than 0.023.

Note that for factors with more than 1 degree of freedom, Minitab performs a multiple degrees of freedom test
with a null hypothesis that all the coefficients associated with the factor are equal to 0 versus them not all being
equal to 0. This example does not have a factor with more than 1 degree of freedom.

Goodness-of-Fit Tests displays Pearson, deviance, and Hosmer-Lemeshow goodness-of-fit tests. In addition, two
Brown tests-general alternative and symmetric alternative-are displayed because you have chosen the logit link
function and the selected option in the Results subdialog box. The goodness-of-fit tests, with p-values ranging from
0.312 to 0.724, indicate that there is insufficient evidence to claim that the model does not fit the data adequately. If
the p-value is less than your accepted -level, the test would reject the null hypothesis of an adequate fit.

Table of Observed and Expected Frequencies allows you to see how well the model fits the data by comparing
the observed and expected frequencies. There is insufficient evidence that the model does not fit the data well, as the
observed and expected frequencies are similar. This supports the conclusions made by the Goodness of Fit Tests.

Measures of Association displays a table of the number and percentage of concordant, discordant, and tied pairs,
as well as common rank correlation statistics. These values measure the association between the observed
responses and the predicted probabilities.

The table of concordant, discordant, and tied pairs is calculated by pairing the observations with different
response values. Here, you have 70 individuals with a low pulse and 22 with a high pulse, resulting in 70 * 22 =
1540 pairs with different response values. Based on the model, a pair is concordant if the individual with a low
pulse rate has a higher probability of having a low pulse, discordant if the opposite is true, and tied if the
probabilities are equal. In this example, 67.9% of pairs are concordant and 29.9% are discordant. You can use
these values as a comparative measure of prediction, for example in comparing fits with different sets of
predictors or with different link functions.
Somers' D, Goodman-Kruskal Gamma, and Kendall's Tau-a are summaries of the table of concordant and
discordant pairs. These measures most likely lie between 0 and 1 where larger values indicate that the model
has a better predictive ability. In this example, the measure range from 0.14 to 0.39 which implies less than
desirable predictive ability.
Plots In the example, you chose two diagnostic plots-delta Pearson
2
versus the estimated event probability and
delta Pearson
2
versus the leverage. Delta Pearson
2
for the jth factor/covariate pattern is the change in the
Pearson
2
when all observations with that factor/covariate pattern are omitted. These two graphs indicate that two
observations are not well fit by the model (high delta
2
). A high delta
2
can be caused by a high leverage and/or a
high Pearson residual. In this case, a high Pearson residual caused the large delta
2
, because the leverages are
less than 0.1. Hosmer and Lemeshow indicate that delta
2
or delta deviance greater than 3.84 is large.

If you choose Editor > Brush, brush these points, and then click on them, they will be identified as data values 31 and
66. These are individuals with a high resting pulse, who do not smoke, and who have smaller than average weights
(Weight = 116, 136 pounds). You might further investigate these cases to see why the model did not fit them well.

You might also like