You are on page 1of 55

Notes on Logistic Regression

STAT 4330/8330
1

Introduction
Previously, you learned about odds
ratios (ORs).
We now transition and begin discussion
of binary logistic regression. We will see
that ORs play an important role in the
results of binary logistic models.
2

Binary Logistic Regression


Binary Logistic Regression is an
appropriate when:
1.The response variable is categorical w/ 2 categories (binary,
dichotomous, etc.). The response categories are often generically
labeled success or failure.
2.One or more explanatory variables are involved. These can be
either quantitative or categorical or a mixture of both.
3.One is interested in assessing the relationship between the
binary response and the explanatory variables and/or predicting
the response category based on the value(s) of the explanatory
variable(s).
3

The Model Equation

The Model Equation


A few points:
1. E(y) can never fall below 0 or above 1
(Remember: it is a probability!).
2. The model is not a linear function of the
parameters. This is a type of nonlinear
regression model.

The Model Function

The Model Equation


Alternatively, the equation can be
transformed to show that it models the
natural logarithm of the odds of y = 1.

The Model Equation

The left side is called the logit

The Model Equation


In general, the bi estimates the change in the log-odds
when xi is increased by 1 unit, holding all other xs in
the model fixed.
Therefore, exp(bi) estimates the OR of a success for
each additional 1-unit increase in xi.
Furthermore, (exp(bi)-1)*100 gives the percent
increase in the odds of a success for each 1-unit
increase in xi.
9

Example: The Outbreak Data


The Outbreak data contain a sample of N = 196
persons in 2 neighborhoods (sectors) of a large
city during a disease outbreak.
Can we predict whether or not a person
contracts the disease?
We will begin with a simple binary logit model
(with 1 predictor = age).
10

Example: The Outbreak Data.


Through SAS PROC LOGISTIC, we
find that b1 = .0285.
Therefore, OR = exp(.0285) = 1.029,
indicating that a persons odds of
contracting the disease increase 1.029
times for every year they age.
11

Example: The Outbreak Data.


Furthermore, we can state that the
odds of contracting the disease
increase by 2.89% with each additional
year in age.
(exp(.0285)-1)*100% = 2.89%.

12

Example: The Outbreak Data.


We can transform these results to
discuss the increase in odds in 5 & 10
year increments by the following:
exp(cbi) = the OR when there is a
difference of c units.

13

Example: The Outbreak Data.


Therefore:
(exp(5*.0285)-1)*100% = 15.32%
(exp(10*.0285)-1)*100% = 32.98%
As a result then, a persons odds of
getting the disease increase by 15.32%
for every additional 5 years in age.

14

Model Fit
We ended last session fitting a simple (1predictor) binary logit model to the
Outbreak data using SAS.
We will now continue covering the SAS
PROC LOGISTIC output.

15

Model Fit Statistics

All of these statistics assess the model fit


through the quality of the explanatory
capacity of the model.

16

Model Fit Statistics

-2 Log L The -2 Log-Likelihood is a


transformation of the Likelihood
function (L). L is a quantification of how
well the model fits the sample data.
17

Model Fit Statistics

Both AIC & SC are deviants of the -2 Log


L that penalize for model complexity
(the number of predictor variables).

18

Model Fit Statistics

AIC Akaike Information Criterion. Used


to compare non-nested models. Smaller
is better. AIC is only meaningful in
relation to another models AIC value.
19

Model Fit Statistics

SC Schwarz Criterion. Very much like


AIC, however the penalization is
different. SC tends to favor simpler
models than AIC.
20

Model Fit Statistics

Choose either AIC or SC (not both) and


use the values under the heading
Intercept and Covariates to compare to
competing models.
21

The model equation.

22

Inference: The Coefficients.

Instead of a t-test for the significance of a


coefficient (like in linear regression), we
have a Wald Chi-Squared test.

23

Inference: The Coefficients.

Remember, typically we do not evaluate


the intercept, but rather focus on the test
for each predictor.

24

Inference: The Coefficients.

In this case, age is a statistically


significant predictor of disease status at
the = .05 level, X2(1) = 11.53,
p = .0007.
25

Inference: The Coefficients.


One can also obtain CIs for the
parameter estimates using CL option in
the MODEL statement of PROC
LOGISTIC.

26

Inference: The Coefficients.

As we found in linear regression, we can


conclude that a given predictor is
statistically significant at the = .05 if the
95% CI does not include the null value of
0.
27

Inference: The Coefficients.

Therefore, our best estimate of the


change in the log-odds for age is 0.0285,
however, we are 95% confident that that
change lies between 0.0120 and 0.0449 for
the population.
28

Inference: The Coefficients.


Furthermore: exp(.0285) = 1.029
exp(.0120) = 1.012
exp(.0449) = 1.046
Therefore, we estimate a persons odds of
contracting the disease increase 1.029
times for every year they age and we are
95% confident that this increase ranges
between (1.012,1.046) for the pop.
29

Inference: The Coefficients.


Of course, we no longer have to compute
these odds ratio estimates by hand,
because SAS provides them for us.

30

Inference: The Coefficients.


Furthermore: (exp(.0285)-1)*100% = 2.89%.
(exp(.0120)-1)*100% = 1.21%
(exp(.0449)-1)*100% = 4.59%

We can state that the odds of contracting


the disease increase by 2.89% with each
additional year in age and we are 95%
confident that this increase ranges
between (1.21%,4.59%) for the pop.
31

Final Note: Model Fitting


Realize that in order to estimate the
model parameters, the data must consist
of a substantial number of each response
category. For example, one will not be
able to estimate the risk of contracting a
disease if the data set does not contain
any individuals who have been
diagnosed with the disease.
32

Final Note: Model Fitting


Essentially, then, in order to estimate the
probability of either a success or failure,
the data set must contain a substantial
number (> 30 is best) of observations that
experienced a success and a substantial
number that experienced a failure.

33

More about output.


PROC LOGISTIC provides more
information concerning how the model
fits the sample data.

34

More about Model Fit


Percent Concordant
A pair of observations with different
observed responses is considered
concordant if the observation with the
lower ordered response value has a lower
predicted value than the observation
with a higher ordered response value.
35

More about Model Fit


Percent Discordant
A pair is considered discordant if an
observation with a lower ordered
response value has a higher predicted
value than an observation with a higher
order response.

36

More about Model Fit


Percent Tied
A pair with different responses is
considered tied if it is neither concordant
nor discordant.

37

More about Model Fit


Somers D, Gamma, & Tau-a
These are statistics that measure the
strength and direction of the relationship
between pairs.

38

More about Model Fit


Somers D & Tau-a
Like r, these vary between -1.0 (all
pairs discordant) & +1.0 (all pairs are
concordant).
Somers D = the difference between
the % concordant and the % discordant *
100.
39

More about Model Fit


Gamma
Gamma is a similar statistic: its
values also range between -1.0 & +1.0,
however the interpretation of these
values is different: -1.0 = no association
& + 1.0 = perfect association.

40

Predicted Values
The output of a logit model is the
predicted probability of a success for
each observation.

41

Predicted Values
These are obtained and stored in a
separate SAS data set using the OUTPUT
statement (see the following code).

42

Predicted Values
PROC LOGISTIC outputs the predicted
values and 95% CI limits to an output
data set that also contains the original
raw data.

43

Predicted Values
Use the PREDPROBS = I option in order
to obtain the predicted category (which
is saved in the _INTO_ variable).

44

Predicted Values
_FROM_ = The observed response
category = The same value as the
response variable.

45

Predicted Values
_INTO_ = The predicted response
category.

46

Predicted Values
IP_1 = The Individual Probability of a
response of 1.

47

Scoring Observations in SAS


Obtaining predicted probabilities and/or
predicted outcomes (categories) for new
observations (i.e., scoring new
observations) is done in logit modeling
using the same procedure we used in
scoring new observations in linear
regression.
48

Scoring Observations in SAS


1. Create a new data set with the desired
values of the x variables and the y
variable set to missing.
2. Merge the new data set with the
original data set.
3. Refit the final model using PROC
LOGISTIC using the OUTPUT
statement.
49

Classification Table & Rates


A Classification Table is used to
summarize the results of the predictions
and to ultimately evaluate the fitness of
the model.
Obtain a classification table using PROC
FREQ.

50

Classification Table & Rates


The observed (or actual) response is in
rows and the predicted response is in
columns.

51

Classification Table & Rates


Correct classifications are summarized
on the main diagonal.

52

Classification Table & Rates


The total number of correct
classifications (i.e., hits) is the sum of
the main diagonal frequencies.
O = 130+9 = 139

53

Classification Table & Rates


The total-group hit rate is the ratio of O
and N.
HR = 139/196 = .698

54

Classification Table & Rates


Individual group hit rates can also be
calculated. These are essentially the row
percents on the main diagonal.

55

You might also like