Logistic Notes

Notes on Logistic Regression
STAT 4330/8330
1
Introduction
Previously, you learned about odds
ratios (ORs).
We now transition and begin discussion
of binary logistic regression. We will see
that ORs play an important role in the
results of binary logistic models.
2
Binary Logistic Regression

Binary Logistic Regression is an
appropriate when:
1.The response variable is categorical w/ 2 categories (binary,
dichotomous, etc.). The response categories are often generically
labeled success or failure.
2.One or more explanatory variables are involved. These can be
either quantitative or categorical or a mixture of both.
3.One is interested in assessing the relationship between the
binary response and the explanatory variables and/or predicting
the response category based on the value(s) of the explanatory
variable(s).
3
The Model Equation
The Model Equation

A few points:
1. E(y) can never fall below 0 or above 1
(Remember: it is a probability!).
2. The model is not a linear function of the
parameters. This is a type of nonlinear
regression model.
The Model Function
The Model Equation

Alternatively, the equation can be
transformed to show that it models the
natural logarithm of the odds of y = 1.
The Model Equation
The left side is called the logit
The Model Equation

In general, the bi estimates the change in the log-odds
when xi is increased by 1 unit, holding all other xs in
the model fixed.
Therefore, exp(bi) estimates the OR of a success for
each additional 1-unit increase in xi.
Furthermore, (exp(bi)-1)*100 gives the percent
increase in the odds of a success for each 1-unit
increase in xi.
9
Example: The Outbreak Data

The Outbreak data contain a sample of N = 196
persons in 2 neighborhoods (sectors) of a large
city during a disease outbreak.
Can we predict whether or not a person
contracts the disease?
We will begin with a simple binary logit model
(with 1 predictor = age).
10
Example: The Outbreak Data.

Through SAS PROC LOGISTIC, we
find that b1 = .0285.
Therefore, OR = exp(.0285) = 1.029,
indicating that a persons odds of
contracting the disease increase 1.029
times for every year they age.
11

Furthermore, we can state that the
odds of contracting the disease
increase by 2.89% with each additional
year in age.
(exp(.0285)-1)*100% = 2.89%.
12

We can transform these results to
discuss the increase in odds in 5 & 10
year increments by the following:
exp(cbi) = the OR when there is a
difference of c units.
13

Therefore:
(exp(5*.0285)-1)*100% = 15.32%
(exp(10*.0285)-1)*100% = 32.98%
As a result then, a persons odds of
getting the disease increase by 15.32%
for every additional 5 years in age.
14
Model Fit
We ended last session fitting a simple (1predictor) binary logit model to the
Outbreak data using SAS.
We will now continue covering the SAS
PROC LOGISTIC output.
15
Model Fit Statistics
All of these statistics assess the model fit

through the quality of the explanatory
capacity of the model.
16
-2 Log L The -2 Log-Likelihood is a

transformation of the Likelihood
function (L). L is a quantification of how
well the model fits the sample data.
17
Both AIC & SC are deviants of the -2 Log

L that penalize for model complexity
(the number of predictor variables).
18
AIC Akaike Information Criterion. Used

to compare non-nested models. Smaller
is better. AIC is only meaningful in
relation to another models AIC value.
19
SC Schwarz Criterion. Very much like

AIC, however the penalization is
different. SC tends to favor simpler
models than AIC.
20
Choose either AIC or SC (not both) and

use the values under the heading
Intercept and Covariates to compare to
competing models.
21
The model equation.
22
Inference: The Coefficients.
Instead of a t-test for the significance of a

coefficient (like in linear regression), we
have a Wald Chi-Squared test.
23
Remember, typically we do not evaluate

the intercept, but rather focus on the test
for each predictor.
24
In this case, age is a statistically

significant predictor of disease status at
the = .05 level, X2(1) = 11.53,
p = .0007.
25

One can also obtain CIs for the
parameter estimates using CL option in
the MODEL statement of PROC
LOGISTIC.
26
As we found in linear regression, we can

conclude that a given predictor is
statistically significant at the = .05 if the
95% CI does not include the null value of
0.
27
Therefore, our best estimate of the

change in the log-odds for age is 0.0285,
however, we are 95% confident that that
change lies between 0.0120 and 0.0449 for
the population.
28

Furthermore: exp(.0285) = 1.029
exp(.0120) = 1.012
exp(.0449) = 1.046
Therefore, we estimate a persons odds of
contracting the disease increase 1.029
times for every year they age and we are
95% confident that this increase ranges
between (1.012,1.046) for the pop.
29

Of course, we no longer have to compute
these odds ratio estimates by hand,
because SAS provides them for us.
30

Furthermore: (exp(.0285)-1)*100% = 2.89%.
(exp(.0120)-1)*100% = 1.21%
(exp(.0449)-1)*100% = 4.59%
We can state that the odds of contracting

the disease increase by 2.89% with each
additional year in age and we are 95%
confident that this increase ranges
between (1.21%,4.59%) for the pop.
31
Final Note: Model Fitting

Realize that in order to estimate the
model parameters, the data must consist
of a substantial number of each response
category. For example, one will not be
able to estimate the risk of contracting a
disease if the data set does not contain
any individuals who have been
diagnosed with the disease.
32
Final Note: Model Fitting

Essentially, then, in order to estimate the
probability of either a success or failure,
the data set must contain a substantial
number (> 30 is best) of observations that
experienced a success and a substantial
number that experienced a failure.
33
More about output.

PROC LOGISTIC provides more
information concerning how the model
fits the sample data.
34
More about Model Fit

Percent Concordant
A pair of observations with different
observed responses is considered
concordant if the observation with the
lower ordered response value has a lower
predicted value than the observation
with a higher ordered response value.
35

Percent Discordant
A pair is considered discordant if an
observation with a lower ordered
response value has a higher predicted
value than an observation with a higher
order response.
36

Percent Tied
A pair with different responses is
considered tied if it is neither concordant
nor discordant.
37

Somers D, Gamma, & Tau-a
These are statistics that measure the
strength and direction of the relationship
between pairs.
38

Somers D & Tau-a
Like r, these vary between -1.0 (all
pairs discordant) & +1.0 (all pairs are
concordant).
Somers D = the difference between
the % concordant and the % discordant *
100.
39

Gamma
Gamma is a similar statistic: its
values also range between -1.0 & +1.0,
however the interpretation of these
values is different: -1.0 = no association
& + 1.0 = perfect association.
40
Predicted Values
The output of a logit model is the
predicted probability of a success for
each observation.
41
Predicted Values
These are obtained and stored in a
separate SAS data set using the OUTPUT
statement (see the following code).
42
Predicted Values
PROC LOGISTIC outputs the predicted
values and 95% CI limits to an output
data set that also contains the original
raw data.
43
Predicted Values
Use the PREDPROBS = I option in order
to obtain the predicted category (which
is saved in the _INTO_ variable).
44
Predicted Values
_FROM_ = The observed response
category = The same value as the
response variable.
45
Predicted Values
_INTO_ = The predicted response
category.
46
Predicted Values
IP_1 = The Individual Probability of a
response of 1.
47
Scoring Observations in SAS

Obtaining predicted probabilities and/or
predicted outcomes (categories) for new
observations (i.e., scoring new
observations) is done in logit modeling
using the same procedure we used in
scoring new observations in linear
regression.
48
Scoring Observations in SAS

1. Create a new data set with the desired
values of the x variables and the y
variable set to missing.
2. Merge the new data set with the
original data set.
3. Refit the final model using PROC
LOGISTIC using the OUTPUT
statement.
49
Classification Table & Rates

A Classification Table is used to
summarize the results of the predictions
and to ultimately evaluate the fitness of
the model.
Obtain a classification table using PROC
FREQ.
50

The observed (or actual) response is in
rows and the predicted response is in
columns.
51

Correct classifications are summarized
on the main diagonal.
52

The total number of correct
classifications (i.e., hits) is the sum of
the main diagonal frequencies.
O = 130+9 = 139
53

The total-group hit rate is the ratio of O
and N.
HR = 139/196 = .698
54

Individual group hit rates can also be
calculated. These are essentially the row
percents on the main diagonal.
55

Logistic Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Notes

Uploaded by

Copyright:

Available Formats

Notes on Logistic Regression

Binary Logistic Regression

The Model Equation

The Model Equation

The Model Function

The Model Equation

The Model Equation

The left side is called the logit

The Model Equation

Example: The Outbreak Data

Example: The Outbreak Data.

Example: The Outbreak Data.

Example: The Outbreak Data.

Example: The Outbreak Data.

Model Fit Statistics

All of these statistics assess the model fit

Model Fit Statistics

-2 Log L The -2 Log-Likelihood is a

Model Fit Statistics

Both AIC & SC are deviants of the -2 Log

Model Fit Statistics

AIC Akaike Information Criterion. Used

Model Fit Statistics

SC Schwarz Criterion. Very much like

Model Fit Statistics

Choose either AIC or SC (not both) and

The model equation.

Inference: The Coefficients.

Instead of a t-test for the significance of a

Inference: The Coefficients.

Remember, typically we do not evaluate

Inference: The Coefficients.

In this case, age is a statistically

Inference: The Coefficients.

Inference: The Coefficients.

As we found in linear regression, we can

Inference: The Coefficients.

Therefore, our best estimate of the

Inference: The Coefficients.

Inference: The Coefficients.

Inference: The Coefficients.

We can state that the odds of contracting

Final Note: Model Fitting

Final Note: Model Fitting

More about output.

More about Model Fit

More about Model Fit

More about Model Fit

More about Model Fit

More about Model Fit

More about Model Fit

Scoring Observations in SAS

Scoring Observations in SAS

Classification Table & Rates

Classification Table & Rates

Classification Table & Rates

Classification Table & Rates

Classification Table & Rates

Classification Table & Rates

You might also like