You are on page 1of 4

Logistic Regression in SPSS

This example is adapted from information in Statistical Analysis Quick Reference Guidebook (2007). A sales director for a chain of appliance stores wants to find out what circumstances encourage customers to purchase extended warranties after a major appliance purchase. The response variable is an indicator of whether or not a warranty is purchased. The predictor variables they want to consider are Customer gender Age of the customer Whether a gift is offered with the warranty Price of the appliance Race of customer There are several strategies you can take to develop the best model for the data. It is recommended that you examine several models before determining which one is best for your analysis. (In this example we allow the computer to help specify important variables, but it is inadvisable to accept a computer designated model without examining alternatives.) Begin by examining the significance of each variable in a fully populated model. 1. Open the data set named WARRANTY.SAV (downloadable from the data section) and choose Analyze/Regression/Binary Logistic. 2. Select Bought as the dependent variable and Gender, Gift, Age, Price, and Race as the covariates (i.e. the independent or predictor) variables. 3. Click on the Categorical checkbox (It is a button in SPSS version 16) and specify Race as a categorical variable. Click Continue and then OK. This produces the following SPSS output table.
Variables in the Equation B Step 1 Gender Price Age Gift Race Race(1) Race(2) Race(3) Constant 3.773 1.163 6.347 -12.018 13.863 13.739 14.070 14.921 -3.772 .001 .091 2.715 S.E. 2.568 .000 .056 1.567 Wald 2.158 3.363 2.638 3.003 2.827 .074 .007 .203 .649 df 1 1 1 1 3 1 1 1 1 Sig. .142 .067 .104 .083 .419 .785 .933 .652 .421 43.518 3.199 570.898 .000 Exp(B) .023 1.001 1.096 15.112

The Variables in the Equation table shows the output resulting from including all of the candidate predictor variables in the equation. Notice that the Race variable, which was originally coded as 1=White, 2=African American, 3=Hispanic and 4=0ther has been changed (by the SPSS logistic procedure) into three (4 - 1) indicator variables called Race(1), Race(2), and Race (3). These three variables each enter the equation with their own coefficient and p-value and there is an overall p-value given for Race. The significance of each variable is measured using a Wald statistic. Using p=0.10 as a cutoff criterion for not including variables in the equation, it can be seen that Gender (p=0.142) and Race (p=0.419) do not appear to be important predictor variables. Age is marginal (p=0.104), but well leave it in for the time being. Rerun

the analysis again after taking out Gender and Race as predictor variables. The analysis is rerun without these unimportant variables, yields the following output:
Variables in the Equation B Step 1 Price Age Gift Constant .000 .064 2.339 -6.096 S.E. .000 .032 1.131 2.142 Wald 6.165 4.132 4.273 8.096 df 1 1 1 1 Sig. .013 .042 .039 .004 Exp(B) 1.000 1.066 10.368 .002

This reduced model indicates that there is a significant predictive power for the variables Gift (p=0.039), Age (p=0.042), and Price (p=0.013). Although the pvalue for Price is small, notice that the OR = 1 and the coefficient for Price is zero to three decimal places. These seemingly contradictory bits of information (i.e. small pvalue but OR = 1.0, etc.) are suggestive that the values for Price are hiding the actual Odds Ration (OR) relationship. If the same model is run with the variable Price100, which is Price divided by 100, the odds ratio for Price100 is 1.041 and the estimated coefficient for Price100 is 0.040 as shown below.
Variables in the Equation B Step 1 Age Gift Price100 Constant .064 2.339 .040 -6.096 S.E. .032 1.131 .016 2.142 Wald 4.132 4.273 6.165 8.096 df 1 1 1 1 Sig. .042 .039 .013 .004 Exp(B) 1.066 10.368 1.041 .002

All of the other values in the table remain the same. All we have done is to recode Price into a more usable number. Another tactic often used is to standardize values such as Price by subtracting the mean and dividing by the standard deviation. Using standardized scores eliminates the problem observed with the Price variable, and also simplifies the comparison of odds ratios for different variables. The result is that we can now see that the odds that a customer who is offered a gift will purchase a warranty is 10 (see Exp(B) for Gift) times greater than the corresponding odds for a customer not offered a gift. We also observe that for each additional $100 in Price, the odds that a customer will purchase a warranty increases by about 4%. This tells us that people tend to be more likely to purchase warranties for more expensive appliances. Finally, the OR for age, 1.066, tells us that older buyers are more likely to purchase a warranty. One way to assess the model is by the Hosmer-Lemeshoi criteria. To product this information: 4. Rerun the analysis and click on the Options checkbox and select the select the Hosmer-Lemeshow goodness-of-fit. Click Continue and OK.
Hosmer and Lemeshow Test Step 1 Chi-square 1.792 df 8 Sig. .987

This test divides the data into several groups based on p values, then computes a chisquare from observed and expected frequencies of subjects falling in the two

categories of the binary response variable within these groups. Large chi-square values (and correspondingly small p-values) indicate a lack of fit for the model. In the table above we see that the Hosmer-Lemeshow chi-square test for the final warranty model yields a p-value of 0.987 thus suggesting a model with good predictive value. Note that the Hosmer and Lemeshow chi-square test is not a test of importance of specific model parameters (which may also appear in your computer printout). It is a separate post-hoc test performed to evaluate a specific model. Interpretation of the multiple logistic regression model Once we are satisfied with the model, it can be used for prediction just as in the simple logistic example above. For this model, the prediction would be

p=

e 6.096 +2.339*Gift

+ .064*Age + .04*Price 100 + .064*Age + .04*Price 100

1 + e6.096 +2.339*Gift

(For more details in predicting see Statistical Analysis Quick Reference Guideboo (Elliott, 2007.) Using this equation it would be reasonable to predict that a person with the characteristics (Age = 54, Price = $3,850, and Gift = 1) would purchase a warranty because p = .775 and the person where no gift is offered would not be predicted to

purchase a warranty because p = .25 . The typical cutoff for the decision would be 0.5 (or 50%). Thus, using this cutoff anyone whose score was higher than 0.5 would be predicted to buy the warranty and anyone with a lower score would be predicted to not buy the warranty. However, there may be times when you want to adjust this cutoff value. Neter et al (1996) suggests three ways to select a cutoff value for predicting: Use the standard 0.5 cutoff value. Determine a cutoff value that will give you the best predictive fit for your sample data. This is usually determined through trial and error. Select a cutoff value that will separate your sample data into a specific proportion of your two states based on a prior known proportion split in your population. For example, to use the second option for deciding on a cutoff value, examine the model classification table that is part of the SPSS logistic output
Classification Tablea Predicted Bought Observed Step 1 Bought No Yes Overall Percentage a. The cut value is .500 No 12 1 Yes 2 35 Percentage Correct 85.7 97.2 94.0

This table indicates that the final model correctly classifies 94% of the cases correctly. The model used the default 0.5 cutoff value to classify each subjects outcome. (Notice the footnote on the table The cut value is .500.) You can rerun the analysis with a series of cutoff values such as 0.4, 0.45, 0.55 and 0.65 to see if the cutoff value could be adjusted for a better fit. For this particular model, these alternate cutoff values do not lead to better predictions. In this case, the default 0.5 cutoff value is deemed sufficient. (For more information about classification see Statistical Analysis Quick Reference Guidebook, 2007.) References

Cohen, J., Cohen, P. West, S.G., and Aiken, L.S. (2002) Applied Multiple regression/Correlation Analysis for the Behavioral Sciences, Third Edition, Lawrence Erlbaum Associates, Publishers. Elliott, A., and Woodward, W. (2007) Statistical Analysis Quick Reference Guidebook, Thousand Oaks: Sage. Hosmer, D.W. and Lemeshow, S. (2000). Applied Logistic Regression, 2nd edition, New York: John Wiley and Sons, Inc. Neter, J., Wasserman, W., Nachtsheim, C. J., & Kutner, M. H. (1996) Applied Linear Regression Models (3rd Ed.).Chicago: Irwin.

You might also like