Professional Documents
Culture Documents
Probit Analysis, Probit Analysis, Probit Analysis Probit Analysis, Probit Analysis, Probit Analysis statistics, statistics, statistics This procedure measures the relationship between the strength of a stimulus and the proportion of cases exhibiting a certain response to the stimulus. It is useful for situations where you have a dichotomous output that is thought to be influenced or caused by levels of some independent variable(s) and is particularly well suited to experimental data. This procedure will allow you to estimate the strength of a stimulus required to induce a certain proportion of responses, such as the median effective dose. Example. How effective is a new pesticide at killing ants, and what is an appropriate concentration to use? You might perform an experiment in which you expose samples of ants to different concentrations of the pesticide and then record the number of ants killed and the number of ants exposed. Applying probit analysis to these data, you can determine the strength of the relationship between concentration and killing, and you can determine what the appropriate concentration of pesticide would be if you wanted to be sure to kill, say, 95% of exposed ants. Statistics. Regression coefficients and standard errors, intercept and standard error, Pearson goodness-of-fit chi-square, observed and expected frequencies, and confidence intervals for effective levels of independent variable(s). Plots: transformed response plots.
parallelism test, parallelism test in Probit Analysis, in Probit Analysis, in Probit Analysis fiducial confidence intervals, fiducial confidence intervals, fiducial confidence intervals in Probit Analysis, in Probit Analysis, in Probit Analysis iterations, iterations, iterations in Probit Analysis, in Probit Analysis, in Probit Analysis You can specify options for your probit analysis: Statistics. Allows you to request the following optional statistics: Frequencies, Relative median potency, Parallelism test, and Fiducial confidence intervals. Relative Median Potency. Displays the ratio of median potencies for each pair of factor levels. Also shows 95% confidence limits for each relative median potency. Relative median potencies are not available if you do not have a factor variable or if you have more than one covariate. Parallelism Test. A test of the hypothesis that all factor levels have a common slope. Fiducial Confidence Intervals. Confidence intervals for the dosage of agent required to produce a certain probability of response. Fiducial confidence intervals and Relative median potency are unavailable if you have selected more than one covariate. Relative median potency and Parallelism test are available only if you have selected a factor variable. Natural Response Rate. Allows you to indicate a natural response rate even in the absence of the stimulus. Available alternatives are None, Calculate from data, or Value. Calculate from Data. Estimate the natural response rate from the sample data. Your data should contain a case representing the control level, for which the value of the covariate(s) is 0. Probit estimates the natural response rate using the proportion of responses for the control level as an initial value. Value. Sets the natural response rate in the model (select this item when you know the natural response rate in advance). Enter the natural response proportion (the proportion must be less than 1). For example, if the response occurs 10% of the time when the stimulus is 0, enter 0.10. Criteria. Allows you to control parameters of the iterative parameter-estimation algorithm. You can override the defaults for Maximum iterations, Step limit, and Optimality tolerance.
The command syntax language also allows you to: Request an analysis on both the probit and logit models. Control the treatment of missing values. Transform the covariates by bases other than base 10 or natural log.
Probit analysis is most appropriate when you want to estimate the effects of one or more independent variables on a binomial dependent variable, particularly in the setting of a dose-response experiment. For example, a retail company wants to establish the relationship between the size of a promotion (measured as a percentage off the retail price) and the probability that a customer will buy. Moreover, they want to establish this relationship for their store, catalog, and internet sales. In the context of a dose-response experiment, the promotion size can be considered a dose to which the customers respond by buying. The three sites at which a customer can shop correspond to different agents to which the customer is introduced. Using probit analysis, the company can determine whether promotions have approximately the same effects on sales in the different markets. Probit Analysis is designed to model the probability of response to a stimulus. Since the probability of an event must lie between 0 and 1, it is impractical to model probabilities with linear regression techniques, because the linear regression model allows the dependent variable to take values greater than 1 or less than 0. The probit analysis model is a type of generalized linear model that extends the linear regression model by linking the range of real numbers to the 0-1 range. Start by considering the existence of an unobserved continuous variable, Z, which can be thought of as the "propensity towards" the event of interest. In the case of the retail company, Z represents a customer's propensity to buy, with larger values of Z corresponding to greater probabilities of buying. Mathematically, the relationship between Z and the probability of response is: i=c+(1c)F(zi) where pi is the probability the ith case experiences the event of interest zi is the value of the unobserved continuous variable for the ith case
is a link function. See Link Functions for the Probit Analysis Procedure for more information. c is the natural response rate. See Natural Response Rate for more information. F
The model also assumes that Z is linearly related to the predictors. zi=b0+b1xi1+b2xi2+...+bpxip where is simply the jth predictor for the ith case when there is no grouping variable. When xij there is a grouping variable, indicator variables are constructed to represent the levels of the grouping variable and added to the list of predictors. bj is the jth coefficient p is the number of predictors If Z were observable, you would simply fit a linear regression to Z and be done. However, since Z is unobserved, you must relate the predictors to the probability of interest by substituting for Z. i=c+(1c)F(b0+b1xi1+b2xi2+...+bpxip) The model coefficients are estimated through an iterative maximum likelihood method. Link functions transform Z to a 0-1 scale, thus providing the "link" between the coefficients and the probability of interest. The link functions available are the probit and logit. The logit link will produce a logistic regression model: F(Z)=11+ez The probit link assumes that Z is approximately normally distributed: F(Z)=1(Z) where F-1 is the inverse standard normal distribution. The logit and probit links often give similar results, though the probit link discriminates better near the median potency (.5 probability response) and the logit link performs better elsewhere. The natural response rate is the probability of getting a response with no dose. In the retail example, the natural response rate is the proportion of people who would buy without a promotional offer. A natural response rate of 0 means the response is due only to the stimulus. You can specify the value of the natural response rate (if known), or allow it to be estimated from the data.
A retail company conducts an experiment to test the effects of different promotions (where the promotion is a percentage off the retail price) on online, from catalog, and instore sales. The promotional offers are made to randomly selected patrons and they record the numbers of sales. This information is collected in offer.sav. Use the Probit Analysis procedure to construct a dose-response model for the promotional effects on sales.
10
11
The Probit Analysis procedure is a useful tool for modeling a binomial dependent variable. The Binary Logistic Regression procedure uses a logistic link model for predicting the event probability for a categorical response variable with two outcomes, and is generally more useful than Probit Analysis when your study is not a dose-response experiment. The Ordinal Regression procedure can be used to predict a categorical variable with two outcomes, and offers several different link functions. It is a better choice than Probit Analysis for general probit modeling when your study is not a dose-response experiment. Finney, D. J. 1971. Probit analysis. Cambridge: Cambridge University Press. Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion . Upper Saddle-River, N.J.: Prentice Hall, Inc..
12
Example. A study in Florida included 219 alligators. How does the alligators' food type vary with their size and the four lakes in which they live? The study found that the odds of a smaller alligator preferring reptiles to fish is 0.70 times lower than for larger alligators; also, the odds of selecting primarily reptiles instead of fish were highest in lake 3. Statistics. Observed and expected frequencies; raw, adjusted, and deviance residuals; design matrix; parameter estimates; generalized log-odds ratio; Wald statistic; and confidence intervals. Plots: adjusted residuals, deviance residuals, and normal probability plots.
13
14
Deviance residuals. The signed square root of an individual contribution to the likelihood-ratio chi-square statistic (G squared), where the sign is the sign of the residual (observed count minus expected count). Deviance residuals have an asymptotic standard normal distribution. This feature requires the Advanced Models option. From the menus choose: Analyze Loglinear Logit... In the Logit Loglinear Analysis dialog box, click Save.
LOGIT LOGLINEAR ANALYSIS The Logit Loglinear Analysis procedure is used to model the values of one or more categorical variables given one or more categorical predictors. This is accomplished through analysis of the cell counts of the crosstabulation table formed by the crossclassification of the response and predictor variables. The Model Logit loglinear models are "ANOVA-like" models for the logit-expected cell counts of crosstabulation tables. Logits are formed by the log-ratios of cell counts, where the cells in a given logit are correspond to pairs of values of the dependent variable, for a given cross-classification of factors. Dependent Variables. Dependent variables are categorical responses whose values you want to model. Factors. Factors are categorical predictor variables that help define the crosstabulation table. Covariates. Scale predictors can be added as covariates in the model. Within cells of the crosstabulation table, the mean covariate values of cases in the cell are used to model the cell counts.
15
Cell Structure. The cell structure variable allows you exclude cells from the analysis. This can be helpful if you want to impose a particular structure on the crosstabulation table. See the General Loglinear Analysis case studies for further uses of the cell structure variable. Contrasts. You can specify a set of contrast variables to test the differences between model effects. Using Logit Loglinear Analysis to Model Consumer Preference of Packaged Goods
As part of an effort to improve the marketing of its breakfast options, a consumer packaged goods company polls 880 people, noting their age, gender, marital status, and whether or not they have an active lifestyle (based upon whether they exercise at least twice a week). Each participant then tasted 3 breakfast foods and was asked which one they liked best. This information is collected in cereal.sav. Use Logit Loglinear Analysis to determine marketing profiles for each breakfast option.
16
17
18
19
20
21
22
23
Using the Logit Loglinear Analysis procedure, you have constructed a model for predicting consumer choice of breakfast products. Note that this model is identical to that obtained by the Multinomial Logistic Regression procedure.
The Logit Loglinear Analysis procedure is a useful tool for modeling the values of one or more categorical response variables.
24
The General Loglinear Analysis procedure uses loglinear models to study relationships between categorical variables without specifying response or predictor variables. If there is one dependent variable, you can alternately use the Multinomial Logistic Regression procedure. If there is one dependent variable and it has just two categories, you can alternately use the Logistic Regression procedure. If there is one dependent variable and its categories are ordered, you can alternately use the Ordinal Regression procedure. See the following texts for more information on loglinear models: Agresti, A. 2002. Categorical Data Analysis, 2nd ed. New York: John Wiley and Sons. Agresti, A. 1996. An Introduction to Categorical Data Analysis. New York: John Wiley and Sons. Bishop, Y. M., S. E. Fienberg, and P. W. Holland. 1977. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press. Fienberg, S. E. 1994. The Analysis of Cross-Classified Categorical Data, 2nd ed. Cambridge, MA: MIT Press. Knoke, D., and P. J. Burke. 1980. Log-Linear Models. Thousand Oaks, Calif.: Sage Publications, Inc.. Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion. Upper Saddle-River, N.J.: Prentice Hall, Inc..
25
dating status has upon the type of film they prefer. The studio can then slant the advertising campaign of a particular movie toward a group of people likely to go see it. Statistics. Iteration history, parameter coefficients, asymptotic covariance and correlation matrices, likelihood-ratio tests for model and partial effects, 2 log-likelihood. Pearson and deviance chi-square goodness of fit. Cox and Snell, Nagelkerke, and McFadden R2. Classification: observed versus predicted frequencies by response category. Crosstabulation: observed and predicted frequencies (with residuals) and proportions by covariate pattern and response category. Methods. A multinomial logit model is fit for the full factorial model or a user-specified model. Parameter estimation is performed through an iterative maximum-likelihood algorithm.
26
Forced Entry Terms. Terms added to the forced entry list are always included in the model. Stepwise Terms. Terms added to the stepwise list are included in the model according to one of the following user-selected Stepwise Methods: Forward entry. This method begins with no stepwise terms in the model. At each step, the most significant term is added to the model until none of the stepwise terms left out of the model would have a statistically significant contribution if added to the model. Backward elimination. This method begins by entering all terms specified on the stepwise list into the model. At each step, the least significant stepwise term is removed from the model until all of the remaining stepwise terms have a statistically significant contribution to the model. Forward stepwise. This method begins with the model that would be selected by the forward entry method. From there, the algorithm alternates between backward elimination on the stepwise terms in the model and forward entry on the terms left out of the model. This continues until no terms meet the entry or removal criteria. Backward stepwise. This method begins with the model that would be selected by the backward elimination method. From there, the algorithm alternates between forward entry on the terms left out of the model and backward elimination on the stepwise terms in the model. This continues until no terms meet the entry or removal criteria. Include intercept in model. Allows you to include or exclude an intercept term for the model.
27
Category Order. In ascending order, the lowest value defines the first category and the highest value defines the last. In descending order, the highest value defines the first category and the lowest value defines the last.
28
goodness of fit,goodness of fit,goodness of fit in Multinomial Logistic Regression,in Multinomial Logistic Regression,in Multinomial Logistic Regression You can specify the following statistics for your Multinomial Logistic Regression: Case processing summary. This table contains information about the specified categorical variables. Model. Statistics for the overall model. Pseudo R-square. Prints the Cox and Snell, Nagelkerke, and McFadden R2 statistics. Step summary. This table summarizes the effects entered or removed at each step in a stepwise method. It is not produced unless a stepwise model is specified in the Model dialog box. Model fitting information. This table compares the fitted and intercept-only or null models. Information criteria. This table prints Akaikes information criterion (AIC) and Schwarzs Bayesian information criterion (BIC). Cell probabilities. Prints a table of the observed and expected frequencies (with residual) and proportions by covariate pattern and response category. Classification table. Prints a table of the observed versus predicted responses. Goodness of fit chi-square statistics. Prints Pearson and likelihood-ratio chi-square statistics. Statistics are computed for the covariate patterns determined by all factors and covariates or by a user-defined subset of the factors and covariates. Monotinicity measures. Displays a table with information on the number of concordant pairs, discordant pairs, and tied pairs. The Somers' D, Goodman and Kruskal's Gamma, Kendall's tau-a, and Concordance Index C are also displayed in this table. Parameters. Statistics related to the model parameters. Estimates. Prints estimates of the model parameters, with a user-specified level of confidence. Likelihood ratio test. Prints likelihood-ratio tests for the model partial effects. The test for the overall model is printed automatically. Asymptotic correlations. Prints matrix of parameter estimate correlations.
29
Asymptotic covariances. Prints matrix of parameter estimate covariances. Define Subpopulations. Allows you to select a subset of the factors and covariates in order to define the covariate patterns used by cell probabilities and the goodness-of-fit tests.
30
Log-likelihood convergence. Convergence is assumed if the absolute change in the loglikelihood function is less than the specified value. The criterion is not used if the value is 0. Specify a non-negative value. Parameter convergence. Convergence is assumed if the absolute change in the parameter estimates is less than this value. The criterion is not used if the value is 0. Delta. Allows you to specify a non-negative value less than 1. This value is added to each empty cell of the crosstabulation of response category by covariate pattern. This helps to stabilize the algorithm and prevent bias in the estimates. Singularity tolerance. Allows you to specify the tolerance used in checking for singularities.
31
Entry test. This is the method for entering terms in stepwise methods. Choose between the likelihood-ratio test and score test. This criterion is ignored unless the forward entry, forward stepwise, or backward stepwise method is selected. Removal Probability. This is the probability of the likelihood-ratio statistic for variable removal. The larger the specified probability, the easier it is for a variable to remain in the model. This criterion is ignored unless the backward elimination, forward stepwise, or backward stepwise method is selected. Removal Test. This is the method for removing terms in stepwise methods. Choose between the likelihood-ratio test and Wald test. This criterion is ignored unless the backward elimination, forward stepwise, or backward stepwise method is selected. Minimum Stepped Effects in Model. When using the backward elimination or backward stepwise methods, this specifies the minimum number of terms to include in the model. The intercept is not counted as a model term. Maximum Stepped Effects in Model. When using the forward entry or forward stepwise methods, this specifies the maximum number of terms to include in the model. The intercept is not counted as a model term. Hierarchically constrain entry and removal of terms. This option allows you to choose whether to place restrictions on the inclusion of model terms. Hierarchy requires that for any term to be included, all lower order terms that are a part of the term to be included must be in the model first. For example, if the hierarchy requirement is in effect, the factors Marital status and Gender must both be in the model before the Marital status*Gender interaction can be added. The three radio button options determine the role of covariates in determining hierarchy.
32
Estimated response probabilities. These are the estimated probabilities of classifying a factor/covariate pattern into the response categories. There are as many estimated probabilities as there are categories of the response variable; up to 25 will be saved. Predicted category. This is the response category with the largest expected probability for a factor/covariate pattern. Predicted category probabilities. This is the maximum of the estimated response probabilities. Actual category probability. This is the estimated probability of classifying a factor/covariate pattern into the observed category. Export model information to XML file. Parameter estimates and (optionally) their covariances are exported to the specified file in XML (PMML) format. SmartScore and SPSS Server (a separate product) can use this model file to apply the model information to other data files for scoring purposes. Multinomial Logistic Regression > Multinomial Logistic Regression is useful for situations in which you want to be able to classify subjects based on values of a set of predictor variables. This type of regression is similar to binary logistic regression, but is more general because the dependent variable is not restricted to two categories. For example, you can conduct a survey in which participants are asked to select one of several competing products as their favorite. Using multinomial logistic regression, you can create profiles of people who are most likely to be interested in your product, and plan your advertising strategy accordingly. The Multinomial Logistic Regression Model Linear regression is not appropriate for situations in which there is no natural ordering to the values of the dependent variable. In such cases, multinomial logistic regression may be the best alternative. For a dependent variable with K categories, consider the existence of K unobserved continuous variables, Z1, ... ZK, each of which can be thought of as the "propensity toward" a category. In the case of a packaged goods company, Zk represents a customer's propensity toward selecting the kth product, with larger values of Zk corresponding to greater probabilities of choosing that product (assuming all other Z's remain the same). The Multinomial Logistic Regression Model Mathematically, the relationship between the Z's and the probability of a particular outcome is described in this formula.
33
34
As part of an effort to improve the marketing of its breakfast options, a Consumer Packaged Goods company polls 880 people, noting their age, gender, marital status, and whether or not they have an active lifestyle (based upon whether they exercise at least twice a week). Each participant then tasted 3 breakfast foods and was asked which one they liked best. This information is collected in cereal.sav. Use Multinomial Logistic Regression to determine marketing profiles for each breakfast option
35
36
37
38
39
Choosing the Right Model There are usually several models that pass the diagnostic checks, so you need tools to choose between them. Variable Selection. When constructing a model, you generally want to only include predictors that contribute significantly to the model. The likelihood ratio statistics table tests each variable's contribution to the model. Additionally, the Multinomial Logistic Regression procedure offers several methods for stepwise selection of the "best" predictors to include in the model. See Using Multinomial Logistic Regression to Classify Telecommunications Customers for more information.
40
Pseudo R-Squared Statistics. The r-squared statistic, which measures the variability in the dependent variable that is explained by a linear regression model, cannot be computed for multinomial logistic regression models. The pseudo r-squared statistics are designed to have similar properties to the true r-squared statistic. Classification and Validation. Crosstabulating observed response categories with predicted categories helps you to determine how well the model identifies consumer preferences.
41
Ordinal Regression
PLUM,PLUM,PLUM in Ordinal Regression,in Ordinal Regression,in Ordinal Regression Ordinal Regression,Ordinal Regression,Ordinal Regression Ordinal Regression,Ordinal Regression,Ordinal Regression statistics,statistics,statistics
42
Ordinal Regression allows you to model the dependence of a polytomous ordinal response on a set of predictors, which can be factors or covariates. The design of Ordinal Regression is based on the methodology of McCullagh (1980, 1998), and the procedure is referred to as PLUM in the syntax. Standard linear regression analysis involves minimizing the sum-of-squared differences between a response (dependent) variable and a weighted combination of predictor (independent) variables. The estimated coefficients reflect how changes in the predictors affect the response. The response is assumed to be numerical, in the sense that changes in the level of the response are equivalent throughout the range of the response. For example, the difference in height between a person who is 150 cm tall and a person who is 140 cm tall is 10 cm, which has the same meaning as the difference in height between a person who is 210 cm tall and a person who is 200 cm tall. These relationships do not necessarily hold for ordinal variables, in which the choice and number of response categories can be quite arbitrary. Example. Ordinal Regression could be used to study patient reaction to drug dosage. The possible reactions may be classified as none, mild, moderate, or severe. The difference between a mild and moderate reaction is difficult or impossible to quantify and is based on perception. Moreover, the difference between a mild and moderate response may be greater or less than the difference between a moderate and severe response. Statistics and plots. Observed and expected frequencies and cumulative frequencies, Pearson residuals for frequencies and cumulative frequencies, observed and expected probabilities, observed and expected cumulative probabilities of each response category by covariate pattern, asymptotic correlation and covariance matrices of parameter estimates, Pearson's chi-square and likelihood-ratio chi-square, goodness-of-fit statistics, iteration history, test of parallel lines assumption, parameter estimates, standard errors, confidence intervals, and Cox and Snell's, Nagelkerke's, and McFadden's R2 statistics.
43
Maximum iterations. Specify a non-negative integer. If 0 is specified, the procedure returns the initial estimates. Maximum step-halving. Specify a positive integer. Log-likelihood convergence. The algorithm stops if the absolute or relative change in the log-likelihood is less than this value. The criterion is not used if 0 is specified. Parameter convergence. The algorithm stops if the absolute or relative change in each of the parameter estimates is less than this value. The criterion is not used if 0 is specified. Confidence interval. Specify a value greater than or equal to 0 and less than 100. Delta. The value added to zero cell frequencies. Specify a non-negative value less than 1. Singularity tolerance. Used for checking for highly dependent predictors. Select a value from the list of options. Link function. The link function is a transformation of the cumulative probabilities that allows estimation of the model. Five link functions are available, summarized in the following table. Function Logit Negative log-log Probit Form Typical application
log( x / (1x) ) Evenly distributed categories log(log(x)) Lower categories more probable F1(x) Latent variable is normally distributed
Cauchit (inverse Cauchy) tan((x0.5)) Latent variable has many extreme values
44
Pearson chi-square,Pearson chi-square,Pearson chi-square in Ordinal Regression,in Ordinal Regression,in Ordinal Regression likelihood-ratio chi-square,likelihood-ratio chi-square,likelihood-ratio chi-square in Ordinal Regression,in Ordinal Regression,in Ordinal Regression parameter estimates,parameter estimates,parameter estimates in Ordinal Regression,in Ordinal Regression,in Ordinal Regression correlation matrix,correlation matrix,correlation matrix in Ordinal Regression,in Ordinal Regression,in Ordinal Regression covariance matrix,covariance matrix,covariance matrix in Ordinal Regression,in Ordinal Regression,in Ordinal Regression expected frequencies,expected frequencies,expected frequencies in Ordinal Regression,in Ordinal Regression,in Ordinal Regression observed frequencies,observed frequencies,observed frequencies in Ordinal Regression,in Ordinal Regression,in Ordinal Regression test of parallel lines,test of parallel lines,test of parallel lines in Ordinal Regression,in Ordinal Regression,in Ordinal Regression Pearson residuals,Pearson residuals,Pearson residuals in Ordinal Regression,in Ordinal Regression,in Ordinal Regression cumulative frequencies,cumulative frequencies,cumulative frequencies in Ordinal Regression,in Ordinal Regression,in Ordinal Regression The Output dialog box allows you to produce tables for display in the Viewer and save variables to the working file. Display. Produces tables for: Print iteration history. The log-likelihood and parameter estimates are printed for the print iteration frequency specified. The first and last iterations are always printed. Goodness-of-fit statistics. The Pearson and likelihood-ratio chi-square statistics. They are computed based on the classification specified in the variable list. Summary statistics. Cox and Snell's, Nagelkerke's, and McFadden's R2 statistics. Parameter estimates. Parameter estimates, standard errors, and confidence intervals. Asymptotic correlation of parameter estimates. Matrix of parameter estimate correlations. Asymptotic covariance of parameter estimates. Matrix of parameter estimate covariances. Cell information. Observed and expected frequencies and cumulative frequencies, Pearson residuals for frequencies and cumulative frequencies, observed and expected probabilities, and observed and expected cumulative probabilities of each response category by covariate pattern. Note that for models with many covariate patterns (for
45
example, models with continuous covariates), this option can generate a very large, unwieldy table. Test of parallel lines. Test of the hypothesis that the location parameters are equivalent across the levels of the dependent variable. This is available only for the location-only model. Saved variables. Saves the following variables to the working file: Estimated response probabilities. Model-estimated probabilities of classifying a factor/covariate pattern into the response categories. There are as many probabilities as the number of response categories. Predicted category. The response category that has the maximum estimated probability for a factor/covariate pattern. Predicted category probability. Estimated probability of classifying a factor/covariate pattern into the predicted category. This probability is also the maximum of the estimated probabilities of the factor/covariate pattern. Actual category probability. Estimated probability of classifying a factor/covariate pattern into the actual category. Print log-likelihood. Controls the display of the log-likelihood. Including multinomial constant gives you the full value of the likelihood. To compare your results across products that do not include the constant, you can choose to exclude it.
46
47
dependent variable. Each equation gives a predicted probability of being in the corresponding category or any lower category. Hypothetical distribution of ordinal dependent Category Probability of Membership Cumulative Probability Current 0.80 0.80 30 days past due 0.07 0.87 60 days past due 0.07 0.94 90 days past due 0.05 0.99 Uncollectable 0.01 1.00 For example, look at the distribution shown in the table. With no predictors in the model, predictions are based only on the overall probabilities of being in each category. The predicted cumulative probability for the first category is 0.80. The prediction for the second category is 0.80+0.07=0.87. The prediction for the third is 0.80+0.07+0.07=0.94, and so on. The prediction for the last category is always 1.0, since all cases must be in either the last category or a lower category. Because of this, the prediction equation for the last category is not needed. The Ordinal Regression Model Generalized linear models are a very powerful class of models, which can be used to answer a wide range of statistical questions. The basic form of a generalized linear model is shown in the following equation.
48
The Ordinal Regression Model There are several important things to notice here. The model is based on the notion that there is some latent continuous outcome variable, and that the ordinal outcome variable arises from discretizing the underlying continuum into ordered groups. The cutoff values that define the categories are estimated by the thresholds. In some cases, there is good theoretical justification for assuming such an underlying distribution. However, even in cases in which there is no theoretical concept that links to the latent variable, the model can still perform quite well and give valid results. The thresholds or constants in the model (corresponding to the intercept in linear regression models) depend only on which category's probability is being predicted. Values of the predictor (independent) variables do not affect this part of the model. The prediction part of the model depends only on the predictors and is independent of the outcome category. These first two properties imply that the results will be a set of parallel lines or planes-one for each category of the outcome variable. Rather than predicting the actual cumulative probabilities, the model predicts a function of those values. This function is called the link function, and you choose the 49
form of the link function when you build the model. This allows you to choose a link function based on the problem under consideration to optimize your results. Several link functions are available in the Ordinal Regression procedure. As you can see, these are very powerful and general models. Of course, there is also a bit more to keep track of here than in a typical linear regression model. There are three major components in an ordinal regression model: Location component. The portion of the equation shown above which includes the coefficients and predictor variables, is called the location component of the model. The location is the "meat" of the model. It uses the predictor variables to calculate predicted probabilities of membership in the categories for each case. Scale component. The scale component is an optional modification to the basic model to account for differences in variability for different values of the predictor variables. For example, if men have more variability than women in their account status values, using a scale component to account for this may improve your model. The model with a scale component follows the form shown in this equation
50
Link function. The link function is a transformation of the cumulative probabilities that allows estimation of the model. Five link functions are available, summarized in the following table. Function Logit Negative log-log Probit Form Typical application
log( x / (1x) ) Evenly distributed categories log(log(x)) Lower categories more probable F1(x) Latent variable is normally distributed
Cauchit (inverse Cauchy) tan((x0.5)) Latent variable has many extreme values
Using Ordinal Regression to Build a Credit Scoring Model A creditor wants to be able to determine whether an applicant is a good credit risk, given various financial and personal characteristics. From their customer database, the creditor (dependent) variable is account status, with five ordinal levels: no debt history, no current debt, debt payments current, debt payments past due, and critical account. Potential predictors consist of various financial and personal characteristics of applicants, including age, number of credits at the bank, housing type, checking account status, and so on. This information is collected in german_credit.sav. Use Ordinal Regression to build a model for scoring applicants. Constructing a Model Constructing your initial ordinal regression model entails several decisions. First, of course, you need to identify the ordinal outcome variable. Then, you need to decide which predictors to use for the location component of the model. Next, you need to decide whether to use a scale component and, if you do, what predictors to use for it. Finally, you need to decide which link function best fits your research question and the structure of the data. Identifying the Outcome Variable In most cases, you will already have a specific target variable in mind by the time you begin building an ordinal regression model. After all, the reason you use an ordinal regression model is that you know you want to predict an ordinal outcome. In this example, the ordinal outcome is Account status, with five categories: No debt history, No current debt, Payments current, Payments delayed, and Critical account. Note that this particular ordering may not, in fact, be the best possible ordering of the outcomes. You can easily argue that a known customer with no current debt, or with
51
payments current, is a better credit risk than a customer with no known credit history. See the discussion in the Test of Parallel Lines for more on this issue. Choosing Predictors for the Location Model The process of choosing predictors for the location component of the model is similar to the process of selecting predictors in a linear regression model. You should take both theoretical and empirical considerations into account in selecting predictors. Ideally, your model would include all of the important predictors and none of the others. In practice, you often don't know exactly which predictors will prove to be important until you build the model. In that case, it's usually better to start off by including all of the predictors that you think might be important. If you discover that some of those predictors seem not to be helpful in the model, you can remove them and reestimate the model. In this case, previous experience and some preliminary exploratory analysis have identified five likely predictors: age, duration of loan, number of credits at the bank, other installment debts, and housing type. You will include these predictors in the initial analysis and then evaluate the importance of each predictor. Number of credits, other installment debts, and housing type are categorical predictors, entered as factors in the model. Age and duration of loan are continuous predictors, entered as covariates in the model. Scale Component The next decision has two stages. The first decision is whether to include a scale component in the model at all. In many cases, the scale component will not be necessary, and the location-only model will provide a good summary of the data. In the interests of keeping things simple, it's usually best to start with a location-only model, and add a scale component only if there is evidence that the location-only model is inadequate for your data. Following this philosophy, you will begin with a location-only model, and after estimating the model, decide whether a scale component is warranted.
52
53
54
55
56
57
58
59
60
61
62
63
64
Summary : Using the Model to Make Predictions Because the model attempts to predict cumulative probabilities rather than category membership, two steps are required to get predicted categories. First, for each case, the probabilities must be estimated for each category. Second, those probabilities must be used to select the most likely outcome category for each case. The probabilities themselves are estimated by using the predictor values for a case in the model equations and taking the inverse of the link function. The result is the cumulative probability for each group, conditional on the pattern of predictor values for the case. The probabilities for individual categories can then be derived by taking the differences of the cumulative probabilities for the groups in order. In other words, the probability for the first category is the first cumulative probability; the probability for the second category is 65
the second cumulative probability minus the first; the probability for the third category is the third cumulative probability minus the second; and so on. For each case, the predicted outcome category is simply the category with the highest probability, given the pattern of predictor values for that case. For example, suppose you have an applicant who wants a 48-month loan (duration), is 22 years old (age), has one credit with the bank (numcred), has no other installment debt (othnstal), and owns her home (housng). Inserting these values into the prediction equations, this applicant has predicted values of -2.78, -1.95, 0.63, and 0.97. (Remember that there is one equation for each category except the last.) Taking the inverse of the complementary log-log link function gives the cumulative probabilities of .06, 0.13, 0.85, and 0.93 (and, of course, 1.0 for the last category). Taking differences gives the following individual category probabilities: category 1: .06, category 2: 0.13-0.06=0.07, category 3: 0.85-0.13=0.72, category 4: 0.93-0.85=0.08, and category 5: 1.0-0.93=0.07. Clearly, category 3 (debt payments current) is the most likely category for this case according to the model, with a predicted probability of 0.72. Thus, you would predict that this applicant would keep her payments current and the account would not become critical.
66