Professional Documents
Culture Documents
18749-001
SPSS v11.5; Clementine v7.0; AnswerTree 3.1; DecisionTime 1.1 Revised 9/26/2002 ss/mr
For more information about SPSS software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and its other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 606066412. TableLook is a trademark of SPSS Inc. Windows is a registered trademark of Microsoft Corporation. DataDirect, DataDirect Connect, INTERSOLV, and SequeLink are registered trademarks of MERANT Solutions Inc. Portions of this product were created using LEADTOOLS 1991-2000, LEAD Technologies, Inc. ALL RIGHTS RESERVED. LEAD, LEADTOOLS, and LEADVIEW are registered trademarks of LEAD Technologies, Inc. Portions of this product were based on the work of the FreeType Team (http:\\www.freetype.org). General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks or registered trademarks of their respective companies in the United States and other countries. Data Mining: Modeling Copyright 2002 by SPSS Inc. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.
Table of Contents 1
REFERENCES..........................................................................................R-1
Table of Contents 2
Chapter 1 Introduction
Topics:
INTRODUCTION MODEL OVERVIEW VALIDATION
Introduction 1 - 1
INTRODUCTION
This course focuses on the modeling stage of the data mining process. It will compare and review the analytic methods commonly used for data mining. In addition, it will illustrate these methods using SPSS software (SPSS, AnswerTree, DecisionTime, and Clementine). The course assumes that a business question has been formulated and that relevant data have been collected, organized, and checked and prepared. In short, that all the time-consuming, preparatory work has been completed and you are at the modeling stage of your project. For more details concerning what should be done during the earlier stages in a data mining project, see the SPSS Data Mining: Overview and Data Mining: Data Understanding and Data Preparation courses. This chapter serves as a road map for the rest of the course. We try to place the various methods discussed within a framework and give you a sense of when to use which methods. The unifying theme is data mining and we discuss in detail the analytic techniques most often used to support these efforts. The course emphasizes the practical issues of setting up, running, and interpreting the results of statistical and machine learning analyses. It assumes you have, or will have, some business questions that require analysis, and that you know what to do with the results once you have them. There are choices regarding specific methods with several of these techniques, and the recommendations we make are based on what is known from properties of the methods, Monte Carlo simulations, or empirical work. You should be aware from the start that in most cases there is not a single method that will definitely yield the best results. However, in the chapters that follow detailing the specific methods, we have sections that list research projects for which the method is appropriate, features and limitations of the method, and comments concerning model deployment. These should prove of some use when you must decide on the method to apply to your problem. Finally, the approach is practical, not mathematical. Relatively few equations are presented and references are given for those who would like a more rigorous review of the techniques. Also, our goal is to provide you with a good sense of the properties of each method and how it is used and interpreted. The course does not strive for exhaustive detail. Entire books have been written on a topic we cover in a single chapter, and we are trying to present the main issues a practitioner will face. Analyses are run using different SPSS products. However, the emphasis in this course is on understanding the characteristics of the methods and being able to interpret the results. Thus we will not discuss data definition and general program operation issues. We do present instructions to perform the analyses, but more information is needed than is presented here to master the software programs used. To provide this depth, SPSS offers operational courses for the products used in this course.
Introduction 1 - 2
MODEL OVERVIEW
In this section we provide brief descriptions and comparisons of the data mining analysis and modeling methods that will be discussed in this course. Recall from your statistics courses that inferential statistics have two key features. They require that you specify an hypothesis to test such as that more satisfied customers will be more likely to make additional purchasesand they allow you to make inferences back to the population from the particular sample data you are studying. Because of these features, it isnt formally necessary to create training and validation (test) data sets when using inferential statistics. The validation portion of the analysis is done with standard test statistics, such as F, t, or chi-square, providing a probability of the hypothesis under test being correct. However, given the accepted data-mining methodology, you may decide to create a validation data set even when using inferential techniques. There is generally no harm in doing so, especially with a sufficient amount of data where the training and validation sets can both be reasonably large. Here is a listing of some inferential statistical methods commonly used in data mining projects. We will not define them here but leave that for a later section. The type of variables each requires is also listed.
GENERAL TECHNIQUE (Inferential Statistics) Discriminant Analysis Linear Regression (and ANOVA) Logistic and Multinomial Regression Time Series Analysis
PREDICTOR VARIABLES Continuous or dummies* Continuous or dummies Continuous or dummies Continuous or dummies
(*Dummies refers to transformed variables coded 1 or 0, representing the presence or absence of a characteristic. Thus a field such as region [north, south, east and west], when used as a predictor variable in several inferential methods, would be represented by dummy variables. For example, one dummy field might be named North and coded 1 if the records region code were north and 0 otherwise.) As is common for inferential statistics, all of these techniques are used to make predictions of a dependent variable. Some have been used for many years, such as linear regression or discriminant analysis. Inferential statistical techniques often make stringent assumptions about the data, such as normality, uncorrelated errors, or homogeneity of variance. They are more restrictive
Introduction 1 - 3
Data Mining: Modeling than non-inferential techniques, which can be a disadvantage. However, they provide rigorous tests of hypotheses unavailable with more automated methods of analysis. Although these methods are not always mentioned in many data mining books and articles, you need to be aware of them because they are often exactly what are necessary to answer a particular question. For instance, to predict the amount of revenue, in dollars, that a new customer is likely to provide in the next two years, linear regression could be a natural choice, depending on the available predictor variables and the nature of the relationships.
GENERAL TECHNIQUE (Data Mining) Decision Trees (Rule Induction) Neural Networks
The key difference for most users between inferential and non-inferential techniques is in whether hypotheses need to be specified beforehand. In the latter methods, this is not normally required, as each is semi- or completely automated as it searches for a model. Nonetheless, in all non-inferential techniques, you clearly need to specify a list of variables as inputs to the procedure, and you may have to specify other details, depending on the exact method. As we discussed in the previous courses in the SPSS Data Mining sequence, data mining is not mindless activity; even here, you need a plan of approach a research designto use these techniques wisely. Notice that the inferential statistical methods are not distinguished from the data mining methods in terms of the types of variables they allow. Instead data mining methods, such as decision trees and neural networks, are distinguished by making fewer assumptions about the data (for example, normality of errors). In many instances both classes of methods can be applied to a given prediction problem. Some data mining methods do not involve prediction, but instead search for groupings or associations in the data. Several of these methods are listed below along with the types of analysis you can do with them.
Introduction 1 - 4
Data Mining: Modeling GENERAL TECHNIQUE Cluster Analysis Analysis Uses continuous or categorical variables to create cluster memberships; no predefined outcome variable. Uses categorical variables to create associations between categories; no outcome variable required. Uses categorical variables in data sorted in time order to discover sequences in data; no outcome variable required, but there may be interest in specific outcomes.
Finally, discussions of data mining mention the tasks of classification, affinity analysis, prediction or segmentation. Below we group the data mining techniques within these categories. Affinity/Association: These methods attempt to find items that are closely associated in a data file, with the archetypal case being shopping patterns of consumers. Market basket analysis and sequence detection fall into this category. Classification/Segmentation: These methods attempt to classify customers into discrete categories that have already been defined, i.e., customers who stay and those who leave, based on a set of predictors. Several methods are available, including decision trees, neural networks, and sequence detection (when data are time structured). Note that logistic regression and discriminant analysis are inferential techniques that accomplish this same task. Clustering/Segmentation: Notice that we have repeated the word segmentation. This is because segmentation is used in two senses in data mining. Its second meaning is to create natural clusters of objectswithout using an outcome variablethat are similar on various characteristics. Cluster analysis and Kohonen networks accomplish this task. Prediction/Estimation: These methods predict a continuous outcome variable, as opposed to classification methods, which work with discrete outcomes. Neural networks fall into this group. Decision tree methods can work with continuous predictors, but they split them into discrete ranges as the tree is built. Memory-based reasoning techniques (not covered in this course) can also predict continuous outcomes. Regression is the inferential method likely to be used for this purpose. The descriptions above are quite simple and hide a wealth of detail that we will consider as we review the techniques. More than one specific method is usually available for a general technique. So, to cluster data, K-means clustering, Two-step clustering, and Kohonen networks (a form of neural network) could be used, with the choice of which to use depending on the type of data, the availability of software, the ease of understanding desired, the speed of processing, and so forth. Introduction 1 - 5
VALIDATION
Since most data-mining methods do not depend on specific data distribution assumptions (for example, normality of errors) to draw inferences from the sample to the population, validation is strongly recommended. It is usually done by fitting the model to a portion of the data (called the Training data) and then applying the predictions to, and evaluating the results with, the other portion of the data (called the Validation data note some authors refer to this as Test data, but as we will see, Test data has a specific meaning in neural network estimation). In this way, the validity of the model is established by demonstrating that it applies to (fits) data independent of that used to derive the model. Statisticians often recommend such validation for statistical models, but it is crucial for more general (less distribution bound) data mining techniques. There are several methods of performing validation.
Holdout Sample
This method was described above. The data set is split into two parts: training and validation files. For large files it might be a 50/50 split, while for smaller files more records are typically placed in the training set. Modeling is performed on the training data, but fit evaluation is done on the separate validation data.
N-Fold Validation
If the data file is small, reserving a holdout sample may not be feasible (the training sample may be too small to obtain stable results). In this case n-fold validation may be done. Here the data set is divided into a number of groups of equal sample size. Lets use 10 groups for the example. The first group is held out from the analysis, which is based on the other 9 groups (or 9/10ths of the data), and is used as the validation sample. Next the second group is held out from the analysis, again based on the other 9 groups, and is used as the validation sample. This continues until each of the 10 groups has served as a validation sample. The validation results from each of these samples are then pooled. This has the advantage of providing a form of validation in the presence of small samples, but since any given data record is used in 9 of the 10 models, there is less than complete independence. A second problem is that since 10 models are run there is no single model result (there are 10). For this reason, n-fold validation is generally used to estimate the fit or accuracy of a model with small data files and not to produce the model coefficients or rules. Some procedures extend this principle to base the model on all but one observation (using fast algorithms), keeping a single record as the hold-out. Generally speaking, resourcewise, only closed-form models that involve no iteration (like regression or discriminant) can afford this.
Introduction 1 - 6
Domain Validation
Do the model results make sense within the business area being studied? Here a domain expertsomeone who understands the business and dataexamines the model results to determine if they make sense, and to decide if they are interesting and useful, as opposed to obvious and trivial.
Introduction 1 - 7
Introduction 1 - 8
INTRODUCTION
In this chapter we consider the various inferential statistical techniques that are commonly used in data mining. We include a detailed example of each, as well as discussions about typical sample sizes, whether the method can be automated, and how easily the model can be understood and deployed. As you work on the tasks weve cited in the last chapter, you should also be thinking about what data mining techniques to use to answer those questions. Research isnt done step-by-step, in some predefined order, as we are taught in textbooks. Instead, all phases of a data mining project should be under review early in the process. This is especially critical for the data mining techniques you plan to employ, for at least three reasons. First, each data mining technique is suitable for only some types of analysis, but not all. Thus the research question you have defined cant necessarily be answered by just any technique. So if you want to answer a question that requires, say, market basket analysis (discussed in Chapter 3), and you have little expertise in this procedure, youll need to prepare ahead of time, conceivably even acquire additional software, so you are ready to begin analysis when the data are ready. Second, some techniques require more data than others do, or data of a particular kind, so you will need to have these conditions in mind when you collect the data. And third, some techniques are more easily understandable than others and the models more readily retrained if the environment changes rapidly; both of which might affect your choice of which technique to use. In this chapter we provide several different frameworks or classification schemes by which to understand and conceptualize the various inferential data mining techniques available in SPSS and other software. Examples of each technique will be given, including research questions or projects suitable for that type of analysis. Although details for running various analyses are given in the chapter, the emphasis is on setting up the basic analysis and interpreting the results. For this reason, all available options and variations will not be covered in this class. Also, such steps as data definition and data exploration are assumed to be completed prior to the modeling stage. In short, the goal of the chapter is not to exhaustively cover each data mining procedure in SPSS, but to present and discuss the core features needed for most analyses. (For more details on specific procedures, you may attend separate SPSS, AnswerTree, DecisionTime, and Clementine application courses.) Instead, we provide an overview of these methods with enough detail for you to begin to make an informed choice about which method will be appropriate for your own data mining projects, to set up a typical analysis, and interpret the results.
STATISTICAL TECHNIQUES
Recall that inferential statistics have two key features. They require that you specify an hypothesis to testsuch as that more satisfied customers will be more likely to make additional purchasesand they allow you to make inferences back to the population from the particular sample data you are studying. Below is the listing, from Chapter 1, of the many inferential methods commonly used in data mining projects. We will define them in later sections of this chapter. The type of variables each requires is also listed. GENERAL TECHNIQUE Discriminant Analysis Linear Regression (and ANOVA) Logistic and Multinomial Regression Time Series Analysis PREDICTOR VARIABLES Continuous or dummies* Continuous or dummies Continuous or dummies Continuous or dummies OUTCOME VARIABLE Categorical Continuous Categorical Continuous
(*Dummies refers to transformed variables coded 1 or 0, representing the presence or absence of a characteristic. Thus a field such as region (north, south, east and west), when used as a predictor variable in several inferential methods, would be represented by dummy variables. For example, one dummy field might be named North and coded 1 if the records region code was north and 0 otherwise.) As we discuss the techniques, we also provide information on whether they can be automated or not, their ease of understanding and typical size of data files, plus other important traits. After this brief glimpse at the various techniques, we turn next to a short discussion of each, including examples of research questions it can answer and where each can be found, if available, in SPSS software. You are probably already familiar with several of the inferential statistics methods we consider here. Our emphasis is on practical use of the techniques, not on the theory underlying each one.
LINEAR REGRESSION
Linear regression is a method familiar to just about everyone these days. It is the classic linear model technique, and is used to predict an outcome variable that is interval or ratio with a set of predictors that are also interval or ratio. In addition, categorical predictor variables can be included by creating dummy variables. Linear regression is available in SPSS under the AnalyzeRegression menu and is available in SPSS Clementine. Linear regression, of course, assumes that the data can be modeled with a linear relationship. As illustration, Figure 2.1 exhibits a scatterplot depicting the relationship between the number of previous late payments for bills and the credit risk of defaulting on a new loan. Superimposed on the plot is the best-fit regression line. The plot may look a bit unusual because of the use of sunflowers, which are used to represent the number of cases at a point. Since credit risk and late payments are measured as whole integers, the number of discrete points here is relatively limited given the large file size (over 2,000 cases). Figure 2.1 Scatterplot of Late Payments and Credit Risk
Although there is a lot of spread around the regression line, it is clear that there is a trend in the data such that more late payments are associated with a greater credit risk. Of course, linear regression is normally used with several predictors; this makes it impossible to display the complete solution with all predictors in convenient graphical form. Thus most users of linear regression use the numeric output. Statistical Data Mining Techniques 2 - 4
Data Mining: Modeling Multiple regression represents a direct extension of simple regression. Instead of a single predictor variable (Y = B*X + A), multiple regression allows for more than one independent variable in the prediction equation: Y = B1*X1 + B2*X2 + B3*X3 + . . . + A While we are limited to the number of dimensions we can view in a single plot (SPSS can build a 3-dimensional scatterplot), the regression equation allows for many independent variables. When we run multiple regression we will again be concerned with how well the equation fits the data, whether there are any significant linear relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are interested in the relative importance of the independent variables in predicting the dependent measure.
Assumptions
Regression is usually performed on data for which the dependent and independent variables are interval scale. In addition, when statistical significance tests are performed, it is assumed that the deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be independent of the predicted (values on the line) values, which implies that the variation of the residuals around the line is homogeneous (homogeneity of variance). SPSS can provide summaries and plots useful in evaluating these latter issues. One special case of the assumptions involves the interval scale nature of the independent variable(s). A variable coded as a dichotomy (say Statistical Data Mining Techniques 2 - 6
Data Mining: Modeling 0 and 1) can technically be considered as an interval scale. An interval scale assumes that a one-unit change has the same meaning throughout the range of the scale. If a variables only possible codes are 0 and 1 (or 1 and 2, etc.), then a one-unit change does mean the same change throughout the scale. Thus dichotomous variables (e.g., gender) can be used as predictor variables in regression. It also permits the use of categorical predictor variables if they are converted into a series of dichotomous variables; this technique is called dummy coding and is considered in most regression texts (Draper and Smith (1998), Cohen, Cohen, West and Aiken (2002)).
Data Mining: Modeling Click EditOptions Click the General tab Click the Display names option button in the Variable Lists section Click the Alphabetical option button in the Variable Lists section Click OK Also, files are assumed to be located in the c:\Train\DM_Model directory. They can be copied from the floppy accompanying this guide (or from the CD-ROM containing this guide). If you are running SPSS Server (you can check by clicking File..Switch Server from within SPSS), then files used with SPSS should be copied to a directory that can be accessed from (mapped) the server. To develop a regression equation predicting claims amount based on hospital length of stay, severity of illness group and age using SPSS: Click FileOpenData (switch to the c:\Train\DM_Model directory if necessary) Double click on InsClaims Click AnalyzeRegression This chapter will discuss two choices: linear regression, which performs simple and multiple linear regression and logistic regression (Binary). Curve Estimation will invoke the Curvefit procedure, which can apply up to 16 different functions relating two variables. Binary logistic regression is used when the dependent variable is a dichotomy (for example, when predicting whether a prospective customer makes a purchase or not). Multinomial logistic regression is appropriate when you have a categorical dependent variable with more than two possible values. Ordinal regression is appropriate if the outcome variable is ordinal (rank ordered). Probit analysis, nonlinear regression, weight estimation (used for weighted least squares analysis), 2-Stage least squares, and optimal scaling are not generally used for data mining and so will not be discussed further here.
We will select Linear to perform multiple linear regression, then specify claim as the dependent variable and age, asg (severity level) and length of stay (los) as the independent variables. Click Linear from the Regression menu Move claim to the Dependent: list box Move age, asg and los to the Independent(s): list box
Since our goal is to identify exceptions to the regression model, we will ask for residual plots and information about cases with large residuals. Also, the Regression dialog box allows many specifications; here we will discuss the most important features. Note on Stepwise Regression With such a small number of predictor variables, we will simply add them all into the model. However, in the more common situation of many predictor variables (most insurance claims forms would contain far more information) a mechanism to select the most promising predictors is desirable. This could be based on the domain knowledge of the business expert (here perhaps a medical expert). In addition, an option may be chosen to select, from a larger set of independent variables, those that in some statistical sense are the best predictors (Stepwise method). The Selection Variable option permits cross-validation of regression results. Only cases whose values meet the rule specified for a selection variable will be used in the regression analysis, yet the resulting prediction equation will be applied to the other cases. Thus you can evaluate the regression on cases not used in the analysis, or apply the equation derived from one subgroup of your data to other groups. The importance of such validation in data mining is a repeated theme in this course. While SPSS will present standard regression output by default, many additional (and some of them quite technical) statistics can be requested via the Statistics dialog box. The Statistical Data Mining Techniques 2 - 10
Data Mining: Modeling Plots dialog box is used to generate various diagnostic plots used in regression, including a residual plot in which we have interest. The Save dialog box permits you to add new variables to the data file containing such statistics as the predicted values from the regression equation, various residuals and influence measures. We will create these in order to calculate our own percentage deviation field. Finally, the Options dialog box controls the criteria when running stepwise regression and choices in handling missing data (the SPSS Missing Values option provides more sophisticated methods of handling missing values). Note that by default, SPSS excludes a case from regression if it has one or more values missing for the variables used in the analysis.
Residual Plots
While we can run the multiple regression at this point, we will request some diagnostic plots involving residuals and information about outliers. A residual is the difference (signed) between the actual value of the dependent variable and the value predicted by the model. Residuals can be used to identify large errors in prediction or cases poorly fit by the model. By default no residual plots will appear. These options are explained below. Click the Plots pushbutton Within the Plots dialog box: Check Histogram in the Standardized Residual Plots area Figure 2.4 Regression Plots Dialog Box
Data Mining: Modeling The options in the Standardized Residual Plots area of the dialog box all involve plots of standardized residuals. Ordinary residuals are useful if the scale of the dependent variable is meaningful, as it is here (claim amount in dollars). Standardized residuals are helpful if the scale of the dependent is not familiar (say a 1 to 10 customer satisfaction scale). By this I mean that it may not be clear to the analyst just what constitutes a large residual: is an over-prediction of 1.5 units a large miss on a 1 to 10 scale? In such situations, standardized residuals (residuals expressed in standard deviation units) are very useful because large prediction errors can be easily identified. If the errors follow a normal distribution, then standardized residuals greater than 2 (in absolute value) should occur in about 5% of the cases, and those greater than 3 (in absolute value) should happen in less than 1% of the cases. Thus standardized residuals provide a norm against which one can judge what constitutes a large residual. Recall that the F and t tests in regression assume that the residuals follow a normal distribution. Click Continue Next we will look at the Statistics dialog box, which contains options concerning Casewise Diagnostics. When this option is checked, Regression will list information about all cases whose standardized residuals are more than 3 standard deviations from the line. This outlier criterion is under your control. Click the Statistics pushbutton Click the Casewise diagnostics check box in the Residuals area Figure 2.5 Regression Statistics Dialog Box
By requesting this option we will obtain a listing of those records that the model predicts poorly. When dealing with a very large data file, which may have many outliers, such a
Data Mining: Modeling list is cumbersome. It would be more efficient to save the residual value (standardized or not) as a new field, then select the large residuals and write these cases to a new file or add a flag field to the main database. We create these new fields below. Click Continue Click Save pushbutton Click the check boxes for Unstandardized Predicted Values, Unstandardized and Standardized Residuals Figure 2.6 Saving Predicted Values and Errors
Data Mining: Modeling Figure 2.7 Model Summary and Overall Significance Tests
After listing the dependent and independent variables (not shown), Regression provides several measures of how well the model fits the data. First is the multiple R, which is a generalization of the correlation coefficient. If there are several independent variables (our situation) then the multiple R represents the unsigned (positive) correlation between the dependent measure and the optimal linear combination of the independent variables. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the r-square measure can be interpreted as the proportion of variance of the dependent measure that can be predicted from the independent variable(s). Here it is about 32%, which is far from perfect prediction, but still substantial. The adjusted r-square represents a technical improvement over the r-square in that it explicitly adjusts for the number of predictor variables, and as such is preferred by many analysts. However, it is a more recently developed statistic and so is not as well known as the r-square. Generally, they are very close in value; in fact, if they differ dramatically in multiple regression, it is a sign that you have used too many predictor variables relative to your sample size, and the adjusted r-square value should be more trusted. In our results, they are very close. While the fit measures indicate how well we can expect to predict the dependent variable or how well the line fits the data, they do not tell whether there is a statistically significant relationship between the dependent and independent variables. The analysis of variance table presents technical summaries (sums of squares and mean square statistics), but here we refer to variation accounted for by the prediction equation. We are interested in determining whether there is a statistically significant (non-zero) linear relation between the dependent variable and the independent variable(s) in the population. Since our analysis contains three predictor variables, we test whether any linear relation differs
Data Mining: Modeling from zero. The significance value accompanying the F test gives us the probability that we could obtain one or more sample slope coefficients (which measure the straight-line relationships) as far from zero as what we obtained if there were no linear relations in the population. The result is highly significant (significance probability less than .0005 (the table value is rounded to .000) or 5 chances in 10,000). Now that we have established there is a significant relationship between the claims amount and one or more predictor variables, and obtained fit measures, we turn to interpret the regression coefficients. Here we are interested in verifying that several expected relationships hold: (1) claims will increase with length of stay, (2) claims will increase with increasing severity of illness, and (3) claims will increase with age. Strictly speaking, this step is not necessary in order to identify cases that are exceptional. However, in order to be confident in the model, it should make sense to a domain expert. Since interpretation of regression models can be made directly from the estimated regression coefficients, we turn to those next. Figure 2.8 Estimated Regression Coefficients
The first column contains a list of the independent variables plus the intercept (constant). Although the estimated B coefficients are important for prediction and interpretive purposes, analysts usually look first to the t test at the end of each line to determine which independent variables are significantly related to the outcome measure. Since three variables are in the equation, we are testing if there is a linear relationship between each independent variable and the dependent measure after adjusting for the effects of the two other independent variables. Looking at the significance values we see that all three predictors are highly significant (significance values are .004 or less). If any of the variables were not found to be significant, you would typically rerun the regression after removing variables not found to be significant. The column labeled B contains the estimated regression coefficients we would use to deploy the model via a prediction equation. The coefficient for length of stay indicates that on average, each additional day spent in the hospital was associated with a claims increase of about $1,106. The coefficient for admission severity group tells us that each one-unit increase in the severity code is associated with a claims increase of $417. Finally, the age coefficient of 33 suggests that claims decrease, on average, by $33 as age increases one year. This is counterintuitive and should be examined by a domain Statistical Data Mining Techniques 2 - 15
Data Mining: Modeling expert (here a physician). Perhaps the youngest patients are at greater risk. If there isnt a convincing reason for this negative association, the data values for age and claims should be examined more carefully (perhaps data errors or outliers are influencing the results). Such oddities may have shown up in the original data exploration. We will not pursue this issue here, but it certainly would be done in practice. The constant or intercept of $3,027 indicates that the claim of someone with 0 days in the hospital, in the least severe illness category (0) and at age 0 would be expected to file a claim of $3,027. This is clearly impossible. This odd result stems in part from the fact that no one in the sample had less than 1 day in the hospital (it was an inpatient procedure) and the patients were adults (no ages of 0), so the intercept projects well beyond where there are any data. Thus the intercept cannot represent an actual patient, but still may be needed to fit the data. Also, note that when using regression it can be risky to extrapolate beyond where the data are observed; the assumption is that the same pattern continues. Here it clearly cannot! The Standard Error (of B) column contains standard errors of the estimated regression coefficients. These provide a measure of the precision with which we estimate the B coefficients. The standard errors can be used to create a 95% confidence band around the B coefficients (available as a Statistics option). In our example, the regression coefficient for length of stay is $1,106 and the standard error is about $104. Thus we would not be surprised if in the population the true regression coefficient were $1,000 or 1,200 (within two standard errors of our sample estimate), but it is very unlikely that the true population coefficient would be $300 or $2,000. Betas are standardized regression coefficients and are used to judge the relative importance of each of several independent variables. They are important because the values of the regression coefficients (Bs) are influenced by the standard deviations of the independent variables and the beta coefficients adjust for this. Here, not surprisingly, length of stay is the most important predictor of claims amount, followed by severity group and age. Betas typically range from 1 to 1 and the further from 0, the more influential the predictor variable. Thus if we wish to predict claims based on length of stay, severity code and age, the formula would use the B coefficients: Predicted Claims = $1,106*length of stay + 417*severity code 33*age + $3,027.
There are two cases for which the claims value is more than three standard deviations from the regression prediction. Both are about $6,000 more than expected from the model. Note that they are 5.5 and 6.1 standard deviations away from the model predictions. These would be the claims to examine more carefully. The case sequence number for these records appears or an identification field could be substituted (through the Case Labels box within the Linear Regression dialog). Figure 2.10 Histogram of Residuals
This histogram of the standardized residuals presents the overall distribution of the errors. It is clear that all large residuals are positive (meaning the model under-predicted the claims value). Case (record) identification is not available in the histogram, but since the standardized residuals were added to the data file, they can be easily selected and examined.
Each cases deviation from the model (claim pre_1) is divided by the model prediction (/pre_1) and converted to a percent (100*). Click OK Scroll to the right in the Data Editor window
Extreme values on this percent deviation field can also be used to identify exceptional claims. While we wont pursue it here, a histogram would display the distribution of the deviations and cases with extreme values could be selected for closer examination. Unusual values could appear at both the high and low ends, with low values indicating the claim was much less than predicted by the model. These might be examined as well, since they might reflect errors or suggest less expensive variations on the treatment. In this section, we offered the search for deviations from a model as a method to identify data errors or possible fraud. It would not detect, of course, fraudulent claims consistent with the model prediction. In actual practice, such models are usually based on a much greater number of predictor variables, but the principles, whether using regression or more complex models such as neural networks, are largely the same.
Other Features
There is no limit to the size of data files used with linear regression, but just as with discriminant, most uses of regression limit the number of predictors to a manageable number, say under 50 or so. As before, there is then no reason for extremely large file sizes. The use of stepwise regression is quite common. Since this involves selection of a few predictors from a larger set, it is recommended that you validate the results with a validation data set when you use a stepwise method. Although this technique is called linear regression, with the use of suitable transformations of the predictors, it is possible to model non-linear relationships. However, more in-depth knowledge is needed to do this correctly, so if you expect nonlinear relationships to occur in your data, you might consider using neural networks or classification and regression trees, which handle these more readily, if differently.
Model Understanding
Linear regression produces very easily understood models, as we can see from the table in Figure 2.8. As noted, graphical results are less helpful with more than a few predictors, although graphing the error in prediction with other variables can lead to insights about where the model fails.
Model Deployment
Predictions for new cases are made from one equation using the unstandardized regression coefficient estimates. Any convenient software for doing this calculation can be employed, and regression equations can therefore be applied directly to data warehouses, not only to extracted datasets. This makes the model easily deployable.
DISCRIMINANT ANALYSIS
Discriminant analysis, a technique used in market research and credit analysis for many years, is a general linear model method, like linear regression. It is used in situations where you want to build a predictive model of group or category membership, based on linear combinations of predictor variables that are either continuous (age) or categorical variables represented by dummy variables (type of customer). Most of the predictors should be truly interval scale, else the multivariate normality assumption will be violated. Discriminant is available in SPSS under the AnalyzeClassify menu. Discriminant follows from a view that the domain of interest is composed of separate populations, each of which is measured on variables that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of these measures that best separate the populations. This is represented in Figure 2.13, which shows one discriminant function derived from two input variables, X and Y, that can be used to predict membership in a dependent variable: Group. The score on the discriminant function separates cases in group 1 from group 2, using the midpoint of the discriminant function (the short line segment). Figure 2.13 Discriminant Function Derived From Two Predictors
Click the Classify pushbutton Click the Summary table checkbox Click the Leave-one-out classification checkbox Figure 2.15 Classification Dialog Box
The Classification dialog box controls the results displayed when the discriminant model is applied to the data. The most useful table does not print out by default (because misclassification summaries require a second data pass), but you can easily request a summary classification table, which reports how well the model predicts the outcome Statistical Data Mining Techniques 2 - 23
Data Mining: Modeling measure. Without this table you cannot effectively evaluate the discriminant analysis, so you should make a point of asking for it. The "leave-one-out" variation classifies each case based on discriminant coefficients calculated while the case is excluded from the analysis. This method is a form of n-fold validation and provides a classification table that should at least slightly better generalize to other samples. Since we have a relatively small data file, rather than splitting it into training and validation samples, we will use the leave-one-out classification for validation purposes. You can use the Prior Probabilities area to provide Discriminant with information about the distribution of the outcome in the population. By default, before examining the data, Discriminant assumes an observation is equally likely to belong to each outcome group. If you know that the sample proportions reflect the distribution of the outcome in the population then you can instruct Discriminant to make use of this information. For example, if an outcome category is very rare, Discriminant can make use of this fact in its prediction equation. Using the dialog box, priors can be set to the sample sizes, and with syntax you can directly specify the population proportions. In our instance, we dont know what the proportions would be, so we retain the default. Click Continue to process Classification choices Click Statistics pushbutton Click Fishers checkbox in the Function Coefficients area Click Unstandardized checkbox in the Function Coefficients area Figure 2.16 Statistics Dialog Box
Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to deploy the model for future observations (customers). Both sets of coefficients produce the same predictions. If there are only two outcome categories (as is our situation), either is easy to use. If you want to try what if scenarios using a spreadsheet, the unstandardized coefficients (since they involve a single equation in the two-outcome
Data Mining: Modeling case) would be more convenient to work with. If you run discriminant with more than two outcome categories, then Fisher's coefficients are easier to apply as prediction rules. If you suspect some of the predictors are highly related, you might view the withingroups correlations among the predictor variables to identify highly correlated predictors. Click Continue to process Statistics requests Now we are ready to run the stepwise discriminant analysis. The Select pushbutton can be used to have SPSS select part of the data to estimate the discriminant function, and then apply the predictions to the other part (cross-validation). We would use this method of validation in place of the leave-one-out method if our data set were larger. The Save pushbutton will create new variables that contain the group membership predicted from the discriminant function and the associated probabilities. To retain predictions for the training data set, you would use the Save dialog to create these variables. Click OK to run the analysis Scroll to the Classification Results table at the bottom of the Viewer window Figure 2.17 Classification Results Table
Although this table appears at the end of the discriminant output, we turn to it first. It is an important summary since it tells us how well we can expect to predict the outcome. There are two subtables with Original referring to the training data and CrossValidated supplying the leave-one-out results. The actual (known) groups constitute the rows and the predicted groups make up the columns of the table. Looking at the Original section, of the 227 people surveyed who said they would not accept the Statistical Data Mining Techniques 2 - 25
Data Mining: Modeling offering, the discriminant model correctly predicted 157 of them, and so its accuracy is 69.2%. For the 214 respondents who said they would accept the offering, 66.4% were correctly predicted. Thus overall, the discriminant model was accurate in 67.80% of the cases. The Cross-Validated summary is very close (67.3% accurate overall). Is this performance good? If we simply guess the larger group 100% of the time, we would be correct 227 times of 441 (227 + 214), or about 51.5% of the time. The 67.8% or 67.3% correct figures, while certainly far from perfect accuracy, do far better than guessing. Whether you would accept this figure and review the remaining output, or go back to the drawing board, is largely a function of the level of predictive accuracy required. Since we are interested in discovering which characteristics are associated with someone who accepts the news channel offer, we proceed.
Stepwise Results
Age is entered first, followed by gender and education. A significance test (Wilks lambda) of between-group differences is performed for the variables at each step. None of the other variables made a significant difference after adjusting for the first three. As an exercise you might rerun the analysis with the additional variables entered and compare the classification results. Figure 2.18 Stepwise Results
This summary is followed by one entitled "Variables in the Analysis" (not shown), which lists the variables included in the discriminant analysis at each step. For the variables selected, tolerance is shown. It measures the proportion of variance in each predictor variable that is independent of the other predictors in the equation at this step. As
Data Mining: Modeling tolerance values approach 0 (say below .1 or so) the data approach multicollinearity, meaning the predictor variables are highly interrelated, and interpretation of individual coefficients can be compromised. Note that discriminant coefficients are only calculated after the stepwise phase is complete. Figure 2.19 Standardized Coefficients and Structure Matrix
The standardized discriminant coefficients can be used as you would regression Beta coefficients in that they attempt to quantify the relative importance of each predictor in the discriminant function. Not surprisingly, age is the dominant factor. The signs of the coefficients can be interpreted with respect to the group means on the discriminant function (see Figure 2.20). An older individual will have a higher discriminant score, since the age coefficient is positive. The outcome group accepting the offering has a positive mean (see Figure 2.20) and so older people are more likely to accept the offering. Notice the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to a woman (code 1), this results in a one unit change, which when multiplied by the negative coefficient will lower the discriminant score, and move the individual toward the group with a negative mean (those that dont accept the offering). Thus women are less likely to accept the offering, adjusting for the other predictors.
Data Mining: Modeling Figure 2.20 Unstandardized Coefficients and Group Means (Centroids)
Back in Figure 2.13 we saw a scatterplot of two separate groups and the axis along which they could be best separated. Unstandardized discriminant coefficients, when multiplied by the values of an observation, project an individual on this discriminant axis (or function) that separates the groups. If you wish to use the unstandardized coefficient estimates for prediction purposes, you simply multiply a prospective customers education, gender and age values by the corresponding unstandardized coefficients and add the constant. Then you compare this value to the cut point (by default the midpoint) between the two group means (centroids) along the discriminant function (the means appear in Figure 2.20). If the prospective customers value is greater than the cut point you predict the customer will accept, if the score is below the cut point, then you predict the customer will not accept. This prediction rule is also easy to implement with two groups, but involves much more complex calculations when more than two groups are involved. It is in a convenient form to do what if scenarios, for example, it we have a male with 16 years of education at what age would such an individual be a good prospect? To answer this we determine the age value that moves the discriminant score above the cut point.
The Fisher function coefficients can be used to classify new observations (customers). If we know a prospective customers education (say 16 years), gender (Female=1) and age (30), we multiply these values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30 -20.85), which yields a numeric score. We repeat the process using the coefficients for the Yes group and obtain another score. The customer is then placed in the outcome group for which she has the higher score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets, databases) for predictive purposes. We did not test for the assumptions of discriminant analysis (normality, equality of within group covariance matrices) in this example. In general, normality does not make a great deal of difference, but heterogeneity of the covariance matrices can, especially if the sample group sizes are very different. Here the samples sizes were about the same. For a more detailed discussion of problems with assumption violation in discriminant analysis see Lachenbruch (1975) or Huberty (1994). As mentioned earlier, whether you consider the hit rate here to be adequate really depends on the costs of errors, the benefits of a correct prediction and what your alternatives are. Here, although the prediction was far from perfect we were able to identify the relations between the demographic variables and the choice outcome.
Data Mining: Modeling Predict which customers will respond to a new product or offer. Predict outcomes of various medical procedures.
Other Features
In theory, there is no limit to the size of data files for discriminant analysis, either in terms of records or variables. However, practically speaking, most applications of discriminant limit the number of predictors to a few dozen at most. With that number of predictors, there is usually no reason to use more than a few thousand records. It is possible to use stepwise methods with discriminant, so that the software can select the best set of predictors from a larger potential group. In this sense, stepwise discriminant can be considered an automated procedure like decision trees. As a result, if you use a stepwise method, you should use a validation dataset on which to check the model derived by discriminant.
Model Understanding
Discriminant analysis produces easily understood results. We have already seen the classification table in Figure 2.17. In addition, the procedure calculates the relative importance of each variable as a predictor (standardized coefficientssee Figure 2.19). Graphical output is produced by discriminant, but with more than a few predictors it becomes less useful.
Model Deployment
Predictions for new cases are made from simple equations using the classification function coefficients (especially the Fisher coefficients). This means that any statistical program, or even a spreadsheet program, could be used to generate new predictions, and that the model can be applied directly to data warehouses, not only extracted data sets. This makes the model easily deployable.
After the procedure calculates the outcome probability, it simply assigns a case to a predicted category based on whether its probability is above .50 or not. The same basic approach is used when the dependent variable has three or more categories. In Figure 2.22, we see that the logistic model is a nonlinear model relating predictor variables to the probability of a choice or event (for example, a purchase). If there are two predictor variables (X1, X2), then the logistic prediction equation can be expressed as:
prob(event) =
where exp() represents the exponential function. The conceptual problem is that the probability of the event is not linearly related to the predictors. However, if a little math is done you can establish that the odds of the event occurring are equal to:
(B1 *X1 )
Although not obviously simpler to the eye, the second formulation (and SPSS displays the logistic coefficients in both the original form and raised to the exponential power) allows you to state how much the odds of the event change with a one unit change in the predictor. For example, if I stated that the odds of making a sale double if a resource is Statistical Data Mining Techniques 2 - 32
Data Mining: Modeling given to me, everyone would know what I meant. With this in mind we will look at the coefficients in the logistic regression equation and try to interpret them. Recall that logistic regression assumes that the predictor variables are interval scale, and like regression, dummy coding of predictors can be performed. As such, its assumptions are less restrictive than discriminant.
This is all we need in order to run a standard logistic regression analysis. Notice the . You can create interaction terms by clicking on two or more Interaction button predictor variables in original list, then clicking on the Interaction button. Also, you can use the Categorical pushbutton to have Logistic Regression create dummy coded (or contrast) variables to substitute for your categorical predictor variables (note that Clementine performs such operations automatically in its modeling nodes). The Save pushbutton allows you to create new variables containing the predicted probability of the event, and various residual and influence measures. As in Discriminant, the Select Statistical Data Mining Techniques 2 - 33
Data Mining: Modeling pushbutton will estimate the model from part of your sample (you provide a selection rule) and apply the prediction equation to the other part of the data (cross-validation). The Options pushbutton provides control over the criteria used in stepwise analysis. Click Save pushbutton Click Probabilities check box Click Group membership check box Figure 2.24 Logistic Regression: Save Dialog Box
The Logistic Regression procedure can create new variables to store various types of information. Influence statistics, which measure the influence of each point on the logistic analysis, can be saved. A variety of residual measures, which identify poorly fit data, can also be retained. When scoring, the predicted probability provides a score for each observation that is used to classify it into an outcome category. We save them here in order to demonstrate how they can be used in a gains tables to evaluate the effectiveness of the model. Click Continue Click OK to run the analysis Scroll down to the Classification Table in the Block 1 section of the Viewer window
The classification results table indicates that those refusing the offer were predicted with 70.5% accuracy and those accepting with 61.7% accuracy, for an overall correct classification of 66.2%. The logistic model predicted a slight bit better for the refusals, and about 4 percentage points worse for the acceptances, so overall it does slightly worse (about 2 percentage points) than discriminant on the training sample. The default classification rule for a case is that if the predicted probability of belonging in the outcome group with the higher value (here 1) is greater than equal to .5, then predict membership in that group. Otherwise, predict membership in the group with the lower outcome value (here 0). We will examine these predicted probabilities in more detail later. 2.26 Significance Tests and Model Summary
The Model Chi-square test provides a significance test for the entire model (three variables) similar to the overall F test in regression. We would say there is a significant relation between the three predictors and the outcome. The Step Chi-square records the change in chi-square from one step to the next and is useful when running stepwise methods.
The pseudo r-square is a statistic modeled after the r-square in regression (discussed earlier in this chapter). It measures how much of the initial lack of fit chi-square is accounted for by the variables in the model. Both variants indicate the model only accounts for a modest amount of the initial unexplained chi-square. Now lets move to the variables in the equation. Figure 2.28 Variables in the Equation
The B coefficients are the actual logistic regression coefficients, but recall they bear a nonlinear relationship to the probability of accepting the offer. Although they do linearly relate to the log odds of accepting, most people do not find this metric helpful for interpretation. The second column (S.E.) contains the standard errors for the B coefficients. The Wald statistic is used to test whether the predictor is significantly related to the outcome measure adjusting for the other variables in the equation (all three are highly significant). The last column presents the B coefficient exponentiated using the e (exponential) function, and we can interpret these coefficients in terms of an odds shift in the outcome. For example, the coefficients of age and education are above 1 meaning that the odds of accepting the offer increase with increasing age. The coefficient for age indicates that the odds increase by a factor of 1.06 per year, which seems rather small. However, recall that age can range from 18 to almost 90 years old and a 20-year age difference would have a substantial impact on the odds of accepting the offering (the odds more than triple). The coefficient for gender is about .5 indicating that if the other factors are held constant, moving from a male to female reduces the odds of accepting the offering by about 1/2. In
Data Mining: Modeling this way you can express the effect of a predictor variable in terms of an odds shift. You can use the B coefficients for your prediction equation:
prob of accepting =
exp (.107*educate -.738*gender +.060*age - 3.5) (1 + exp (.107*educate -.738*gender +.060*age -3.5))
The results of the logistic regression confirm that age, gender and education are related to the probability of a potential customer accepting the offering. Although not shown, we ran logistic using a stepwise method and obtained the same model.
Other Features
As with the other techniques we have discussed, there is no limit to the size of data files. As previously, usually a limited number of predictors are used for any problem, so file sizes can be reasonably small. Stepwise logistic regression is available in the Binary Logistic procedure, but not in the Multinomial Logistic procedure, and the standard caveats apply about using a validation data set when stepwise variable selection is done.
Model Understanding
Although the logistic model is inherently more complex, it is not unduly so compared to linear regression. When the results are translated into odds on the dependent variable, they become quite helpful to decision makers. As before, graphical representations of the solution are less helpful with more than a few predictors.
Model Deployment
For predictions with binary logistic regression, only one equation is involved, so the model is easily deployable. For multinomial regression, more than one equation is
Data Mining: Modeling involved, which requires more calculations, but this doesnt make prediction for new cases that much more difficult.
Click TransformRank Cases from the Data Editor window Move pre_1 into the Variable(s) list box Click Largest value option button
Data Mining: Modeling The predicted probability variable from the logistic regression (pre_1) will be used to create a new rank variable. In addition, we indicate that the highest value of pre_1 should be assigned rank 1. This makes sense: the case with the highest predicted probability of belonging to the Yes outcome group (coded 1- the highest value) should have the top rank. If we stopped here, each case would be assigned a unique rank (assuming no ties on predicted probability). Since we want to create decile groups, we must make use of the Rank Cases: Types dialog box. Click Rank Types pushbutton Click Ntiles check box to check it Erase 4 and type 10 in the Ntiles text box Click the Rank check box so it is not checked Click Continue, then click OK The Rank procedure will now create a new variable coded 1 through 10, representing decile groups based on the predicted probability of responding Yes. Although a number of procedures can display the critical summary (response percentage for each decile group), we will use the OLAP Cubes procedure since it can present the base rate (assuming a random prediction model) as well. Click AnalyzeReportsOLAP Cubes Move newschan into the Summary Variable(s) list box Move npre_1 into the Grouping Variable(s) list box The OLAP Cubes procedure is designed to produce a multidimensional summary table that supports drill-down operations. We choose it for some specific summary statistics. Click Statistics pushbutton Click and drag to select all statistics in the Cell Statistics list box Click the left arrow to remove these statistics from the Cell Statistics list box Move Number of Cases into the Cell Statistics list box Move Percent of Sum in (npre_1) into the Cell Statistics list box Move Percent of N in (npre_1) into the Cell Statistics list box We request several summaries for each decile score group. The number of cases in each group will appear, along with the percent of the total sum of the newschan variable in each decile group (based on npre_1). Since newschan is coded 0 or 1, the percentage of the overall sum of newschan in each decile group is the percentage of cases in a group that gave a positive response to the newschan question. Finally, the percentage of cases in each decile group will display. This provides the base rate against which the model-score based deciles will be compared. The logic is that if the model predictions are unrelated to the actual outcome, then we expect that the top decile of cases (based on the model) will contain about 10% of the positive responses: the rate we expect by chance alone.
Data Mining: Modeling Click Continue Click Title pushbutton Erase OLAP Cubes and type Gains Table in the Title text box Click Continue, then click OK The OLAP Cubes table is designed to be manipulated within the Pivot Table editor to permit different views of the summaries. In order to see the entire table we must move the decile-grouping variable into the row dimension of the pivot table (such manipulation is covered in more detail in The Basics: SPSS for Windows and Intermediate Topics: SPSS for Windows courses) Double-click on the Gains Table pivot table in the Viewer window If the Pivoting Trays window is not visible, then click Pivot..Pivoting Trays Click the pivot icon in the Layer tray of the Pivoting Tray window
The Pivot Tray window allows us to move table elements (in this case the decile categories) from one dimension to another. We need to move the npre_1 (NTILES of PRE_1) icon from the layer to the row dimension.
Drag the NTILES of PRE_1 pivot icon to the right side of the Row tray (right of the icon already in the row tray) Click outside the crosshatched area in the Viewer window to close the Pivot Table Editor
The column headings of the pivot table can be edited so they are easier to read (doubleclick on the pivot table, then double-click on any column heading to edit it). The rows of the table represent the decile groupings based on the predicted probabilities (scores) from the logistic regression model. The column labeled N contains the number of cases in each decile group and the % of N in NTILES of PRE_1 displays the percentages. This latter column contains the expected percentage of overall positive responses appearing in each group, under a model with no predictability (i.e. the base rate). The column of greatest interest is labeled % of Sum in NTILES of PRE_1. It contains the expected percentage of the overall positive responses contained in each decile group under the model. Examining the first decile (1), we see that the most promising 10% of the sample contains 16.4% of the positive respondents. Similarly, both the second and third deciles contain 15% of the positive respondents. Thus, if we were to offer the interactive news cable package to the top 30% of the prospects, we expect we would obtain 46.4% of the positive responders. In this way, analysts in direct mail and related areas can evaluate the expected return of mailing to the top x% of the population (based on the model). In the fourth decile and beyond, the return is near or below that expected from a random model; so the decile groups holding most promise are the first three.
INTRODUCTION
As its name implies, market basket analysis techniques were developed, in part, to analyze consumer shopping patterns. These methods are descriptive and find groupings of items. Market basket or association analysis clusters fields (items), which are typically products purchased, but could also be medical procedures proscribed for patients, or banking or telecom services used by customers. The techniques look for patterns or clusters among a small number of items. These techniques can be found in Clementine under the Apriori and GRI procedures. Market basket or association analysis produces output that is easy to understand. For example, a rule may state that, if corn chips are purchased, then 65% of the time cola is purchased, unless there is a promotion, in which case 85% of the time cola is purchased. In other words, the technique correlates the presence of one set of items with another. A large set of rules is typically generated for any data set with a reasonably diverse type and number of transactions. As an illustration, Figure 3.1 shows a portion of the output from Clementines Apriori procedure showing relations between various products bought by customers over one week in a supermarket. The first line is telling us that 10.8% of the sample, or 85 customers, bought both frozen foods and milk, and that 77.6% of such customers also bought some type of alcoholic product. Figure 3.1 A Set of Market Basket Association Rules
Such association rules describe relations among the items purchased; the goal is to discover interesting and actionable associations.
TECHNICAL CONSIDERATIONS
Number of Items?
If the fields to be analyzed represent items sold, the number of distinct items influences the resources required for analysis. For example, the number of SPSS products a customer can currently purchase is about 15, which is a relatively small number of items to analyze. Now, lets consider a major retail chain store, auto parts supplier, mail catalog vendor or web vendor. Each might have anywhere from hundreds or thousands to tens of thousands of unique products. Generally, when such large numbers of products are involved, they are binned (grouped) together in higher-level product categories. As items are added, the number of possible combinations increases exponentially. Just how much categorization is necessary depends upon the original number of items, the detail level of the business question asked, and the level of grouping at which meaningful categories can be created. When a large number of items are present, careful consideration must be given to this issue, which can be time consuming. But time spent on this matter will increase your chance of finding useful associations and will reduce the number of largely redundant rules (for example, rules describing the purchase of a hammer with each of many different sizes and types of nails).
Actionable?
More so than with other data-mining methods, data-mining authors and consultants raise questions about whether the associations discovered using market basket analysis are useful and actionable. The challenge is that if you do discover an association between two products, say beer and diapers (to use an example with a storied past), just what action would this lead a retailer to take? Of course, other data-mining methods are open to a similar challenge, but it is worthwhile to consider in advance how your organization would make use of any strong associations that are discovered in the market basket analysis.
RULE GENERATION
Within Clementine, the association rule methods begin by generating simple rules (those involving two items) and testing them against the data. The most interesting (that is, those that meet the specified minimum criteriaby default, coverage and accuracy) of these rules are stored. Next, all rules are expanded by adding an additional condition from a third item (this process is called specialization) and are, as before, tested against the data. The most interesting of these rules are stored and specialization continues. When the analysis ends, the best of the stored rules can be examined. Two procedures in Clementine perform association rule analysis. GRI is more general in that it permits both numeric (continuous) and categorical input (condition) variables. Apriori, because it only permits categorical input (condition) variables, is quicker. It also supports wider choice in criteria used to select rules. Results from an Association analysis are presented in a table with these column headings: Consequent Antecedent 1 Antecedent 2 Antecedent N For example: Consequent AnswerTree Antecedent 1 Regression Models Antecedent 2 Advanced Models
This rule tells us that customers who purchase the Regression Models and Advanced Models SPSS options also purchase AnswerTree. Association rules are commonly evaluated by two criteria: support and confidence. Support is the percentage of records in the data set for which the conditions (antecedents) hold. It indicates how general the rule isthat is, to what percentage of the data will it apply.
Data Mining: Modeling Confidence (accuracy) is the proportion of records meeting the conditions (antecedents) that also meet the consequent (conclusion). It indicates how likely the consequent is, given that the conditions are met. It is worth noting explicitly that market basket analysis does not take time into account. Thus, while purchases may well be time ordered, Apriori and GRI do not include time as a component of the rule generation process. Practically speaking, this means that all data for a case (e.g., shopping trip) must be stored in one physical record. Sequence detection algorithms take time sequencing into account and such analyses can be done in Clementine with the Sequence node (or the CaprI algorithm add-on) see Chapter 8.
Data Mining: Modeling Figure 3.2 Clementine Stream for Shopping Data
The stream canvas currently contains: a source node used to access the text file containing the shopping data (shopping.txt); a Type node used to declare field types that can
and indicate how fields are to be used in any modeling; a Table node display the original data values; a Distribution node
. We will examine some of these for a selected field; and an Apriori modeling node nodes and then proceed with the market basket analysis. Right-click on the Table node, then click Execute from the Context menu
In addition to several demographic fields, each purchase item field is coded 0 (no purchase) or 1 (purchase). Note that the number of purchase variables is quite limited because purchases have been grouped into broad categories. Close the Table window Double-click the Type node to open it Figure 3.4 Type Node for Shopping Data
Data Mining: Modeling In Clementine, data fields can be one of the following types: range, set (categorical with more than two categories) flag (binary), or typeless (such as for an ID field). The modeling procedures in Clementine automatically take a fields type into account. Since the various purchased item fields can take only two values (0 (no purchase) or 1 (purchase)), they are defined as flag fields. Also, the Direction column instructs Clementine how a field should be used in modeling. For most modeling procedures, fields are set to In (Input, independent or predictor fields) or Out (Output, dependent or outcome fields). However, we mentioned earlier that association rules can examine all relations and there generally isnt a specific dependent variable. For this reason, all the purchased item fields are set to the Both direction, meaning that they can be inputs and outputs, and can thus appear on either side of an association rule. Close the Type node dialog Double-click the Apriori node Figure 3.5 Apriori Dialog Box
The node is labeled 10 fields because it has a default model name that indicates how many fields will be used in creating the association rules. The basic dialog for setting up the market basket analysis is fairly simple, largely because much of the definitional work has been done in the Type node. Notice that you can set criteria values for the minimum rule support (default value is 10%) and minimum rule confidence (default value is 80%). Thus at least 10% of the records must contain a particular group of antecedents before it
Data Mining: Modeling will be considered as a rule. For purchased items you might need to raise or lower this value depending on the purchase rates of the items and the number of items. For example, if the most popular item were purchased on only 8% of all store visits, then no association rules would be generated under the default (10%) minimum coverage rule. As you raise it, fewer association rules will be generated. The minimum confidence may need to be modified as well. In practice, it might be lowered if the initial 80% setting results in too few or even no rules. We lowered it because an earlier run (not shown), under the default value of 80%, produced only four association rules. The Maximum number of antecedents setting controls how many fields may be combined to form Conditions. The search space, and thus processing time, increases rapidly with an increase in the maximum number of rule preconditions. By default, rules will only contain true values for fields that are flag type. We discussed this earlier and since we are not interested in rules relating to the non-purchase of products, we retain the check in the Only true values for flags checkbox. Click the Execute button Figure 3.6 Generated Model Node Added to Models Tab in Manager Window
Data Mining: Modeling After a Clementine model runs successfully, a node containing information about the model (the generated node) is added to the Models tab in the Manager window. This model is an unrefined model, which cant be placed in a stream. But if we ran a model in which there was a declared outcome field, we could move the generated model node into the data stream and examine the predicted values (as we will do in later chapters). Right-click on the Generated Model Node Figure 3.7 Rules Produced by Apriori , then click Browse
By default, only the association rules appear. Since the support and confidence values are important when evaluating the rules, we display them as well. on the toolbar Click the Show/Hide criteria button You may also need to widen the window to see all antecedents In addition to support and confidence, also listed are instances, which are the number of cases used to calculate the support proportion.
Data Mining: Modeling Figure 3.8 Rules with Instances, Support and Confidence Added
Twenty-six rules were found (the number in the upper right-hand corner), and they are sorted by confidence. The first rule associates frozen foods, milk, and bakery goods. There are 85 instances (shopping trips) in which both frozen foods and milk were purchased. This is 10.8% of all shopping trips. On these 85 trips, bakery goods were also purchased 83.5% of the time. From here you would browse the rules, guided by the support and confidence figures, looking for interesting associations. Those with greater support will apply to a larger percent of the data. Those with greater confidence provide more certainty that a case that meets the conditions will also have the conclusion value. When viewing the confidence values, it is important to keep the base rates of purchasing items in mind (these can be obtained from the distribution node in Clementine). For example, if bakery goods are bought by 78% of all customers, the 83.5% confidence of the first association rule looks much less impressive than if the base rate were 30% (the actual base rate for bakery goods is about 43%, not shown). The table of rules can be sorted by support, confidence, the product of both, the consequent, or the length of the sequence, providing different views of the results.
This dialog will create a new Association Rule node with all rules that contain the Target Field as the consequent. Data can be passed through this generated node and a new field (variable) would be created. It would store the predicted values for the target field based on the Conditions in the data records. By default, it will be added to the stream canvas. Minimum values can be specified here for the coverage and confidence of the generated rules. Click OK Close the Association Rules browser window Right-click the new Apriori Generated Node Edit from the Context menu Click the Show all levels button on the Canvas, then click
Click the Show or hide instance and confidence figures button view those values Figure 3.10 Ruleset for Alcohol Generated from Apriori Model
to
We see the rules that lead to prediction of the purchase of alcohol (value=1). The first rule indicates that if frozen foods and milk are purchased (antecedents), then predict that alcohol is also purchased. There are 85 records (shoppers) with these antecedents (Instances) and 77.6% (.776) of these shoppers also purchased alcohol (confidence). We will now connect the Apriori Generated node to the stream to view the fields created by the rule set. Close the Ruleset browser window Click and drag the Apriori Generated node to the right of the Type node (You might want to click and drag the Apriori Modeling node down a bit first to make some room) Click with the middle mouse button (or click both left and right buttons on a two button mouse) on the Type node and drag to the Apriori Generated node (connecting the two)
Click on a Table node from the Output Palette, then click in the stream canvas to the right of the Apriori Generated node to place the Table node Click with the middle mouse button (or click both left and right buttons on a two button mouse) on the Apriori Generated node and drag to the Table node (connecting the two) Figure 3.11 Adding the Ruleset Node to the Data Stream
The Ruleset node has been added to the Data Stream so the new fields it creates (prediction for Alcohol purchase and the confidence value) will appear in the table. Right-click the Table node Click Execute on the Context menu Scroll to the right to view the last columns in the Table window
Data Mining: Modeling Figure 3.12 Fields Created by the Generated Rule Set
Two new fields have been added to the data. The first, $A-Alcohol, is $null$ (the default) unless any of the three rules in the rule set apply to the record, in which case it has a value of 1. The second field, $AC-Alcohol, represents the confidence figure for the rule consequent (here alcohol) for a particular rule. In this way, prediction rules can be derived from an association rule analysis if rules related to a particular outcome are interesting.
Data Mining: Modeling Figure 3.13 SPSS Training Purchase Data Stream
As before, the Clementine stream consists of a data source node (reading a text data file) connected to a Type node and then to the Apriori Association Rule node (labeled 13 fields). Right-Click on the Table node, and then click Execute on the Context Menu Scroll to the right in the Table window to the training course fields Figure 3.14 Training Purchase Data
Data Mining: Modeling As in the previous example, each training course product is coded 1 or 0 depending on whether or not a customer (each record represents a customer) purchased the product. Close the Table window Double-click the Apriori node Figure 3.15 Apriori Dialog Box to edit it
Notice that the minimum values for coverage (5%) and accuracy (20%) are much lower than our earlier analysis (see Figure 3.5). This is because, to the regret of the Training Department, the percentage of customers taking any single training course is low. When we optimistically ran the analysis using the default coverage and accuracy rules, no rules were generated. Click Execute Right-click on the Generated Model Node click Browse for the new model, then
Click the Show/Hide criteria button to see the support and confidence values Widen the columns to see the full names of the courses, and then widen the dialog box
Data Mining: Modeling Although we allowed up to five antecedents, the ten rules generated have only one antecedent. The Introduction to SPSS course is the consequent for 3 rules. However, one of these rules has no antecedent! This is because Apriori is designed to display the base rates for all potential consequents, if they meet the specified minimum rule confidence condition (more on base rates in a moment). Figure 3.16 Association Rules for Training Courses
The rules that conclude with the Introduction to SPSS and the Introduction to SPSS and Statistics courses are not very interesting since they are entry-level courses through which customers usually pass in order to take more advanced training. A more useful rule is the one relating Building Predictive Models to Introduction to CHAID. It has a support value of 6.5% and a confidence value of .273. Given that the highest confidence value is 31.1%, it is one of the better rules. But what is critical, as mentioned in the previous example with grocery items, is to compare the confidence value to the base rate for the Introduction to CHAID class. We get this information from the frequency table and bar chart produced by a distribution node. Close the Association Rules browser window Right-click the Distribution node Click Execute on the Context menu
Data Mining: Modeling Figure 3.17 Distribution of Introduction to CHAID Course Purchase
A small percentage (5.71%) of customers who purchase some training product purchase the CHAID course. However, the rule confidence indicates that of those who purchase the Building Predictive Models course, 27.3% also purchase the CHAID coursefive times the base rate. This could present a cross-selling opportunity. If a new training customer is interested in the Building Predictive Models course, the CHAID software and course products might be mentioned as well. A domain expert (in this case, someone familiar with the content and sales of training products) would be the one to examine such rules and decide which are worthwhile and not just artifacts of how training is structured. Since there is no specific target variable in a market basket or association rules analysis, many associations are summarized. The analyst can then examine the rules for interesting associations and generate rule sets, or note them for further study. This very general, nondirected analysis is the strength of market basket analysis. However, in practice the domain expert must examine and discard rules that are obvious or unhelpful.
Other Features
Market basket analysis is typically done on very large data sets, with tens of thousands of transactions, and hundreds of items. But as file sizes grow, especially in the number of items, the number of potential connections grows quickly, as does required computation time. For example, for just 50 items, there are 1,225 distinct combinations of two items, Market Basket or Association Analysis 3 - 19
Data Mining: Modeling 19,600 of three items, and so forth. So in practice, the number of cases can be large, but the number of distinct items tends to be much smaller, as is the number of items to be associated together in a cluster (limited by the value for Maximum number of antecedents in the Apriori dialog). It can be difficult, therefore, to determine the optimum number and type of items a priori, so typically, like all the automated methods, several different solutions may be tried with different sets. This is the equivalent of validating the associations you discover on a completely separate data set, although standard validation can also be done.
Model Understanding
One of the great strengths of market basket analysis is that its results are easily understood by anyone. The rules it finds can be expressed in natural language and dont involve any statistical testing. Usually, the graphical output, except for small sets of items, is not as helpful as the numeric data showing the actual strength of association.
Model Deployment
The association rules produced can be coded as statements in SQL. This means that they can be applied directly to databases, if necessary. However, model deployment for these techniques does not usually involve applying the rules to a new data file. Instead, the results are directly actionable, as decisions can be made about cross-marketing and new promotions by simply examining the associations and determining what level of association is high enough to warrant action.
Neural Networks 4 - 1
INTRODUCTION
Neural networks are models based on the way the brain and nervous system operate. Such models are used to predict an outcome variable that is either categorical or interval in scale using predictors that are also categorical or interval. Neural networks are popular when modeling complex environments (for example financial applications). Forms of neural networks (Kohonen networks) are used to cluster data (see Chapter 6). Neural network models are available from SPSS through its Clementine and Neural Connection products. The basic unit is the neuron or node, and these are typically organized into layers, as shown in Figure 4.1 Figure 4.1 Simple Neural Network
Input data is presented to the first layer, and values propagate from each neuron to every neuron in the next layer. The values are modified during transmission by weights. Eventually, a result is delivered from the output layer. Initially, all weights are randomly set to small values and the answers that come out of the net are probably nonsensical. The network learns through training; examples for which the output is known are repeatedly presented to the network; the answers it gives are compared to the known outcomes, and information from this comparison is passed back through the network, gradually changing the weights. As training progresses, the network will usually become more and more accurate in replicating the known outcomes; once trained, it can be applied to future cases where the outcome is unknown. Neural networks complement such statistical models as regression and discriminant analysis by allowing for nonlinear relations and complex interactions among predictor variables. When such relations are present, then neural networks will outperform the other models. When such relations are not present, then neural network models do about
Neural Networks 4 - 2
Data Mining: Modeling as well as the parametric statistical models. In this latter circumstance, the traditional models would probably be chosen on the basis of parsimony and their greater transparency (see Model Understanding section). A neural network solution with all predictors cannot be displayed in convenient graphical form, since relations between predictors and the outcome are mediated by the hidden layer(s), and many predictors may be involved. However, numeric summaries provide insight into the relative importance of each predictor and plots relating individual predictors to the outcome can be produced. Most of the output summaries from neural network models are numeric.
Neural Networks 4 - 3
In the diagram, each of the predictor variables is related to the outcome by a single connection weight or coefficient (the regression or B coefficients). Relations between the predictors and outcome variable are assumed to be linear. The coefficients or weights are estimated from the training data in a single data pass. The coefficients are those that minimize the sums of squared errors (called least squares). There are as many coefficients (weights) as there are input variables (for categorical variables, dummy variables would be substituted), plus one coefficient for the intercept. Now lets consider the form a neural network would take. Here we are using a feedforward backpropagation (named as such because prediction errors propagate backward to modify the weight coefficients) neural network.
Neural Networks 4 - 4
Two features immediately attract attention. The first is the middle layer of neurons that intervenes between the input layer (which accepts data values; there is one input layer neuron for each input variable) and the output layer neuron. This is called the hidden layer since its neurons do not correspond to observable fields. Secondly, due to the hidden layer, there are many more connections between the input and output than we saw with regression. This allows a neural network to fit more complex models than regression would. Notice that each input neuron is connected to every hidden layer neuron and, in turn, each hidden layer neuron is connected to the output neuron(s) (here there is just one output neuron). Thus a single input predictor can influence the output through a variety of paths, which allows for greater model complexity. Backpropagation networks typically have a single hidden layer of neurons, but there are extensions that allow two (a second hidden layer is useful in capturing symmetric functions) or even three (although rarely used) hidden layers. The number of neurons within a hidden layer can be varied; more neurons permit more complex relations to be fit. In fact, any continuous function can be fit using a backpropagation network by continuing to add neurons in the hidden layer. However, a neural network with many neurons in its hidden layer(s) may over-fit the training data and not generalize well to other instances. For this reason, in practice, the number of neurons in the hidden layer(s) is usually adjusted based on the number of predictor and outcome fields.
Neural Networks 4 - 5
Activation Function
A feature not visible in the diagram in Figure 4.3 concerns what occurs at a hidden neuron when the inputs from a data record are combined using the current set of weights relating them to the hidden neuron. Instead of the value of the weights applied to the input fields after normalization (if Xi is input variable i and neuron j is the jth hidden neuron, the value would be X1*wj1 + X2*wj2 + + Xk*wjk) simply passing through the neuron, a nonlinear function is applied. Although there are several choices, very often the logistic function (see Chapter 2Logistic Regression) is used. Figure 4.4 Logistic Function
Thus there is a nonlinear mapping of the value produced by the weights applied to a data record before the next set of weights (relating the hidden layer to the output layer) are applied. This allows the neural net to capture nonlinear relations between the inputs and outcome. In addition (not shown), there is a threshold below which a hidden layer neuron will not pass the value to the next neuron(s). In this way, a neural network can natively capture complex interactions among inputs and nonlinear relations between the inputs and output in the data, which are beyond what standard regression can do. Herein lies the strength of a neural network. If we expand our diagram (Figure 4.2) for multiple regression by adding a single neuron in a hidden layer and use the logistic function as its activation function, then we have a neural network that performs logistic regression.
Neural Networks 4 - 6
Data Mining: Modeling Figure 4.5 Neural Network for Logistic Regression
Adding additional neurons to the hidden layer, which is done in most neural network analyses, allows the model to capture more complex relationships in the data.
Complexity
If we examine Figure 4.3 again, we see that a neural network with 5 input neurons, 3 hidden layer neurons and 1 output neuron (5 predictors and one outcome) contains 18 weights. Also, every predictor relates to the output through three paths, each of which involves the other predictors and the nonlinear activation function. Because of this, there is no simple expression relating an input to the output. For this reason, neural network models are difficult to interpret. There are indirect methods used to better understand the relationship between an input variable and the predicted output from a neural network, but the relatively simple interpretations we had with regression, discriminant and logistic regression are not to be found. Also, there are measures of the relative importance of the predictors (sensitivity analysis) in neural networks, but they dont provide direction or functional form information.
Approach
What follows is a simplified description of how the feed-forward backpropagation neural network model is fit (how the weights are estimated). The basic idea is that the weights are adjusted based on the error of prediction from each record (case). To illustrate this, suppose we are attempting to predict the insurance claim amount from five input variables.
Neural Networks 4 - 7
Data Mining: Modeling First, the weights are set to small, random values. Then the data values from the first record (case) are passed through the net (the weights and activation functions are applied) and a prediction is obtained. The error of this prediction is used to adjust the weights connecting the hidden layer to the output layer. Figure 4.6 Adjusting Hidden Layer Weights Based on Prediction Error
The current weights, relating the hidden layer neurons to the outcome, are modified based on the error of the prediction. Next the weights relating the input variables to the hidden layer are adjusted in a similar way. First, the error in the output is propagated back to each of the hidden layer neurons via the weights connecting the hidden layer to the output. This must be done since the hidden layer neurons have no observed outcome values and error cannot be measured directly. In turn, the weights relating each input neuron to a hidden layer neuron are similarly modified based on the error assigned to that hidden layer neuron (see Figure 4.7).
Neural Networks 4 - 8
Data Mining: Modeling Figure 4.7 Adjusting Input Layer Weights Based on Error
By propagating the output prediction error to the hidden layer through the weights, all weight coefficients can be adjusted from the prediction error. The new weights are applied to the next case, and are similarly adjusted. Weights are adjusted for each record (case) and, typically, many data passes are required for the model to become stable. For those interested in the equation, the weight change can be expressed as follows: Wji( t + 1) = Wji (t) + * dj* Oi + [Wji (t) - Wji (t-1)] Where Wji is the weight connecting neuron i to neuron j, t is the trial number, is the learning rate (a value set between 0 and 1), dj is the error gradient at node j (for discussion of this see Fu, 1994), Oi is the activation level of a node (the nonlinear function applied to the result of the combination of the weights and inputs), and is a momentum term (a value set between 0 and 1). A new weight is derived by taking the old weight and applying an adjustment based on a function of the prediction error (represented here by dj). In addition, the momentum term () serves to encourage the weight change to maintain the same direction as the last weight change. It and the learning rate () are control parameters that can be modified by experienced neural network practitioners to fine-tune the performance of feed-forward backpropagation neural networks. This is a bare description of the estimation method and it skips over the more difficult details of how the error gradient is calculated. For those interested in these details see Fu Neural Networks 4 - 9
Data Mining: Modeling (1994) or the more accessible, but less detailed Bigus (1996). Berry and Linoff (1997) include a chapter discussing neural networks in the context of data mining.
Other Issues
Neural networks assume the inputs are in a range from 0 to 1 (typically). Most neural network software automatically rescales input variables, but you should be aware of it. Also, categorical variables are replaced by dummy variables (discussed in Chapter 2) coded 1 or 0, indicating the presence or absence of the category. If you have categorical inputs with many categories (postal codes, telephone exchange codes, industry codes, or medical diagnostic groups), the neural network will use the large number of dummy fields as input, which can substantially increase the time and memory requirements. Given that many iterations are typically required for a neural network model to be fit, you should examine categorical predictors to determine the number of categories. If the number of categories for a variable is excessive, you should consider binning or collapsing the original categories into a smaller number of super-categories.
Neural Networks 4 - 10
This stream has been prepared in advance, although we will run the analysis and examine the results. Notice there are two streams; the upper stream applies the neural network
Neural Networks 4 - 11
Data Mining: Modeling model to the training data set (train.txt), while the lower stream will send the separate validation data (validate.txt) through the generated model nodes for evaluation. First lets examine the data. Right-click on the Table node in the upper left corner of the stream palette Click Execute on the Context menu Figure 4.9 Data File
There are 2,455 records in the training data file and we see values for the first few cases. The outcome field (Risk) has string values for its three categories. We saw in Chapter 3 that the Type node is very important when setting up a model in Clementine and we examine it now. Close the Table window Right-click on the Type node Click Edit on the Context menu
Neural Networks 4 - 12
Data Mining: Modeling Figure 4.10 Type Node for Neural Network Model on Risk Data
Recall that the direction of a field determines how it will be used within Clementine models. To review: IN OUT BOTH NONE The field acts an input or predictor in modeling The field is the output or target field in modeling The field can act as both an input and an output in modeling (limited to Association Rule and Sequence Detection modeling; see Chapters 3, 8). The field will not be used in modeling.
Notice that the ID field is declared as Typeless and its direction is None. Thus it will not be included in the neural network model. This is for good reason. With a large enough network topology ID could perfectly (and trivially) predict risk in the training data. In addition, a dummy field would be generated for each unique ID in the training data. Since we have thousands of IDs, this would dramatically increase the scope of the analysis and would substantially slow down the estimation. Thus we would slowly obtain a trivial and unhelpful model. And there is no reason ID should be a good predictor, in any case. Risk is the outcome variable and all others (except ID) are inputs. Close the Type window Click the Neural Net node in the Modeling palette and place it in the stream canvas to the right of the Type node Neural Networks 4 - 13
Data Mining: Modeling Connect the Type node (click and drag with the middle mouse button, on simultaneously click the left and right buttons of a two button mouse) to the Neural net node Double-click the Neural Net (named Risk) Node Figure 4.11 Neural Net Node
The name of the Neural Net node can be specified within the Custom Model name text box. By default, it takes the name of the output field. A feedback graph (selected by default in the Options tab) appears while the network is training and provides information on the current accuracy of the network. We will examine the feedback graph shortly. There are several different neural network algorithms available within the Train Net node (Method dropdown list). Some (Quick, dynamic, prune, and RBFN (radial basis function network)) were described earlier. We will demonstrate the default Quick method, but will summarize the performance of the others. For more details on the different methods, please see the Clementine User Guide or the Clementine: Advanced Models course. Over-training is one of the problems that can occur within neural networks. As the data are passed repeatedly through the network, it is possible for the network to learn the
Neural Networks 4 - 14
Data Mining: Modeling patterns that exist in the training sample only and hence over-train. That is, it will become too specific to the training sample data, and will loose its ability to generalize. By selecting the Prevent overtraining check box (the default), only a randomly selected portion of the training data is used to train the network. Once this sample of data (training data) has made a complete pass through the network, the rest is used as a test set to evaluate the performance of the current network. It is this test information that is used to determine when to stop training and provide feedback information. Thus training data are used to estimate the neural network weights while test data are used to evaluate the model (accuracy) and determine when training ends. In addition, we have set aside a separate validation data set (second stream) to evaluate the model(s). We advise you to leave the Prevent overtraining option turned on. You can control how Clementine decides to stop training a network. The default option stops training when the network appears to have reached its optimally trained state. That is, the model performance in the test data sample seems to no longer improve with additional training cycles. Alternatively you can set a required Accuracy, a limit to the number of Cycles through the data, or a Time limit in minutes. In this chapter we will use the Default option. Since the neural network initiates itself with random weights, the behavior of the network can be reproduced by using the Set random seed option with the same seed. Setting the random seed is not a normal practice and it is advisable to run a neural network model several times to ensure that you obtain similar results using different random starting points. This influences both the starting weights and the splitting of the data file into training and test data sets. Under the Options tab, Sensitivity analysis gives information on the relative importance of each of the fields used as inputs to the network. This is useful information although it increases processing time, and is the default setting. Under the Experts tab, and then the Experts option button, are additional choices that allow you to further refine the properties of the training method (for example, the learning rate and momentum parameters briefly cited earlier). Expert options are detailed in the Clementine Users Guide. In this chapter we shall stay with the default settings on all but one of the above options, although if multiple models are built on the same stream, using different inputs or training methods, it may be advisable to change the network names. In order to reproduce the neural network results appearing in this example (recall that starting weights are randomly set), we set the random seed to 233 for this analysis. Click the Set random seed checkbox Type 233 in the Seed box Click the Execute button
Neural Networks 4 - 15
Data Mining: Modeling Figure 4.12 Feedback Graph During Neural Network Training
The graph shows two lines, the red, more irregular line labeled Current Predicted Accuracy, represents the accuracy of the current network (current set of weights) in predicting the test data (as defined above). The blue, smoother line, labeled Best Predicted Accuracy, represents the best accuracy (and network) so far (again applied to the test data). The accuracy percentages for both the current and best performing networks are detailed in the legend. When the outcome field is categorical, this accuracy is simply the percentage of correct model predictions (in the test data sample). For a numeric outcome field (Clementine type Range), the accuracy for a prediction (as a percent) is: 100*(1 absolute value ((Target Value Network Prediction) / Target Value Range)), which is averaged across the test data. Once trained the network performs, if requested, the sensitivity analysis and a golden nugget node appears in the Models tab of the Manager. This represents the trained network and is labeled with the network name.
Right-click on the Neural Net Generated Model node tab Click Browse on the Context menu
in the Models
Neural Networks 4 - 16
The information on the model in this window is collapsed, except for the Analysis section. Here we see that the predicted accuracy for this neural network is 70.949%, indicating the proportion of the test set correctly predicted. Below this is the final topology of the networkthe number of neurons within each layer of the network. The input layer is made up of one neuron per numeric or flag (binary) type field. Those fields defined as sets (categorical with more than two categories) will have one neuron per value within the set. Therefore in this example there are 9 numeric or flag fields and 1 set field which has 3 values, giving a total of 12 neurons. In this network there is one hidden layer, containing 3 neurons, and then the output layer, also containing three neurons for the three values of the output field, Risk. The Quick method selects the number of neurons in the hidden layer based on the number and types of the input and output fields. More input fields mean more neurons in the hidden layer. If the output field had been defined as numeric then the output layer would only contain one neuron. To view the relative importance of the input fields from the sensitivity analysis, we need to expand that section of the model information. Click on the Relative importance of Inputs icon to expand the folder
Neural Networks 4 - 17
The relative importance of each input field is listed in descending order. The figures range between 0.0 and 1.0, where 0.0 indicates unimportant and 1.0 indicates extremely important. In practice this value rarely goes above 0.35. Here we see that age, income and marital status are relatively important fields in this network. The generated model icon can be placed in the data stream and data can be passed through it to create prediction and confidence fields and understand how the model works. This has already been done. Close the RISK model browse window Right-click on the Table node connected to the Neural Net Generated Node in the upper stream Click Execute on the Context menu Scroll to the right so the last few columns are visible
Neural Networks 4 - 18
Data Mining: Modeling Figure 4.15 Table Showing Prediction and Confidence Fields
The model node calculates two new fields, $N-RISK and $NC-RISK, for every record in the data file. The first represents the predicted value and the second a confidence value for the prediction. The latter is only appropriate for categorical outputs and will be in the range of 0.0 to 1.0, with the more confident predictions having values closer to 1.0. The confidence values are almost identical for the first few customers because they all have the same value of marital status and nearly identical annual incomes. Close the Table window
Neural Networks 4 - 19
Data Mining: Modeling Figure 4.16 Misclassification Table of Neural Network Predictions
The model is predicting about 90% of the bad but profitable risk individuals correctly, only 32% of those who belong to the bad loss category, and the good risk group between these two (62%). Thus if we want to correctly predict bad but profitable loans at the expense of the other categories, this would appear to be a reasonable model. On the other hand, if we want to predict those credit risks that are going to cause the company a loss, this model would only predict about a third of those correctly. We have established where the model is making incorrect predictions. But how is the model making its predictions? In the next section we shall examine a couple of methods that will help us to begin to understand the reasoning behind the predictions. Close the Matrix window
Neural Networks 4 - 20
The chart is normalized (by editing the Distribution node; not shown) so each marital status bar is the same length. The chart illustrates that the model is predicting almost all the divorced, separated or widowed individuals as bad but profitable types. The single individuals are associated with both good risk and bad but profitable types. The married individuals are split among all three types, with most of the bad loss types falling in this group. One of the important numeric inputs for this network is income. Since the output field is categorical, we shall use a histogram of income, with the predicted value as an overlay ($N-RISK), to try to understand how the network is associating income with risk. Close the Distribution window Right-click the Histogram node Click Execute on the Context menu
Neural Networks 4 - 21
Data Mining: Modeling Figure 4.18 Income and Predicted Credit Risk
Here we can see that the neural network is associating high income with good credit risk. There seems to be a break point in income at about 30,000 pounds (UK data), over which the model predicts good risk. The lower income ranges are split, roughly speaking, proportionally between the two bad credit types. By comparing this with a histogram using the actual output field as an overlay, we can assess where, in terms of income, the model is getting it wrong. Close the Histogram window Double-click on the Histogram node Select Risk as the Overlay Color field (not shown) Click Execute
Neural Networks 4 - 22
Data Mining: Modeling Figure 4.19 Income and Actual Credit Risk
With this overlay histogram of the actual data, we see that there are some good credit risk individuals present at the lower end of the income scale (which is one reason the model is not predicting that group very accurately). And although the bad loss group has incomes generally below 30,000 pounds, we know from the misclassification table (Figure 4.16) that this group is being predicted poorly. Thus, there must be other factors in the model that we havent yet discovered that lead to these mispredictions.
Model Summary
We have built a neural network that is very good at predicting bad but profitable types. The more important factors in making predictions are marital status, income, and age. The network appears to associate low income and those divorced, separated or widowed as those individuals likely to be a bad but profitable credit risk. The neural network is associating high incomes with good credit types, but does not appear to be very successful in correctly identifying those who are classified as bad credit risks likely to create a loss for the company. It could be argued that this latter group of individuals is the
Neural Networks 4 - 23
Data Mining: Modeling most important to identify successfully, and, for this reason alone, the network is not achieving great success.
The overall performance of the neural network models on the original data was very similar, with the more complex networks doing slightly better. Oddly, performance improved for all networks when applied to the validation sample. Typically, there is a slight drop-off. The simplest of the neural network methods (Quick) did quite well, relatively speaking, on this data.
Neural Networks 4 - 24
Other Features
There is no fixed limit to the size of data files used with neural networks, but since training a network involves iterative passes through the training data, most users of neural networks limit the number of predictors to a manageable number, say under 50 or so. The data are split into training, test and validation samples. Data values must be normalized (range constrained to be roughly 0 to 1) for neural networks to run effectively. Neural network software programs typically perform this automatically. Although stepwise methods are not available in most neural networks (because all the predictors are used to set up the model; pruned networks are an exception), measures suggesting the relative importance of the predictors and plots can be used to evaluate predictor variables. Neural networks can natively model non-linear and complex relationships.
Model Understanding
Neural networks produce models that are not easily understood or described. This is because the relationships are reflected in the (possibly) many weights contained in the hidden layer(s). Thus neural networks largely present a black box and not a model that can be readily described. This can be a disadvantage if you need to explain the model to others or demonstrate that certain factors were not considered. For example, race cannot be considered in the U.S. when issuing a loan or credit card and a demonstration of this may be required by Federal agencies.
Model Deployment
Applying the weights and threshold values (after normalization) to new data produces predictions. Although neural network prediction is more complex than regression models or decision trees, it can be incorporated into other programs or languages. Specifically, Clementine neural net models can be deployed via exported code to apply the coefficients within other programs. In sum, while not as easy to deploy as some of the other models considered, it can be done
Neural Networks 4 - 25
Neural Networks 4 - 26
INTRODUCTION
Decision trees and rule induction methods are capable of culling through a set of predictor variables and successively splitting a data set into subgroups in order to improve the prediction or classification of a target (dependent) variable. As such they are valuable to data miners faced with constructing predictive models when there may be a large number of predictor variables and not much theory or previous work to guide them. In this chapter we use the terms rule induction and decision tree interchangeably. The different names are a result of strongly related techniques having been developed in two different research areas (artificial intelligence and machine learning versus applied statistics). The decision trees we view can be expressed as rules, and these decision rules (although not rulesets as discussed later in the chapter) can be represented as decision trees. We will discuss decision tree methods within the context of data mining and use the CHAID procedure in AnswerTree and the C5.0 rule induction method in Clementine to predict credit risk. Traditional statistical prediction methods (for example, regression, logistic regression or discriminant analysis) involve fitting a model to data, evaluating fit and estimating parameters that are later used in a prediction equation. Decision tree or rule induction models take a different approach. They successively partition a data set based on the relationships between predictor variables and a target (outcome) variable. When successful, the resulting tree or rules indicate which predictor variables are most strongly related to the target variable. They also find subgroups that have concentrations of cases with desired characteristics (e.g., good credit risks, those who buy a product, those who do not have a second heart attack within 30 days of the first). Decision trees can use many predictors, but they are not true multivariate models in the classic sense of parametric statistics. That is, they do not simultaneously assess the effect of one variable while controlling for the others. Instead, their general approach is to find the best single predictor of the dependent variable at the root of the tree. Finding this predictor usually involves recoding or grouping together several of the original values of the predictor to create at least two nodes. Each node now defines a new branch of the tree being created. Within each branch, the process repeats itself. The algorithm looks for the best predictor among the set of variables. Again, it will create at least two nodes with that best predictor. When no predictor can be found that improves the accuracy of prediction, the tree can be grown no further. Some decision tree methods also prune the tree after initial growth. Decision tree models may have the following advantages over traditional statistical models. First, they are designed to be able to handle a large number of predictor variables, in some cases far more than a corresponding parametric statistical model. Second, some of tree-based models (C5.0 and C&RT in particular) are entirely nonparametric and can capture relationships that standard linear models do not easily handle (nonlinear relationships, complex interactions). They (especially C5.0 and C&RT) Rule Induction and Decision Tree Methods 5 - 2
Data Mining: Modeling share this characteristic with neural networks and kernel density methods (robust regression). For researchers needing to explore their data, decision tree methods are designed to automatically perform the successive splitting operations until stopping criteria are reached. They thus provide a means of ransacking a data set (to borrow Leo Goodmans expression) to identify key relationships. The downside of this is that care must be taken to avoid false positives (cross-validation is recommended) and trivial results. These characteristics are common to decision tree models. Below we see a decision tree derived from a marketing study of the factors that influence a persons decision to subscribe to an interactive news service. Figure 5.1 Decision Tree with Age As First Predictor
At the root (top) node, the yes and no responses are about evenly divided. However, after splitting the sample based on age (<= 44, >44), we have two very different subgroups. Of those over 44, over 67% respond yes, while almost 60% of those 44 or under say no to the offering. For those 44 or under a second split is performed based on income level; those with higher income are more likely to accept the offering. Decision tree methods perform successive splitting of the data using predictor variables guided by a criterion (statistical or information-based). Next we examine some rules produced by Clementines C5.0 rule induction procedure.
Data Mining: Modeling Figure 5.2 Rules in Decision Tree Format for C5.0
The rules are expressed as branching IfThen statements. They describe the conditions that lead to a predicted outcome. The first line indicates that if the number of (other) loans is 0, then predict good risk. In addition, it indicates that there were 335 cases in this subgroup and the rule predicted 59.1% (.591) of those cases correctly. If the number of loans is greater than 0, then additional conditions apply (involving the number of store credit cards, income and, again, the number of loans). These rules, in principle, could be represented in a graphical decision tree.
Data Mining: Modeling categories to be selected, lengthy processing time). C5.0 is the most recent version of a machine-learning program with a long pedigree (ID3, C4.0, C4.5). SPSS saw value in each method and decided to offer them all. Often several of the methods can be applied to answer a question, and we encourage you to run your analyses using the different methods and compare the results, especially in terms of how well they classify on a validation sample.
CHAID Multiple Yes1 No2 No Statistical Moderate Missing becomes a category No AnswerTree
C&RT Binary Yes Yes Yes Impurity measure Slower Surrogates4 Yes AnswerTree, Clementine
SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target variables. 2 Continuous predictors are converted into ordinal variables containing (by default) 10 approximately equally sized categories. 3 If misclassification costs are symmetric, they can be incorporated using priors. 4 Fractionalization and surrogate (substitute predictor) methods advance a case with a missing value through the tree.
Data Mining: Modeling One important point is that if you wish to use decision tree analysis to predict a continuous variable, then you are limited to the C&RT procedure in Clementine and AnswerTree or a variant of the CHAID method within AnswerTree. Note that C&RT and QUEST produce binary splits when growing the tree, while C5.0 and CHAID can produce multiple groups when splitting occurs. Also, missing values are handled in two different ways. CHAID allows missing values to be considered as a single category and used in the analysis (refusing to answer an income question may be predictive of the target variable). C&RT and QUEST use the substitute (surrogate) variable whose split is most strongly associated with that of the original predictor to direct a case with a missing value to one of the split groups during tree building. C5.0 splits a case in proportion to the distribution of the predictor variable and passes a weighted portion of the case down each tree branch. So these latter methods do not treat missing as a separate category that might contain predictive information. The methods also differ in whether statistical tests or other criteria are used when selecting predictor variables.
CHAID ANALYSIS
Chi-square automatic interaction detection (CHAID) is a heuristic tree-based statistical method that examines the relations between many categorical, ordinal or continuous (which are grouped or binned into ordered categories) predictor variables and a categorical outcome (target) measure. It provides a summary diagram depicting the predictor categories that make the greatest difference in the desired outcome. For example, in a segmentation context, CHAID can give you information about the combinations of demographics that yield the highest probability of a sale.
Data Mining: Modeling variables and the outcome, and test for significance using a chi-square independence test. If more than one of these relations is statistically significant, CHAID will select the predictor that is most significant (smallest p value). If a predictor has more than two categories (as region does), CHAID compares them and collapses together those categories that show no differences on the outcome. For example, if the Midwest and West do not significantly differ in the percentage of customers who upgrade, then these two regions will be collapsed together (the least significant are merged first). Thus categories based on those regions (or merged regions) that significantly differ from the others are used in computing the significance tests for predictor selection. Lets say the best predictor is region, so the split occurs on that variable. CHAID then turns to the first new region category (say Midwest/West), and for these observations, it examines the predictor variables to see which makes the most significant difference (lets suppose it is gender) for this subgroup. It then splits the Midwest/West subgroup by gender. It would then examine each of the other region subgroups and, if the criteria are met, split each region subgroup by the predictor most significantly related to the target within that subgroup (which might not be gender). CHAID would then drop down a level in the tree and take the first gender group (say females) within the Midwest/West subgroup and see if any of the predictor variables (including region) make a significant difference in outcome (since gender is a constant for this subgroupall femaleit would not be considered). If no remaining predictor variable makes a significant difference for the Midwest/West-female subgroup, or a stopping rule is reached, CHAID declares this a terminal node and examines the Midwest/West-male subgroup in the same way. Thus, level by level, CHAID systematically splits the data file into subgroups (called nodes) that show significant differences as they relate to the outcome measure. The results of this process are displayed in a tree diagram that branches out as additional splits are made. As noted above, CHAID can split more than once on the same variable in a branch if that variable becomes the best predictor at a subsequent node.
Data Mining: Modeling composed of the data file and any analysis trees based on the file. We will start our new project from the Startup menu, but could also do so directly from the AnswerTree parent window (click FileNew Project). Click Start a new project option button in AnswerTree dialog (not shown) Click OK Click SPSS data file (.sav) option button in New Project dialog Click OK Move to c:\Train\DM_Model directory Double-click Train Figure 5.3 AnswerTree Parent Window
The AnswerTree Parent window contains a Project window that has an entry (Project 1), indicating that a default project data file has been opened. The Project window provides a way of referencing different tree analyses done in the course of the project (analyses linked to a single data file). However, the details and specifications for a CHAID (or other tree) analysis are provided elsewhere (Tree Wizard dialogs, Analysis and Tree menus, and the Tree window). The project is assigned the temporary name Project 1 (see the title bar of the AnswerTree parent window or the entry in the Project window). Since no analyses have been run yet, there are no tree icons in the Project window. If you wish a spreadsheet view of the data, simply click ViewData after a root node is present in the Tree window (meaning the data file has been read). Lets assume the data have been checked in SPSS and proceed to set up the model for CHAID analysis.
Data Mining: Modeling A Tree Wizard window also opened with our new project. The Tree Wizard, composed of four dialog boxes, steps you through the analysis setup. First the Tree Wizard asks which decision tree method we want to apply to our data. CHAID is the default method. Figure 5.4 Tree Wizard: Growing Method (Step 1 of 4)
We will be running a CHAID analysis; other AnswerTree models are also available as Growing Method option buttons. Each is accompanied by a brief description. For more details on QUEST and C&RT, see the AnswerTree Users Guide. Click CHAID option button (if not already selected) Now we will set up the analysis. Click Next Click and drag Credit Risk into the Target list box Click and drag all variables except ID into the Predictors list box
Data Mining: Modeling Figure 5.5 Tree Wizard: Model Definition Dialog (Step 2 of 4) Completed
The Tree Wizard Model Definition dialog displays the SPSS variable labels, ordered by original position in the file, for all variables in the SPSS data file (Train.sav). In order to perform a CHAID analysis we must indicate the target (dependent) and predictor (independent) variables. The icon beside each variable identifies the variables measurement level. A variable can be nominal , ordinal , or continuous . You can change display characteristics of the variable list and modify a variables measurement level from this dialog (measurement level can also be changed later using the Measurement Level dialog available from the Analysis menu). In addition to list boxes for Target and Predictors, there are list boxes for Frequency and Case Weights. A Frequency variable is needed when aggregate data are used, that is, a data file in which each record represents not an individual, but a subgroup summary. In such files the Frequency variable contains the number of observations that each record represents. In addition a Case Weight variable can be used if each observation is not to be equally weighted (the default) in the analysis. Such weights can be used when the sample proportions do not reflect the population proportions. Click Next
Data Mining: Modeling An important step in decision tree analysis is to validate the results on data other than that used to build the model. AnswerTree offers two methods for this: (1) partitioning the data into training and validation samples, and (2) n-fold validation. These options were briefly discussed in Chapter 1. We will partition the data so that 75% goes to training and 25% (about 600 cases) to the testing sample. We actually have separate training and validation data files for this study (named Train and Validate) and could have combined them beforehand, which would allow more data for the analysis. Click the Partition my data into subsamples option button Move the slider control so that Training sample is at 75% and Testing sample is at 25% Replace the Random Seed value with 233 Figure 5.6 Tree Wizard: Validation (Step 3 of 4) Dialog Box Completed
Click Next At this point, you can examine and change some advanced analysis options or proceed to create a Tree window using default settings. We will examine the default CHAID advanced option settings now, changing some of them. Click the Advanced Options pushbutton
The first tab within the Advanced Options dialog controls when tree growth will stop. Since this tab applies to methods other than CHAID, one choice is inactive. Maximum Tree Depth sets a maximum for the number of levels deep (below the root node) a tree can grow. The default (3) is useful when there are many variables and you want to identify the most important few. Deeper trees allow more detailed analysis and possibly a better solution, but correspondingly require more computing resources to complete. For our analysis, we leave the depth at 3. The Parent node minimum number of cases is the minimum size a node (subgroup) must be before any splitting can occur. The default (100) is reasonable for large data files (many thousands), but for some market research and segmentation studies (other than direct mail where samples are typically very large) it must be scaled down accordingly. Since our training data set will contain about 1,850 observations, we will reduce the minimum parent node value to 40. The Child node minimum number of cases is the minimum size of nodes (subgroups) resulting from a split. This means that a node will not be split further if any of the resulting nodes would contain fewer cases than this value. By default it is 50 and, for the reason outlined above, we will reduce it to 15. We are setting the numbers a bit on the low side (especially the minimum number of cases for the Child node) to better demonstrate how the CHAID model runs (by increasing the chances of a larger tree).
Data Mining: Modeling Enter 40 into the Parent node text box Enter 15 into the Child node text box The CHAID tab sheet controls the statistical criteria used by CHAID. The Costs tab sheet allows you to supply misclassification costs, which will influence case classification (prediction) in CHAID, though not tree growth. By default, all misclassification costs are assumed equal. The Intervals tab sheet allows you to examine or modify the intervals that CHAID will use to convert continuous predictors into ordinal form. Click OK to process the Advanced Options Click Finish Figure 5.8 Tree Window with Root Node
The Tree window now opens and displays the root node, which represents the entire training sample. The target variable label (Credit risk) appears above the node along with a notation that the training data (not the testing data) are displayed. Within the root node, we can see the overall counts and percentages of responses to the credit risk target variable. Overall 22.62% of the sample were a bad risk with a loss of money, 59.94% were bad risks but profitable, and 17.45% were good risks. These are the base rates for the training sample and the CHAID analysis will create splits using the predictor variables most strongly related to credit risk. Notice also that a tree icon labeled Tree 01 RISK has been added to the Project window (not shown). It represents the tree corresponding to this analysis. Multiple trees (analyses) can be included in a single project.
Data Mining: Modeling The Tree window is composed of five tabbed sheets: Tree, Gains, Risk, Rules, and Summary. Several of these are of interest after the analysis is run (after growing occurs), and we will examine them. The most important menus within the Tree window are Analysis, which controls features and criteria for the analysis, and Tree, which runs the analysis (TreeGrow Tree) and allows you to customize the resulting tree by selectively growing or pruning branches. The View menu will open up additional windows: a Tree Map window that aids navigation within a large tree (open by default) and the Data Viewer window. This menu also contains options to present different views or summaries of the selected node (a graph or table). Many of these features are also available via tool buttons. To perform the full analysis we will choose TreeGrow Tree. Another Tree menu choice allows you to grow the tree one level at a time. You can also grow a branch from a node you select (since only the root node is present, the tree and branch choices are equivalent). Although we will not pursue it here, you can direct CHAID in growing the tree (Select Predictor), and modify the tree CHAID creates (Remove Branch, Remove One Level, Select Predictor). Click TreeGrow Tree Maximize AnswerTree to see a larger Tree window From this window, we see the tree has grown at least two levels deeper. The entire tree is not visible within this screen and we can use the scroll bars to adjust the view. For larger trees a tree-navigation window called the Tree Map window is very helpful and we will use it later. Also note that you can use the zoom tools to increase or decrease the size of the tree (or click ViewZoom).
Data Mining: Modeling Figure 5.9 CHAID Tree With Root Node Visible and One Node at First Split
As mentioned earlier, the root node represents the entire sample (there is no missing data for the credit risk field) and the base rate for credit risk. The first split (below the root node) is due to age; in other words, of all the predictor variables, age had the strongest (most significant, as measured by chi-square) relationship with credit risk. Although only one of the four age groups is visible above, we see that those over 39 and 45 or younger had proportionately more bad losses (42.85%) than the two younger groups (7.52% and 22.28%). Thus by focusing on the younger subgroups, we could reduce the proportion of bad loss risks and increase the proportion of bad but profitable risks relative to the other, older age categories (scroll right and left to examine the four age groups). The significance value and chi-square statistic summaries appear above the split. By default CHAID originally formed ten age categories and these were merged into three distinct age categories. Scroll left and down to see the split below the Age (25,39] node.
Data Mining: Modeling Figure 5.10 CHAID Tree with Second Split
The node (subgroup) composed of those over 25 and 39 or younger is in turn split into four nodes based on income. Those in the higher income categories were more likely to be good risks (over 50%) and less likely to be bad but profitable risks than those in lower income categories. Thus of the remaining predictor variables, income was the most significant predictor for this age subgroup (25,39). If the tree is small, you can scroll through it and examine the terminal nodes for high or low concentrations of the response of interest. It is at the terminal nodes where predictions are made by the tree. You would typically examine the tree carefully to make sure the splits and results make sense. We will move on to examine the other summaries. What of our credit risk segments? Where are the highest concentrations of good risks or of bad but profitable risks? Instead of searching the tree, a more convenient summary would be a table listing the nodes in decreasing order by the percentage of cases falling in the target category of interest. Thus you can easily choose the top-so-many subgroups to focus on. To request this gains table: Click Gains tab (at bottom of Tree window) Click FormatGains Select good risk in the Gain Column Contents Category list box (not shown) Click OK
Data Mining: Modeling Figure 5.11 Gains Table for Good Risks (Training Data)
Before examining the details of this table, notice the third title line lists the target variable label (Credit risk) and target category (good risk). By default, the target category with the highest code appears, and we changed it to examine the good risk category. The table is now organized by, and percentages displayed for, the good risk category. The left half of the table (Node-by-Node) separately summarizes each terminal node (segment), while the right half (Cumulative Statistics) displays cumulative summaries. A row represents a segment (or a terminal node on the tree) and the sixth column (Resp %) contains the percentage of that group who are good credit risks. In the first row (node 9) we see 63.2% of the group represent good credit risks. Notice the table is sorted in descending order by this column. Thus you can easily pick the top segments. The first column contains an identification number for the node (it corresponds to the node number in the Tree Map window (not shown)). The Node: n column reports the number of observations in the segment (node) and the Node: % column displays the segment (node) size as a percentage of the entire sample. The Gain: n column displays a count of the target category responses occurring within the node, and the Gain: % column indicates the percentage of all target category responses that are contained in this node. Node 9 (which from the tree are those over 39 but 45 or younger with at most 1 credit card) contains 13.3% of all good credit risk individuals, although it represents only 3.7% of the training sample. Finally, the index is a ratio (multiplied by 100) of the nodes percentage of target category responses to the overall samples percentage. The segment with the highest percentage was those over 39 but 45 or younger with at most 1 credit card, at 63.2% percent, while the entire sample had a good credit risk percentage of 17.45%; the ratio of these (multiplied by 100 and rounded) is 362.4.
Data Mining: Modeling The summaries on the right side of the chart present the same statistics accumulated across the segments (terminal nodes). Using them you can see, for example, what the overall percentage of good credit risks in the top three segments (nodes) is. These results are for the training data; to see a similar table for the validation (testing) data simply click ViewSampleTesting. We next examine the training samples assignment rules.
If we focus on the section pertaining to node 9, we see that those over 39 but 45 or younger (AGE GT 39 AND AGE LE 45) with at most 1 credit card (NUMCARDS LE 1) are assigned a node (nod_001 variable) value of 9. In addition, their predicted target category (based on the most common response category in the node) is stored in the variable pre_001 (here 3, corresponding to the numeric code for good risk). The predicted probability that a node member falls in this category (here 63.2% or .632) is assigned to the prb_001 variable. The case assignment format provides more detailed information than the selecting cases format.
Data Mining: Modeling Figure 5.13 Risk Summary for Training Data
In the misclassification table, the columns correspond to the actual target categories in the data and the rows represent the target category predictions using the classification rules. For each actual target category, we can see how often the model predicts the correct outcome and how the errors are distributed among the other categories. Of the 324 actual good risks, the model correctly predicts 198, or about 61% correct. However, notice that only 166 of the 420 bad risks with loss individuals were predicted correctly (40%). Depending on which target category is most important to predict correctly, you might evaluate this result differently. The risk estimate, which is the overall misclassification or overall error rate when equal misclassification costs and no target-category prior probabilities are specified, is about .28. Thus, overall, about 72% of the training cases are predicted correctly. The misclassification table is important because it indicates the frequencies of the different types of errors made using the classification rules. In turn, the risk estimate provides a valuable estimate of the overall misclassification risk, here an error rate, when using the classification rules. Lets examine the same table for the validation data. Click ViewSampleTesting
Data Mining: Modeling Figure 5.14 Risk Summary for Test Data
The misclassification table for the 598 test cases is very similar to the training data results (the risk estimate is .276 for training data versus .284 for validation (testing) data). This validation step gives us greater confidence in the model. We could also examine the gains table for the testing data. The Summary tab (not shown) presents a summary of the information needed to reproduce the analysis. This includes the data file name, target and predictor variable information, tree growth criteria, and the size of the tree. This serves as a useful document recording how the solution was obtained. If our goal were to predict good credit risks, the model does this fairly well. However, since some of the predicted good risks are actually bad risks with losses, further evaluation based on the expected profit or loss from this model should be performed. Also, a domain expert should review the rules or the tree in order to insure that they are reasonable. Click FileExit AnswerTree Click No when asked to save the project
Data Mining: Modeling Figure 5.15 Clementine C5.0 Streams for Training and Validation Data
The upper stream reads the training data (train.txt in text format), passes it through a type node and then to the C5.0 model node . The analysis was run previously, so we already have a generated model node in the stream (based on the C5.0 model), which leads to a Matrix node (misclassification table) and an Analysis node, which provides useful summaries of model fit. The lower stream passes the separate validation data (validate.txt) through the model so we can validate the results. Although the generated node from the C5.0 model already appears in the stream, we will rerun the model. Right-click on the Table node in the upper stream Click Execute on the Context menu
We are fitting the same data as we did earlier in this chapter using CHAID. Close the Table window Double-click on the Type node in the upper stream Figure 5.17 Type Node
Data Mining: Modeling Risk is declared as the output field and all fields except ID are inputs to the model. This Type node is identical to the one used when we applied neural networks to the same data (Chapter 4). Close the Type node dialog Double-click the C5.0 node Figure 5.18 C5.0 Model Node Dialog
For a field (variable) that has been defined as a set (categorical with more than two categories), C5.0 will normally form one branch per value in the set. However, by checking the Group symbolics option, as has been done here, the algorithm will search for sensible groupings of the values within the input field (ones that show no difference in predicting the outcome risk), thus reducing the potential number of rules. For example, instead of having one rule per region of the country, Group symbolics may produce a rule such as: Region [North, East] Region [South] Region [West] Once trained, C5.0 builds a decision tree or a ruleset that can be used for predictions. However, it can also be instructed to build a number of alternative rule models for the same data by selecting the Use boosting option. In each boosted model, cases mispredicted by the previous model are given greater weight. Then, when it makes a Rule Induction and Decision Tree Methods 5 - 25
Data Mining: Modeling prediction, it consults each of the rules before making a decision. This can often provide more accurate predictions, but takes longer to train and a single decision tree no longer represents the model. The algorithm can be set to favor either Accuracy or Generality. Since we want a model that applies to other cases, we chose Generality. Click Execute A new generated model node (not shown). appears in the Models tab of the Manager window
Right-click the C5.0 Generated Model node in the Models tab Click Browse from the Context menu Click the Show all levels button to see all branches of the tree to
Click the Show or hide instance and confidence figures button see those values Figure 5.19 C5.0 Rule Browsing Window (Beginning)
Data Mining: Modeling Information on the number of records (Instances) used to generate each branch, and the level of Confidence attached to a rules conclusions, now appear since we requested them with the Show or hide instance and confidence figure button. The first branch informs us that 335 individuals had no other loans (0), that the prediction for individuals in this branch is good credit risk (which is also the modal response), and that the confidence of this prediction (here the proportion of cases in the subgroup that are good credit risks) is .591 (59.1%). Additional conditions are added to the group with at least one other loan, and we see that income (which was an influential predictor in the neural network model and appeared in the CHAID tree as well) appears in the model, as do number of children and number of credit cards, among other fields. Notice that some of the branches have small coverage values (9, 7 or 11 cases). By default, the minimum number of records per branch is set to 2 (meaning at least two branches from the node must have at least two cases eachalthough our choice to favor generality increased it a bit). To try to avoid some of these very small branches, which may not generalize well to other data, you could click the Expert option button in the C5.0 Model Node Editing dialog and change this setting. You can examine the rules in much the same way as we examined the nodes within the decision tree that CHAID produces. The information in the browsing window is a decision tree, but presented using IfThen logic instead of a graphic. To examine the accuracy of prediction we can use a Matrix node or an Analysis node. Close the Rule Browser window Double-click the Matrix node in the upper stream
We place Credit risk (Risk) in the Rows and predicted credit risk ($C-RISK) in the Columns. Counts and row percents based on Risk will appear in the body of the table, as set in the Appearance tab (not shown). Click Execute
Data Mining: Modeling Figure 5.21 Misclassification Table for Training Data
The overall pattern is similar to what we found with the CHAID model. Of the 421 actual good risks, the model correctly predicted 281 (66.7% correct). The comparable CHAID figure for the training data was 63% (although it was based on fewer cases due to the validation (testing) sample). Also, this model correctly predicts 50.6% (283 of 559) of the bad losses. Close the Matrix output window Right-click on the Analysis node in the upper stream
We requested that the analysis of the prediction results be broken down by risk category (not shown). In this way a summary for each risk category will appear in addition to the overall summary. Click Execute on the Context menu The window that opens is quite large and you may need to resize it.
The overall results appear first (accuracy of 76.74%), which did not appear in either the C5.0 model browsing window or the matrix. A report on the confidence values is supplied, including the mean confidence when correct and incorrect predictions are made (the first mean should be higher than the second). Then the results are broken down by each risk category.
The results look very similar to those for the training data (a good sign for the model). For the actual good risks, 234 of 383 were correctly predicted (61.1%). This is lower than in the training sample (66.7%). Again, some drop off is to be expected when shifting to validation data. Close the Matrix window Right-click the Analysis node in the lower stream Click Execute from the Context menu The overall accuracy for the validation data is 76.2%, which is very comparable to the 76.7% for the training data.
Data Mining: Modeling Figure 5.24 Analysis Summary for Validation Data
Although we did not evaluate the CHAID model on the separate validation file used to validate the neural net and C5.0 results, the accuracy values of the CHAID model and C5.0 appear very similar. Unless they differ in accuracy for a target category of critical interest to you, either model should suffice.
Other Features
There are several methods available to grow a tree, and decisions to be made about the minimum number of cases in a new node, about statistical criteria, and so forth. In addition, analysts have often found it best to grow a very large tree and then prune its
Data Mining: Modeling branches, to make it simpler but still retain a high level of accuracy (C5.0, C&RT and QUEST do this). Decision trees have been used on files of all sizes, although in general as the number of potential predictors increases, the number of cases should increase as well. Otherwise, only a few predictors have any chance of being chosen. Decision trees are often used on extremely large filesmillions of recordsand perform quite well. When files are large, the time to create a tree can be long, but would be faster than a neural network. It is essential that the results be validated on a test data set, since no global statistical test is being done to assess the overall fitness of a solution. Also, like many classic datamining methods, it is common to try several different algorithms, and variations within those methods, to find the best solution, defined primarily by predictive accuracy on validation data.
Model Understanding
Decision trees produce easily understood rules that describe the predictions they make in IfThen language statements. No overall equation is available to describe the model, but that is not the intent of these methods. The tree itself is also helpful in illustrating the solution, though a tree with many predictors becomes quite cumbersome to display.
Model Deployment
Applying simple rules makes predictions for new cases, and SQL statements can be created to represent the rules. In this way, a decision tree can be applied directly to the cases in a data warehouse. This makes deployment extremely efficient on new data, and it is often possible to make predictions in real time.
Cluster Analysis 6 - 1
INTRODUCTION
Cluster analysis is an exploratory data analysis technique designed to reveal natural groupings within a collection of data. The basic criterion used for this is distance, in the sense that cases close together should fall into the same cluster, while observations far apart should be in different clusters. Ideally the cases within a cluster would be relatively homogenous, but different from those contained in other clusters. As cluster analysis is based on distances derived from the fields in the data, these fields are typically interval, ordinal (for example, Likert scales that are considered a close enough approximation to an interval scale), or binary in scale (for example, checklist items). When clustering is successful, the results suggest separate segments within the overall set of data. Clustering is available in SPSS under the AnalyzeClassify menu, and in Clementine through the Kohonen network (which is a form of neural network analysis), the Kmeans and the TwoStep nodes (both of which are also available in SPSS). Given these characteristics, it is not surprising that cluster analysis is often employed in market segmentation studies, since the aim is to find distinct types of customers toward whom more targeted and effective marketing and sales action may be taken. In addition, for modeling applications, clustering is sometimes performed first to identify subgroups in the data that might behave differently. These subgroups can then be modeled separately or the cluster membership variable can be included as a predictor. Also, note that one of the neural network methods, radial basis function method (RBFN), performs clustering within the hidden layer and thus partially incorporates this principle. Keep in mind that cluster analysis is considered an exploratory data method. Expecting a unique and definitive solution is a sure recipe for disappointment. Rather, cluster analysis can suggest useful ways of grouping the data. In practice, you typically consider several solutions before deciding on the most useful one. Domain knowledge of customers and products/services will play a role in deciding among alternative solutions. Cluster analysis is not an end in itself, but one step in a data mining project. The clusters you discover are often used next as predictors, as separate analysis groups to each of which a separate model is fit, or as groupings in OLAP reports. There are many different clustering methods, but in the area of data mining two are in wide usage. This is because the large class of hierarchical clustering methods requires that distances between every pair of data records be stored (there are n*(n-1)/2 such pairs) and updated, which places a substantial demand on memory and resources for the large files common in data mining. Instead clustering is typically performed using the Kmeans algorithm or an unsupervised neural network method (Kohonen). Of the two, Kmeans clustering is considerably faster. Clementine also contains a TwoStep cluster method that quickly creates a number of preliminary clusters and then performs a hierarchical cluster analysis on these preliminary clusters. In addition, the TwoStep algorithm uses statistical criteria to indicate the optimal number of clusters.
Cluster Analysis 6 - 2
Data Mining: Modeling In this chapter we will apply two of these clustering methods, each to a different data set (K-means to software usage data and Kohonen to shopping purchases).
Briefly, the multiple line chart shows usage information of analytical software for four cluster groups. Without going into detail here, we can see that group 1 (solid line) tends to report usage of all the analytical techniques. Group 3 (heavy dotted line) tends to use basics, presentation statistics and mapping and relatively little of the others. Groups 2 and
Cluster Analysis 6 - 3
Data Mining: Modeling 5 (dashed lines) exhibit patterns that seem to be the converse of Group 3. We will interpret this data in greater detail later. Here we hope to demonstrate the importance of such plots in interpreting and naming the clusters. Another way to profile clusters is to apply rule induction methods (Chapter 5) to predict the cluster groups from the fields used in the initial clusters or from demographics. Validation Also, a validation step is important. Do the clusters make conceptual sense? Do the clusters replicate with a different sample (if data are available)? Do clusters replicate using a different clustering method? Are the clusters useful for business purposes? As you pass over each of these hurdles, you have greater confidence in the solution.
Cluster Analysis 6 - 4
Data Mining: Modeling the observation is assigned to the now nearest cluster. Additional data passes are usually made until the clusters become stable. Based on simulation studies, K-means clustering is effective when the starting cluster centers are well spaced. Again, several trials, varying the number of clusters, are typically run before settling on a solution.
Cluster Analysis 6 - 5
Notice that two clusters will be created by default. The Centers pushbutton would be used if you wished to provide starting points for the clusters (these might be based on previous research, or to try hypothetical scenarios). The Method area permits you to iterate and classify or to classify only; we want to iterate and classify which actually applies the K-means cluster method to the data. The classify only choice is usually used to assign additional cases to clusters already established (since the clusters are not updated); this method is sometimes called nearest-neighbor discriminant analysis. The Iterate pushbutton controls the criteria used to determine when the solution is stable and the algorithm will stop. The Save pushbutton is used to create a cluster membership variable (only one since K-means creates a specified number of clusters). The Options pushbutton permits you to display ANOVA tests on each clustering variable; these are not taken seriously as actual statistical tests since clusters are formed that maximally separate groups, but as indicators of which cluster variables are most important in the formation of clusters. This is useful when many variables are used in the analysis and you wish to focus attention on the most important ones. Stepwise discriminant analysis (see Chapter 2) or decision tree methods (see Chapter 5) are also used for this purpose. Move all variables except Jobarea into the Variables list box Replace the 2 with a 3 in the Number of Clusters text box Click Save pushbutton Click Cluster membership check box to save cluster membership variable
Cluster Analysis 6 - 6
We wish to create a cluster membership variable based on the K-means method. We can also save a variable containing the distance from each observation to the center of its cluster. Click Continue to process the Save requests Click Options pushbutton Click ANOVA table checkbox in the Statistics area Figure 6.4 Options Subdialog Box
We really dont need the ANOVA table here because there are not many usage variables. However, if we were clustering on 30 variables, then the ANOVA table might suggest a subset of important variables on which to focus attention. Alternatively stepwise discriminant or logistic, or a decision tree method could be used for this. Note the Kmeans procedure contains an option to include in the analysis cases that are missing on one or more of the clustering variables. Click Continue to process the Options
Cluster Analysis 6 - 7
Data Mining: Modeling Figure 6.5 Completed K-Means Cluster Dialog Box
We are now ready to run a cluster analysis. Click OK Note The Final Cluster Center pivot table displays means rounded to the nearest whole number (since the original variables are formatted as integers). We need to see more decimal places, so we must edit the pivot table. Double-click on the Final Cluster Center pivot table Click in the first data cell (1: Advanced Stats - Cluster 1) Shift-click in the last data cell (1: Time Series - Cluster 3) Click Format..Cell Properties Set the Decimals value to 2 using the spin control Click OK Click outside the crosshatched border to close the pivot table editor
Cluster Analysis 6 - 8
Data Mining: Modeling Figure 6.6 Final Cluster Centers (Mean Profiles)
Above we see the mean profiles for the three segments constructed by the K-means cluster method. We will examine the mean patterns to interpret the clusters and display them in a multiple line chart. Group 1 is composed of customers who use the more advanced statistical modules and make little use of the presentation procedures: we might call them Technicals Group 3 shows the reverse pattern, using Tables and Mapping heavily, but not the more statistical modules: we could label them Presenters Group 2 is based on those who use most, if not all, the modules, so they are Jacks-of-all-trades. Figure 6.7 Sizes of Clusters
The size of each cluster appears as part of the K-means output. Each cluster contains a substantial number of cases (relative to the sample size); there are no clusters based on only a few data points (outliers).
Cluster Analysis 6 - 9
As mentioned earlier, we should not take the significance values seriously here because clustering is designed to optimally create well-separated groups. However, when clustering with many variables (not the case here), analysts can use the F values to guide them to those variables most important in the clustering (those with larger F values). Here the Fs range from 40 (professional statistics) to 261 (advanced statistics) and this provides an indicator as to where our attention should be concentrated. Stepwise discriminant analysis or decision tree methods are applied to the clusters for the same purpose: to identify the important cluster variables. Next we view a multiple line chart profiling the three segments. Click Graphs..Line Click Multiple Line icon (we are displaying three segments) Click Summaries of separate variables option button (since all usage variables are to be plotted) Click Define pushbutton Move all variables except Jobarea and qcl_1 into the Lines Represent list box Move qcl_1 into the Category Axis list box
Cluster Analysis 6 - 10
Data Mining: Modeling Figure 6.9 Completed Multiple Line Chart Dialog Box
Since the SPSS procedure that performs the K-means method is called Quick Cluster, by default, the cluster member variable it creates is named qcl_1 the first time it is run during a session, qcl_2 for the second, etc. Click OK to create the chart By default, each variable used in the clustering appears on a separate line, but we want each line to represent a different cluster group. Double-click on the Chart, then click SeriesTranspose Data
Cluster Analysis 6 - 11
Data Mining: Modeling Figure 6.10 Mean Profiles for the K-Means, Three Cluster Solution
Note We could have also created the line chart by activating (double-clicking) the Final Cluster Centers pivot table, selecting the cluster means, then right-clicking and selecting Create GraphLine from the context menu. It is easier to interpret the cluster groups in this way when the clustering is based on a small number of variables that share the same scale. The contrast in profiles for the three cluster groups (named Technicals (1), Jacks-of-All-Trades (2), and Presenters (3)) is clear. Click FileClose (to close the Chart Editor) For an additional comparison, we reran the K-means method requesting four clusters. Figure 6.11 displays the multiple line chart of the four segments below.
Cluster Analysis 6 - 12
Data Mining: Modeling Figure 6.11 Mean Profiles for the K-Means, Four Cluster Solution
The additional (fourth) cluster (cluster #3 in the chart) is high on advanced statistics and time series usage, but lower on the presentation procedures. It looks as if weve split off, from the Technicals group, a subgroup that focuses on time series analysis.
Cluster Analysis 6 - 13
(Note that not all input neuron connections to the map are shown.) When a record is presented to the grid, its pattern of inputs is compared with those of the artificial neurons within the grid. The artificial neuron with the pattern most like that of the input wins the input. This causes the weights of the artificial neuron to become adjusted to make it appear even more like the input pattern. The Kohonen network also slightly adjusts the weights of those neurons surrounding the neuron with the most similar pattern. This has the effect of moving the most similar neuron, and the surrounding nodes to a lesser degree, to the position of the record in the input data space. The result, after the data have passed through the network a number of times, will be a map containing Cluster Analysis 6 - 14
Data Mining: Modeling clusters of records corresponding to different types of patterns within the data. Because of the many iterations with weight adjustments being made, Kohonen networks, especially those with a large number of neurons, take considerably longer to train than K-means clustering. Yet, they provide a different and potentially valuable view of groupings in the data.
Cluster Analysis 6 - 15
Data Mining: Modeling Figure 6.12 Clementine Stream for Kohonen Analysis
Right-click the Table node attached to the Type node, then click Execute View the data, then close the Table window (not shown) Double-click on the first Type node
Cluster Analysis 6 - 16
Data Mining: Modeling Figure 6.13 Type Node Setup for a Kohonen Network
We want to segment the data based on purchasing behavior, and therefore have set all other fields to direction None so that they are not used in the Kohonen network. All product fields are set to In, since they will be inputs to the Kohonen network. Close the Type node dialog Double-click on the Train Kohonen node
Cluster Analysis 6 - 17
By default, a feedback graph appears while the network is training and gives information on the progress of the network. It represents a grid of the output neurons. As records are won by neurons, the corresponding cell becomes a deeper shade of red. The deeper the shade, the stronger the neuron is in winning records. The size of the grid can be thought of as the maximum number of possible clusters required by the user. This can vary and is dependent on the number and types of input variables. The size of the grid can be set using the Expert tab. The Expert tab, when chosen, increases the available options and allows the user to further refine the settings of the Kohonen network algorithm. For detail on the expert tab, see the Clementine Users Guide. The Kohonen network can be set to stop training either using the default, a built in rule of thumb (see the Clementine Users Guide), or after a specified time has elapsed. Similar to the Neural Network node, the Set random seed option can be used to build models that can be reproduced. This is not normal practice and it is advisable to perform several runs to ensure you obtain similar results using random starting points. In this chapter we will retain the defaults for the majority of the above settings, although if multiple models are built you may want to change the networks name for clarity. Cluster Analysis 6 - 18
Data Mining: Modeling To speed up the analysis, we will limit the size of the grid of output nodes to 3 x 3. Click the Set random seed checkbox Enter 1000 in the Seed text box Check the Expert tab Click the Expert mode option button Change the Length and Width values to 3 and 3 (We set a specific seed value so these results will be reproduced.) Figure 6.15 Completed Kohonen Dialog with Expert Options Visible
Click Execute A feedback graph similar to Figure 6.16 will appear while the network is training.
Cluster Analysis 6 - 19
Data Mining: Modeling Figure 6.16 Feedback Graph When Training a 3 by 3 Kohonen Net
Darker colors indicate a greater density of cases within a neuron. As the network becomes stable, there is less color shifting. Once the Train Kohonen node has finished training, the Feedback Graph disappears and a Kohonen model node appears in the Models tab of the Manger window.
Browsing this node only gives information on the number of input neurons (number of numeric or flag type fields plus the number of values within each set type field entering the analysis) and the number of output neurons (equal to the number of cells (neurons) within the grid), and the settings for training. Now that we have produced a Kohonen network, we will attempt to understand how the network has clustered the records. The first step we need to take is to see how many distinct clusters the network has found and to assess whether all of these are useful and if any of them can be discarded because of small numbers of extreme records. When data are passed through the Kohonen generated model node, two new fields are created representing X and Y (Kohonen map row and column) coordinates. We previously ran this Kohonen analysis, placed the generated node into the data stream, and connected it to a Plot node. We are able to view the clusters using the Plot node. Within the Plot node we request that the X ($KX-Kohonen) and Y ($KY-Kohonen) coordinates of the Kohonen generated model node be plotted. Also, since all records in a Kohonen maps neuron have the same coordinate values we add some agitation (jittering) to the data values (not shown).
Cluster Analysis 6 - 20
Right-click on the Plot node Click Execute on the Context menu Figure 6.17 Plot of Kohonen Axis Coordinates (with Agitation) to Show Clusters
Here we see that although the network has produced nine clusters there appear to be only four main clusters. It is always important to calculate how many records fall into each segment. This can be achieved using a distribution node. First we need to create a unique value for each of the clusters by combining the co-ordinates. To do this we concatenate the co-ordinates to form a two-digit reference number. In Clementine, this involves using a Derive node. Close the Plot display window Double-click the Derive node
Cluster Analysis 6 - 21
Data Mining: Modeling Figure 6.18 Creating a Unique Cluster Number for Each Cluster
The formula contains the concatenate expression >< and the two new field names. CLEM (Clementine Expression Language) is case sensitive and field names beginning with $ need to be enclosed in single quotes. Click OK to return to the stream canvas Right-click on the Table node (to which the Derive node connects) Click Execute from the Context menu Figure 6.19 Cluster Field and Kohonen Axis Coordinate Values
Cluster Analysis 6 - 22
Data Mining: Modeling The resulting table contains a field called Cluster that consists of a combination of the two co-ordinates. Close the Table window
Here we see that the four main groups are those labeled 00, 02, 20 and 22, and these include about 77% of all records. Close the Distribution plot window It may be the case that the smaller groups are of interest (e.g., targeting a specialist product group). However, for this example we shall concentrate on profiling the four largest groups. To this end we need a Clementine Select node to filter out those records falling in the other five groups. An easy way to achieve this is to produce a table and, from it, generate a Select node. This was done beforehand within the Table window; we clicked the cluster value in the Cluster Analysis 6 - 23
Data Mining: Modeling Cluster column for four records with values of 00, 02, 20 and 22. On selection the cells changed color to blue. We chose the Select node (or) option under the Generate drop down menu, which produced the Select node we see in the stream.
To build profiles of our segments, we need to understand what patterns exist between the fields used as inputs to the Kohonen network and the groups of records in the segments. Since the input fields in this example are flags, we shall use a directed web plot. If the input fields had been numeric this process would have involved looking at histograms with an overlay of cluster number. Double-click on the Web Plot node Figure 6.21 Web Plot Node
A web plot shows the relationships among the categories in different fields. We are interested in how cluster group (for the four major clusters) relates to the fields used to produce the clusters. For this reason, we have checked the Directed web option button and selected the cluster field as the To Field: variable. All the products used for
Cluster Analysis 6 - 24
Data Mining: Modeling clustering are selected in the From Fields box. This will create connections between only the cluster groups and the products. The Show true flags only (for binary fields) checkbox is checked. This is because we are interested in seeing relations with purchased items (true value) rather than items not purchased. Instead of counts (the default), the thresholds for weak and strong connections are set to the percentages of the To field value, i.e., the cluster variable. In other words, the percentages are based on the number of people in each cluster who purchased a product, which is exactly what we need to profile the clusters. We choose the values of 20, 30 and 45 percent. Click Execute Figure 6.22 Directed Web Plot Connecting Cluster Groups to Purchases
The weight of the line indicates whether the connection is weak (light), moderate (normal), or strong (heavy). Here we can clearly see that only a few individual products are associated with the four different clusters.
Cluster Analysis 6 - 25
Data Mining: Modeling Summarizing the plot and clusters: Cluster 00 These shoppers are likely to purchase ready-made foods, plus bakery goods and snacks to a lesser extent. Cluster 02 These shoppers are largely associated with tinned goods, plus snacks and bakery goods. Cluster 20 These shoppers buy all products heavily, with the exception of toiletries and fresh vegetables. Cluster 22 These shoppers are more likely to buy alcohol and snacks, plus frozen foods. We have begun to build a picture of the four groups using the fields on which they were clustered. We shall now use some of the other fields in the data set to build a stronger profile of each of the four groups. Distribution plots (bar charts) can be used to investigate whether there are any relationships between the cluster groups and the demographics within the data set. We demonstrate by using the Children field (do you have at least one child?) as an overlay on a distribution plot of Cluster. Close the Web plot window Right-click on the Distribution node (Cluster) that is connected to the Select node (generated) Click Execute on the Context menu
Cluster Analysis 6 - 26
Data Mining: Modeling Figure 6.23 Normalized Distribution of Cluster with Children Overlay
Individuals in cluster 00 are the least likely to have children (and this fits with the products they buy more frequently), whereas individuals in cluster 02 are more likely to have children (which is a little hard to square with the products they buy, although thats the intrigue of cluster analysis). Such plots can be used to provide more detail to the cluster descriptions. Alternatively, decision trees can be built that predict cluster group membership using as inputs either the fields used in the original clustering (results will suggest which inputs determined which clusters) or demographics (which demographics are most strongly related to the clusters). To accomplish this we ran the C5.0 rule induction method twice (not shown). Parts of the decision trees are shown below. Figure 6.24 shows the rules for predicting cluster membership from the various product fields. The rules can be examined to better understand how the products relate to the cluster groups.
Cluster Analysis 6 - 27
Data Mining: Modeling Figure 6.24 C5.0 Rules Predicting Cluster Membership from Cluster Inputs
Next we view the decision tree predicting cluster membership from the demographic variables.
Cluster Analysis 6 - 28
Data Mining: Modeling Figure 6.25 C5.0 Rules Predicting Cluster Membership from Demographics
In turn, these rules can be examined to see how the demographic fields relate to the cluster groups. Either analysis can provide insight into the characteristics of the cluster groups. As one example, there is a group of 62 people who have no children, who are age 18 to 30, are single, and male. The C5.0 model predicts they are likely to be in cluster 22 (64.5% are actually in this cluster). Note how this demographic profile makes perfect sense given the products cluster 22 members are likely to buy. If the clusters can be meaningfully interpreted, then they should be examined from the perspective of how they might be relevant in the context of the business.
Cluster Analysis 6 - 29
Other Features
There is no limit to the number of cases or input variables for cluster analysis. However, hierarchical clustering methods are memory-intensive and often cannot be used with more than a few thousand cases or a few dozen variables. For that reason K-means clustering or Kohonen networks are more commonly used in data mining, with the first being essentially unlimited in file size. K-means clustering generally takes considerably less processing time than clustering using Kohonen networks. Since the TwoStep procedure performs hierarchical clustering only in the second step (on the clusters formed in the first step), it also runs relatively quickly and can be used with large files. For practical reasons, usually no more than ten to twenty variables might be used in clustering. This is because trying to understand the cluster solutions becomes quite complex with more variables than that. And with more than a few variables, graphical output is typically not that helpful. We emphasize again that the number of clusters is a user decision with K-means clustering and Kohonen networks. The TwoStep algorithm does provide guidance, using statistical criteria, on the number of clusters.
Model Deployment
The cluster centers can be input into the K-means clustering routine to cluster new data sets, so model deployment is routine if done in SPSS or Clementine, rather than by applying a simple equation in a spreadsheet program. This means that clustering cannot be easily and directly applied to data warehouses or data marts. However, this can be done for SPSS K-means or TwoStep clustering with modest programming effort. And Clementine offers other options, including Solutions Publisher, to deploy complete streams outside the software.
Normalization
If you cluster using fields that are measured on different scales (for example, age in years and income in dollars or euros) or have very different variances, then it is generally recommended that the fields be normalized or standardized before clustering. Otherwise, those fields with relatively large variances will dominate the clustering. There are
Cluster Analysis 6 - 30
Data Mining: Modeling exceptions to this, as when clustering on financial account balances of customers. If the greatest volatility (variance) is in the mutual fund account, it can be argued that it should be the most important field in the clustering. Under this scenario, normalization would not be performed. However, in most analyses, normalization is performed prior to clustering. This is not done automatically by the K-means procedures in SPSS and must be completed by the analyst (Descriptives procedure with the Save option in SPSS). Along similar lines, categorical (the set field type in Clementine) fields must be converted to dummy fields before clustering. This is done by the K-Means, Kohonen, and TwoStep nodes within Clementine, but must be performed by the analyst (using Recode or If statements) within SPSS.
Cluster Analysis 6 - 31
Cluster Analysis 6 - 32
Introduction
It is essential for business organizations to plan ahead and forecast events in order to ensure a smooth transition into the future. To minimize errors when planning for the future it is necessary to collect the information on factors that might influence plans on a regular basis over time. From this point, patterns can be identified and these patterns help make forecasts for the future. Today, although many organizations store information relevant to the planning process, which could be in business reports, database tables or within a data warehouse, forecasts are often made on an ad-hoc basis. This can lead to large forecasting errors and costly mistakes in the planning process. Analytic methods provide a more structured approach that will reduce the chance of making costly errors. Statisticians have developed statistical techniques, known as time series analysis, which are devoted to the area of forecasting. Specific methodologies such as exponential smoothing, forms of regression, and ARIMA modeling have been used for years. Traditionally these involved an analyst examining various diagnostic plots of the data series (sequence plots, autocorrelation and partial autocorrelation plots), fitting a tentative model, evaluating the model fit, rerunning the diagnostics, and further modifying the model. This activity requires training and is labor intensive. In recent years, researchers have developed efficient ways of fitting a set of time series models to a data set and reporting on the most promising results. The analyst can evaluate and compare the fit of the alternative models and choose among or combine them. This is the approach taken by SPSS DecisionTime. Such an approach is very much in the tradition of data mining in that an algorithm applies decision rules that permit a number of models to be evaluated against test data. In this chapter, we will apply use DecisionTime to obtain a time series model for data that records the daily volume of packages delivered by a parcel delivery service. More detail about DecisionTime and a brief introduction to time series modeling are contained in the Introduction to DecisionTime training course. Note that the SPSS Trends option, although not containing an automated time series model-fitting routine, can apply, under the direction of the analyst, a variety of time series models (exponential smoothing, ARIMA, ARIMA with predictor variables (interventions, fixed regression inputs), and spectral analysis). This option and a detailed introduction to time series analysis are covered in the Time Series Analysis and Forecasting with SPSS Trends training course. It is also worth mentioning that a data mining method called sequence detection can also be applied to time-structured data. Here the focus is in analyzing patterns in data over time (for example, the sequence in which SPSS software products are purchased, the web pages that are examined prior to the online purchase of a computer, and the common types of transaction patterns that occur for mutual fund customers). Sequence detection is discussed in the following chapter.
Each record represents a single day. In addition to Parcel, there are date fields that can be used in plots and summary tables within SPSS. It is worth noting that DecisionTime permits data to be modeled at different level of aggregation (for example, months and quarters).
Data Mining: Modeling Linear trend Linear trend is present when the series that you wish to predict either increases or decreases over time at a locally constant rate. This means that the increase or decrease is steady over some number of time periods, but the slope need not be the same throughout the entire series. Exponential Trend A time series has an exponential trend when the series level tends to increase or decrease in value, but at an increasing rate the further the time series progresses. (Note: this is implemented in the SPSS Trends option but not in DecisionTime.) Damped Trend If there are signs that the series increases or decreases, but at a decreasing rate as the time series proceeds, then the damped option should be specified. No seasonality As expected, no seasonality implies the series does not have a seasonal pattern. Additive Seasonality Additive seasonality describes a series with a seasonal pattern that maintains the same magnitude when the series level increases or decreases. Multiplicative seasonality If the seasonal model patterns become more (less) pronounced when the series values increase (decrease) then this type of seasonal pattern is multiplicative. DecisionTime applies a series of decision rules to diagnostic summaries and evaluates the fit of different models to decide which of the exponential smoothing (or ARIMA) models is best for the data series.
Data Mining: Modeling Figure 7.2 DecisionTime Project Window with Sequence Plot for Parcel Deliveries
Note that the year 1997 is used to illustrate how dates appear in the graphs, but the data were not collected in 1997. The left (Project) pane lists the elements in the current DecisionTime project, which currently consists of a single variable. It can contain multiple variables (Series) and model results (Model). The right (Content) pane displays a sequence plot of the selected series. If several series were selected in the Project pane, they would appear, stacked one above another, in the Content pane if the Graph Panels tab were selected. All selected series would appear in a single plot if the Single Graph tab were selected, and in a table presenting the data if the Table tab were chosen. Parcel deliveries seem to follow a fairly well defined pattern (weekly) in which certain days of the week consistently have higher demand. Volume reaches its peak around midweek and the low is on Sunday. Also, there is an upward trend over time. A time series model would try to capture both of these features: trend and seasonality. From week 4, the demand for parcel deliveries is clearly increasing in a linear fashion. So there looks to be trend for the series as a whole, and it is best described as linear. The amplitude of the seasonal pattern seems to increase towards the end of the series, which suggests the seasonality is multiplicative. Although we could try this model and evaluate the fit, we will have the DecisionTime Forecast Wizard perform the model selection. Validation To validate the model we will train the model on the first 119 days of data and reserve the last 7 days as validation data. Note the test data is not a random sample from all time periods, but since the goal of the model is to predict future demand, then the last week of data would be the most representative sample for this. Time Series Analysis 7 - 6
Data Mining: Modeling To apply the Forecast Wizard: Click Forecast..Forecast Wizard (Alternatively, click the button) Click Next pushbutton Drag and drop parcel into the Series to be Forecasted list box Change the value in the How many periods would you like to forecast text box to 14 Figure 7.3 Forecast Wizard - Series to be Forecast Dialog
We have identified parcel as the single series to be forecast and indicated that we wish to produce forecasts for 14 periods (two weeks). As we will see shortly, 7 of these periods will correspond to the last week of actual data (validation data) and 7 will be forecasts to the future. Click Next pushbutton
Although not relevant to this analysis, it is worth noting that the Wizard allows you specify other data series as predictors. For example, advertising spending may help predict sales, or regional housing starts may predict later demand for materials (plumbing fixtures). The Wizard would search for relations between these predictors and the series to be forecast at different time lags and thus would investigate whether advertising spending three months back relates to sales this month (based on the cross-correlation function). Since we have no predictor variables, we proceed. Click Next pushbutton
In addition to predictor variables, DecisionTime allows inclusion of intervention effects. Unlike predictor variables, interventions are not assumed to be themselves time series, but rather represent single point occurrences of events (holidays, natural disasters, political events) or structural shifts called steps (change in law, opening of a new plant, advent of deregulation) lasting for some period of time, which are believed to influence the series to be forecast. If either type of intervention occurred, the Interventions button leads to a dialog in which you could specify the point intervention dates and step date ranges. Click Next pushbutton The next dialog (Forecast Wizard Events, which is not shown) permits you to specify future dates at which you expect the interventions to repeat. In this way, the intervention effects would be taken into account during the forecasting. Click Next pushbutton Click Prior to end of the historical data option button Enter 7 in the How many periods prior? box
By default, forecasting will begin at the end of the historical data. This is the best choice when producing forecasts. However, since we want the last 7 days forecast from the earlier part of the series, we must begin forecasting prior to the end of the historical data. Thus the last week of data will not be used when producing the forecasts for those days. Click Next pushbutton
The last choice involves whether you wish to exclude data early in the series from the modeling. This option would be used if you had a very long time series or believed that patterns had changed over time and that the early part of the series wasnt relevant to future predictions. Examples of this might be sales data from utilities and telecommunication companies before and after deregulation. Although a telecommunications company would have many years of data from some customers, it might discard data collected prior to deregulation as irrelevant for forecasting revenue in a deregulated environment. Click Finish pushbutton
Data Mining: Modeling Figure 7.8 Model Selection and Forecasts from Expert Modeler
There are two noticeable changes to the DecisionTime window. First, the Model column of the Project (left) pane and the graph label indicate that Winters Multiplicative model has been chosen. This model, named after its founder, is an exponential smoothing model that includes trend and multiplicative seasonality. Second, the fourteen forecast values for parcels appear in the graph with a reference line marking the beginning of the forecast period. Visually, the forecasts track the actual series closely during the one-week holdout (validation) period.
button
Data Mining: Modeling Figure 7.9 Forecasts with 95% Confidence Limits
The 95% confidence limits for the forecasts are displayed. These limits can also be shown for the historical series data. Click the Table tab Scroll to the 5/6/97 date Figure 7.10 Table Containing Forecasts
Four rows appear: historical data, forecasts, upper and lower limits. These correspond to the values displaying in the graph. You control what information displays in the graphs Time Series Analysis 7 - 13
Data Mining: Modeling and table by your choices on the Show buttons located to the right of the Table sheet (alternatively, click View..Forecast Output). Now, we will take a closer look at the forecasts. Click the Graph Panels tab Click the Show Historical Data button (so it is no longer depressed)
Since the Show Historical Data button is not depressed, only the forecasts and their confidence limits appear for the two-week forecast period at the end of the series (the last week of the series and a one-week forecast beyond the data). We see the mid week peaks appearing in the two-week forecast.
Model Evaluation
There are many ways of evaluating a time series model. In this example we limit ourselves to a visual inspection of how the model predictions track the original series (shown earlier (Figure 7.8) for the 7-day holdout period) and some commonly used goodness of fit measures. Click View..Model Viewer (Alternatively, click the Model Viewer button
Data Mining: Modeling Figure 7.12 Model Viewer Window: Goodness of Fit Statistics
We will focus on two of the fit measures. The mean absolute error (also referred to as mean absolute deviation or MAD) takes the average of the absolute values of the errors and is often used as the primary measure of fit. In our data, the average magnitude of error either side of zero is 1,474. The mean absolute error must be interpreted with reference to the original units of measurement. For those familiar with their data this is adequate, since they would know whether a mean absolute error of 1,474 is reason for celebration or sorrow. However, the Mean Absolute Percentage Error (MAPE) is sometimes preferred because it is a percentage and thus is a relative measure. Again, positive and negative errors do not cancel. MAPE is obtained by taking the absolute error for each time period, dividing it by the actual series value, averaging these ratios across all time points, and multiplying by 100. The MAPE is often used as it measures the percentage prediction error and in this case is 7.26%. The maximum absolute error and maximum absolute percentage error are also reported, as are some other, more technical, measures. Whether this model is acceptable largely depends upon what degree of accuracy is required in the context of the business and how alternative models fare. Since the Expert Modeler in the Forecast Wizard selected this model, we know that other forms of exponential smoothing were considered, along with ARIMA models. However, if there were a need for greater accuracy in predicting parcel deliveries, perhaps other predictors
Data Mining: Modeling (for example, measures of business activity or the local economy) could be examined for inclusion in the model. Scroll down to the Model Parameters section in the Summary tab sheet Figure 7.13 Model Parameters
Estimates for the parameters in the Winters Multiplicative (Exponential Smoothing) model appear. For those interested in their interpretation, we include a brief description below. Most books on time series cover exponential smoothing models and the Gardner (1985) reference is quite complete.
Data Mining: Modeling If the overall series mean is representative of the series values at the end of the series then the alpha value should be 0. This places equal importance on all values of the series. If however the series level shifts then the later observations should have more weight in predicting future series values. As the alpha parameter value moves closer to 1, more and more weight is given to the most recent observations. The alpha parameter is used for all exponential smoothing models. The gamma weight applies to the trend component of the series. Again this parameter gives more weight to recent values the nearer to 1 the value of gamma specified. If the trend component is relatively stable throughout the whole series then 0 is probably the most appropriate value where each observation is given equal weight. It is important to note that there may well be trend to the series if the gamma coefficient estimate is 0; a gamma of 0 simply implies that the trend is constant over time. Gamma is used only for exponential smoothing models where a trend component has been specified. If no trend component is needed (as evaluated by the Forecast Wizard), then this parameter will not be included in the model. The delta parameter controls the relative weight given to recent observations in estimating the seasonality present in the series. It ranges from 0 to 1, with values near 1 giving higher weight to recent values. Delta is used for all exponential smoothing models with a seasonal component. Be aware that a delta parameter of 0 does not imply there is no seasonality, but that the seasonality is constant over time. The final parameter is the phi parameter (not used here) that deals with trend modification when the damped trend option is specified. It controls the rate at which a trend is damped, or reduced in magnitude over time. Here the Forecast Wizard chose the model. A user experienced in time series analysis can use the Advanced Forecast Wizard (Click Forecast..Advanced Forecast Wizard) to fit a specific exponential smoothing or ARIMA (with intervention effects and predictors) model to the series. Click the Residuals tab in the Model Viewer window Position the cursor over the largest positive residual point
Individual residuals (model prediction errors) can be examined. Any extremely large residuals or patterns in the residuals should be checked for data errors or possible model misspecification. When the cursor is placed over a point in this (or the other DecisionTime graphs), information about the point appears. Time series practitioners will appreciate that autocorrelation and partial autocorrelation plots of the residuals are also available in the Model Viewer window. If multiple models had been run, the view can be switched across models using the model drop-down list or the direction arrows beside it. The model predictions seem to track the day-to-day variation fairly well (based on the plot in Figure 7.8 and the Model Summary window). However, viewing the residual plot, there seems to be a tendency toward larger residuals later in the series. This could be because the daily variation has increased during the last four weeks without being fully captured by the model. This tendency is a bit troublesome and should be monitored as additional data are collected. Once you are satisfied with a model you can have exponential smoothing model create additional forecasts from this data by reapplying the model and forecasting from the end of the series for the forecast period desired. Also, within DecisionTime the forecast period would be extended if additional data were added to the original series (in the input file).
Data Mining: Modeling Click File..Exit DecisionTime from the DecisionTime project (main) window Click No when asked to save changes to the DecisionTime project
Other Features
Data files for time series analysis are much smaller than we are accustomed to using. Normally, you should have at least 40 to 50 observations to make monthly time series analysis practical, but you would rarely have more than a few hundred observations. Typically only a few predictors are used to model a series. Stepwise methods that select among many predictors are not available for standard time series methods, although the algorithms in DecisionTime will examine multiple predictors, retaining those judged important to the model. Generally time series analysis is done with a predetermined set of variables. If predictors are used, they tend to be few in number. Methods are available to select predictor variables and suggest the best model within several classes of time series models (see DecisionTime).
Model Understanding
Exponential Smoothing models are easy to deploy; only a few estimated values need be stored and the calculations are straightforward. Although ARIMA (a more advanced time series model, not discussed here) is a bit complicated theoretically, the chief output of a model is still very clear-cuta predicted value. And that predicted value could be compared to the actual value to get a very good idea of how well the model fits the data. It is therefore not hard at all to present the results of a time series analysis to nonstatisticians or to see graphically whether the model seems to fit the data.
Model Deployment
For some time series techniques, it is possible to use a simple equation to make future predictions. For others, it may be more convenient to add new data to the original file and calculate the predicted values by reapplying the model. But in either case, calculating predicted values will not be a difficult task
Topics:
INTRODUCTION TO SEQUENCE DETECTION TECHNICAL CONSIDERATIONS DATA ORGANIZATION FOR SEQUENCE DETECTION SEQUENCE DETECTION ALGORITHMS IN CLEMENTINE A SEQUENCE DETECTION EXAMPLE: REPAIR DIAGNOSTICS
Sequence Detection 8 - 1
This tells us that individuals who buy SPSS Base and Regression Models, and later buy Advanced Models, are likely to later buy Clementine. The and in Antecedent 1 indicates that the two items are members of an item set. Thus, Base and Regression Models indicates that both items were purchased at the same time, while Advanced Models was purchased at a later time.
Sequence Detection 8 - 2
Data Mining: Modeling When the Sequence node produces a set of sequences, it provides evaluation measures similar to those we reviewed when we discussed association rules. The measures are called support and confidence. Support refers to the number or percentage of cases (where a case is linked to a unique ID number) to which the rule applies that is, the number of cases for which the antecedents and consequent appear in the proper order. Note that this differs from coverage, defined when we discussed association rules in Chapter 3, which is the percentage of cases in which the antecedents (conditions) hold. Confidence refers to that proportion of the cases to which the antecedents apply that the consequent also follows. Stated another way, confidence is the proportion of cases with the antecedents that also include that specific consequent. Confidence is used the same way for association rules. These measures are presented in the same table with these additional three columns: Instances, Support, and Confidence. For example: Instances Support Confidence Consequent Antecedent 1 100 .15 .60 Clementine Base and Regression Models Antecedent 2 Advanced Models
This means that 15% (100 individuals) of the customers purchased SPSS Base and Regression Models at the same time, then purchased the Advanced Models, and later purchased Clementine. Of the customers who purchased Base and Regression Models, then Advanced Models, 60% later purchased Clementine.
TECHNICAL CONSIDERATIONS
Number of items?
If the fields to be analyzed represent items sold or web pages clicked, the number of distinct items influences the resources required for analysis. For example, the number of SPSS products a customer can currently purchase is about 15, which is a relatively small number of items to analyze. Now, lets consider a major retail chain store, auto parts supplier, mail catalogue or web vendor. Each might have anywhere from hundreds or thousands to tens of thousands of individual product or web page codes. Generally, when such large numbers of products are involved, they are binned (grouped) together in higher-level product categories. As items are added, the number of possible combinations increases exponentially. Just how much categorization is necessary depends upon the original number of items, the detail level of the business question asked, and at what level of grouping meaningful categories can be created. When a large number of items are present, careful consideration must be given to this issue and it may be time consuming. Sequence Detection 8 - 3
Data Mining: Modeling Time spent on this issue will increase your chance of finding useful sequences and will reduce largely redundant rules (for example, rules describing the purchase of a shirt followed by purchase of many different types and colors of ties).
Sequence Length
Searching for longer sequences requires greater resources. So one consideration concerns whether you are interested in any sequences that are found, whatever their length. An expert option in the Sequence node permits you to set an upper limit on the length of sequences.
Sequence Detection 8 - 4
Data Mining: Modeling Consider the software transaction example used earlier, in which a customer first purchased SPSS Base and Regression Models, then later purchased Advanced Models, and then purchased Clementine. It could appear in tabular data format as follows: Customer 101 101 101 Date Base Regression Adv Models T F F T F F Clementine F F T Decision Time F F F
The same sequence in transactional data format would be: Customer 101 101 101 101 Date Feb 2, 2001 Feb 2, 2001 May 1, 2001 Dec 31, 2002 Purchase Base Regression Adv Models Clementine
In tabular format, it is clear that SPSS Base and Regression Models were purchased together (they are treated as an item set). In transactional format, the same items would be treated as an item set if the date field were specified as a Time field within the Sequence node. In addition, under the Expert tab, you can specify a timestamp tolerance value that is applied to the Time field to determine, for an ID, which records should be grouped into item sets.
Data Mining: Modeling pruning options allow you ignore full (contiguous) sequences or partial (items need not be contiguous) sequences that are contained in other sequences reported, which simplifies the results. Also you can search for only sequences that start or end with certain items. This might be useful if you are primarily interested in searching for sequences that lead to a certain web page or result. In short, both algorithms have advantages, which is why they are made available in Clementine (CaprI as an add-on algorithm).
Sequence Detection 8 - 6
Right-click the Table node above the Type node, then click Execute
Sequence Detection 8 - 7
Data Mining: Modeling Each service problem is identified by a unique ID value. The field Index1 records the sequence in which the diagnostic/repair steps were performed and the Stage field contains the actual diagnostic/repair codes. All repair sequences should begin with code 90 and a successful repair has 210 as the final code (299 is used if the problem was not successfully resolved). The data file was presorted by Index1 within ID. The Sequence node has an option to sort the data prior to analysis (or the Sort nodelocated in the Record Ops palettecould be used). Close the Table window Double click the Type node Figure 8.3 Type Node for Sequence Detection
Even though numeric codes are used for the diagnostic/repair values in Stage, it is declared as type set. Sequence analysis could be done if the field was defined as numeric type, but its values would still be treated as type set (categorical). That is, 90 and 95 would be treated as two categories, not as similar numeric values. In this analysis, the content to be analyzed is contained in the single field: Stage. The field(s) containing the content can be of any type and any direction (if numeric they must be of type range). If there are multiple content fields, they all must be of the same type.
Sequence Detection 8 - 8
Data Mining: Modeling The field (here ID) that identifies the unit of analysis can also be of any type and can have any directionhere it is set to None. A time field, if used, must be numeric and can have any direction. Close the Type node Double click the Sequence node Figure 8.4 Sequence Node Dialog
The ID field defines the basic unit of analysis for the Sequence node. In our example, the unit of analysis is the service problem and each problem has a unique value for the field named ID. A time field is not required and if no time field is specified, the data are assumed to be time ordered for a given ID. We specify Index1 as the time field for this analysis. Under the Expert tab you have additional controls based on the time field (for example, an event occurring more than a user-specified interval since the IDs previous event can be considered to begin a new sequence). The content fields contain the variables that constitute the sequences. In our example, the content is stored in a single field, but multiple fields can be analyzed. Sequence Detection 8 - 9
Data Mining: Modeling If the data records are already sorted so that all records for an ID are contiguous, you can check the IDs are contiguous check box, in which case the Sequence node will not resort the data, saving on resources (and your time). The Model and Expert tabs provide greater control over various aspects of the analysis. We illustrate these options by examining the Model settings, which include support and confidence minimum values. Click the Model tab Figure 8.5 Model Tab Settings
The minimum values for support and confidence are set to 20%, but as we learned from working with association rules, these values often need to be changed. Click Execute to create the model
Data Mining: Modeling Click Browse on the Context menu When the generated Sequence rules node first appears, it lists the consequent and antecedent(s), but does not display support or confidence. To see these values: Click the Show/Hide criteria button on the toolbar You may need to maximize the Sequence rules window to see all columns The Sequence node found 86 rules, which are presented in descending order by rule confidence. The second rule is a sequence that begins with code 90 (all actual repair/diagnostic sequences should start with 90), followed by code 125, then code 195, and ends in code 210 (successful resolution). This sequence was found for 163 IDs, which constitute 21.7% of the IDs (there are 750 IDs in the data). This is the support value. Thus almost one fourth of all service problems in the data showed this pattern. Of the cases containing this sequence of antecedents90, then 125, then 195code 210 followed 98.2% of the time. This is the confidence value. Figure 8.6 Sequence Rules
Notice that codes 90 and 210 appear frequently in the rules. This is because almost all service problem sequences begin with 90 and end with 210. Someone with domain knowledge of this area could now examine the sequences to determine if there is anything
Sequence Detection 8 - 11
Data Mining: Modeling interesting or unexpectedfor example a sequence that should not occur given the nature or the diagnostic tests/repairs, or a repeating sequence. The sequence rule sets are ordered by confidence value (descending order). To view the most common sequences, we simply sort by support value. Click the Sort by: dropdown list and select Support The sequence rules can be sorted in a number of ways. For those interested in sequences beginning or ending with a particular event (for example, clicking on a specific webpage), the sorts by First Antecedent, Last Antecedent, or Consequent would be of interest. Figure 8.7 Sequence Rule Sets Sorted by Support
Code 110 appears in two of the three most frequent rules. The sequence 90 followed by 210 occurs in about 92% of the service problems, which we would expect in a very high proportion of the sequences. Code 299, which indicates the problem was not resolved, has not appeared. This is because it is relatively infrequent (fortunately so, for the business and customers). If we were interested in sequences containing 299, we would have to lower the confidence to below 5%, which is the base rate for code 299.
Sequence Detection 8 - 12
Data Mining: Modeling A domain expert would be interested in the most frequent sequences, which describe the typical path a service problem follows. If some stages were more expensive/time consuming, they would attract particular attention. We will view the results in one other order. Click the Sort by: dropdown list and select Number of items The sequences are now sorted by the number of distinct items that appear in a sequence, sorted in descending order. Thus the first sequence, with 212 instances, contains four items (90, 110, 125, and 210). Figure 8.8 Sequence Rule Sets Sorted by Number of Items
To see an odd sequence: to change the sort order from descending Click the Sort by: button to ascending Scroll down a number of rows until you see the beginning of sequences with two antecedents (see Figure 8.9) One of the sequences with only two items has an antecedent of 125 and an identical consequent of 125. This pattern occurs in 22% of the sequences. The sequence would be of interest because, ideally, a diagnostic/repair stage should not be repeated. Someone Sequence Detection 8 - 13
Data Mining: Modeling familiar with the diagnostic/repair process would look into why this stage is repeating so often (erroneous test results at that stage, records not being forwarded properly, etc.), and modify the process to reduce it. Other repeating sequences may be present, but do not meet the minimum support and confidence criteria values. Figure 8.9 Sequence with Identical Antecedent and Consequent
Next we view the model predictions. Right-click the Table node attached to the Sequence generated model node Click Execute
Sequence Detection 8 - 14
By default, the Sequence generated model node contains three prediction fields (prefixed with $S-), containing the three most confident predictions of codes that will appear later in the sequence, predicted from the sequence observed to that point. The confidence values for each prediction are stored in the fields prefixed with $SC-. The sequence value in the first record is stage 90 (for ID=1, Index1=1), which is the problem report. The most likely stage to occur later, given that stage 90 has occurred, is stage 210 with confidence .949. (Note: this rule can be seen in Figure 8.7.) Since most sequences end with stage 210, the second and third most confident predictions are, in some senses, more interesting for this analysis. Thus, the next most likely stage to occur later, given that stage 90 has occurred, is stage 110 with confidence .677. And the third most likely is stage 125. In this way, the three most confident future predictions, based on the observed sequence, are generated. Examining the predictions for ID 1, notice that the most likely item to occur later can change as the observed sequence changes. This makes sense, since as more information becomes available about a sequence, additional rules can apply.
Sequence Detection 8 - 15
Other Features
Sequence detection is often done on very large data sets, with tens of thousands of records. But as file sizes grow, especially in the number of items, the number of potential sequences grows quickly, as does necessary computation time. So in practice the number of cases (customers, web visits, problems reported) can be large, but the number of distinct items tends to be much smaller, as is the number of items to be associated together in a sequence. It can be difficult, therefore, to determine the right number and type of items a priori (for example, for a large retailer), so typically, like all the automated methods, several different solutions may be tried with different sets.
Model Understanding
One of the great strengths of sequence detection, as we found with association rules in general, is that its results are easily understood by anyone. The rules it finds can be expressed in natural language and dont involve any statistical testing.
Model Deployment
Model deployment is complicated by the fact that the items in a sequence need not be contiguous, which complicates the logic needed to identify key sequences in order to generate predictions. Generally speaking, transaction databases do not easily identify sequences, especially if the items are not contiguous (uninterrupted). In Clementine, the generated Sequence node can be used to produce predictions and this node, in turn, can generate a Clementine supernode that will create additional fields that support prediction. These can be deployed within the Clementine Solution Publisher. In addition, the results of sequence detection analysis may be directly actionable (e.g., make a web page that is found to lead to later registration more prominent to the visitor; investigate why going through a certain repair stage often leads to a return to that stage). These results are useful but are not deployed as prediction models.
Sequence Detection 8 - 16
References
Berry, Michael J.A. and Linoff, Gordon S. 1997. Data Mining Techniques for Marketing, Sales, and Customer Support. New York, Wiley. Berry, Michael J. A. and Linoff, Gordon S. (2001) Mastering Data Mining: The Art and Science of Customer Relationship Management. New York: Wiley. Bigus, Joseph P. 1996. Data Mining with Neural Networks. New York, McGraw Hill. Breiman, Leo. 2001. Statistical Modeling: The Two Cultures (with rejoinders). Statistical Science, 16:3, 199-231. Cohen, Jacob, Cohen, Patricia, West, Stephen and Aiken, Leona. 2002. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed.). Hillsdale, NJ, Lawrence Erlbaum Associates. Draper, Norman and Smith, Harry. 1998. Applied regression Analysis (3nd ed.). Wiley, New York. Fu, LiMin. 1994. Neural Networks in Computer Intelligence. New York McGraw Hill. Gardner, E.S., 1985. Exponential Smoothing: The State of the Art. Journal for Forecasting, 4: 1-28. Han, Jiawei and Kamber, Micheline. 2000. Data Mining: Concepts and Techniques. San Francisco, Morgan Kauffman. Hastie, Trevor, Tibshirani, R. and Friedman, J. H. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York. Springer-Verlag. Huberty, Carl J. 1994. Applied Discriminant Analysis. New York, Wiley. Kim, Hyunjoong and Loh, Wei-Yin. 2001. Classification Trees With Unbiased Multiway Splits. Journal of the American Statistical Association, 96, 589-604. Lachenbruch, P. A. 1975. Discriminant Analysis, New York. Hafner Press.
References R - 1
Data Mining: Modeling Lim, T., W. Loh and Y. Shih. 1997. An empirical comparison of decision trees and other classification methods. Technical Report 979. Department of Statistics, University of Wisconsin, Madison. Madison, WI. Linoff, Gordon S. and Berry, Michael J. A. 2002. Mining the Web: Transforming Customer Data. New York. Wiley. Loh, W. and Y. Shih. 1997. Split selection methods for classification trees. Statistica Sinica, 7:4. Loh, W. and N. Vanichsetakul. 1988. Tree-structured classification via generalized discriminant analysis. Journal of the American Statistical Association, 83: September 1988: 715-724 (comments and rejoinder 725-728). Mannila, Heikki, Smyth, Padhraic and David J. Hand. 2001. Principles of Data Mining. Cambridge, MA. MIT Press.
References R - 2