You are on page 1of 204

Data Mining: Modeling

18749-001

SPSS v11.5; Clementine v7.0; AnswerTree 3.1; DecisionTime 1.1 Revised 9/26/2002 ss/mr

For more information about SPSS software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and its other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 606066412. TableLook is a trademark of SPSS Inc. Windows is a registered trademark of Microsoft Corporation. DataDirect, DataDirect Connect, INTERSOLV, and SequeLink are registered trademarks of MERANT Solutions Inc. Portions of this product were created using LEADTOOLS 1991-2000, LEAD Technologies, Inc. ALL RIGHTS RESERVED. LEAD, LEADTOOLS, and LEADVIEW are registered trademarks of LEAD Technologies, Inc. Portions of this product were based on the work of the FreeType Team (http:\\www.freetype.org). General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks or registered trademarks of their respective companies in the United States and other countries. Data Mining: Modeling Copyright 2002 by SPSS Inc. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.

Data Mining: Data Modeling

Data Mining: Modeling


Table of Contents
CHAPTER 1 INTRODUCTION
INTRODUCTION........................................................................................................1-2 MODEL OVERVIEW .................................................................................................1-3 VALIDATION .............................................................................................................1-6

CHAPTER 2 STATISTICAL DATA MINING TECHNIQUES


INTRODUCTION........................................................................................................2-2 STATISTICAL TECHNIQUES ..................................................................................2-3 LINEAR REGRESSION..............................................................................................2-4 DISCRIMINANT ANALYSIS ..................................................................................2-21 LOGISTIC AND MULTINOMIAL REGRESSION.................................................2-31 APPENDIX: GAINS TABLES..................................................................................2-38

CHAPTER 3 MARKET BASKET OR ASSOCIATION ANALYSIS


INTRODUCTION........................................................................................................3-2 TECHNICAL CONSIDERATIONS............................................................................3-3 RULE GENERATION.................................................................................................3-4 APRIORI EXAMPLE: GROCERY PURCHASES.....................................................3-5 USING THE ASSOCIATIONS .................................................................................3-12 APRIORI EXAMPLE: TRAINING COURSE PURCHASES..................................3-15

CHAPTER 4 NEURAL NETWORKS


INTRODUCTION........................................................................................................4-2 BASIC PRINCIPLES OF SUPERVISED NEURAL NETWORKS...........................4-3 A NEURAL NETWORK EXAMPLE: PREDICTING CREDIT RISK ...................4-11

Table of Contents 1

Data Mining: Data Modeling

CHAPTER 5 RULE INDUCTION AND DECISION TREE METHODS


INTRODUCTION........................................................................................................5-2 WHY SO MANY METHODS?...................................................................................5-4 CHAID ANALYSIS ....................................................................................................5-6 A CHAID EXAMPLE: CREDIT RISK.......................................................................5-7 RULE INDUCTION (C5.0).......................................................................................5-22 A C5.0 EXAMPLE: CREDIT RISK..........................................................................5-22

CHAPTER 6 CLUSTER ANALYSIS


INTRODUCTION........................................................................................................6-2 WHAT TO LOOK AT WHEN CLUSTERING ..........................................................6-3 A K-MEANS EXAMPLE: CLUSTERING SOFTWARE USAGE DATA................6-5 CLUSTERING WITH KOHONEN NETWORKS....................................................6-14 A KOHONEN EXAMPLE: CLUSTERING PURCHASE DATA ...........................6-15

CHAPTER 7 TIME SERIES ANALYSIS


INTRODUCTION ..............................................................................................................7-2 DATA ORGANIZATION FOR TIME SERIES ANALYSIS ......................................................7-3 INTRODUCTION TO EXPONENTIAL SMOOTHING .............................................................7-4 A DECISIONTIME FORECASTING EXAMPLE: DAILY PARCEL DELIVERIES .....................7-5

CHAPTER 8 SEQUENCE DETECTION


INTRODUCTION TO SEQUENCE DETECTION ....................................................8-2 TECHNICAL CONSIDERATIONS............................................................................8-3 DATA ORGANIZATION FOR SEQUENCE DETECTION .....................................8-4 SEQUENCE DETECTION ALGORITHMS IN CLEMENTINE ..............................8-5 A SEQUENCE DETECTION EXAMPLE: REPAIR DIAGNOSTICS......................8-6

REFERENCES..........................................................................................R-1

Table of Contents 2

Data Mining: Modeling

Chapter 1 Introduction
Topics:
INTRODUCTION MODEL OVERVIEW VALIDATION

Introduction 1 - 1

Data Mining: Modeling

INTRODUCTION
This course focuses on the modeling stage of the data mining process. It will compare and review the analytic methods commonly used for data mining. In addition, it will illustrate these methods using SPSS software (SPSS, AnswerTree, DecisionTime, and Clementine). The course assumes that a business question has been formulated and that relevant data have been collected, organized, and checked and prepared. In short, that all the time-consuming, preparatory work has been completed and you are at the modeling stage of your project. For more details concerning what should be done during the earlier stages in a data mining project, see the SPSS Data Mining: Overview and Data Mining: Data Understanding and Data Preparation courses. This chapter serves as a road map for the rest of the course. We try to place the various methods discussed within a framework and give you a sense of when to use which methods. The unifying theme is data mining and we discuss in detail the analytic techniques most often used to support these efforts. The course emphasizes the practical issues of setting up, running, and interpreting the results of statistical and machine learning analyses. It assumes you have, or will have, some business questions that require analysis, and that you know what to do with the results once you have them. There are choices regarding specific methods with several of these techniques, and the recommendations we make are based on what is known from properties of the methods, Monte Carlo simulations, or empirical work. You should be aware from the start that in most cases there is not a single method that will definitely yield the best results. However, in the chapters that follow detailing the specific methods, we have sections that list research projects for which the method is appropriate, features and limitations of the method, and comments concerning model deployment. These should prove of some use when you must decide on the method to apply to your problem. Finally, the approach is practical, not mathematical. Relatively few equations are presented and references are given for those who would like a more rigorous review of the techniques. Also, our goal is to provide you with a good sense of the properties of each method and how it is used and interpreted. The course does not strive for exhaustive detail. Entire books have been written on a topic we cover in a single chapter, and we are trying to present the main issues a practitioner will face. Analyses are run using different SPSS products. However, the emphasis in this course is on understanding the characteristics of the methods and being able to interpret the results. Thus we will not discuss data definition and general program operation issues. We do present instructions to perform the analyses, but more information is needed than is presented here to master the software programs used. To provide this depth, SPSS offers operational courses for the products used in this course.

Introduction 1 - 2

Data Mining: Modeling

MODEL OVERVIEW
In this section we provide brief descriptions and comparisons of the data mining analysis and modeling methods that will be discussed in this course. Recall from your statistics courses that inferential statistics have two key features. They require that you specify an hypothesis to test such as that more satisfied customers will be more likely to make additional purchasesand they allow you to make inferences back to the population from the particular sample data you are studying. Because of these features, it isnt formally necessary to create training and validation (test) data sets when using inferential statistics. The validation portion of the analysis is done with standard test statistics, such as F, t, or chi-square, providing a probability of the hypothesis under test being correct. However, given the accepted data-mining methodology, you may decide to create a validation data set even when using inferential techniques. There is generally no harm in doing so, especially with a sufficient amount of data where the training and validation sets can both be reasonably large. Here is a listing of some inferential statistical methods commonly used in data mining projects. We will not define them here but leave that for a later section. The type of variables each requires is also listed.

GENERAL TECHNIQUE (Inferential Statistics) Discriminant Analysis Linear Regression (and ANOVA) Logistic and Multinomial Regression Time Series Analysis

PREDICTOR VARIABLES Continuous or dummies* Continuous or dummies Continuous or dummies Continuous or dummies

OUTCOME VARIABLE Categorical Continuous Categorical Continuous

(*Dummies refers to transformed variables coded 1 or 0, representing the presence or absence of a characteristic. Thus a field such as region [north, south, east and west], when used as a predictor variable in several inferential methods, would be represented by dummy variables. For example, one dummy field might be named North and coded 1 if the records region code were north and 0 otherwise.) As is common for inferential statistics, all of these techniques are used to make predictions of a dependent variable. Some have been used for many years, such as linear regression or discriminant analysis. Inferential statistical techniques often make stringent assumptions about the data, such as normality, uncorrelated errors, or homogeneity of variance. They are more restrictive

Introduction 1 - 3

Data Mining: Modeling than non-inferential techniques, which can be a disadvantage. However, they provide rigorous tests of hypotheses unavailable with more automated methods of analysis. Although these methods are not always mentioned in many data mining books and articles, you need to be aware of them because they are often exactly what are necessary to answer a particular question. For instance, to predict the amount of revenue, in dollars, that a new customer is likely to provide in the next two years, linear regression could be a natural choice, depending on the available predictor variables and the nature of the relationships.

GENERAL TECHNIQUE (Data Mining) Decision Trees (Rule Induction) Neural Networks

PREDICTOR VARIABLES Continuous or dummies* Continuous or dummies

OUTCOME VARIABLE Categorical (some allow Continuous) Categorical or Continuous

The key difference for most users between inferential and non-inferential techniques is in whether hypotheses need to be specified beforehand. In the latter methods, this is not normally required, as each is semi- or completely automated as it searches for a model. Nonetheless, in all non-inferential techniques, you clearly need to specify a list of variables as inputs to the procedure, and you may have to specify other details, depending on the exact method. As we discussed in the previous courses in the SPSS Data Mining sequence, data mining is not mindless activity; even here, you need a plan of approach a research designto use these techniques wisely. Notice that the inferential statistical methods are not distinguished from the data mining methods in terms of the types of variables they allow. Instead data mining methods, such as decision trees and neural networks, are distinguished by making fewer assumptions about the data (for example, normality of errors). In many instances both classes of methods can be applied to a given prediction problem. Some data mining methods do not involve prediction, but instead search for groupings or associations in the data. Several of these methods are listed below along with the types of analysis you can do with them.

Introduction 1 - 4

Data Mining: Modeling GENERAL TECHNIQUE Cluster Analysis Analysis Uses continuous or categorical variables to create cluster memberships; no predefined outcome variable. Uses categorical variables to create associations between categories; no outcome variable required. Uses categorical variables in data sorted in time order to discover sequences in data; no outcome variable required, but there may be interest in specific outcomes.

Market Basket/Association Analysis Sequence Detection

Finally, discussions of data mining mention the tasks of classification, affinity analysis, prediction or segmentation. Below we group the data mining techniques within these categories. Affinity/Association: These methods attempt to find items that are closely associated in a data file, with the archetypal case being shopping patterns of consumers. Market basket analysis and sequence detection fall into this category. Classification/Segmentation: These methods attempt to classify customers into discrete categories that have already been defined, i.e., customers who stay and those who leave, based on a set of predictors. Several methods are available, including decision trees, neural networks, and sequence detection (when data are time structured). Note that logistic regression and discriminant analysis are inferential techniques that accomplish this same task. Clustering/Segmentation: Notice that we have repeated the word segmentation. This is because segmentation is used in two senses in data mining. Its second meaning is to create natural clusters of objectswithout using an outcome variablethat are similar on various characteristics. Cluster analysis and Kohonen networks accomplish this task. Prediction/Estimation: These methods predict a continuous outcome variable, as opposed to classification methods, which work with discrete outcomes. Neural networks fall into this group. Decision tree methods can work with continuous predictors, but they split them into discrete ranges as the tree is built. Memory-based reasoning techniques (not covered in this course) can also predict continuous outcomes. Regression is the inferential method likely to be used for this purpose. The descriptions above are quite simple and hide a wealth of detail that we will consider as we review the techniques. More than one specific method is usually available for a general technique. So, to cluster data, K-means clustering, Two-step clustering, and Kohonen networks (a form of neural network) could be used, with the choice of which to use depending on the type of data, the availability of software, the ease of understanding desired, the speed of processing, and so forth. Introduction 1 - 5

Data Mining: Modeling

VALIDATION
Since most data-mining methods do not depend on specific data distribution assumptions (for example, normality of errors) to draw inferences from the sample to the population, validation is strongly recommended. It is usually done by fitting the model to a portion of the data (called the Training data) and then applying the predictions to, and evaluating the results with, the other portion of the data (called the Validation data note some authors refer to this as Test data, but as we will see, Test data has a specific meaning in neural network estimation). In this way, the validity of the model is established by demonstrating that it applies to (fits) data independent of that used to derive the model. Statisticians often recommend such validation for statistical models, but it is crucial for more general (less distribution bound) data mining techniques. There are several methods of performing validation.

Holdout Sample
This method was described above. The data set is split into two parts: training and validation files. For large files it might be a 50/50 split, while for smaller files more records are typically placed in the training set. Modeling is performed on the training data, but fit evaluation is done on the separate validation data.

N-Fold Validation
If the data file is small, reserving a holdout sample may not be feasible (the training sample may be too small to obtain stable results). In this case n-fold validation may be done. Here the data set is divided into a number of groups of equal sample size. Lets use 10 groups for the example. The first group is held out from the analysis, which is based on the other 9 groups (or 9/10ths of the data), and is used as the validation sample. Next the second group is held out from the analysis, again based on the other 9 groups, and is used as the validation sample. This continues until each of the 10 groups has served as a validation sample. The validation results from each of these samples are then pooled. This has the advantage of providing a form of validation in the presence of small samples, but since any given data record is used in 9 of the 10 models, there is less than complete independence. A second problem is that since 10 models are run there is no single model result (there are 10). For this reason, n-fold validation is generally used to estimate the fit or accuracy of a model with small data files and not to produce the model coefficients or rules. Some procedures extend this principle to base the model on all but one observation (using fast algorithms), keeping a single record as the hold-out. Generally speaking, resourcewise, only closed-form models that involve no iteration (like regression or discriminant) can afford this.

Introduction 1 - 6

Data Mining: Modeling

Validate with Other Models


Since different data mining models often can be applied to the same data, you would have greater confidence in your results if different methods led to the same conclusions. This is not to say that the results should be identical, since the models do differ in their assumptions and approach. But you would expect that important predictors repeat across methods and have the same general relationship to the outcome.

Validate with Different Starting Values


Neural networks usually begin with randomly assigned weights and then, hopefully, converge to the optimum solution. If analyses run with different starting values for the weights produce the same solution, then you would have greater confidence in it.

Domain Validation
Do the model results make sense within the business area being studied? Here a domain expertsomeone who understands the business and dataexamines the model results to determine if they make sense, and to decide if they are interesting and useful, as opposed to obvious and trivial.

Introduction 1 - 7

Data Mining: Modeling

Introduction 1 - 8

Data Mining: Modeling

Chapter 2 Statistical Data Mining Techniques


Topics:
STATISTICAL TECHNIQUES LINEAR REGRESSION DISCRIMINANT ANALYSIS LOGISTIC AND MULTINOMIAL REGRESSION APPENDIX: GAINS TABLES

Statistical Data Mining Techniques 2 - 1

Data Mining: Modeling

INTRODUCTION
In this chapter we consider the various inferential statistical techniques that are commonly used in data mining. We include a detailed example of each, as well as discussions about typical sample sizes, whether the method can be automated, and how easily the model can be understood and deployed. As you work on the tasks weve cited in the last chapter, you should also be thinking about what data mining techniques to use to answer those questions. Research isnt done step-by-step, in some predefined order, as we are taught in textbooks. Instead, all phases of a data mining project should be under review early in the process. This is especially critical for the data mining techniques you plan to employ, for at least three reasons. First, each data mining technique is suitable for only some types of analysis, but not all. Thus the research question you have defined cant necessarily be answered by just any technique. So if you want to answer a question that requires, say, market basket analysis (discussed in Chapter 3), and you have little expertise in this procedure, youll need to prepare ahead of time, conceivably even acquire additional software, so you are ready to begin analysis when the data are ready. Second, some techniques require more data than others do, or data of a particular kind, so you will need to have these conditions in mind when you collect the data. And third, some techniques are more easily understandable than others and the models more readily retrained if the environment changes rapidly; both of which might affect your choice of which technique to use. In this chapter we provide several different frameworks or classification schemes by which to understand and conceptualize the various inferential data mining techniques available in SPSS and other software. Examples of each technique will be given, including research questions or projects suitable for that type of analysis. Although details for running various analyses are given in the chapter, the emphasis is on setting up the basic analysis and interpreting the results. For this reason, all available options and variations will not be covered in this class. Also, such steps as data definition and data exploration are assumed to be completed prior to the modeling stage. In short, the goal of the chapter is not to exhaustively cover each data mining procedure in SPSS, but to present and discuss the core features needed for most analyses. (For more details on specific procedures, you may attend separate SPSS, AnswerTree, DecisionTime, and Clementine application courses.) Instead, we provide an overview of these methods with enough detail for you to begin to make an informed choice about which method will be appropriate for your own data mining projects, to set up a typical analysis, and interpret the results.

Statistical Data Mining Techniques 2 - 2

Data Mining: Modeling

STATISTICAL TECHNIQUES
Recall that inferential statistics have two key features. They require that you specify an hypothesis to testsuch as that more satisfied customers will be more likely to make additional purchasesand they allow you to make inferences back to the population from the particular sample data you are studying. Below is the listing, from Chapter 1, of the many inferential methods commonly used in data mining projects. We will define them in later sections of this chapter. The type of variables each requires is also listed. GENERAL TECHNIQUE Discriminant Analysis Linear Regression (and ANOVA) Logistic and Multinomial Regression Time Series Analysis PREDICTOR VARIABLES Continuous or dummies* Continuous or dummies Continuous or dummies Continuous or dummies OUTCOME VARIABLE Categorical Continuous Categorical Continuous

(*Dummies refers to transformed variables coded 1 or 0, representing the presence or absence of a characteristic. Thus a field such as region (north, south, east and west), when used as a predictor variable in several inferential methods, would be represented by dummy variables. For example, one dummy field might be named North and coded 1 if the records region code was north and 0 otherwise.) As we discuss the techniques, we also provide information on whether they can be automated or not, their ease of understanding and typical size of data files, plus other important traits. After this brief glimpse at the various techniques, we turn next to a short discussion of each, including examples of research questions it can answer and where each can be found, if available, in SPSS software. You are probably already familiar with several of the inferential statistics methods we consider here. Our emphasis is on practical use of the techniques, not on the theory underlying each one.

Statistical Data Mining Techniques 2 - 3

Data Mining: Modeling

LINEAR REGRESSION
Linear regression is a method familiar to just about everyone these days. It is the classic linear model technique, and is used to predict an outcome variable that is interval or ratio with a set of predictors that are also interval or ratio. In addition, categorical predictor variables can be included by creating dummy variables. Linear regression is available in SPSS under the AnalyzeRegression menu and is available in SPSS Clementine. Linear regression, of course, assumes that the data can be modeled with a linear relationship. As illustration, Figure 2.1 exhibits a scatterplot depicting the relationship between the number of previous late payments for bills and the credit risk of defaulting on a new loan. Superimposed on the plot is the best-fit regression line. The plot may look a bit unusual because of the use of sunflowers, which are used to represent the number of cases at a point. Since credit risk and late payments are measured as whole integers, the number of discrete points here is relatively limited given the large file size (over 2,000 cases). Figure 2.1 Scatterplot of Late Payments and Credit Risk

Although there is a lot of spread around the regression line, it is clear that there is a trend in the data such that more late payments are associated with a greater credit risk. Of course, linear regression is normally used with several predictors; this makes it impossible to display the complete solution with all predictors in convenient graphical form. Thus most users of linear regression use the numeric output. Statistical Data Mining Techniques 2 - 4

Data Mining: Modeling

Basic Concepts of Regression


Earlier we pointed out that to the eye there seems to be a positive relation between credit risk and the number of late payments. However, it would be more useful in practice to have some form of prediction equation. Specifically, if some simple function can approximate the pattern shown in the plot, then the equation for the function would concisely describe the relation, and could be used to predict values of one variable given knowledge of the other. A straight line is a very simple function, and is usually what researchers start with, unless there are reasons (theory, previous findings, or a poor linear fit) to suggest another. Also, since the point of much research involves prediction, a prediction equation is valuable. However, the value of the equation would be linked to how well it actually describes or fits the data, and so part of the regression output includes fit measures.

The Regression Equation and Fit Measure


In the plot above, credit risk is placed on the Y (or vertical axis) and the number of late payments appears along the X (horizontal) axis. If we are interested in credit risk as a function of the number of late payments, we consider credit risk to be the dependent variable and number of late payments the independent or predictor variable. A straight line is superimposed on the scatterplot along with the general form of the equation: Y = B*X + A Here, B is the slope (the change in Y per one unit change in X) and A is the intercept (the value of Y when X is zero). Given this, how would one go about finding a best-fitting straight line? In principle, there are various criteria that might be used: minimizing the mean deviation, mean absolute deviation, or median deviation. Due to technical considerations, and with a dose of tradition, the best-fitting straight line is the one that minimizes the sum of the squared deviation of each point about the line Returning to the plot of credit risk and number of late payments, we might wish to quantify the extent to which the straight line fits the data. The fit measure most often used, the r-square measure, has the dual advantages of falling on a standardized scale and having a practical interpretation. The r-square measure (which is the correlation squared, or r2, when there is a single predicator variable, and thus its name) is on a scale from 0 (no linear association) to 1 (perfect prediction). Also, the r-square value can be interpreted as the proportion of variation in one variable that can be predicted from the other. Thus an r-square of .50 indicates that we can account for 50% of the variation in one variable if we know values of the other. You can think of this value as a measure of the improvement in your ability to predict one variable from the other (or others if there is more than one independent variable).

Statistical Data Mining Techniques 2 - 5

Data Mining: Modeling Multiple regression represents a direct extension of simple regression. Instead of a single predictor variable (Y = B*X + A), multiple regression allows for more than one independent variable in the prediction equation: Y = B1*X1 + B2*X2 + B3*X3 + . . . + A While we are limited to the number of dimensions we can view in a single plot (SPSS can build a 3-dimensional scatterplot), the regression equation allows for many independent variables. When we run multiple regression we will again be concerned with how well the equation fits the data, whether there are any significant linear relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are interested in the relative importance of the independent variables in predicting the dependent measure.

Residuals and Outliers


Viewing the plot, we see that many points fall near the line, but some are more distant from it. For each point, the difference between the value of the dependent variable and the value predicted by the equation (value on the line) is called the residual. Points above the line have positive residuals (they were under predicted), those below the line have negative residuals (they were over predicted), and a point falling on the line has a residual of zero (perfect prediction). Points having relatively large residuals are of interest because they represent instances where the prediction line did poorly. As we will see shortly in our detailed example, large residuals (gross deviations from the model) have been used to identify data errors or possible instances of fraud (in application areas such as insurance claims, invoice submission, telephone and credit card usage). In SPSS, the Regression procedure can provide information about large residuals, and also present them in standardized form. Outliers, or points far from the mass of the others, are of interest in regression because they can exert considerable influence on the equation (especially if the sample size is small, which is rarely the case in data mining). Also, outliers can have large residuals and would be of interest for this reason as well. While not covered in this class, SPSS can provide influence statistics to aid in judging whether the equation was strongly affected by an observation and, if so, to identify the observation.

Assumptions
Regression is usually performed on data for which the dependent and independent variables are interval scale. In addition, when statistical significance tests are performed, it is assumed that the deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be independent of the predicted (values on the line) values, which implies that the variation of the residuals around the line is homogeneous (homogeneity of variance). SPSS can provide summaries and plots useful in evaluating these latter issues. One special case of the assumptions involves the interval scale nature of the independent variable(s). A variable coded as a dichotomy (say Statistical Data Mining Techniques 2 - 6

Data Mining: Modeling 0 and 1) can technically be considered as an interval scale. An interval scale assumes that a one-unit change has the same meaning throughout the range of the scale. If a variables only possible codes are 0 and 1 (or 1 and 2, etc.), then a one-unit change does mean the same change throughout the scale. Thus dichotomous variables (e.g., gender) can be used as predictor variables in regression. It also permits the use of categorical predictor variables if they are converted into a series of dichotomous variables; this technique is called dummy coding and is considered in most regression texts (Draper and Smith (1998), Cohen, Cohen, West and Aiken (2002)).

An Example: Error or Fraud Detection in Claims


To illustrate linear regression we turn to a data set containing insurance claims for a single medical treatment performed in a hospital (in the US for a single DRG or diagnostic related group). In addition to the claim amount, the data file also contains patient age (Age), length of hospital stay (Los) and a severity of illness category (Asg). This last field is based on several heath measures and higher category scores indicate greater severity of the illness. The plan is to build a regression model that predicts the total claims amount for a patient on the basis of length of stay, severity of illness and patient age. Assuming the model fits, we are then interested in those patients that the model predicts poorly. Such cases can simply be instances of poor model fit, or the result of predictors not included in the model, but they also might be due to errors on the claims form or fraudulent entries. Thus we are approaching the problem of error or fraud detection by identifying exceptions to the prediction model. Such exceptions are not necessarily instances of fraud, but since they are inconsistent with the model, they may be more likely to be fraudulent or contain errors. Some organizations perform random audits on claims applications and then classify them as fraudulent or not. Under these circumstances, predictive models can be constructed that attempt to correctly classify new claims applications (logistic, discriminant, rule induction and neural networks have been used for this purpose). However, when such an outcome field is not available, fraud detection then involves searching for and identifying exceptional instances. Here, an exceptional instance is one that the model predicts poorly. We use regression to build the model; if there were reason to believe the model were more complex (for example, contained nonlinear relations) then neural networks could be applied. A Note Concerning Data Files, Variable Names and Labels in SPSS In this course guide, variable names (not labels) in alphabetic order are displayed in SPSS dialog boxes. To set your machine to match this display, click as follows within SPSS.

Statistical Data Mining Techniques 2 - 7

Data Mining: Modeling Click EditOptions Click the General tab Click the Display names option button in the Variable Lists section Click the Alphabetical option button in the Variable Lists section Click OK Also, files are assumed to be located in the c:\Train\DM_Model directory. They can be copied from the floppy accompanying this guide (or from the CD-ROM containing this guide). If you are running SPSS Server (you can check by clicking File..Switch Server from within SPSS), then files used with SPSS should be copied to a directory that can be accessed from (mapped) the server. To develop a regression equation predicting claims amount based on hospital length of stay, severity of illness group and age using SPSS: Click FileOpenData (switch to the c:\Train\DM_Model directory if necessary) Double click on InsClaims Click AnalyzeRegression This chapter will discuss two choices: linear regression, which performs simple and multiple linear regression and logistic regression (Binary). Curve Estimation will invoke the Curvefit procedure, which can apply up to 16 different functions relating two variables. Binary logistic regression is used when the dependent variable is a dichotomy (for example, when predicting whether a prospective customer makes a purchase or not). Multinomial logistic regression is appropriate when you have a categorical dependent variable with more than two possible values. Ordinal regression is appropriate if the outcome variable is ordinal (rank ordered). Probit analysis, nonlinear regression, weight estimation (used for weighted least squares analysis), 2-Stage least squares, and optimal scaling are not generally used for data mining and so will not be discussed further here.

Statistical Data Mining Techniques 2 - 8

Data Mining: Modeling Figure 2.2 Regression Menu

We will select Linear to perform multiple linear regression, then specify claim as the dependent variable and age, asg (severity level) and length of stay (los) as the independent variables. Click Linear from the Regression menu Move claim to the Dependent: list box Move age, asg and los to the Independent(s): list box

Statistical Data Mining Techniques 2 - 9

Data Mining: Modeling Figure 2.3 Linear Regression Dialog Box

Since our goal is to identify exceptions to the regression model, we will ask for residual plots and information about cases with large residuals. Also, the Regression dialog box allows many specifications; here we will discuss the most important features. Note on Stepwise Regression With such a small number of predictor variables, we will simply add them all into the model. However, in the more common situation of many predictor variables (most insurance claims forms would contain far more information) a mechanism to select the most promising predictors is desirable. This could be based on the domain knowledge of the business expert (here perhaps a medical expert). In addition, an option may be chosen to select, from a larger set of independent variables, those that in some statistical sense are the best predictors (Stepwise method). The Selection Variable option permits cross-validation of regression results. Only cases whose values meet the rule specified for a selection variable will be used in the regression analysis, yet the resulting prediction equation will be applied to the other cases. Thus you can evaluate the regression on cases not used in the analysis, or apply the equation derived from one subgroup of your data to other groups. The importance of such validation in data mining is a repeated theme in this course. While SPSS will present standard regression output by default, many additional (and some of them quite technical) statistics can be requested via the Statistics dialog box. The Statistical Data Mining Techniques 2 - 10

Data Mining: Modeling Plots dialog box is used to generate various diagnostic plots used in regression, including a residual plot in which we have interest. The Save dialog box permits you to add new variables to the data file containing such statistics as the predicted values from the regression equation, various residuals and influence measures. We will create these in order to calculate our own percentage deviation field. Finally, the Options dialog box controls the criteria when running stepwise regression and choices in handling missing data (the SPSS Missing Values option provides more sophisticated methods of handling missing values). Note that by default, SPSS excludes a case from regression if it has one or more values missing for the variables used in the analysis.

Residual Plots
While we can run the multiple regression at this point, we will request some diagnostic plots involving residuals and information about outliers. A residual is the difference (signed) between the actual value of the dependent variable and the value predicted by the model. Residuals can be used to identify large errors in prediction or cases poorly fit by the model. By default no residual plots will appear. These options are explained below. Click the Plots pushbutton Within the Plots dialog box: Check Histogram in the Standardized Residual Plots area Figure 2.4 Regression Plots Dialog Box

Statistical Data Mining Techniques 2 - 11

Data Mining: Modeling The options in the Standardized Residual Plots area of the dialog box all involve plots of standardized residuals. Ordinary residuals are useful if the scale of the dependent variable is meaningful, as it is here (claim amount in dollars). Standardized residuals are helpful if the scale of the dependent is not familiar (say a 1 to 10 customer satisfaction scale). By this I mean that it may not be clear to the analyst just what constitutes a large residual: is an over-prediction of 1.5 units a large miss on a 1 to 10 scale? In such situations, standardized residuals (residuals expressed in standard deviation units) are very useful because large prediction errors can be easily identified. If the errors follow a normal distribution, then standardized residuals greater than 2 (in absolute value) should occur in about 5% of the cases, and those greater than 3 (in absolute value) should happen in less than 1% of the cases. Thus standardized residuals provide a norm against which one can judge what constitutes a large residual. Recall that the F and t tests in regression assume that the residuals follow a normal distribution. Click Continue Next we will look at the Statistics dialog box, which contains options concerning Casewise Diagnostics. When this option is checked, Regression will list information about all cases whose standardized residuals are more than 3 standard deviations from the line. This outlier criterion is under your control. Click the Statistics pushbutton Click the Casewise diagnostics check box in the Residuals area Figure 2.5 Regression Statistics Dialog Box

By requesting this option we will obtain a listing of those records that the model predicts poorly. When dealing with a very large data file, which may have many outliers, such a

Statistical Data Mining Techniques 2 - 12

Data Mining: Modeling list is cumbersome. It would be more efficient to save the residual value (standardized or not) as a new field, then select the large residuals and write these cases to a new file or add a flag field to the main database. We create these new fields below. Click Continue Click Save pushbutton Click the check boxes for Unstandardized Predicted Values, Unstandardized and Standardized Residuals Figure 2.6 Saving Predicted Values and Errors

Click Continue, then click OK Now we examine the results.

Statistical Data Mining Techniques 2 - 13

Data Mining: Modeling Figure 2.7 Model Summary and Overall Significance Tests

After listing the dependent and independent variables (not shown), Regression provides several measures of how well the model fits the data. First is the multiple R, which is a generalization of the correlation coefficient. If there are several independent variables (our situation) then the multiple R represents the unsigned (positive) correlation between the dependent measure and the optimal linear combination of the independent variables. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the r-square measure can be interpreted as the proportion of variance of the dependent measure that can be predicted from the independent variable(s). Here it is about 32%, which is far from perfect prediction, but still substantial. The adjusted r-square represents a technical improvement over the r-square in that it explicitly adjusts for the number of predictor variables, and as such is preferred by many analysts. However, it is a more recently developed statistic and so is not as well known as the r-square. Generally, they are very close in value; in fact, if they differ dramatically in multiple regression, it is a sign that you have used too many predictor variables relative to your sample size, and the adjusted r-square value should be more trusted. In our results, they are very close. While the fit measures indicate how well we can expect to predict the dependent variable or how well the line fits the data, they do not tell whether there is a statistically significant relationship between the dependent and independent variables. The analysis of variance table presents technical summaries (sums of squares and mean square statistics), but here we refer to variation accounted for by the prediction equation. We are interested in determining whether there is a statistically significant (non-zero) linear relation between the dependent variable and the independent variable(s) in the population. Since our analysis contains three predictor variables, we test whether any linear relation differs

Statistical Data Mining Techniques 2 - 14

Data Mining: Modeling from zero. The significance value accompanying the F test gives us the probability that we could obtain one or more sample slope coefficients (which measure the straight-line relationships) as far from zero as what we obtained if there were no linear relations in the population. The result is highly significant (significance probability less than .0005 (the table value is rounded to .000) or 5 chances in 10,000). Now that we have established there is a significant relationship between the claims amount and one or more predictor variables, and obtained fit measures, we turn to interpret the regression coefficients. Here we are interested in verifying that several expected relationships hold: (1) claims will increase with length of stay, (2) claims will increase with increasing severity of illness, and (3) claims will increase with age. Strictly speaking, this step is not necessary in order to identify cases that are exceptional. However, in order to be confident in the model, it should make sense to a domain expert. Since interpretation of regression models can be made directly from the estimated regression coefficients, we turn to those next. Figure 2.8 Estimated Regression Coefficients

The first column contains a list of the independent variables plus the intercept (constant). Although the estimated B coefficients are important for prediction and interpretive purposes, analysts usually look first to the t test at the end of each line to determine which independent variables are significantly related to the outcome measure. Since three variables are in the equation, we are testing if there is a linear relationship between each independent variable and the dependent measure after adjusting for the effects of the two other independent variables. Looking at the significance values we see that all three predictors are highly significant (significance values are .004 or less). If any of the variables were not found to be significant, you would typically rerun the regression after removing variables not found to be significant. The column labeled B contains the estimated regression coefficients we would use to deploy the model via a prediction equation. The coefficient for length of stay indicates that on average, each additional day spent in the hospital was associated with a claims increase of about $1,106. The coefficient for admission severity group tells us that each one-unit increase in the severity code is associated with a claims increase of $417. Finally, the age coefficient of 33 suggests that claims decrease, on average, by $33 as age increases one year. This is counterintuitive and should be examined by a domain Statistical Data Mining Techniques 2 - 15

Data Mining: Modeling expert (here a physician). Perhaps the youngest patients are at greater risk. If there isnt a convincing reason for this negative association, the data values for age and claims should be examined more carefully (perhaps data errors or outliers are influencing the results). Such oddities may have shown up in the original data exploration. We will not pursue this issue here, but it certainly would be done in practice. The constant or intercept of $3,027 indicates that the claim of someone with 0 days in the hospital, in the least severe illness category (0) and at age 0 would be expected to file a claim of $3,027. This is clearly impossible. This odd result stems in part from the fact that no one in the sample had less than 1 day in the hospital (it was an inpatient procedure) and the patients were adults (no ages of 0), so the intercept projects well beyond where there are any data. Thus the intercept cannot represent an actual patient, but still may be needed to fit the data. Also, note that when using regression it can be risky to extrapolate beyond where the data are observed; the assumption is that the same pattern continues. Here it clearly cannot! The Standard Error (of B) column contains standard errors of the estimated regression coefficients. These provide a measure of the precision with which we estimate the B coefficients. The standard errors can be used to create a 95% confidence band around the B coefficients (available as a Statistics option). In our example, the regression coefficient for length of stay is $1,106 and the standard error is about $104. Thus we would not be surprised if in the population the true regression coefficient were $1,000 or 1,200 (within two standard errors of our sample estimate), but it is very unlikely that the true population coefficient would be $300 or $2,000. Betas are standardized regression coefficients and are used to judge the relative importance of each of several independent variables. They are important because the values of the regression coefficients (Bs) are influenced by the standard deviations of the independent variables and the beta coefficients adjust for this. Here, not surprisingly, length of stay is the most important predictor of claims amount, followed by severity group and age. Betas typically range from 1 to 1 and the further from 0, the more influential the predictor variable. Thus if we wish to predict claims based on length of stay, severity code and age, the formula would use the B coefficients: Predicted Claims = $1,106*length of stay + 417*severity code 33*age + $3,027.

Points Poorly Fit by Model


The motivation for this analysis is to detect errors or possible fraud by identifying cases that deviate substantially from the model. As mentioned earlier, these need not be the result of errors or fraud, but they are inconsistent with the majority of cases and thus merit scrutiny. We first turn to a list of cases whose residuals are more than three standard deviations from 0 (a residual of 0 indicates the model perfectly predicts the outcome). Statistical Data Mining Techniques 2 - 16

Data Mining: Modeling Figure 2.9 Outliers

There are two cases for which the claims value is more than three standard deviations from the regression prediction. Both are about $6,000 more than expected from the model. Note that they are 5.5 and 6.1 standard deviations away from the model predictions. These would be the claims to examine more carefully. The case sequence number for these records appears or an identification field could be substituted (through the Case Labels box within the Linear Regression dialog). Figure 2.10 Histogram of Residuals

This histogram of the standardized residuals presents the overall distribution of the errors. It is clear that all large residuals are positive (meaning the model under-predicted the claims value). Case (record) identification is not available in the histogram, but since the standardized residuals were added to the data file, they can be easily selected and examined.

Statistical Data Mining Techniques 2 - 17

Data Mining: Modeling

Calculating a Percent Error


Instead of standardizing the residuals, analysts may prefer to express the residual as a percent deviation from the prediction. Such a measure may be easier to communicate to a wider audience. Move to the Data Editor window Click TransformCompute Type presid in the Target text box Enter the following into the Expression text box 100* (claim pre_1)/pre_1 Figure 2.11 Compute Dialog Box with Percentage Deviation from Model

Each cases deviation from the model (claim pre_1) is divided by the model prediction (/pre_1) and converted to a percent (100*). Click OK Scroll to the right in the Data Editor window

Statistical Data Mining Techniques 2 - 18

Data Mining: Modeling Figure 2.12 Percent Deviation Field

Extreme values on this percent deviation field can also be used to identify exceptional claims. While we wont pursue it here, a histogram would display the distribution of the deviations and cases with extreme values could be selected for closer examination. Unusual values could appear at both the high and low ends, with low values indicating the claim was much less than predicted by the model. These might be examined as well, since they might reflect errors or suggest less expensive variations on the treatment. In this section, we offered the search for deviations from a model as a method to identify data errors or possible fraud. It would not detect, of course, fraudulent claims consistent with the model prediction. In actual practice, such models are usually based on a much greater number of predictor variables, but the principles, whether using regression or more complex models such as neural networks, are largely the same.

Appropriate Research Projects


Other examples of questions for which linear regression is appropriate are: Predict expected revenue in dollars from a new customer based on customer characteristics. Predict sales revenue for a store (with a sufficiently large number of stores in the database). Predict waiting time on hold for callers to an 800 number.

Statistical Data Mining Techniques 2 - 19

Data Mining: Modeling

Other Features
There is no limit to the size of data files used with linear regression, but just as with discriminant, most uses of regression limit the number of predictors to a manageable number, say under 50 or so. As before, there is then no reason for extremely large file sizes. The use of stepwise regression is quite common. Since this involves selection of a few predictors from a larger set, it is recommended that you validate the results with a validation data set when you use a stepwise method. Although this technique is called linear regression, with the use of suitable transformations of the predictors, it is possible to model non-linear relationships. However, more in-depth knowledge is needed to do this correctly, so if you expect nonlinear relationships to occur in your data, you might consider using neural networks or classification and regression trees, which handle these more readily, if differently.

Model Understanding
Linear regression produces very easily understood models, as we can see from the table in Figure 2.8. As noted, graphical results are less helpful with more than a few predictors, although graphing the error in prediction with other variables can lead to insights about where the model fails.

Model Deployment
Predictions for new cases are made from one equation using the unstandardized regression coefficient estimates. Any convenient software for doing this calculation can be employed, and regression equations can therefore be applied directly to data warehouses, not only to extracted datasets. This makes the model easily deployable.

Statistical Data Mining Techniques 2 - 20

Data Mining: Modeling

DISCRIMINANT ANALYSIS
Discriminant analysis, a technique used in market research and credit analysis for many years, is a general linear model method, like linear regression. It is used in situations where you want to build a predictive model of group or category membership, based on linear combinations of predictor variables that are either continuous (age) or categorical variables represented by dummy variables (type of customer). Most of the predictors should be truly interval scale, else the multivariate normality assumption will be violated. Discriminant is available in SPSS under the AnalyzeClassify menu. Discriminant follows from a view that the domain of interest is composed of separate populations, each of which is measured on variables that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of these measures that best separate the populations. This is represented in Figure 2.13, which shows one discriminant function derived from two input variables, X and Y, that can be used to predict membership in a dependent variable: Group. The score on the discriminant function separates cases in group 1 from group 2, using the midpoint of the discriminant function (the short line segment). Figure 2.13 Discriminant Function Derived From Two Predictors

Statistical Data Mining Techniques 2 - 21

Data Mining: Modeling

A Discriminant Example: Predicting Purchases


To demonstrate discriminant analysis we take data from a study in which respondents answered, hypothetically, whether they would accept an interactive news subscription service (via cable). There was interest in identifying those segments most likely to adopt the service. Several demographic variables were available: education, gender, age, income category, number of children, number of organizations the respondent belonged to, and the number of hours of TV watched per day. The outcome measure was whether they would accept the offering. Most of the predictor variables are interval scale, the exceptions being gender (a dichotomy) and income (an ordered categorical variable). We would expect few if any of these variables to follow a normal distribution, but will proceed with discriminant. As in our other examples, we will move directly to the analysis although ordinarily you would run data checks and exploratory data analysis first. Click File..Open..Data Move to the c:\Train\DM_Model directory (if necessary) Double click on Newschan (respond No if asked to save Data Editor contents) Click AnalyzeClassifyDiscriminant Click newschan, then click the upper arrow to move it into Grouping Variable list box Notice that two question marks appear beside newschan in the Grouping Variable list box. This is because Discriminant can be applied to more than two outcome groups and expects a minimum and maximum group code. The news channel acceptance variable is coded 0 (no) and 1 (yes) and we use the Define Range pushbutton to supply this information. Click Define Range pushbutton (not shown) Type 0 in the Minimum text box Click in the Maximum text box, and type 1 Click Continue to process the range The default Method within Discriminant is to run the analysis using all the predictor variables. For the typical data mining application, you would probably invoke a stepwise option that will enter predictor variables into the equation based on statistical criteria instead of forcing all predictors into the model. Click and drag from age to tvday to select them Click the lower arrow to place the selected variables in the Independents: list box. Click Use stepwise method option button

Statistical Data Mining Techniques 2 - 22

Data Mining: Modeling Figure 2.14 Discriminant Analysis Dialog Box

Click the Classify pushbutton Click the Summary table checkbox Click the Leave-one-out classification checkbox Figure 2.15 Classification Dialog Box

The Classification dialog box controls the results displayed when the discriminant model is applied to the data. The most useful table does not print out by default (because misclassification summaries require a second data pass), but you can easily request a summary classification table, which reports how well the model predicts the outcome Statistical Data Mining Techniques 2 - 23

Data Mining: Modeling measure. Without this table you cannot effectively evaluate the discriminant analysis, so you should make a point of asking for it. The "leave-one-out" variation classifies each case based on discriminant coefficients calculated while the case is excluded from the analysis. This method is a form of n-fold validation and provides a classification table that should at least slightly better generalize to other samples. Since we have a relatively small data file, rather than splitting it into training and validation samples, we will use the leave-one-out classification for validation purposes. You can use the Prior Probabilities area to provide Discriminant with information about the distribution of the outcome in the population. By default, before examining the data, Discriminant assumes an observation is equally likely to belong to each outcome group. If you know that the sample proportions reflect the distribution of the outcome in the population then you can instruct Discriminant to make use of this information. For example, if an outcome category is very rare, Discriminant can make use of this fact in its prediction equation. Using the dialog box, priors can be set to the sample sizes, and with syntax you can directly specify the population proportions. In our instance, we dont know what the proportions would be, so we retain the default. Click Continue to process Classification choices Click Statistics pushbutton Click Fishers checkbox in the Function Coefficients area Click Unstandardized checkbox in the Function Coefficients area Figure 2.16 Statistics Dialog Box

Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to deploy the model for future observations (customers). Both sets of coefficients produce the same predictions. If there are only two outcome categories (as is our situation), either is easy to use. If you want to try what if scenarios using a spreadsheet, the unstandardized coefficients (since they involve a single equation in the two-outcome

Statistical Data Mining Techniques 2 - 24

Data Mining: Modeling case) would be more convenient to work with. If you run discriminant with more than two outcome categories, then Fisher's coefficients are easier to apply as prediction rules. If you suspect some of the predictors are highly related, you might view the withingroups correlations among the predictor variables to identify highly correlated predictors. Click Continue to process Statistics requests Now we are ready to run the stepwise discriminant analysis. The Select pushbutton can be used to have SPSS select part of the data to estimate the discriminant function, and then apply the predictions to the other part (cross-validation). We would use this method of validation in place of the leave-one-out method if our data set were larger. The Save pushbutton will create new variables that contain the group membership predicted from the discriminant function and the associated probabilities. To retain predictions for the training data set, you would use the Save dialog to create these variables. Click OK to run the analysis Scroll to the Classification Results table at the bottom of the Viewer window Figure 2.17 Classification Results Table

Although this table appears at the end of the discriminant output, we turn to it first. It is an important summary since it tells us how well we can expect to predict the outcome. There are two subtables with Original referring to the training data and CrossValidated supplying the leave-one-out results. The actual (known) groups constitute the rows and the predicted groups make up the columns of the table. Looking at the Original section, of the 227 people surveyed who said they would not accept the Statistical Data Mining Techniques 2 - 25

Data Mining: Modeling offering, the discriminant model correctly predicted 157 of them, and so its accuracy is 69.2%. For the 214 respondents who said they would accept the offering, 66.4% were correctly predicted. Thus overall, the discriminant model was accurate in 67.80% of the cases. The Cross-Validated summary is very close (67.3% accurate overall). Is this performance good? If we simply guess the larger group 100% of the time, we would be correct 227 times of 441 (227 + 214), or about 51.5% of the time. The 67.8% or 67.3% correct figures, while certainly far from perfect accuracy, do far better than guessing. Whether you would accept this figure and review the remaining output, or go back to the drawing board, is largely a function of the level of predictive accuracy required. Since we are interested in discovering which characteristics are associated with someone who accepts the news channel offer, we proceed.

Stepwise Results
Age is entered first, followed by gender and education. A significance test (Wilks lambda) of between-group differences is performed for the variables at each step. None of the other variables made a significant difference after adjusting for the first three. As an exercise you might rerun the analysis with the additional variables entered and compare the classification results. Figure 2.18 Stepwise Results

This summary is followed by one entitled "Variables in the Analysis" (not shown), which lists the variables included in the discriminant analysis at each step. For the variables selected, tolerance is shown. It measures the proportion of variance in each predictor variable that is independent of the other predictors in the equation at this step. As

Statistical Data Mining Techniques 2 - 26

Data Mining: Modeling tolerance values approach 0 (say below .1 or so) the data approach multicollinearity, meaning the predictor variables are highly interrelated, and interpretation of individual coefficients can be compromised. Note that discriminant coefficients are only calculated after the stepwise phase is complete. Figure 2.19 Standardized Coefficients and Structure Matrix

The standardized discriminant coefficients can be used as you would regression Beta coefficients in that they attempt to quantify the relative importance of each predictor in the discriminant function. Not surprisingly, age is the dominant factor. The signs of the coefficients can be interpreted with respect to the group means on the discriminant function (see Figure 2.20). An older individual will have a higher discriminant score, since the age coefficient is positive. The outcome group accepting the offering has a positive mean (see Figure 2.20) and so older people are more likely to accept the offering. Notice the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to a woman (code 1), this results in a one unit change, which when multiplied by the negative coefficient will lower the discriminant score, and move the individual toward the group with a negative mean (those that dont accept the offering). Thus women are less likely to accept the offering, adjusting for the other predictors.

Statistical Data Mining Techniques 2 - 27

Data Mining: Modeling Figure 2.20 Unstandardized Coefficients and Group Means (Centroids)

Back in Figure 2.13 we saw a scatterplot of two separate groups and the axis along which they could be best separated. Unstandardized discriminant coefficients, when multiplied by the values of an observation, project an individual on this discriminant axis (or function) that separates the groups. If you wish to use the unstandardized coefficient estimates for prediction purposes, you simply multiply a prospective customers education, gender and age values by the corresponding unstandardized coefficients and add the constant. Then you compare this value to the cut point (by default the midpoint) between the two group means (centroids) along the discriminant function (the means appear in Figure 2.20). If the prospective customers value is greater than the cut point you predict the customer will accept, if the score is below the cut point, then you predict the customer will not accept. This prediction rule is also easy to implement with two groups, but involves much more complex calculations when more than two groups are involved. It is in a convenient form to do what if scenarios, for example, it we have a male with 16 years of education at what age would such an individual be a good prospect? To answer this we determine the age value that moves the discriminant score above the cut point.

Statistical Data Mining Techniques 2 - 28

Data Mining: Modeling Figure 2.21 Fisher Classification Coefficients

The Fisher function coefficients can be used to classify new observations (customers). If we know a prospective customers education (say 16 years), gender (Female=1) and age (30), we multiply these values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30 -20.85), which yields a numeric score. We repeat the process using the coefficients for the Yes group and obtain another score. The customer is then placed in the outcome group for which she has the higher score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets, databases) for predictive purposes. We did not test for the assumptions of discriminant analysis (normality, equality of within group covariance matrices) in this example. In general, normality does not make a great deal of difference, but heterogeneity of the covariance matrices can, especially if the sample group sizes are very different. Here the samples sizes were about the same. For a more detailed discussion of problems with assumption violation in discriminant analysis see Lachenbruch (1975) or Huberty (1994). As mentioned earlier, whether you consider the hit rate here to be adequate really depends on the costs of errors, the benefits of a correct prediction and what your alternatives are. Here, although the prediction was far from perfect we were able to identify the relations between the demographic variables and the choice outcome.

Appropriate Research Projects


Examples of questions for which discriminant analysis is appropriate are: Predict instances of fraud in all types of situations, including credit card, insurance, and telephone usage. Predict whether customers will remain or leave (churn or not).

Statistical Data Mining Techniques 2 - 29

Data Mining: Modeling Predict which customers will respond to a new product or offer. Predict outcomes of various medical procedures.

Other Features
In theory, there is no limit to the size of data files for discriminant analysis, either in terms of records or variables. However, practically speaking, most applications of discriminant limit the number of predictors to a few dozen at most. With that number of predictors, there is usually no reason to use more than a few thousand records. It is possible to use stepwise methods with discriminant, so that the software can select the best set of predictors from a larger potential group. In this sense, stepwise discriminant can be considered an automated procedure like decision trees. As a result, if you use a stepwise method, you should use a validation dataset on which to check the model derived by discriminant.

Model Understanding
Discriminant analysis produces easily understood results. We have already seen the classification table in Figure 2.17. In addition, the procedure calculates the relative importance of each variable as a predictor (standardized coefficientssee Figure 2.19). Graphical output is produced by discriminant, but with more than a few predictors it becomes less useful.

Model Deployment
Predictions for new cases are made from simple equations using the classification function coefficients (especially the Fisher coefficients). This means that any statistical program, or even a spreadsheet program, could be used to generate new predictions, and that the model can be applied directly to data warehouses, not only extracted data sets. This makes the model easily deployable.

Statistical Data Mining Techniques 2 - 30

Data Mining: Modeling

LOGISTIC AND MULTINOMIAL REGRESSION


Logistic regression is similar to discriminant analysis in that it attempts to predict a categorical dependent variable. And it is similar to linear regression in that it uses the general linear model as its theoretical underpinning, and so calculates regression coefficients and tries to fit the data to a line, although not a straight one. A common application would be predicting whether someone renews an insurance policy. The outcome variable should be categorical. SPSS has three procedures that can be used to build logistic regression models. The Binary Logistic procedure and the Multinomial Logistic procedure are both found in the AnalyzeRegression menu. The former is used only for dichotomous dependent variables; the latter can handle dependent variables with two or more categories. See the manual SPSS Regression Models, Chapter 1, for a discussion of when to use each for a dichotomous outcome. In addition, the Ordinal Regression procedure models an ordinal outcome variable. Logistic regression follows from a view that the world is truly continuous, and so the procedure actually predicts a continuous function that represents the probability associated with being in a particular category of the dependent variable. This is represented in Figure 2.22, which displays the predicted relationship between household income and the probability of purchase of a home. The S-shaped curve is the logistic curve, hence the name for this technique. The idea is that at low income, the probability of purchasing a home is small and rises only slightly with increasing income. But at a certain point, the chance of buying a home begins to increase in almost a linear fashion, until eventually most people with substantial incomes have bought homes, at which point the function levels off again. Thus the outcome variable varies from 0 to 1 because it is measured in probability.

Statistical Data Mining Techniques 2 - 31

Data Mining: Modeling Figure 2.22 The Logistic Function

After the procedure calculates the outcome probability, it simply assigns a case to a predicted category based on whether its probability is above .50 or not. The same basic approach is used when the dependent variable has three or more categories. In Figure 2.22, we see that the logistic model is a nonlinear model relating predictor variables to the probability of a choice or event (for example, a purchase). If there are two predictor variables (X1, X2), then the logistic prediction equation can be expressed as:

prob(event) =

exp (B1 *X1 + B2 *X 2 + A) (1 + exp (B1 *X1 + B2 *X 2 + A) )

where exp() represents the exponential function. The conceptual problem is that the probability of the event is not linearly related to the predictors. However, if a little math is done you can establish that the odds of the event occurring are equal to:

exp (B1 *X1 + B2 *X 2 + A) , which equals exp

(B1 *X1 )

* exp (B2 *X 2 ) * exp A

Although not obviously simpler to the eye, the second formulation (and SPSS displays the logistic coefficients in both the original form and raised to the exponential power) allows you to state how much the odds of the event change with a one unit change in the predictor. For example, if I stated that the odds of making a sale double if a resource is Statistical Data Mining Techniques 2 - 32

Data Mining: Modeling given to me, everyone would know what I meant. With this in mind we will look at the coefficients in the logistic regression equation and try to interpret them. Recall that logistic regression assumes that the predictor variables are interval scale, and like regression, dummy coding of predictors can be performed. As such, its assumptions are less restrictive than discriminant.

A Logistic Regression Example: Predicting Purchases


We will apply logistic regression to the same problem of discovering which demographics are related to acceptance of an interactive news service. However, instead of running a stepwise method we will apply the variables selected by our discriminant analysis and compare the results. Click AnalyzeRegressionBinary Logistic Move newschan into the Dependent list box Move age, educate and gender into Covariates: list box Figure 2.23 Logistic Regression Dialog Box

This is all we need in order to run a standard logistic regression analysis. Notice the . You can create interaction terms by clicking on two or more Interaction button predictor variables in original list, then clicking on the Interaction button. Also, you can use the Categorical pushbutton to have Logistic Regression create dummy coded (or contrast) variables to substitute for your categorical predictor variables (note that Clementine performs such operations automatically in its modeling nodes). The Save pushbutton allows you to create new variables containing the predicted probability of the event, and various residual and influence measures. As in Discriminant, the Select Statistical Data Mining Techniques 2 - 33

Data Mining: Modeling pushbutton will estimate the model from part of your sample (you provide a selection rule) and apply the prediction equation to the other part of the data (cross-validation). The Options pushbutton provides control over the criteria used in stepwise analysis. Click Save pushbutton Click Probabilities check box Click Group membership check box Figure 2.24 Logistic Regression: Save Dialog Box

The Logistic Regression procedure can create new variables to store various types of information. Influence statistics, which measure the influence of each point on the logistic analysis, can be saved. A variety of residual measures, which identify poorly fit data, can also be retained. When scoring, the predicted probability provides a score for each observation that is used to classify it into an outcome category. We save them here in order to demonstrate how they can be used in a gains tables to evaluate the effectiveness of the model. Click Continue Click OK to run the analysis Scroll down to the Classification Table in the Block 1 section of the Viewer window

Statistical Data Mining Techniques 2 - 34

Data Mining: Modeling Figure 2.25 Classification Table

The classification results table indicates that those refusing the offer were predicted with 70.5% accuracy and those accepting with 61.7% accuracy, for an overall correct classification of 66.2%. The logistic model predicted a slight bit better for the refusals, and about 4 percentage points worse for the acceptances, so overall it does slightly worse (about 2 percentage points) than discriminant on the training sample. The default classification rule for a case is that if the predicted probability of belonging in the outcome group with the higher value (here 1) is greater than equal to .5, then predict membership in that group. Otherwise, predict membership in the group with the lower outcome value (here 0). We will examine these predicted probabilities in more detail later. 2.26 Significance Tests and Model Summary

The Model Chi-square test provides a significance test for the entire model (three variables) similar to the overall F test in regression. We would say there is a significant relation between the three predictors and the outcome. The Step Chi-square records the change in chi-square from one step to the next and is useful when running stepwise methods.

Statistical Data Mining Techniques 2 - 35

Data Mining: Modeling Figure 2.27 Model Summary

The pseudo r-square is a statistic modeled after the r-square in regression (discussed earlier in this chapter). It measures how much of the initial lack of fit chi-square is accounted for by the variables in the model. Both variants indicate the model only accounts for a modest amount of the initial unexplained chi-square. Now lets move to the variables in the equation. Figure 2.28 Variables in the Equation

The B coefficients are the actual logistic regression coefficients, but recall they bear a nonlinear relationship to the probability of accepting the offer. Although they do linearly relate to the log odds of accepting, most people do not find this metric helpful for interpretation. The second column (S.E.) contains the standard errors for the B coefficients. The Wald statistic is used to test whether the predictor is significantly related to the outcome measure adjusting for the other variables in the equation (all three are highly significant). The last column presents the B coefficient exponentiated using the e (exponential) function, and we can interpret these coefficients in terms of an odds shift in the outcome. For example, the coefficients of age and education are above 1 meaning that the odds of accepting the offer increase with increasing age. The coefficient for age indicates that the odds increase by a factor of 1.06 per year, which seems rather small. However, recall that age can range from 18 to almost 90 years old and a 20-year age difference would have a substantial impact on the odds of accepting the offering (the odds more than triple). The coefficient for gender is about .5 indicating that if the other factors are held constant, moving from a male to female reduces the odds of accepting the offering by about 1/2. In

Statistical Data Mining Techniques 2 - 36

Data Mining: Modeling this way you can express the effect of a predictor variable in terms of an odds shift. You can use the B coefficients for your prediction equation:

prob of accepting =

exp (.107*educate -.738*gender +.060*age - 3.5) (1 + exp (.107*educate -.738*gender +.060*age -3.5))

The results of the logistic regression confirm that age, gender and education are related to the probability of a potential customer accepting the offering. Although not shown, we ran logistic using a stepwise method and obtained the same model.

Appropriate Research Projects


Examples of questions for which logistic regression is appropriate are: Predicting whether a person will renew an insurance policy. Predicting instances of fraud. Predicting which product someone will buy or a response to a direct mail offer. Predicting that a product is likely to fail.

Other Features
As with the other techniques we have discussed, there is no limit to the size of data files. As previously, usually a limited number of predictors are used for any problem, so file sizes can be reasonably small. Stepwise logistic regression is available in the Binary Logistic procedure, but not in the Multinomial Logistic procedure, and the standard caveats apply about using a validation data set when stepwise variable selection is done.

Model Understanding
Although the logistic model is inherently more complex, it is not unduly so compared to linear regression. When the results are translated into odds on the dependent variable, they become quite helpful to decision makers. As before, graphical representations of the solution are less helpful with more than a few predictors.

Model Deployment
For predictions with binary logistic regression, only one equation is involved, so the model is easily deployable. For multinomial regression, more than one equation is

Statistical Data Mining Techniques 2 - 37

Data Mining: Modeling involved, which requires more calculations, but this doesnt make prediction for new cases that much more difficult.

APPENDIX: GAINS TABLES


One way to evaluate the usefulness of a classification model is to group or order cases by a score predicted from the model, and then examine the desired response rate within these score-ordered groups. For the model to be useful at the practical level, cases with high model scores should show higher response rates than those with low scores. Gains tables and lift charts provide numeric and graphical summaries of this aspect of the analysis. Using them, a direct marketing analyst can estimate the proportion of positive respondents they will find if they promote the offering to the top x% of the sample. Some programs, for example SPSS AnswerTree, automatically produce gains tables. However, a basic gains table can be produced easily within SPSS, and with some additional effort, a gains plot can be created as well. In this context, the score would be the predicted probability in a binary logistic analysis or the estimated discriminant score in a twooutcome discriminant analysis. The main concept behind the gains table and chart involves ordering or grouping the cases by the score produced from the model, and evaluating the response rates. If the summary is tabular (gains table), then the cases are placed into score groups (for example, decile groups by score). For graphs, the cases can be grouped (for example, a gains plot based on decile groups) or not (for example, a gains plot in which each point represents a unique score). To demonstrate, we will create a basic gains table using the SPSS menus. In addition, SPSS syntax that will produce a gains chart can be found in Gains.sps (located in c:\Train\DM_Model); it involves more data manipulation in SPSS. Gains and other evaluation charts are available in Clementine and AnswerTree. Since the predictions in binary logistic regression are based on the predicted probability, we will use these values as the model scores. First we will collapse the data into ten groups based on their predicted probabilities, and then we will display the proportion of Yes responses within each decile group. Our use of decile (ten) groups is arbitrary; you might create more or fewer groups.

Click TransformRank Cases from the Data Editor window Move pre_1 into the Variable(s) list box Click Largest value option button

Statistical Data Mining Techniques 2 - 38

Data Mining: Modeling The predicted probability variable from the logistic regression (pre_1) will be used to create a new rank variable. In addition, we indicate that the highest value of pre_1 should be assigned rank 1. This makes sense: the case with the highest predicted probability of belonging to the Yes outcome group (coded 1- the highest value) should have the top rank. If we stopped here, each case would be assigned a unique rank (assuming no ties on predicted probability). Since we want to create decile groups, we must make use of the Rank Cases: Types dialog box. Click Rank Types pushbutton Click Ntiles check box to check it Erase 4 and type 10 in the Ntiles text box Click the Rank check box so it is not checked Click Continue, then click OK The Rank procedure will now create a new variable coded 1 through 10, representing decile groups based on the predicted probability of responding Yes. Although a number of procedures can display the critical summary (response percentage for each decile group), we will use the OLAP Cubes procedure since it can present the base rate (assuming a random prediction model) as well. Click AnalyzeReportsOLAP Cubes Move newschan into the Summary Variable(s) list box Move npre_1 into the Grouping Variable(s) list box The OLAP Cubes procedure is designed to produce a multidimensional summary table that supports drill-down operations. We choose it for some specific summary statistics. Click Statistics pushbutton Click and drag to select all statistics in the Cell Statistics list box Click the left arrow to remove these statistics from the Cell Statistics list box Move Number of Cases into the Cell Statistics list box Move Percent of Sum in (npre_1) into the Cell Statistics list box Move Percent of N in (npre_1) into the Cell Statistics list box We request several summaries for each decile score group. The number of cases in each group will appear, along with the percent of the total sum of the newschan variable in each decile group (based on npre_1). Since newschan is coded 0 or 1, the percentage of the overall sum of newschan in each decile group is the percentage of cases in a group that gave a positive response to the newschan question. Finally, the percentage of cases in each decile group will display. This provides the base rate against which the model-score based deciles will be compared. The logic is that if the model predictions are unrelated to the actual outcome, then we expect that the top decile of cases (based on the model) will contain about 10% of the positive responses: the rate we expect by chance alone.

Statistical Data Mining Techniques 2 - 39

Data Mining: Modeling Click Continue Click Title pushbutton Erase OLAP Cubes and type Gains Table in the Title text box Click Continue, then click OK The OLAP Cubes table is designed to be manipulated within the Pivot Table editor to permit different views of the summaries. In order to see the entire table we must move the decile-grouping variable into the row dimension of the pivot table (such manipulation is covered in more detail in The Basics: SPSS for Windows and Intermediate Topics: SPSS for Windows courses) Double-click on the Gains Table pivot table in the Viewer window If the Pivoting Trays window is not visible, then click Pivot..Pivoting Trays Click the pivot icon in the Layer tray of the Pivoting Tray window

The Pivot Tray window allows us to move table elements (in this case the decile categories) from one dimension to another. We need to move the npre_1 (NTILES of PRE_1) icon from the layer to the row dimension.

Statistical Data Mining Techniques 2 - 40

Data Mining: Modeling Figure 2.29 Pivot Table Editor

Drag the NTILES of PRE_1 pivot icon to the right side of the Row tray (right of the icon already in the row tray) Click outside the crosshatched area in the Viewer window to close the Pivot Table Editor

Statistical Data Mining Techniques 2 - 41

Data Mining: Modeling Figure 2.30 Gains Table

The column headings of the pivot table can be edited so they are easier to read (doubleclick on the pivot table, then double-click on any column heading to edit it). The rows of the table represent the decile groupings based on the predicted probabilities (scores) from the logistic regression model. The column labeled N contains the number of cases in each decile group and the % of N in NTILES of PRE_1 displays the percentages. This latter column contains the expected percentage of overall positive responses appearing in each group, under a model with no predictability (i.e. the base rate). The column of greatest interest is labeled % of Sum in NTILES of PRE_1. It contains the expected percentage of the overall positive responses contained in each decile group under the model. Examining the first decile (1), we see that the most promising 10% of the sample contains 16.4% of the positive respondents. Similarly, both the second and third deciles contain 15% of the positive respondents. Thus, if we were to offer the interactive news cable package to the top 30% of the prospects, we expect we would obtain 46.4% of the positive responders. In this way, analysts in direct mail and related areas can evaluate the expected return of mailing to the top x% of the population (based on the model). In the fourth decile and beyond, the return is near or below that expected from a random model; so the decile groups holding most promise are the first three.

Statistical Data Mining Techniques 2 - 42

Data Mining: Modeling

Chapter 3 Market Basket or Association Analysis


Topics:
INTRODUCTION TECHNICAL CONSIDERATIONS RULE GENERATION APRIORI EXAMPLE: GROCERY PURCHASES USING THE ASSOCIATIONS APRIORI EXAMPLE: TRAINING COURSE PURCHASES

Market Basket or Association Analysis 3 - 1

Data Mining: Modeling

INTRODUCTION
As its name implies, market basket analysis techniques were developed, in part, to analyze consumer shopping patterns. These methods are descriptive and find groupings of items. Market basket or association analysis clusters fields (items), which are typically products purchased, but could also be medical procedures proscribed for patients, or banking or telecom services used by customers. The techniques look for patterns or clusters among a small number of items. These techniques can be found in Clementine under the Apriori and GRI procedures. Market basket or association analysis produces output that is easy to understand. For example, a rule may state that, if corn chips are purchased, then 65% of the time cola is purchased, unless there is a promotion, in which case 85% of the time cola is purchased. In other words, the technique correlates the presence of one set of items with another. A large set of rules is typically generated for any data set with a reasonably diverse type and number of transactions. As an illustration, Figure 3.1 shows a portion of the output from Clementines Apriori procedure showing relations between various products bought by customers over one week in a supermarket. The first line is telling us that 10.8% of the sample, or 85 customers, bought both frozen foods and milk, and that 77.6% of such customers also bought some type of alcoholic product. Figure 3.1 A Set of Market Basket Association Rules

Such association rules describe relations among the items purchased; the goal is to discover interesting and actionable associations.

Market Basket or Association Analysis 3 - 2

Data Mining: Modeling

TECHNICAL CONSIDERATIONS
Number of Items?
If the fields to be analyzed represent items sold, the number of distinct items influences the resources required for analysis. For example, the number of SPSS products a customer can currently purchase is about 15, which is a relatively small number of items to analyze. Now, lets consider a major retail chain store, auto parts supplier, mail catalog vendor or web vendor. Each might have anywhere from hundreds or thousands to tens of thousands of unique products. Generally, when such large numbers of products are involved, they are binned (grouped) together in higher-level product categories. As items are added, the number of possible combinations increases exponentially. Just how much categorization is necessary depends upon the original number of items, the detail level of the business question asked, and the level of grouping at which meaningful categories can be created. When a large number of items are present, careful consideration must be given to this issue, which can be time consuming. But time spent on this matter will increase your chance of finding useful associations and will reduce the number of largely redundant rules (for example, rules describing the purchase of a hammer with each of many different sizes and types of nails).

Only True Responses?


Since market basket fields are usually coded as dichotomies (0,1) or logical flags (F,T), one issue the analyst needs to consider is whether there is interest in both true (purchase or occurrence) and false (no purchase or no occurrence) responses. Typically, a customer purchases a relatively small proportion of the available items. If both true and false responses are included, many rules will be of the sort those who dont purchase X, tend not to purchase Y. On the other hand, excluding the false responses implies that a rule such as those who dont purchase X, tend to purchase Y will not be discovered. Thus items that act as substitutes for each other are more difficult to discover. On balance, many analysts restrict the association rules to those of the true form to avoid wading through the lengthy list of things not purchased.

Actionable?
More so than with other data-mining methods, data-mining authors and consultants raise questions about whether the associations discovered using market basket analysis are useful and actionable. The challenge is that if you do discover an association between two products, say beer and diapers (to use an example with a storied past), just what action would this lead a retailer to take? Of course, other data-mining methods are open to a similar challenge, but it is worthwhile to consider in advance how your organization would make use of any strong associations that are discovered in the market basket analysis.

Market Basket or Association Analysis 3 - 3

Data Mining: Modeling

No Dependent Variable (Necessarily)


An advantage of market basket analysis is that it investigates all associations among the analysis variables and not only those related to a specific outcome or dependent field. In this sense it is a more broadly based method than data-mining techniques that are trying to predict a specific outcome field. That said, after the associations have been generated in Clementine, you could designate one variable as an outcome and display only those associations that relate to it.

RULE GENERATION
Within Clementine, the association rule methods begin by generating simple rules (those involving two items) and testing them against the data. The most interesting (that is, those that meet the specified minimum criteriaby default, coverage and accuracy) of these rules are stored. Next, all rules are expanded by adding an additional condition from a third item (this process is called specialization) and are, as before, tested against the data. The most interesting of these rules are stored and specialization continues. When the analysis ends, the best of the stored rules can be examined. Two procedures in Clementine perform association rule analysis. GRI is more general in that it permits both numeric (continuous) and categorical input (condition) variables. Apriori, because it only permits categorical input (condition) variables, is quicker. It also supports wider choice in criteria used to select rules. Results from an Association analysis are presented in a table with these column headings: Consequent Antecedent 1 Antecedent 2 Antecedent N For example: Consequent AnswerTree Antecedent 1 Regression Models Antecedent 2 Advanced Models

This rule tells us that customers who purchase the Regression Models and Advanced Models SPSS options also purchase AnswerTree. Association rules are commonly evaluated by two criteria: support and confidence. Support is the percentage of records in the data set for which the conditions (antecedents) hold. It indicates how general the rule isthat is, to what percentage of the data will it apply.

Market Basket or Association Analysis 3 - 4

Data Mining: Modeling Confidence (accuracy) is the proportion of records meeting the conditions (antecedents) that also meet the consequent (conclusion). It indicates how likely the consequent is, given that the conditions are met. It is worth noting explicitly that market basket analysis does not take time into account. Thus, while purchases may well be time ordered, Apriori and GRI do not include time as a component of the rule generation process. Practically speaking, this means that all data for a case (e.g., shopping trip) must be stored in one physical record. Sequence detection algorithms take time sequencing into account and such analyses can be done in Clementine with the Sequence node (or the CaprI algorithm add-on) see Chapter 8.

APRIORI EXAMPLE: GROCERY PURCHASES


To demonstrate a market basket analysis we will use the Apriori procedure within Clementine to analyze the purchase patterns among a limited number of product categories (10) of grocery store items: ready made, frozen foods, alcohol, fresh vegetables, milk, bakery goods, fresh meat, toiletries, snacks and tinned goods. About two thousand shopping visits are analyzed. We will not construct the entire data stream within Clementine, but rather will review the main settings, then run and interpret the results. For more details on the Clementine environment, see the Introduction to Clementine training course or the Clementine User Guide. From within Clementine: Click FileOpen Stream Move to the c:\Train\DM_Model directory Double-click on ShopApr.str

Market Basket or Association Analysis 3 - 5

Data Mining: Modeling Figure 3.2 Clementine Stream for Shopping Data

The stream canvas currently contains: a source node used to access the text file containing the shopping data (shopping.txt); a Type node used to declare field types that can

and indicate how fields are to be used in any modeling; a Table node display the original data values; a Distribution node

, which displays a bar chart

. We will examine some of these for a selected field; and an Apriori modeling node nodes and then proceed with the market basket analysis. Right-click on the Table node, then click Execute from the Context menu

Market Basket or Association Analysis 3 - 6

Data Mining: Modeling Figure 3.3 Shopping Data

In addition to several demographic fields, each purchase item field is coded 0 (no purchase) or 1 (purchase). Note that the number of purchase variables is quite limited because purchases have been grouped into broad categories. Close the Table window Double-click the Type node to open it Figure 3.4 Type Node for Shopping Data

Market Basket or Association Analysis 3 - 7

Data Mining: Modeling In Clementine, data fields can be one of the following types: range, set (categorical with more than two categories) flag (binary), or typeless (such as for an ID field). The modeling procedures in Clementine automatically take a fields type into account. Since the various purchased item fields can take only two values (0 (no purchase) or 1 (purchase)), they are defined as flag fields. Also, the Direction column instructs Clementine how a field should be used in modeling. For most modeling procedures, fields are set to In (Input, independent or predictor fields) or Out (Output, dependent or outcome fields). However, we mentioned earlier that association rules can examine all relations and there generally isnt a specific dependent variable. For this reason, all the purchased item fields are set to the Both direction, meaning that they can be inputs and outputs, and can thus appear on either side of an association rule. Close the Type node dialog Double-click the Apriori node Figure 3.5 Apriori Dialog Box

The node is labeled 10 fields because it has a default model name that indicates how many fields will be used in creating the association rules. The basic dialog for setting up the market basket analysis is fairly simple, largely because much of the definitional work has been done in the Type node. Notice that you can set criteria values for the minimum rule support (default value is 10%) and minimum rule confidence (default value is 80%). Thus at least 10% of the records must contain a particular group of antecedents before it

Market Basket or Association Analysis 3 - 8

Data Mining: Modeling will be considered as a rule. For purchased items you might need to raise or lower this value depending on the purchase rates of the items and the number of items. For example, if the most popular item were purchased on only 8% of all store visits, then no association rules would be generated under the default (10%) minimum coverage rule. As you raise it, fewer association rules will be generated. The minimum confidence may need to be modified as well. In practice, it might be lowered if the initial 80% setting results in too few or even no rules. We lowered it because an earlier run (not shown), under the default value of 80%, produced only four association rules. The Maximum number of antecedents setting controls how many fields may be combined to form Conditions. The search space, and thus processing time, increases rapidly with an increase in the maximum number of rule preconditions. By default, rules will only contain true values for fields that are flag type. We discussed this earlier and since we are not interested in rules relating to the non-purchase of products, we retain the check in the Only true values for flags checkbox. Click the Execute button Figure 3.6 Generated Model Node Added to Models Tab in Manager Window

Market Basket or Association Analysis 3 - 9

Data Mining: Modeling After a Clementine model runs successfully, a node containing information about the model (the generated node) is added to the Models tab in the Manager window. This model is an unrefined model, which cant be placed in a stream. But if we ran a model in which there was a declared outcome field, we could move the generated model node into the data stream and examine the predicted values (as we will do in later chapters). Right-click on the Generated Model Node Figure 3.7 Rules Produced by Apriori , then click Browse

By default, only the association rules appear. Since the support and confidence values are important when evaluating the rules, we display them as well. on the toolbar Click the Show/Hide criteria button You may also need to widen the window to see all antecedents In addition to support and confidence, also listed are instances, which are the number of cases used to calculate the support proportion.

Market Basket or Association Analysis 3 - 10

Data Mining: Modeling Figure 3.8 Rules with Instances, Support and Confidence Added

Twenty-six rules were found (the number in the upper right-hand corner), and they are sorted by confidence. The first rule associates frozen foods, milk, and bakery goods. There are 85 instances (shopping trips) in which both frozen foods and milk were purchased. This is 10.8% of all shopping trips. On these 85 trips, bakery goods were also purchased 83.5% of the time. From here you would browse the rules, guided by the support and confidence figures, looking for interesting associations. Those with greater support will apply to a larger percent of the data. Those with greater confidence provide more certainty that a case that meets the conditions will also have the conclusion value. When viewing the confidence values, it is important to keep the base rates of purchasing items in mind (these can be obtained from the distribution node in Clementine). For example, if bakery goods are bought by 78% of all customers, the 83.5% confidence of the first association rule looks much less impressive than if the base rate were 30% (the actual base rate for bakery goods is about 43%, not shown). The table of rules can be sorted by support, confidence, the product of both, the consequent, or the length of the sequence, providing different views of the results.

Market Basket or Association Analysis 3 - 11

Data Mining: Modeling

USING THE ASSOCIATIONS


If after examining the association rules, those related to a particular consequent attract your interest, you can isolate those rules and convert them to a set of prediction rules. This can be done because the results would now be limited to a single consequent. And these prediction rules can be applied to the data in a stream. To demonstrate, we will produce a rule set for the Alcohol item. Click GenerateRule Set from the Association Rules browser window and select Alcohol in the Target Field Click the Field Chooser icon list box Change the Rule set name to Alcohol Figure 3.9 Translating Association Rules to a Ruleset

This dialog will create a new Association Rule node with all rules that contain the Target Field as the consequent. Data can be passed through this generated node and a new field (variable) would be created. It would store the predicted values for the target field based on the Conditions in the data records. By default, it will be added to the stream canvas. Minimum values can be specified here for the coverage and confidence of the generated rules. Click OK Close the Association Rules browser window Right-click the new Apriori Generated Node Edit from the Context menu Click the Show all levels button on the Canvas, then click

to see the complete set of rules

Market Basket or Association Analysis 3 - 12

Data Mining: Modeling

Click the Show or hide instance and confidence figures button view those values Figure 3.10 Ruleset for Alcohol Generated from Apriori Model

to

We see the rules that lead to prediction of the purchase of alcohol (value=1). The first rule indicates that if frozen foods and milk are purchased (antecedents), then predict that alcohol is also purchased. There are 85 records (shoppers) with these antecedents (Instances) and 77.6% (.776) of these shoppers also purchased alcohol (confidence). We will now connect the Apriori Generated node to the stream to view the fields created by the rule set. Close the Ruleset browser window Click and drag the Apriori Generated node to the right of the Type node (You might want to click and drag the Apriori Modeling node down a bit first to make some room) Click with the middle mouse button (or click both left and right buttons on a two button mouse) on the Type node and drag to the Apriori Generated node (connecting the two)

Market Basket or Association Analysis 3 - 13

Data Mining: Modeling

Click on a Table node from the Output Palette, then click in the stream canvas to the right of the Apriori Generated node to place the Table node Click with the middle mouse button (or click both left and right buttons on a two button mouse) on the Apriori Generated node and drag to the Table node (connecting the two) Figure 3.11 Adding the Ruleset Node to the Data Stream

The Ruleset node has been added to the Data Stream so the new fields it creates (prediction for Alcohol purchase and the confidence value) will appear in the table. Right-click the Table node Click Execute on the Context menu Scroll to the right to view the last columns in the Table window

Market Basket or Association Analysis 3 - 14

Data Mining: Modeling Figure 3.12 Fields Created by the Generated Rule Set

Two new fields have been added to the data. The first, $A-Alcohol, is $null$ (the default) unless any of the three rules in the rule set apply to the record, in which case it has a value of 1. The second field, $AC-Alcohol, represents the confidence figure for the rule consequent (here alcohol) for a particular rule. In this way, prediction rules can be derived from an association rule analysis if rules related to a particular outcome are interesting.

APRIORI EXAMPLE: TRAINING COURSE PURCHASES


We will run a second market basket analysis that is based on SPSS training course registration in the UK. Most course products are purchased on an individual course basis, and there was interest in examining the associations among them. The data file contains records from about 2,000 training customers. All customers in this file have purchased at least one training product. Close the Table window Click FileClose stream from the main menu (when prompted to save the stream, say No) Click FileOpen Stream Double-click on TrainApr.str (in the c:\Train\DM_Model directory)

Market Basket or Association Analysis 3 - 15

Data Mining: Modeling Figure 3.13 SPSS Training Purchase Data Stream

As before, the Clementine stream consists of a data source node (reading a text data file) connected to a Type node and then to the Apriori Association Rule node (labeled 13 fields). Right-Click on the Table node, and then click Execute on the Context Menu Scroll to the right in the Table window to the training course fields Figure 3.14 Training Purchase Data

Market Basket or Association Analysis 3 - 16

Data Mining: Modeling As in the previous example, each training course product is coded 1 or 0 depending on whether or not a customer (each record represents a customer) purchased the product. Close the Table window Double-click the Apriori node Figure 3.15 Apriori Dialog Box to edit it

Notice that the minimum values for coverage (5%) and accuracy (20%) are much lower than our earlier analysis (see Figure 3.5). This is because, to the regret of the Training Department, the percentage of customers taking any single training course is low. When we optimistically ran the analysis using the default coverage and accuracy rules, no rules were generated. Click Execute Right-click on the Generated Model Node click Browse for the new model, then

Click the Show/Hide criteria button to see the support and confidence values Widen the columns to see the full names of the courses, and then widen the dialog box

Market Basket or Association Analysis 3 - 17

Data Mining: Modeling Although we allowed up to five antecedents, the ten rules generated have only one antecedent. The Introduction to SPSS course is the consequent for 3 rules. However, one of these rules has no antecedent! This is because Apriori is designed to display the base rates for all potential consequents, if they meet the specified minimum rule confidence condition (more on base rates in a moment). Figure 3.16 Association Rules for Training Courses

The rules that conclude with the Introduction to SPSS and the Introduction to SPSS and Statistics courses are not very interesting since they are entry-level courses through which customers usually pass in order to take more advanced training. A more useful rule is the one relating Building Predictive Models to Introduction to CHAID. It has a support value of 6.5% and a confidence value of .273. Given that the highest confidence value is 31.1%, it is one of the better rules. But what is critical, as mentioned in the previous example with grocery items, is to compare the confidence value to the base rate for the Introduction to CHAID class. We get this information from the frequency table and bar chart produced by a distribution node. Close the Association Rules browser window Right-click the Distribution node Click Execute on the Context menu

Market Basket or Association Analysis 3 - 18

Data Mining: Modeling Figure 3.17 Distribution of Introduction to CHAID Course Purchase

A small percentage (5.71%) of customers who purchase some training product purchase the CHAID course. However, the rule confidence indicates that of those who purchase the Building Predictive Models course, 27.3% also purchase the CHAID coursefive times the base rate. This could present a cross-selling opportunity. If a new training customer is interested in the Building Predictive Models course, the CHAID software and course products might be mentioned as well. A domain expert (in this case, someone familiar with the content and sales of training products) would be the one to examine such rules and decide which are worthwhile and not just artifacts of how training is structured. Since there is no specific target variable in a market basket or association rules analysis, many associations are summarized. The analyst can then examine the rules for interesting associations and generate rule sets, or note them for further study. This very general, nondirected analysis is the strength of market basket analysis. However, in practice the domain expert must examine and discard rules that are obvious or unhelpful.

Appropriate Research Projects


Market basket analysis is appropriate whenever you have a set of discrete items that you want to group together, not in large clusters but in very small sets, as demonstrated in this chapter. This occurs most often in the retail industry, but there is no reason it need be limited to that area. Insurance claims, bank transactions, and medical procedures are other potential applications.

Other Features
Market basket analysis is typically done on very large data sets, with tens of thousands of transactions, and hundreds of items. But as file sizes grow, especially in the number of items, the number of potential connections grows quickly, as does required computation time. For example, for just 50 items, there are 1,225 distinct combinations of two items, Market Basket or Association Analysis 3 - 19

Data Mining: Modeling 19,600 of three items, and so forth. So in practice, the number of cases can be large, but the number of distinct items tends to be much smaller, as is the number of items to be associated together in a cluster (limited by the value for Maximum number of antecedents in the Apriori dialog). It can be difficult, therefore, to determine the optimum number and type of items a priori, so typically, like all the automated methods, several different solutions may be tried with different sets. This is the equivalent of validating the associations you discover on a completely separate data set, although standard validation can also be done.

Model Understanding
One of the great strengths of market basket analysis is that its results are easily understood by anyone. The rules it finds can be expressed in natural language and dont involve any statistical testing. Usually, the graphical output, except for small sets of items, is not as helpful as the numeric data showing the actual strength of association.

Model Deployment
The association rules produced can be coded as statements in SQL. This means that they can be applied directly to databases, if necessary. However, model deployment for these techniques does not usually involve applying the rules to a new data file. Instead, the results are directly actionable, as decisions can be made about cross-marketing and new promotions by simply examining the associations and determining what level of association is high enough to warrant action.

Market Basket or Association Analysis 3 - 20

Data Mining: Modeling

Chapter 4 Neural Networks


Topics:
INTRODUCTION BASIC PRINCIPLES OF SUPERVISED NEURAL NETWORKS
Activation Function Complexity Approach Other Types of Neural Networks

NEURAL NETWORK EXAMPLE: PREDICTING CREDIT RISK


Comparing Predicted to Actual Values Understanding the Reasoning Behind the Predictions Model Summary Other Neural Network Models

Neural Networks 4 - 1

Data Mining: Modeling

INTRODUCTION
Neural networks are models based on the way the brain and nervous system operate. Such models are used to predict an outcome variable that is either categorical or interval in scale using predictors that are also categorical or interval. Neural networks are popular when modeling complex environments (for example financial applications). Forms of neural networks (Kohonen networks) are used to cluster data (see Chapter 6). Neural network models are available from SPSS through its Clementine and Neural Connection products. The basic unit is the neuron or node, and these are typically organized into layers, as shown in Figure 4.1 Figure 4.1 Simple Neural Network

Input data is presented to the first layer, and values propagate from each neuron to every neuron in the next layer. The values are modified during transmission by weights. Eventually, a result is delivered from the output layer. Initially, all weights are randomly set to small values and the answers that come out of the net are probably nonsensical. The network learns through training; examples for which the output is known are repeatedly presented to the network; the answers it gives are compared to the known outcomes, and information from this comparison is passed back through the network, gradually changing the weights. As training progresses, the network will usually become more and more accurate in replicating the known outcomes; once trained, it can be applied to future cases where the outcome is unknown. Neural networks complement such statistical models as regression and discriminant analysis by allowing for nonlinear relations and complex interactions among predictor variables. When such relations are present, then neural networks will outperform the other models. When such relations are not present, then neural network models do about

Neural Networks 4 - 2

Data Mining: Modeling as well as the parametric statistical models. In this latter circumstance, the traditional models would probably be chosen on the basis of parsimony and their greater transparency (see Model Understanding section). A neural network solution with all predictors cannot be displayed in convenient graphical form, since relations between predictors and the outcome are mediated by the hidden layer(s), and many predictors may be involved. However, numeric summaries provide insight into the relative importance of each predictor and plots relating individual predictors to the outcome can be produced. Most of the output summaries from neural network models are numeric.

BASIC PRINCIPLES OF SUPERVISED NEURAL NETWORKS


There are many types of neural networks (for a review, see Fu (1994)). In this section we will focus on a popular neural network used to solve prediction problems (feed-forward backpropagation). Supervised neural networks are those for which there is a measured target variable and so estimation of the network weights is guided by the error in prediction. Unsupervised neural networks also have applications in the data mining area and we will see a clustering application in Chapter 6 (using a Kohonen network). In some sense a neural network learns (trains) in just the same way as a human would. If a new employee, who knew little about his industry, were asked to predict an outcome measure (say the next years revenue expected from a new customer) based on some background information, he would have to guess (no cheating by asking experienced employees). Over time, he would learn how close his prediction was. If the prediction were perfect, he would probably make the same prediction given the original conditions. However, if the prediction were incorrect, he would change the prediction next time (weighting the inputs differently) and would probably make a large change in the face of a large error. This process of adjustment based on error is the basis of a supervised neural network. It is formalized and optimized, but the notion of adjusting predictions based on error feedback is familiar. It might be best to start with regression and use it as a basis for discussing neural networks. Recall from Chapter 2 that a regression model has the expression shown below: Y = B1*X1 + B2*X2 + B3*X3 + . . . + A We could express this in the diagram below (ignoring the intercept, which could be added).

Neural Networks 4 - 3

Data Mining: Modeling Figure 4.2 Regression Model

In the diagram, each of the predictor variables is related to the outcome by a single connection weight or coefficient (the regression or B coefficients). Relations between the predictors and outcome variable are assumed to be linear. The coefficients or weights are estimated from the training data in a single data pass. The coefficients are those that minimize the sums of squared errors (called least squares). There are as many coefficients (weights) as there are input variables (for categorical variables, dummy variables would be substituted), plus one coefficient for the intercept. Now lets consider the form a neural network would take. Here we are using a feedforward backpropagation (named as such because prediction errors propagate backward to modify the weight coefficients) neural network.

Neural Networks 4 - 4

Data Mining: Modeling Figure 4.3 Neural Network

Two features immediately attract attention. The first is the middle layer of neurons that intervenes between the input layer (which accepts data values; there is one input layer neuron for each input variable) and the output layer neuron. This is called the hidden layer since its neurons do not correspond to observable fields. Secondly, due to the hidden layer, there are many more connections between the input and output than we saw with regression. This allows a neural network to fit more complex models than regression would. Notice that each input neuron is connected to every hidden layer neuron and, in turn, each hidden layer neuron is connected to the output neuron(s) (here there is just one output neuron). Thus a single input predictor can influence the output through a variety of paths, which allows for greater model complexity. Backpropagation networks typically have a single hidden layer of neurons, but there are extensions that allow two (a second hidden layer is useful in capturing symmetric functions) or even three (although rarely used) hidden layers. The number of neurons within a hidden layer can be varied; more neurons permit more complex relations to be fit. In fact, any continuous function can be fit using a backpropagation network by continuing to add neurons in the hidden layer. However, a neural network with many neurons in its hidden layer(s) may over-fit the training data and not generalize well to other instances. For this reason, in practice, the number of neurons in the hidden layer(s) is usually adjusted based on the number of predictor and outcome fields.

Neural Networks 4 - 5

Data Mining: Modeling

Activation Function
A feature not visible in the diagram in Figure 4.3 concerns what occurs at a hidden neuron when the inputs from a data record are combined using the current set of weights relating them to the hidden neuron. Instead of the value of the weights applied to the input fields after normalization (if Xi is input variable i and neuron j is the jth hidden neuron, the value would be X1*wj1 + X2*wj2 + + Xk*wjk) simply passing through the neuron, a nonlinear function is applied. Although there are several choices, very often the logistic function (see Chapter 2Logistic Regression) is used. Figure 4.4 Logistic Function

Thus there is a nonlinear mapping of the value produced by the weights applied to a data record before the next set of weights (relating the hidden layer to the output layer) are applied. This allows the neural net to capture nonlinear relations between the inputs and outcome. In addition (not shown), there is a threshold below which a hidden layer neuron will not pass the value to the next neuron(s). In this way, a neural network can natively capture complex interactions among inputs and nonlinear relations between the inputs and output in the data, which are beyond what standard regression can do. Herein lies the strength of a neural network. If we expand our diagram (Figure 4.2) for multiple regression by adding a single neuron in a hidden layer and use the logistic function as its activation function, then we have a neural network that performs logistic regression.

Neural Networks 4 - 6

Data Mining: Modeling Figure 4.5 Neural Network for Logistic Regression

Adding additional neurons to the hidden layer, which is done in most neural network analyses, allows the model to capture more complex relationships in the data.

Complexity
If we examine Figure 4.3 again, we see that a neural network with 5 input neurons, 3 hidden layer neurons and 1 output neuron (5 predictors and one outcome) contains 18 weights. Also, every predictor relates to the output through three paths, each of which involves the other predictors and the nonlinear activation function. Because of this, there is no simple expression relating an input to the output. For this reason, neural network models are difficult to interpret. There are indirect methods used to better understand the relationship between an input variable and the predicted output from a neural network, but the relatively simple interpretations we had with regression, discriminant and logistic regression are not to be found. Also, there are measures of the relative importance of the predictors (sensitivity analysis) in neural networks, but they dont provide direction or functional form information.

Approach
What follows is a simplified description of how the feed-forward backpropagation neural network model is fit (how the weights are estimated). The basic idea is that the weights are adjusted based on the error of prediction from each record (case). To illustrate this, suppose we are attempting to predict the insurance claim amount from five input variables.

Neural Networks 4 - 7

Data Mining: Modeling First, the weights are set to small, random values. Then the data values from the first record (case) are passed through the net (the weights and activation functions are applied) and a prediction is obtained. The error of this prediction is used to adjust the weights connecting the hidden layer to the output layer. Figure 4.6 Adjusting Hidden Layer Weights Based on Prediction Error

The current weights, relating the hidden layer neurons to the outcome, are modified based on the error of the prediction. Next the weights relating the input variables to the hidden layer are adjusted in a similar way. First, the error in the output is propagated back to each of the hidden layer neurons via the weights connecting the hidden layer to the output. This must be done since the hidden layer neurons have no observed outcome values and error cannot be measured directly. In turn, the weights relating each input neuron to a hidden layer neuron are similarly modified based on the error assigned to that hidden layer neuron (see Figure 4.7).

Neural Networks 4 - 8

Data Mining: Modeling Figure 4.7 Adjusting Input Layer Weights Based on Error

By propagating the output prediction error to the hidden layer through the weights, all weight coefficients can be adjusted from the prediction error. The new weights are applied to the next case, and are similarly adjusted. Weights are adjusted for each record (case) and, typically, many data passes are required for the model to become stable. For those interested in the equation, the weight change can be expressed as follows: Wji( t + 1) = Wji (t) + * dj* Oi + [Wji (t) - Wji (t-1)] Where Wji is the weight connecting neuron i to neuron j, t is the trial number, is the learning rate (a value set between 0 and 1), dj is the error gradient at node j (for discussion of this see Fu, 1994), Oi is the activation level of a node (the nonlinear function applied to the result of the combination of the weights and inputs), and is a momentum term (a value set between 0 and 1). A new weight is derived by taking the old weight and applying an adjustment based on a function of the prediction error (represented here by dj). In addition, the momentum term () serves to encourage the weight change to maintain the same direction as the last weight change. It and the learning rate () are control parameters that can be modified by experienced neural network practitioners to fine-tune the performance of feed-forward backpropagation neural networks. This is a bare description of the estimation method and it skips over the more difficult details of how the error gradient is calculated. For those interested in these details see Fu Neural Networks 4 - 9

Data Mining: Modeling (1994) or the more accessible, but less detailed Bigus (1996). Berry and Linoff (1997) include a chapter discussing neural networks in the context of data mining.

Other Issues
Neural networks assume the inputs are in a range from 0 to 1 (typically). Most neural network software automatically rescales input variables, but you should be aware of it. Also, categorical variables are replaced by dummy variables (discussed in Chapter 2) coded 1 or 0, indicating the presence or absence of the category. If you have categorical inputs with many categories (postal codes, telephone exchange codes, industry codes, or medical diagnostic groups), the neural network will use the large number of dummy fields as input, which can substantially increase the time and memory requirements. Given that many iterations are typically required for a neural network model to be fit, you should examine categorical predictors to determine the number of categories. If the number of categories for a variable is excessive, you should consider binning or collapsing the original categories into a smaller number of super-categories.

Other Types Of Neural Networks


We discussed the feed-forward backpropagation network and will focus on its use in this chapter. Other neural networks can be applied to data mining applications. One that is growing in popularity is the radial basis function network. Essentially, its hidden layer acts to cluster the data (each hidden node represents a cluster) into groups based on similar input values and then a set of backpropagation weights are applied. This feature can reduce training time. The dynamic growing network starts with a small hidden layer(s) (the neurons and their connections are called the network topology by neural network practitioners) and, after training, it adds additional hidden layer neurons, trains and evaluates the improvement. The network grows in this way until no additional improvement is obtained. Since a number of network topologies are trained, this can require substantial computing time. At the other end of the spectrum are pruned networks, in which a large neural network (possibly containing several hidden layers) is trained and the hidden neurons with the weakest links are dropped. In turn, this reduced network is trained and the input neurons with the weakest links are dropped. This process is repeated until no additional pruning is necessary. Of the networks so far described, it alone will automatically drop input variables that make little predictive contribution. However, since a number of networks, some large, are trained, this method involves a large computational effort.

Neural Networks 4 - 10

Data Mining: Modeling

A NEURAL NETWORK EXAMPLE: PREDICTING CREDIT RISK


We will use a neural network to predict the credit risk category into which individuals should be placed. The outcome field is credit risk, made up of three categories: good risk, bad risk, but profitable and bad risk with loss. Ten demographic variables will be used as predictors in the model (marital status, income (in pounds; this is a UK data file), number of department store credit cards, age, number of credit cards, number of loans, number of children, gender, mortgage, and whether salary is paid weekly or monthly. Because the target variable is categorical, the analysis will substitute three dummy coded (0,1) fields for the single three-category field. Similarly, this will be done for the marital status field. Such adjustment is made automatically within Clementine, based on the fields type. From within Clementine: Click FileLoad Stream Move to the c:\Train\DM_Model directory Double-click on NeurCred.str Figure 4.8 Clementine Stream for Credit Risk Analysis

This stream has been prepared in advance, although we will run the analysis and examine the results. Notice there are two streams; the upper stream applies the neural network

Neural Networks 4 - 11

Data Mining: Modeling model to the training data set (train.txt), while the lower stream will send the separate validation data (validate.txt) through the generated model nodes for evaluation. First lets examine the data. Right-click on the Table node in the upper left corner of the stream palette Click Execute on the Context menu Figure 4.9 Data File

There are 2,455 records in the training data file and we see values for the first few cases. The outcome field (Risk) has string values for its three categories. We saw in Chapter 3 that the Type node is very important when setting up a model in Clementine and we examine it now. Close the Table window Right-click on the Type node Click Edit on the Context menu

Neural Networks 4 - 12

Data Mining: Modeling Figure 4.10 Type Node for Neural Network Model on Risk Data

Recall that the direction of a field determines how it will be used within Clementine models. To review: IN OUT BOTH NONE The field acts an input or predictor in modeling The field is the output or target field in modeling The field can act as both an input and an output in modeling (limited to Association Rule and Sequence Detection modeling; see Chapters 3, 8). The field will not be used in modeling.

Notice that the ID field is declared as Typeless and its direction is None. Thus it will not be included in the neural network model. This is for good reason. With a large enough network topology ID could perfectly (and trivially) predict risk in the training data. In addition, a dummy field would be generated for each unique ID in the training data. Since we have thousands of IDs, this would dramatically increase the scope of the analysis and would substantially slow down the estimation. Thus we would slowly obtain a trivial and unhelpful model. And there is no reason ID should be a good predictor, in any case. Risk is the outcome variable and all others (except ID) are inputs. Close the Type window Click the Neural Net node in the Modeling palette and place it in the stream canvas to the right of the Type node Neural Networks 4 - 13

Data Mining: Modeling Connect the Type node (click and drag with the middle mouse button, on simultaneously click the left and right buttons of a two button mouse) to the Neural net node Double-click the Neural Net (named Risk) Node Figure 4.11 Neural Net Node

The name of the Neural Net node can be specified within the Custom Model name text box. By default, it takes the name of the output field. A feedback graph (selected by default in the Options tab) appears while the network is training and provides information on the current accuracy of the network. We will examine the feedback graph shortly. There are several different neural network algorithms available within the Train Net node (Method dropdown list). Some (Quick, dynamic, prune, and RBFN (radial basis function network)) were described earlier. We will demonstrate the default Quick method, but will summarize the performance of the others. For more details on the different methods, please see the Clementine User Guide or the Clementine: Advanced Models course. Over-training is one of the problems that can occur within neural networks. As the data are passed repeatedly through the network, it is possible for the network to learn the

Neural Networks 4 - 14

Data Mining: Modeling patterns that exist in the training sample only and hence over-train. That is, it will become too specific to the training sample data, and will loose its ability to generalize. By selecting the Prevent overtraining check box (the default), only a randomly selected portion of the training data is used to train the network. Once this sample of data (training data) has made a complete pass through the network, the rest is used as a test set to evaluate the performance of the current network. It is this test information that is used to determine when to stop training and provide feedback information. Thus training data are used to estimate the neural network weights while test data are used to evaluate the model (accuracy) and determine when training ends. In addition, we have set aside a separate validation data set (second stream) to evaluate the model(s). We advise you to leave the Prevent overtraining option turned on. You can control how Clementine decides to stop training a network. The default option stops training when the network appears to have reached its optimally trained state. That is, the model performance in the test data sample seems to no longer improve with additional training cycles. Alternatively you can set a required Accuracy, a limit to the number of Cycles through the data, or a Time limit in minutes. In this chapter we will use the Default option. Since the neural network initiates itself with random weights, the behavior of the network can be reproduced by using the Set random seed option with the same seed. Setting the random seed is not a normal practice and it is advisable to run a neural network model several times to ensure that you obtain similar results using different random starting points. This influences both the starting weights and the splitting of the data file into training and test data sets. Under the Options tab, Sensitivity analysis gives information on the relative importance of each of the fields used as inputs to the network. This is useful information although it increases processing time, and is the default setting. Under the Experts tab, and then the Experts option button, are additional choices that allow you to further refine the properties of the training method (for example, the learning rate and momentum parameters briefly cited earlier). Expert options are detailed in the Clementine Users Guide. In this chapter we shall stay with the default settings on all but one of the above options, although if multiple models are built on the same stream, using different inputs or training methods, it may be advisable to change the network names. In order to reproduce the neural network results appearing in this example (recall that starting weights are randomly set), we set the random seed to 233 for this analysis. Click the Set random seed checkbox Type 233 in the Seed box Click the Execute button

Neural Networks 4 - 15

Data Mining: Modeling Figure 4.12 Feedback Graph During Neural Network Training

The graph shows two lines, the red, more irregular line labeled Current Predicted Accuracy, represents the accuracy of the current network (current set of weights) in predicting the test data (as defined above). The blue, smoother line, labeled Best Predicted Accuracy, represents the best accuracy (and network) so far (again applied to the test data). The accuracy percentages for both the current and best performing networks are detailed in the legend. When the outcome field is categorical, this accuracy is simply the percentage of correct model predictions (in the test data sample). For a numeric outcome field (Clementine type Range), the accuracy for a prediction (as a percent) is: 100*(1 absolute value ((Target Value Network Prediction) / Target Value Range)), which is averaged across the test data. Once trained the network performs, if requested, the sensitivity analysis and a golden nugget node appears in the Models tab of the Manager. This represents the trained network and is labeled with the network name.

Right-click on the Neural Net Generated Model node tab Click Browse on the Context menu

in the Models

Neural Networks 4 - 16

Data Mining: Modeling Figure 4.13 Basic Model Information

The information on the model in this window is collapsed, except for the Analysis section. Here we see that the predicted accuracy for this neural network is 70.949%, indicating the proportion of the test set correctly predicted. Below this is the final topology of the networkthe number of neurons within each layer of the network. The input layer is made up of one neuron per numeric or flag (binary) type field. Those fields defined as sets (categorical with more than two categories) will have one neuron per value within the set. Therefore in this example there are 9 numeric or flag fields and 1 set field which has 3 values, giving a total of 12 neurons. In this network there is one hidden layer, containing 3 neurons, and then the output layer, also containing three neurons for the three values of the output field, Risk. The Quick method selects the number of neurons in the hidden layer based on the number and types of the input and output fields. More input fields mean more neurons in the hidden layer. If the output field had been defined as numeric then the output layer would only contain one neuron. To view the relative importance of the input fields from the sensitivity analysis, we need to expand that section of the model information. Click on the Relative importance of Inputs icon to expand the folder

Neural Networks 4 - 17

Data Mining: Modeling Figure 4.14 Sensitivity Analysis

The relative importance of each input field is listed in descending order. The figures range between 0.0 and 1.0, where 0.0 indicates unimportant and 1.0 indicates extremely important. In practice this value rarely goes above 0.35. Here we see that age, income and marital status are relatively important fields in this network. The generated model icon can be placed in the data stream and data can be passed through it to create prediction and confidence fields and understand how the model works. This has already been done. Close the RISK model browse window Right-click on the Table node connected to the Neural Net Generated Node in the upper stream Click Execute on the Context menu Scroll to the right so the last few columns are visible

Neural Networks 4 - 18

Data Mining: Modeling Figure 4.15 Table Showing Prediction and Confidence Fields

The model node calculates two new fields, $N-RISK and $NC-RISK, for every record in the data file. The first represents the predicted value and the second a confidence value for the prediction. The latter is only appropriate for categorical outputs and will be in the range of 0.0 to 1.0, with the more confident predictions having values closer to 1.0. The confidence values are almost identical for the first few customers because they all have the same value of marital status and nearly identical annual incomes. Close the Table window

Comparing Predicted To Actual Values


Since we are predicting a Categorical output, it is necessary to study a data matrix of the predicted value, $N-RISK, against the actual value, RISK, to see where the predictions were correct (and wrong!). This is called a misclassification table. Right-click on the Matrix node in the upper stream Click Execute on the Context menu The Matrix node had been edited so that actual risk forms the rows and predicted risk forms the columns of the table. Values in the table are row percents (sum to 100% across a row).

Neural Networks 4 - 19

Data Mining: Modeling Figure 4.16 Misclassification Table of Neural Network Predictions

The model is predicting about 90% of the bad but profitable risk individuals correctly, only 32% of those who belong to the bad loss category, and the good risk group between these two (62%). Thus if we want to correctly predict bad but profitable loans at the expense of the other categories, this would appear to be a reasonable model. On the other hand, if we want to predict those credit risks that are going to cause the company a loss, this model would only predict about a third of those correctly. We have established where the model is making incorrect predictions. But how is the model making its predictions? In the next section we shall examine a couple of methods that will help us to begin to understand the reasoning behind the predictions. Close the Matrix window

Understanding The Reasoning Behind The Predictions


One method of trying to understand how a neural network is making its predictions is to use an alternative technique, such as rule induction, to model the predictions made by the neural network. Decision trees and rule induction will be discussed in Chapter 5. From the sensitivity analysis, we know that marital status is an important categorical input. Since the outcome is also categorical, we will use a distribution plot (bar chart) to understand how marital status relates to credit risk.

Neural Networks 4 - 20

Data Mining: Modeling

Right-click on the Distribution node Click Execute on the Context menu

in the upper stream

Figure 4.17 Distribution of Marital Status and Predicted Credit Risk

The chart is normalized (by editing the Distribution node; not shown) so each marital status bar is the same length. The chart illustrates that the model is predicting almost all the divorced, separated or widowed individuals as bad but profitable types. The single individuals are associated with both good risk and bad but profitable types. The married individuals are split among all three types, with most of the bad loss types falling in this group. One of the important numeric inputs for this network is income. Since the output field is categorical, we shall use a histogram of income, with the predicted value as an overlay ($N-RISK), to try to understand how the network is associating income with risk. Close the Distribution window Right-click the Histogram node Click Execute on the Context menu

Neural Networks 4 - 21

Data Mining: Modeling Figure 4.18 Income and Predicted Credit Risk

Here we can see that the neural network is associating high income with good credit risk. There seems to be a break point in income at about 30,000 pounds (UK data), over which the model predicts good risk. The lower income ranges are split, roughly speaking, proportionally between the two bad credit types. By comparing this with a histogram using the actual output field as an overlay, we can assess where, in terms of income, the model is getting it wrong. Close the Histogram window Double-click on the Histogram node Select Risk as the Overlay Color field (not shown) Click Execute

Neural Networks 4 - 22

Data Mining: Modeling Figure 4.19 Income and Actual Credit Risk

With this overlay histogram of the actual data, we see that there are some good credit risk individuals present at the lower end of the income scale (which is one reason the model is not predicting that group very accurately). And although the bad loss group has incomes generally below 30,000 pounds, we know from the misclassification table (Figure 4.16) that this group is being predicted poorly. Thus, there must be other factors in the model that we havent yet discovered that lead to these mispredictions.

Model Summary
We have built a neural network that is very good at predicting bad but profitable types. The more important factors in making predictions are marital status, income, and age. The network appears to associate low income and those divorced, separated or widowed as those individuals likely to be a bad but profitable credit risk. The neural network is associating high incomes with good credit types, but does not appear to be very successful in correctly identifying those who are classified as bad credit risks likely to create a loss for the company. It could be argued that this latter group of individuals is the

Neural Networks 4 - 23

Data Mining: Modeling most important to identify successfully, and, for this reason alone, the network is not achieving great success.

Other Neural Network Models And Validation


We have not yet validated the neural network on the holdout data. Also, for comparison purposes, we ran models on the risk data with the other neural net model choices in Clementine, using the same random seed. The generated model nodes for each model, including the model we ran with the Quick method, have been placed in the lower stream, attached to the validate.txt data file. If time permits, you can run the Analysis node in the lower stream to compare the model results. And if time permits, you can recreate these models by editing the Neural Net node in the upper stream (but note that all but RBFN are considerably slower than the Quick choice). We present a table reporting the overall accuracy of each model on both the original data used in the analysis and the second validation data set (lower stream). Model Quick RBFN Dynamic Multiple Prune Accuracy (Original Data) 71.98% 71.41% 72.91% 72.79% 73.16% Accuracy (Validation Data) 78.52% 75.93% 79.06% 79.12% 78.16%

The overall performance of the neural network models on the original data was very similar, with the more complex networks doing slightly better. Oddly, performance improved for all networks when applied to the validation sample. Typically, there is a slight drop-off. The simplest of the neural network methods (Quick) did quite well, relatively speaking, on this data.

Appropriate Research Projects


Examples of questions for which neural networks are appropriate are: Predict expected revenue in dollars for a new customer from customer characteristics. Predict fraudulent use of credit card from purchase patterns. Predict acceptance of a direct mail offering Predict future monthly sales in dollars from previous sales history Predict real estate value for a house

Neural Networks 4 - 24

Data Mining: Modeling

Other Features
There is no fixed limit to the size of data files used with neural networks, but since training a network involves iterative passes through the training data, most users of neural networks limit the number of predictors to a manageable number, say under 50 or so. The data are split into training, test and validation samples. Data values must be normalized (range constrained to be roughly 0 to 1) for neural networks to run effectively. Neural network software programs typically perform this automatically. Although stepwise methods are not available in most neural networks (because all the predictors are used to set up the model; pruned networks are an exception), measures suggesting the relative importance of the predictors and plots can be used to evaluate predictor variables. Neural networks can natively model non-linear and complex relationships.

Model Understanding
Neural networks produce models that are not easily understood or described. This is because the relationships are reflected in the (possibly) many weights contained in the hidden layer(s). Thus neural networks largely present a black box and not a model that can be readily described. This can be a disadvantage if you need to explain the model to others or demonstrate that certain factors were not considered. For example, race cannot be considered in the U.S. when issuing a loan or credit card and a demonstration of this may be required by Federal agencies.

Model Deployment
Applying the weights and threshold values (after normalization) to new data produces predictions. Although neural network prediction is more complex than regression models or decision trees, it can be incorporated into other programs or languages. Specifically, Clementine neural net models can be deployed via exported code to apply the coefficients within other programs. In sum, while not as easy to deploy as some of the other models considered, it can be done

Neural Networks 4 - 25

Data Mining: Modeling

Neural Networks 4 - 26

Data Mining: Modeling

Chapter 5 Rule Induction and Decision Tree Methods


Topics:
INTRODUCTION WHY SO MANY METHODS? CHAID ANALYSIS A CHAID EXAMPLE: CREDIT RISK RULE INDUCTION (C5.0) A C5.0 EXAMPLE: CREDIT RISK

Rule Induction and Decision Tree Methods 5 - 1

Data Mining: Modeling

INTRODUCTION
Decision trees and rule induction methods are capable of culling through a set of predictor variables and successively splitting a data set into subgroups in order to improve the prediction or classification of a target (dependent) variable. As such they are valuable to data miners faced with constructing predictive models when there may be a large number of predictor variables and not much theory or previous work to guide them. In this chapter we use the terms rule induction and decision tree interchangeably. The different names are a result of strongly related techniques having been developed in two different research areas (artificial intelligence and machine learning versus applied statistics). The decision trees we view can be expressed as rules, and these decision rules (although not rulesets as discussed later in the chapter) can be represented as decision trees. We will discuss decision tree methods within the context of data mining and use the CHAID procedure in AnswerTree and the C5.0 rule induction method in Clementine to predict credit risk. Traditional statistical prediction methods (for example, regression, logistic regression or discriminant analysis) involve fitting a model to data, evaluating fit and estimating parameters that are later used in a prediction equation. Decision tree or rule induction models take a different approach. They successively partition a data set based on the relationships between predictor variables and a target (outcome) variable. When successful, the resulting tree or rules indicate which predictor variables are most strongly related to the target variable. They also find subgroups that have concentrations of cases with desired characteristics (e.g., good credit risks, those who buy a product, those who do not have a second heart attack within 30 days of the first). Decision trees can use many predictors, but they are not true multivariate models in the classic sense of parametric statistics. That is, they do not simultaneously assess the effect of one variable while controlling for the others. Instead, their general approach is to find the best single predictor of the dependent variable at the root of the tree. Finding this predictor usually involves recoding or grouping together several of the original values of the predictor to create at least two nodes. Each node now defines a new branch of the tree being created. Within each branch, the process repeats itself. The algorithm looks for the best predictor among the set of variables. Again, it will create at least two nodes with that best predictor. When no predictor can be found that improves the accuracy of prediction, the tree can be grown no further. Some decision tree methods also prune the tree after initial growth. Decision tree models may have the following advantages over traditional statistical models. First, they are designed to be able to handle a large number of predictor variables, in some cases far more than a corresponding parametric statistical model. Second, some of tree-based models (C5.0 and C&RT in particular) are entirely nonparametric and can capture relationships that standard linear models do not easily handle (nonlinear relationships, complex interactions). They (especially C5.0 and C&RT) Rule Induction and Decision Tree Methods 5 - 2

Data Mining: Modeling share this characteristic with neural networks and kernel density methods (robust regression). For researchers needing to explore their data, decision tree methods are designed to automatically perform the successive splitting operations until stopping criteria are reached. They thus provide a means of ransacking a data set (to borrow Leo Goodmans expression) to identify key relationships. The downside of this is that care must be taken to avoid false positives (cross-validation is recommended) and trivial results. These characteristics are common to decision tree models. Below we see a decision tree derived from a marketing study of the factors that influence a persons decision to subscribe to an interactive news service. Figure 5.1 Decision Tree with Age As First Predictor

At the root (top) node, the yes and no responses are about evenly divided. However, after splitting the sample based on age (<= 44, >44), we have two very different subgroups. Of those over 44, over 67% respond yes, while almost 60% of those 44 or under say no to the offering. For those 44 or under a second split is performed based on income level; those with higher income are more likely to accept the offering. Decision tree methods perform successive splitting of the data using predictor variables guided by a criterion (statistical or information-based). Next we examine some rules produced by Clementines C5.0 rule induction procedure.

Rule Induction and Decision Tree Methods 5 - 3

Data Mining: Modeling Figure 5.2 Rules in Decision Tree Format for C5.0

The rules are expressed as branching IfThen statements. They describe the conditions that lead to a predicted outcome. The first line indicates that if the number of (other) loans is 0, then predict good risk. In addition, it indicates that there were 335 cases in this subgroup and the rule predicted 59.1% (.591) of those cases correctly. If the number of loans is greater than 0, then additional conditions apply (involving the number of store credit cards, income and, again, the number of loans). These rules, in principle, could be represented in a graphical decision tree.

WHY SO MANY METHODS?


The procedures available in Clementine and AnswerTree differ in many ways. These include: the types of variables they can handle (only C&RT (Classification and Regression Trees) and a variant of CHAID accept continuous target variables); the criterion used for splitting; whether such elements as misclassification costs and prior probabilities are incorporated into the analysis; and computational effort required for the analysis. Since the methods vary on so many characteristics (see Table 5.1 below) the question as to whether any one is best is difficult to answer, although the main developers have addressed this issue (see Kim and Loh (2001), Loh and Vanichsetakul (1988) and rejoinders, or Loh and Shih (1997)). CHAID and C&RT have been in use longer than QUEST (Quick, Unbiased, Efficient, Statistical Tree), but QUEST was designed to address what were viewed as deficiencies in C&RT (tendency for predictors with many

Rule Induction and Decision Tree Methods 5 - 4

Data Mining: Modeling categories to be selected, lengthy processing time). C5.0 is the most recent version of a machine-learning program with a long pedigree (ID3, C4.0, C4.5). SPSS saw value in each method and decided to offer them all. Often several of the methods can be applied to answer a question, and we encourage you to run your analyses using the different methods and compare the results, especially in terms of how well they classify on a validation sample.

Comparing The Methods


The table below summarizes some of the characteristics that differentiate among the four tree-based analysis methods available in Clementine and AnswerTree. The purpose is not to study every feature at this point, but to provide a reference table that can be revisited at different points in the course. Here we will mention a few important differences. Please note that the list is restricted to characteristics on which the four methods differ. For example, all methods can handle nominal (categorical) target and predictor variables. Table 5.1 Features of Decision Tree and Rule Induction Methods Split Type Continuous Target Continuous Predictors Misclassification Costs in Tree Growth Criterion for Predictor Selection Processing Time (Approx.) Handling Missing Predictor Values Priors SPSS Software
1

CHAID Multiple Yes1 No2 No Statistical Moderate Missing becomes a category No AnswerTree

C5.0 Multiple No Yes Yes Information measure Moderate Fractionalization4 No Clementine

C&RT Binary Yes Yes Yes Impurity measure Slower Surrogates4 Yes AnswerTree, Clementine

QUEST Binary No Yes No3 Statistical Moderate Surrogates4 Yes AnswerTree

SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous target variables. 2 Continuous predictors are converted into ordinal variables containing (by default) 10 approximately equally sized categories. 3 If misclassification costs are symmetric, they can be incorporated using priors. 4 Fractionalization and surrogate (substitute predictor) methods advance a case with a missing value through the tree.

Rule Induction and Decision Tree Methods 5 - 5

Data Mining: Modeling One important point is that if you wish to use decision tree analysis to predict a continuous variable, then you are limited to the C&RT procedure in Clementine and AnswerTree or a variant of the CHAID method within AnswerTree. Note that C&RT and QUEST produce binary splits when growing the tree, while C5.0 and CHAID can produce multiple groups when splitting occurs. Also, missing values are handled in two different ways. CHAID allows missing values to be considered as a single category and used in the analysis (refusing to answer an income question may be predictive of the target variable). C&RT and QUEST use the substitute (surrogate) variable whose split is most strongly associated with that of the original predictor to direct a case with a missing value to one of the split groups during tree building. C5.0 splits a case in proportion to the distribution of the predictor variable and passes a weighted portion of the case down each tree branch. So these latter methods do not treat missing as a separate category that might contain predictive information. The methods also differ in whether statistical tests or other criteria are used when selecting predictor variables.

CHAID ANALYSIS
Chi-square automatic interaction detection (CHAID) is a heuristic tree-based statistical method that examines the relations between many categorical, ordinal or continuous (which are grouped or binned into ordered categories) predictor variables and a categorical outcome (target) measure. It provides a summary diagram depicting the predictor categories that make the greatest difference in the desired outcome. For example, in a segmentation context, CHAID can give you information about the combinations of demographics that yield the highest probability of a sale.

Principles And Considerations


As mentioned above, CHAID is a heuristic statistical method that examines relations between many categorical predictor variables and a single categorical outcome variable, and it provides a summary diagram relating the predictors to the outcome. This summary diagram can be searched to locate the subgroups with the highest percentages on the desired outcome, or a gains table can be examined. It might be easiest to explain the CHAID algorithm using an example. Suppose you have a sample of customers and know whether or not they purchased an upgrade to your product. Four demographic variablesregion, gender, age and income categoryare available as potential predictors. The first two are nominal (categorical), age is continuous and would be collapsed into ordered categories, and income category is ordinal. CHAID will first examine crosstabulation tables between each of the predictor Rule Induction and Decision Tree Methods 5 - 6

Data Mining: Modeling variables and the outcome, and test for significance using a chi-square independence test. If more than one of these relations is statistically significant, CHAID will select the predictor that is most significant (smallest p value). If a predictor has more than two categories (as region does), CHAID compares them and collapses together those categories that show no differences on the outcome. For example, if the Midwest and West do not significantly differ in the percentage of customers who upgrade, then these two regions will be collapsed together (the least significant are merged first). Thus categories based on those regions (or merged regions) that significantly differ from the others are used in computing the significance tests for predictor selection. Lets say the best predictor is region, so the split occurs on that variable. CHAID then turns to the first new region category (say Midwest/West), and for these observations, it examines the predictor variables to see which makes the most significant difference (lets suppose it is gender) for this subgroup. It then splits the Midwest/West subgroup by gender. It would then examine each of the other region subgroups and, if the criteria are met, split each region subgroup by the predictor most significantly related to the target within that subgroup (which might not be gender). CHAID would then drop down a level in the tree and take the first gender group (say females) within the Midwest/West subgroup and see if any of the predictor variables (including region) make a significant difference in outcome (since gender is a constant for this subgroupall femaleit would not be considered). If no remaining predictor variable makes a significant difference for the Midwest/West-female subgroup, or a stopping rule is reached, CHAID declares this a terminal node and examines the Midwest/West-male subgroup in the same way. Thus, level by level, CHAID systematically splits the data file into subgroups (called nodes) that show significant differences as they relate to the outcome measure. The results of this process are displayed in a tree diagram that branches out as additional splits are made. As noted above, CHAID can split more than once on the same variable in a branch if that variable becomes the best predictor at a subsequent node.

A CHAID EXAMPLE: CREDIT RISK


We will apply the CHAID decision tree analysis to the credit risk data used in the last chapter. Recall the outcome is credit risk with three categories: good risk, bad risk but profitable, bad risk with a loss. Run SPSS AnswerTree The Startup dialog asks whether you wish to run an introductory overview, start a new project, or open a previous project. A project is linked to a particular data file and is

Rule Induction and Decision Tree Methods 5 - 7

Data Mining: Modeling composed of the data file and any analysis trees based on the file. We will start our new project from the Startup menu, but could also do so directly from the AnswerTree parent window (click FileNew Project). Click Start a new project option button in AnswerTree dialog (not shown) Click OK Click SPSS data file (.sav) option button in New Project dialog Click OK Move to c:\Train\DM_Model directory Double-click Train Figure 5.3 AnswerTree Parent Window

The AnswerTree Parent window contains a Project window that has an entry (Project 1), indicating that a default project data file has been opened. The Project window provides a way of referencing different tree analyses done in the course of the project (analyses linked to a single data file). However, the details and specifications for a CHAID (or other tree) analysis are provided elsewhere (Tree Wizard dialogs, Analysis and Tree menus, and the Tree window). The project is assigned the temporary name Project 1 (see the title bar of the AnswerTree parent window or the entry in the Project window). Since no analyses have been run yet, there are no tree icons in the Project window. If you wish a spreadsheet view of the data, simply click ViewData after a root node is present in the Tree window (meaning the data file has been read). Lets assume the data have been checked in SPSS and proceed to set up the model for CHAID analysis.

Rule Induction and Decision Tree Methods 5 - 8

Data Mining: Modeling A Tree Wizard window also opened with our new project. The Tree Wizard, composed of four dialog boxes, steps you through the analysis setup. First the Tree Wizard asks which decision tree method we want to apply to our data. CHAID is the default method. Figure 5.4 Tree Wizard: Growing Method (Step 1 of 4)

We will be running a CHAID analysis; other AnswerTree models are also available as Growing Method option buttons. Each is accompanied by a brief description. For more details on QUEST and C&RT, see the AnswerTree Users Guide. Click CHAID option button (if not already selected) Now we will set up the analysis. Click Next Click and drag Credit Risk into the Target list box Click and drag all variables except ID into the Predictors list box

Rule Induction and Decision Tree Methods 5 - 9

Data Mining: Modeling Figure 5.5 Tree Wizard: Model Definition Dialog (Step 2 of 4) Completed

The Tree Wizard Model Definition dialog displays the SPSS variable labels, ordered by original position in the file, for all variables in the SPSS data file (Train.sav). In order to perform a CHAID analysis we must indicate the target (dependent) and predictor (independent) variables. The icon beside each variable identifies the variables measurement level. A variable can be nominal , ordinal , or continuous . You can change display characteristics of the variable list and modify a variables measurement level from this dialog (measurement level can also be changed later using the Measurement Level dialog available from the Analysis menu). In addition to list boxes for Target and Predictors, there are list boxes for Frequency and Case Weights. A Frequency variable is needed when aggregate data are used, that is, a data file in which each record represents not an individual, but a subgroup summary. In such files the Frequency variable contains the number of observations that each record represents. In addition a Case Weight variable can be used if each observation is not to be equally weighted (the default) in the analysis. Such weights can be used when the sample proportions do not reflect the population proportions. Click Next

Rule Induction and Decision Tree Methods 5 - 10

Data Mining: Modeling An important step in decision tree analysis is to validate the results on data other than that used to build the model. AnswerTree offers two methods for this: (1) partitioning the data into training and validation samples, and (2) n-fold validation. These options were briefly discussed in Chapter 1. We will partition the data so that 75% goes to training and 25% (about 600 cases) to the testing sample. We actually have separate training and validation data files for this study (named Train and Validate) and could have combined them beforehand, which would allow more data for the analysis. Click the Partition my data into subsamples option button Move the slider control so that Training sample is at 75% and Testing sample is at 25% Replace the Random Seed value with 233 Figure 5.6 Tree Wizard: Validation (Step 3 of 4) Dialog Box Completed

Click Next At this point, you can examine and change some advanced analysis options or proceed to create a Tree window using default settings. We will examine the default CHAID advanced option settings now, changing some of them. Click the Advanced Options pushbutton

Rule Induction and Decision Tree Methods 5 - 11

Data Mining: Modeling Figure 5.7 Stopping Rules

The first tab within the Advanced Options dialog controls when tree growth will stop. Since this tab applies to methods other than CHAID, one choice is inactive. Maximum Tree Depth sets a maximum for the number of levels deep (below the root node) a tree can grow. The default (3) is useful when there are many variables and you want to identify the most important few. Deeper trees allow more detailed analysis and possibly a better solution, but correspondingly require more computing resources to complete. For our analysis, we leave the depth at 3. The Parent node minimum number of cases is the minimum size a node (subgroup) must be before any splitting can occur. The default (100) is reasonable for large data files (many thousands), but for some market research and segmentation studies (other than direct mail where samples are typically very large) it must be scaled down accordingly. Since our training data set will contain about 1,850 observations, we will reduce the minimum parent node value to 40. The Child node minimum number of cases is the minimum size of nodes (subgroups) resulting from a split. This means that a node will not be split further if any of the resulting nodes would contain fewer cases than this value. By default it is 50 and, for the reason outlined above, we will reduce it to 15. We are setting the numbers a bit on the low side (especially the minimum number of cases for the Child node) to better demonstrate how the CHAID model runs (by increasing the chances of a larger tree).

Rule Induction and Decision Tree Methods 5 - 12

Data Mining: Modeling Enter 40 into the Parent node text box Enter 15 into the Child node text box The CHAID tab sheet controls the statistical criteria used by CHAID. The Costs tab sheet allows you to supply misclassification costs, which will influence case classification (prediction) in CHAID, though not tree growth. By default, all misclassification costs are assumed equal. The Intervals tab sheet allows you to examine or modify the intervals that CHAID will use to convert continuous predictors into ordinal form. Click OK to process the Advanced Options Click Finish Figure 5.8 Tree Window with Root Node

The Tree window now opens and displays the root node, which represents the entire training sample. The target variable label (Credit risk) appears above the node along with a notation that the training data (not the testing data) are displayed. Within the root node, we can see the overall counts and percentages of responses to the credit risk target variable. Overall 22.62% of the sample were a bad risk with a loss of money, 59.94% were bad risks but profitable, and 17.45% were good risks. These are the base rates for the training sample and the CHAID analysis will create splits using the predictor variables most strongly related to credit risk. Notice also that a tree icon labeled Tree 01 RISK has been added to the Project window (not shown). It represents the tree corresponding to this analysis. Multiple trees (analyses) can be included in a single project.

Rule Induction and Decision Tree Methods 5 - 13

Data Mining: Modeling The Tree window is composed of five tabbed sheets: Tree, Gains, Risk, Rules, and Summary. Several of these are of interest after the analysis is run (after growing occurs), and we will examine them. The most important menus within the Tree window are Analysis, which controls features and criteria for the analysis, and Tree, which runs the analysis (TreeGrow Tree) and allows you to customize the resulting tree by selectively growing or pruning branches. The View menu will open up additional windows: a Tree Map window that aids navigation within a large tree (open by default) and the Data Viewer window. This menu also contains options to present different views or summaries of the selected node (a graph or table). Many of these features are also available via tool buttons. To perform the full analysis we will choose TreeGrow Tree. Another Tree menu choice allows you to grow the tree one level at a time. You can also grow a branch from a node you select (since only the root node is present, the tree and branch choices are equivalent). Although we will not pursue it here, you can direct CHAID in growing the tree (Select Predictor), and modify the tree CHAID creates (Remove Branch, Remove One Level, Select Predictor). Click TreeGrow Tree Maximize AnswerTree to see a larger Tree window From this window, we see the tree has grown at least two levels deeper. The entire tree is not visible within this screen and we can use the scroll bars to adjust the view. For larger trees a tree-navigation window called the Tree Map window is very helpful and we will use it later. Also note that you can use the zoom tools to increase or decrease the size of the tree (or click ViewZoom).

Rule Induction and Decision Tree Methods 5 - 14

Data Mining: Modeling Figure 5.9 CHAID Tree With Root Node Visible and One Node at First Split

As mentioned earlier, the root node represents the entire sample (there is no missing data for the credit risk field) and the base rate for credit risk. The first split (below the root node) is due to age; in other words, of all the predictor variables, age had the strongest (most significant, as measured by chi-square) relationship with credit risk. Although only one of the four age groups is visible above, we see that those over 39 and 45 or younger had proportionately more bad losses (42.85%) than the two younger groups (7.52% and 22.28%). Thus by focusing on the younger subgroups, we could reduce the proportion of bad loss risks and increase the proportion of bad but profitable risks relative to the other, older age categories (scroll right and left to examine the four age groups). The significance value and chi-square statistic summaries appear above the split. By default CHAID originally formed ten age categories and these were merged into three distinct age categories. Scroll left and down to see the split below the Age (25,39] node.

Rule Induction and Decision Tree Methods 5 - 15

Data Mining: Modeling Figure 5.10 CHAID Tree with Second Split

The node (subgroup) composed of those over 25 and 39 or younger is in turn split into four nodes based on income. Those in the higher income categories were more likely to be good risks (over 50%) and less likely to be bad but profitable risks than those in lower income categories. Thus of the remaining predictor variables, income was the most significant predictor for this age subgroup (25,39). If the tree is small, you can scroll through it and examine the terminal nodes for high or low concentrations of the response of interest. It is at the terminal nodes where predictions are made by the tree. You would typically examine the tree carefully to make sure the splits and results make sense. We will move on to examine the other summaries. What of our credit risk segments? Where are the highest concentrations of good risks or of bad but profitable risks? Instead of searching the tree, a more convenient summary would be a table listing the nodes in decreasing order by the percentage of cases falling in the target category of interest. Thus you can easily choose the top-so-many subgroups to focus on. To request this gains table: Click Gains tab (at bottom of Tree window) Click FormatGains Select good risk in the Gain Column Contents Category list box (not shown) Click OK

Rule Induction and Decision Tree Methods 5 - 16

Data Mining: Modeling Figure 5.11 Gains Table for Good Risks (Training Data)

Before examining the details of this table, notice the third title line lists the target variable label (Credit risk) and target category (good risk). By default, the target category with the highest code appears, and we changed it to examine the good risk category. The table is now organized by, and percentages displayed for, the good risk category. The left half of the table (Node-by-Node) separately summarizes each terminal node (segment), while the right half (Cumulative Statistics) displays cumulative summaries. A row represents a segment (or a terminal node on the tree) and the sixth column (Resp %) contains the percentage of that group who are good credit risks. In the first row (node 9) we see 63.2% of the group represent good credit risks. Notice the table is sorted in descending order by this column. Thus you can easily pick the top segments. The first column contains an identification number for the node (it corresponds to the node number in the Tree Map window (not shown)). The Node: n column reports the number of observations in the segment (node) and the Node: % column displays the segment (node) size as a percentage of the entire sample. The Gain: n column displays a count of the target category responses occurring within the node, and the Gain: % column indicates the percentage of all target category responses that are contained in this node. Node 9 (which from the tree are those over 39 but 45 or younger with at most 1 credit card) contains 13.3% of all good credit risk individuals, although it represents only 3.7% of the training sample. Finally, the index is a ratio (multiplied by 100) of the nodes percentage of target category responses to the overall samples percentage. The segment with the highest percentage was those over 39 but 45 or younger with at most 1 credit card, at 63.2% percent, while the entire sample had a good credit risk percentage of 17.45%; the ratio of these (multiplied by 100 and rounded) is 362.4.

Rule Induction and Decision Tree Methods 5 - 17

Data Mining: Modeling The summaries on the right side of the chart present the same statistics accumulated across the segments (terminal nodes). Using them you can see, for example, what the overall percentage of good credit risks in the top three segments (nodes) is. These results are for the training data; to see a similar table for the validation (testing) data simply click ViewSampleTesting. We next examine the training samples assignment rules.

Rules For Classifying Cases


Suppose you are satisfied with the current solution and wish to identify new cases that fall into the most promising subgroups (those with the highest proportions of good credit risks, or perhaps you might choose bad but profitable risks). The Rules tab sheet displays selection or assignment rules for selected nodes. Moreover, it can format these rules as SQL (Structured Query Language), SPSS, or logical statements. You can view these rules and export them in order to apply them to databases, SPSS data files, or with other programs. Since the top four nodes account for about 61% of the target responses (see Gain: % column in the Cumulative section of the table), we will select them as the most promising segments and view their assignment rules. Click on the first node in the Nodes column of the Gain Summary table Ctrl-click on the next three nodes in the Nodes column Click the Rules tab By default, the Rules window displays a SQL statement to select all terminal nodes currently selected. For users of SPSS, we will format the rules as assignment rules in SPSS syntax. Click FormatRules (not shown) Click SPSS Type option button Click Assigning values option button Click OK

Rule Induction and Decision Tree Methods 5 - 18

Data Mining: Modeling Figure 5.12 Assignment Rules in SPSS Syntax

If we focus on the section pertaining to node 9, we see that those over 39 but 45 or younger (AGE GT 39 AND AGE LE 45) with at most 1 credit card (NUMCARDS LE 1) are assigned a node (nod_001 variable) value of 9. In addition, their predicted target category (based on the most common response category in the node) is stored in the variable pre_001 (here 3, corresponding to the numeric code for good risk). The predicted probability that a node member falls in this category (here 63.2% or .632) is assigned to the prb_001 variable. The case assignment format provides more detailed information than the selecting cases format.

Other Tab Sheets In The Tree Window


The Risk tab presents a misclassification matrix, which is a table indicating the number and types of correct and incorrect classifications that would result from the classification rules. Since no misclassification costs or priors were specified, predictions for each rule are that all members of a terminal node are classified into that nodes most frequent response category (the mode). Click Risks tab in Tree window

Rule Induction and Decision Tree Methods 5 - 19

Data Mining: Modeling Figure 5.13 Risk Summary for Training Data

In the misclassification table, the columns correspond to the actual target categories in the data and the rows represent the target category predictions using the classification rules. For each actual target category, we can see how often the model predicts the correct outcome and how the errors are distributed among the other categories. Of the 324 actual good risks, the model correctly predicts 198, or about 61% correct. However, notice that only 166 of the 420 bad risks with loss individuals were predicted correctly (40%). Depending on which target category is most important to predict correctly, you might evaluate this result differently. The risk estimate, which is the overall misclassification or overall error rate when equal misclassification costs and no target-category prior probabilities are specified, is about .28. Thus, overall, about 72% of the training cases are predicted correctly. The misclassification table is important because it indicates the frequencies of the different types of errors made using the classification rules. In turn, the risk estimate provides a valuable estimate of the overall misclassification risk, here an error rate, when using the classification rules. Lets examine the same table for the validation data. Click ViewSampleTesting

Rule Induction and Decision Tree Methods 5 - 20

Data Mining: Modeling Figure 5.14 Risk Summary for Test Data

The misclassification table for the 598 test cases is very similar to the training data results (the risk estimate is .276 for training data versus .284 for validation (testing) data). This validation step gives us greater confidence in the model. We could also examine the gains table for the testing data. The Summary tab (not shown) presents a summary of the information needed to reproduce the analysis. This includes the data file name, target and predictor variable information, tree growth criteria, and the size of the tree. This serves as a useful document recording how the solution was obtained. If our goal were to predict good credit risks, the model does this fairly well. However, since some of the predicted good risks are actually bad risks with losses, further evaluation based on the expected profit or loss from this model should be performed. Also, a domain expert should review the rules or the tree in order to insure that they are reasonable. Click FileExit AnswerTree Click No when asked to save the project

Rule Induction and Decision Tree Methods 5 - 21

Data Mining: Modeling

RULE INDUCTION (C5.0)


The C5.0 rule induction method also constructs a decision tree by recursive splitting of the data. It begins with all data in a single group and calculates the information measure of the target variable (if pi is the proportion of cases in target category i, the information (or entropy) measure is -pilog2(pi), where log2 is the log to the base 2). It then examines each predictor, determining which one will produce a split that maximizes the information gain ratio. Although this is a technical measure, if a split tends to concentrate cases in one or more target categories within each of its child nodes, it will have a higher information gain ratio. For example, if there were three target categories initially distributed as (1/3, 1/3, 1/3), then a split that resulted in two subgroups with target categories distributed as (1, 0, 0) and (0, , ) would result in an information gain. Thus the C5.0 criterion tends to create subgroups that concentrate cases into a single target response category. One point to be aware of is that, by default, the aim of C5.0 is to produce a rule that is totally accurate on the training data. If the data have any extreme, unique or special cases, no matter how few, the tree will attempt to predict these also. This can result in a model that has high accuracy on the training data but low performance on unseen data. This is known as building on accuracy and not generality. There will be times when accuracy is required, as when you are not going to use the tree to make predictions on future data and instead simply want to explain the interrelationship between fields in the given data set. However, it is more likely that a data miner wants to build a model to explain, in general terms, what is happening in the data and then use the model to try to predict outcomes in the future. If this is the case, C5.0 can be set to favor generality rather than accuracy. We will choose this option.

A C5.0 EXAMPLE: CREDIT RISK


We will apply the C5.0 algorithm to the credit risk data using Clementine. As before, we examine and work with a previously constructed Clementine stream. From within Clementine: Click FileOpen Stream Move to the c:\Train\DM_Model directory Double-click on C50Cred.str

Rule Induction and Decision Tree Methods 5 - 22

Data Mining: Modeling Figure 5.15 Clementine C5.0 Streams for Training and Validation Data

The upper stream reads the training data (train.txt in text format), passes it through a type node and then to the C5.0 model node . The analysis was run previously, so we already have a generated model node in the stream (based on the C5.0 model), which leads to a Matrix node (misclassification table) and an Analysis node, which provides useful summaries of model fit. The lower stream passes the separate validation data (validate.txt) through the model so we can validate the results. Although the generated node from the C5.0 model already appears in the stream, we will rerun the model. Right-click on the Table node in the upper stream Click Execute on the Context menu

Rule Induction and Decision Tree Methods 5 - 23

Data Mining: Modeling Figure 5.16 Credit Data

We are fitting the same data as we did earlier in this chapter using CHAID. Close the Table window Double-click on the Type node in the upper stream Figure 5.17 Type Node

Rule Induction and Decision Tree Methods 5 - 24

Data Mining: Modeling Risk is declared as the output field and all fields except ID are inputs to the model. This Type node is identical to the one used when we applied neural networks to the same data (Chapter 4). Close the Type node dialog Double-click the C5.0 node Figure 5.18 C5.0 Model Node Dialog

For a field (variable) that has been defined as a set (categorical with more than two categories), C5.0 will normally form one branch per value in the set. However, by checking the Group symbolics option, as has been done here, the algorithm will search for sensible groupings of the values within the input field (ones that show no difference in predicting the outcome risk), thus reducing the potential number of rules. For example, instead of having one rule per region of the country, Group symbolics may produce a rule such as: Region [North, East] Region [South] Region [West] Once trained, C5.0 builds a decision tree or a ruleset that can be used for predictions. However, it can also be instructed to build a number of alternative rule models for the same data by selecting the Use boosting option. In each boosted model, cases mispredicted by the previous model are given greater weight. Then, when it makes a Rule Induction and Decision Tree Methods 5 - 25

Data Mining: Modeling prediction, it consults each of the rules before making a decision. This can often provide more accurate predictions, but takes longer to train and a single decision tree no longer represents the model. The algorithm can be set to favor either Accuracy or Generality. Since we want a model that applies to other cases, we chose Generality. Click Execute A new generated model node (not shown). appears in the Models tab of the Manager window

Right-click the C5.0 Generated Model node in the Models tab Click Browse from the Context menu Click the Show all levels button to see all branches of the tree to

Click the Show or hide instance and confidence figures button see those values Figure 5.19 C5.0 Rule Browsing Window (Beginning)

Rule Induction and Decision Tree Methods 5 - 26

Data Mining: Modeling Information on the number of records (Instances) used to generate each branch, and the level of Confidence attached to a rules conclusions, now appear since we requested them with the Show or hide instance and confidence figure button. The first branch informs us that 335 individuals had no other loans (0), that the prediction for individuals in this branch is good credit risk (which is also the modal response), and that the confidence of this prediction (here the proportion of cases in the subgroup that are good credit risks) is .591 (59.1%). Additional conditions are added to the group with at least one other loan, and we see that income (which was an influential predictor in the neural network model and appeared in the CHAID tree as well) appears in the model, as do number of children and number of credit cards, among other fields. Notice that some of the branches have small coverage values (9, 7 or 11 cases). By default, the minimum number of records per branch is set to 2 (meaning at least two branches from the node must have at least two cases eachalthough our choice to favor generality increased it a bit). To try to avoid some of these very small branches, which may not generalize well to other data, you could click the Expert option button in the C5.0 Model Node Editing dialog and change this setting. You can examine the rules in much the same way as we examined the nodes within the decision tree that CHAID produces. The information in the browsing window is a decision tree, but presented using IfThen logic instead of a graphic. To examine the accuracy of prediction we can use a Matrix node or an Analysis node. Close the Rule Browser window Double-click the Matrix node in the upper stream

Rule Induction and Decision Tree Methods 5 - 27

Data Mining: Modeling Figure 5.20 Matrix Node

We place Credit risk (Risk) in the Rows and predicted credit risk ($C-RISK) in the Columns. Counts and row percents based on Risk will appear in the body of the table, as set in the Appearance tab (not shown). Click Execute

Rule Induction and Decision Tree Methods 5 - 28

Data Mining: Modeling Figure 5.21 Misclassification Table for Training Data

The overall pattern is similar to what we found with the CHAID model. Of the 421 actual good risks, the model correctly predicted 281 (66.7% correct). The comparable CHAID figure for the training data was 63% (although it was based on fewer cases due to the validation (testing) sample). Also, this model correctly predicts 50.6% (283 of 559) of the bad losses. Close the Matrix output window Right-click on the Analysis node in the upper stream

We requested that the analysis of the prediction results be broken down by risk category (not shown). In this way a summary for each risk category will appear in addition to the overall summary. Click Execute on the Context menu The window that opens is quite large and you may need to resize it.

Rule Induction and Decision Tree Methods 5 - 29

Data Mining: Modeling Figure 5.22 Analysis Window

The overall results appear first (accuracy of 76.74%), which did not appear in either the C5.0 model browsing window or the matrix. A report on the confidence values is supplied, including the mean confidence when correct and incorrect predictions are made (the first mean should be higher than the second). Then the results are broken down by each risk category.

Rule Induction and Decision Tree Methods 5 - 30

Data Mining: Modeling

Model Applied To The Validation Data


The lower stream takes the Generated Model node from this analysis and applies the validation data to it. We will examine the Matrix for the validation data. Close the Analysis window Right-click on the Matrix node in the lower stream Click Execute on the Context menu Figure 5.23 Misclassification Table for Test Data

The results look very similar to those for the training data (a good sign for the model). For the actual good risks, 234 of 383 were correctly predicted (61.1%). This is lower than in the training sample (66.7%). Again, some drop off is to be expected when shifting to validation data. Close the Matrix window Right-click the Analysis node in the lower stream Click Execute from the Context menu The overall accuracy for the validation data is 76.2%, which is very comparable to the 76.7% for the training data.

Rule Induction and Decision Tree Methods 5 - 31

Data Mining: Modeling Figure 5.24 Analysis Summary for Validation Data

Although we did not evaluate the CHAID model on the separate validation file used to validate the neural net and C5.0 results, the accuracy values of the CHAID model and C5.0 appear very similar. Unless they differ in accuracy for a target category of critical interest to you, either model should suffice.

Appropriate Research Projects


Examples of questions for which decision trees are appropriate are: Determine who will respond to a direct mailing Identify factors causing product defects in manufacturing Predict who needs additional care after heart surgery Discover which variables are predictors of increased customer profitability

Other Features
There are several methods available to grow a tree, and decisions to be made about the minimum number of cases in a new node, about statistical criteria, and so forth. In addition, analysts have often found it best to grow a very large tree and then prune its

Rule Induction and Decision Tree Methods 5 - 32

Data Mining: Modeling branches, to make it simpler but still retain a high level of accuracy (C5.0, C&RT and QUEST do this). Decision trees have been used on files of all sizes, although in general as the number of potential predictors increases, the number of cases should increase as well. Otherwise, only a few predictors have any chance of being chosen. Decision trees are often used on extremely large filesmillions of recordsand perform quite well. When files are large, the time to create a tree can be long, but would be faster than a neural network. It is essential that the results be validated on a test data set, since no global statistical test is being done to assess the overall fitness of a solution. Also, like many classic datamining methods, it is common to try several different algorithms, and variations within those methods, to find the best solution, defined primarily by predictive accuracy on validation data.

Model Understanding
Decision trees produce easily understood rules that describe the predictions they make in IfThen language statements. No overall equation is available to describe the model, but that is not the intent of these methods. The tree itself is also helpful in illustrating the solution, though a tree with many predictors becomes quite cumbersome to display.

Model Deployment
Applying simple rules makes predictions for new cases, and SQL statements can be created to represent the rules. In this way, a decision tree can be applied directly to the cases in a data warehouse. This makes deployment extremely efficient on new data, and it is often possible to make predictions in real time.

Rule Induction and Decision Tree Methods 5 - 33

Data Mining: Modeling

Rule Induction and Decision Tree Methods 5 - 34

Data Mining: Modeling

Chapter 6 Cluster Analysis


Topics:
INTRODUCTION WHAT TO LOOK AT WHEN CLUSTERING CLUSTERING WITH THE K-MEANS METHOD A K-MEANS EXAMPLE: CLUSTERING SOFTWARE USAGE DATA CLUSTERING WITH KOHONEN NETWORKS A KOHONEN EXAMPLE: CLUSTERING PURCHASE DATA

Cluster Analysis 6 - 1

Data Mining: Modeling

INTRODUCTION
Cluster analysis is an exploratory data analysis technique designed to reveal natural groupings within a collection of data. The basic criterion used for this is distance, in the sense that cases close together should fall into the same cluster, while observations far apart should be in different clusters. Ideally the cases within a cluster would be relatively homogenous, but different from those contained in other clusters. As cluster analysis is based on distances derived from the fields in the data, these fields are typically interval, ordinal (for example, Likert scales that are considered a close enough approximation to an interval scale), or binary in scale (for example, checklist items). When clustering is successful, the results suggest separate segments within the overall set of data. Clustering is available in SPSS under the AnalyzeClassify menu, and in Clementine through the Kohonen network (which is a form of neural network analysis), the Kmeans and the TwoStep nodes (both of which are also available in SPSS). Given these characteristics, it is not surprising that cluster analysis is often employed in market segmentation studies, since the aim is to find distinct types of customers toward whom more targeted and effective marketing and sales action may be taken. In addition, for modeling applications, clustering is sometimes performed first to identify subgroups in the data that might behave differently. These subgroups can then be modeled separately or the cluster membership variable can be included as a predictor. Also, note that one of the neural network methods, radial basis function method (RBFN), performs clustering within the hidden layer and thus partially incorporates this principle. Keep in mind that cluster analysis is considered an exploratory data method. Expecting a unique and definitive solution is a sure recipe for disappointment. Rather, cluster analysis can suggest useful ways of grouping the data. In practice, you typically consider several solutions before deciding on the most useful one. Domain knowledge of customers and products/services will play a role in deciding among alternative solutions. Cluster analysis is not an end in itself, but one step in a data mining project. The clusters you discover are often used next as predictors, as separate analysis groups to each of which a separate model is fit, or as groupings in OLAP reports. There are many different clustering methods, but in the area of data mining two are in wide usage. This is because the large class of hierarchical clustering methods requires that distances between every pair of data records be stored (there are n*(n-1)/2 such pairs) and updated, which places a substantial demand on memory and resources for the large files common in data mining. Instead clustering is typically performed using the Kmeans algorithm or an unsupervised neural network method (Kohonen). Of the two, Kmeans clustering is considerably faster. Clementine also contains a TwoStep cluster method that quickly creates a number of preliminary clusters and then performs a hierarchical cluster analysis on these preliminary clusters. In addition, the TwoStep algorithm uses statistical criteria to indicate the optimal number of clusters.

Cluster Analysis 6 - 2

Data Mining: Modeling In this chapter we will apply two of these clustering methods, each to a different data set (K-means to software usage data and Kohonen to shopping purchases).

WHAT TO LOOK AT WHEN CLUSTERING


Number Of Observations Per Cluster The number of observations in each group will be a critical factor, as outliers can form clusters containing very few observations. For marketing and direct mail applications, hopefully most customers will fall into a limited number of clusters and most will contain an appreciable number of cases (say 10% or more of the sample). Cluster Profiles For help in understanding and interpreting the clusters, you can turn to the cluster group means computed on the cluster variables. These can be viewed in table format, or as line charts to aid visualization. Below we show a multiple line chart composed of usage information for four cluster groups. Figure 6.1 Profiling the Cluster Groups

Briefly, the multiple line chart shows usage information of analytical software for four cluster groups. Without going into detail here, we can see that group 1 (solid line) tends to report usage of all the analytical techniques. Group 3 (heavy dotted line) tends to use basics, presentation statistics and mapping and relatively little of the others. Groups 2 and

Cluster Analysis 6 - 3

Data Mining: Modeling 5 (dashed lines) exhibit patterns that seem to be the converse of Group 3. We will interpret this data in greater detail later. Here we hope to demonstrate the importance of such plots in interpreting and naming the clusters. Another way to profile clusters is to apply rule induction methods (Chapter 5) to predict the cluster groups from the fields used in the initial clusters or from demographics. Validation Also, a validation step is important. Do the clusters make conceptual sense? Do the clusters replicate with a different sample (if data are available)? Do clusters replicate using a different clustering method? Are the clusters useful for business purposes? As you pass over each of these hurdles, you have greater confidence in the solution.

Clustering With The K-Means Method


The K in K-Means is derived from the fact that for a specific run the analyst chooses (guesses) the number of clusters (K) to be fit. The means portion of the name refers to the fact that the mean (or the centroid) of observations in a cluster represents the cluster. Since the analyst must pick a specific number of clusters to run, typically, several attempts are made, each with a different number of clusters, and the results evaluated with the criteria mentioned earlier (group size, profile, and validation). Since the number of clusters is chosen in advance and is usually small relative to the number of observations, the K-means method runs quickly. This is because if seven clusters are requested, the program needs only track the seven clusters. In hierarchical clustering the distance between every pair of observations must be evaluated and intercluster distances recomputed at each cluster step (which is a fairly intensive computational task). For this reason, K-means is a popular method for data mining. One nice feature of K-means clustering (in SPSS) is that you can try out your own ideas (or apply the results of other studies) as to the definition of clusters. An option is available in which you can provide starting values (think of them as group means of the clusters) for each of the clusters, and the K-means procedure will base the analysis on them. A brief description of the K-means method follows. If no starting values (see preceding paragraph) for the K cluster means are provided, the data file is searched for K wellspaced (using distances based on the set of cluster variables) observations, and these are used as the original cluster centers. The data file is then reread with each observation assigned to the nearest cluster. At completion every record is assigned to some cluster, and the cluster means (centroids) are updated due to the additional observations (optionally the updating can be done as each observation is assigned to a cluster). At least one additional data pass (you can control the number of iterations) is made to check that each observation is still closest to the centroid of its own cluster (recall the cluster centers can move when they are updated, based on addition or deletion of members), and if not,

Cluster Analysis 6 - 4

Data Mining: Modeling the observation is assigned to the now nearest cluster. Additional data passes are usually made until the clusters become stable. Based on simulation studies, K-means clustering is effective when the starting cluster centers are well spaced. Again, several trials, varying the number of clusters, are typically run before settling on a solution.

A K-MEANS EXAMPLE: CLUSTERING SOFTWARE USAGE DATA


We will demonstrate cluster analysis with software usage data from a customer segmentation study. Customers of a statistical software company were asked, as part of a survey, which of the optional modules they had used. A checklist was presented and all modules that were used were checked. Each module was defined as a separate variable coded 1 if checked and 0 otherwise. The modules include Basics, Professional Statistics (now named Regression Models), Advanced Statistics (now named Advanced Models and including Multivariate and Health Science Analysis), Time Series, Presentation Tables, Perceptual Maps, Automatic Interaction Detection (CHAID), Mapping, and Neural Nets. While many other questions were asked as part of the study, the file we will use includes only a single additional variable (job area). The sample is composed of 310 observations containing complete data. The data file is an SPSS data file named Usage.sav (located in the c:\Train\DM_Model directory). It is important to note that the data used in this example have been distorted so as not to compromise the confidentiality of the study. Thus not all modules listed above appeared within the software package when the data were collected. Also, the segments and especially the sizes of the segments should not be taken to represent true values. The data masking does not reduce the richness of the analysis, and the same issues faced in the original study come into play in this analysis. From within SPSS: Click FileOpenData and move to the c:\Train\DM_Model directory Double-click on Usage Now we apply the K-means nonhierarchical method on the usage data. We will go through the steps of running a three-cluster solution, but also display the summary results from a four-group analysis. In the full analysis (not shown) other numbers of clusters were tried and the three- and four-cluster solutions seemed most interesting. Click AnalyzeClassifyK-Means Cluster

Cluster Analysis 6 - 5

Data Mining: Modeling Figure 6.2 K-Means Clustering Dialog Box

Notice that two clusters will be created by default. The Centers pushbutton would be used if you wished to provide starting points for the clusters (these might be based on previous research, or to try hypothetical scenarios). The Method area permits you to iterate and classify or to classify only; we want to iterate and classify which actually applies the K-means cluster method to the data. The classify only choice is usually used to assign additional cases to clusters already established (since the clusters are not updated); this method is sometimes called nearest-neighbor discriminant analysis. The Iterate pushbutton controls the criteria used to determine when the solution is stable and the algorithm will stop. The Save pushbutton is used to create a cluster membership variable (only one since K-means creates a specified number of clusters). The Options pushbutton permits you to display ANOVA tests on each clustering variable; these are not taken seriously as actual statistical tests since clusters are formed that maximally separate groups, but as indicators of which cluster variables are most important in the formation of clusters. This is useful when many variables are used in the analysis and you wish to focus attention on the most important ones. Stepwise discriminant analysis (see Chapter 2) or decision tree methods (see Chapter 5) are also used for this purpose. Move all variables except Jobarea into the Variables list box Replace the 2 with a 3 in the Number of Clusters text box Click Save pushbutton Click Cluster membership check box to save cluster membership variable

Cluster Analysis 6 - 6

Data Mining: Modeling Figure 6.3 Save Subdialog Box

We wish to create a cluster membership variable based on the K-means method. We can also save a variable containing the distance from each observation to the center of its cluster. Click Continue to process the Save requests Click Options pushbutton Click ANOVA table checkbox in the Statistics area Figure 6.4 Options Subdialog Box

We really dont need the ANOVA table here because there are not many usage variables. However, if we were clustering on 30 variables, then the ANOVA table might suggest a subset of important variables on which to focus attention. Alternatively stepwise discriminant or logistic, or a decision tree method could be used for this. Note the Kmeans procedure contains an option to include in the analysis cases that are missing on one or more of the clustering variables. Click Continue to process the Options

Cluster Analysis 6 - 7

Data Mining: Modeling Figure 6.5 Completed K-Means Cluster Dialog Box

We are now ready to run a cluster analysis. Click OK Note The Final Cluster Center pivot table displays means rounded to the nearest whole number (since the original variables are formatted as integers). We need to see more decimal places, so we must edit the pivot table. Double-click on the Final Cluster Center pivot table Click in the first data cell (1: Advanced Stats - Cluster 1) Shift-click in the last data cell (1: Time Series - Cluster 3) Click Format..Cell Properties Set the Decimals value to 2 using the spin control Click OK Click outside the crosshatched border to close the pivot table editor

Cluster Analysis 6 - 8

Data Mining: Modeling Figure 6.6 Final Cluster Centers (Mean Profiles)

Above we see the mean profiles for the three segments constructed by the K-means cluster method. We will examine the mean patterns to interpret the clusters and display them in a multiple line chart. Group 1 is composed of customers who use the more advanced statistical modules and make little use of the presentation procedures: we might call them Technicals Group 3 shows the reverse pattern, using Tables and Mapping heavily, but not the more statistical modules: we could label them Presenters Group 2 is based on those who use most, if not all, the modules, so they are Jacks-of-all-trades. Figure 6.7 Sizes of Clusters

The size of each cluster appears as part of the K-means output. Each cluster contains a substantial number of cases (relative to the sample size); there are no clusters based on only a few data points (outliers).

Cluster Analysis 6 - 9

Data Mining: Modeling Figure 6.8 ANOVA Table

As mentioned earlier, we should not take the significance values seriously here because clustering is designed to optimally create well-separated groups. However, when clustering with many variables (not the case here), analysts can use the F values to guide them to those variables most important in the clustering (those with larger F values). Here the Fs range from 40 (professional statistics) to 261 (advanced statistics) and this provides an indicator as to where our attention should be concentrated. Stepwise discriminant analysis or decision tree methods are applied to the clusters for the same purpose: to identify the important cluster variables. Next we view a multiple line chart profiling the three segments. Click Graphs..Line Click Multiple Line icon (we are displaying three segments) Click Summaries of separate variables option button (since all usage variables are to be plotted) Click Define pushbutton Move all variables except Jobarea and qcl_1 into the Lines Represent list box Move qcl_1 into the Category Axis list box

Cluster Analysis 6 - 10

Data Mining: Modeling Figure 6.9 Completed Multiple Line Chart Dialog Box

Since the SPSS procedure that performs the K-means method is called Quick Cluster, by default, the cluster member variable it creates is named qcl_1 the first time it is run during a session, qcl_2 for the second, etc. Click OK to create the chart By default, each variable used in the clustering appears on a separate line, but we want each line to represent a different cluster group. Double-click on the Chart, then click SeriesTranspose Data

Cluster Analysis 6 - 11

Data Mining: Modeling Figure 6.10 Mean Profiles for the K-Means, Three Cluster Solution

Note We could have also created the line chart by activating (double-clicking) the Final Cluster Centers pivot table, selecting the cluster means, then right-clicking and selecting Create GraphLine from the context menu. It is easier to interpret the cluster groups in this way when the clustering is based on a small number of variables that share the same scale. The contrast in profiles for the three cluster groups (named Technicals (1), Jacks-of-All-Trades (2), and Presenters (3)) is clear. Click FileClose (to close the Chart Editor) For an additional comparison, we reran the K-means method requesting four clusters. Figure 6.11 displays the multiple line chart of the four segments below.

Cluster Analysis 6 - 12

Data Mining: Modeling Figure 6.11 Mean Profiles for the K-Means, Four Cluster Solution

The additional (fourth) cluster (cluster #3 in the chart) is high on advanced statistics and time series usage, but lower on the presentation procedures. It looks as if weve split off, from the Technicals group, a subgroup that focuses on time series analysis.

How Many Segments Have We Found?


The results of the K-means analysis have not resolved the three- versus four-segment question. There is greater separation or differentiation among three segments than is found with four. The fourth segment, "Technicals who forecast," is the smallest of the four (counts not shown). Whether you would recommend a three- or four-segment solution would probably depend on your organizations ability to target and profitably make use of this cluster. In practice, if these clusters make business sense, you would proceed to relate the clusters to demographics and other information about the individuals. This would serve to further describe the clusters. To obtain some statistical guidance as to the number of clusters, the TwoStep method could be run.

Cluster Analysis 6 - 13

Data Mining: Modeling

CLUSTERING WITH KOHONEN NETWORKS


A Kohonen network is a type of neural network that performs unsupervised learning, labeled as such because it has no output or outcome to predict. Kohonen networks are used to cluster or segment data based on the patterns of input fields. Kohonen networks make the basic assumption that clusters are formed from patterns that share similar features and will therefore group similar patterns together. Kohonen networks are usually one- or two-dimensional grids or arrays of neurons. Each neuron is connected to all inputs, and weights are placed on each of these connections (as in a standard neural network). There is no actual output layer in Kohonen networks, although the Kohonen map containing the neurons can be thought of as an output. The diagram below shows a simple representation of an output grid or Kohonen map.

(Note that not all input neuron connections to the map are shown.) When a record is presented to the grid, its pattern of inputs is compared with those of the artificial neurons within the grid. The artificial neuron with the pattern most like that of the input wins the input. This causes the weights of the artificial neuron to become adjusted to make it appear even more like the input pattern. The Kohonen network also slightly adjusts the weights of those neurons surrounding the neuron with the most similar pattern. This has the effect of moving the most similar neuron, and the surrounding nodes to a lesser degree, to the position of the record in the input data space. The result, after the data have passed through the network a number of times, will be a map containing Cluster Analysis 6 - 14

Data Mining: Modeling clusters of records corresponding to different types of patterns within the data. Because of the many iterations with weight adjustments being made, Kohonen networks, especially those with a large number of neurons, take considerably longer to train than K-means clustering. Yet, they provide a different and potentially valuable view of groupings in the data.

A KOHONEN EXAMPLE: CLUSTERING PURCHASE DATA


In this example, we attempt to segment a data set containing shopping information, stored in a text file, Shopping.txt, located in the c:\Train\DM_Model directory (we used this data for the association analysis in Chapter 3). The file contains fields that indicate whether or not each customer (during a store visit) purchased a particular type of product. The file also contains basic demographics, such as gender and age group, which can be used to help explain and investigate the resulting clusters. From within Clementine: Click FileOpen Stream Move to the c:\Train\DM_Model directory Double-click on ShopKoh.str Most of this stream involves investigation of the clusters resulting from the Kohonen analysis. Examining the upper-left portion of the stream, the data file (Shopping.txt) is accessed and passed through a Type node. We will review this first.

Cluster Analysis 6 - 15

Data Mining: Modeling Figure 6.12 Clementine Stream for Kohonen Analysis

Right-click the Table node attached to the Type node, then click Execute View the data, then close the Table window (not shown) Double-click on the first Type node

Cluster Analysis 6 - 16

Data Mining: Modeling Figure 6.13 Type Node Setup for a Kohonen Network

We want to segment the data based on purchasing behavior, and therefore have set all other fields to direction None so that they are not used in the Kohonen network. All product fields are set to In, since they will be inputs to the Kohonen network. Close the Type node dialog Double-click on the Train Kohonen node

Cluster Analysis 6 - 17

Data Mining: Modeling Figure 6.14 Train Kohonen Dialog Box

By default, a feedback graph appears while the network is training and gives information on the progress of the network. It represents a grid of the output neurons. As records are won by neurons, the corresponding cell becomes a deeper shade of red. The deeper the shade, the stronger the neuron is in winning records. The size of the grid can be thought of as the maximum number of possible clusters required by the user. This can vary and is dependent on the number and types of input variables. The size of the grid can be set using the Expert tab. The Expert tab, when chosen, increases the available options and allows the user to further refine the settings of the Kohonen network algorithm. For detail on the expert tab, see the Clementine Users Guide. The Kohonen network can be set to stop training either using the default, a built in rule of thumb (see the Clementine Users Guide), or after a specified time has elapsed. Similar to the Neural Network node, the Set random seed option can be used to build models that can be reproduced. This is not normal practice and it is advisable to perform several runs to ensure you obtain similar results using random starting points. In this chapter we will retain the defaults for the majority of the above settings, although if multiple models are built you may want to change the networks name for clarity. Cluster Analysis 6 - 18

Data Mining: Modeling To speed up the analysis, we will limit the size of the grid of output nodes to 3 x 3. Click the Set random seed checkbox Enter 1000 in the Seed text box Check the Expert tab Click the Expert mode option button Change the Length and Width values to 3 and 3 (We set a specific seed value so these results will be reproduced.) Figure 6.15 Completed Kohonen Dialog with Expert Options Visible

Click Execute A feedback graph similar to Figure 6.16 will appear while the network is training.

Cluster Analysis 6 - 19

Data Mining: Modeling Figure 6.16 Feedback Graph When Training a 3 by 3 Kohonen Net

Darker colors indicate a greater density of cases within a neuron. As the network becomes stable, there is less color shifting. Once the Train Kohonen node has finished training, the Feedback Graph disappears and a Kohonen model node appears in the Models tab of the Manger window.

Browsing this node only gives information on the number of input neurons (number of numeric or flag type fields plus the number of values within each set type field entering the analysis) and the number of output neurons (equal to the number of cells (neurons) within the grid), and the settings for training. Now that we have produced a Kohonen network, we will attempt to understand how the network has clustered the records. The first step we need to take is to see how many distinct clusters the network has found and to assess whether all of these are useful and if any of them can be discarded because of small numbers of extreme records. When data are passed through the Kohonen generated model node, two new fields are created representing X and Y (Kohonen map row and column) coordinates. We previously ran this Kohonen analysis, placed the generated node into the data stream, and connected it to a Plot node. We are able to view the clusters using the Plot node. Within the Plot node we request that the X ($KX-Kohonen) and Y ($KY-Kohonen) coordinates of the Kohonen generated model node be plotted. Also, since all records in a Kohonen maps neuron have the same coordinate values we add some agitation (jittering) to the data values (not shown).

Cluster Analysis 6 - 20

Data Mining: Modeling

Right-click on the Plot node Click Execute on the Context menu Figure 6.17 Plot of Kohonen Axis Coordinates (with Agitation) to Show Clusters

Here we see that although the network has produced nine clusters there appear to be only four main clusters. It is always important to calculate how many records fall into each segment. This can be achieved using a distribution node. First we need to create a unique value for each of the clusters by combining the co-ordinates. To do this we concatenate the co-ordinates to form a two-digit reference number. In Clementine, this involves using a Derive node. Close the Plot display window Double-click the Derive node

Cluster Analysis 6 - 21

Data Mining: Modeling Figure 6.18 Creating a Unique Cluster Number for Each Cluster

The formula contains the concatenate expression >< and the two new field names. CLEM (Clementine Expression Language) is case sensitive and field names beginning with $ need to be enclosed in single quotes. Click OK to return to the stream canvas Right-click on the Table node (to which the Derive node connects) Click Execute from the Context menu Figure 6.19 Cluster Field and Kohonen Axis Coordinate Values

Cluster Analysis 6 - 22

Data Mining: Modeling The resulting table contains a field called Cluster that consists of a combination of the two co-ordinates. Close the Table window

Focusing on the Main Segments


Using a Distribution node, we can now see how many records are in each segment. We previously added a Distribution node to display the newly created Cluster variable. Right-click on the Distribution node that is connected with the Derive node Click Execute from the Context menu Figure 6.20 Distribution of Cases Among Clusters

Here we see that the four main groups are those labeled 00, 02, 20 and 22, and these include about 77% of all records. Close the Distribution plot window It may be the case that the smaller groups are of interest (e.g., targeting a specialist product group). However, for this example we shall concentrate on profiling the four largest groups. To this end we need a Clementine Select node to filter out those records falling in the other five groups. An easy way to achieve this is to produce a table and, from it, generate a Select node. This was done beforehand within the Table window; we clicked the cluster value in the Cluster Analysis 6 - 23

Data Mining: Modeling Cluster column for four records with values of 00, 02, 20 and 22. On selection the cells changed color to blue. We chose the Select node (or) option under the Generate drop down menu, which produced the Select node we see in the stream.

To build profiles of our segments, we need to understand what patterns exist between the fields used as inputs to the Kohonen network and the groups of records in the segments. Since the input fields in this example are flags, we shall use a directed web plot. If the input fields had been numeric this process would have involved looking at histograms with an overlay of cluster number. Double-click on the Web Plot node Figure 6.21 Web Plot Node

A web plot shows the relationships among the categories in different fields. We are interested in how cluster group (for the four major clusters) relates to the fields used to produce the clusters. For this reason, we have checked the Directed web option button and selected the cluster field as the To Field: variable. All the products used for

Cluster Analysis 6 - 24

Data Mining: Modeling clustering are selected in the From Fields box. This will create connections between only the cluster groups and the products. The Show true flags only (for binary fields) checkbox is checked. This is because we are interested in seeing relations with purchased items (true value) rather than items not purchased. Instead of counts (the default), the thresholds for weak and strong connections are set to the percentages of the To field value, i.e., the cluster variable. In other words, the percentages are based on the number of people in each cluster who purchased a product, which is exactly what we need to profile the clusters. We choose the values of 20, 30 and 45 percent. Click Execute Figure 6.22 Directed Web Plot Connecting Cluster Groups to Purchases

The weight of the line indicates whether the connection is weak (light), moderate (normal), or strong (heavy). Here we can clearly see that only a few individual products are associated with the four different clusters.

Cluster Analysis 6 - 25

Data Mining: Modeling Summarizing the plot and clusters: Cluster 00 These shoppers are likely to purchase ready-made foods, plus bakery goods and snacks to a lesser extent. Cluster 02 These shoppers are largely associated with tinned goods, plus snacks and bakery goods. Cluster 20 These shoppers buy all products heavily, with the exception of toiletries and fresh vegetables. Cluster 22 These shoppers are more likely to buy alcohol and snacks, plus frozen foods. We have begun to build a picture of the four groups using the fields on which they were clustered. We shall now use some of the other fields in the data set to build a stronger profile of each of the four groups. Distribution plots (bar charts) can be used to investigate whether there are any relationships between the cluster groups and the demographics within the data set. We demonstrate by using the Children field (do you have at least one child?) as an overlay on a distribution plot of Cluster. Close the Web plot window Right-click on the Distribution node (Cluster) that is connected to the Select node (generated) Click Execute on the Context menu

Cluster Analysis 6 - 26

Data Mining: Modeling Figure 6.23 Normalized Distribution of Cluster with Children Overlay

Individuals in cluster 00 are the least likely to have children (and this fits with the products they buy more frequently), whereas individuals in cluster 02 are more likely to have children (which is a little hard to square with the products they buy, although thats the intrigue of cluster analysis). Such plots can be used to provide more detail to the cluster descriptions. Alternatively, decision trees can be built that predict cluster group membership using as inputs either the fields used in the original clustering (results will suggest which inputs determined which clusters) or demographics (which demographics are most strongly related to the clusters). To accomplish this we ran the C5.0 rule induction method twice (not shown). Parts of the decision trees are shown below. Figure 6.24 shows the rules for predicting cluster membership from the various product fields. The rules can be examined to better understand how the products relate to the cluster groups.

Cluster Analysis 6 - 27

Data Mining: Modeling Figure 6.24 C5.0 Rules Predicting Cluster Membership from Cluster Inputs

Next we view the decision tree predicting cluster membership from the demographic variables.

Cluster Analysis 6 - 28

Data Mining: Modeling Figure 6.25 C5.0 Rules Predicting Cluster Membership from Demographics

In turn, these rules can be examined to see how the demographic fields relate to the cluster groups. Either analysis can provide insight into the characteristics of the cluster groups. As one example, there is a group of 62 people who have no children, who are age 18 to 30, are single, and male. The C5.0 model predicts they are likely to be in cluster 22 (64.5% are actually in this cluster). Note how this demographic profile makes perfect sense given the products cluster 22 members are likely to buy. If the clusters can be meaningfully interpreted, then they should be examined from the perspective of how they might be relevant in the context of the business.

Cluster Analysis 6 - 29

Data Mining: Modeling

Appropriate Research Projects


Examples of questions for which cluster analysis is appropriate are: Creating segments of customers for use in marketing campaigns Creating segments of members of an association to look for differences in membership renewal Creating segments of customers for input to decision trees or neural networks

Other Features
There is no limit to the number of cases or input variables for cluster analysis. However, hierarchical clustering methods are memory-intensive and often cannot be used with more than a few thousand cases or a few dozen variables. For that reason K-means clustering or Kohonen networks are more commonly used in data mining, with the first being essentially unlimited in file size. K-means clustering generally takes considerably less processing time than clustering using Kohonen networks. Since the TwoStep procedure performs hierarchical clustering only in the second step (on the clusters formed in the first step), it also runs relatively quickly and can be used with large files. For practical reasons, usually no more than ten to twenty variables might be used in clustering. This is because trying to understand the cluster solutions becomes quite complex with more variables than that. And with more than a few variables, graphical output is typically not that helpful. We emphasize again that the number of clusters is a user decision with K-means clustering and Kohonen networks. The TwoStep algorithm does provide guidance, using statistical criteria, on the number of clusters.

Model Deployment
The cluster centers can be input into the K-means clustering routine to cluster new data sets, so model deployment is routine if done in SPSS or Clementine, rather than by applying a simple equation in a spreadsheet program. This means that clustering cannot be easily and directly applied to data warehouses or data marts. However, this can be done for SPSS K-means or TwoStep clustering with modest programming effort. And Clementine offers other options, including Solutions Publisher, to deploy complete streams outside the software.

Normalization
If you cluster using fields that are measured on different scales (for example, age in years and income in dollars or euros) or have very different variances, then it is generally recommended that the fields be normalized or standardized before clustering. Otherwise, those fields with relatively large variances will dominate the clustering. There are

Cluster Analysis 6 - 30

Data Mining: Modeling exceptions to this, as when clustering on financial account balances of customers. If the greatest volatility (variance) is in the mutual fund account, it can be argued that it should be the most important field in the clustering. Under this scenario, normalization would not be performed. However, in most analyses, normalization is performed prior to clustering. This is not done automatically by the K-means procedures in SPSS and must be completed by the analyst (Descriptives procedure with the Save option in SPSS). Along similar lines, categorical (the set field type in Clementine) fields must be converted to dummy fields before clustering. This is done by the K-Means, Kohonen, and TwoStep nodes within Clementine, but must be performed by the analyst (using Recode or If statements) within SPSS.

Cluster Analysis 6 - 31

Data Mining: Modeling

Cluster Analysis 6 - 32

Data Mining: Modeling

Chapter 7 Time Series Analysis


Topics:
INTRODUCTION DATA ORGANIZATION FOR TIME SERIES ANALYSIS INTRODUCTION TO EXPONENTIAL SMOOTHING A DECISIONTIME FORECASTING EXAMPLE: PARCEL DELIVERIES

Time Series Analysis 7 - 1

Data Mining: Modeling

Introduction
It is essential for business organizations to plan ahead and forecast events in order to ensure a smooth transition into the future. To minimize errors when planning for the future it is necessary to collect the information on factors that might influence plans on a regular basis over time. From this point, patterns can be identified and these patterns help make forecasts for the future. Today, although many organizations store information relevant to the planning process, which could be in business reports, database tables or within a data warehouse, forecasts are often made on an ad-hoc basis. This can lead to large forecasting errors and costly mistakes in the planning process. Analytic methods provide a more structured approach that will reduce the chance of making costly errors. Statisticians have developed statistical techniques, known as time series analysis, which are devoted to the area of forecasting. Specific methodologies such as exponential smoothing, forms of regression, and ARIMA modeling have been used for years. Traditionally these involved an analyst examining various diagnostic plots of the data series (sequence plots, autocorrelation and partial autocorrelation plots), fitting a tentative model, evaluating the model fit, rerunning the diagnostics, and further modifying the model. This activity requires training and is labor intensive. In recent years, researchers have developed efficient ways of fitting a set of time series models to a data set and reporting on the most promising results. The analyst can evaluate and compare the fit of the alternative models and choose among or combine them. This is the approach taken by SPSS DecisionTime. Such an approach is very much in the tradition of data mining in that an algorithm applies decision rules that permit a number of models to be evaluated against test data. In this chapter, we will apply use DecisionTime to obtain a time series model for data that records the daily volume of packages delivered by a parcel delivery service. More detail about DecisionTime and a brief introduction to time series modeling are contained in the Introduction to DecisionTime training course. Note that the SPSS Trends option, although not containing an automated time series model-fitting routine, can apply, under the direction of the analyst, a variety of time series models (exponential smoothing, ARIMA, ARIMA with predictor variables (interventions, fixed regression inputs), and spectral analysis). This option and a detailed introduction to time series analysis are covered in the Time Series Analysis and Forecasting with SPSS Trends training course. It is also worth mentioning that a data mining method called sequence detection can also be applied to time-structured data. Here the focus is in analyzing patterns in data over time (for example, the sequence in which SPSS software products are purchased, the web pages that are examined prior to the online purchase of a computer, and the common types of transaction patterns that occur for mutual fund customers). Sequence detection is discussed in the following chapter.

Time Series Analysis 7 - 2

Data Mining: Modeling

Data Organization for Time Series Analysis


For standard time series analysis, each record of data represents a time period (which could be an hour, day, week, month, quarter, etc.) and the fields to be forecast (sales fields, products sold, number of visitors, web-site hits) form the columns. Thus rather than a record representing a transaction, each represents the summary of many transactions within a time period. For this reason (aggregation), data sets used for time series analysis tend to among the smallest used in data mining. However, this need not be true if there are many series to be forecast (for example, each of 2,000 products for a major manufacturer of plumbing supplies). Data are assumed to be sorted in time order with each record representing a single, evenly spaced, time period. Below we display the data series we will analyze in this chapter. The series records the volume of daily parcel deliveries by a private delivery service. Figure 7.1 Daily Parcel Volume Arranged for Time Series Analysis

Each record represents a single day. In addition to Parcel, there are date fields that can be used in plots and summary tables within SPSS. It is worth noting that DecisionTime permits data to be modeled at different level of aggregation (for example, months and quarters).

Time Series Analysis 7 - 3

Data Mining: Modeling

Introduction to Exponential Smoothing


SPSS DecisionTime permits forecasting through an Expert Forecast Wizard that selects an optimal time-series model from the classes of exponential smoothing and ARIMA models. We will not review both types of time series models. Rather, in anticipation of DecisionTime selecting an exponential smoothing model for the parcel delivery data, we briefly introduce exponential smoothing and the types of patterns it captures. Exponential smoothing is a time series technique that can be a relatively quick way of developing forecasts. This technique is a pure time series method; this means that the technique is suitable when data have only been collected for the series that you wish to forecast. Exponential smoothing can therefore be applied in instances when there are not enough variables measured to achieve good causal time series models, or when the quality of the data is such that causal time series models give poor forecasts. In comparison, ARIMA models (Multivariate ARIMA, which DecisionTime supports) can accommodate predictor variables and intervention effects. Exponential smoothing takes the approach that recent observations should have relatively more weight in forecasting than distance observations. Smoothing implies predicting an observation by a weighted combination of previous values. Exponential smoothing implies that weights decrease exponentially as observations get older. Simple (as in simple exponential smoothing) implies that slowly changing level is all that is being modeled. Exponential smoothing can be extended to model different combinations of trend and seasonality. Exponential smoothing implements a number of different models in this fashion. The exponential smoothing method involves examination of the series to make some broad characterizations (is there trend, and if so what type? is there seasonality (a repeating pattern), and if so what type?) and a model is fit. This model fit is then extrapolated into the future to make forecasts. One of the main advantages of exponential smoothing is that models can be easily constructed. The type of exponential smoothing model developed will depend upon the seasonal and trend patterns inherent in the series that you wish to forecast. An analyst building a model might simply observe the patterns in a sequence chart to decide which type of exponential smoothing model is the most promising one to generate forecasts. In DecisionTime, when the Forecast Wizard examines the series, it considers all appropriate exponential smoothing models when searching for the most promising time series model.

Trend and Seasonality in Exponential Smoothing


Within exponential smoothing models there are commonly four types of trend and three types of seasonality (including none) that can be specified, these are: No Trend If the sequence chart shows that there is no general increase or decrease in the series value then the no trend option should be specified. Time Series Analysis 7 - 4

Data Mining: Modeling Linear trend Linear trend is present when the series that you wish to predict either increases or decreases over time at a locally constant rate. This means that the increase or decrease is steady over some number of time periods, but the slope need not be the same throughout the entire series. Exponential Trend A time series has an exponential trend when the series level tends to increase or decrease in value, but at an increasing rate the further the time series progresses. (Note: this is implemented in the SPSS Trends option but not in DecisionTime.) Damped Trend If there are signs that the series increases or decreases, but at a decreasing rate as the time series proceeds, then the damped option should be specified. No seasonality As expected, no seasonality implies the series does not have a seasonal pattern. Additive Seasonality Additive seasonality describes a series with a seasonal pattern that maintains the same magnitude when the series level increases or decreases. Multiplicative seasonality If the seasonal model patterns become more (less) pronounced when the series values increase (decrease) then this type of seasonal pattern is multiplicative. DecisionTime applies a series of decision rules to diagnostic summaries and evaluates the fit of different models to decide which of the exponential smoothing (or ARIMA) models is best for the data series.

A DecisionTime Forecasting Example: Daily Parcel Deliveries


To evaluate the trend and seasonality aspects of the parcel data, prior to applying the Forecast Wizard in DecisionTime, we view the sequential plot of parcel deliveries. From within DecisionTime: Click File..Open Project and move to the c:\Train\DM_Model folder Double-click on Parcel

Time Series Analysis 7 - 5

Data Mining: Modeling Figure 7.2 DecisionTime Project Window with Sequence Plot for Parcel Deliveries

Note that the year 1997 is used to illustrate how dates appear in the graphs, but the data were not collected in 1997. The left (Project) pane lists the elements in the current DecisionTime project, which currently consists of a single variable. It can contain multiple variables (Series) and model results (Model). The right (Content) pane displays a sequence plot of the selected series. If several series were selected in the Project pane, they would appear, stacked one above another, in the Content pane if the Graph Panels tab were selected. All selected series would appear in a single plot if the Single Graph tab were selected, and in a table presenting the data if the Table tab were chosen. Parcel deliveries seem to follow a fairly well defined pattern (weekly) in which certain days of the week consistently have higher demand. Volume reaches its peak around midweek and the low is on Sunday. Also, there is an upward trend over time. A time series model would try to capture both of these features: trend and seasonality. From week 4, the demand for parcel deliveries is clearly increasing in a linear fashion. So there looks to be trend for the series as a whole, and it is best described as linear. The amplitude of the seasonal pattern seems to increase towards the end of the series, which suggests the seasonality is multiplicative. Although we could try this model and evaluate the fit, we will have the DecisionTime Forecast Wizard perform the model selection. Validation To validate the model we will train the model on the first 119 days of data and reserve the last 7 days as validation data. Note the test data is not a random sample from all time periods, but since the goal of the model is to predict future demand, then the last week of data would be the most representative sample for this. Time Series Analysis 7 - 6

Data Mining: Modeling To apply the Forecast Wizard: Click Forecast..Forecast Wizard (Alternatively, click the button) Click Next pushbutton Drag and drop parcel into the Series to be Forecasted list box Change the value in the How many periods would you like to forecast text box to 14 Figure 7.3 Forecast Wizard - Series to be Forecast Dialog

We have identified parcel as the single series to be forecast and indicated that we wish to produce forecasts for 14 periods (two weeks). As we will see shortly, 7 of these periods will correspond to the last week of actual data (validation data) and 7 will be forecasts to the future. Click Next pushbutton

Time Series Analysis 7 - 7

Data Mining: Modeling Figure 7.4 Forecast Wizard Predictors Dialog

Although not relevant to this analysis, it is worth noting that the Wizard allows you specify other data series as predictors. For example, advertising spending may help predict sales, or regional housing starts may predict later demand for materials (plumbing fixtures). The Wizard would search for relations between these predictors and the series to be forecast at different time lags and thus would investigate whether advertising spending three months back relates to sales this month (based on the cross-correlation function). Since we have no predictor variables, we proceed. Click Next pushbutton

Time Series Analysis 7 - 8

Data Mining: Modeling Figure 7.5 Forecast Wizard Interventions Dialog

In addition to predictor variables, DecisionTime allows inclusion of intervention effects. Unlike predictor variables, interventions are not assumed to be themselves time series, but rather represent single point occurrences of events (holidays, natural disasters, political events) or structural shifts called steps (change in law, opening of a new plant, advent of deregulation) lasting for some period of time, which are believed to influence the series to be forecast. If either type of intervention occurred, the Interventions button leads to a dialog in which you could specify the point intervention dates and step date ranges. Click Next pushbutton The next dialog (Forecast Wizard Events, which is not shown) permits you to specify future dates at which you expect the interventions to repeat. In this way, the intervention effects would be taken into account during the forecasting. Click Next pushbutton Click Prior to end of the historical data option button Enter 7 in the How many periods prior? box

Time Series Analysis 7 - 9

Data Mining: Modeling Figure 7.6 Forecast Wizard Holdouts Dialog

By default, forecasting will begin at the end of the historical data. This is the best choice when producing forecasts. However, since we want the last 7 days forecast from the earlier part of the series, we must begin forecasting prior to the end of the historical data. Thus the last week of data will not be used when producing the forecasts for those days. Click Next pushbutton

Time Series Analysis 7 - 10

Data Mining: Modeling Figure 7.7 Forecast Wizard Finished Dialog

The last choice involves whether you wish to exclude data early in the series from the modeling. This option would be used if you had a very long time series or believed that patterns had changed over time and that the early part of the series wasnt relevant to future predictions. Examples of this might be sales data from utilities and telecommunication companies before and after deregulation. Although a telecommunications company would have many years of data from some customers, it might discard data collected prior to deregulation as irrelevant for forecasting revenue in a deregulated environment. Click Finish pushbutton

Time Series Analysis 7 - 11

Data Mining: Modeling Figure 7.8 Model Selection and Forecasts from Expert Modeler

There are two noticeable changes to the DecisionTime window. First, the Model column of the Project (left) pane and the graph label indicate that Winters Multiplicative model has been chosen. This model, named after its founder, is an exponential smoothing model that includes trend and multiplicative seasonality. Second, the fourteen forecast values for parcels appear in the graph with a reference line marking the beginning of the forecast period. Visually, the forecasts track the actual series closely during the one-week holdout (validation) period.

Click the Show Forecast Limits

button

Time Series Analysis 7 - 12

Data Mining: Modeling Figure 7.9 Forecasts with 95% Confidence Limits

The 95% confidence limits for the forecasts are displayed. These limits can also be shown for the historical series data. Click the Table tab Scroll to the 5/6/97 date Figure 7.10 Table Containing Forecasts

Four rows appear: historical data, forecasts, upper and lower limits. These correspond to the values displaying in the graph. You control what information displays in the graphs Time Series Analysis 7 - 13

Data Mining: Modeling and table by your choices on the Show buttons located to the right of the Table sheet (alternatively, click View..Forecast Output). Now, we will take a closer look at the forecasts. Click the Graph Panels tab Click the Show Historical Data button (so it is no longer depressed)

Figure 7.11 Graph of Forecasts and 95% Confidence Limits

Since the Show Historical Data button is not depressed, only the forecasts and their confidence limits appear for the two-week forecast period at the end of the series (the last week of the series and a one-week forecast beyond the data). We see the mid week peaks appearing in the two-week forecast.

Model Evaluation
There are many ways of evaluating a time series model. In this example we limit ourselves to a visual inspection of how the model predictions track the original series (shown earlier (Figure 7.8) for the 7-day holdout period) and some commonly used goodness of fit measures. Click View..Model Viewer (Alternatively, click the Model Viewer button

Time Series Analysis 7 - 14

Data Mining: Modeling Figure 7.12 Model Viewer Window: Goodness of Fit Statistics

We will focus on two of the fit measures. The mean absolute error (also referred to as mean absolute deviation or MAD) takes the average of the absolute values of the errors and is often used as the primary measure of fit. In our data, the average magnitude of error either side of zero is 1,474. The mean absolute error must be interpreted with reference to the original units of measurement. For those familiar with their data this is adequate, since they would know whether a mean absolute error of 1,474 is reason for celebration or sorrow. However, the Mean Absolute Percentage Error (MAPE) is sometimes preferred because it is a percentage and thus is a relative measure. Again, positive and negative errors do not cancel. MAPE is obtained by taking the absolute error for each time period, dividing it by the actual series value, averaging these ratios across all time points, and multiplying by 100. The MAPE is often used as it measures the percentage prediction error and in this case is 7.26%. The maximum absolute error and maximum absolute percentage error are also reported, as are some other, more technical, measures. Whether this model is acceptable largely depends upon what degree of accuracy is required in the context of the business and how alternative models fare. Since the Expert Modeler in the Forecast Wizard selected this model, we know that other forms of exponential smoothing were considered, along with ARIMA models. However, if there were a need for greater accuracy in predicting parcel deliveries, perhaps other predictors

Time Series Analysis 7 - 15

Data Mining: Modeling (for example, measures of business activity or the local economy) could be examined for inclusion in the model. Scroll down to the Model Parameters section in the Summary tab sheet Figure 7.13 Model Parameters

Estimates for the parameters in the Winters Multiplicative (Exponential Smoothing) model appear. For those interested in their interpretation, we include a brief description below. Most books on time series cover exponential smoothing models and the Gardner (1985) reference is quite complete.

Exponential Smoothing Parameters


The four exponential smoothing parameters are alpha, gamma, delta and phi. Three of them are used in Winters Multiplicative model: alpha (local shift), gamma (trend shift) and delta (seasonal shift). The alpha smoothing parameter taps the extent to which the recent observations in the series are predictive of the current value. It can also be viewed as the factor used to weight the recent prediction error in order to bring the prediction in line with the actual series. Alpha is designed to adjust for local shifts in the mean level of the series.

Time Series Analysis 7 - 16

Data Mining: Modeling If the overall series mean is representative of the series values at the end of the series then the alpha value should be 0. This places equal importance on all values of the series. If however the series level shifts then the later observations should have more weight in predicting future series values. As the alpha parameter value moves closer to 1, more and more weight is given to the most recent observations. The alpha parameter is used for all exponential smoothing models. The gamma weight applies to the trend component of the series. Again this parameter gives more weight to recent values the nearer to 1 the value of gamma specified. If the trend component is relatively stable throughout the whole series then 0 is probably the most appropriate value where each observation is given equal weight. It is important to note that there may well be trend to the series if the gamma coefficient estimate is 0; a gamma of 0 simply implies that the trend is constant over time. Gamma is used only for exponential smoothing models where a trend component has been specified. If no trend component is needed (as evaluated by the Forecast Wizard), then this parameter will not be included in the model. The delta parameter controls the relative weight given to recent observations in estimating the seasonality present in the series. It ranges from 0 to 1, with values near 1 giving higher weight to recent values. Delta is used for all exponential smoothing models with a seasonal component. Be aware that a delta parameter of 0 does not imply there is no seasonality, but that the seasonality is constant over time. The final parameter is the phi parameter (not used here) that deals with trend modification when the damped trend option is specified. It controls the rate at which a trend is damped, or reduced in magnitude over time. Here the Forecast Wizard chose the model. A user experienced in time series analysis can use the Advanced Forecast Wizard (Click Forecast..Advanced Forecast Wizard) to fit a specific exponential smoothing or ARIMA (with intervention effects and predictors) model to the series. Click the Residuals tab in the Model Viewer window Position the cursor over the largest positive residual point

Time Series Analysis 7 - 17

Data Mining: Modeling Figure 7.14 Residual Plot

Individual residuals (model prediction errors) can be examined. Any extremely large residuals or patterns in the residuals should be checked for data errors or possible model misspecification. When the cursor is placed over a point in this (or the other DecisionTime graphs), information about the point appears. Time series practitioners will appreciate that autocorrelation and partial autocorrelation plots of the residuals are also available in the Model Viewer window. If multiple models had been run, the view can be switched across models using the model drop-down list or the direction arrows beside it. The model predictions seem to track the day-to-day variation fairly well (based on the plot in Figure 7.8 and the Model Summary window). However, viewing the residual plot, there seems to be a tendency toward larger residuals later in the series. This could be because the daily variation has increased during the last four weeks without being fully captured by the model. This tendency is a bit troublesome and should be monitored as additional data are collected. Once you are satisfied with a model you can have exponential smoothing model create additional forecasts from this data by reapplying the model and forecasting from the end of the series for the forecast period desired. Also, within DecisionTime the forecast period would be extended if additional data were added to the original series (in the input file).

Time Series Analysis 7 - 18

Data Mining: Modeling Click File..Exit DecisionTime from the DecisionTime project (main) window Click No when asked to save changes to the DecisionTime project

Appropriate Research Projects


Time series analysis is appropriate whenever numeric data are collected at regular intervals of time and you wish to make predictions of future values.

Other Features
Data files for time series analysis are much smaller than we are accustomed to using. Normally, you should have at least 40 to 50 observations to make monthly time series analysis practical, but you would rarely have more than a few hundred observations. Typically only a few predictors are used to model a series. Stepwise methods that select among many predictors are not available for standard time series methods, although the algorithms in DecisionTime will examine multiple predictors, retaining those judged important to the model. Generally time series analysis is done with a predetermined set of variables. If predictors are used, they tend to be few in number. Methods are available to select predictor variables and suggest the best model within several classes of time series models (see DecisionTime).

Model Understanding
Exponential Smoothing models are easy to deploy; only a few estimated values need be stored and the calculations are straightforward. Although ARIMA (a more advanced time series model, not discussed here) is a bit complicated theoretically, the chief output of a model is still very clear-cuta predicted value. And that predicted value could be compared to the actual value to get a very good idea of how well the model fits the data. It is therefore not hard at all to present the results of a time series analysis to nonstatisticians or to see graphically whether the model seems to fit the data.

Model Deployment
For some time series techniques, it is possible to use a simple equation to make future predictions. For others, it may be more convenient to add new data to the original file and calculate the predicted values by reapplying the model. But in either case, calculating predicted values will not be a difficult task

Time Series Analysis 7 - 19

Data Mining: Modeling

Time Series Analysis 7 - 20

Data Mining: Modeling

Chapter 8 Sequence Detection

Topics:
INTRODUCTION TO SEQUENCE DETECTION TECHNICAL CONSIDERATIONS DATA ORGANIZATION FOR SEQUENCE DETECTION SEQUENCE DETECTION ALGORITHMS IN CLEMENTINE A SEQUENCE DETECTION EXAMPLE: REPAIR DIAGNOSTICS

Sequence Detection 8 - 1

Data Mining: Modeling

INTRODUCTION TO SEQUENCE DETECTION


What are the most common sequences of web-clicks when visiting the web site of an online retailer? Does a pattern of retail purchases predict the future purchase of an item? If a manufactured part, insurance claim, or sales order must go through a number of steps to completion, what are the most common sequences and do any of these involve repeating steps? These questions entail looking for patterns in data that are time ordered. In Chapter 7 we discussed time series analysis, in which a numeric outcome, recorded at regular time intervals, is predicted from the history of the series. Here we are interested in examining time-structured data in order to find commonly occurring sequences and to predict an event likely to occur given the sequence observed to date. Unlike general time series analysis, sequence detection does not assume the data are recorded or aggregated at regular time intervals, only that the data are time-ordered. In Chapter 3 we discussed general association rules. Sequence detection is related to these analyses, but is distinguished from them in that the time sequence is formally taken into account. Thus we are studying associations that occur over time. The Sequence node in Clementine performs sequence detection analysis. In addition, a Clementine algorithm add-on, CaprI, performs sequence detection using a different algorithm and provides greater flexibility in specifying the types of sequences you are interested in investigating. In this chapter we will use the Sequence node to explore common sequences of diagnostic/repair steps taken when attempting to solve telecom service problems. Results from a Sequence node analysis are presented in a table with these column headings: Consequent Antecedent 1 Antecedent 2 Antecedent N For example: Consequent Clementine Antecedent 1 Base and Regression Models Antecedent 2 Advanced Models

This tells us that individuals who buy SPSS Base and Regression Models, and later buy Advanced Models, are likely to later buy Clementine. The and in Antecedent 1 indicates that the two items are members of an item set. Thus, Base and Regression Models indicates that both items were purchased at the same time, while Advanced Models was purchased at a later time.

Sequence Detection 8 - 2

Data Mining: Modeling When the Sequence node produces a set of sequences, it provides evaluation measures similar to those we reviewed when we discussed association rules. The measures are called support and confidence. Support refers to the number or percentage of cases (where a case is linked to a unique ID number) to which the rule applies that is, the number of cases for which the antecedents and consequent appear in the proper order. Note that this differs from coverage, defined when we discussed association rules in Chapter 3, which is the percentage of cases in which the antecedents (conditions) hold. Confidence refers to that proportion of the cases to which the antecedents apply that the consequent also follows. Stated another way, confidence is the proportion of cases with the antecedents that also include that specific consequent. Confidence is used the same way for association rules. These measures are presented in the same table with these additional three columns: Instances, Support, and Confidence. For example: Instances Support Confidence Consequent Antecedent 1 100 .15 .60 Clementine Base and Regression Models Antecedent 2 Advanced Models

This means that 15% (100 individuals) of the customers purchased SPSS Base and Regression Models at the same time, then purchased the Advanced Models, and later purchased Clementine. Of the customers who purchased Base and Regression Models, then Advanced Models, 60% later purchased Clementine.

TECHNICAL CONSIDERATIONS
Number of items?
If the fields to be analyzed represent items sold or web pages clicked, the number of distinct items influences the resources required for analysis. For example, the number of SPSS products a customer can currently purchase is about 15, which is a relatively small number of items to analyze. Now, lets consider a major retail chain store, auto parts supplier, mail catalogue or web vendor. Each might have anywhere from hundreds or thousands to tens of thousands of individual product or web page codes. Generally, when such large numbers of products are involved, they are binned (grouped) together in higher-level product categories. As items are added, the number of possible combinations increases exponentially. Just how much categorization is necessary depends upon the original number of items, the detail level of the business question asked, and at what level of grouping meaningful categories can be created. When a large number of items are present, careful consideration must be given to this issue and it may be time consuming. Sequence Detection 8 - 3

Data Mining: Modeling Time spent on this issue will increase your chance of finding useful sequences and will reduce largely redundant rules (for example, rules describing the purchase of a shirt followed by purchase of many different types and colors of ties).

Sequence Length
Searching for longer sequences requires greater resources. So one consideration concerns whether you are interested in any sequences that are found, whatever their length. An expert option in the Sequence node permits you to set an upper limit on the length of sequences.

When Does a New Sequence Begin?


For some types of sequential data, for example web-log data, decisions must be made as to when a sequence ends and a new one begins. For example, when analyzing basic web logs a simple rule of thumb is that if a half hour has passed since a machine with a given IP address has accessed a page, then treat it as a new session. Or for a catalog retailer, if I made four purchases in the first year, then make another purchase five years later, should this last purchase be considered part of the earlier sequence? The answers to such questions are dependent on the goals of the project and the nature of the business studied (the time lag between automobile purchases is much greater than the time lag between food purchases). Still it is important to note that decisions must be made concerning whether there are any time limits that apply to the sequences studied.

No Dependent Variable (Necessarily)


An advantage of sequence detection (and association rules) is that it investigates all timeordered associations among the analysis variables and not only those related to a specific outcome category. In this sense it is a more broadly based method than data-mining techniques that are trying to predict a specific outcome field. However, after the sequences have been generated in Clementine by the Sequence node, you can produce predictions for the item(s) most likely to appear later, given the sequence observed to that point.

DATA ORGANIZATION FOR SEQUENCE DETECTION


The Sequence node (and CaprI) can analyze data in either of two formats. In the Tabular data format, each item is represented by a binary (flag) field coded to record the presence or absence of the item. In transactional data format, item values are stored in one or more content fields, usually defined as type set.

Sequence Detection 8 - 4

Data Mining: Modeling Consider the software transaction example used earlier, in which a customer first purchased SPSS Base and Regression Models, then later purchased Advanced Models, and then purchased Clementine. It could appear in tabular data format as follows: Customer 101 101 101 Date Base Regression Adv Models T F F T F F Clementine F F T Decision Time F F F

Feb 2, 2001 T May 1, 2001 F Dec 31, F 2002

The same sequence in transactional data format would be: Customer 101 101 101 101 Date Feb 2, 2001 Feb 2, 2001 May 1, 2001 Dec 31, 2002 Purchase Base Regression Adv Models Clementine

In tabular format, it is clear that SPSS Base and Regression Models were purchased together (they are treated as an item set). In transactional format, the same items would be treated as an item set if the date field were specified as a Time field within the Sequence node. In addition, under the Expert tab, you can specify a timestamp tolerance value that is applied to the Time field to determine, for an ID, which records should be grouped into item sets.

SEQUENCE DETECTION ALGORITHMS IN CLEMENTINE


Both the Sequence node and CaprI perform sequence detection analysis. The Sequence node is one of the algorithms included with Clementine, while CaprI is an algorithm addon. The trick to sequence analysis is to find a quick, memory efficient, minimal data-pass way of determining the sequences. The Sequence node and CaprI use different algorithms to accomplish this. Both permit you to specify various criteria (for example, maximum sequence size, minimum support, minimum confidence) that control the sequence search. There are some considerations that might help you choose between them for a specific application. The Sequence node permits you to create generated nodes that can be used to identify sequences and produce predictions in other data files. CaprI has a more extensive set of controls that determine the types of sequences you want to focus on. For example, Sequence Detection 8 - 5

Data Mining: Modeling pruning options allow you ignore full (contiguous) sequences or partial (items need not be contiguous) sequences that are contained in other sequences reported, which simplifies the results. Also you can search for only sequences that start or end with certain items. This might be useful if you are primarily interested in searching for sequences that lead to a certain web page or result. In short, both algorithms have advantages, which is why they are made available in Clementine (CaprI as an add-on algorithm).

A SEQUENCE DETECTION EXAMPLE: REPAIR DIAGNOSTICS


In this example, we will look for common sequences of diagnostic/repair steps needed to resolve problems with telecom service. When a customer reports a problem with telecom service, different tests and operations are performed to resolve the problem. In some cases only a single test is needed, but sometimes many steps are required. There is interest in identifying common sequences needed to resolve service problems, discovering any repeating patterns (repetition of steps), and identifying sequences that were related to failure to resolve the problem. Data codes are masked and values are simulated based on a customer analysis. The data are stored in the tab-separated file TelRepair.txt. The file contains three fields: an ID number corresponding to the problem report, a sequence code for each step taken during problem resolution, and a code representing the diagnostic/repair activity in the step. Code 90 represents the reporting of the original problem (all reports should begin with code 90, but not all do). Codes 210 and 299 are termination codes: code 210 indicates the problem was resolved, while code 299 indicates it was not successfully resolved. Codes between 100 and 195 represent different diagnostic/repair activities. The file contains information on 750 service problems. From within Clementine: Click FileOpen Stream and navigate to the c:\Train\DM_Model directory Double click on TelRepair.str

Sequence Detection 8 - 6

Data Mining: Modeling Figure 8.1 TelRepair Stream

Right-click the Table node above the Type node, then click Execute

Figure 8.2 Telecom Repair Sequence Data

Sequence Detection 8 - 7

Data Mining: Modeling Each service problem is identified by a unique ID value. The field Index1 records the sequence in which the diagnostic/repair steps were performed and the Stage field contains the actual diagnostic/repair codes. All repair sequences should begin with code 90 and a successful repair has 210 as the final code (299 is used if the problem was not successfully resolved). The data file was presorted by Index1 within ID. The Sequence node has an option to sort the data prior to analysis (or the Sort nodelocated in the Record Ops palettecould be used). Close the Table window Double click the Type node Figure 8.3 Type Node for Sequence Detection

Even though numeric codes are used for the diagnostic/repair values in Stage, it is declared as type set. Sequence analysis could be done if the field was defined as numeric type, but its values would still be treated as type set (categorical). That is, 90 and 95 would be treated as two categories, not as similar numeric values. In this analysis, the content to be analyzed is contained in the single field: Stage. The field(s) containing the content can be of any type and any direction (if numeric they must be of type range). If there are multiple content fields, they all must be of the same type.

Sequence Detection 8 - 8

Data Mining: Modeling The field (here ID) that identifies the unit of analysis can also be of any type and can have any directionhere it is set to None. A time field, if used, must be numeric and can have any direction. Close the Type node Double click the Sequence node Figure 8.4 Sequence Node Dialog

The ID field defines the basic unit of analysis for the Sequence node. In our example, the unit of analysis is the service problem and each problem has a unique value for the field named ID. A time field is not required and if no time field is specified, the data are assumed to be time ordered for a given ID. We specify Index1 as the time field for this analysis. Under the Expert tab you have additional controls based on the time field (for example, an event occurring more than a user-specified interval since the IDs previous event can be considered to begin a new sequence). The content fields contain the variables that constitute the sequences. In our example, the content is stored in a single field, but multiple fields can be analyzed. Sequence Detection 8 - 9

Data Mining: Modeling If the data records are already sorted so that all records for an ID are contiguous, you can check the IDs are contiguous check box, in which case the Sequence node will not resort the data, saving on resources (and your time). The Model and Expert tabs provide greater control over various aspects of the analysis. We illustrate these options by examining the Model settings, which include support and confidence minimum values. Click the Model tab Figure 8.5 Model Tab Settings

The minimum values for support and confidence are set to 20%, but as we learned from working with association rules, these values often need to be changed. Click Execute to create the model

Exploring the Sequences


When the sequence detection analysis is complete, a generated sequence ruleset node will appear in the Models tab of the Manager window. Right click on the generated Sequence ruleset node tab of the Manager window Sequence Detection 8 - 10 in the Models

Data Mining: Modeling Click Browse on the Context menu When the generated Sequence rules node first appears, it lists the consequent and antecedent(s), but does not display support or confidence. To see these values: Click the Show/Hide criteria button on the toolbar You may need to maximize the Sequence rules window to see all columns The Sequence node found 86 rules, which are presented in descending order by rule confidence. The second rule is a sequence that begins with code 90 (all actual repair/diagnostic sequences should start with 90), followed by code 125, then code 195, and ends in code 210 (successful resolution). This sequence was found for 163 IDs, which constitute 21.7% of the IDs (there are 750 IDs in the data). This is the support value. Thus almost one fourth of all service problems in the data showed this pattern. Of the cases containing this sequence of antecedents90, then 125, then 195code 210 followed 98.2% of the time. This is the confidence value. Figure 8.6 Sequence Rules

Notice that codes 90 and 210 appear frequently in the rules. This is because almost all service problem sequences begin with 90 and end with 210. Someone with domain knowledge of this area could now examine the sequences to determine if there is anything

Sequence Detection 8 - 11

Data Mining: Modeling interesting or unexpectedfor example a sequence that should not occur given the nature or the diagnostic tests/repairs, or a repeating sequence. The sequence rule sets are ordered by confidence value (descending order). To view the most common sequences, we simply sort by support value. Click the Sort by: dropdown list and select Support The sequence rules can be sorted in a number of ways. For those interested in sequences beginning or ending with a particular event (for example, clicking on a specific webpage), the sorts by First Antecedent, Last Antecedent, or Consequent would be of interest. Figure 8.7 Sequence Rule Sets Sorted by Support

Code 110 appears in two of the three most frequent rules. The sequence 90 followed by 210 occurs in about 92% of the service problems, which we would expect in a very high proportion of the sequences. Code 299, which indicates the problem was not resolved, has not appeared. This is because it is relatively infrequent (fortunately so, for the business and customers). If we were interested in sequences containing 299, we would have to lower the confidence to below 5%, which is the base rate for code 299.

Sequence Detection 8 - 12

Data Mining: Modeling A domain expert would be interested in the most frequent sequences, which describe the typical path a service problem follows. If some stages were more expensive/time consuming, they would attract particular attention. We will view the results in one other order. Click the Sort by: dropdown list and select Number of items The sequences are now sorted by the number of distinct items that appear in a sequence, sorted in descending order. Thus the first sequence, with 212 instances, contains four items (90, 110, 125, and 210). Figure 8.8 Sequence Rule Sets Sorted by Number of Items

To see an odd sequence: to change the sort order from descending Click the Sort by: button to ascending Scroll down a number of rows until you see the beginning of sequences with two antecedents (see Figure 8.9) One of the sequences with only two items has an antecedent of 125 and an identical consequent of 125. This pattern occurs in 22% of the sequences. The sequence would be of interest because, ideally, a diagnostic/repair stage should not be repeated. Someone Sequence Detection 8 - 13

Data Mining: Modeling familiar with the diagnostic/repair process would look into why this stage is repeating so often (erroneous test results at that stage, records not being forwarded properly, etc.), and modify the process to reduce it. Other repeating sequences may be present, but do not meet the minimum support and confidence criteria values. Figure 8.9 Sequence with Identical Antecedent and Consequent

Next we view the model predictions. Right-click the Table node attached to the Sequence generated model node Click Execute

Sequence Detection 8 - 14

Data Mining: Modeling Figure 8.10 Top Three Sequence Predictions

By default, the Sequence generated model node contains three prediction fields (prefixed with $S-), containing the three most confident predictions of codes that will appear later in the sequence, predicted from the sequence observed to that point. The confidence values for each prediction are stored in the fields prefixed with $SC-. The sequence value in the first record is stage 90 (for ID=1, Index1=1), which is the problem report. The most likely stage to occur later, given that stage 90 has occurred, is stage 210 with confidence .949. (Note: this rule can be seen in Figure 8.7.) Since most sequences end with stage 210, the second and third most confident predictions are, in some senses, more interesting for this analysis. Thus, the next most likely stage to occur later, given that stage 90 has occurred, is stage 110 with confidence .677. And the third most likely is stage 125. In this way, the three most confident future predictions, based on the observed sequence, are generated. Examining the predictions for ID 1, notice that the most likely item to occur later can change as the observed sequence changes. This makes sense, since as more information becomes available about a sequence, additional rules can apply.

Appropriate Research Projects


Sequence detection is appropriate whenever you have a time-ordered set of discrete items that you are interested in exploring for commonly occurring or high confidence sequences. This occurs most often in the retail industry, but there is no reason it need be limited to that area. Insurance claims, bank transactions, medical procedures, and internal corporate processes are other potential applications. Sequences that lead to a failure, an error, or fraud are potential areas of application.

Sequence Detection 8 - 15

Data Mining: Modeling

Other Features
Sequence detection is often done on very large data sets, with tens of thousands of records. But as file sizes grow, especially in the number of items, the number of potential sequences grows quickly, as does necessary computation time. So in practice the number of cases (customers, web visits, problems reported) can be large, but the number of distinct items tends to be much smaller, as is the number of items to be associated together in a sequence. It can be difficult, therefore, to determine the right number and type of items a priori (for example, for a large retailer), so typically, like all the automated methods, several different solutions may be tried with different sets.

Model Understanding
One of the great strengths of sequence detection, as we found with association rules in general, is that its results are easily understood by anyone. The rules it finds can be expressed in natural language and dont involve any statistical testing.

Model Deployment
Model deployment is complicated by the fact that the items in a sequence need not be contiguous, which complicates the logic needed to identify key sequences in order to generate predictions. Generally speaking, transaction databases do not easily identify sequences, especially if the items are not contiguous (uninterrupted). In Clementine, the generated Sequence node can be used to produce predictions and this node, in turn, can generate a Clementine supernode that will create additional fields that support prediction. These can be deployed within the Clementine Solution Publisher. In addition, the results of sequence detection analysis may be directly actionable (e.g., make a web page that is found to lead to later registration more prominent to the visitor; investigate why going through a certain repair stage often leads to a return to that stage). These results are useful but are not deployed as prediction models.

Sequence Detection 8 - 16

Data Mining: Modeling

References

Berry, Michael J.A. and Linoff, Gordon S. 1997. Data Mining Techniques for Marketing, Sales, and Customer Support. New York, Wiley. Berry, Michael J. A. and Linoff, Gordon S. (2001) Mastering Data Mining: The Art and Science of Customer Relationship Management. New York: Wiley. Bigus, Joseph P. 1996. Data Mining with Neural Networks. New York, McGraw Hill. Breiman, Leo. 2001. Statistical Modeling: The Two Cultures (with rejoinders). Statistical Science, 16:3, 199-231. Cohen, Jacob, Cohen, Patricia, West, Stephen and Aiken, Leona. 2002. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed.). Hillsdale, NJ, Lawrence Erlbaum Associates. Draper, Norman and Smith, Harry. 1998. Applied regression Analysis (3nd ed.). Wiley, New York. Fu, LiMin. 1994. Neural Networks in Computer Intelligence. New York McGraw Hill. Gardner, E.S., 1985. Exponential Smoothing: The State of the Art. Journal for Forecasting, 4: 1-28. Han, Jiawei and Kamber, Micheline. 2000. Data Mining: Concepts and Techniques. San Francisco, Morgan Kauffman. Hastie, Trevor, Tibshirani, R. and Friedman, J. H. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York. Springer-Verlag. Huberty, Carl J. 1994. Applied Discriminant Analysis. New York, Wiley. Kim, Hyunjoong and Loh, Wei-Yin. 2001. Classification Trees With Unbiased Multiway Splits. Journal of the American Statistical Association, 96, 589-604. Lachenbruch, P. A. 1975. Discriminant Analysis, New York. Hafner Press.

References R - 1

Data Mining: Modeling Lim, T., W. Loh and Y. Shih. 1997. An empirical comparison of decision trees and other classification methods. Technical Report 979. Department of Statistics, University of Wisconsin, Madison. Madison, WI. Linoff, Gordon S. and Berry, Michael J. A. 2002. Mining the Web: Transforming Customer Data. New York. Wiley. Loh, W. and Y. Shih. 1997. Split selection methods for classification trees. Statistica Sinica, 7:4. Loh, W. and N. Vanichsetakul. 1988. Tree-structured classification via generalized discriminant analysis. Journal of the American Statistical Association, 83: September 1988: 715-724 (comments and rejoinder 725-728). Mannila, Heikki, Smyth, Padhraic and David J. Hand. 2001. Principles of Data Mining. Cambridge, MA. MIT Press.

References R - 2

You might also like