Professional Documents
Culture Documents
the SAS software. For a thorough discussion of data mining concepts, methods, and applications, see Two Crows Corporation4 and Berry and Linoff.5,6
applications reach across industries and business functions. For example, telecommunications, stock exchange, credit card, and insurance companies use data mining to detect fraudulent use of their services; the medical industry uses data mining to predict the effectiveness of surgical procedures, diagnostic medical tests, and medications; and retailers use data mining to assess the effectiveness of discount coupons and sales promotions. Data mining has many varied fields of application, some of which are listed below: Retail/marketing. An example of pattern discovery in retail sales is to identify seemingly unrelated products that are often purchased together. Market basket analysis is an algorithm that examines a long list of transactions in order to determine which items are most frequently purchased together. The results can be useful to any company that sells products, whether in a store, by catalog, or directly to the customer. Banking. A credit card company can leverage its customer transaction database to identify customers most likely to be interested in a new credit product. Using a small test mailing, the characteristics of customers with an affinity for the product can be identified. Data mining tools can also be used to detect patterns of fraudulent credit card use, including detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. It identifies loyal customers, predicts customers likely to change their credit card affiliation, determines credit card spending by customer groups, uncovers hidden correlations among various financial indicators, and identifies stock trading trends from historical market data. Healthcare insurance. Through claims analysis (i.e., identifying medical procedures that are claimed together), data mining can predict which customers will buy new policies, defines behavior patterns of risky customers, and identifies fraudulent behavior. Transportation. State and federal departments of transportation can develop performance and network optimization models to predict the life-cycle costs of road pavement. Product manufacturing companies. Manufacturers can apply data mining to improve their sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, a manufacturer can select promotional strategies that best reach their target customer segments. Data mining can determine distribution schedules among outlets and analyze loading patterns. Healthcare and pharmaceutical industries. A pharmaceutical company can analyze its recent sales records to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The ongoing, dynamic analysis of the data warehouse
allows the best practices from throughout the organization to be applied in specific sales situations. Internal Revenue Service (IRS) and Federal Bureau of Investigation (FBI). As examples, the IRS uses data mining to track federal income tax frauds, and the FBI uses data mining to detect any unusual patterns or trends in thousands of field reports to look for any leads in terrorist activities.
Preprocessing. This is the data cleansing stage, where certain information that is deemed unnecessary and likely to slow down queries is removed. Also, the data are checked to ensure use of a consistent format in dates, zip codes, currency, units of measurements, etc. Inconsistent formats in the database are always a possibility because the data are drawn from several sources. Data entry errors and extreme outliers should be removed from the dataset because influential outliers can affect the modeling results and subsequently limit the usability of the predicted models. Data integration. Combining variables from many different data sources is an essential step because some of the most important variables are stored in different data marts (customer demographics, purchase data, business transaction). The uniformity in variable coding and the scale of measurements should be verified before combining different variables and observations from different data marts. Variable transformation. Sometimes expressing continuous variables in standardized units (or in log or square-root scale) is necessary to improve the model fit that leads to improved precision in the fitted models. Missing value imputation is necessary if some important variables have large proportions of missing values in the dataset. Identifying the response (target) and the predictor (input) variables and defining their scale of measurement are important steps in data preparation because the type of modeling is determined by the characteristics of the response and the predictor variables. Splitting databases. Sampling is recommended in extremely large databases because it significantly reduces the model training time. Randomly splitting the data into training, validation, and testing categories is very important in calibrating the model fit and validating the model results. Trends and patterns observed in the training dataset can be expected to generalize the complete database if the training sample used sufficiently represents the database.
Simple descriptive statistics and exploratory graphics displaying the distribution pattern and the presence of outliers are useful in exploring continuous variables. Descriptive statistical measures such as the mean, median, range, and standard deviation of continuous variables provide information regarding their distributional properties and the presence of outliers. Frequency histograms display the distributional properties of the continuous variable. Box plots provide an excellent visual summary of many important aspects of a distribution. The box plot is based on a five-number summary plot, which is based on the median, quartiles, and extreme values. One-way and multi-way frequency tables of categorical data are useful in summarizing group distributions and relationships between groups, as well as checking for rare events. Bar charts show frequency information for categorical variables and display differences among the various groups in the categorical variable. Pie charts compare the levels or classes of a categorical variable to each other and to the whole. They use the size of pie slices to graphically represent the value of a statistic for a data range.
tinuous and binary variables as targets. In regression we want to approximate the regression function, while in classification problems we want to approximate the probability of class membership as a function of the input variables. Predictive modeling is a fundamental data mining task. It is an approach that reads training data composed of multiple input variables and a target variable. It then builds a model that attempts to predict the target on the basis of the inputs. After this model is developed, it can be applied to new data similar to the training data but not containing the target. Multiple linear regression (MLR). In MLR, the association between the two sets of variables is described by a linear equation that predicts the continuous response variable from a function of predictor variables. Logistic regressions. This type of regression uses a binary or an ordinal variable as the response variable and allows construction of more complex models than the straight linear models do. Neural net (NN) modeling. Neural net modeling can be used for both prediction and classification. NN models enable construction of trains and validate multiplayer feed-forward network models for modeling large data and complex interactions with many predictor variables. NN models usually contain more parameters than a typical statistical model, the results are not easily interpreted, and no explicit rationale is given for the prediction. All variables are considered to be numeric and all nominal variables are coded as binary. Relatively more training time is needed to fit the NN models. Classification and regression tree (CART). These models are useful in generating binary decision trees by splitting the subsets of the dataset using all predictor variables to create two child nodes repeatedly beginning with the entire dataset. The goal is to produce subsets of the data that are as homogeneous as possible with respect to the target variable. Continuous, binary, and categorical variables can be used as response variables in CART. Discriminant function analysis. This is a classification method used to determine which predictor variables discriminate between two or more naturally occurring groups. Only categorical variables are allowed to be the response variable and both continuous and ordinal variables can be used as predictors. Chi-square automatic interaction detector (CHAID) decision tree. This is a classification method used to study the relationships between a categorical response measure and a large series of possible predictor variables that may interact with each other. For qualitative predictor variables, a series of chisquare analyses are conducted between the response and predictor variables to see if splitting the sample based on these predictors leads to a statistically significant discrimination in the response.
Experience in the SAS output delivery system (ODS) is not required because options for producing SAS output and graphics in RTF, WEB, and PDF are included within the macros. Experience in writing SAS program codes or SAS macros is not required to use these macros. The SAS enhanced data mining software Enterprise Miner is not required to run these SAS macros. All SAS macros included in this book use the same simple user-friendly format, so minimal training time is needed to master usage of these macros. Experienced SAS programmers can customize these macros by modifying the SAS macro codes included. Regular updates to the SAS macros will be posted in the book website, so readers can always take advantage of the updated features in the SAS macros by downloading the latest versions. The fact that these SAS macros do not use Enterprise Miner is something of a limitation in that SAS macros could not be included for performing neural net, CART, and market basket analysis, as these data mining tools require the use of Enterprise Miner.
1.10 Summary
Data mining is a journey a continuous effort to combine business knowledge with information extracted from acquired data. This chapter briefly introduces the concept and applications of data mining, which is the secret and intelligent weapon that unleashes the power hidden in data. The SAS Institute, the industry leader in analytical and decision support solutions, provides the powerful software Enterprise Miner to perform complete data mining solutions; however, because of the high price tag for Enterprise Miner, application of this software is not feasible for all business analysts and academic institutions. As alternatives to the point-and-click menu interface modules and Enterprise Miner, user-friendly SAS macro applications for performing several data mining tasks are included in this book. Instructions are given in the book for downloading and applying these user-friendly SAS macros for producing quick and complete data mining solutions.
References
1. SAS Institute, Inc., Customer Success Stories (http://www.sas.com/news/success/solutions.html).
2. SAS Institute, Inc., Customer Relationship Management (http://www.sas.com/solutions/crm/index.html). 3. SAS Institute, Inc., SAS Enterprise Miner Product Review (http://www.sas. com/products/miner/miner_review.pdf). 4. Two Crows Corporation, Introduction to Data Mining and Knowledge Discovery, 3rd ed., Potomac, MD, 1999 (http://www.twocrows.com/intro-dm.pdf). 5. Berry, M.J.A. and Linoff, G.S., Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley & Sons, New York, 1997. 6. Berry, M.J.A. and Linoff, G.S., Mastering Data Mining: The Art and Science of Customer Relationship Management, 2nd ed., John Wiley & Sons, New York, 1999. 7. SAS Institute, Inc., The Power To Know (http://www.sas.com). 8. SAS Institute, Inc., Data Mining Using Enterprise Miner Software: A Case Study Approach, 1st ed., SAS Institute, Inc., Cary, NC, 2000. 9. SAS Institute, Inc., The Enterprise Miner (http://www.sas.com/products/ miner/index.html). 10. SAS Institute, Inc., The Enterprise Miner Standalone Tutorial (http://www. sas.com/service/tutorials/v8/em/mainmenu.htm).
Way, R., Using SAS/INSIGHT Software as an Exploratory Data Mining Platform (http://www2.sas.com/proceedings/sugi24/Infovis/p160-24.pdf). Westphal, C. and Blaxton, T., Data Mining Solutions, John Wiley & Sons, New York, 1998.