You are on page 1of 8

Fourth Workshop on Data Mining Case Studies and Practice Prize

Call For Papers


Fourth Workshop on Data Mining Case Studies and Success Stories and Fourth Data Mining Practice Prize

Motivation From its inception the field of data mining has been guided by the need to solve practical problems. Yet a cursory examination of the publications shows that few papers describe a completed implementation or what we will term a case study. The small number of case studies is counter-balanced by their prominence. Anecdotally case studies are one of the most discussed topics at data mining conferences. Some of the benefits of good case studies include 1. Inspiration: Case studies provide examples that can inspire data mining researchers to pursue important new technical directions. 2. Innovation: Data mining case studies demonstrate how whole problems were solved not just part of the problem. Often building the prediction algorithm is only 10% of the problem - the other aspects that comprise a successful deployment are valuable for practitioners to understand. 3. Education: People are more likely to remember stories than facts. 4. Media Coverage: The media is more likely to report on completed data mining applications, than they are on isolated algorithms. We have an opportunity to present positive success stories to the wider community. 5. Public relations: Applications, particularly those that are socially beneficial, will help our perception both within the wider public and other scientific fields. 6. Connections to Other Scientific Fields: Completed systems knit together a range of scientific and engineering disciplines such as signal processing, chemistry, optimization theory, auction theory and so on. Fostering meaningful connections to these fields will benefit data mining academically, and will assist data mining practitioners to learn how to harness these fields to develop successful applications.

The Workshop

The Data Mining Case Studies Workshop and Practice Prize was established in 2005 showcase the very best in data mining case deployments. Data Mining Case Studies continues its biennial tradition of show-casing top case studies at ICDM-2011. Data Mining Case Studies will highlight data mining implementations that have been responsible for a significant

and measurable improvement in business operations, or an equally important scientific discovery, or some other benefit to humanity. Examples of Data Mining Case Studies from previous years have included: (a) a medical application that has save hundreds of lives by mining through hundreds of thousands of patient records to identify patients who have show all the signs for heart disease, yet have not been prescribed heart medication, (b) a system which has uncovered hundreds of millions in sheltered tax evasion rings, (c) a system which has raised revenue by improved cross-selling of computer peripherals and equipment. Data Mining Case Studies will allow papers greater latitude in (a) range of topics - authors may touch upon areas such as optimization, operations research, inventory control, and so on, (b) page length - longer submissions are allowed, (c) scope - more complete context, problem and solution descriptions will be encouraged, (d) prior publication - if the paper was published in part elsewhere, it may still be considered if the new article is substantially more detailed, (e) novelty the use of established techniques to achieve successful implementations will be given partial allowance. Unsuccessful data mining systems that describe lessons learned and war stories will also be assessed.

The Data Mining Practice Prize Introduction: The Data Mining Practice Prize will be awarded for the best Data Mining Case Study submission. The prize will be awarded for work that has had a significant and quantitative impact in the application in which it was applied, or has significantly benefited humanity. Eligibility: All papers submitted to Data Mining Case Studies will be eligible for the Data Mining Practice Prize, with the exception of any persons serving on the Data Mining Practice Prize Committee, in addition to memberes of Sponsoring companies. Eligible authors must consent to allowing the Practice Prize Committee independently validate their claims by contacting third parties and their deployment client for independent verification and analysis.

Award: Winners will receive a variety of honors including: 1. Prize money comprising $200 . 2. Plaque. 3. Awards Dinner with organizers and prize winners.

Topics

Most operational industrial and scientific systems that involve data mining to some extent are likely to be acceptable. Systems that are responsible for mission critical systems, medical applications, cash flow, or applications that significantly benefit humanity will be particularly good candidates. If you are unsure as to the suitability of your paper, please contact the organizers with your topic at the email address at the bottom of the page. Topics include but are not limited to

- Genomics - Inventory control - Customer Relationship Management (CRM) - ShopBots - Recommendation systems - Auction trading systems - Clinical patient monitoring - Seismic Data interpretation - Survival analysis for medical procedures - Climate analysis - Correlates of genes with disease - Dangerous Drug interactions - Law enforcement applications - Search Engine Marketing - Food spoilage elimination - Price optimization - Data visualization in mission-critical user interfaces - Text understanding Dates
Submissions open Notify organizers of intent to submit Optional Draft submission including client contact information* Final submission including client contact information if it has not already been provided Notification of acceptance Camera ready paper submission Workshop held, Practice Prize winners announced May 8 May 8 Jun 15 Jul 23 Sep 23 Oct 14 Dec 11

*We recommend authors submit a draft of their paper by June 15, 2011 so that we can better begin the process of validating claims. Only the chairs will see the paper - it will not be seen by the reviewers.

Submission instructions In order to contact the organizers, submit, or for any other correspondence, please use the following email address

(note that the email address which follows has been encoded as a bitmap)

1. Please email the organizers as early as possible with your intention to submit. 2. If possible, it is recommended that you provide an optional draft of the article by the draft submission date. This draft will only be viewed by the Chairs - it will not be given to the reviewers or affect the prize competition. 3. Please provide email the organizers with three persons who use the system in their day to day activities, or are responsible for the system, and who may be contact to validate the claims made in the paper. Ideally these individuals belong to a different company than the authors. Also, ideally these individuals are not personal acquaintances or friends of the authors. 4. Provide your author names, addresses, affiliations, phone numbers and email. Also note the nature of relationship of each contact to the system and authors. Finally, provide any information of relevance to contacting deployment users. 5. Please submit your completed article, in IEEE Proceedings format to the email address above. Due to editing requirements for the Workshop Proceedings, we strongly encourage documents to be submitted in Microsoft Word format. The official IEEE Microsoft Word template is available from here. 6. In addition to the above steps, please also follow the instructions on the IEEE author kit website to submit a pdf version of your paper to the official ICDM Conference site .

Guidelines

1. Word limits: The maximum submission page length is 8 pages. 2. Commercial product mentions: Data Mining Case Studies is not a sales venue. References to commercial products will be carefully scrutinized by our Program Committee for applicability. Where possible the underlying techniques should be described. The purpose of Data Mining Case Studies is to illustrate real applications with descriptions that are concise and complete. Commercial software if introduced, should be named briefly and then described at a technical level (eg. don't mention that "SAS Neural Nets(TM) increased our forecast accuracy by 20%" - instead say that you used 'SAS PROC Neural Net(TM)' which implemented a 3- layer sigmoidal backpropagation model with 10 inputs, 4 hidden and 1 output node, and this net increased forecast accuracy by 20%". Any papers violating these ethics will be deemed inadmissible. If in doubt please contact the organizers prior to submission. We will allow a single product mention along the lines described above, and this should be sufficient for establishing commercial credibility.

3. Valid contact information for the company that deployed the data mining system must be supplied to the Program Committee. The Program Committee should be afforded the right to contact individuals that were the beneficiaries of the data mining system and ask them questions about the implementation. In particular, the claims made in the paper submission will need to be verified. Failure to provide factual or complete descriptions of results obtained with the system, that are discovered through this fact checking process, will result in forfeiture of prize and dismissal from the conference. The Prize Committee will endeavor to be discrete in its contacts, so please inform us of any information we need to know before contacting the system users. 4. Copyright: Authors will agree to allow the display of their articles on the web. Authors should also agree to allow their articles to be published in book form. If authors wish to opt out of website or book publication, please contact the Workshop organizers. 5. Confidentiality: The reviewing process will be confidential.

Data Mining - Case Study


Franais Please check this page regularly for updates, corrections, and answers to frequently-asked questions!

Introduction
Data mining is the process of discovering previously unknown, actionable and profitable information from large consolidated databases and using it to support tactical and strategic business decisions. The statistical techniques of data mining are familiar. They include linear and logistic regression, multivariate analysis, principal components analysis, decision trees and neural networks. Traditional approaches to statistical inference fail with large databases, however, because with thousands or millions of cases and hundreds or thousands of variables there will be a high level of redundancy among the variables, there will be spurious relationships, and even the weakest relationships will be highly significant by any statistical test. The objective is to build a model with significant predictive power. It is not enough just to find which relationships are statistically significant. Consider a campaign offering a product or service for sale, directed at a given customer base. Typically, about 1% of the customer base will be "responders," customers who will purchase the product or service if it is offered to them. A mailing to 100,000 randomly-chosen customers will therefore generate about 1000 sales. Data mining techniques enable customer relationship marketing, by identifying which customers are most likely to respond to the campaign. If the response can be raised from 1% to, say, 1.5% of the customers contacted (the "lift value"), then 1000 sales could by achieved with only 66,666 mailings, reducing the cost of mailing by onethird.

Data
This example was provided by Gary Saarenvirta, formerly of The Loyalty Group, now with IBM Canada. Each case is one account. The account numbers have been removed. The objective variable is a response variable indicating whether or not a consumer responded to a direct mail campaign for a specific product. "True" or "response" is 1, "False" or "nonresponse" is 0. The data were extracted from a much larger set with a response rate of about 1%. All 1079 responders were used, together with 1079 randomly-chosen non-responders, for a total of 2158 cases. There are 200 explanatory variables in the file: v137, v141 and v200 are indicators for gender "male," "female," or "unknown," respectively, and v1-v24, v138-v140 and v142-v144 are recency, frequency, monetary type data for the specific accounts; v25-v136 are census variables, and v145-v199 are demographic "taxfiler" variables. Most of the variables have been normalized. A table with some variable descriptions is attached. Some of the product-specific variables have been blinded. "p##" means product, "rcy" means recency (no of months since most recent transaction), "trans" means number of transactions, "spend" means dollars spending. For example: p01rcy means product 1 recency. Note that zero recency means that the account was active for that product in the most recent month. "Never active" would be indicated by the largest possible value for recency, as determined by the first month in which the business collected data. The census and taxfiler variables are summary statistics for the enumeration area in which the account holder's address is located. They generally give total or average numbers of individuals or families or dollars in the categories indicated. A table of taxfiler variable descriptions is attached. You may be able to guess the census variables from their names, but tables with longer descriptions of the census variables are attached: Group "a" and Group "b" are listed separately. You are welcome to contact us if you aren't sure of any. You can get the data as an Excel 97/98 Workbook gary.xls (5.9 Mb), as an Excel 97/98 Workbook compressed into a ZIP archive gary_xls.zip (2.4 Mb), or as text files in a ZIP archive gary.zip (1.3 Mb). There are two text files in gary.zip. The data are in a fixed-width ASCII file Sasvar.txt and the data file description is in imtrans.txt. If you choose to work with the text files you MUST use the column positions in imtrans.txt to import the data into SAS or Splus because some columns are contiguous. Be careful with the line endings; for example, if you unzip the text files in UNIX there will be a line feed character at the end of each line that will have to be included that when computing the record length.

Suggestions for Analysis


We expect that you will be using Splus or SAS for the analysis, however, not all of the methods suggested here are readily available in Splus. If you have a SAS licence, the Enterprise Miner module will conveniently automate many of the analyses and you may be able to get an evaluation copy inexpensively from SAS. IBM's Intelligent Miner is also recommended but it is less likely to be available to you. For all the analyses below, you should create a training set and a validation set. As the data were stratified to 50/50 you should create an unstratified validation set with the original proportion of 1% "True" for the objective variable. You would, of course, get better validation sets if you had the complete sample of around 100,000 accounts, 99% of them non-responders, but the file is too large for us to distribute conveniently. Validation sets constructed from the 50/50 stratified sample should be adequate for the purposes of this exercise. Your results should be plotted on a gains chart, either tabular or graphical. A gains chart is a plot of the % of the responders reached (ordinate) against the % of the customer base contacted (abscissa). If the campaign is directed at randomly-chosen individuals the plot will be a straight line with unit slope through the origin. If the campaign preferentially targets responders, the gains curve will lie above the diagonal except, of course, at 0% and 100% where it necessarily touches. The performance of a predictive model is measured by looking at the % responders at 10%, 20% or 30% of customers mailed. A good model will get 1.5 to 3.5 times as many as random over this range so, for example, mailing to 10% of the customer base will reach 15% to 35% of the responders. Less than this means the data are not very predictive, more than this likely means that you have overfitted or there is a strong bias in the data. Some things you could try with these data include: 1. Try some simple linear correlations, Spearman and Pearson, against the objective variable and reduce the number of variables. With the reduced set of variables, build logistic regression models. Don't forget to remove colinear variables. 2. Break the variables into blocks of 10-20 and build logistic models on each of the blocks. After all the models are built, pool the variables that were left in the models and create new blocks of 10-20 and redo until there is only one block of variables left. Don't forget to remove colinear variables. 3. Create PCA factors from the set of variables (don't include the objective variable!). Select a reduced set of variables from the PCA factors (using cumulative % of variation explained) and build a model from the factors. Compare this result with using all the factors, noting the effect of overfitting. Don't forget to remove colinear variables. 4. Perform a varclus with all variables. This procedure clusters the variables into hierarchical groups using the PCA factors. Select variables from the bottom-level of the hierarchical groups and build a logistic model. Don't forget to remove colinear variables.

5. Create multiple training and test samples. Use bootstrapping to estimate the error bounds on the model coefficients and gains chart performance. Try sampling with and without replacement to see how sensitive logistic regression is to the data set configuration. 6. Use SAS to construct a Radial Basis Function regression. Use all the above methods to reduce the variable set and compare RBF results to logistic. 7. It is possible to implement a decision tree with SAS using the CART algorithm. Run this algorithm against all the variables. Build multiple training sets using sampling with replacement. This should improve the tree performance by a few percent. 8. Other modeling techniques to try include neural networks and genetic algorithms. All model results should be analyzed for gains chart performance with the following measures: 1. What is the response rate (as % of responders in the customer base reached), compared to random, for campaigns mailed to 10%, 20% or 30% of all customers? At these points, the random response would be 10%, 20% or 30% respectively. Most campaigns are mailed to 10% to 30% of the customer base. Good models can achieve 1.5 to 3.5 times the random response rate in this range. 2. Monotonicity and smoothness: do the response rates by quantile group form a smoothly decreasing profile? Any waviness is indicative of bias, overfitting or unmodelled effects. 3. Ease of model explanation. It is very important for prospective clients to understand why the model is working!

You might also like