KDD Cup 2009 customer prediction challenge

13-11-17
KDD Cup 2009: Customer relationship prediction | Sig KDD
Home
Blogs & News
Membership
KDD 2014
Conferences
Explorations
Awards
About KDD
Home KDD Cup 2009: Customer relationship prediction
KDD Cup 2009: Customer relationship prediction
1 1 /04 /1 1
KDD Cup 2009: Overview This Year's Challenge Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling). The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application. The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly. Tasks Tasks: KDD Cup 2009: Tasks Task Description The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge. Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks. Appetency: In our context, the appetency is the propensity to buy a service or a product. Up-selling (wikipedia definition): Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Up-selling usually involves marketing more profitable services or products, but up-selling can also be simply exposing the customer to other options he or she may not have considered previously. Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.
Evaluation The performances are evaluated according to the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling). This is what we call "Score" in the Result page. Sensitivity and specificity The main objective of the challenge is to make good predictions of the target variables. The prediction of each target variable is thought of as a separate classification problem. The results of classification, obtained by thresholding the prediction score, may be represented in a confusion matrix, where tp (true positive), fn (false negative), tn (true negative) and fp (false positive) represent the number of examples falling into each possible outcome:
Prediction Class +1 Truth Class +1 Class -1 tp fp Class -1 fn tn
Any sort of numeric prediction score is allowed, larger numerical values indicating higher confidence in positive class membership. We define the sensitivity (also called true positive rate or hit rate) and the specificity (true negative rate) as:
www.kdd.org/kdd-cup-2009-customer-relationship-prediction
1/8
13-11-17
Sensitivity = tp/pos Specificity = tn/neg
where pos = tp+fn is the total number of positive examples and neg=tn+fp the total number of negative examples. AUC The results will be evaluated with the so-called Area Under Curve (AUC). It corresponds to the area under the curve obtained by plotting sensitivity against specificity by varying a threshold on the prediction values to determine the classification result. The AUC is related to the area under the lift curve and the Gini index used in marketing (Gini = 2 AUC -1). The AUC is calculated using the trapezoid method. In the case when binary scores are supplied for the classification instead of discriminant values, the curve is given by {(0,1), (tn/(tn+fp), tp/(tp+fn)), (1,0)} and the AUC is just the Balanced ACcuracy BAC.
[Go Top]
Rules Rules: KDD Cup 2009: Rules Competition Rules Conditions of participation: Anybody who complies with the rules of the challenge (KDDcup 2009) is welcome to participate. Only the organizers are excluded from participating. The KDDcup 2009 is part of the competition program of the Knowledge Discovery in Databases conference (KDD 2009), Paris June 28-July 1st, 2009. Participants are not required to attend the KDDcup 2009 workshop, which will be held at the conference, and the workshop is open to anyone who registers. The proceedings of the competition will be published by the Journal of Machine Learning Research Workshop and Conference Proceedings (JMLR WC&P). Anonymity: All entrants must identify themselves by registering on the KDDcup 2009 website. However, they may elect to remain anonymous by choosing a nickname and checking the box "Make my profile anonymous". If this box is checked, only the nickname will appear in the result tables instead of the real name. Participant emails will not appear anywhere on the website and will be used only by the organizers to communicate with the participants. To be eligible for prizes the participants will have to publicly reveal their identity and uncheck the box "Make my profile anonymous". Data: The dataset is available for download from the Data page to registered participants. The data are available in several archives to facilitate downloading and two versions are made available ("small" with 230 variables, and "large" with 15,000 variables). The participants may enter results on either or both versions, which correspond to the same data entries, the 230 variables of the small version being just a subset of the 15,000 variables of the large version. Both training and test data are available without the true target labels. For practice purpose, "toy" training labels are available together with the training data from the onset of the challenge in the fast track. The results on toy targets (T) will not count for the final evaluation. The real training labels of the tasks "churn" (C), "appetency" (A), and "up-selling" (U), will be made available for download separately half-way through the challenge. Challenge duration and tracks: The challenge starts March 10, 2009 and ends May 11, 2009. There are two challenge tracks: FAST (large) challenge: Results submitted on the LARGE dataset within five days of the release of the real training labels will count towards the fast challenge. SLOW challenge: Results on the small dataset and results on the large dataset not qualifying for the fast challenge, submitted before the KDDcup 2009 deadline May 11, 2009, will count toward the SLOW challenge. If more than one submission is made in either track and with either dataset, the last submission before the track deadline will be taken into account to determine the ranking of participants and attribute the prizes. You may compete in both tracks. There are prizes in both tracks. On-line feed-back: During the challenge, the training set performances will be available on the Results page as well as partial information on test set performances: The test set performances on the toy task (T) and performances on a fixed 10% subset of the test examples for the real tasks (C, A, U). After the challenge is over, the performances on the whole test set will be calculated and substituted in the result tables. Submission method: The method of submission is via the form on the Submission page. To be ranked, submissions must comply with the Instructions. A submission should include results on both training and test set on at least one of the tasks (T, C, A, U), but it may include results on several tasks. A submission will be considered "complete" and eligible for prizes if it contains 6 files corresponding to training and test data predictions for the tasks C, A, and U, either for the small or for the large dataset (or for both). Results on the practice task T will not count as part of the competition. If you encounter problems with the submission process, please contact the Challenge Webmaster. Multiple submissions are allowed, but please limit yourself to 5 submissions per day maximum. For your final entry in the slow track, you may submit results on either or both small and large datasets in the same archive (hence you get 2 chances of winning). Evaluation and ranking: For each entrant, only the last valid entry will count towards determining the winner in each track (fast and slow). We limit each participating person to a single final entry in each track (see the FAQs page for the conditions under which you can work in teams). Valid entries must include results on all three real tasks. The method of scoring is posted on the Tasks page. Prizes will be attributed only to entries performing better than the baseline method (Naive Bayes). The results of the baseline method are provided in the Result page. These are not the best results obtained by the organization team at Orange, they are easy to outperform, but difficult to attain by chance. Reproducibility: Participation is not conditioned on delivering code nor publishing methods. However, we will ask the top ranking participants to voluntarily fill out a fact sheet about their methods, contribute papers to the proceedings, and help reproducing their results. [Go Top]
Data
2/8
13-11-17
Data: KDD Cup 2009: Data Data Download
Training and test data matrices and practice target values The large dataset archives are available since the onset of the challenge. The small dataset will be made available at the end of the fast challenge. Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical. Toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation. Small version (230 var.): orange_small_train.data.zip (8.2 Mbytes) orange_small_test.data.zip (8.2 Mbytes) Large version (15,000 var.): orange_large_train.data.chunk1.zip (52.7 Mbytes) orange_large_train.data.chunk2.zip (52.7 Mbytes) orange_large_train.data.chunk3.zip (52.6 Mbytes) orange_large_train.data.chunk4.zip (52.5 Mbytes) orange_large_train.data.chunk5.zip (52.6 Mbytes) orange_large_test.data.chunk1.zip (52.8 Mbytes) orange_large_test.data.chunk2.zip (52.5 Mbytes) orange_large_test.data.chunk3.zip (52.6 Mbytes) orange_large_test.data.chunk4.zip (52.6 Mbytes) orange_large_test.data.chunk5.zip (52.6 Mbytes) Toy targets (large): orange_large_train_toy.labels True task labels Real binary targets (small): orange_small_train_appentency.labels orange_small_train_churn.labels orange_small_train_upselling.labels Real binary targets (large): orange_large_train_appetency.labels orange_large_train_churn.labels orange_large_train_upselling.labels
Data Format The datasets use a format similar as that of the text export format from relational databases: One header lines with the variables names One line per instance Separator tabulation between the values There are missing values (consecutive tabulations) The large matrix results from appending the various chunks downloaded in their order number. The header line is present only in the first chunk. The target values (.labels files) have one example per line in the same order as the corresponding data files. Note that churn, appetency, and up-selling are three separate binary classification problems. The target values are +1 or -1. We refer to examples having +1 (resp. -1) target values as positive (resp. negative) examples. The Matlab matrices are numeric. When loaded, the data matrix is called X. The categorical variables are mapped to integers. Missing values are replaced by NaN for the original numeric variables while they are mapped to 0 for categorical variables. [Go Top]
Results Results: KDD Cup 2009: Results Winners of KDD Cup 2009: Fast Track First Place: IBM Research Ensemble Selection for the KDD Cup Orange Challenge First Runner Up: ID Analytics, Inc KDD Cup Fast Scoring on a Large Database Second Runner Up: Old dogs with new tricks (David Slate, Peter W. Frey)
Winners of KDD Cup 2009: Slow Track First Place: University of Melbourne University of Melbourne entry First Runner Up: Financial Engineering Group, Inc. Japan Stochastic Gradient Boosting
3/8
13-11-17

Second Runner Up: National Taiwan University, Computer Science and Information Engineering Fast Scoring on a Large Database using regularized maximum entropy model, categorical/numerical balanced AdaBoost and selective Naive Bayes
Full Results: Fast Track
Rank Team Name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 IBM Research ID Analytics, Inc Old dogs with new tricks Crusaders Financial Engineering Group, Inc. Japan LatentView Analytics Data Mining StatConsulting (K.Ciesielski, M.Sapinski, M.Tafil) Sigma Analytics Ming Li & Yuwei Zhang Hungarian Academy of Sciences Oldham Athletic Reserves Swetha VladN VADIS brendano commendo FEG CTeam Vadis Team 2 National Taiwan University, Computer Science and Information Engineering Kranf Neo Metrics ooo TonyM AIIALAB Uni Melb Christian Colot Cline Theeuws m&m Predictive Analytics DKW NICAL UW Prem Swaroop Dr. Bunsen Honeydew dodio FEG D TEAM minos M dataminers Weka1 idg HP Labs - Analytics Research hyperthinker vodafone paberlo Lenca C.A.Wang FEG B rw Tree Builders Leo ZhiGao
Method Final Submission DT Our own method Joint Score Technique boosting Boosting Logistic AdvancedMiner Decision Tree Algo CART me fri4 tiberius10 Logistic vnf8c Bagging random forests (res11) 1 before noon Boosting Best final all2 TIM final2 10-3 mymethod5 ensemble hfinal My GoldMiner final final test Logistic NN / Logistic Regression on Laptop Dys eq+uneq thmdkd4 submission #004 L2 mix2 rdf Release1 Ensemble Model 2 final b_1 lrs L1-regularization b Method10 test Bagging Naive Bayes & Logit lst VA - NN Naive Coding Zhigao5
AUC Churn Appetency Upselling Score 0.7611 0.8830 0.7565 0.8724 0.7541 0.8740 0.7569 0.8688 0.7498 0.8732 0.7579 0.8670 0.7580 0.8659 0.7544 0.8723 0.7568 0.8644 0.7564 0.8644 0.7507 0.8683 0.7496 0.8683 0.7492 0.8699 0.7550 0.8659 0.7415 0.8692 0.7474 0.8631 0.7468 0.8627 0.7381 0.8693 0.7389 0.8616 0.7442 0.8568 0.7428 0.8679 0.7463 0.8478 0.7454 0.8449 0.7427 0.8520 0.7397 0.8481 0.7413 0.8458 0.7087 0.8669 0.7183 0.8577 0.7346 0.8476 0.7218 0.8423 0.7131 0.8336 0.6980 0.8449 0.7108 0.8461 0.6804 0.8531 0.6972 0.8384 0.7048 0.8235 0.7179 0.8474 0.6997 0.8139 0.6828 0.8233 0.7289 0.8341 0.6850 0.8288 0.6795 0.7727 0.6851 0.7931 0.6414 0.8042 0.6770 0.7822 0.6819 0.7216 0.6717 0.7544 0.6713 0.7493 0.5956 0.8300 0.6499 0.8317 0.6368 0.7045 0.6358 0.6583 0.5928 0.5544 0.5425 0.5431 0.9038 0.9056 0.9050 0.9034 0.9057 0.9034 0.9034 0.8997 0.9034 0.9034 0.9050 0.9042 0.9026 0.8996 0.9012 0.8994 0.9003 0.8988 0.9011 0.8996 0.8890 0.8980 0.8994 0.8920 0.8988 0.8969 0.8996 0.8958 0.8835 0.8924 0.8917 0.8928 0.8707 0.8815 0.8794 0.8760 0.8356 0.8824 0.8698 0.8053 0.8205 0.8764 0.8458 0.8607 0.8386 0.8917 0.8451 0.8456 0.8369 0.7777 0.8070 0.7918 0.7314 0.5774 0.8493 0.8448 0.8443 0.8430 0.8429 0.8428 0.8424 0.8421 0.8415 0.8414 0.8413 0.8407 0.8406 0.8401 0.8373 0.8366 0.8366 0.8354 0.8338 0.8335 0.8332 0.8307 0.8299 0.8289 0.8289 0.8280 0.8251 0.8240 0.8219 0.8189 0.8128 0.8119 0.8092 0.8050 0.8050 0.8015 0.8003 0.7987 0.7920 0.7894 0.7781 0.7762 0.7747 0.7687 0.7659 0.7651 0.7571 0.7554 0.7541 0.7531 0.7161 0.6953 0.6262 0.5543
4/8
13-11-17
55 56 57 58 59 homehome decaff Claminer Klimma Reference

etc zzz only churn simple Random predictions 0.5835 0.3876 0.5288 0.5009 0.5731 0.5095 0.5034 0.5025 0.5030 0.4889 0.6290 0.5608 0.5055 0.4965 0.5069 0.5334 0.5302 0.5294 0.5008 0.4996
Full Results: Slow Track
Rank Team Name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 IBM Research Uni Melb ID Analytics, Inc Financial Engineering Group, Inc. Japan National Taiwan University, Computer Science and Information Engineering Hungarian Academy of Sciences Neo Metrics Ming Li & Yuwei Zhang Data Mining Tree Builders dataminers Oldham Athletic Reserves Swetha Analytics Old dogs with new tricks Sigma Weka1 Predictive Analytics LatentView Analytics Crusaders HP Labs - Analytics Research VladN brendano VADIS AIIALAB ooo m&m Vadis Team 2 commendo Kranf Christian Colot UW AI NICAL creon LosKallos FEG_BOSS Lenca Prem Swaroop M FEG ATeam pavel Additive Groves nikhop mi java. lang. OutOfMemory Error dodio Lajkonik FEG CTeam Cline Theeuws zlm FEG B CSN
Method Submission The generally satisfactory model DT with bagging (large + small) boosting all_final last_after_last FINAL me Multiple Techniques Sub2 Better Variables Combined Model Submission 1 Tiberius Data Mining Algorithms Logistic Decision Tree Algo Our own method Enemble classifier finalRules Adaptive Boosting Algorithm Segmented Joint Score CART + Combination Logic adds vn_large random forests (res11) Final Slow merge 3 final test test1 a2 bag 10 The Intelligent Mining Machine(TIM) My GoldMiner Final-2 Kahu Dys boosting combinations bit regurb Boosting RAE thmdkd4 final logit combfinal Additive Groves bteqwcomb a weka2 L2 final submission test2 test 7 dt submit8 cs_nott4
AUC Churn Appetency Upselling Score 0.7651 0.8819 0.7570 0.8836 0.7614 0.8761 0.7589 0.8768 0.7558 0.8789 0.7567 0.8736 0.7521 0.8756 0.7512 0.8744 0.7574 0.8700 0.7552 0.8736 0.7553 0.8736 0.7525 0.8720 0.7580 0.8652 0.7559 0.8691 0.7488 0.8730 0.7580 0.8636 0.7477 0.8727 0.7579 0.8676 0.7578 0.8675 0.7579 0.8660 0.7500 0.8653 0.7484 0.8597 0.7468 0.8627 0.7442 0.8631 0.7467 0.8551 0.7434 0.8583 0.7434 0.8486 0.7381 0.8493 0.7321 0.8637 0.7369 0.8434 0.7183 0.8577 0.7171 0.8455 0.7255 0.8408 0.7108 0.8461 0.7359 0.8268 0.7398 0.8204 0.7406 0.8149 0.7348 0.8175 0.6972 0.8384 0.7319 0.8153 0.7325 0.8160 0.7358 0.8130 0.7135 0.8311 0.7359 0.8098 0.7365 0.8090 0.7360 0.8090 0.7179 0.8474 0.7323 0.8073 0.7321 0.8062 0.7230 0.8147 0.7232 0.8175 0.7354 0.8031 0.7282 0.8051 0.9092 0.9048 0.9061 0.9074 0.9036 0.9065 0.9059 0.9059 0.9036 0.9021 0.9016 0.9028 0.9038 0.9014 0.9040 0.9041 0.9049 0.8995 0.8995 0.8995 0.9049 0.9040 0.9003 0.9013 0.8963 0.8936 0.8984 0.9013 0.8920 0.8963 0.8958 0.8927 0.8872 0.8707 0.8615 0.8621 0.8621 0.8629 0.8794 0.8644 0.8610 0.8591 0.8605 0.8589 0.8569 0.8572 0.8356 0.8600 0.8596 0.8584 0.8544 0.8544 0.8594 0.8521 0.8484 0.8479 0.8477 0.8461 0.8456 0.8445 0.8439 0.8437 0.8436 0.8435 0.8424 0.8423 0.8421 0.8419 0.8419 0.8418 0.8416 0.8416 0.8411 0.8401 0.8373 0.8366 0.8362 0.8327 0.8318 0.8301 0.8296 0.8293 0.8255 0.8240 0.8184 0.8179 0.8092 0.8081 0.8074 0.8058 0.8051 0.8050 0.8038 0.8031 0.8027 0.8017 0.8015 0.8008 0.8007 0.8003 0.7999 0.7993 0.7987 0.7983 0.7976 0.7976
5/8
13-11-17
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 TonyM Sundance minos Miner12 FEG D TEAM DKW idg homehome Mai Dang parramining. blogspot. com bob muckl C.A.Wang KDD@PT decaff

final BT rdf Model56 logit+tree Logistic Regression with interactions disc3 test572 Boosting GS10+NN+KR Basic play thirdM final2 Bagging c_a_u_ zzz Original232Vars final submission Orange_ Small_ Results_ KDD2009_ OptInference smallstm3.3 Method10 Btest class1 t Naive Coding Hello_Theta Personal Algorithm LDA Weka Straft finnaly lazy Zhigao5 tree knn Random predictions bz 0.7249 0.7996 0.7244 0.8172 0.6828 0.8233 0.7264 0.7973 0.7219 0.8077 0.7153 0.8015 0.7146 0.8041 0.7176 0.8062 0.7167 0.8099 0.7134 0.8056 0.7053 0.8052 0.7239 0.8195 0.7067 0.8043 0.7081 0.7989 0.7120 0.7916 0.7137 0.7605 0.7170 0.8052 0.6905 0.7744 0.7258 0.6915 0.6717 0.7544 0.7078 0.7670 0.6665 0.7785 0.7257 0.6928 0.6528 0.7504 0.6416 0.7167 0.6168 0.7571 0.6027 0.7201 0.6249 0.6425 0.6057 0.6465 0.5494 0.5378 0.5077 0.5047 0.5425 0.5431 0.5283 0.5231 0.5000 0.5090 0.4997 0.5057 0.5016 0.4993 0.8596 0.8400 0.8698 0.8484 0.8422 0.8547 0.8507 0.8416 0.8372 0.8438 0.8520 0.8180 0.8502 0.8528 0.8498 0.8501 0.7954 0.8465 0.8582 0.8451 0.7931 0.8199 0.8369 0.7693 0.7370 0.6792 0.6936 0.7218 0.6167 0.6873 0.7000 0.5774 0.5909 0.5000 0.5025 0.4982 0.7947 0.7939 0.7920 0.7907 0.7906 0.7905 0.7898 0.7885 0.7880 0.7876 0.7875 0.7871 0.7871 0.7866 0.7845 0.7748 0.7725 0.7705 0.7585 0.7571 0.7560 0.7550 0.7518 0.7242 0.6984 0.6843 0.6721 0.6630 0.6230 0.5915 0.5708 0.5543 0.5474 0.5030 0.5026 0.4997
[Go Top]
StatConsulting (K.Ciesielski, M.Sapinski, M.Tafil) Dr. Bunsen Honeydew Raymond Falk vodafone paberlo K2 Claminer rw Leo Persistent Louis Duclos-Gosselin Chy Abo-Ali sduzx MT Shiraz University - Undergradute Team ZhiGao Klimma hyperthinker Reference thes
FAQS FAQS: KDD Cup 2009: FAQs Participation and Registration What is the goal of the challenge? The challenge consists of several classification problems. The goal is to make the best possible predictions of a binary target variable from a number of predictive variables. Can I enter under multiple names? No, we limit each participant to one final entry, which may contain results on the large dataset only in the fast track and on either or both the small and the large dataset in the slow track. Registering under multiple names would be considered cheating and disqualify you. Your real identity must be known to the organizers. You may hide your identity only to the outside by checking the "Make my profile anonymous" in the registration form. Can I participate to multiple teams? No. Each individual is allowed to make only a single final entry into the challenge to compete towards the prizes. During the development period, each team must have a different registered team leader. To be ranked in the challenge and qualify for prizes, each registered participant (individual or team leader) will have to disclose the names of eventual team members, before the final results of the challenge get released. Hence, at the end of the challenge, you will have to choose to which team you want to belong (only one!), before the results are publicly released. After the results are released, no change in team composition will be allowed. I understand that one person can join only one team, however, is it ok to have many teams in the same organization? Yes it is OK. Each team leader must be a different person and must register and the teams cannot intersect. Before the end of the challenge the team leaders will have to declare the composition of their team. This will have to correspond to the list of co-authors in the proceedings, if they decide to publish their results. Hence a professor cannot have his/her name on all his/her students papers (but can be thanked in acknowledgements). How do I register a team? Only register the team leader and choose a nickname for your team. We'll let you know later how to disclose the members of your team.
6/8
13-11-17
Can the organizers enter the challenge?
No. The organizers may make entries under the common name "Reference" to stimulate the competition, but they do not compete towards the prizes.
Data: Download, format, etc. I have problems with the ZIP files which appear to be corrupted. Can I get a DVD? Try do download one archive at a time. If the problem persists, contact the organizers so they can send you a DVD. Are the data available in other formats: matlab, SAS, etc.? There are several Matlab versions posted on the Forum. There is also a numerical version of the categorical variables in text format for the large dataset. Please post your own version of the data to share it with others. Is there sample code available? Yes. We made available sample Matlab code to help you format your results. There are also examples to call CLOP models from that code. AT THIS STAGE THERE IS NOT YET MATLAB SUPPORT FOR HANDLING THE LARGE DATASET. Are the true targets distributed similarly as the toy target? No. The toy target is generated by an artificial stochastic process. The proportion of examples in either class is different in the real targets. The real targets have less than 10% of examples in the positive class. I have observed that the last columns (after variable 14740) are not numerical, are the data corrupted? The last variables are categorical variables. The strings correspond to category codes. This could be for instance a city name. But for reasons of privacy, the real names were replaced by strings that are meaningless. I have observed that some columns are empty or constant, are the data corrupted? No. This is correct, and part of the challenge, that deals with automatic data preparation and modeling in the context of industrial real data. Filtering constant data is the easy part of the challenge. I have observed that the first chunk of the large dataset contains only 9999 lines, is this correct? Yes. Chunk 1 contains 9999 data lines plus the header. All other chunks have no header. The last chunk has 10001 lines. So the total is 50000 lines of data. In the categorical variables, do the value need to be handled as meaningful sequences or are they just codes? The original categorical values where symbols, not indicating any category ordering. The category symbols have been replaced by random anonymized values (strings) with no semantic, in 1 to 1 bijection with the original values so as to keep the structure of the data. Do the targets correspond to single or multiple products? The targets correspond to single products (but not necessarily the same one). For instance, churn concerns mobile phone customers switching providers and upselling the plan upgrade to include television. Is there a meaning in the variable ordering? No. The variables are randomly ordered. Are the variables in the small dataset a subset of those in the large dataset? Yes. However, they are disguised to make it non-trivial to identify and discourage people to do so. The examples are also ordered differently to render such mapping even harder. We wish that participants work on each dataset separately, although they may work on both. Are the training and test data drawn from the same distribution? Yes. Are the set of categorical variable values the same in the training and test data? Not necessarily. Some values might show up only in training data or only in test data. Are there the same number of values in each line? There can be missing values. The values are separated by tabulations. Two consecutive tabs indicate a missing value. Is it allowed to unscramble the small dataset? Scrambling was done to encourage the participants to work separately on the small dataset and the big dataset. If we wanted the participants to be able to use the features of the small dataset in addition to those they might select from the big one, we would not have scrambled the data. We realize however that, if we forbid the participants from unscrambling and consider it cheating, we would have difficulties enforcing that rule. Hence, participants who unscramble the small dataset will not be disqualified from the competition. All participants will be requested to report at the end of the challenge whether they made use of unscrambling and whether they derived some advantage from it.
Evaluation: Tracks, submission format, etc. Why do we need to submit results on training data? In this way we can assess the robustness of the models. If you make great predictions on training data and perform poorly on test data, your method likely is overfitting. What is the purpose of giving performances on 10% or the test data? We want to give feed-back to the participants to motivate them. In this way, they can see how roughly their performance compares to others. But, by giving feed-back on only 10% of the data, we avoid that they fine tune their system using the test data (i.e. de facto "learn" from the test data). There will be a slight bias in performance because of the 10% on which feed-back is provided, but it is the same bias for all contestants. Is it correct that even if I submit the result on the large dataset in the fast track, I can submit the result on the large dataset in the slow track together with that on the small dataset? Yes. In fact, you may submit as many times as you want. But, only the last complete entry (with churn appetency and upselling results both on training and test data = 6 files) will count in each track, depending on the submission date. In the fast track, you may enter only large dataset results, so you get 1 chance. In the slow track you
7/8
13-11-17

may enter on both small and large datasets so you get 2 chances (the best of your 2 results will be taken into account). In total, you get 3 chances of winning. If I submit results on both the small and large datasets in the slow track, how will results be evaluated? The best of your 2 results will be taken into account. Both small and large entries compete for the slow prize, but they seem to correspond to two distinct problems? Shouldn't there be two slow track prizes? The small dataset is a downsized version of the large one: same examples, a subset of the features. To distinguish the two, the examples were ordered differently and the features were coded differently, in a way that should not affect performance but makes it non obvious to descramble. Because of the (unlikely) possibility that someone would spend time descrambling, we decided to give a single prize in the slow challenge, not to encourage people to cheat. If I submit results before the fast track deadline, will those results also enter the slow track if I submit nothing afterwards? Yes. For each deadline, your last valid complete entry will be entered in the ranking. So if you submit only to the fast track, your results will automatically be entered in the slow track. If I win in both tracks, will I cumulate prizes? No. You will get the largest of the two prizes. The remaining money will be used to give travel grants to other deserving participants. On the result page, there is a "Score" column in the table, what does it mean? As explained on the Tasks page, the score is the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling). I see a bunch of xxxx instead of my score, is there a problem? No. Until the data labels of the tasks of the challenge are released, if people submit something on those tasks, they cannot see results to prevent them from gaining information by guessing. Only results on the toy problem are shown. You may still practice submitting some random values to test the system, but you will not see the results.
DISCLAIMER Can a participant give an arbitrary hard time to the organizers? ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ORANGE AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ORANGE AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE. [Go Top]
Contacts Contacts: KDD Cup 2009: Contacts KDD Cup 2009 Chairs David Vogel Data Mining Solutions Isabelle Guyon Clopinet Please visit the original KDD Cup 2009 website for more information. [Go Top]
Latest Resources
AAAI American Statistical Association KDNuggets SIGART
KDD Updates
Stay informed on our latest news: *
SIGNUP
8/8

KDD Cup 2009 customer prediction challenge

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KDD Cup 2009 customer prediction challenge

Uploaded by

Copyright:

Available Formats

13-11-17

KDD Cup 2009: Customer relationship prediction | Sig KDD

Blogs & News

Home KDD Cup 2009: Customer relationship prediction

KDD Cup 2009: Customer relationship prediction

Prediction Class +1 Truth Class +1 Class -1 tp fp Class -1 fn tn

KDD Cup 2009: Customer relationship prediction | Sig KDD

KDD Cup 2009: Customer relationship prediction | Sig KDD

KDD Cup 2009: Customer relationship prediction | Sig KDD

Full Results: Fast Track

KDD Cup 2009: Customer relationship prediction | Sig KDD

Full Results: Slow Track

KDD Cup 2009: Customer relationship prediction | Sig KDD

KDD Cup 2009: Customer relationship prediction | Sig KDD

KDD Cup 2009: Customer relationship prediction | Sig KDD

You might also like