You are on page 1of 4

~ TELSIKS 2011 Serbia, Nis, October 5 - 8,2011

Predicting the Churn of Telecommunication Service Users


using Open Source Data Mining Tools
Srdjan M. Sladojevic1, Dubravko R. Culibrk2, Vladimir S. Crnojevic3
Abstract The paper presents a study focused on the problem Evaluation of bigger data sets like this could take a lot of
of predicting the churn of telecommunication-service users. The time. One of the goals of this research was to make all
transition of customers (users) to competitors is a significant evaluations on personal laptop computer with open source
business problem in areas with poor market differentiation of tools in reasonable time.
products and services, which is particularly evident in the
market of telecommunication services. The paper evaluates the
The rest of the paper is organized as follows: Section 2
applicability of commonly used methods of data mining, contains the overview of the related work. Section 3 contains
available within a widely used data mining tool, to the problem. experimental evaluation description, section 4 presents
Experiments presented in the paper have been conducted based achieved results, and finally, conclusion is done in section 5.
on real-world data provided for research purposes by Orange
telecommunications company. Results indicate that boosting of
simple classifiers achieves best results and that open-source tools
II. RELATED WORK
can achieve performance very close to the best proprietary
solutions. The annual ACM SIGKDD conference is the main
Keywords Churn, Users, Data Mining. international forum for data mining scientists, but also
practitioners from various academies, businesses and
government institutions. The 2009 KDD Cup partner was the
I. INTRODUCTION French telecommunications company Orange.
Orange wanted to compare the time needed to research
The transition of customers (users) to competitors is a laboratories to develop a solution and to check how good the
significant business problem in areas with poor market solution for data mining developed in their laboratories is.
differentiation of products and services, which is particularly Participants had an opportunity to try working with large data
evident in the market of telecommunication services. The set including heterogeneous data with a lot of noise (the
phenomenon is usually referred to as churn. A numerical and categorical attributes) and with very
telecommunications company, which could predict a possible unbalanced data. Limited time was given to participants, to
transition of users in time, would be able to take effective simulate the necessity to minimize the time interval for
steps to stop this trend and, thus, gain a great competitive obtaining the appropriate results. The reason for this is that
advantage. As there is no deterministic model that would most data sets for analysis change rapidly.
allow for such predictions to be made, significant research The KDD Cup competition winner for 2009 was IBM's
effort has been invested in solving this problem through the research laboratory. Their overall strategy was to address this
application of data mining methods [1]. The approach has challenge using Ensemble Selection. Ensemble Selection is an
great application potential, since modern telecommunication- overproduce-and-select ensemble building method that is
service providers collect large amounts of data about the users designed to generate high performing ensembles from large,
and their behaviour, which can be used to discover the patters heterogeneous libraries of classifiers [4].
of user dynamics. "Weka" is an open-source software that has become
The focus of the study presented here is an analysis of the standard for the evaluation and application of methods of data
applicability of algorithms implemented in the open source mining. "Weka" allows its users to perform the necessary
data mining tool "Weka" [2] when the goal is to identify users experiments efficiently, as well as enabling the subsequent
who will transfer their business to competition. As basis for realization of practical systems for prediction and
comparison of different machine learning algorithms, data classification [5].
provided by the international telecommunications company Weka contains an advanced set of tools for data mining and
"Orange" was used. This data had been made available to data analysis. It contains implementations of a large number
researchers in the international competition, "KDD Cup of algorithms and the machine learning approaches [6].
2009. Several of these algorithms are of interest when it comes to
1
Srdjan M. Sladojevic is with the Faculty of Technical Sciences, the research presented in this paper: ADTree, J48, IB1, KStar,
University of Novi Sad, Trg Dositeja Obradovica 6, 21000 Novi Sad, NaiveBayes, DecisionStump and AdaBoost.
Serbia, E-mail: ssladojevic@gmail.com ADTree is an algorithm builds an alternating decision tree
2
Dubravko R. Culibrk is with the Faculty of Technical Sciences, using boosting and is optimized for two-class problems [6].
University of Novi Sad, Trg Dositeja Obradovica 6, 21000 Novi Sad, We considered another tree building algorithm, J48, which
Serbia, E-mail: alef.tau@gmail.com builds a C4.5 decision tree [2].
3
Vladimir S. Crnojevic is with the Faculty of Technical Sciences,
IB1 and KStart are so-called lazy learners, since they
University of Novi Sad, Trg Dositeja Obradovica 6, 21000 Novi Sad,
Serbia, E-mail: vladimir.crnojevic@gmail.com
store the training instances and do no real work until

978-1-4577-2019-2/11/$26.00 2011 IEEE 749


classification time. IB1 is a basic instance-based learner [11]. B. Performance metrics
KStar is a nearest-neighbour method with a generalized
distance function based on transformations, proposed by Target performance metrics was AUC (Area Under Curve)
Cleary and Trigg [3]. in the KDD challenge. So, all results in this work will be
NaiveBayes implements the probabilistic Nave Bayes evaluated using AUC too, especially in order to compare the
classifier [6]. achieved results. AUC corresponds to the area under the curve
DecisionStump is an algorithm for building a simple one- obtained by plotting the curve of sensitivity versus specificity
level binary decision tree with an extra branch for missing for different thresholds of predicted values to determine the
values [6]. result of classification [9].
AdaBoost is an implementation of a meta learning
algorithm based on the approach of Freund and Schapire [10]. B. Classification
Boosting is a well-known methodology for combining
multiple models by explicitly seeking models that In order to come to the conclusion which is the best
complement one another. algorithm for the classification, following classification
algorithms were compared:
III. EXPERIMENTAL EVALUATION
1. DecisionTable
A. Input Data Set 2. ADTree
3. J48
Data sets used for training and testing contain 50000 4. Lazy IB1
instances. Independent variables (attributes) are numeric and 5. Lazy Kstar
6. NaiveBayes
categorical. Two versions of the datasets have been provided
by the KDD cup organizers: extended and reduced datasets. 7. Boosting - (DecisionStump)
The reduced data set contains 190 numerical and 40
categorical variables. Expanded data set contains 14740 IV. RESULTS
numerical and 260 categorical.
Any evaluation on original data set on standard laptop First experiment was made with Decision Table classifier
computer could take a days. without cost matrix applied. The results show that the default
Due to the large volume of data, in order to the evaluation algorithm is not able to classify correctly all users of -1, and
could be performed on standard personal computers, also ROC of 0.513 was not very satisfactory value. Due to
clustering was performed firstly, to obtain a reduced data set. these reasons further research with this algorithm is not
K-means clustering was used to reduce the number of performed.
instances. Decision trees have proved to be more accurate than the
Then, different classification algorithms were tested on a decision table.
reduced set, and then verified on the original set. ADTree without cost matrix set, classified all the instances
Reduced set made by clustering contained 5000 instances as majority class. The result was obtained is a result with the
(cluster centroids). highest percentage matches for Class -1 (93.10%). But in this
In the reduced dataset 4655 instances represented the users case there is no information about the users who will change
who will not go with another provider. They are marked with their provider (Class 1), which is the main goal of the
class -1, and 345 clusters of users who will go with another classification. Introduction of a cost matrix was necessary in
provider and they are marked as class 1. Therefore, the input order to isolate the users who will leave the service. The
data set is highly unbalanced. This is a particular problem for results obtained are shown in Table I:
classifiers because they lack sufficient data for all classes in
order to become properly trained. In this case the number of TABLE I
instances of one class is over 13 times higher than the number CONFUSION MATRIX FOR COST-SENSITIVE ADTREE
of instances of the another class.
Since the input data set is very unbalanced, all algorithms A b classified as
had to be applied with "cost sensitive" [7] classification to 28 4627 a = -1
produce a prediction of the user who will leave. 0 345 b=1
By default, Weka treats all types of classification errors
equally. In many practical cases though, not all errors are
Initial value of the matrix was set to reflect the ratio of
equal.
number of instances of two classes. Unfortunately this did not
Cost-sensitive classification allows one to assign different
yield satisfactory results. These cost-matrix values had to be
costs to different types of misclassification [8]. The way to do
subsequently modified to optimize AUC value.
this in Weka is to create cost matrix [6].

750
We started by tweaking the Cost matrix to identify all users identified. The ROC Area in this case is 0.697 which is the
who will leave the service. This procedure reduced the largest measured value in the evaluation.
percentage of recognition of those who will remain to only All evaluations presented above were performed on the
7.46%. To achieve more meaningful classification, we reduced data set generated as a result of clustering, due to
evaluated the impact of a broad range of cost-matrix values. need of higher speed performances, but also the inability of
Figure 1 shows the plot of the number of identified users who the average PC to evaluate the complex algorithms.
will leave the service against the percentage of correctly Final testing was performed on the whole data set
classified instances, achieved by varying the cost matrix. containing 50000 instances. For this purpose cost matrix has
been slightly changed because ratio of instances of the classes
400
A are changed too. It is fine adjusted again in order to optimize
350 the AUC value and the results of classification of the
Identified minority class instances

300 complete data set using this cost matrix are shown in Table
250 IV:
200
150 TABLE IV
100 CONFUSION MATRIX FOR COST-SENSITIVE DECISIONSTUMP BOOSTING
50 BASED ON THE COMPLETE DATASET
0 B
7.46 12.96 23.78 41.2 62.9 82.8 85.28 88.84 92.32 93.1
Classification accuracy(%)
a B classified as
43773 1898 a = -1
Fig. 1. Leaving users in dependence of correctly classified instances
2978 649 b=1
The results obtained by J48 classifier are weaker than the
results obtained by algorithms previously described. Including The results and the corresponding ROC value of 0.72
the "cost sensitive" classification, a typical result is shown in indicate the algorithms perform even better than when using
Table II: reduced dataset. The percentage of correctly classified
instances was also slightly higher, and it, in this case, also
TABLE II
confirms the result obtained by centroid classification.
CONFUSION MATRIX FOR COST-SENSITIVE J48 This algorithm is therefore identified as the best model for
classification of unbalanced data sets that describe the
fluctuation of users of telecommunications operators.
a B classified as
The result achieved in this manner is comparable to IBM
2789 1866 a = -1 Research Laboratorys results where ROC value was 0.7651.
209 136 b=1
V. CONCLUSION
Maximum values of ROC could not go over 0.5. J48 has
poor results with unbalanced data sets. It is also poorly We studied the applicability of different machine learning
sensitive to cost matrix changes. algorithms, available within the open-source data mining tool
It is interesting to note that only IB1 identified a number of Weka, to the problem of predicting the churn of
users who will leave the provider without the cost matrix telecommunication users. A number of algorithms have been
settings. That algorithm is one of the slowest algorithms evaluated using the data set provided to researchers within the
together with Kstar which did not generally react to cost KDD Cup 2009 competition. We showed that meaningful
matrix changes. classification can be achieved using a personal computer and
Naive Bayes has achieved an average score, but had the best open-source tools.
classification speed. Working with unbalanced data sets, which are common
Boosting of Decision Stumps, using Ada Boost, achieved place in the problem considered, is difficult. This was a major
the best results, slightly better than ADTree algorithm. challenge in the work presented.
The result with the best cost matrix balance achieved with None of the applied algorithms is able to do a meaningful
DecisionStump algorithm is shown in Table III: classification of very unbalanced input dataset without
employing "cost sensitive" classification.
TABLE III The best results were achieved using Ada Boost to boost
CONFUSION MATRIX FOR COST-SENSITIVE DECISIONSTUMP BOOSTING the Decision Stumps classifier.

A B classified as ACKNOWLEDGEMENT
4408 247 a = -1
292 53 b=1 This work has been supported in part by the Ministry of
Science and Technological Development of Serbia grant
In total, 89% instances were classified correctly, and as III43002.
many as 53 users who will leave the service have been

751
REFERENCES [6] Ian H. Witten & Eibe Frank, Data Mining - Practical Machine
Learning Tools, The Morgan Kaufmann, 2005
[7] Bianca Zadrozny, John Langford, Naoki Abe, Cost-Sensitive
[1] Srdjan Sladojevic, Previdjanje fluktuacije korisnika
Learning by Cost-Proportionate Example Weighting,
telekomunikacionih usluga primenom savremenih sistema za
Mathematical Sciences Department IBM T. J. Watson Research
istrazivanje podataka Magistarska teza, FTN, Novi Sad,
Center, Yorktown Heights, NY 10598
2010
[8] Kai Ming Ting, Cost-Sensitive Classification using Decision
[2] Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey
Trees, Boosting and MetaCost, School of Computing and
Holmes, and Sally Jo Cunningham, Weka: Practical Machine
Information Technology, Monash University, Churchill,
Learning Tools and Techniques with Java Implementations,
Victoria 3842, Australia
Department of Computer Science, University of Waikato, New
[9] Andrew P. Bradley, The use of the area under the ROC curve
Zealand
in the evaluation of machine learning algorithms, Cooperative
[3] John G. Cleary, Leonard E. Trigg: K*: An Instance-based
Research Centre for Sensor Signal and Information Processing,
Learner Using an Entropic Distance Measure. In: 12th
Department of Electrical and Computer Engineering, The
International Conference on Machine Learning, 108-114, 1995.
University of Queensland, Australia, 1996
[4] Gideon Dror, Marc Boulle, Isabelle Guyon, Vincent Lemaire i
[10] Yoav Freund, Robert E. Schapire: Experiments with a new
David Vogel, Winning the KDD Cup Orange Challenge with
boosting algorithm. In: Thirteenth International Conference on
Ensemble Selection, JMLR:Workshop and Conference
Machine Learning, San Francisco, 148-156, 1996.
Proceedings, KDD Cup, 2009.
[11] D. Aha, D. Kibler (1991). Instance-based learning algorithms.
[5] Zdravko Markov, Ingrid Russel, An introduction to Weka data
Machine Learning. 6:37-66.
mining system, Central Connecticut State University,
University of Hartford.

752

You might also like