You are on page 1of 7

2012 45th Hawaii International Conference on System Sciences

A New Ensemble Model for Efficient Churn Prediction in Mobile


Telecommunication
Namhyoung Kim, Jaewook Lee
Department of Industrial and
Management Engineering,
POSTECH,
Pohang, South Korea
{skagud,jaewookl}@postech.ac.
kr

Kyu-Hwan Jung
SK telecom
Seoul, South Korea
Onlyou7@postech.ac.kr

opportunities with new customers and retain current


customers through improved business operations.
Note that many companies in telecommunications
industry has been suffering from extremely high
churning rates, i.e. between 20% and 40% of
customers leave their current service provider for a
given year, mainly because their relatively
homogeneous technologies and services drive them
to compete in terms of lower service charges. In such
a case, the role of marketing becomes a key success
factor. In particular, it is well known that it is much
profitable for a company to retain a current and royal
customer than to recruit a new customer considering
an increasing marketing costs. In particular, micro or
target marketing programs with tailored messages are
much more cost effective than mass marketing
programs through traditional marketing channels
such as TVs and newspapers. Therefore, it is strongly
recommended that companies in a very competitive
business environment operate their own customer
relationship management (CRM) systems equipped
with business intelligence and data mining tools to
identify a group of customers who are most likely to
terminate their relationships with the current service
providers.

Abstract
This paper explores the possible application of a
single SVM classifier and its variants to churner
identification problem in mobile telecommunication
industry in which the role of customer retention
program becomes more important than ever due to
its very competitive business environment. In
particular, this study introduces a uniformly
subsampled ensemble model of SVM classifiers
combined with principal component analysis (PCA)
not only to reduce the high dimensionality of data but
also to boost the reliability and accuracy of
calibrated models on data sets with highly skewed
class distributions. According to our experiments, the
performance of USE SVM model with PCA is
superior to all compared models and the number of
principal components (PCs) affect the accuracy of
ensemble models.

1. Introduction
The availability of cheap hard disk spaces and the
expansion of data collection technologies empower
many business companies to easily monitor and
visualize customers daily purchase and usage
patterns through online transaction processing
(OLTP) databases [5]. Therefore, in these days, most
companies have plenty of data. However, data itself
is not information, and data must be turned into
information so that users can answer their own
questions with the right information at the right time
and in the right place.

Note that churn identification and prevention is a


critical issue because the mobile phone market has
already reached a saturation point and each company
strives to attract new subscribers while retaining
current profitable customers [19]. To support such an
effort, in this paper, we like to introduce one of such
micro marketing tools suited for churner
identification on behalf of companies in
telecommunications industry. We first note that churn
management should start with an accurate
identification of churners possibly coupled with
detailed profiling of their demographic information
and behavioral and transactional patterns. While

In this paper, we consider an imaginary company in


mobile telecommunications industry that faces a very
steep competition from its competitors and hence is
compelled to capture, understand, and harness these
customer related data sets to seek for new business
978-0-7695-4525-7/12 $26.00 2012 IEEE
DOI 10.1109/HICSS.2012.74

Yong Seog Kim


Department of Management
Information Systems,
Utah State University,
Logan, UT 84322, USA
yong.kim@usu.edu

1023

values. Second, categorical variables with high


missing rate were also eliminated because each
categorical variable has very little predictive power in
general [17].

developing retention strategies and management


practices targeted for identified very likely churners
may complete a churn management system, we limit
our interests on developing a new SVM ensemble
model to accurately identify possible churners from
their service usage patterns collected over a certain
period.
The remainder of this paper is organized as follows.
Section 2 describes the original data set and
preprocessing procedure. Then we introduce the
Uniformly Subsampled Ensemble (USE) method in
Section 3 and present experimental results in Section
4. Finally, Section 5 concludes this paper and
provides suggestions for future research directions.

2. Data Description and Evaluation


Metrics
2.1. Telecommunications Market Data
Figure 1. Plot of training dataset

The data sets used in this paper are the customer


records of a major wireless telecommunications
company. They are provided by the Teradata Center
for CRM at Duke University [9]. The data collection
period is the second half of 2001. The active
customers who had been with the company for at
least 6 months were sampled. The original predictor
variables are 171 and the number of samples is
100,000. Predictor data include four types of
variables: demographics such as age, location,
number and ages of children; financial such as credit
score, credit card ownerships; product details such as
handset price, handset capabilities; and phone usage
information.
To predict churn, we have to set the criteria of
churn at first. We classified the customers left the
company by the end of the following 60 days after
sampled as churners. The actual ratio of churners in a
given month is approximately 1.8% but churners in
the original training data set were oversampled to
50%. In the test data set there were 51,036
observations with 924 churners which represent a real
churning rate 1.8% per month rate.
Fig. 1 shows a plot of training dataset with two
features selected by feature selection. We can see that
the churners and non-churners are highly overlapped.

Also if they are encoded into multiple binary


variables, dimensions will increase. Thus only 11
categorical variables were included. They are either
indicator variables or countable variables. Finally we
removed observations with missing values in dataset.
After preprocessing steps, we have 123 predictors
with 11 categorical variables and 112 continuous
variables. The training dataset has 67,181
observations with 32,862 churners of which churn
rate is approximately 49%. The test set has 34,986
observations with 619 churners of which churn rate is
approximately 1.8%.

2.3. Evaluation
We used hit rate as an evaluation metric for our
research . The hit rate is a popular measure to
evaluate the predictive power of models numerically
for the marketing field [18]. Hit rate is calculated as
n

Hit rate = H i / n

(1)

i =1

where Hi is 1 if the prediction is correct and 0


otherwise. n represents the number of samples in the
data sets. In other words, the hit rate represents the
percentage of correctly predicted churners from the
churner candidates. Hit rate is associated with a target
point. For example, a hit rate at a target point of x%
is a hit rate when only the top x% of customers are
considered for evaluation based on their estimated
churn probabilities. Therefore, if we assume that we

2.2. Data Preprocessing


For further analysis, we performed the
preprocessing on raw data before applying the
proposed method as follows. First, we eliminated
continuous variables with more than 20% of missing

1024

The simplest weighting scheme is the uniform weight


method that apply the same weight (=1/M) to the
prediction from all the classifiers. The prediction of
each individual classifier may be weighted depending
on the binary classification performance or the hit
rate on sampled validation data from the training data.
To apply the weighting scheme based on
classification
performance,
the
classification
accuracy on the validation data of each classifier is
normalized to facilitate summing to 1 and the final
prediction on the test data set is weighted according
to this normalized weight. In the weighting scheme
based on hit rate, the hit rate at 10%, 20%, and 30%
are summed to measure the performance.
Subsequently, they are normalized prior to summing
to 1. The final prediction on the test data set is
weighted according to this normalized weight as
follows:

have 10,000 observations, a hit rate at a target point


of 10% is the percentage of correctly predicted
churner out of 1,000 customers who are most likely
to churn. Considering hit rates with target points is
important because marketing managers have to focus
only on the top percentage of customers due to
limited budget and time constraints. Thus out target
point is 30%

3. Proposed Ensemble Method


In this section, we present the structure of our
new ensemble model, the USE, and describe its
unique characteristics in terms of sampling and
weighting schemes. Figure 2 graphically presents the
structure of the USE model. The first step in building
the USE is to partition the data set into subsets to
train a single corresponding classifier . Once a single
classifier is calibrated to produce the estimated score
(e.g., probability of churning) for each customer
record from each partition, the USE ensemble model
aggregates the scores of each classifier and produces
the final score of the ensemble model.

f ( x) = wm f m ( x)

(4)

m =1

3.2. Bagging and Boosting vs. USE


To build an accurate ensemble model based on
our proposed USE method, we divide an entire
training data set into M equally sized nonoverlapping subsamples using a random sampler.
Consequently, any single classifier (e.g., an SVM
classifier) can be calibrated on each subsampled data
set to determine hidden patterns. Finally, the
prediction of all classifiers will be aggregated via a
weighted summation to construct the final prediction
as an ensemble model for each record in a test data.
In this sense, the proposed USE method is very
similar to two popular ensemble methods, namely,
bagging [2] and boosting [6] which have been known
to perform better than single classifiers [1], [3]. For
example, ensemble models based on bagging train
each classifier on a randomly drawn training set that
consists of the same number of examples randomly
drawn from the original training set, with the
probability of drawing any given example being
equal. Since samples are drawn with replacement,
some examples may be selected multiple times
whereas others may not be selected at all. Bagging
combines the predictions of multiple classifiers by
voting with an equal weight. In short, the major
difference between bagging and the proposed USE
method is whether or not samples are drawn with
replacement and whether or not the size of sampled
training set for each single classifier is equal to the
size of the original training set.

Figure 2. The structure of the proposed


ensemble method

3.1. Weighting methods


To generate a collective decision, we consider
several ways to aggregate the predictions of trained
classification models through various weighting
schemes such as uniform weights, weighted by
classification performance, or weighted by hit rate.

1025

compared with other classifiers [10], [11], and


authors familiarity. On the other hand, our proposed
USE method can be combined with any other
classifier. In addition, SVM classifiers often require
additional computing power and show poor
performance when they are applied to large-scale
data [7], [12], [14], [15]. Therefore, the SVM
classifier is a perfect candidate to test the
effectiveness of the USE method through data
subsampling if the aim is to reduce the requirement
for high computing power.

Furthermore, our proposed USE method is


different from the boosting [6] method that produces
a series of classifiers with each training set based on
the performance of the previous classifiers. Through
adaptive resampling in boosting, examples that are
incorrectly predicted by previous classifiers are
sampled more frequently, whereas the uniform
subsampling method without replacement is
exploited in the USE. Overall, each classifier in the
USE model is calibrated on a smaller training set
compared with classifiers in bagging and boosting,
which requires less CPU power and main memory.
The USE model can still reduce the expected
prediction error of a single predictor.
All three ensemble modelsbagging, boosting,
and USEshare a common characteristic: the
effectiveness and improved accuracy of the proposed
ensemble model come primarily from the diversity
caused by resampling training examples.
While it is perfectly reasonable to calibrate single
classifier on a sampled training set without further
preprocessing, we consider a data dimension
reduction method such as principal component
analysis (PCA). Note that PCA is a mathematical
procedure to transform a set of correlated predictors
into a set of new uncorrelated variables called
principal components (PCs) that capture the
maximum amount of variation in the data. Since the
number of PCs is less than or equal to the number of
original variables, and each PC is uncorrelated with
other PCs, PCA method can be particularly useful to
reduce high dimensionality of data sets in which
many input variables are correlated. Note that
dimensionality reduction can be accomplished by
selecting by fewer number of PCs than the original
input variables, and three methods have been widely
used for determining the number of PCs. The first
criterion, The Eigenvalue-One Criterion" or the
Kaiser-Guttman criterion [8] selects all PCs with an
eigenvalue greater than 1. The second approach is
based on The Scree Test" [4] and it selects all PCs
considering a definitive break between sorted
eigenvalues of PCs. The last criterion retains
components if they exceeds a specified proportion of
variance in the data where the proportion is
calculated as follows:

4. Experimental results
In this section, we present the process and results
that the proposed Uniformly Subsampled Ensemble
SVM is applied to the telecommunications market
data. Fig. 3 shows correlation matrix of variables. We
can notice that there are high correlations among
features. It supports the need of extracting
uncorrelated new features.

Figure 3. The correlation matrix with values


higher than 0.5
We applied PCA for data dimension reduction. As
asserted in section 3, there exist some kinds of
methods to select the optimal number of PCs. Among
them we considered three approaches which are most
commonly used. .

Proportion =Eigenvalue for the component of


interest/ Total eigenvalues of the correlation matrix
In the actual implementation of the USE model in
the present paper, we built an ensemble of SVM
classifiers. Mainly, the SVM classifiers are used to
construct an ensemble model because of their
popularity among researchers, superior performance

1026

Figure 6. Effect of Number of Classifiers


We also analyzed the effect of weighting methods.
Fig. 7 represents a graph of cumulative hit rate for
different weighting methods. PCA in the graph is a
uniform weight method. As shown in the figure, the
weighting methods don't greatly affect the
performance. However, the uniform weight method is
easy to apply and its performance is a little bit better
than other methods. Thus we decided to apply the
proposed method using the uniform weight method.

Figure 4. Plot of Eigenvalues


Fig. 4 is the plot of eigenvalues. The numbers of
PCs from each approach are as follows.
The eigenvalue-one criterion [8] : 27 PCs
Scree test [4] : 4 PCs
Proportion of variance accounted : 36(90%),
48(95%) PCs
We applied the proposed method with above
numbers of PCs then compared their hit rates. The
results for different number of PCs are presented in
Fig. 5. As shown in the graph, hit rate at 30\% is
highest when 48 PCs are used. It tends to be
increased as the number of PCs increases. Thus 48
PCs are selected in this study
After choosing the number of PCs, the optimal
number of SVMs, M should be considered. We
explored the effects of number of classifiers on the
predictive accuracy while the number of PCs was
fixed as 48. The training dataset is divided into the
different M groups by a random sampler. The hit rate
at 10% is highest when M is 49, i.e. 49 SVMs but 25
SVMs shows better hit rate at 30%. Thus our final
optimal model is an ensemble model of 25 SVMs
with 48 PCs.

. Figure 7. Effect of weighting methods


Fig. 7 represents a graph of cumulative hit rate
for different weighting methods. PCA in the graph is
a uniform weight method. As shown in the figure, the
weighting methods don't greatly affect the
performance. However, the uniform weight method is
easy to apply and its performance is a little bit better
than other methods. Thus we decided to apply the
proposed method using the uniform weight method.
Figure 5. Effect of Number of PC

1027

SVDD (Support Vector Domain Description) model


to our problem. Fig. 9 presents hit rates of different
five models, respectively: USE SVM + PCA,
Ensemble Multi-SVDD, SOVdoo, Logistic model, and
a random model. The proposed USE SVM + PCA
outperformed other methods, and it shows a larger
performance improvement at low proportion. As
mentioned before, hit rate at low proportion is more
important measure than its at high proportion.
Through this the proposed USE method outperforms
the conventional method not only theoretically but
also practically.

5. Conclusions

Figure 8. Gain by PCA and Ensemble


To explore how much the proposed PCA and
ensemble model contribute to the increase of
performance compared to a single SVM, we
compared the performances of five models: USE
SVM + PCA, USE SVM, single SVM + PCA, single
SVM, and a random model. In Fig. 8 the hit rates
were noticeably improved in both cases of USE SVM
and USE SVM + PCA. In the case of using a signle
SVM with PCA, the performance is increased
slightly compared with the results of the previous two
methods.

In this paper, we proposed Uniformly


Subsmapled Ensemble (USE) method for churn
management. We show that USE SVM enhances
churn prediction performance. New features are
extracted using PCA. We also investigated the effect
of the number of classifiers and principle components
and gave a guideline to select them. Different
aggregating methods were also considered but they
didn't affect the result that much. The performance of
USE SVM proposed in this research is superior to all
compared models.
For further research, Ensemble of heterogeneous
classifiers can be considered. In the proposed
methodology, only a single classifier, SVM model, is
used for prediction, but other heterogeneous
classifiers can be calibrated. Effect of distribution of
labels also can be analyzed besides effect of the
number classifiers and PCs.

6. References
[1] E. Bauer and R. Kohavi, "An empirical comparison of
voting classification algorithms: Bagging, Boosting, and
variants", Machine Learning, 36(12):105139, 1999.
[2] L. Breiman. Bagging predictors. Machine Learning,
24(2):123140, 1996.
[3] L. Breiman. Stacked regression. Machine Learning,
24(1):4964, 1996.
[4] R. B. Cattell. The scree test for the number of factors.
Multivariate Behavioral Research, 1:245276.
[5] S. Chaudhuri and U. Dayal. An overview of data
warehousing and olap technology. SIGMOD Rec., 26:65
74, March 1997.
[6] Y. Freund and R. Schapire. Experiments with a new
Boosting algorithm. In Proc. of 13th Intl Conf. on Machine
Learning, pages 148156, Bari, Italy, 1996.
[7] K.-H. Jung, D. Lee, and J. Lee. Fast support-based
clustering method for large-scale problems. Pattern
Recognition, 43:19751983, 2010.

Figure 9. Comparison with other methods


The performance of the proposed method was
compared with the performance of other classifiers.
Given dataset is large scale and highly imbalanced,
only 1.8% of observations are non-churners. Thus
simple conventional methods will not work properly.
In the previous studies, Partial Least Square (PLS)
model and logistic model, popular models in
marketing area, have been proposed to solve this
problem [13]. We also applied ensemble multi-

1028

[8] H. Kaiser. The application of electronic computers to


factor
analysis.
Educational
and
Psychological
Measurement, 20:141151.
[9] Y. Kim. Toward a successful crm: Variable selection,
sampling, and ensemble. Decision Support Systems,
41(2):542553, 2006.
[10] D. Lee and J. Lee. Domain described support vector
classifier for multi-class classification problems. Pattern
Recognition, 40:4151, 2007.
[11] D. Lee and J. Lee. Equilibrium-based support vector
machine for semisupervised classification. IEEE Trans. on
Neural Networks, 18(2):578583, 2007.
[12] D. Lee and J. Lee. Dynamic dissimilarity measure for
supportbased clustering. IEEE Trans. on Knowledge and
Data Engineering, 22(6):900905, 2010.
[13] H. Lee, Y. Kim, Y. Lee, and H. Cho. Toward optimal
churn management: A partial least square (pls) model. In
Proc. of 16th AMCIS. Paper 78, pages 110, 2010.
[14] J. Lee and D. Lee. An improved cluster labeling
method for support vector clustering. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 27(3):461
464, march 2005.
[15] J. Lee and D. Lee. Dynamic characterization of cluster
structures for robust and inductive support vector clustering.
Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 28(11):1869 1874, nov.2006.
[16] S. Rosset, E. Neumann, U. Eick, N. Vatnik, and I. Idan.
Evaluation of prediction models for marketing campaigns.
In Proc. of 7th Intl Conf. on Knowledge Discovery & Data
Mining (KDD-01), pages 456461, 2001.
[17] P. E. Rossi, R. McCulloch, and G. Allenby. The Value
of Household Information in Target Marketing. Marketing
Science, 15(3):321340, 1996.
[18] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos.
Conceptual modeling for etl processes. In Proceedings of
the 5th ACM international workshop on Data Warehousing
and OLAP, DOLAP 02, pages 1421, New York, NY,
USA, 2002. ACM.
[19] L. Wright. The crm imperative practice vs theory in
the telecommunications industry. The Journal of Database
Marketing, 9:339349(11), 1 July 2002.

1029

You might also like