You are on page 1of 34

1

Customer Lifetime Value Measurement using Machine Learning


Techniques

Tarun Rathi
Mathematics and Computing
Department of Mathematics
Indian Institute of Technology (IIT), Kharagpur -721302
08MA2027@iitkgp.ac.in

Project guide: Dr. V Ravi


Associate Professor, IDRBT
Institute of Development and Research in Banking Technology (IDRBT)
Road No. 1, Castle Hills, Masab Tank, Hyderabad 500 057
http://www.idrbt.ac.in/

July 8, 2011

Certificate
Date: July 8, 2011

This is to certify that the project Report entitled Customer Lifetime Value
Measurement using Machine Learning Techniques submitted by Mr. TARUN
RATHI, 3rd year student in the Department of Mathematics, enrolled in its 5
year integrated MSc. course of Mathematics and Computing, Indian Institute of
Technology, Kharagpur is a record of bonafide work carried out by him under
my guidance during the period May 6, 2011 to July 8, 2011 at Institute for
Development and Research in Banking Technology (IDRBT), Hyderabad.
The project work is a research study, which has been successfully completed as
per the set of objectives. I observed Mr. TARUN RATHI as sincere, hardworking
and having capability and aptitude for independent research work.

I wish him every success in his life.

Dr. V Ravi
Associate Professor, IDRBT
Supervisor

Declaration by the candidate


I declare that the summer internship project report entitled, Customer
Lifetime Value Measurement using Machine Learning Techniques is my own
work conducted under the supervision of Dr. V Ravi at Institute of
Development and Research in Banking Technology, Hyderabad. I have put in 64
days of attendance with my supervisor at IDRBT and awarded project
fellowship.
I further declare that to the best of my knowledge the report does not contain
any part of any work, which has been submitted for the award of any degree
either by this institute or in any other university without proper citation.

Tarun Rathi
III yr. Undergraduate Student
Department of Mathematics
IIT Kharagpur
July 8, 2011

Acknowledgement

I would like to thank Mr B. Sambamurthy, director of IDRBT, for giving me this


opportunity.
I gratefully acknowledge the guidance from Dr. V. Ravi, who helped me sort
out all the problems in concept clarifications; and without whose support, the
project would not have reached its present state. I would also like to thank Mr.
Naveen Nekuri for his guidance and sincere help in understanding important
concepts and also in the development of the WNN software.

Tarun Rathi
III yr. Undergraduate Student
Department of Mathematics
IIT Kharagpur
July 8, 2011

Abstract: Customer Lifetime Value (CLV) is an important metric in relationship marketing


approaches. There have always been traditional techniques like Recency, Frequency and
Monetary Value (RFM), Past Customer Value (PCV) and Share-of-Wallet (SOW) for
segregation of customers into good or bad, but these are not adequate, as they only
segment customers based on their past contribution. CLV on the other hand calculates the
future value of a customer over his or her entire lifetime, which means it takes into account
the prospect of a bad customer being good in future and hence profitable for a company or
organisation. In this paper, we review the various models and different techniques used in
the measurement of CLV. Towards the end we make a comparison of various machine
learning techniques like Classification and Regression Trees (CART), Support Vector
Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, Multilayer
Perceptron (MLP) and Wavelet Neural Network (WNN) for the calculation of CLV.
Keywords : Customer lifetime value (CLV), RFM, Share-of-Wallet (SOW), Past Customer
Value (PCV), machine learning techniques, Data mining, Support Vector Machines,
Sequential Minimal Optimization (SMO), Additive Regression, K-star Method, Artificial
Neural Networks (ANN), Multilayer Perceptron (MLP), Wavelet Neural Network (WNN).

Contents
Certificate
Declaration by the candidate
Acknowledgement

Abstract

1. Introduction
2. Literature Review
2.1 Aggregate Approach
2.2 Individual Approach
2.3 Models and Techniques to calculate CLV
2.2.1
RFM Models
2.2.2
Computer Science and Stochastic Models
2.2.3
Growth/Diffusion Models
2.2.4
Econometric Models
2.2.5
Some other Modelling Approaches
3. Estimating Future Customer Value using Machine Learning Techniques
3.1 Data Description
3.2 Models and Software Used
3.2.1
SVM
3.2.2
Additive Regression and K-Star
3.2.3
MLP
3.2.4
WNN
3.2.5
CART
4. Results and Comparison of Models
5. Conclusion and Directions of future research

4
5
5
8
10
10
12
15
15
17
19
19
20
20
21
22
22
24
27
28

References

29

1. Introduction: Customer Lifetime Value has become a very important metric in Customer
Relationship Management. Various firms are increasing relying on CLV to manage and
measure their business. CLV is a disaggregate metric that can be used to find customers who
can be profitable in future and hence be used allocate resources accordingly (Kumar and
Reinartz, 2006). Besides, CLV of current and future customers is a also a good measure of
overall value of a firm (Gupta, Lehmann and Stuart 2004).
There have been other measures as well which are fairly good indicators of customer
loyalty like Recency, Frequency and Monetary Value (RFM), Past Customer Value (PCV) and
Share-of-Wallet (SOW). The customers who are more recent and have a high frequency and
total monetary contribution are said to be the best customers in this approach. However, it
is possible that a star customer of today may not be the same tomorrow. Matlhouse and
Blattberg (2005) have given examples of customers who can be good at certain point and
may not be good later and a bad customer turning to good by change of job. Past Customer
Value (PCV) on the other hand calculates the total previous contribution of a customer
adjusted for time value of money. Again, PCV also does not take into account the possibility
of a customer being active in future (V. Kumar, 2007). Share-of-Wallet is another metric to
calculate customer loyalty which takes into account the brand preference of a customer. It
measures the amount that a customer will spend on a particular brand against other brands.
However it is not always possible to get the details of a customer spending on other brands
which makes the calculation of SOW a difficult task. A common disadvantage which these
models share is the inability to look forward and hence they do not consider the prospect of
a customer being active in future. The calculation of the probability of a customer being
active in future is a very important part in CLV calculation, which differentiates CLV from
from these traditional metrics of calculating customer loyalty. It is very important for a firm
to know whether a customer will continue his relationship with it in the future or not. CLV
helps firms to understand the behaviour of a customer in future and thus enable them to
allocate their resources accordingly.
Customer Lifetime Value is defined as the present value of all future profits obtained
from a customer over his or her entire lifetime of relationship with the firm (Berger and
Nassr, 1998). A very basic model to calculate CLV of a customer is (V. Kumar, 2007) :
 = 





    




where,
 is the customer index,
 is the time index,
T is the number of time periods considered for estimating CLV,
is the discount rate.

There are various models to calculate the CLV of a customer or a cohort of


customers, depending on the amount of data available and the type of company. V. Kumar
(2007) has shown individual level approach and aggregate level approach to calculate CLV.
He has linked CLV to Customer Equity (CE) which is nothing but the average CLV of a cohort
of customers. Dwyer (1997) have used a customer migration model to take into account the
repeat purchase behaviour of customers. Various behaviour based models like logit-models
and multivariate Probit-models have also been used (Donkers, Verhoef and Jong, 2007) and
models which takes into account the relationship between various components of CLV like
customer acquitition and retention are also used (Thomas 2001). We will present some of
the most used models to calculate CLV in the later part of the paper. Besides this, there are
various techniques that are also used to calculate CLV or the parameters needed to
calculate CLV. Aeron, Kumar and Janakiraman (2010) have presented various parameters
that may be useful in the calculation of CLV which include Acquisition rate, Retention rate,
Add-on-selling rate, Purchase Probability, Purchase amount, Discount rate, Referral rate and
Cost factor. However, all of these parameters may not be required in a single model. Various
researchers have used different techniques to calculate these parameters for calculating
CLV. Hansotia and Wang (1997) used Logistic Regression, Malthouse and Blattberg (2005)
used linear regression for predicting future cash flows, Dries and Poel (2009) used quantile
regression, Haenlein et al. (2007) used CART and markov chain model to calculate CLV. An
overview of various data mining techniques used to calculate the parameters for CLV have
been compiled by Aeron, Kumar and Janakiraman (2010). Besides this, many researchers
also use models like Pareto/NBD, BG/NBD, MBG-NBD, CBG-NBD, Probit, Tobit, ARIMA,
Support vector machines, Kohonen Networks etc., to calculate CLV . Malthouse (2009)
presents a list of these methods used by academicians and researchers who participated in
the Lifetime Value and Customer equity Modelling Competition.
Most of the above mentioned models are used either to calculate the variables used
to predict CLV or to find a relationship between them. In our research, we have used several
non-linear techniques like Classification and Regression Trees (CART), Support Vector
Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, Multilayer
Perceptron (MLP) and Wavelet Neural Network (WNN) to calculate CLV which takes care of
the relationship between the variables which act as input variables in the prediction of CLV.
Further we also make a comparison of these techniques to find the best fitted model for the
dataset we used. Later on we make conclusions and discuss the areas of future research.
2. Literature Review: Before going into the details of various models of CLV, let us first have
a look on the various approaches designed for calculating CLV. CLV can broadly be classified
in 2 ways: a) Aggregate Approach b) Individual Approach
2.1 Aggregate Approach: This approach revolves around calculating Customer Equity (CE) of
a firm. Customer Equity is nothing but the average CLV of a cohort of customers. Various
researchers have devised different ways to calculate CE of a firm. Gupta, Lehman and Stuart

(2004) have calculated CE by summing up the CLV of all the customers and taking its
average. Berger and Nassr (1998) calculated CLV from the lifetime value of a customer
segment. They also took into account the rate of retention and the average acquisition cost
per customer.

Avg. CLV = 
)!" # $
/1 ' 
( A
Here, r=rate of retention
A= Avg. Acquisition cost per customer

Kumar and Reinartz (2006) gave a formula for calculating the retention rate for a customer
segment as follows :
Retention rate(%) =

N+.+- ./01+2340 56 0372361 8/9567 56 1 6+.+- ./01+2340 56 0372361 8/9567 56 1
 ; < 
=    = 
 >  


?$
@ ?$< A 1 B C 

Here, ?$
= predicted retention rate for a given period of time in future.
And ?$< = Max attainable retention rate, given by the firm.
r = coefficient of retention and calculated as
r= (1/t) * (ln(?$< ) ln(?$< B ?$
) )

Projecting Retention rate :

This model is good enough for calculating the CLV of a segment of customers over a small
period of time, however the fluctuation of retention rate and gross contribution margin
needs to be taken care of while projecting CLV for longer periods. Taking this into account
they proposed another model which calculated the profit function over time, which can be
calculated separately. This models is given as :
CLV = 
 D  x [$
1 ' E
] ,
where D is the profit function over time.

Blattberg, Getz and Thomas (2001) calculated average CLV or CE as the sum of return on
acquisition, return on retention and return on add-on selling rate across the entire customer
base. They summarized the formula as :
Y

)

V

U

G   @ H IJ,
L,
MN,
B O,
P B J,
Q,,
' H J,
L,
RS TU,
V W

where,

1 V
A MN,
V B O,
V B Q, ,
V B Q,Z[,
V P \
] ^
1'E

CE(t) is the customer equity value for customers acquired at time t,


J,
is the number of potential customers at time t for segment i,

L,
is the acquisition probability at time t for segment i,

T,
is the retention probability at time t for a customer in segment i,

Q,,
is the marketing cost per prospect (N) for acquiring customers for segment i,

Q, ,
is the marketing costs in time period t for retained customers for segment i,
Q,Z[,
is the marketing costs in time period t for add-on selling for segment i

d is the discount rate

N,
is the sales of the product/services offered by the firm at time t for segment i,
O,
is the cost of goods at time t for segment i.

_ is the number of segments,

 is the segment designation and


 is the initial time period.

Rust, Lemon and Zeithaml (2004) used a CLV model in which they considered the case
where a customer switches between different brands. However, in using this model, one
needs to have a customer base which provides information about previous brands
purchased, probability of purchasing different brands etc. Here the CLV of customer i to
brand j is given as :
`

U @ H



1
 A DU
A QU

1 ' EU 
/; U

where,

aU is the number of purchases customer i makes during the specified time period,
EU is firm j s discount rate,

c is the average number of purchases customer i makes in a unit time (eg. Per year)
U
is customer is expected purchase volume of brand j in purchase t

DU
is the expected contribution margin per unit of brand j from customer i in purchase t
QU
is the probability that customer i buys brand j in purchase t .

The Customer Equity (CE) of firm j is then calculated as the mean CLV of all customers across
all firms multiplied by the total number of customers in the market across all brands.
2.2 Individual Approach : In this approach, CLV is calculated for an individual customer as
the sum of cumulated cash flows discounted using WACC (Weighted avg. cost of capital) of
a customer over his or her entire lifetime (Kumar and George, 2007). The CLV in this case
depends on the activity of the customer or his expected number of purchases during the
prediction time period and also his expected contribution margin. The basic formula for CLV
in this approach is :


 @ H fgOhC
A i



where,
i
is the gross contribution margin for customer i in period t.

This approach brings into light the need for calculating the probability of a customer being
active or P(active). There are various ways to calculate P(active) :
V. Kumar (2007) have calculated P(active) as :

P(Active) = aJ  ,
where,

n is the number of purchases in the observation period,


T is the time elapsed between acquisition and the most recent purchase and,
N is the time elapsed between acquisition and the period for which P(Active) needs to be
calculated.
This model however, is quite trivial. Several researchers have used statistically advanced
methods to calculate P(active) or the expected frequency of purchase. Most of them have
also taken into account other factors like channel communication, recency of purchase,
customers characteristics, switching costs, first contribution margin etc. to make the
predictions more accurate.
Venkatesan and Kumar (2004) in his approach to calculate CLV predicted the customers
purchase frequency based on their past purchases. The CLV function in this case is
represented as :

,j
 = >
B x
 j/klmnompqj


r <,r,s x u,r,s
 svw

where,

 is the lifetime value of customer i,

",> is the contribution margin from customer i in purchase occasion y,

$ is the discount rate,

O,=,x is the unit marketing cost for customer i in channel m in year l,


#,=,x is the number of contacts to customer i in channel m in year l,

c$CyzC{O| is the predicted purchase frequency for customer i, number of years to


forecast, and

a is the predicted number of purchases made by customer i until the end of the planning
period.
Besides this, there have been various others models and techniques which calculate
P(Active) or the expected frequency of purchase which include Pareto/NBD, BG/NBD, MBGNBD, CBG-NBD, Probit, Tobit, generalized gamma distribution, log-normal distribution etc.
Various researchers and academicians who participated in the 2008 DMEF CLV Modelling
Competition have used some of these models to calculated CLV. We will come to know
more about these in the next part of the paper, when we study the various models and
techniques used by researchers to calculate the parameters of CLV or CLV itself.
As we have seen there are various aggregate and disaggregate approaches to calculate CLV.
The obvious question which one comes across is which model we use. Kumar and George
(2007) have given a detailed discussion of the comparison of these models. They observed
that an aggregate approach performs poorly in terms of time to implement and expected
benefits and a disaggregate approach has higher data requirement and more metrics to
track. They have also concluded that the model selection should depend on the requirement
of the firm and which criteria would they more importance to in comparison of others. For
example one may consider the cost involved as an important factor while others may
consider expected profits as a major factor of contribution. Kumar and George (2007) have
also proposed an integrated or hybrid approach to calculate CLV. In this approach,
depending on the various details of a customer, an appropriate approach is adopted. If the
firms transaction data and firm-customer interaction data in available then individual
approach of Venkatesan and Kumar (2004) is adopted. If this data is not available, but
segment level data is available then Blattberg, Getz and Thomas (2001) approach is
adopted, if size of wallet information of customers is not available, but survey data is not
available then Rust, Lemon and Zeithaml (2004) approach is adopted.

10

2. 3 Models and Techniques to calculate CLV : There are various models to calculate CLV.
Most of the models calculate the parameters to measure CLV using different models and
then combine the same as a new method to calculate CLV. For example Fader, Hardie and
Lee (2005) captured recency and frequency in one model to calculate the expected the
number of purchases and built another model to calculate the monetary value. Reinartz,
Thomas and Kumar (2005) captured customer acquisition and retention simultaneously.
Gupta et. al. (2006) have given a good review on modelling CLV. We will try to use some of
his modelling methods in this paper with more examples and understanding.
2.3.1 RFM Models : RFM Models have been in used in direct marketing for more than 30
years. These type of models are most common in industry because of their ease of use.
These type of models are based on three levels of information from customers i.e their
recency, frequency and Monetary contribution. Fader, Hardie and Lee (2005) have shown
that RFM variables can be used to build a CLV model and that RFM are sufficient statistics
for their CLV model. We now present in brief about two RFM based models used to
determine CLV.
Weighted RFM Model : Mahboubeh Khajvand and Mohammad Jafar Tarokh, (2010) have
presented his model for estimating customer future value based on the data given by an
Iranian Bank. In this model they got the raw data from an Iranian Bank and calculated the
recency, frequency and Monetary value of each customer. Using various clustering
techniques like K-mean clustering, they segment the data into various groups and calculate
the CLV for each cluster using the following formula:
< = J?< A }~ ' J< A } + J"< A }

where,

} is the weight of recency, frequency and monetary value obtained by AHP method based
on expert people idea.
The key limit to this modelling approach is that it is scoring model rather than a CLV Model.
It divides customers into various segments and then calculates a score for each segment.
They dont actually provide a dollar value for each customer. Hence to overcome this,
Mahboubeh Khajvand and Mohammad Jafar Tarokh, (2010) proposed a multiplicative
seasonal ARIMA Auto Regressive Integrated Moving Average method to calculate CLV,
which is a time series prediction method. The multiplicative seasonal ARIMA(p,d,q)x(P,D,Q)s
model where,
p = order of auto regressive process
d = order of differencing operator
q = order of moving average process

11

P = order of seasonal auto regressive process


D= order of seasonal differencing operator
Q= order of seasonal moving average process
Can be represented by :

where,

0
Q  QD
0 x 1 @ B B1

Q is auto regressive process,

Q is moving average process,

is d-folding differencing operator which is used to change a nonstationary time series to


a stationary one,
Q   is the seasonal moving average process and,

D
0 is the D-fold differencing operator

The main limitation of this model was that it predicted the future value of customers in the
next interval only due to lack of data.
RFM and CLV using Iso-value curves : Fader, Hardie and Lee (2005) proposed this model to
calculate CLV. They showed that no other information other than RFM characteristics are
required to formulate this model. Further they have also used the lost for good approach
to formulate this model, which means that the customers who leave the relationship with a
firm never come back. It is also assumed that M is independent of R and F. This suggests
that the value per transaction can be factored out and we can forecast the flow of future
transactions. We can then rescale this number of discounted expected transaction (DET) by
a monetary value (a multiplier) to yield a dollar number for each customer. This models is
formulated as :
CLV = margin x revenue/transaction x DET
The calculation of DET is the most important part of this model. Fader, Hardie and Lee
(2005) first of all calculated DET for a customer with observed behaviour (X=x, u , T) as :

12

Here, the numerator is the expected number of transactions in period t and d is the discount
rate. However, according to Blattberg, Getz and Thomas (2001) this calculation of CLV has
the following problems : a) we dont know the time horizon in projecting the sales, b) What
time periods to measure and c) The expression ignores specific timing of transactions.
Hence they used Pareto/NBD model by using a continuous-time formulation instead of
discrete time formulation to compute DET (and this CLV) over an infinite time horizon. The
DET is thus calculated as :

where r, L, , are the pareto/NBD parameters. (.) is the confluent hypergeometric


function of second kind; and L(.) is the pareto/NBD likelihood function. Now they added a
general model of monetary value to a dollar value of CLV assuming that a customers given
transactions varies around his/her average transaction value. After that they checked
various distributions to find that the gamma distribution best fitted their model and hence
calculated the expected average transaction value for a customer with an avg. spend of u
across x transactions as :

This value of monetary value obtained multiplied with DET gave the CLV of a customer.
Following this, various graphs also called as iso-curves were drawn to identify customers
with different purchase histories but similar CLVs, like CLV frequency, CLV Recency, CLV
frequency recency etc. They key limitations of this model is that it is based on a noncontractual purchase model. It is not clear which distribution should be used to calculate
the transaction incidence and transaction size immediately.
2.3.2 Computer Science and Stochastic Models : These types of models are primarily based
on Data mining, machine learning, non parametric statistics and other approaches that
emphasize predictive ability. These include neural network models, projection-pursuit
models, decision tree models, spline-based models (Generalized Additive Models (GAM),
Classification and Regression Trees (CART), Support Vector Machines (SVM) etc.). There are
various researchers who have been using these techniques to calculate CLV. Haenlein et al.
(2007) have used a model based on CART and 1st order Markov chains to calculate CLV. They
had the data from a retail bank. First of all they determined they various profitability drivers
as predictor variables together with target variables in a CART analysis to build a regression
tree. This tree helped them to cluster the customer base into a set of homogenous subgroups. They used these sub-groups as discrete states and estimate a transition matrix

13

which describes movements between them, using markov chains. To estimate the
corresponding transition probability, they determined the state each customer belonged to
at the beginning and end of a predefined time interval T by using decision rules resulting
from CART analysis. In the final step the CLV of each customer group as the discounted sum
of state dependent contribution margins, weighted with their corresponding transition
probabilities was determined.

where,

 = 
f A " /1 ' 

f is the probability of transition from one state to other,

" is the contribution margin for customer i and


is the discount rate.

Finally, a study of the CLVs of each customer segment to carry out marketing strategies for
each segment was made. This model however has some limitations too. It was assumed that
assumed that client behaviour follows a 1st order markov process, which does not take into
account the behaviour of early periods, rendering it as insignificant. It was also assumed
that the transition matrix is stable and constant over time, which seems inappropriate for
long term forecasts and the possibilities of brand switching in customer behaviour are not
taken into account.
Malthouse and Blattberg (2005) have used linear regression to calculate CLV. The CLV in this
case is related to the predictor variables with some regression function f as

where, C are independent random variable with mean 0 and error variance V(C ) = ,
Invertible function g is a variance stabilizing transformation. We can consider various
regression models for this function : a) Linear regression with variance stabilizing
transformations estimated with ordinary least squares. b) Linear regression estimated with
iteratively re-weighted least squares(IRLS). c) Feedforward neural network estimated using
S-plus version 6.0.2. Methods like k-fold cross validation are used to check the extent of
correctness of the analysis.
Dries and Van den Poel (2009) have used quantile regression over linear regression to
calculate CLV. It extends the mean regression model to conditional quantiles of the
response variables like the median. It provides insights into the effects of the covariates on
the conditional CLV distribution that may be missed by the least squares method. In
prediction of the top x-percent of the customers, quantile regression method is a better

14

method than the linear regression method. The smaller the top segment of interest, the
better estimate of predictive performance we get.
Besides, other data mining techniques like Decision Trees (DT), Artificial Neural Networks
(ANN), Genetic Algorithm (GA), Fuzzy Logic and Support Vector Machines (SVM) are also in
use but mostly to calculate CLV metrics like customer churn, acquisition rate, customer
targeting etc. Among DT the most common are C4.5, CHAID, CART and SLIQ. Again ANN
have also been used to catch non linear paterns in data. Besides, it can be used for both
classification and regression purposes depending on the activation function.
Malthouse and Blattberg (2005) used ANN to predict future cash flows. Aeron and Kumar
(2010) have mentioned about different approaches for using ANN. First is the generalised
stack approach used by Hu and Tsoukalas (2003) where ensemble method is used. The data
is first divided into three groups. The first group has all situational variables, the second has
all demographics variables and the third has both situational and demographic variables.
The other is the hybrid approach of GA/ANN by Kim and Street (2004) for customer
targeting where, GA searches the exponential space of features and passes one subset of
features to ANN. The ANN extracts predictive information from each subset and learns the
patters. Once, it finds the data patters, it is evaluated on a data set and returns metrics to
GA. ANN too is not without limitations. It cannot handle too many variables. So, various
other algorithms like GA, PCA (Principal Component Analysis) and logistic regression are
used for selecting variables to input in ANN. There is no set rule to find ANN parameters.
Selection of these parameters is a research area in itself. Besides all this initial weights are
decided randomly in ANN, which takes longer time to reach the desired solution.
Genetic Algorihm (GA) are more suitable for optimization problems as they achieve global
optimum with quick convergence especially for high dimensional problems. GA have seen
varied applications among CLV parameters like multiobjective optimization (using GeneticPareto Algorithm), churn prediction, customer targeting, cross selling and feature selection.
GA is either used to predict these parameters or optimize parameter selection of other
techniques like ANN. Besides GA, Fuzzy Logic and Support Vector Machines also find
applications for predicting churn and loyality index. There are many other techniques and
models like GAM(Generalized Addictive Models), MARS(Multivariate Adaptive Regression
Splines), Support Vector Machines (SVM) etc. which are used to predict or optimize the
various parameters for CLV like churn rate, logit, hazard functions, classification etc. Churn
Rate in itself is a very vast area of CRM which can be used as a parameter in the prediction
of CLV and many other related models. There have been many worldwide competitions and
tournaments in which various academics and practitioners use various methods by
combining different models to get the best possible results. These approaches remain little
known in the marketing literature and has a lot of scope for further research. The 2008
DMEF CLV Competition was one such competition in which various researchers and
academicians came together to compete for the three tasks in that competition. Malthouse

15

(2009) have made a compilation of the various models which were presented in that
competition.
2.3.3 Growth/Diffusion Models : These types of models focus on calculating the CLV of
current and future customers. Forecasting acquisition of future customers can be done in 2
ways : The first approach uses disaggregate customer data and builds models that predict
the probability of acquiring a particular customer (Thomas, Blattberg and Fox, 2004). The
other approach is to use aggregate data and use diffusion or growth to predict the no. of
customers a firm is likely to acquire in the future (Gupta, Lehman and Stuart, 2004). The
expression for forecasting the number of new customers at time t is :
, where L, , are parameters of the customer growth curve

Using this, they estimated the CE of a firm as :

where,

{V is the the no. of newly acquired customers for a segment k,

m is the margin,

r is retention rate,
i is the discount rate, and
c is acquisition cost per customer.
Diffusion models can also be used to assess the value of a lost customer. For eg. In a banking
Industry which has recently acquired a new technology will have some customers who will
be reluctant to that change and will be lost. If the relative proportions of lost customers are
L, then value of average lost customer is :
2.3.4 Econometric Models : Gupta et. al (2006) have given a good review on this type of
models. We will present the same in brief in this paper with an example of a right censored
tobit model by Hansotia and Wang (1997). Econometric models study customer acquisition,
retention and expansion (cross selling or margin) and combine them to calculate CLV.
Customer Acquisition and Customer Retention are the key inputs for such a type of model.
Various models relate customer acquisition and retention and come up with new models to
calculate CLV. For example the right censored Tobit Model for CLV (Hansotia and Wang,

16

1997). It has also been shown by some researchers (Thomas, 2001) that ignoring the link
b/w customer acquisition and retention may cause a 6-50% variation from these models.
For example if we spend less money on acquisition, the customers might walk away soon. In
case of retention models, they are broady classified into two main categories : a) the first
one considers the lost for good approach and uses hazard models to predict the
probability of customer deflection, b) the second one considers the always a share
approach and typically uses markov models. Hazard models are used to predict probability
of customer deflection. They again are are of two types : a) Accelerated Failure time (AFT)
(Kalbfleisch and Prentice, 1980) and b) Proportional Hazard Models (PH) (Levinthal and
Fichman , 1988). AFT is of the form :
ln(U ) = U U ' U ,

where, where t is purchase duration for customer j and X are covariates.

Different specifications of and lead to different models such as Weibull or generalized


gamma Model. Again PH models specify the hazard rate () and covariates (X) as :
;  @   exp .

We get different models like exponential, weibull, gompertz etc. for different specifications.
Hansotia and Wang (1997) used a right censored tobit model to calculate the lifetime value
of customers or LTV as it was called then. It is a regression model with right censored
observations and can be estimated by the method of maximum likelihood. The present
value of a customers revenue (PVR) for the qth customer receiving package j was calculated
as :

,
where,

# is the (K+1) dimensional column vector of profile variable for the qth customer. The
equation may also be estimated using LIFEBERG procedure in SAS. The likelihood function
which is the probability of observing the sample value was given by :

where, S=1 if observation i is uncensored and 0 otherwise.


Besides, the four type of models presented in this paper, Gupta et. al (2006) have also
mentioned about a probability model. However, in our research, it has been taken into
account in the Computer science and stochastic model. However Gupta et. al. (2006) have

17

made a few assumptions in their review of probability models like the probability of a
customer being alive can be characterized by various probability distributions models.
They have also taken into account the heterogeneity in dropout rates across customers.
Various combinations of these assumptions results in models like pareto/NBD, betabinomial/beta-geometric (BG/BB), markov models etc. Other than that Gupta et. al. (2006)
have also mentioned about persistence models which has been used in some CLV context to
study the impact of advertising, discounting and product quality on customer equity (Yoo
and Hanssens, 2005) and to examine differences in CLV resulting from different customer
acquisition methods (Villanueva, Yoo, and Hanssens, 2006).
2.3.5 Some other Modelling Approaches :
Donkers et al. (2007) have also made a review of various CLV modelling approaches with
respect to the insurance industry sector. These include a status quo model, a Tobit-II model,
univariate and multivariate models and duration models. They grouped these models into
two types of models. First Relationship Level models which focus on relationship length
and total profit, and is build directly on the definition of CLV as defined by Berger and Nasr
(1998) :

where, d is a predefined discount rate and

Profit ,
for a multiservice industry is defined as :
,
where,
J is the number of different services sold,

ServU,
is a dummy indicating whether customer i purchases service j at time t,
UsageU,
is the amount of service purchased, and

MarginU,
is the average profit margin for service j.
and the second is the service level models- which disaggregate a customers profit into the
contribution per service. The CLV predictions are then obtained by predicting purchase
behaviour at the service level and combining the results of both models to calculate CLV. An
overview of the models as presented by Donkers et al. (2007) with their mathematical
models is given below :

18

An overview of Relationship Level Models :

Here the Status Quo Model assumes profit simply remains constant over time. Profit
Regression Model aims at prediction of customers annual profit contribution. Retention
Models are based on segmenting over RFM. Probit Model is based on customer specific
retention probabilities. Bagging Model is also based on customer specific retention
probability. Duration Model is focused on customers relationship duration. Tobit II Model
separates the effect of customer deflection on profitability.

An Overview of Sevice-level-Models :

These types of models are explained as choice model approach and duration model
approach. Choice model approach has as dependent variable the decision to purchase a
service or not. Duration Model approach focuses on the duration of an existing relationship.
It only models the ending of a period and not the starting of a new one.
The next part of the paper presents the machine learning approach, we have used to
calculate the future value of customers. A dataset obtained from Microsoft Access 2000, the
Northwind Traders is adopted to demonstrate our approach. We have used Classification
and Regression Trees (CART), Support Vector Machines (SVM), SVM using SMO, Additive

19

Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural Network
(WNN) to calculate the futute value of customers. In the later part of the paper, we make a
comparison of these models and suggest the best model to calculate the CLV. We end this
paper with results and discussion on the future development in the area of CLV
measurement.
3. Estimating Future Customer Value using Machine Learning Techniques:
There are various data mining techniques which are used in the field of classification and
regression. The use of a technique depends on the type of data available. In our case, we
have we have used the regression technique to determine the future value of customers in
the next prediction period. In the past, several researchers have used these techniques to
determine the metrics of CLV depending on the type of model and approach they have
used. Hansotia and Wang (1997) have used CART and CHAID for customer acquisition. Kim
and Street (2004) have used ANN for customer targeting, Au et al. (2003) used Genetic
Algorithms (GA) for predicting customers churn. However, using these techniques to
directly predict a customers future value and hence CLV have not been done so far. Most of
the previous approaches in measuring CLV have used two or more models to calculate
either CLV or determine the relationship between the various parameters used to
determine CLV. The approach which we have adopted tries to eliminate this process and
allows the software which uses this technique to predict the relationship between the input
variables and their weightage in calculating CLV.
3.1 Data Description : A sample database of Microsoft Access 2000, the Northwind Traders
database is adopted to calculate the CLV of customers. The database contains 89 customers
with a purchase period of 2 years from 1st July 1994 till 30th June 1996. We have divided this
time frame into 4 equal half years and calculated the frequency of purchase and the total
monetary contribution in July December 1994, January June 1995, July December 1995
and January June 1996. Further we kept the observation period from July, 1994 till
December 1995 and made a prediction of the expected contribution in the next period i.e.
January June 1996.
The total variables used are 7, out of which 6 are input or predictor variables and the
remaining one i.e. contribution margin in jan-june, 1996 as the target variable. The entire
dataset is then dived in two parts: a) training and b) testing. We used 65 samples for training
the data and the remaining 24 for testing purposes.

20

Table 1: Description of variables


Type of variable
Input Variable

Variable Name
Recency-dec95

Input Variable

total frequency

Input Variable

Total duration

Input Variable

CM_july-dec94

Input Variable

CM_jan-june95

Input Variable

CM_july-dec95

Target Variable

output

Variable Description
Calculates the recency as a
score, calculating july, 94 as
1 and dec, 95 as 18
The total number of
purchases between july, 94
till dec, 95
The total duration of
observation i.e from july 94
till dec, 95
The contribution margin in
the period july dec, 94
The contribution margin in
the period jan june, 95
The contribution margin in
the period july dec, 95
The contribution margin in
the period jan june, 96

3.2 Models and Software used: Knime 2.0.0 , Salford Predictive Miner (SPM), NeuroShell 2
(Release 4.0) and a software by Chauhan et al. (2009) developed at IDRBT for classification
problems in DEWNN, Hyderabad is used for analysis. In Knime, we have used Support
Vector Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, for learning
purposes of the training dataset and the weka predictor for prediction of the testing
dataset. In Salford Predictive Miner (SPM), we used CART to train the dataset and applied
the rules obtained from the training dataset on the testing dataset for prediction. The
software developed at IDRBT, Hyderabad was used to train the data using Wavelet Neural
Network (WNN) and applied the learning parameters on the test data to get the results and
NeuroShell for MLP. We have given brief description of the techniques used for prediction
of the target variable.
3.2.1 SVM : The SVM is a powerful learning algorithm based on recent advances in statistical
learning theory (Vapnik, 1998). SVMs are learning systems that use a hypothesis space of
linear functions in a high-dimensional space, trained with a learning algorithm from
optimization theory that implements a learning bias derived from statistical learning theory
(Cristianini & Shawe-Taylor, 2000). SVMs have recently become one of the popular tools for
machine learning and data mining and can perform both classification and regression. SVM
uses a linear model to implement non-linear class boundaries by mapping input vectors
non-linearly into a high dimensional feature space using kernels. The training examples that

21

are closest to the maximum margin hyper plane are called support vectors. All other training
examples are irrelevant for defining the binary class boundaries. The support vectors are
then used to construct an optimal linear separating hyper plane (in case of pattern
recognition) or a linear regression function (in case of regression) in this feature space. The
support vectors are conventionally determined by solving a quadratic programming (QP)
problem. SVMs have the following advantages: (i) they are able to generalize well even if
trained with a small number of examples and (ii) they do not assume prior knowledge of the
probability distribution of the underlying dataset. SVM is simple enough to be analyzed
mathematically. In fact, SVM may serve as a sound alternative combining the advantages of
conventional statistical methods that are more theory-driven and easy to analyze and
machine learning methods that are more data-driven, distribution-free and robust.
Recently, SVM are used in financial applications such as credit rating, time series prediction
and insurance claim fraud detection (Vinaykumar et al., 2008).
In our research, we used two SVM learner models for predictive purposes. First we used the
SVM Regression model as the learner function and then used weka predictor to get the
results. We found the correlation coefficient as 0.8889 and root relative squared squared
error as 48.03%.
In case of SVO (sequential minimal optimization algorithm) for training a support vector
regression model, we replaced the learner function by the SVOreg function. This
implementation globally replaces all missing values and transforms nominal attributes into
binary ones. It also normalizes all attributes by default. Here we found the correlation
coefficient as 0.8884 and the root relative squared error as 47.98%.
3.2.2 Additive Regression and K-star: Addtive Regression is another classifier used in weka
that enhances the performance of a regression base classifier. Each iteration fits a model to
the residuals left by the classifier on the previous iteration. Prediction is accomplished by
adding the predictions of each classifier. Reducing the shrinkage (learning rate) parameter
helps prevent overfitting and has a smoothing effect but increases the learning time. K-star
on the other hand is an instance-based classifier, that is the class of a test instance is based
upon the class of those training instances similar to it, as determined by some similarity
function. It differs from other instance-based learners in that it uses an entropy-based
distance function. These techniques are quite similar to what we did in SVM Regression and
SMO Regression learners using weka predictors.
In Additive Regression, we found the correlation coefficient as 0.895, the root mean squared
error as 3062.19 and the root relative squared error as 44.36%. In case of K-star, we found
the correlation coefficient as 0.9102, root mean squared error as 3203.57 and the root
relative squared error as 46.41%.

22

3.2.3 MLP : Multilayer Perceptron (MLP) is one of the most common neural network
structures, as they are simple and effective, and have found home in a wide assortment of
machine learning applications. MLPs start as a network of nodes arranged in three layers
the input, hidden, and output layers. The input and output layers serve as nodes to buffer
input and output for the model, respectively, and the hidden layer serves to provide a
means for input relations to be represented in the output. Before any data is passed to the
network, the weights for the nodes are random, which has the effect of making the network
much like a newborns brain developed but without knowledge. MLPs are feed-forward
neural networks trained with the standard back propagation algorithm. They are supervised
networks so they require a desired response to be trained. They learn how to transform
input data into a desired response So they are widely used for pattern classification and
prediction. A multi-layer perceptron is made up of several layers of neurons. Each layer is
fully connected to the next one. With one or two hidden layers, they can approximate
virtually any inputoutput map. They have been shown to yield accurate predictions in
difficult problems (Rumelhart, Hinton, & Williams, 1986, chap. 8).
In our research, we used NeuroShell 2 (version 4.0) to determine the results. For learning
purposes we set the learning rate as 0.5, momentum rate as 0.1 and the scale function as
linear [-1,1] to get the best results. We found the root mean squared error as 43.8 % which
was the least among all other methods used, as we will find out later.

3.2.4 WNN : The word wavelet is due to Grossmann et al. (1984). Wavelets are a class of
function used to localize a given function in both space and scaling
(http://mathworld.wolfram.com/wavelet.html). They have advantages over traditional
Fourier methods in analyzing physical situations where the signal contains discontinuities
and sharp spikes. Wavelets were developed independently in the fields of mathematics,
quantum physics, electrical engineering and seismic geology. Interchanges between these
fields during the last few years have led to many new wavelet applications such as image
compression, radar and earthquake prediction.
A family of wavelet can be constructed from a function ( x) known as mother wavelet,
which is confined in a finite interval Daughter Wavelets a ,b ( x ) are then formed by
translation (b) and dilation (a). Wavelets are especially useful for compressing image data.
An individual wavelet is defined by
x b
a,b ( x) =| |1/2 (
)
a
In the case of non-uniformly distributed training data, an efficient way of solving this
problem is by learning at multiple resolutions. Wavelets in addition to forming an
orthogonal basis are capable of explicitly representing the behaviour of a function at various

23

resolutions of input variables. Consequently, a wavelet network is first trained to learn the
mapping at the coarsest resolution level. In subsequent stages, the network is trained to
incorporate elements of mapping at higher and higher resolutions. Such hierarchical, multi
resolution has many attractive features for solving engineering problems, resulting in a
more meaningful interpretation of the resulting mapping and more efficient training and
adaptation of the network compared to conventional methods. The wavelet theory provides
useful guidelines for the construction and initialization of networks and consequently, the
training times are significantly reduced (http://www.ncl.ac.uk/pat/neural-networks.html).
Wavelet networks employ activation functions that are dilated and translated versions of a
single function, where d is the input dimension (Zhang, 1997). This function called the
mother wavelet is localized both in the space and frequency domains (Becerra, Galvao,
Abou-Seads 2005). The wavelet neural network (WNN) was proposed as a universal tool for
functional approximation, which shows surprising effectiveness in solving the conventional
problem of poor convergence or even divergence encountered in other kinds of neural
networks It can dramatically increase convergence speed (Zhang et al., 2001).
The WNN network is consists of three layers namely input layer, hidden layer and output
layer. Each layer is fully connected to the nodes in the next subsequent layer. Number of
input and output nodes depends on the number of inputs and outputs present in the
problem. The number of hidden node can be any number from 3 to 15is a user-defined
parameter depending on the problem. WNN is implemented here with the Gaussian
wavelet function.
The original training algorithm for training a WNN is as follows (Zhang et al., 2001):
1) Specify the number of hidden nodes required. Initialize randomly the dilation and
translation parameters and the weights for the connections between the input and
hidden layers and also between the hidden and the output layers.
2) The output value of the sample  , k = 1,2,..,np, is calculated with the following
formula :computed as follows:
n in

nhn

VK =

j =1

f (

w ij x k i b

i =1

)
j

(1)

where, nin is the number of input nodes and nhn is the number of hidden nodes
and np is the number of samples.
In (1) when f(t) is taken as Morlet mother wavelet is has the following form :

f (t ) = cos(1.75t )exp(t 2 / 2)

(2)

24

And when taken as Gaussian wavelet it becomes


f (t ) = exp( t 2 )

(3)

3) Reduce the error of prediction by adjusting updating

using

W j , wij , a j , b j using

W j , wij , a j , b j (see formulas (4)-(7)). Thus, in training the WNN, the gradient descend
algorithm is employed:

E
+ W j (t ),
W j

(4)

E
+ wij (t ),
wij (t )

(5)

E
+ a j (t ),
a j (t )

(6)

E
+ b j (t ),
b j (t )

(7)

W j (t + 1) =
wij (t + 1) =

a j (t + 1) =

b j (t + 1) =

where, the error function can be taken as


1/ 2

k = np

(
V

V
)2
K

K
E=
(8)
2
,
V
k
=
1
K

Where and are the learning and the momentum rates respectively.

4) Return to step (2) the process is continued until E satisfies the given error criteria, and
the whole training of the WNN is completed.
Some problem exists in the original WNN such as slow convergence, entrapment in local
minima and oscillation (Pan et al., 2008). We propose BFTWNN to resolve these problems.
In our research, we used a software made by Chauhan et al. (2009) for DEWNN (Differential
evolution trained Wavelet Neural Network). The software was initially made for
classification purposes. We changed the software code from classification to regression type
and used it in our problem. We set the weight factor as 0.95, convergence criteria as
0.00001,crossover factor as 0.95, population size as 60, number of hidden node as 20,
maximum weight as 102 and minimum weight as -102 to find the optimum solution. We
found the test set normalized root mean square error as 0.928441. The root relative
squared error as 111.2 %, which was the highest amongst all the results.
3.2.5 CART : Decision trees form an integral part of machine learning an important subdiscipline of artificial intelligence. Almost all the decision tree algorithms are used for solving

25

classification problems. However, algorithms like CART solve regression problems also.
Decision tree algorithms induce a binary tree on a given training data, resulting in a set of
ifthen rules. These rules can be used to solve the classification or regression problem.
CART (http:// www.salford-systems.com) is a robust, easy-to-use decision tree tool that
automatically sifts large, complex databases, searching for and isolating significant patterns
and relationships. CART uses a recursive partitioning, a combination of exhaustive searches
and intensive testing techniques to identify useful tree structures in the data. This
discovered knowledge is then used to generate a decision tree resulting in reliable, easy-tograsp predictive models in the form of ifthen rules. CART is powerful because it can deal
with incomplete data; multiple types of features (floats, enumerated sets) both in input
features and predicted features, and the trees it produces contain rules, which are humanly
readable. Decision trees contain a binary question (with yes/no answer) about some feature
at each node in the tree. The leaves of the tree contain the best prediction based on the
training data. Decision lists are a reduced form of this where an answer to each question
leads directly to a leaf node. A trees leaf node may be a single member of some class, a
probability density function (over some discrete class) or a predicted mean value for a
continuous feature or a Gaussian (mean and standard deviation for a continuous value). The
key elements of a CART analysis are a set of rules for: (i) splitting each node in a tree, (ii)
deciding when a tree is complete; and (iii) assigning each terminal node to a class outcome
(or predicted value for regression).
In our research, we used Salford Predictive Miner (SPM) to use CART for prediction
purposes. We trained the model using least absolute deviation on the training data. We
found that the root mean squared error was 3367.53 and the total number of nodes was 5,
however, on growing the tree nodes from 5 to 6, we found better results. The root mean
squared error changed to 3107.13 and the root relative squared error is 45.38% which is
very close to MLP. Figure 3 shows the plot of relative vs. The number of nodes. We see that
we got the optimum results on growing the tree from node 5 to node 6.
Figure 1 : CART : Plot of relative error vs number of nodes

Figure 2: CART : Plot of percent error vs. Terminal nodes

26

It was also seen from the results that, when the optimum number of nodes were kept at 5,
19 out 24 customers were put in node 1, 4 in node 3 and 1 in node 6. We also found that the
root mean squared error was 2892.6 for the 19 customers in node 1, which is better than
the overall error. However, the overall increase in error was caused due to misclassification
or high error rate in splitting customers in node 4 and node 6. In case of growing, optimum
nodes to 6, we found that 14 customers were split in node 1, 5 in node 2, 4 in node 4 and 1
in node 6. The RMSE in node 1 was 1846.89, which was way less than the total RMSE of
3107.13. One obvious conclusion, one can draw from CART is that it is more useful than
other methods for prediction, because of its rules which gives companies the flexibility to
decide which customer to put in which node and also to choose the optimum number of
nodes for their analysis.
Figure 3 : CART : Tree details showing the splitting rules at each node

A summary of the rules is given as :


1. if(CM_JULY_DEC95 <= 2278.66 && CM_JAN_JUNE95 <= 3534.06 ) then y = 1511.64
2. if(CM_JULY_DEC95 <= 2278.66 &&CM_JAN_JUNE95 > 3534.06 && CM_JAN_JUNE95
<= 12252.1 ) then y = 5932.26
3. if(CM_JAN_JUNE95 <= 12252.1 && CM_JULY_DEC95 > 2278.66 && CM_JULY_DEC95
<= 2464.75 ) then y = 24996
4. if(CM_JAN_JUNE95 <= 12252.1 && CM_JULY_DEC95 > 2464.75 &&
TOTAL_FREQUENCY <= 14 ) then y = 6350.25
5. if(CM_JAN_JUNE95 <= 12252.1 && CM_JULY_DEC95 > 2464.75 &&
TOTAL_FREQUENCY > 14 ) then y = 19044.4
6. if(CM_JAN_JUNE95 > 12252.1 ) then y = 38126.7 ; where, y is median

27

4. Results and Comparison of Models : We have used various machine learning techniques
to calculate the future value of 24 customers from a sample of 89 customers. We have used
various techniques like SVM, WNN, Additive Regression, K-star Method in Knime using weka
predictor, CART in SPM and MLP in NeuroShell. We found that MLP has given the least error
amongst all these models, but we find CART to be more useful, as is more helpful in taking
decisions by setting splitting rules and also predicts more accurately for a greater section of
the test sample by splitting the sample into various nodes. We find that companies can
make better decisions with the help of these rules and the segmentation technique in CART.
A detailed summary of the final results of competing models is given in Table 2. One
limitation of our study is that we have only predicted the future value of only the next time
period. Besides this, the error percentage is relative high, because of the small amount of
dataset we have. We believe that these models will be able to perform better in case of
large dataset with more input variables including customer demographics, customer
behaviour etc.

Table 2 : Comparison of Competing Models


Correlation
coefficient
SVMreg
SMOreg
Additive Reg.
K-star
MLP
CART

0.8889
0.8884
0.8950
0.9102
NA
NA

Root Mean
Squared
error
3315.25
3311.98
3062.19
3203.57
2986.77
3107.13

Mean
Absolute
error
2513.03
2499.48
2203.76
2233.21
2107.10
2343.82

Root relative
squared error
48.0%
47.9%
44.3%
46.4%
43.8%
45.3%

Figure 4 : Graph of Error vs Model


49
48
47
46
45
44
43
42
41
MLP

Additive Reg.

CART

K-Star

SMOreg

SVMreg

28

5. Conclusion and Directions of future research: In this paper we have presented a review
of various approaches and modelling techniques to determine Customer Lifetime Value. We
have also covered the tradional techniques used to calculated Customer Loyalty and found
that CLV is better metric compared to these measures. The most common approaches used
to measure CLV are aggregate approach and individual approach. We also see that the type
of approach used to calculate CLV depends on the type of data available and the type of
result which a firm wants. Further, we have also reviewed various modelling techniques to
determine CLV, which include RFM Models, Computer Science and Stochastic Models,
Econometric Models, Diffusion Models and also relationship level models and service level
models. We see that the most frequently applied techniques to determine CLV parameter or
to determine the relationship between them include, Pareto/NBD models, Decision trees,
Artificial Neural Networks, Genetic Algorithms, Support Vector Machines.
We have also presented a study of measuring CLV by means of various machine learning
techniques. Emphasis has been given to catch the non-linear pattern in the data which was
available for a set of 89 customers having a 2 year transaction history. We have used
Classification and Regression Trees (CART), Support Vector Machines (SVM), SVM using
SMO, Additive Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural
Network (WNN) for the calculation of the future value 24 customers. Further we see that
although MLP gives the best result amongst all these models, we would still recommend
using CART to calculate CLV as it segments the customers into various nodes and calculates
more precisely for a larger segment of test case customers. Besides, the splitting rules
would also help any firm to understand better the classification of a customer into a
particular segment and hence derive more profit out of him.
The main limitations of our study have been the projection of future value of customers till
only the next period, mainly due to the limitation of the dataset we had. This also resulted
in some high error rates even amongst the best models. There limitations can be overcome
by using datasets which can give more information about the customer behaviour, his
demographics etc. Besides, a large dataset will be useful to make better predictions as it can
estimate the training parameters better. For better estimation in small datasets, we have
not covered techniques like k-fold cross validation, which again, can be taken as an area of
future research. We have also not given much emphasis on feature selection and the
relationship between the input variables to calculate CLV. Producing better results with an
integrated approach with this dataset is again an area of future research.

29

References:
Aeron, H., Kumar, A. and Janakiraman, M. (2010) Application of data mining techniques for
customer lifetime value parameters : a review, Int. J. Business Information Systems, Vol. 6,
No. 4, pp.514- 529.
Au, W., Chan, K., & Yao, X. (2003). A novel evolutionary data mining algorithm with
applications to churn prediction. IEEE Transactions on Evolutionary Computation, 7(6), 532
545.
Becerra, V. M., Galvao, H., & Abou-Seads, M. (2005). Neural and wavelet network models
for financial distress classification. Data Mining and Knowledge Discovery, 11, 3555. doi:1
0.1007/s1 0618-0051360-0
Berger, P. D. and Nasr, N. I. (1998), Customer lifetime value: Marketing models and
applications. Journal of Interactive Marketing, 12: 1730
Blattberg, Robert C., Getz G., Thomas js (2001), ''Customer Equity: Building and Managing
Relationships as Valuable Assets'', Boston, MA : Harvard Business School Press.
Chauhan, N., V. Ravi, D. Karthik Chandra: Differential evolution trained wavelet neural
networks: Application to bankruptcy prediction in banks.Expert Syst. Appl. 36(4): 7659-7665
(2009)
Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines.
Cambridge, UK: Cambridge University Press.
Donkers, B. P.C. Verhoef and M.G. de Jong (2007) Modeling CLV: a Test of Competing
Models in the Insurance Industry, Quantitative Marketing and Economics, 5 (2) 163-190.
Dries F. Benoit, Dirk Van den Poel: Benefits of quantile regression for the analysis of
customer lifetime value in a contractual setting: An application in financial services. Expert
Syst. Appl. 36 (7): 10475 - 10484 (2009)
Dwyer, R.F (1997) Customer lifetime valuation to support marketing decision making,
Journal of Direct Marketing, Vol. 11, No. 4, pp.205 219.
Fader, Peter S., Bruce G. S. Hardie, and Ka Lok Lee (2005), RFM and CLV: Using Iso-CLV
Curves for customer base analysis, Jounal of Marketing Research, 42 (November), 415-30.
Gupta, Sunil, Donald R. Lehmann and Jennifer Ames Stuart (2004), Valuing Customers,
Journal of Marketing Research, 41 (1), 7-18.
______, Hanssens, D., Hardie, B., Kahn, W., Kumar, V., and Lin, N. Modelling Customer
Lifetime Value. Journal of Service Research, 9, 2006, 139-155.

30

Hansotia, B. J. and P. Wang (1997), Analytical challenges in customer acquisition. Journal of


Direct Marketing 11(2), 7-19.
Haenlein, M., Kaplan, A.M., Beeser, A.J. (2007) A model to determine customer lifetime
value in a retail banking context, European management journal.
Hu, M. and Tsoukalas, C., Explaining Consumer Choice through Neural Networks: The
Stacked Generalization Approach, European Journal of Operational Research, Vol. 146, No.
3, 2003, 650-661.
Kalbfleisch, J. D. and R. L. Prentice. (1980), Statistical Analysis of Failure Time Data, New
York: Wiley
Kim, Y., Street, N. (2004). An intelligent recommendation system for customer targeting: A
data mining approach. Decision Support Systems, 37(2), 215-228
Kumar, V. and J. Werner Reinartz (2006), Customer Relationship Management : A Databased
Approach. New York : John Willey.
_______ and Morris George (2007), Journal of the Academy of Marketing Science, 35:157171.
Levinthal, D. and M. Fichman. (1988). Dynamics of Interorganizational Attachments:
Auditor Client Relationships. Administrative Science Quarterly, 33, 345-69.
Malthouse, Edward C. 2009. The Results from the Lifetime Value and Customer Equity
Modeling Competition. Journal of Interactive Marketing, Vol. 23 (2009), pp. 272-275.
Malthouse, C.E and Blattberg, C.R. (2005) Can we predict customer lifetime value, Journal
of Interactive Marketing, Vol. 19, No. 1, pp.2 16.
Mahboubeh Khajvand, and Mohammad Jafar Tarokh. Estimating customer future value of
different customer segments based on adapted RFM model in retail banking
context.. Procedia CS, (3):1327-1332, 2011.
Reinartz, Werner, Jacquelyn Thomas and V. Kumar (2005), Balancing Acquisition and
Rentension Resources to Maximize Customer Profitability, Journal of Marketing, 69 (1), 6379.
Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning
representations by back-propagating errors". Nature 323 (6088): 533536
Rust, R. T., K. N. Lemon, and V. A. Zeithaml (2004), Return on marketing: Using customer
equity to focus marketing strategy. Journal of Marketing 68, 109-127.
Thomas, Jacquelyn (2001), A methodology for linking customers acquisition to customer
retention, Journal of Marketing Research, 38 (2), 262-68.

31

Thomas, J.S., Blattberg R.C., and Fox, E.J. (2004, February), Recapturing lost customers,
Journal of Marketing Research, 41, 31-45.
V. Kumar, Customer Lifetime Value The path to profitability, Foundations and Trends in
Marketing, vol 2, no 1, pp 1-96, 2007.
Vapnik, V., 1998. Statistical Learning Theory Wiley. New York.
Venkatesan, R. and V. Kumar (2004), 'A customer lifetime value framework for customer
selections and resource allocation strategy'. Journal of Marketing, 68, 106-125 (October).
Villanueva J., S. Yoo, and D.M. Hanssens, "The Impact of Marketing-Induced vs. Word-ofMouth Customer Acquisition on Customer Equity," Journal of Marketing Research, February
2008.
Vinay Kumar, K., V. Ravi, Mahil Carr, N. Raj Kiran: Software development cost estimation
using wavelet neural networks. Journal of Systems and Software 81(11): 1853-1867 (2008)
Yoo S. & D.M. Hanssens, "Modeling the Sales and Customer Equity Effects of the Marketing
Mix," revised, February 2005, working paper, University of California, Los Angeles, Anderson
School of Management.
Zhang. Q, 1997. Using wavelet network in non-parameters estimation. IEEE Transaction
Neural Networks 8 (2): 227~236

You might also like