Understanding Customer Segments through Data Mining

Master in Advanced Analytics
2017/2018
Data Mining
Final Project
0
Group
Student ID Student Name
M20170606 Biazi Bayer
M20170590 Rodrigo Pupo Ribeiro
M20170366 Vasco Castela
1
Table of Contents
Project - Part 1 3
Executive Summary 3
Tools and Methodology 3
Conclusion 4
Project - Part 2 5
Executive Summary 5
Tools and Methodology 5
Variable Analysis 6
Variable definition 6
Data Preparation 6
Outliers 8
Data Preprocessing 9
Segmentation 9
Value/Engagement Segmentation 10
Cluster characterization 10
Consumption Segmentation 11
Cluster characterization 12
Cluster Concatenation 12
Conclusion 13
ANNEX A 14
Histograms from SAS Miner MultiPlot node 14
Segmentation Diagram 15
SAS Miner Segment profile nodes 15
ANNEX B 16
FREQ Procedure 16
MEANS Procedure 16
Monetary Variable Analysis 16
Income Variable Analysis 17
WebPurchase Variable Analysis 17
Histograms of WebPurchase variable 18
Histograms of Freq variable 19
Histograms of Recency variable 20
Histograms of Monetary variable 21
SAS Guide Code 22
2
Project - Part 1
Executive Summary
The Wonderful Wines of the World (WWW) is a seven-year-old company that sells wines
through a website and ten small stores in major cities around USA. Customers are provided with
a new catalog, sent every 6 weeks, from which they can select several hundred different
products.
WWW is now trying to make the best out of the customer database it started about four years
ago and wants to start differentiating customers in order to develop more focused programs .
Thus, it has provided a random sample of 10,000 customers, from its active database, who
have purchased something from WWW in the past 18 months.
As a first measure, the digital impact on the WWW’s business was assessed after a merge of
two datasets from the provided database: Value Engage and Newsletter. The former may be
considered as the main general collection of information that is gathered from each customer,
whereas the latter is only specific to customers’ digital enrollment and behaviour in the
business.
Tools and Methodology
In order to try to understand if the digital approach does have impact on the Wonderful Wines of
the World business, shareMedia and Newsletter class variables, which identify who signed up
for a newsletter and shared at least one post over social media, were analysed, mainly against
Recency, Frequency and Monetary interval variables, as these three variables give an indication
of customers’ past purchase behavior: how recently, how often and how much did they buy.
Eventually, WebPurchase and Income variables were also considered for further insight on how
the digital approach is impacting or could impact on WWW’s business. For this purpose SAS
Guide was used, more precisely Proc Freq, Proc Means and Proc Univariate procedures.
3
Conclusion
From the sample dataset provided by the WWW, only 5.88% (588 customers) signed up for a
newsletter and shared at least one post over social media (digitally driven), 37.64% did one of
the above and, finally, 56.48% didn't care for this digital approach (Annex B - Pic. 1).
The 56,48% customers not digitally engaged, amount to a total sales (Monetary) of 56.45%, and
those 5.88% who are fully engaged, amount to a total sales of much less, 6.13%, although each
of the latter buys more in average, 648.8 against 622.2 (Annex B - Pic. 2). From the in-between
group of customers - those who at least shared a post or subscribed the newsletter - the ones
who shared at least a post on social media alone buy slightly more on average (Annex B - Pic.
2).
It is also worth noting that those customers who did signed up for a newsletter and shared at
least one post over social media are the ones presenting a higher household income (Annex B -
Pic. 3). More importantly, looking at the online sales alone, which represent 31.1% of the total
sales over the last 18 months (see SAS Guide code in Annex B for details), one can say that the
web purchases represent a higher percentage of sales for precisely those customers who didn’t
subscribe the newsletter nor shared any post on social media (Annex B - Pic. 4 and 5), thus
there must be other(s) driving factor(s) for these group of customers.
So, it is reasonable to assume that as of now the digital approach doesn’t have a meaningful
impact on the business, also confirmed by Pic. 6, 7 and 8 (Annex B), where doesn’t exist major
differences, however reinforcing this approach to customers may pay-off in the future, as the
wealthier customers seem to be more sensitive to the digital approach (Annex B - Pic. 3) and,
additionally, they are buying more already on average (Annex B - Pic. 2).
4
Project - Part 2
Executive Summary
Following the Wonderful Wines of the World effort in making the best use of its customer
database to increase their wine selling business, two customer segmentations were performed
on the provided database sample of 10,000 customers, as an attempt to differentiate customers
by Value/Engagement and Consumption.
Once these customer segmentations and their respective clusters were obtained, more focused
programs were proposed to target specific customer groups, based on their resulting profiles
and potential value for the future of the business, as described throughout this project report.
Tools and Methodology
The tools used throughout the project development were SAS Enterprise Guide and SAS
Enterprise Miner. With these tools at disposal the descriptive analysis goal of this Data Mining
project was carried out fairly straightforward:
1. Extract patterns which allows us to characterize subsets of data;

2. Evaluate the data potential through data summaries;
3. Describe data distribution, searching for deep relationships across datasets
For the descriptive analysis process a clustering model was used, consisting of organizing
objects in homogeneous groups according to their similarity. The better this grouping is, the
lower the intra-cluster and the higher the inter-cluster distances are, which, in turn, means the
better one can differentiate one cluster from the next. So, in order to attain this objective, three
stages were followed:
1. Variable analysis, to determine a set of variables over which the similarity/ dissimilarity of
the entities (customer behaviour) will be assessed;
2. Segmentation, customer clustering according to the desired similarity/dissimilarity
criterion (what they buy, how they buy);
3. Cluster concatenation, customer profiling through inter-cluster relationship assessment
(Value/Engagement vs Consumption) and validation
5
Variable Analysis
Variable definition
The original dataset provided by WWW consists of a 10,000 customers sample from the last 18
months, whose attributes are described on Table 1.
Variable Description
CUSTID customer ID number
DAYSWUS number of days as a customer
AGE customer’s age or imputed age
EDUC years of education (may be imputed)
INCOME household income (may be imputed)
KIDHOME 1=child under 13 lives at home
TEENHOME 1=child 13-19 years lives at home
FREQ number of purchases in past 18 mo.
RECENCY number of days since last purchase
MONETARY total sales to this person in 18 mo.
LTV Lifetime value of the customer
PERDEAL % purchases bought on discount
DRYRED % of wines that were dry red wines
SWEETRED % sweet or semi-dry reds
DRYWH % dry white wines
SWEETWH % sweet or semi-dry white wines
DESSERT % dessert wines (port, sherry, etc.)
EXOTIC % very unusual wines
WEBPURCH % of purchases made on website
WEBVISIT average # visits to website per month
Note: DRYRED + SWEETRED + DRYWH + SWEETWH + DESSERT = 100%
Table 1 - Variables definition and description
Data Preparation
From a measurement level perspective, all variables used have interval scale (continuous
variables that contain values across a range), except the binary variables Teenhome and
Kidhome (which have two possible values, 0 or 1), and also the nominal variable Custid (it has
more than two levels, but the values of each level have no implied order) which was defined as
ID (indicating that the identified variable contains a unique identifier for each observation),
whereas all the other variables were considered Input variables. The Rejected variables role
specifies that the identified variable is not included in the model building process. (See Table 2).
6
Table 2 - Variables name, role and level
Considering the summary statistics of all interval variables shown on Table 3, it is notorious that
no missing values existed. Also, a first assessment about some basic WWW´s customers
characteristics, mainly from the mean values, can be readily done using Table 3: the average
customers age is 48 years old, with an average income of 69,904 and 17 years of education, on
average; of the usual wines, Dryred accounts for about 50% of sales. Changing the Freq’s
variable role from Input to Frequency, for instance, allows further readings: 31.1% of purchases
are carried out through WWW´s website and the remaining 68.9% using the stores (see Table
4).
Table 3 - Interval variables summary statistics
7
Table 4 - Interval variables summary statistics with Freq’s variable role defined as Frequency
Outliers
Only Dessert and Sweetred interval variables presented some extreme values (markedly distant
from the bulk of observations) which were considered outliers (See Annex A, univariate
histograms from SAS Miner MultiPlot node - Graphics 1 and 2). These outliers were filtered out
using the standard deviation filter and a fairly conservative cutoff point, equal to 10 standard
deviations, as only extreme values, those above that cutoff point, should be discarded (See
Graphic 3a). Thus, only 396 values in total, out of 10,000, were excluded to avoid any possible
negative influence on future interpretation results (Graphic 3b).
Graphic 3a - Outliers filtering example (Dessert variable)
8
Variable Role Minimum Maximum Filter Method
Dessert INPUT -16.6912 30.5860 STDDEV
Sweetred INPUT -16.5451 30.6541 STDDEV
Number of Observation
Data Role Filtered Excluded Data
TRAIN 9604 396 10000
Graphic 3b - Final result from outliers filtering of Dessert and Sweetred variables
Data Preprocessing
Despite the existence of several strongly correlated variables, namely Monetary/Freq,

Monetary/LTV and WebPurchase/WebVisit, as can be seen on the Correlation Matrix below
(Graphic 4), it was decided to keep all the variables, thus keeping all the available discrimination
ability potential of each, for a better descriptive analysis.
Graphic 4 - Correlation matrix
Segmentation
In order to identify business opportunities and establish a well-defined marketing strategy

accordingly, a previous knowledge about the existing customers’ consumption profile and
behaviour patterns must be attained, therefore two main segmentations were implemented, one
based on customer value/engagement and the other based on consumption. This way,
customers are grouped in different clusters, each of them reflecting, as much as possible,
common customer characteristics by the defined segmentation. The clustering method used
was the K-Means due to its known better than average performance when compared to the
Hierarchical Clustering technique. Furthermore, the best number of clusters to be considered,
which should be a balance between complexity and descriptive/discrimination ability in a strictly
descriptive analysis, was determined using the Elbow graphic rule, whose goal is to find the
minimum number of necessary clusters within each segmentation dataset. (The segmentation
diagram used in SAS Miner is shown in Annex A).
9
Value/Engagement Segmentation
For this segmentation the selected variables were Age, Educ, Income, Dayswu, Freq, LTV,
Monetary, Perdeal, Recency, WebPurchase and WebVisit (binary variables such as Teenhome
and Kidhome were not included, as K-Means algorithm uses Euclidean distance measure and is
unlikely that binary data can be clustered satisfactorily). The first three variables provide the
customers socio-economics context, whereas the remaining ones translate how (rather than
what) customers buy.
Next, to obtain the best number of clusters a provided SAS Guide code was run and the Elbow
graphic (Graphic 5) was obtained. Looking at the graphic, a first pronounced inflection (elbow)
occurs for a number of clusters equal to 4. Thus, the K-Means algorithm was applied for K=4 .
Graphic 5 - Elbow graphic for Value/Engagement segmentation clusters
Cluster characterization
With the help of the SAS Miner Segment Profile node, relevant information about the four
clusters was obtained (Graphic 6 in Annex A).
Cluster 1
This cluster represents 18.96% of the observations, mainly most recent customers with an
average age of 64.5 years old and above than average income, average purchase amount, they
buy with average frequency, they don’t care much for discounts nor website purchases - Mid-
rank, newest.
10
Cluster 2
This cluster is the biggest one, representing 45.89% of the observations, essentially the
youngest group of customers, 34.4 years old on average, average time as customers, with the
lowest income of all groups, they buy the least, they don’t buy frequently and exhibit the longest
elapsed time since the last purchase, although this group of customers has the highest
percentage of purchases through the website and on discount - Low-rank, new-tech.
Cluster 3
This cluster is the smallest one representing only 13.94% of the observations, it is also the
oldest group of customers, with 71.3 years old on average, and the wealthiest ones, they are
well established customers with the highest purchase amount, they also buy very frequently,
they don’t buy on discount and seldom use the website for purchasing - High-rank, traditional.
Cluster 4
This cluster represents 21.21% of the observations, 54.2 years old on average, average
income, very well established customers, in fact, exhibiting the longest time as customers,
despite the average purchase amount, also buy with average frequency and average
percentage through the website, they don’t seem especially attracted by discounts - Mid-rank,
longest-time.
Consumption Segmentation
For this segmentation the selected variables were Dessert, Dryred, Drywh, Exotic, Sweetred
and Sweetwh, the available products from Wonderful Wines of the World. The objective is to
understand what customers buy.
Again, to obtain the best number of clusters a provided SAS Guide code was run and the Elbow
graphic (Graphic 7) was obtained. Looking at the graphic, a first pronounced inflection (elbow)
occurs for a number of clusters equal to 3. Thus, the K-Means algorithm was applied for K=3 .
Graphic 7 - Elbow graphic for Consumption segmentation clusters
11
Cluster characterization
With the help of the SAS Miner Segment Profile node, relevant information about the three
clusters, resulting from the consumption segmentation, was obtained (Graphic 8 in Annex A).
Cluster 1
This cluster represents 15.47% of the observations, Dryred 22.0%, Drywh 28.7%, the highest
sweet, semi-dry or exotic (31.2%) wines consuming group - Sweet, Semi-dry and Exotic.
Cluster 2
This cluster represents 50.09% of the observations, the highest Dryred consuming group on
average, 68.9%, actually, the highest consumption percentage of a single type of wine of all
groups, contrasting with the lowest Exotic consumption, Drywh 21.6%, there’s no meaningful
sweet or semi-dry wines consumption - Dry red.
Cluster 3
This cluster represents 34.43% of the observations, the highest Drywh consuming group on
average, 41.8%, Dryred 35.1%, close to average sweet or semi-dry wines consumption which is
low, second lowest consumption percentage as far as exotic wines is concerned - Dry white.
Cluster Concatenation
In order to get a complete picture of the customer profiles, a cross-analysis between the clusters
formed from each previous segmentations is not only necessary but also constitutes a good way
of validating the resulting solution.
After running the provided SAS Guide code for joining the clusters the results are presented on
Table 5 (Cluster_VE stands for clusters from Value/Engagement segmentation and
Cluster_CONS for clusters from Consumption segmentation):
Table 5 - Clusters concatenation
12
From the concatenation table a deeper understanding of the customers profile is now possible:
Low-rank, new-tech customers (Cluster_VE = 2)
The vast majority of customers, they are young customers, digitally engaged, and they buy,
essentially, dry wines with particular emphasis on Dry red, although in low quantities and value,
they are the group of customers who buy the most on discount, they are also the group of
customers who buy more Sweet, Semi-dry and Exotic wines, both in percentage and quantity.
Mid-rank, newest customers (Cluster_VE = 1)
They represent the newest customers, they are particularly found of Dry white wines, although
Dry red still is the wine of choice by a good margin, they have above than average income.
Mid-rank, longest-time customers (Cluster_VE = 4)
Long term customers, they have a clear preference for Dry red wines, more than any other
group of customers, so much so that all the other type of wines combined represent only 25% of
purchases.
High-rank, traditional customers (Cluster_VE = 3)
However the smallest, these are WWW’s best group of customers, they are also traditional
customers in the sense that they are the group of customers who use less the digital approach,
also the fact of being the group of customers with the highest average age, 71 years old, may
help explaining it, they buy almost as much Dry white wine as Dry red, still the Dry red wine
purchases are marginally higher, although they are the customers who buy more Dry white wine
in percentage, as for Sweet, Semi-dry and Exotic wines purchases, they are absolutely marginal
(around 7%), the lowest of all.
Conclusion
It is clear that the Dry red wine is the WWW’s best selling product, followed by Dry white wine
and, at a significantly higher distance, Sweet and Exotic wines. Also, the so-called Low-rank
customers represent the vast majority of customers whereas the best customers are the
minority. Then, there is the most recent customers with above than average income, who are in-
between.
So, based on the achieved customer profiles, any adopted marketing strategy should target at
least two main objectives: (1) Lead the Low-rank customers to spend more; (2) Lead the best
customers to buy more expensive wines:
1. Since this group of customers is the youngest and has the highest percentage of
purchases through the website and on discount, a marketing campaign on social
networks should be implemented announcing bigger discounts on several wine packs
through online sales - the bigger the pack, the bigger the discount - a smartphone app
should be in place;
2. Private wine tasting events for the most expensive Dry wines should be offered to the
best customers group, above a certain purchase amount of these expensive wines, an
annual winery tour should be also offered with a big discount, additionally, personalized
handwritten birthday wishes card should be sent every birthday to these customers.
13
ANNEX A
Histograms from SAS Miner MultiPlot node
Graphic 1 - Univariate histogram for Dessert interval variable
Graphic 2 - Univariate histogram for Sweetred interval variable
14
Segmentation Diagram
Segmentation diagram
SAS Miner Segment profile nodes
Graphic 6 - Value/Engagement segment profile
Graphic 8 - Consumption segment profile
15
ANNEX B
FREQ Procedure
Picture 1 - shareMedia and Newsletter variables after PROC FREQ
MEANS Procedure
Monetary Variable Analysis
Picture 2 - Monetary variable analysis for classes shareMedia and Newsletter after PROC
MEANS
16
Income Variable Analysis
Picture 3 - Income variable analysis for classes shareMedia and Newsletter after PROC
MEANS
WebPurchase Variable Analysis
Picture 4 - WebPurchase variable analysis for classes shareMedia and Newsletter after
PROC MEANS
17
Histograms of WebPurchase variable
Picture 5 - Histograms of WebPurchase variable analysis for classes shareMedia and

Newsletter
18
Histograms of Freq variable
Picture 6 - Histograms of Freq variable for classes shareMedia and Newsletter
19
Histograms of Recency variable
Picture 7 - Histograms of Recency variable for classes shareMedia and Newsletter
20
Histograms of Monetary variable
Picture 8 - Histograms of Monetary variable for classes shareMedia and Newsletter
21
SAS Guide Code
/*Code to calculate the percentage of the total sales over the last 18 months*/
/*Calculates the number of website purchases per customer*/
data webpurchase;
set fulltable;
webp = WebPurchase/100 * Freq;
Run;
/*Calculates both website purchases and last 18 months purchases, for all customers*/
proc means data=webpurchase noprint;
var webp Freq;
output out=temp (drop=_type_ _freq_) sum(webp)= sum(Freq)= /autoname;
Run;
/*Calculates the percentage of website purchases from the last 18 months total sales*/
data temp1;
set temp;
webpurch_p=webp_Sum/Freq_Sum *100;
run;
/*Code for Picture 1*/
proc freq data=fulltable order=freq;

table Newsletter*shareMedia / list;
title 'shareMedia and Newsletter';
run;
/*Monetary variable analysis for classes shareMedia and Newsletter*/

proc means data=fulltable mean median sum;
class shareMedia Newsletter;
var Monetary;
title 'Monetary variable analysis';
run;
/*Income variable analysis for classes shareMedia and Newsletter*/

var Income;
title 'Income variable analysis';
run;
/*WebPurchase variable analysis for classes shareMedia and Newsletter*/

var WebPurchase;
title 'WebPurchase variable analysis';
Run;
22
proc format;
value newsformat 0="NEWSLETTER = 0"
1="NEWSLETTER = 1";
Run;
proc format;
value sharedformat 0="SHAREDMEDIA = 0"
1="SHAREMEDIA = 1";
Run;
data grouping2;
set fulltable;
format Newsletter newsformat.;
format shareMedia sharedformat.;
run;
proc univariate data=grouping2 noprint;

histogram WebPurchase / normal (noprint) CTEXTSIDE=bib CTEXTTOP=blue;
title 'Histograms with Normal Distribution for WebPurchase';
run;
/*Code for Picture 6, 7 and 8*/
proc univariate data=grouping2 noprint;

histogram Recency Freq Monetary / normal (noprint) CTEXTSIDE=bib CTEXTTOP=blue;
title 'Histograms with Normal Distribution for Freq, Recency and Monetary';
run;
23

Understanding Customer Segments through Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Customer Segments through Data Mining

Uploaded by

Copyright:

Available Formats

Master in Advanced Analytics

M20170606 Biazi Bayer

M20170590 Rodrigo Pupo Ribeiro

M20170366 Vasco Castela

Tools and Methodology

Tools and Methodology

1. Extract patterns which allows us to characterize subsets of data;

Table 3 - Interval variables summary statistics

Graphic 3a - Outliers filtering example (Dessert variable)

Dessert INPUT -16.6912 30.5860 STDDEV

Sweetred INPUT -16.5451 30.6541 STDDEV

Data Role Filtered Excluded Data

TRAIN 9604 396 10000

Despite the existence of several strongly correlated variables, namely Monetary/Freq,

Graphic 4 - Correlation matrix

In order to identify business opportunities and establish a well-defined marketing strategy

Graphic 5 - Elbow graphic for Value/Engagement segmentation clusters

Graphic 7 - Elbow graphic for Consumption segmentation clusters

Table 5 - Clusters concatenation

Low-rank, new-tech customers (Cluster_VE = 2)

Mid-rank, newest customers (Cluster_VE = 1)

Mid-rank, longest-time customers (Cluster_VE = 4)

High-rank, traditional customers (Cluster_VE = 3)

Histograms from SAS Miner MultiPlot node

Graphic 1 - Univariate histogram for Dessert interval variable

Graphic 2 - Univariate histogram for Sweetred interval variable

SAS Miner Segment profile nodes

Graphic 6 - Value/Engagement segment profile

Graphic 8 - Consumption segment profile

Picture 1 - shareMedia and Newsletter variables after PROC FREQ

Monetary Variable Analysis

WebPurchase Variable Analysis

Picture 5 - Histograms of WebPurchase variable analysis for classes shareMedia and

Picture 6 - Histograms of Freq variable for classes shareMedia and Newsletter

Picture 7 - Histograms of Recency variable for classes shareMedia and Newsletter

Picture 8 - Histograms of Monetary variable for classes shareMedia and Newsletter

/*Code for Picture 1*/

proc freq data=fulltable order=freq;

/*Code for Picture 2*/

/*Monetary variable analysis for classes shareMedia and Newsletter*/

/*Code for Picture 3*/

/*Income variable analysis for classes shareMedia and Newsletter*/

/*Code for Picture 4*/

/*WebPurchase variable analysis for classes shareMedia and Newsletter*/

proc univariate data=grouping2 noprint;

/*Code for Picture 6, 7 and 8*/

proc univariate data=grouping2 noprint;

You might also like

/Code for Picture 1/

/Code for Picture 2/

/Monetary variable analysis for classes shareMedia and Newsletter/

/Code for Picture 3/

/Income variable analysis for classes shareMedia and Newsletter/

/Code for Picture 4/

/WebPurchase variable analysis for classes shareMedia and Newsletter/

/Code for Picture 6, 7 and 8/