You are on page 1of 24

Master in Advanced Analytics

2017/2018

Data Mining

Final Project

0
Group
Student ID Student Name

M20170606 Biazi Bayer

M20170590 Rodrigo Pupo Ribeiro

M20170366 Vasco Castela

1
Table of Contents
Project - Part 1 3
Executive Summary 3
Tools and Methodology 3
Conclusion 4

Project - Part 2 5
Executive Summary 5
Tools and Methodology 5
Variable Analysis 6
Variable definition 6
Data Preparation 6
Outliers 8
Data Preprocessing 9
Segmentation 9
Value/Engagement Segmentation 10
Cluster characterization 10
Consumption Segmentation 11
Cluster characterization 12
Cluster Concatenation 12
Conclusion 13

ANNEX A 14
Histograms from SAS Miner MultiPlot node 14
Segmentation Diagram 15
SAS Miner Segment profile nodes 15

ANNEX B 16
FREQ Procedure 16
MEANS Procedure 16
Monetary Variable Analysis 16
Income Variable Analysis 17
WebPurchase Variable Analysis 17
Histograms of WebPurchase variable 18
Histograms of Freq variable 19
Histograms of Recency variable 20
Histograms of Monetary variable 21
SAS Guide Code 22

2
Project - Part 1

Executive Summary
The Wonderful Wines of the World (WWW) is a seven-year-old company that sells wines
through a website and ten small stores in major cities around USA. Customers are provided with
a new catalog, sent every 6 weeks, from which they can select several hundred different
products.

WWW is now trying to make the best out of the customer database it started about four years
ago and wants to start differentiating customers in order to develop more focused programs .
Thus, it has provided a random sample of 10,000 customers, from its active database, who
have purchased something from WWW in the past 18 months.

As a first measure, the digital impact on the WWW’s business was assessed after a merge of
two datasets from the provided database: Value Engage and Newsletter. The former may be
considered as the main general collection of information that is gathered from each customer,
whereas the latter is only specific to customers’ digital enrollment and behaviour in the
business.

Tools and Methodology

In order to try to understand if the digital approach does have impact on the Wonderful Wines of
the World business, shareMedia and Newsletter class variables, which identify who signed up
for a newsletter and shared at least one post over social media, were analysed, mainly against
Recency, Frequency and Monetary interval variables, as these three variables give an indication
of customers’ past purchase behavior: how recently, how often and how much did they buy.
Eventually, WebPurchase and Income variables were also considered for further insight on how
the digital approach is impacting or could impact on WWW’s business. For this purpose SAS
Guide was used, more precisely Proc Freq, Proc Means and Proc Univariate procedures.

3
Conclusion
From the sample dataset provided by the WWW, only 5.88% (588 customers) signed up for a
newsletter and shared at least one post over social media (digitally driven), 37.64% did one of
the above and, finally, 56.48% didn't care for this digital approach (Annex B - Pic. 1).

The 56,48% customers not digitally engaged, amount to a total sales (Monetary) of 56.45%, and
those 5.88% who are fully engaged, amount to a total sales of much less, 6.13%, although each
of the latter buys more in average, 648.8 against 622.2 (Annex B - Pic. 2). From the in-between
group of customers - those who at least shared a post or subscribed the newsletter - the ones
who shared at least a post on social media alone buy slightly more on average (Annex B - Pic.
2).

It is also worth noting that those customers who did signed up for a newsletter and shared at
least one post over social media are the ones presenting a higher household income (Annex B -
Pic. 3). More importantly, looking at the online sales alone, which represent 31.1% of the total
sales over the last 18 months (see SAS Guide code in Annex B for details), one can say that the
web purchases represent a higher percentage of sales for precisely those customers who didn’t
subscribe the newsletter nor shared any post on social media (Annex B - Pic. 4 and 5), thus
there must be other(s) driving factor(s) for these group of customers.

So, it is reasonable to assume that as of now the digital approach doesn’t have a meaningful
impact on the business, also confirmed by Pic. 6, 7 and 8 (Annex B), where doesn’t exist major
differences, however reinforcing this approach to customers may pay-off in the future, as the
wealthier customers seem to be more sensitive to the digital approach (Annex B - Pic. 3) and,
additionally, they are buying more already on average (Annex B - Pic. 2).

4
Project - Part 2

Executive Summary

Following the Wonderful Wines of the World effort in making the best use of its customer
database to increase their wine selling business, two customer segmentations were performed
on the provided database sample of 10,000 customers, as an attempt to differentiate customers
by Value/Engagement and Consumption.

Once these customer segmentations and their respective clusters were obtained, more focused
programs were proposed to target specific customer groups, based on their resulting profiles
and potential value for the future of the business, as described throughout this project report.

Tools and Methodology

The tools used throughout the project development were SAS Enterprise Guide and SAS
Enterprise Miner. With these tools at disposal the descriptive analysis goal of this Data Mining
project was carried out fairly straightforward:

1. Extract patterns which allows us to characterize subsets of data;


2. Evaluate the data potential through data summaries;
3. Describe data distribution, searching for deep relationships across datasets

For the descriptive analysis process a clustering model was used, consisting of organizing
objects in homogeneous groups according to their similarity. The better this grouping is, the
lower the intra-cluster and the higher the inter-cluster distances are, which, in turn, means the
better one can differentiate one cluster from the next. So, in order to attain this objective, three
stages were followed:

1. Variable analysis, to determine a set of variables over which the similarity/ dissimilarity of
the entities (customer behaviour) will be assessed;
2. Segmentation, customer clustering according to the desired similarity/dissimilarity
criterion (what they buy, how they buy);
3. Cluster concatenation, customer profiling through inter-cluster relationship assessment
(Value/Engagement vs Consumption) and validation

5
Variable Analysis

Variable definition

The original dataset provided by WWW consists of a 10,000 customers sample from the last 18
months, whose attributes are described on Table 1.

Variable Description
CUSTID customer ID number
DAYSWUS number of days as a customer
AGE customer’s age or imputed age
EDUC years of education (may be imputed)
INCOME household income (may be imputed)
KIDHOME 1=child under 13 lives at home
TEENHOME 1=child 13-19 years lives at home
FREQ number of purchases in past 18 mo.
RECENCY number of days since last purchase
MONETARY total sales to this person in 18 mo.
LTV Lifetime value of the customer
PERDEAL % purchases bought on discount
DRYRED % of wines that were dry red wines
SWEETRED % sweet or semi-dry reds
DRYWH % dry white wines
SWEETWH % sweet or semi-dry white wines
DESSERT % dessert wines (port, sherry, etc.)
EXOTIC % very unusual wines
WEBPURCH % of purchases made on website
WEBVISIT average # visits to website per month
Note: DRYRED + SWEETRED + DRYWH + SWEETWH + DESSERT = 100%
Table 1 - Variables definition and description

Data Preparation

From a measurement level perspective, all variables used have interval scale (continuous
variables that contain values across a range), except the binary variables Teenhome and
Kidhome (which have two possible values, 0 or 1), and also the nominal variable Custid (it has
more than two levels, but the values of each level have no implied order) which was defined as
ID (indicating that the identified variable contains a unique identifier for each observation),
whereas all the other variables were considered Input variables. The Rejected variables role
specifies that the identified variable is not included in the model building process. (See Table 2).

6
Table 2 - Variables name, role and level

Considering the summary statistics of all interval variables shown on Table 3, it is notorious that
no missing values existed. Also, a first assessment about some basic WWW´s customers
characteristics, mainly from the mean values, can be readily done using Table 3: the average
customers age is 48 years old, with an average income of 69,904 and 17 years of education, on
average; of the usual wines, Dryred accounts for about 50% of sales. Changing the Freq’s
variable role from Input to Frequency, for instance, allows further readings: 31.1% of purchases
are carried out through WWW´s website and the remaining 68.9% using the stores (see Table
4).

Table 3 - Interval variables summary statistics

7
Table 4 - Interval variables summary statistics with Freq’s variable role defined as Frequency

Outliers

Only Dessert and Sweetred interval variables presented some extreme values (markedly distant
from the bulk of observations) which were considered outliers (See Annex A, univariate
histograms from SAS Miner MultiPlot node - Graphics 1 and 2). These outliers were filtered out
using the standard deviation filter and a fairly conservative cutoff point, equal to 10 standard
deviations, as only extreme values, those above that cutoff point, should be discarded (See
Graphic 3a). Thus, only 396 values in total, out of 10,000, were excluded to avoid any possible
negative influence on future interpretation results (Graphic 3b).

Graphic 3a - Outliers filtering example (Dessert variable)

8
Variable Role Minimum Maximum Filter Method

Dessert INPUT -16.6912 30.5860 STDDEV

Sweetred INPUT -16.5451 30.6541 STDDEV

Number of Observation

Data Role Filtered Excluded Data

TRAIN 9604 396 10000

Graphic 3b - Final result from outliers filtering of Dessert and Sweetred variables

Data Preprocessing

Despite the existence of several strongly correlated variables, namely Monetary/Freq,


Monetary/LTV and WebPurchase/WebVisit, as can be seen on the Correlation Matrix below
(Graphic 4), it was decided to keep all the variables, thus keeping all the available discrimination
ability potential of each, for a better descriptive analysis.

Graphic 4 - Correlation matrix

Segmentation

In order to identify business opportunities and establish a well-defined marketing strategy


accordingly, a previous knowledge about the existing customers’ consumption profile and
behaviour patterns must be attained, therefore two main segmentations were implemented, one
based on customer value/engagement and the other based on consumption. This way,
customers are grouped in different clusters, each of them reflecting, as much as possible,
common customer characteristics by the defined segmentation. The clustering method used
was the K-Means due to its known better than average performance when compared to the
Hierarchical Clustering technique. Furthermore, the best number of clusters to be considered,
which should be a balance between complexity and descriptive/discrimination ability in a strictly
descriptive analysis, was determined using the Elbow graphic rule, whose goal is to find the
minimum number of necessary clusters within each segmentation dataset. (The segmentation
diagram used in SAS Miner is shown in Annex A).

9
Value/Engagement Segmentation

For this segmentation the selected variables were Age, Educ, Income, Dayswu, Freq, LTV,
Monetary, Perdeal, Recency, WebPurchase and WebVisit (binary variables such as Teenhome
and Kidhome were not included, as K-Means algorithm uses Euclidean distance measure and is
unlikely that binary data can be clustered satisfactorily). The first three variables provide the
customers socio-economics context, whereas the remaining ones translate how (rather than
what) customers buy.

Next, to obtain the best number of clusters a provided SAS Guide code was run and the Elbow
graphic (Graphic 5) was obtained. Looking at the graphic, a first pronounced inflection (elbow)
occurs for a number of clusters equal to 4. Thus, the K-Means algorithm was applied for K=4 .

Graphic 5 - Elbow graphic for Value/Engagement segmentation clusters

Cluster characterization

With the help of the SAS Miner Segment Profile node, relevant information about the four
clusters was obtained (Graphic 6 in Annex A).

Cluster 1

This cluster represents 18.96% of the observations, mainly most recent customers with an
average age of 64.5 years old and above than average income, average purchase amount, they
buy with average frequency, they don’t care much for discounts nor website purchases - Mid-
rank, newest.

10
Cluster 2

This cluster is the biggest one, representing 45.89% of the observations, essentially the
youngest group of customers, 34.4 years old on average, average time as customers, with the
lowest income of all groups, they buy the least, they don’t buy frequently and exhibit the longest
elapsed time since the last purchase, although this group of customers has the highest
percentage of purchases through the website and on discount - Low-rank, new-tech.

Cluster 3

This cluster is the smallest one representing only 13.94% of the observations, it is also the
oldest group of customers, with 71.3 years old on average, and the wealthiest ones, they are
well established customers with the highest purchase amount, they also buy very frequently,
they don’t buy on discount and seldom use the website for purchasing - High-rank, traditional.

Cluster 4

This cluster represents 21.21% of the observations, 54.2 years old on average, average
income, very well established customers, in fact, exhibiting the longest time as customers,
despite the average purchase amount, also buy with average frequency and average
percentage through the website, they don’t seem especially attracted by discounts - Mid-rank,
longest-time.

Consumption Segmentation

For this segmentation the selected variables were Dessert, Dryred, Drywh, Exotic, Sweetred
and Sweetwh, the available products from Wonderful Wines of the World. The objective is to
understand what customers buy.

Again, to obtain the best number of clusters a provided SAS Guide code was run and the Elbow
graphic (Graphic 7) was obtained. Looking at the graphic, a first pronounced inflection (elbow)
occurs for a number of clusters equal to 3. Thus, the K-Means algorithm was applied for K=3 .

Graphic 7 - Elbow graphic for Consumption segmentation clusters

11
Cluster characterization

With the help of the SAS Miner Segment Profile node, relevant information about the three
clusters, resulting from the consumption segmentation, was obtained (Graphic 8 in Annex A).

Cluster 1

This cluster represents 15.47% of the observations, Dryred 22.0%, Drywh 28.7%, the highest
sweet, semi-dry or exotic (31.2%) wines consuming group - Sweet, Semi-dry and Exotic.

Cluster 2

This cluster represents 50.09% of the observations, the highest Dryred consuming group on
average, 68.9%, actually, the highest consumption percentage of a single type of wine of all
groups, contrasting with the lowest Exotic consumption, Drywh 21.6%, there’s no meaningful
sweet or semi-dry wines consumption - Dry red.

Cluster 3

This cluster represents 34.43% of the observations, the highest Drywh consuming group on
average, 41.8%, Dryred 35.1%, close to average sweet or semi-dry wines consumption which is
low, second lowest consumption percentage as far as exotic wines is concerned - Dry white.

Cluster Concatenation

In order to get a complete picture of the customer profiles, a cross-analysis between the clusters
formed from each previous segmentations is not only necessary but also constitutes a good way
of validating the resulting solution.

After running the provided SAS Guide code for joining the clusters the results are presented on
Table 5 (Cluster_VE stands for clusters from Value/Engagement segmentation and
Cluster_CONS for clusters from Consumption segmentation):

Table 5 - Clusters concatenation

12
From the concatenation table a deeper understanding of the customers profile is now possible:

Low-rank, new-tech customers (Cluster_VE = 2)

The vast majority of customers, they are young customers, digitally engaged, and they buy,
essentially, dry wines with particular emphasis on Dry red, although in low quantities and value,
they are the group of customers who buy the most on discount, they are also the group of
customers who buy more Sweet, Semi-dry and Exotic wines, both in percentage and quantity.

Mid-rank, newest customers (Cluster_VE = 1)

They represent the newest customers, they are particularly found of Dry white wines, although
Dry red still is the wine of choice by a good margin, they have above than average income.

Mid-rank, longest-time customers (Cluster_VE = 4)

Long term customers, they have a clear preference for Dry red wines, more than any other
group of customers, so much so that all the other type of wines combined represent only 25% of
purchases.

High-rank, traditional customers (Cluster_VE = 3)

However the smallest, these are WWW’s best group of customers, they are also traditional
customers in the sense that they are the group of customers who use less the digital approach,
also the fact of being the group of customers with the highest average age, 71 years old, may
help explaining it, they buy almost as much Dry white wine as Dry red, still the Dry red wine
purchases are marginally higher, although they are the customers who buy more Dry white wine
in percentage, as for Sweet, Semi-dry and Exotic wines purchases, they are absolutely marginal
(around 7%), the lowest of all.

Conclusion

It is clear that the Dry red wine is the WWW’s best selling product, followed by Dry white wine
and, at a significantly higher distance, Sweet and Exotic wines. Also, the so-called Low-rank
customers represent the vast majority of customers whereas the best customers are the
minority. Then, there is the most recent customers with above than average income, who are in-
between.

So, based on the achieved customer profiles, any adopted marketing strategy should target at
least two main objectives: (1) Lead the Low-rank customers to spend more; (2) Lead the best
customers to buy more expensive wines:

1. Since this group of customers is the youngest and has the highest percentage of
purchases through the website and on discount, a marketing campaign on social
networks should be implemented announcing bigger discounts on several wine packs
through online sales - the bigger the pack, the bigger the discount - a smartphone app
should be in place;
2. Private wine tasting events for the most expensive Dry wines should be offered to the
best customers group, above a certain purchase amount of these expensive wines, an
annual winery tour should be also offered with a big discount, additionally, personalized
handwritten birthday wishes card should be sent every birthday to these customers.

13
ANNEX A

Histograms from SAS Miner MultiPlot node

Graphic 1 - Univariate histogram for Dessert interval variable

Graphic 2 - Univariate histogram for Sweetred interval variable

14
Segmentation Diagram

Segmentation diagram

SAS Miner Segment profile nodes

Graphic 6 - Value/Engagement segment profile

Graphic 8 - Consumption segment profile

15
ANNEX B

FREQ Procedure

Picture 1 - shareMedia and Newsletter variables after PROC FREQ

MEANS Procedure

Monetary Variable Analysis

Picture 2 - Monetary variable analysis for classes shareMedia and Newsletter after PROC
MEANS

16
Income Variable Analysis

Picture 3 - Income variable analysis for classes shareMedia and Newsletter after PROC
MEANS

WebPurchase Variable Analysis

Picture 4 - WebPurchase variable analysis for classes shareMedia and Newsletter after
PROC MEANS

17
Histograms of WebPurchase variable

Picture 5 - Histograms of WebPurchase variable analysis for classes shareMedia and


Newsletter

18
Histograms of Freq variable

Picture 6 - Histograms of Freq variable for classes shareMedia and Newsletter

19
Histograms of Recency variable

Picture 7 - Histograms of Recency variable for classes shareMedia and Newsletter

20
Histograms of Monetary variable

Picture 8 - Histograms of Monetary variable for classes shareMedia and Newsletter

21
SAS Guide Code

/*Code to calculate the percentage of the total sales over the last 18 months*/
/*Calculates the number of website purchases per customer*/
data webpurchase;
set fulltable;
webp = WebPurchase/100 * Freq;
Run;

/*Calculates both website purchases and last 18 months purchases, for all customers*/
proc means data=webpurchase noprint;
var webp Freq;
output out=temp (drop=_type_ _freq_) sum(webp)= sum(Freq)= /autoname;
Run;

/*Calculates the percentage of website purchases from the last 18 months total sales*/
data temp1;
set temp;
webpurch_p=webp_Sum/Freq_Sum *100;
run;

/*Code for Picture 1*/

proc freq data=fulltable order=freq;


table Newsletter*shareMedia / list;
title 'shareMedia and Newsletter';
run;

/*Code for Picture 2*/

/*Monetary variable analysis for classes shareMedia and Newsletter*/


proc means data=fulltable mean median sum;
class shareMedia Newsletter;
var Monetary;
title 'Monetary variable analysis';
run;

/*Code for Picture 3*/

/*Income variable analysis for classes shareMedia and Newsletter*/


proc means data=fulltable mean median sum;
class shareMedia Newsletter;
var Income;
title 'Income variable analysis';
run;

/*Code for Picture 4*/

/*WebPurchase variable analysis for classes shareMedia and Newsletter*/


proc means data=fulltable mean median sum;
class shareMedia Newsletter;
var WebPurchase;
title 'WebPurchase variable analysis';
Run;

22
/*Code for Picture 5*/

proc format;
value newsformat 0="NEWSLETTER = 0"
1="NEWSLETTER = 1";
Run;

proc format;
value sharedformat 0="SHAREDMEDIA = 0"
1="SHAREMEDIA = 1";
Run;

data grouping2;
set fulltable;
format Newsletter newsformat.;
format shareMedia sharedformat.;
run;

proc univariate data=grouping2 noprint;


class shareMedia Newsletter;
histogram WebPurchase / normal (noprint) CTEXTSIDE=bib CTEXTTOP=blue;
title 'Histograms with Normal Distribution for WebPurchase';
run;

/*Code for Picture 6, 7 and 8*/

proc univariate data=grouping2 noprint;


class shareMedia Newsletter;
histogram Recency Freq Monetary / normal (noprint) CTEXTSIDE=bib CTEXTTOP=blue;
title 'Histograms with Normal Distribution for Freq, Recency and Monetary';
run;

23

You might also like