BIAM - 560 - Final Course Project D40562330

BIAM 560 – Predictive Analytics
FINAL CLASS PROJECT
BANK MARKETING MODEL

Dritan Papazisi
Submitted
to Professor: Dr. Michael Mullas
Sunday, June 23, 2019
1|Page
Table of Contents
Scenario.......................................................................................................................................................3
Executive Summary.....................................................................................................................................4
Method and Analysis...................................................................................................................................5
Random Forest............................................................................................................................................6
Decision tree analysis..................................................................................................................................9
Artificial Neural Network...........................................................................................................................12
Literature Sources......................................................................................................................................14
Appendix 1: Data explanation....................................................................................................................15
Appendix 2: Pictorial view of correlation between some of the factors considered..................................16
Appendix 3: The R Code for this Project....................................................................................................22
2|Page
Scenario
You have been retained as a consultant for The Very Little Bank and Trust Company. It is about to embark
on a marketing campaign to roll out a new certificate of deposit product. It wants you to provide the
analytical guidance at each step of the way. This is a unique opportunity because most BI analysts are
called in after everything is said and done and then must make do with whatever is there. So the first
thing you do is a review of the literature to see if anything similar has been done. Lo and behold, your
efforts are rewarded. You discover a past BI effort that is similar. You decide to review it and practice on it
to gain experience and insight. Recognize you are not searching for right answers to some hypothetical
Final Exam questions here. Rather, you are polishing your BI management and analytical skills. You will
be judged accordingly. You should look at this as something more akin to a wrestling match. Your
opponent is the data set. Begin by reading and understanding the main themes in the paper. Good luck.
3|Page
Executive Summary
When it comes to marketing campaigns, nowadays businesses take one of the two main
approaches to promote their products and/or services: a through mass campaigns targeting general
indiscriminate public or directed marketing to a specific set of contacts (Laudon & Laudon, 2015). The
fact is that increasingly vast number of marketing campaigns over time has reduced its effect on the
public. In common conditions of budget cuts, pressures to reduce overhead expensed and competition
has raised the need to research on investment in campaigns with a strict and rigorous selection of
contacts. The same alternatives are faced from The Very Little Bank and Trust Company, which seem to
be in unique position. First, the bank possesses a considerable customer base of various demographics.
Second, over the last few years, the Bank has collected vast amount of data from the previous Marketing
Campaigns. In this study it is illustrated that Business Intelligence (BI) and Data Mining (DM) techniques
can be a great help to fulfill that goal. Directed marketing with focus on targets that assumable will be
keener to that specific product/service, making this kind of campaigns extremely efficient. However, the
drawbacks of directed marketing may be triggering of a negative attitude towards banks due to the
intrusion of privacy.
This project describes an implementation of DM based on the CRISP-DM methodology. Given Real-
world data were analyzed with the business goal is to find a model that can explain the conditions
driving successful sale of the Bank’s project to existing clients by identifying the main characteristics that
affect success. This knowledge may help to better manage the available resources (e.g. human effort,
phone calls, time) and selection of a high quality and affordable set of potential buying customers.
4|Page
Data are analyzed using several Data Mining
tools and results persistently indicate that there is
a specific condition which can be more prevalent
to predict the customer will buy the bonds
offered. It is strongly recommended to Bank’s
Marketing Management team to target and work
with this customer primarily. More details are
given further, but shortly, the most important
condition ranked as they will positively impact the
sale are given in the pic adjacent.
Method and Analysis

Analysis is focused in given previous marketing
campaigns from The Very Little Bank and Trust Company. As explained elsewhere, (Turban et al. , 2010),
BI is an umbrella term that includes architectures, tools,
databases, applications and methodologies with the goal
of using data to support decisions of business managers.
DM is a BI technology that uses data-driven models to
extract useful knowledge (e.g. patterns) from complex
and vast data (Witten and Frank, 2005).
The CRoss-Industry Standard Process for Data Mining

(CRISP-DM) is a popular iterative methodology for
increasing the success of DM projects (Chapman et al.,
2000). The methodology defines a non-rigid sequence of
six phases, which allow the building and implementation
of a DM model to be used in a real environment, helping
to support business decisions. In general, this is an iterative process going in several phases:
1. Business Understanding phase, identifying the goal to achieve, in our specific case is the
successful Bank Product Sale after contact with and existing customer.
2. Data Understanding, which in this case is given as an .CVS file and with a note file with full
explanation about the data given to be analyzed (see Appendix 1).
5|Page
3. Data Preparation data has concepts (what needs to be learned), instances (independent records
related to an occurrence) and attributes (which characterize a specific aspect of a given
instance). In this case the method we used needed to have the data Numeric of Factor format.
4. Modeling Phase model is build using Random Walk, Decision Tree and Neural Network models.
5. Model Evaluation with real data. In this case the Confusion Matrix (GeeksforGeeks) and the
Receiver Operating Characteristic (ROC) curve (Grace-Martin) have been used.
6. If the obtained model is not good enough for use to support business, Then a new iteration for
the CRISP-DM is defined Else, the model is implemented in a real time environment
(Deployment phase).
Random Forest
The random forest is a model made up of many decision trees, rather than just simply averaging the
prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the
name random:
1. Random sampling of training data points when building trees. When training, each tree in a
random forest learns from a random sample of the data points. At test time, predictions are
made by averaging the predictions of each decision tree.
2. Random Subsets of features for splitting nodes. The other main concept in the random forest is
that only a subset of all the features are considered for splitting each node in each decision tree.
The objective of a machine learning model is to generalize well to new data it has never seen before.
With one tree vs Forest we run is some issues such as Overfitting. In this case the model learns well the
training data without distinguishing very well the noise. The reason the decision tree is prone to
overfitting when we don’t limit the maximum depth is because it has unlimited flexibility, meaning that it
can keep growing until it has exactly one leaf node for every single observation. This issue is overcome
with the random forest, which by nature combines hundreds or thousands of decision trees, trains each
one on a slightly different set of the observations, splitting nodes in each tree considering a limited
number of the features. The final predictions of the random forest are made by averaging the predictions
of each individual tree.
In specific bank data the outcome of the random forest is made from 500 trees and is build a predictivity
of about 90%. Based on the chart given the most impactful parameters are:
1. duration: last contact duration, in seconds (numeric)
6|Page
2. month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
3. balance: average yearly balance, in euros (numeric)
4. age
5. day: last contact day of the month (numeric)
6. job: type of job (categorical:” admin.", "unknown", "unemployed", "management",
"housemaid", "entrepreneur", "student","blue-collar","self-
employed","retired","technician","services")
In addition, error does not significantly reduce when forest has more than 300 trees.
With xgboost and ROCCurve packages is compiled the Receiver Operating Characteristic Curve with
thresholds. Based on the chart below the system exhibits adequacy to describe the real process.
7|Page
randomForest(formula = yy ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 10.34%

Confusion matrix:
no yes class.error
no 2713 92 0.03279857
yes 237 139 0.63031915
MeanDecreaseGini
age 56.534807
job 51.979302
marital 16.001449
education 18.150368
default 2.057567
balance 61.866361
housing 9.229602
loan 4.929918
contact 11.809905
day 54.216374
month 80.501799
duration 180.182668
campaign 24.510517
pdays 27.386274
previous 18.463179
poutcome 39.068395
8|Page
Same results can be confirmed with xgboost package.
A forest with 300 trees seem sufficient to achieve a minimum error
[1] train-error:0.079220
[101] train-error:0.020748
[126] train-error:0.015718
[151] train-error:0.008174
[176] train-error:0.004715
[201] train-error:0.003144
[226] train-error:0.002201
[251] train-error:0.000943
[276] train-error:0.000629
[301] train-error:0.000000
[326] train-error:0.000000
[351] train-error:0.000000
[376] train-error:0.000000
[401] train-error:0.000000
[426] train-error:0.000000
[451] train-error:0.000000
[476] train-error:0.000000
[500] train-error:0.000000
user system elapsed
3.10 1.62 2.79
9|Page
Decision tree analysis.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and
regression. In nutshell, a DT learn from data to approximate a sine curve with a set of if-then-else
decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.
DT builds classification or regression models in the form of a tree structure, breaking down a data set
into smaller and smaller subsets while at the same time an associated decision tree is incrementally
developed. The end result is a tree with decision nodes and leaf nodes. A decision node has two or more
branches. Leaf node represents a classification or decision. The topmost decision node in a tree which
corresponds to the best predictor called root node. The good thing is that DT can handle both categorical
and numerical data. The process of finding the smallest tree that fits the data. Usually this is the tree
that yields the lowest cross-validated error.
Important terminologies on Decision Tree, are:
Root Node represents the entire population

or sample. It further gets divided into two or
more homogeneous sets.
Splitting is a process of dividing a node into
two or more sub-nodes.
Decision Node, when a sub-node splits into
further sub-nodes
Terminal Node or a Leaf are nodes that do
not split.
Pruning a process when sub-nodes of a decision node is removed. The opposite of pruning is Splitting.
Branch is a sub-section of an entire tree is called.
Parent Node is a node, which is divided into sub-nodes; whereas the sub-nodes are called the Child of
the parent node.
In the case of the Bank data the DT model was build and evaluated with the following indexes.
10 | P a g e
Confusion Matrix and Statistics
Reference
Prediction no yes
no 1166 99
yes 29 46
Accuracy : 0.9045
95% CI : (0.8875, 0.9197)
No Information Rate : 0.8918
P-Value [Acc > NIR] : 0.07151
Kappa : 0.3718
Mcnemar's Test P-Value : 1.069e-09
Sensitivity : 0.9757
Specificity : 0.3172
Pos Pred Value : 0.9217
Neg Pred Value : 0.6133
Prevalence : 0.8918
Detection Rate : 0.8701
Detection Prevalence : 0.9440
Balanced Accuracy : 0.6465
'Positive' Class : no
11 | P a g e
1) root 3181 376 no (0.88179818 0.11820182)
2) duration< 645.5 2927 250 no (0.91458832 0.08541168)
4) poutcome=failure,other,unknown 2844 195 no (0.93143460 0.06856540)
8) duration< 222.5 1842 44 no (0.97611292 0.02388708) *
9) duration>=222.5 1002 151 no (0.84930140 0.15069860)
18) month=apr,aug,feb,jan,jul,jun,may,nov 954 124 no (0.87002096
0.12997904)
36) pdays< 392.5 945 116 no (0.87724868 0.12275132) *
37) pdays>=392.5 9 1 yes (0.11111111 0.88888889) *
19) month=dec,mar,oct,sep 48 21 yes (0.43750000 0.56250000)
38) day< 16.5 26 9 no (0.65384615 0.34615385) *
39) day>=16.5 22 4 yes (0.18181818 0.81818182) *
5) poutcome=success 83 28 yes (0.33734940 0.66265060)
10) duration< 163 17 3 no (0.82352941 0.17647059) *
11) duration>=163 66 14 yes (0.21212121 0.78787879) *
3) duration>=645.5 254 126 no (0.50393701 0.49606299)
6) duration< 766.5 93 33 no (0.64516129 0.35483871)
12) pdays< 132 84 25 no (0.70238095 0.29761905) *
13) pdays>=132 9 1 yes (0.11111111 0.88888889) *
7) duration>=766.5 161 68 yes (0.42236025 0.57763975)
14) month=apr,feb,jan,may,nov 88 43 no (0.51136364 0.48863636)
28) education=primary,unknown 16 3 no (0.81250000 0.18750000) *
29) education=secondary,tertiary 72 32 yes (0.44444444 0.55555556)
58) marital=married,single 63 31 yes (0.49206349 0.50793651)
116) job=blue-collar,management,self-employed,services,student 36 14
no (0.61111111 0.38888889) *
117) job=admin.,entrepreneur,technician,unemployed 27 9 yes
(0.33333333 0.66666667) *
59) marital=divorced 9 1 yes (0.11111111 0.88888889) *
15) month=aug,dec,jul,jun,mar 73 23 yes (0.31506849 0.68493151) *
12 | P a g e
13 | P a g e
Artificial Neural Network
Artificial neural networks (ANNs) or connectionist systems are systems which “learn” to perform tasks by
considering examples, generally without being programmed with any task-specific rules. The
Architecture of an Artificial Neural Network is a set of connected neurons organized in layers:
input layer: brings the initial data into the system

for further processing by subsequent layers of
artificial neurons.
hidden layer: a layer in between input layers and

output layers, where artificial neurons take in a set
of weighted inputs and produce an output through
an activation function.
14 | P a g e
output layer: the last layer of neurons that produces given outputs for the program.
What is typical for this calculating system is that numerical data would need to be normalized.
1 1
age
-1 .
47
4. 4
00
2-5
4
3 .520 74
25
.8 4 2
31
89
-1 .0
balance 8.56273
3.0
0 04
50 3
1
3-81 3.3
2
-3 .
6.6 1832
05
84 no
23
93
113
.4
2
-0
day
-2 8
-6 .02
.3 7
08
94
1. 0
84
2
277
39
07
.7
38
10
55
2
1.
-1
duration -330.62415
-1
.23
-1
2
97
19
3.3
68 2
87
.52
1.002 1
64
. 7
90
29
60
-7
campaign
2.4
4.565 6
97 yes
07
95
0.4 63
14
30
272.109
59
2
6
3. 0
6 12
-3 .0
pdays 40.29168
3
89
.8 4 4
75
--830.77 8
. 001
9
-3 5
previous
Error: 248.316845 Steps: 12893
15 | P a g e
Literature Sources
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000): CRISP-DM
1.0 - Step-by-step data mining guide, CRISP-DM Consortium.
Grace-Martin, K. (No Year): “What is an ROC Curve?”, https://www.theanalysisfactor.com/what-is-an-
roc-curve/.
GeeksforGeeks: Confusion Matrix in Machine Learning https://www.geeksforgeeks.org/confusion-
matrix-machine-learning/
Lander, J. P. (2017): R for Everyone: Advanced Analytics and Graphics. [VitalSource Bookshelf]. Retrieved
from https://online.vitalsource.com/#/books/9781323582657/
Laudon, K. C., Laudon,, J. P. (2015): Management Information Systems: Managing the Digital Firm, 15th
Edition Retrieved from vbk://9781323187944
Ling, X. and Li, C., (1998): “Data Mining for Direct Marketing: Problems and Solutions”. In Proceedings of
the 4th KDD conference, AAAI Press, 73–79.
S. Moro, R. Laureano and P. Cortez.( 2011): Using Data Mining for Bank Direct Marketing: An Application
of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and
Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October,. EUROSIS.
Turban, E., Sharda, R. and Delen, D. (2010): Decision Support and Business Intelligence Systems – 9th
edition, Prentice Hall Press, USA.
Witten, I. and Frank, E. (2005): Data Mining – Pratical Machine Learning Tools and Techniques – 2nd
edition, Elsevier, USA.
16 | P a g e
Appendix 1: Data explanation
This dataset is public available for research. The details are described in (Moro et al., 2011)
http://hdl.handle.net/1822/14838
http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
There are two datasets:

1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).
2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.
The classification goal is to predict if the client will subscribe a term deposit (variable y).
Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)
Number of Attributes: 16 + output attribute.
Attribute information:
Input variables: # bank client data:

1 - age (numeric)
2 - job : type of job (categorical:
"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced
or widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric,
includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
(numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical:
"unknown","other","failure","success")
Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")
17 | P a g e
Appendix 2: Pictorial view of correlation between some of
the factors considered.
18 | P a g e
19 | P a g e
20 | P a g e
21 | P a g e
22 | P a g e
23 | P a g e
Appendix 3: The R Code for this Project
cat("\014")
getwd()
setwd("C:\Users\tpapazisi\OneDrive - Smithfield Foods Inc\Devry\BIAM 560")
setwd("\Wk8 Final Project")
getwd()
#-- Getting the Data & converting to proper numeric format ---------------------------------
library(readxl)
prac <- read_excel("C:/Users/tpapazisi/OneDrive - Smithfield Foods, Inc/Devry/BIAM 560/Wk8 Final
Project/BankData.xlsm",
sheet = "Data",
range=cell_cols("A:x"), col_names = TRUE,
col_types = c("numeric","text","text","text","text",
"numeric","text","text","text","numeric",
"text","numeric","numeric","numeric","numeric",
"text","text","text","numeric","numeric","numeric","numeric","numeric","numeric"))
str(prac)
prac$job<-as.factor(prac$job)
prac$marital<-as.factor(prac$marital)
prac$education<-as.factor(prac$education)
prac$default<-as.factor(prac$default)
prac$housing<-as.factor(prac$housing)
prac$loan<-as.factor(prac$loan)
prac$contact<-as.factor(prac$contact)
prac$month<-as.factor(prac$month)
prac$poutcome<-as.factor(prac$poutcome)
prac$Gr_Age<-as.factor(prac$Gr_Age)
prac$Gr_Balance<-as.factor(prac$Gr_Balance)
prac$Gr_Duration<-as.factor(prac$Gr_Duration)
prac$Gr_pWeeks<-as.factor(prac$Gr_pWeeks)
prac$Gr_pMonths<-as.factor(prac$Gr_pMonths)
prac$Gr_Previous2<-as.factor(prac$Gr_Previous2)
prac$Gr_Previous3<-as.factor(prac$Gr_Previous3)
prac$yy<-as.factor(prac$yy)
str(prac)
#prac <- read_excel("C:/Users/tpapazisi/OneDrive - Smithfield Foods, Inc/Devry/BIAM 560/Wk8 Final

Project/BankData.xlsm", sheet = "Data", range=cell_cols("A:Q"), col_names = TRUE,
col_types = c("numeric","text","text","text","text", "numeric","text","text","text","numeric",
"text","numeric","numeric","numeric","numeric", "text","text"))
24 | P a g e
#View(prac)
#warnings()
head(prac)
summary(prac)
str(prac)
tail(prac)
names(prac)
#colSums(is.na(prac))
library(ggplot2)
#--- Age & Balance

ggplot(prac,aes(x=prac$poutcome, y=prac$yy))+
geom_jitter(aes(shape=factor(prac$Gr_Age), color=factor(prac$Gr_Balance)),size=2.5)+
#-- Point Chart elements for gSize=20cm chart

labs ( title="Campaigns Performance")+
theme(plot.title = element_text('Calibri', 'bold', 'blue', size=14))+
labs ( subtitle="by Age and Account Balance")+

theme(plot.subtitle = element_text('Calibri', 'bold', 'darkgreen', size=10))+
labs(caption = "D40562330 - Final Project")+

theme(plot.caption = element_text('Calibri', 'italic', 'darkred', size=8))+
labs ( tag = "Bank Marketing")+

theme(plot.tag = element_text(family="Calibri", face="bold", colour="red", size=20,hjust=1),
plot.tag.position = c(.975, .98),
)+
#-- Xaxis Label

xlab("Results of Previous Campaign") +
theme(axis.title.x = element_text('Calibri', 'bold', 'darkblue', size=10),
axis.text.x = element_text('Calibri', 'plain', 'black', size=10))+
#xlim(NA,NA)+
#-- Xaxis Label

ylab("Recent Campaign")+
theme(axis.title.y = element_text('Calibri', 'bold', 'darkblue', size=10),
axis.text.y = element_text('Calibri', 'plain', 'black', size=10))+
#scale_y_continuous(labels = comma_format(accuracy=0.1), limits=c(0,yMax))+
25 | P a g e
#-- Legend
theme(legend.title = element_blank(),
legend.text = element_text(color='blue', size=8, angle =0, face='plain'),
legend.box.background = element_rect(fill='white', size=0.25, linetype='solid', colour =NA),

legend.position = 'right',
legend.direction = 'vertical',
legend.justification = 'center',
legend.key = element_rect(colour = 'white', fill = 'white', size = 0.1, linetype='solid',inherit.blank=TRUE ),
legend.key.size = unit(.25, "cm"),

legend.key.height = unit(.25, "cm"),
legend.key.width = unit(.25, "cm"),
legend.spacing.x = unit(0.1, 'cm'),

legend.spacing.y = unit(.1, 'cm'),
legend.box = "vertical",
#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm" ),
legend.box.spacing = unit(0.01, "cm"))

guides(colour = guide_legend(override.aes = list(size=2, stroke=1.5)))
#--- Age & Employment

geom_jitter(aes(shape=factor(prac$Gr_Age), color=factor(prac$job)),size=2.5)+




)+
#-- Xaxis Label

#xlim(NA,NA)+
26 | P a g e
#-- Xaxis Label
#-- Legend



#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm"),
legend.box.spacing = unit(0.01, "cm")) +

#--- Age & Marital Atatus

geom_jitter(aes(shape=factor(prac$Gr_Age), color=factor(prac$marital)),size=2.5)+


27 | P a g e

)+
#-- Xaxis Label

#xlim(NA,NA)+
#-- Xaxis Label

#-- Legend




#--- Education & Employment

28 | P a g e
geom_jitter(aes(shape=factor(prac$education), color=factor(prac$job)),size=2.5)+




)+
#-- Xaxis Label

#xlim(NA,NA)+
#-- Xaxis Label

#-- Legend



29 | P a g e

#--- Education & marital

geom_jitter(aes(shape=factor(prac$education), color=factor(prac$marital)),size=2.5)+




)+
#-- Xaxis Label

#xlim(NA,NA)+
#-- Xaxis Label

#-- Legend

30 | P a g e


#--- Duration & Contacts

geom_jitter(aes(shape=factor(prac$Gr_Previous3), color=factor(prac$Gr_Duration)),size=2.5)+




)+
#-- Xaxis Label

#xlim(NA,NA)+
#-- Xaxis Label

31 | P a g e
#-- Legend




library(readxl)
sheet = "Data",
range=cell_cols("A:Q"), col_names = TRUE,
"text","text"))
str(prac)
32 | P a g e
str(prac)
#View(prac)
#warnings()
head(prac)
summary(prac)
str(prac)
tail(prac)
names(prac)
#'----------------------------------
#' Split Data
#'----------------------------------
set.seed(1234)
ind<-sample(2,nrow(prac),replace=TRUE,prob=c(0.7,0.3))
train<-prac[ind==1,]
test<-prac[ind==2,]
str(train)
summary(train)
train<-train[rowSums(is.na(train))==0,]
summary(test)
33 | P a g e
#'----------------------------------
#' Random Forest
#'----------------------------------
library(randomForest)
set.seed(1234)
rf2<-randomForest(yy ~ ., data = train)

print(rf2)
varImpPlot(rf2)
#'----------------------------------
#' Decision Tree
#'----------------------------------
#---How to tell if R is 64Bit version

isR64 <- function() .Machine[['sizeof.pointer']] == 8L
isR64()
#---How to tell if Java Runtime is 64Bit version

system('java -version')
#---How to tell if JAVA_HOME is set

Sys.getenv('JAVA_HOME')
Sys.setenv(JAVA_HOME="C:\\Program Files\\Java\\jre1.8.0_192")
sessionInfo()
#install.packages("rJava")
library(rJava)
#install.packages("FSelector")
require(RWekajars)
require(FSelector)
library(FSelector)
library(rpart)
#install.packages("caret")
require(gower)
require(caret)
library(caret)
34 | P a g e
library(rpart.plot)
library(dplyr)
library(xlsx)
#install.packages("downloader")
require(downloader)
#install.packages("influenceR")
require(influenceR)
#install.packages("rgexf")
require(rgexf)
#install.packages("data.tree")
library(data.tree)
#install.packages("caTools")
require(caTools)
library(caTools)
library(ElemStatLearn)
set.seed(1234)
tree<-rpart(yy ~ ., data = train)

tree.yy.prediced<-predict(tree,test,type='class')
confusionMatrix(tree.yy.prediced,test$yy)
prp(tree)
rf1<-tree
tree
print(rf1)
plot(train$duration)
hist(train$duration)
attributes(rf2)
attributes(rf1)
rf2$confusion
library(caret)
p1<-predict(rf2,train)
head(p1)
confusionMatrix(p1,train$yy)
p2<-predict(rf2,test)
head(p2)
confusionMatrix(p2,test$yy)
plot(rf2)
35 | P a g e
importance(rf2)
varImpPlot(rf2, sort=T,n.var=10,main="Top 10 Variable Importance")
#'----------------------------------
#' Neural Network
#'----------------------------------
getwd()
dataN<-prac
str(dataN)
#-- Min-Max Normalization for all num variables

dataN$age<-(dataN$age-min(dataN$age))/(max(dataN$age)-min(dataN$age))
dataN$balance<-(dataN$balance-min(dataN$balance))/(max(dataN$balance)-min(dataN$balance))
dataN$day<-(dataN$day-min(dataN$day))/(max(dataN$day)-min(dataN$day))
dataN$duration<-(dataN$duration-min(dataN$duration))/(max(dataN$duration)-min(dataN$duration))
dataN$campaign<-(dataN$campaign-min(dataN$campaign))/(max(dataN$campaign)-min(dataN$campaign))
dataN$pdays<-(dataN$pdays-min(dataN$pdays))/(max(dataN$pdays)-min(dataN$pdays))
dataN$previous<-(dataN$previous-min(dataN$previous))/(max(dataN$previous)-min(dataN$previous))
require(e1071)
library(e1071)
install.packages("ggvis")
require(ggvis)
library(ggvis)
require(class)
library(class)
require(gmodels)
library(gmodels)
require(neuralnet)
library(neuralnet)
require(nnet)
library(nnet)
#-- Data Partition

set.seed(1234)
ind<-sample(2,nrow(dataN),replace=TRUE,prob=c(0.7,0.3))
trainN<-dataN[ind==1,]
testN<-dataN[ind==2,]
#-- Neural Network

library(neuralnet)
36 | P a g e
set.seed(1234)
nn<-neuralnet(yy~age+balance+day+duration+campaign+pdays+previous, data=trainN, hidden=3,linear.output =
FALSE)
plot(nn)
#-- Prediction
output<-predict(nn,trainN[,-17])
head(output)
head(trainN[1,])
names(nn)
#-- ConfusionMatrix
p1<-nn$net.result
confusionMatrix(nn,p1)
library(readxl)
37 | P a g e
sheet = "Data",
range=cell_cols("A:Q"), col_names = TRUE,
"text","text"))
str(prac)
str(prac)
#'--sparse matrix,
#'which is a memory-efficient way to represent a large dataset that holds many zeros.
#'We're going to use the Matrix package to convert our data frame to a sparse matrix and
#'all our factored (categorical) features into dummy variables in one step.
# Load the Matrix package
train.label <- train$yy

test.label <- test$yy
library(Matrix)
# Create sparse matrixes and perform One-Hot Encoding to create dummy variables
dtrain <- sparse.model.matrix(yy ~ .-1, data=train)
dtest <- sparse.model.matrix(yy ~ .-1, data=test)
# View the number of rows and features of each set

dim(dtrain)
dim(dtest)
#'---- Training The Model

#'For simplicity sake, we'll use the below hyperparameters, but
38 | P a g e
#'you can improve the performance of your model by tuning them using the caret package.
#'We're using objective = "binary:logistic" because this is a logistic regression for binary classification problem.
#'We're also using eval_metrix = "error" which is used for classification problems.
#'You can learn about all available options here XGBoost.
# Load the XGBoost package
library(xgboost)
# Set our hyperparameters

param <- list(objective = "binary:logistic",
eval_metric = "error",
max_depth = 7,
eta = 0.1,
gammma = 1,
colsample_bytree = 0.5,
min_child_weight = 1)
set.seed(1234)
# Pass in our hyperparameteres and train the model

system.time(xgb <- xgboost(params = param,
data = dtrain,
label = as.numeric(train.label)-1,
nrounds = 500,
print_every_n = 25,
verbose = 1))
# Create our prediction probabilities

pred <- predict(xgb, dtest)
str(pred)
# Set our cutoff threshold
pred.resp <- ifelse(pred >= 0.86, 1, 0)
str(pred.resp)
str(test.label)
test.resp<-as.numeric(test.label)-1
str(test.resp)
str(pred.resp)
summary(test.label)
# Create the confusion matrix
confusionMatrix(as.factor(pred.resp), as.factor(test.resp))
#'--- Determine Which Features Are Most Important

#'Now, we can view which features had the greatest impact on predictive performance.
# Get the trained model
model <- xgb.dump(xgb, with_stats=TRUE)
# Get the feature real names

names <- dimnames(dtrain)[[2]]
39 | P a g e
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model=xgb)[0:20] # View top 20 most important features
# Plot
xgb.plot.importance(importance_matrix)
#'------------------------------------------------
#' Plotting The ROC To View Various Thresholds
#'------------------------------------------------
#'An ROC curve allows us to visualize our model's performance when selecting different thresholds.
#'The threshold value is indicated by the dots on the curved line.
#'Each dot lets us view the average true positive rate and average false positive rate
#'for each threshold. As the threshold value gets lower, the average true positive rate gets higher.
#'However, the average false positive rate gets higher as well.
#'It's important to select a threshold that provides an acceptable true positive rate
#'while also limiting the false positive rate.
#'https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
library(ROCR)
# Use ROCR package to plot ROC Curve
xgb.pred <- prediction(pred, test.label)
xgb.perf <- performance(xgb.pred, "tpr", "fpr")
plot(xgb.perf,
avg="threshold",
colorize=TRUE,
lwd=1,
main="ROC Curve w/ Thresholds",
print.cutoffs.at=seq(0, 1, by=0.05),
text.adj=c(-0.5, 0.5),
text.cex=0.5)
grid(col="lightgray")
axis(1, at=seq(0, 1, by=0.1))
axis(2, at=seq(0, 1, by=0.1))
abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")
library(readr)
library(stringr)
library(caret)
library(car)
# Lets start with finding what the actual tree looks like
model <- xgb.dump(xgb, with_stats = T)
model[1:10] #This statement prints top 10 nodes of the model
# Get the feature real names
40 | P a g e
names <- dimnames(data.matrix(prac[,-1]))[[2]]
names
xgb
# Compute feature importance matrix
importance_matrix <- xgb.importance( model = xgb)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
#'---Testing whether the results make sense

#'Let’s assume, Age was the variable which came out to be most important from the above analysis.
#' Chi-square test can do to see whether the variable is actually important or not.
test <- chisq.test(test.resp, pred.resp)

print(test)
41 | P a g e

BIAM - 560 - Final Course Project D40562330

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIAM - 560 - Final Course Project D40562330

Uploaded by

Copyright:

Available Formats

BIAM 560 – Predictive Analytics

FINAL CLASS PROJECT

BANK MARKETING MODEL

to Professor: Dr. Michael Mullas

Sunday, June 23, 2019

Method and Analysis

The CRoss-Industry Standard Process for Data Mining

1. duration: last contact duration, in seconds (numeric)

OOB estimate of error rate: 10.34%

Important terminologies on Decision Tree, are:

Root Node represents the entire population

input layer: brings the initial data into the system

hidden layer: a layer in between input layers and

Error: 248.316845 Steps: 12893

There are two datasets:

Input variables: # bank client data:

Output variable (desired target):

#prac <- read_excel("C:/Users/tpapazisi/OneDrive - Smithfield Foods, Inc/Devry/BIAM 560/Wk8 Final

#--- Age & Balance

#-- Point Chart elements for gSize=20cm chart

labs ( subtitle="by Age and Account Balance")+

labs(caption = "D40562330 - Final Project")+

labs ( tag = "Bank Marketing")+

#-- Xaxis Label

#-- Xaxis Label

#scale_y_continuous(labels = comma_format(accuracy=0.1), limits=c(0,yMax))+

legend.box.background = element_rect(fill='white', size=0.25, linetype='solid', colour =NA),

legend.key = element_rect(colour = 'white', fill = 'white', size = 0.1, linetype='solid',inherit.blank=TRUE ),

legend.key.size = unit(.25, "cm"),

legend.spacing.x = unit(0.1, 'cm'),

legend.box.spacing = unit(0.01, "cm"))

#--- Age & Employment

#-- Point Chart elements for gSize=20cm chart

labs ( subtitle="by Age and Account Balance")+

labs(caption = "D40562330 - Final Project")+

labs ( tag = "Bank Marketing")+

#-- Xaxis Label

#scale_y_continuous(labels = comma_format(accuracy=0.1), limits=c(0,yMax))+

legend.box.background = element_rect(fill='white', size=0.25, linetype='solid', colour =NA),

legend.key = element_rect(colour = 'white', fill = 'white', size = 0.1, linetype='solid',inherit.blank=TRUE ),

legend.key.size = unit(.25, "cm"),

legend.spacing.x = unit(0.1, 'cm'),

legend.box.spacing = unit(0.01, "cm")) +

#--- Age & Marital Atatus

#-- Point Chart elements for gSize=20cm chart

labs ( subtitle="by Age and Account Balance")+

labs(caption = "D40562330 - Final Project")+

labs ( tag = "Bank Marketing")+

#-- Xaxis Label

#-- Xaxis Label

#scale_y_continuous(labels = comma_format(accuracy=0.1), limits=c(0,yMax))+

legend.box.background = element_rect(fill='white', size=0.25, linetype='solid', colour =NA),

legend.key = element_rect(colour = 'white', fill = 'white', size = 0.1, linetype='solid',inherit.blank=TRUE ),

legend.key.size = unit(.25, "cm"),

legend.spacing.x = unit(0.1, 'cm'),

legend.box.spacing = unit(0.01, "cm")) +

#--- Education & Employment

#-- Point Chart elements for gSize=20cm chart

labs ( subtitle="by Age and Account Balance")+

labs(caption = "D40562330 - Final Project")+

labs ( tag = "Bank Marketing")+

#-- Xaxis Label

#-- Xaxis Label