Professional Documents
Culture Documents
Submitted
1|Page
Table of Contents
Scenario.......................................................................................................................................................3
Executive Summary.....................................................................................................................................4
Method and Analysis...................................................................................................................................5
Random Forest............................................................................................................................................6
Decision tree analysis..................................................................................................................................9
Artificial Neural Network...........................................................................................................................12
Literature Sources......................................................................................................................................14
Appendix 1: Data explanation....................................................................................................................15
Appendix 2: Pictorial view of correlation between some of the factors considered..................................16
Appendix 3: The R Code for this Project....................................................................................................22
2|Page
Scenario
You have been retained as a consultant for The Very Little Bank and Trust Company. It is about to embark
on a marketing campaign to roll out a new certificate of deposit product. It wants you to provide the
analytical guidance at each step of the way. This is a unique opportunity because most BI analysts are
called in after everything is said and done and then must make do with whatever is there. So the first
thing you do is a review of the literature to see if anything similar has been done. Lo and behold, your
efforts are rewarded. You discover a past BI effort that is similar. You decide to review it and practice on it
to gain experience and insight. Recognize you are not searching for right answers to some hypothetical
Final Exam questions here. Rather, you are polishing your BI management and analytical skills. You will
be judged accordingly. You should look at this as something more akin to a wrestling match. Your
opponent is the data set. Begin by reading and understanding the main themes in the paper. Good luck.
3|Page
Executive Summary
When it comes to marketing campaigns, nowadays businesses take one of the two main
approaches to promote their products and/or services: a through mass campaigns targeting general
indiscriminate public or directed marketing to a specific set of contacts (Laudon & Laudon, 2015). The
fact is that increasingly vast number of marketing campaigns over time has reduced its effect on the
public. In common conditions of budget cuts, pressures to reduce overhead expensed and competition
has raised the need to research on investment in campaigns with a strict and rigorous selection of
contacts. The same alternatives are faced from The Very Little Bank and Trust Company, which seem to
be in unique position. First, the bank possesses a considerable customer base of various demographics.
Second, over the last few years, the Bank has collected vast amount of data from the previous Marketing
Campaigns. In this study it is illustrated that Business Intelligence (BI) and Data Mining (DM) techniques
can be a great help to fulfill that goal. Directed marketing with focus on targets that assumable will be
keener to that specific product/service, making this kind of campaigns extremely efficient. However, the
drawbacks of directed marketing may be triggering of a negative attitude towards banks due to the
intrusion of privacy.
This project describes an implementation of DM based on the CRISP-DM methodology. Given Real-
world data were analyzed with the business goal is to find a model that can explain the conditions
driving successful sale of the Bank’s project to existing clients by identifying the main characteristics that
affect success. This knowledge may help to better manage the available resources (e.g. human effort,
phone calls, time) and selection of a high quality and affordable set of potential buying customers.
4|Page
Data are analyzed using several Data Mining
tools and results persistently indicate that there is
a specific condition which can be more prevalent
to predict the customer will buy the bonds
offered. It is strongly recommended to Bank’s
Marketing Management team to target and work
with this customer primarily. More details are
given further, but shortly, the most important
condition ranked as they will positively impact the
sale are given in the pic adjacent.
1. Business Understanding phase, identifying the goal to achieve, in our specific case is the
successful Bank Product Sale after contact with and existing customer.
2. Data Understanding, which in this case is given as an .CVS file and with a note file with full
explanation about the data given to be analyzed (see Appendix 1).
5|Page
3. Data Preparation data has concepts (what needs to be learned), instances (independent records
related to an occurrence) and attributes (which characterize a specific aspect of a given
instance). In this case the method we used needed to have the data Numeric of Factor format.
4. Modeling Phase model is build using Random Walk, Decision Tree and Neural Network models.
5. Model Evaluation with real data. In this case the Confusion Matrix (GeeksforGeeks) and the
Receiver Operating Characteristic (ROC) curve (Grace-Martin) have been used.
6. If the obtained model is not good enough for use to support business, Then a new iteration for
the CRISP-DM is defined Else, the model is implemented in a real time environment
(Deployment phase).
Random Forest
The random forest is a model made up of many decision trees, rather than just simply averaging the
prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the
name random:
1. Random sampling of training data points when building trees. When training, each tree in a
random forest learns from a random sample of the data points. At test time, predictions are
made by averaging the predictions of each decision tree.
2. Random Subsets of features for splitting nodes. The other main concept in the random forest is
that only a subset of all the features are considered for splitting each node in each decision tree.
The objective of a machine learning model is to generalize well to new data it has never seen before.
With one tree vs Forest we run is some issues such as Overfitting. In this case the model learns well the
training data without distinguishing very well the noise. The reason the decision tree is prone to
overfitting when we don’t limit the maximum depth is because it has unlimited flexibility, meaning that it
can keep growing until it has exactly one leaf node for every single observation. This issue is overcome
with the random forest, which by nature combines hundreds or thousands of decision trees, trains each
one on a slightly different set of the observations, splitting nodes in each tree considering a limited
number of the features. The final predictions of the random forest are made by averaging the predictions
of each individual tree.
In specific bank data the outcome of the random forest is made from 500 trees and is build a predictivity
of about 90%. Based on the chart given the most impactful parameters are:
6|Page
2. month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
3. balance: average yearly balance, in euros (numeric)
4. age
5. day: last contact day of the month (numeric)
6. job: type of job (categorical:” admin.", "unknown", "unemployed", "management",
"housemaid", "entrepreneur", "student","blue-collar","self-
employed","retired","technician","services")
In addition, error does not significantly reduce when forest has more than 300 trees.
With xgboost and ROCCurve packages is compiled the Receiver Operating Characteristic Curve with
thresholds. Based on the chart below the system exhibits adequacy to describe the real process.
7|Page
randomForest(formula = yy ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
MeanDecreaseGini
age 56.534807
job 51.979302
marital 16.001449
education 18.150368
default 2.057567
balance 61.866361
housing 9.229602
loan 4.929918
contact 11.809905
day 54.216374
month 80.501799
duration 180.182668
campaign 24.510517
pdays 27.386274
previous 18.463179
poutcome 39.068395
8|Page
Same results can be confirmed with xgboost package.
A forest with 300 trees seem sufficient to achieve a minimum error
[1] train-error:0.079220
[26] train-error:0.080478
[51] train-error:0.044326
[76] train-error:0.030179
[101] train-error:0.020748
[126] train-error:0.015718
[151] train-error:0.008174
[176] train-error:0.004715
[201] train-error:0.003144
[226] train-error:0.002201
[251] train-error:0.000943
[276] train-error:0.000629
[301] train-error:0.000000
[326] train-error:0.000000
[351] train-error:0.000000
[376] train-error:0.000000
[401] train-error:0.000000
[426] train-error:0.000000
[451] train-error:0.000000
[476] train-error:0.000000
[500] train-error:0.000000
user system elapsed
3.10 1.62 2.79
9|Page
Decision tree analysis.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and
regression. In nutshell, a DT learn from data to approximate a sine curve with a set of if-then-else
decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.
DT builds classification or regression models in the form of a tree structure, breaking down a data set
into smaller and smaller subsets while at the same time an associated decision tree is incrementally
developed. The end result is a tree with decision nodes and leaf nodes. A decision node has two or more
branches. Leaf node represents a classification or decision. The topmost decision node in a tree which
corresponds to the best predictor called root node. The good thing is that DT can handle both categorical
and numerical data. The process of finding the smallest tree that fits the data. Usually this is the tree
that yields the lowest cross-validated error.
In the case of the Bank data the DT model was build and evaluated with the following indexes.
10 | P a g e
Confusion Matrix and Statistics
Reference
Prediction no yes
no 1166 99
yes 29 46
Accuracy : 0.9045
95% CI : (0.8875, 0.9197)
No Information Rate : 0.8918
P-Value [Acc > NIR] : 0.07151
Kappa : 0.3718
Mcnemar's Test P-Value : 1.069e-09
Sensitivity : 0.9757
Specificity : 0.3172
Pos Pred Value : 0.9217
Neg Pred Value : 0.6133
Prevalence : 0.8918
Detection Rate : 0.8701
Detection Prevalence : 0.9440
Balanced Accuracy : 0.6465
'Positive' Class : no
11 | P a g e
1) root 3181 376 no (0.88179818 0.11820182)
2) duration< 645.5 2927 250 no (0.91458832 0.08541168)
4) poutcome=failure,other,unknown 2844 195 no (0.93143460 0.06856540)
8) duration< 222.5 1842 44 no (0.97611292 0.02388708) *
9) duration>=222.5 1002 151 no (0.84930140 0.15069860)
18) month=apr,aug,feb,jan,jul,jun,may,nov 954 124 no (0.87002096
0.12997904)
36) pdays< 392.5 945 116 no (0.87724868 0.12275132) *
37) pdays>=392.5 9 1 yes (0.11111111 0.88888889) *
19) month=dec,mar,oct,sep 48 21 yes (0.43750000 0.56250000)
38) day< 16.5 26 9 no (0.65384615 0.34615385) *
39) day>=16.5 22 4 yes (0.18181818 0.81818182) *
5) poutcome=success 83 28 yes (0.33734940 0.66265060)
10) duration< 163 17 3 no (0.82352941 0.17647059) *
11) duration>=163 66 14 yes (0.21212121 0.78787879) *
3) duration>=645.5 254 126 no (0.50393701 0.49606299)
6) duration< 766.5 93 33 no (0.64516129 0.35483871)
12) pdays< 132 84 25 no (0.70238095 0.29761905) *
13) pdays>=132 9 1 yes (0.11111111 0.88888889) *
7) duration>=766.5 161 68 yes (0.42236025 0.57763975)
14) month=apr,feb,jan,may,nov 88 43 no (0.51136364 0.48863636)
28) education=primary,unknown 16 3 no (0.81250000 0.18750000) *
29) education=secondary,tertiary 72 32 yes (0.44444444 0.55555556)
58) marital=married,single 63 31 yes (0.49206349 0.50793651)
116) job=blue-collar,management,self-employed,services,student 36 14
no (0.61111111 0.38888889) *
117) job=admin.,entrepreneur,technician,unemployed 27 9 yes
(0.33333333 0.66666667) *
59) marital=divorced 9 1 yes (0.11111111 0.88888889) *
15) month=aug,dec,jul,jun,mar 73 23 yes (0.31506849 0.68493151) *
12 | P a g e
13 | P a g e
Artificial Neural Network
Artificial neural networks (ANNs) or connectionist systems are systems which “learn” to perform tasks by
considering examples, generally without being programmed with any task-specific rules. The
Architecture of an Artificial Neural Network is a set of connected neurons organized in layers:
14 | P a g e
output layer: the last layer of neurons that produces given outputs for the program.
What is typical for this calculating system is that numerical data would need to be normalized.
1 1
age
-1 .
47
4. 4
00
2-5
4
3 .520 74
25
.8 4 2
31
89
-1 .0
balance 8.56273
3.0
0 04
50 3
1
3-81 3.3
2
-3 .
6.6 1832
05
84 no
23
93
113
.4
2
-0
day
-2 8
-6 .02
.3 7
08
94
1. 0
84
2
277
39
07
.7
38
10
55
2
1.
-1
duration -330.62415
-1
.23
-1
2
97
19
3.3
68 2
87
.52
1.002 1
64
. 7
90
29
60
-7
campaign
2.4
4.565 6
97 yes
07
95
0.4 63
14
30
272.109
59
2
6
3. 0
6 12
-3 .0
pdays 40.29168
3
89
.8 4 4
75
--830.77 8
. 001
9
-3 5
previous
15 | P a g e
Literature Sources
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000): CRISP-DM
1.0 - Step-by-step data mining guide, CRISP-DM Consortium.
Grace-Martin, K. (No Year): “What is an ROC Curve?”, https://www.theanalysisfactor.com/what-is-an-
roc-curve/.
GeeksforGeeks: Confusion Matrix in Machine Learning https://www.geeksforgeeks.org/confusion-
matrix-machine-learning/
Lander, J. P. (2017): R for Everyone: Advanced Analytics and Graphics. [VitalSource Bookshelf]. Retrieved
from https://online.vitalsource.com/#/books/9781323582657/
Laudon, K. C., Laudon,, J. P. (2015): Management Information Systems: Managing the Digital Firm, 15th
Edition Retrieved from vbk://9781323187944
Ling, X. and Li, C., (1998): “Data Mining for Direct Marketing: Problems and Solutions”. In Proceedings of
the 4th KDD conference, AAAI Press, 73–79.
S. Moro, R. Laureano and P. Cortez.( 2011): Using Data Mining for Bank Direct Marketing: An Application
of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and
Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October,. EUROSIS.
Turban, E., Sharda, R. and Delen, D. (2010): Decision Support and Business Intelligence Systems – 9th
edition, Prentice Hall Press, USA.
Witten, I. and Frank, E. (2005): Data Mining – Pratical Machine Learning Tools and Techniques – 2nd
edition, Elsevier, USA.
16 | P a g e
Appendix 1: Data explanation
This dataset is public available for research. The details are described in (Moro et al., 2011)
http://hdl.handle.net/1822/14838
http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt
Attribute information:
17 | P a g e
Appendix 2: Pictorial view of correlation between some of
the factors considered.
18 | P a g e
19 | P a g e
20 | P a g e
21 | P a g e
22 | P a g e
23 | P a g e
Appendix 3: The R Code for this Project
cat("\014")
getwd()
setwd("C:\Users\tpapazisi\OneDrive - Smithfield Foods Inc\Devry\BIAM 560")
setwd("\Wk8 Final Project")
getwd()
#-- Getting the Data & converting to proper numeric format ---------------------------------
library(readxl)
prac <- read_excel("C:/Users/tpapazisi/OneDrive - Smithfield Foods, Inc/Devry/BIAM 560/Wk8 Final
Project/BankData.xlsm",
sheet = "Data",
range=cell_cols("A:x"), col_names = TRUE,
col_types = c("numeric","text","text","text","text",
"numeric","text","text","text","numeric",
"text","numeric","numeric","numeric","numeric",
"text","text","text","numeric","numeric","numeric","numeric","numeric","numeric"))
str(prac)
prac$job<-as.factor(prac$job)
prac$marital<-as.factor(prac$marital)
prac$education<-as.factor(prac$education)
prac$default<-as.factor(prac$default)
prac$housing<-as.factor(prac$housing)
prac$loan<-as.factor(prac$loan)
prac$contact<-as.factor(prac$contact)
prac$month<-as.factor(prac$month)
prac$poutcome<-as.factor(prac$poutcome)
prac$Gr_Age<-as.factor(prac$Gr_Age)
prac$Gr_Balance<-as.factor(prac$Gr_Balance)
prac$Gr_Duration<-as.factor(prac$Gr_Duration)
prac$Gr_pWeeks<-as.factor(prac$Gr_pWeeks)
prac$Gr_pMonths<-as.factor(prac$Gr_pMonths)
prac$Gr_Previous2<-as.factor(prac$Gr_Previous2)
prac$Gr_Previous3<-as.factor(prac$Gr_Previous3)
prac$yy<-as.factor(prac$yy)
str(prac)
24 | P a g e
#View(prac)
#warnings()
head(prac)
summary(prac)
str(prac)
tail(prac)
names(prac)
#colSums(is.na(prac))
library(ggplot2)
25 | P a g e
#-- Legend
theme(legend.title = element_blank(),
legend.text = element_text(color='blue', size=8, angle =0, face='plain'),
legend.box = "vertical",
#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm" ),
26 | P a g e
#-- Xaxis Label
ylab("Recent Campaign")+
theme(axis.title.y = element_text('Calibri', 'bold', 'darkblue', size=10),
axis.text.y = element_text('Calibri', 'plain', 'black', size=10))+
#-- Legend
theme(legend.title = element_blank(),
legend.text = element_text(color='blue', size=8, angle =0, face='plain'),
legend.box = "vertical",
#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm"),
27 | P a g e
theme(plot.caption = element_text('Calibri', 'italic', 'darkred', size=8))+
#-- Legend
theme(legend.title = element_blank(),
legend.text = element_text(color='blue', size=8, angle =0, face='plain'),
legend.box = "vertical",
#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm"),
28 | P a g e
geom_jitter(aes(shape=factor(prac$education), color=factor(prac$job)),size=2.5)+
#-- Legend
theme(legend.title = element_blank(),
legend.text = element_text(color='blue', size=8, angle =0, face='plain'),
legend.box = "vertical",
29 | P a g e
#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm"),
#-- Legend
theme(legend.title = element_blank(),
legend.text = element_text(color='blue', size=8, angle =0, face='plain'),
30 | P a g e
legend.key.size = unit(.25, "cm"),
legend.key.height = unit(.25, "cm"),
legend.key.width = unit(.25, "cm"),
legend.box = "vertical",
#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm"),
31 | P a g e
#-- Legend
theme(legend.title = element_blank(),
legend.text = element_text(color='blue', size=8, angle =0, face='plain'),
legend.box = "vertical",
#legend.box.margin = margin(0.01, 0.01, 0.01, 0.01, "cm"),
#-- Getting the Data & converting to proper numeric format ---------------------------------
library(readxl)
prac <- read_excel("C:/Users/tpapazisi/OneDrive - Smithfield Foods, Inc/Devry/BIAM 560/Wk8 Final
Project/BankData.xlsm",
sheet = "Data",
range=cell_cols("A:Q"), col_names = TRUE,
col_types = c("numeric","text","text","text","text",
"numeric","text","text","text","numeric",
"text","numeric","numeric","numeric","numeric",
"text","text"))
str(prac)
prac$job<-as.factor(prac$job)
prac$marital<-as.factor(prac$marital)
32 | P a g e
prac$education<-as.factor(prac$education)
prac$default<-as.factor(prac$default)
prac$housing<-as.factor(prac$housing)
prac$loan<-as.factor(prac$loan)
prac$contact<-as.factor(prac$contact)
prac$month<-as.factor(prac$month)
prac$poutcome<-as.factor(prac$poutcome)
prac$yy<-as.factor(prac$yy)
str(prac)
#View(prac)
#warnings()
head(prac)
summary(prac)
str(prac)
tail(prac)
names(prac)
#'----------------------------------
#' Split Data
#'----------------------------------
set.seed(1234)
ind<-sample(2,nrow(prac),replace=TRUE,prob=c(0.7,0.3))
train<-prac[ind==1,]
test<-prac[ind==2,]
str(train)
summary(train)
train<-train[rowSums(is.na(train))==0,]
summary(test)
33 | P a g e
#'----------------------------------
#' Random Forest
#'----------------------------------
library(randomForest)
set.seed(1234)
#'----------------------------------
#' Decision Tree
#'----------------------------------
sessionInfo()
#install.packages("rJava")
library(rJava)
#install.packages("FSelector")
require(RWekajars)
require(FSelector)
library(FSelector)
library(rpart)
#install.packages("caret")
require(gower)
require(caret)
library(caret)
34 | P a g e
library(rpart.plot)
library(dplyr)
library(xlsx)
#install.packages("downloader")
require(downloader)
#install.packages("influenceR")
require(influenceR)
#install.packages("rgexf")
require(rgexf)
#install.packages("data.tree")
library(data.tree)
#install.packages("caTools")
require(caTools)
library(caTools)
library(ElemStatLearn)
set.seed(1234)
rf1<-tree
tree
print(rf1)
plot(train$duration)
hist(train$duration)
attributes(rf2)
attributes(rf1)
rf2$confusion
library(caret)
p1<-predict(rf2,train)
head(p1)
confusionMatrix(p1,train$yy)
p2<-predict(rf2,test)
head(p2)
confusionMatrix(p2,test$yy)
plot(rf2)
35 | P a g e
importance(rf2)
varImpPlot(rf2, sort=T,n.var=10,main="Top 10 Variable Importance")
#'----------------------------------
#' Neural Network
#'----------------------------------
getwd()
dataN<-prac
str(dataN)
require(e1071)
library(e1071)
install.packages("ggvis")
require(ggvis)
library(ggvis)
require(class)
library(class)
require(gmodels)
library(gmodels)
require(neuralnet)
library(neuralnet)
require(nnet)
library(nnet)
36 | P a g e
set.seed(1234)
nn<-neuralnet(yy~age+balance+day+duration+campaign+pdays+previous, data=trainN, hidden=3,linear.output =
FALSE)
plot(nn)
#-- Prediction
output<-predict(nn,trainN[,-17])
head(output)
head(trainN[1,])
names(nn)
#-- ConfusionMatrix
p1<-nn$net.result
confusionMatrix(nn,p1)
#-- Getting the Data & converting to proper numeric format ---------------------------------
library(readxl)
prac <- read_excel("C:/Users/tpapazisi/OneDrive - Smithfield Foods, Inc/Devry/BIAM 560/Wk8 Final
Project/BankData.xlsm",
37 | P a g e
sheet = "Data",
range=cell_cols("A:Q"), col_names = TRUE,
col_types = c("numeric","text","text","text","text",
"numeric","text","text","text","numeric",
"text","numeric","numeric","numeric","numeric",
"text","text"))
str(prac)
prac$job<-as.factor(prac$job)
prac$marital<-as.factor(prac$marital)
prac$education<-as.factor(prac$education)
prac$default<-as.factor(prac$default)
prac$housing<-as.factor(prac$housing)
prac$loan<-as.factor(prac$loan)
prac$contact<-as.factor(prac$contact)
prac$month<-as.factor(prac$month)
prac$poutcome<-as.factor(prac$poutcome)
prac$yy<-as.factor(prac$yy)
str(prac)
#'--sparse matrix,
#'which is a memory-efficient way to represent a large dataset that holds many zeros.
#'We're going to use the Matrix package to convert our data frame to a sparse matrix and
#'all our factored (categorical) features into dummy variables in one step.
# Load the Matrix package
library(Matrix)
# Create sparse matrixes and perform One-Hot Encoding to create dummy variables
dtrain <- sparse.model.matrix(yy ~ .-1, data=train)
dtest <- sparse.model.matrix(yy ~ .-1, data=test)
38 | P a g e
#'you can improve the performance of your model by tuning them using the caret package.
#'We're using objective = "binary:logistic" because this is a logistic regression for binary classification problem.
#'We're also using eval_metrix = "error" which is used for classification problems.
#'You can learn about all available options here XGBoost.
# Load the XGBoost package
library(xgboost)
set.seed(1234)
39 | P a g e
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model=xgb)[0:20] # View top 20 most important features
# Plot
xgb.plot.importance(importance_matrix)
#'------------------------------------------------
#' Plotting The ROC To View Various Thresholds
#'------------------------------------------------
#'An ROC curve allows us to visualize our model's performance when selecting different thresholds.
#'The threshold value is indicated by the dots on the curved line.
#'Each dot lets us view the average true positive rate and average false positive rate
#'for each threshold. As the threshold value gets lower, the average true positive rate gets higher.
#'However, the average false positive rate gets higher as well.
#'It's important to select a threshold that provides an acceptable true positive rate
#'while also limiting the false positive rate.
#'https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
library(ROCR)
# Use ROCR package to plot ROC Curve
xgb.pred <- prediction(pred, test.label)
xgb.perf <- performance(xgb.pred, "tpr", "fpr")
plot(xgb.perf,
avg="threshold",
colorize=TRUE,
lwd=1,
main="ROC Curve w/ Thresholds",
print.cutoffs.at=seq(0, 1, by=0.05),
text.adj=c(-0.5, 0.5),
text.cex=0.5)
grid(col="lightgray")
axis(1, at=seq(0, 1, by=0.1))
axis(2, at=seq(0, 1, by=0.1))
abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")
library(readr)
library(stringr)
library(caret)
library(car)
# Lets start with finding what the actual tree looks like
model <- xgb.dump(xgb, with_stats = T)
model[1:10] #This statement prints top 10 nodes of the model
40 | P a g e
names <- dimnames(data.matrix(prac[,-1]))[[2]]
names
xgb
# Compute feature importance matrix
importance_matrix <- xgb.importance( model = xgb)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
41 | P a g e