Professional Documents
Culture Documents
html
The goal of this article is to quickly get you running XGBoost on any classification
problem and measuring its performance. It won't explain feature engineering, model
tuning, or the theory or math behind the algorithm. There's already a plethoral of
free resources to learn those elements. In my opinion, I learn better when I run my
data through an algorithm and then use various resources to learn how to improve
my prediction performance.
0 3 male 22 1 0 7.2500 S
1 3 female 26 0 0 7.9250 S
Survived Pclass Sex Age SibSp Parch Fare Embarked Cabin
0 3 male 35 0 0 8.0500 S
0 3 male NA 0 0 8.4583 Q
Survived
0
Pclass
0
Sex
0
Age
177
SibSp
0
Parch
0
Fare
0
Embarked
0
Cabin
0
The Age feature is missing 177 observations so we'll simply remove those rows
altogether.
# Get all rows that are not NA in the Age feature
mydata <- subset(mydata, !is.na(Age))
# Create separate vectors of our outcome variable for both our train and test sets
# We'll use these to train and test our model later
train.label <- train$Survived
test.label <- test$Survived
One-hot Encoding
XGBoost requires your dataset to be a sparse matrix, which is a memory-efficient
way to represent a large dataset that holds many zeros. We're going to use the
Matrix package to convert our data frame to a sparse matrix and all our factored
(categorical) features into dummy variables in one step.
# Create sparse matrixes and perform One-Hot Encoding to create dummy variables
dtrain <- sparse.model.matrix(Survived ~ .-1, data=train)
dtest <- sparse.model.matrix(Survived ~ .-1, data=test)
# View the number of rows and features of each set
dim(dtrain)
dim(dtest)
1. 536
2. 158
1. 178
2. 158
set.seed(1234)
[1] train-error:0.149254
[101] train-error:0.052239
[201] train-error:0.029851
[301] train-error:0.016791
[401] train-error:0.013060
[500] train-error:0.009328
Reference
Prediction 0 1
0 97 18
1 6 57
Accuracy : 0.8652
Kappa : 0.7173
Sensitivity : 0.7600
Specificity : 0.9417
Prevalence : 0.4213
Without creating new features or tuning our hyperparameters, we were able to get a
76% true positive rate and a 94% true negative rate with a Kappa score of 0.7173.
# Plot
xgb.plot.importance(importance_matrix)
Plotting The ROC To View Various Thresholds
An ROC curve allows us to visualize our model's performance when selecting
different thresholds. The threshold value is indicated by the dots on the curved line.
Each dot lets us view the average true positive rate and average false positive rate
for each threshold. As the threshold value gets lower, the average true positive rate
gets higher. However, the average false positive rate gets higher as well. It's
important to select a threshold that provides an acceptable true positive rate while
also limiting the false positive rate. You can read more at
https://en.wikipedia.org/wiki/Receiver_operating_characteristic .
library(ROCR)
plot(xgb.perf,
avg="threshold",
colorize=TRUE,
lwd=1,
main="ROC Curve w/ Thresholds",
print.cutoffs.at=seq(0, 1, by=0.05),
text.adj=c(-0.5, 0.5),
text.cex=0.5)
grid(col="lightgray")
axis(1, at=seq(0, 1, by=0.1))
axis(2, at=seq(0, 1, by=0.1))
abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")
These are decent results, but you can get much better predictions by creating new
features. However, that topic is for another discussion.
Please leave a comment if you have any questions, spot any errors, or if you know of
any other packages or graphs to display correlation matrices. You can grab the
notebook from my GitHub here get_up_and_running_with_xgboost_in_r.ipynb .
Thanks for reading!