Professional Documents
Culture Documents
8/30/15, 10:09 PM
XGBoost
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 1 of 128
XGBoost
8/30/15, 10:09 PM
Overview
Introduction
Basic Walkthrough
Real World Application
Model Specication
Parameter Introduction
Advanced Features
Kaggle Winning Solution
2/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 2 of 128
XGBoost
8/30/15, 10:09 PM
Introduction
3/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 3 of 128
XGBoost
8/30/15, 10:09 PM
Introduction
Nowadays we have plenty of machine learning models. Those most well-knowns are
Linear/Logistic Regression
k-Nearest Neighbours
Support Vector Machines
Tree-based Model
- Decision Tree
- Random Forest
- Gradient Boosting Machine
Neural Networks
4/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 4 of 128
XGBoost
8/30/15, 10:09 PM
Introduction
XGBoost is short for eXtreme Gradient Boosting. It is
An open-sourced tool
- Computation in C++
- R/python/Julia interface provided
A variant of the gradient boosting machine
- Tree-based model
The winning model for several kaggle competitions
5/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 5 of 128
XGBoost
8/30/15, 10:09 PM
Introduction
XGBoost is currently host on github.
The primary author of the model and the c++ implementation is Tianqi Chen.
The author for the R-package is Tong He.
6/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 6 of 128
XGBoost
8/30/15, 10:09 PM
Introduction
XGBoost is widely used for kaggle competitions. The reason to choose XGBoost includes
Easy to use
- Easy to install.
- Highly developed R/python interface for users.
Eciency
- Automatic parallel computation on a single machine.
- Can be run on a cluster.
Accuracy
- Good result for most data sets.
Feasibility
- Customized objective and evaluation
- Tunable parameters
7/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 7 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
We introduce the R package for XGBoost. To install, please run
devtools::install_github('dmlc/xgboost',subdir='R-package')
This command downloads the package from github and compile it automatically on your
machine. Therefore we need RTools installed on Windows.
8/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 8 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
XGBoost provides a data set to demonstrate its usages.
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train = agaricus.train
test = agaricus.test
This data set includes the information for some kinds of mushrooms. The features are binary,
indicate whether the mushroom has this characteristic. The target variable is whether they are
poisonous.
9/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 9 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
Let's investigate the data rst.
str(train$data)
.. ..$ : NULL
.. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=fla
..@ x
: num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
We can see that the data is a dgCMatrix class object. This is a sparse matrix class from the
package Matrix. Sparse matrix is more memory ecient for some specic data.
10/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 10 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
To use XGBoost to classify poisonous mushrooms, the minimum information we need to
provide is:
1. Input features
XGBoost allows dense and sparse matrix as the input.
2. Target variable
A numeric vector. Use integers starting from 0 for classication, or real values for
regression
3. Objective
For regression use 'reg:linear'
For binary classication use 'binary:logistic'
4. Number of iteration
The number of trees added to the model
11/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 11 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
To run XGBoost, we can use the following command:
bst = xgboost(data = train$data, label = train$label,
nround = 2, objective = "binary:logistic")
## [0]
train-error:0.000614
## [1]
train-error:0.001228
12/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 12 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
Sometimes we might want to measure the classication by 'Area Under the Curve':
bst = xgboost(data = train$data, label = train$label, nround = 2,
objective = "binary:logistic", eval_metric = "auc")
## [0]
## [1]
train-auc:0.999238
train-auc:0.999238
13/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 13 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
To predict, you can simply write
pred = predict(bst, test$data)
head(pred)
14/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 14 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
Cross validation is an important method to measure the model's predictive power, as well as
the degree of overtting. XGBoost provides a convenient function to do cross validation in a line
of code.
cv.res = xgb.cv(data = train$data, nfold = 5, label = train$label, nround = 2,
objective = "binary:logistic", eval_metric = "auc")
## [0]
train-auc:0.998668+0.000354 test-auc:0.998497+0.001380
## [1]
train-auc:0.999187+0.000785 test-auc:0.998700+0.001536
Notice the dierence of the arguments between xgb.cv and xgboost is the additional nfold
parameter. To perform cross validation on a certain set of parameters, we just need to copy
them to the xgb.cv function and add the number of folds.
15/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 15 of 128
XGBoost
8/30/15, 10:09 PM
Basic Walkthrough
xgb.cv returns a data.table object containing the cross validation results. This is helpful for
choosing the correct number of iterations.
cv.res
##
train.auc.mean train.auc.std test.auc.mean test.auc.std
## 1:
0.998668
0.000354
0.998497
0.001380
## 2:
0.999187
0.000785
0.998700
0.001536
16/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 16 of 128
XGBoost
8/30/15, 10:09 PM
17/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 17 of 128
XGBoost
8/30/15, 10:09 PM
18/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 18 of 128
XGBoost
8/30/15, 10:09 PM
19/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 19 of 128
XGBoost
8/30/15, 10:09 PM
20/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 20 of 128
XGBoost
8/30/15, 10:09 PM
21/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 21 of 128
XGBoost
8/30/15, 10:09 PM
22/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 22 of 128
XGBoost
8/30/15, 10:09 PM
23/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 23 of 128
XGBoost
8/30/15, 10:09 PM
24/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 24 of 128
XGBoost
8/30/15, 10:09 PM
25/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 25 of 128
XGBoost
8/30/15, 10:09 PM
26/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 26 of 128
XGBoost
8/30/15, 10:09 PM
27/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 27 of 128
XGBoost
8/30/15, 10:09 PM
28/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 28 of 128
XGBoost
8/30/15, 10:09 PM
29/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 29 of 128
XGBoost
8/30/15, 10:09 PM
Model Specication
30/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 30 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
To understand other parameters, one need to have a basic understanding of the model behind.
Suppose we have K trees, the model is
K
k=1
fk
where each fk is the prediction from a decision tree. The model is a collection of decision trees.
31/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 31 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
Having all the decision trees, we make prediction by
yi =
k=1
f k (xi )
yi
(t)
k=1
f k (xi )
32/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 32 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
To train the model, we need to optimize a loss function.
Typically, we use
Rooted Mean Squared Error for regression
-
L=
1
N
Ni=1 (yi yi )2
L=
1
N
L=
1
N
Ni=1 M
j=1 yi,j log(p i,j )
33/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 33 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
Regularization is another important part of the model. A good regularization term controls the
complexity of the model which prevents overtting.
Dene
T
1
w 2j
= T +
2
j=1
where T is the number of leaves, and w 2j is the score on the j -th leaf.
34/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 34 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
Put loss function and regularization together, we have the objective of the model:
Obj = L +
where loss function controls the predictive power, and regularization controls the simplicity.
35/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 35 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
In XGBoost, we use gradient descent to optimize the objective.
Given an objective
calculate
objective.
36/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 36 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
Recall the denition of objective
objective function as
Obj (t) =
i=1
L(yi , y (t)
i )
i=1
(f i ) =
i=1
L(yi , yi
(t1)
+ f t (xi )) +
i=1
(fi )
To optimize it by gradient descent, we need to calculate the gradient. The performance can also
be improved by considering both the rst and the second order gradient.
37/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 37 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
Since we don't have derivative for every objective function, we calculate the second order taylor
approximation of it
Obj (t)
i=1
[L(yi , y
(t1)
1
) + gi f t (xi ) + hi f t 2 (xi )] +
(f i )
2
i=1
where
38/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 38 of 128
XGBoost
8/30/15, 10:09 PM
Training Objective
Remove the constant terms, we get
Obj
(t)
i=1
[gi f t (xi ) +
1
hi f t 2 (xi )] + (f t )
2
This is the objective at the t -th step. Our goal is to nd a f t to optimize it.
39/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 39 of 128
XGBoost
8/30/15, 10:09 PM
40/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 40 of 128
XGBoost
8/30/15, 10:09 PM
Each data point ows to one of the leaves following the direction on each node.
41/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 41 of 128
XGBoost
8/30/15, 10:09 PM
42/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 42 of 128
XGBoost
8/30/15, 10:09 PM
43/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 43 of 128
XGBoost
8/30/15, 10:09 PM
ft (x) = w q(x)
where q(x) is a "directing" function which assign every data point to the q(x) -th leaf.
This denition describes the prediction process on a tree as
Assign the data point x to a leaf by q
Assign the corresponding score w q(x) on the q(x) -th leaf to the data point.
44/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 44 of 128
XGBoost
8/30/15, 10:09 PM
Ij = {i|q(xi ) = j}
This set contains the indices of data points that are assigned to the j-th leaf.
45/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 45 of 128
XGBoost
8/30/15, 10:09 PM
Obj (t)
1
1
2
w 2j
=
[gi ft (xi ) + hi f t (xi )] + T +
2
2
i=1
j=1
T
1
gi )w j + (
hi + )w 2j ] + T
=
[(
2
j=1 iI
iI
j
Since all the data points on the same leaf share the same prediction, this form sums the
prediction by leaves.
46/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 46 of 128
XGBoost
8/30/15, 10:09 PM
w j
iI j gi
iI j hi +
Obj (t) =
( iI j gi )2
1
+ T
2
h
iI j i
j=1
47/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 47 of 128
XGBoost
8/30/15, 10:09 PM
wj =
iIj gi
iIj hi +
relates to
The rst and second order of the loss function g and h
The regularization parameter
48/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 48 of 128
XGBoost
8/30/15, 10:09 PM
49/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 49 of 128
XGBoost
8/30/15, 10:09 PM
50/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 50 of 128
XGBoost
8/30/15, 10:09 PM
51/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 51 of 128
XGBoost
8/30/15, 10:09 PM
I L and IR be the sets of indices of data points assigned to two new leaves.
Obj (t)
2
1 ( iIj gi )
=
+
2 iI j hi +
52/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 52 of 128
XGBoost
8/30/15, 10:09 PM
2 [ iIL hi +
iI R hi +
iI hi + ]
53/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 53 of 128
XGBoost
8/30/15, 10:09 PM
54/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 54 of 128
XGBoost
8/30/15, 10:09 PM
55/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 55 of 128
XGBoost
8/30/15, 10:09 PM
56/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 56 of 128
XGBoost
8/30/15, 10:09 PM
Parameters
57/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 57 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
XGBoost has plenty of parameters. We can group them into
1. General parameters
Number of threads
2. Booster parameters
Stepsize
Regularization
3. Task parameters
Objective
Evaluation metric
58/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 58 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
After the introduction of the model, we can understand the parameters provided in XGBoost.
To check the parameter list, one can look into
The documentation of xgb.train.
The documentation in the repository.
59/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 59 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
General parameters:
nthread
- Number of parallel threads.
booster
- gbtree: tree-based model.
- gblinear: linear function.
60/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 60 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
Parameter for Tree Booster
eta
- Step size shrinkage used in update to prevents overtting.
- Range in [0,1], default 0.3
gamma
- Minimum loss reduction required to make a split.
- Range [0, ], default 0
61/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 61 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
Parameter for Tree Booster
max_depth
- Maximum depth of a tree.
- Range [1, ], default 6
min_child_weight
- Minimum sum of instance weight needed in a child.
- Range [0, ], default 1
max_delta_step
- Maximum delta step we allow each tree's weight estimation to be.
- Range [0, ], default 0
62/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 62 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
Parameter for Tree Booster
subsample
- Subsample ratio of the training instance.
- Range (0, 1], default 1
colsample_bytree
- Subsample ratio of columns when constructing each tree.
- Range (0, 1], default 1
63/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 63 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
Parameter for Linear Booster
lambda
- L2 regularization term on weights
- default 0
alpha
- L1 regularization term on weights
- default 0
lambda_bias
- L2 regularization term on bias
- default 0
64/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 64 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
Objectives
- "reg:linear": linear regression, default option.
- "binary:logistic": logistic regression for binary classication, output probability
- "multi:softmax": multiclass classication using the softmax objective, need to specify
num_class
- User specied objective
65/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 65 of 128
XGBoost
8/30/15, 10:09 PM
Parameter Introduction
Evaluation Metric
- "rmse"
- "logloss"
- "error"
- "auc"
- "merror"
- "mlogloss"
- User specied evaluation metric
66/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 66 of 128
XGBoost
8/30/15, 10:09 PM
67/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 67 of 128
XGBoost
8/30/15, 10:09 PM
68/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 68 of 128
XGBoost
8/30/15, 10:09 PM
69/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 69 of 128
XGBoost
8/30/15, 10:09 PM
70/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 70 of 128
XGBoost
8/30/15, 10:09 PM
Advanced Features
71/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 71 of 128
XGBoost
8/30/15, 10:09 PM
Advanced Features
There are plenty of highlights in XGBoost:
Customized objective and evaluation metric
Prediction from cross validation
Continue training on existing model
Calculate and plot the variable importance
72/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 72 of 128
XGBoost
8/30/15, 10:09 PM
Customization
According to the algorithm, we can dene our own loss function, as long as we can calculate the
rst and second order gradient of the loss function.
Dene
grad = y t1 l and hess = 2y t1 l . We can optimize the loss function if we can calculate
73/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 73 of 128
XGBoost
8/30/15, 10:09 PM
Customization
We can rewrite logloss for the i -th data point as
L = yi log(p i ) + (1 yi ) log(1 p i )
The p i is calculated by applying logistic transformation on our prediction y i .
Then the logloss is
1
ey i
L = yi log
+ (1 yi ) log
i
y
1+e
1 + ey i
74/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 74 of 128
XGBoost
8/30/15, 10:09 PM
Customization
We can see that
grad =
hess =
1+e y i
yi = p i yi
1+e yi
(1+e y i ) 2
= p i (1 p i )
75/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 75 of 128
XGBoost
8/30/15, 10:09 PM
Customization
The complete code:
logregobj = function(preds, dtrain) {
labels = getinfo(dtrain, "label")
preds = 1/(1 + exp(-preds))
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
76/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 76 of 128
XGBoost
8/30/15, 10:09 PM
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
preds = 1/(1 + exp(-preds))
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
77/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 77 of 128
XGBoost
8/30/15, 10:09 PM
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
78/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 78 of 128
XGBoost
8/30/15, 10:09 PM
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
# Calculate the 1st gradient
grad = preds - labels
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
79/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 79 of 128
XGBoost
8/30/15, 10:09 PM
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
# Calculate the 1st gradient
grad = preds - labels
# Calculate the 2nd gradient
hess = preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
80/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 80 of 128
XGBoost
8/30/15, 10:09 PM
Customization
The complete code:
logregobj = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# apply logistic transformation to the output
preds = 1/(1 + exp(-preds))
# Calculate the 1st gradient
grad = preds - labels
# Calculate the 2nd gradient
hess = preds * (1 - preds)
# Return the result
return(list(grad = grad, hess = hess))
}
81/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 81 of 128
XGBoost
8/30/15, 10:09 PM
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
labels = getinfo(dtrain, "label")
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
82/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 82 of 128
XGBoost
8/30/15, 10:09 PM
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
83/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 83 of 128
XGBoost
8/30/15, 10:09 PM
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# Calculate the error
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
84/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 84 of 128
XGBoost
8/30/15, 10:09 PM
Customization
We can also customize the evaluation metric.
evalerror = function(preds, dtrain) {
# Extract the true label from the second argument
labels = getinfo(dtrain, "label")
# Calculate the error
err = as.numeric(sum(labels != (preds > 0)))/length(labels)
# Return the name of this metric and the value
return(list(metric = "error", value = err))
}
85/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 85 of 128
XGBoost
8/30/15, 10:09 PM
Customization
To utilize the customized objective and evaluation, we simply pass them to the arguments:
param = list(max.depth=2,eta=1,nthread = 2, silent=1,
objective=logregobj, eval_metric=evalerror)
bst = xgboost(params = param, data = train$data, label = train$label, nround = 2)
## [0]
## [1]
train-error:0.0465223399355136
train-error:0.0222631659757408
86/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 86 of 128
XGBoost
8/30/15, 10:09 PM
87/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 87 of 128
XGBoost
8/30/15, 10:09 PM
## [0]
train-error:0.046522+0.001347
test-error:0.046522+0.005387
## [1]
train-error:0.022263+0.000637
test-error:0.022263+0.002545
str(res)
## List of 2
## $ dt :Classes 'data.table' and 'data.frame':
##
..$ train.error.mean: num [1:2] 0.0465 0.0223
##
##
##
##
##
..$
..$
..$
..-
2 obs. of
4 variables:
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 88 of 128
XGBoost
8/30/15, 10:09 PM
xgb.DMatrix
XGBoost has its own class of input data xgb.DMatrix. One can convert the usual data set into
it by
dtrain = xgb.DMatrix(data = train$data, label = train$label)
It is the data structure used by XGBoost algorithm. XGBoost preprocess the input data and
label into an xgb.DMatrix object before feed it to the training algorithm.
If one need to repeat training process on the same big data set, it is good to use the
xgb.DMatrix object to save preprocessing time.
89/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 89 of 128
XGBoost
8/30/15, 10:09 PM
xgb.DMatrix
An xgb.DMatrix object contains
1. Preprocessed training data
2. Several features
Missing values
data weight
90/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 90 of 128
XGBoost
8/30/15, 10:09 PM
Continue Training
Train the model for 5000 rounds is sometimes useful, but we are also taking the risk of
overtting.
A better strategy is to train the model with fewer rounds and repeat that for many times. This
enable us to observe the outcome after each step.
91/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 91 of 128
XGBoost
8/30/15, 10:09 PM
Continue Training
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0]
train-error:0.0465223399355136
## [1] TRUE
## [0]
train-error:0.0222631659757408
92/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 92 of 128
XGBoost
8/30/15, 10:09 PM
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0]
train-error:0.0222631659757408
## [1] TRUE
## [0]
train-error:0.00706279748195916
93/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 93 of 128
XGBoost
8/30/15, 10:09 PM
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0]
train-error:0.00706279748195916
## [1] TRUE
## [0]
train-error:0.0152003684937817
94/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 94 of 128
XGBoost
8/30/15, 10:09 PM
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0]
train-error:0.0152003684937817
## [1] TRUE
## [0]
train-error:0.00706279748195916
95/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 95 of 128
XGBoost
8/30/15, 10:09 PM
Continue Training
# Train with only one round
bst = xgboost(params = param, data = dtrain, nround = 1)
## [0]
train-error:0.00706279748195916
## [1] TRUE
## [0]
train-error:0.00122831260555811
96/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 96 of 128
XGBoost
8/30/15, 10:09 PM
##
Feature
Gain
Cover Frequence
## 1:
28 0.67615484 0.4978746
0.4
## 2:
## 3:
## 4:
55 0.17135352 0.1920543
59 0.12317241 0.1638750
108 0.02931922 0.1461960
0.2
0.2
0.2
97/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 97 of 128
XGBoost
8/30/15, 10:09 PM
98/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 98 of 128
XGBoost
8/30/15, 10:09 PM
Tree plotting
99/128
file:///Users/vivi/Desktop/xgboost/index.html#1
Page 99 of 128
XGBoost
8/30/15, 10:09 PM
Early Stopping
When doing cross validation, it is usual to encounter overtting at a early stage of iteration.
Sometimes the prediction gets worse consistantly from round 300 while the total number of
iteration is 1000. To stop the cross validation process, one can use the early.stop.round
argument in xgb.cv.
bst = xgb.cv(params = param, data = train$data, label = train$label,
nround = 20, nfold = 5,
maximize = FALSE, early.stop.round = 3)
100/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
Early Stopping
## [0]
## [1]
## [2]
train-error:0.050783+0.011279
train-error:0.021227+0.001863
train-error:0.008483+0.003373
test-error:0.055272+0.013879
test-error:0.021496+0.004443
test-error:0.007677+0.002820
## [3]
## [4]
## [5]
train-error:0.014202+0.002592
train-error:0.004721+0.002682
train-error:0.001382+0.000533
test-error:0.014126+0.003504
test-error:0.005527+0.004490
test-error:0.001382+0.000642
##
##
##
##
[6]
[7]
[8]
[9]
train-error:0.001228+0.000219
train-error:0.001228+0.000219
train-error:0.001228+0.000219
train-error:0.000691+0.000645
test-error:0.001228+0.000875
test-error:0.001228+0.000875
test-error:0.001228+0.000875
test-error:0.000921+0.001001
##
##
##
##
[10]
[11]
[12]
[13]
train-error:0.000422+0.000582
train-error:0.000192+0.000429
train-error:0.000192+0.000429
train-error:0.000000+0.000000
test-error:0.000768+0.001086
test-error:0.000460+0.001030
test-error:0.000460+0.001030
test-error:0.000000+0.000000
## [14] train-error:0.000000+0.000000
## [15] train-error:0.000000+0.000000
## [16] train-error:0.000000+0.000000
test-error:0.000000+0.000000
test-error:0.000000+0.000000
test-error:0.000000+0.000000
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
102/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
103/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
104/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
105/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
106/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
107/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
108/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
##
##
[1] "A_follower_count"
"A_following_count"
"A_listed_count"
[4] "A_mentions_received" "A_retweets_received" "A_mentions_sent"
## [7] "A_retweets_sent"
## [10] "A_network_feature_2"
## [13] "B_following_count"
## [16] "B_retweets_received"
"A_posts"
"A_network_feature_3"
"B_listed_count"
"B_mentions_sent"
"A_network_feature_1"
"B_follower_count"
"B_mentions_received"
"B_retweets_sent"
## [19] "B_posts"
"B_network_feature_1" "B_network_feature_2"
## [22] "B_network_feature_3"
109/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
##
A_follower_count
A_following_count
##
2.280000e+02
3.020000e+02
## A_mentions_received A_retweets_received
##
5.839794e-01
1.005034e-01
A_listed_count
3.000000e+00
A_mentions_sent
1.005034e-01
##
A_retweets_sent
A_posts A_network_feature_1
##
1.005034e-01
3.621501e-01
2.000000e+00
## A_network_feature_2 A_network_feature_3
B_follower_count
##
1.665000e+02
##
B_following_count
##
2.980800e+04
## B_retweets_received
1.135500e+04
3.446300e+04
B_listed_count B_mentions_received
1.689000e+03
1.543050e+01
B_mentions_sent
B_retweets_sent
##
3.984029e+00
8.204331e+00
3.324230e-01
##
B_posts B_network_feature_1 B_network_feature_2
##
6.988815e+00
6.600000e+01
7.553030e+01
## B_network_feature_3
file:///Users/vivi/Desktop/xgboost/index.html#1
110/128
XGBoost
8/30/15, 10:09 PM
111/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
112/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
113/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
114/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
115/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
116/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
117/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
118/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
119/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
120/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
121/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
122/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
123/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
124/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
## [1] 2442
cv.res[bestRound,]
##
train.auc.mean train.auc.std test.auc.mean test.auc.std
## 1:
0.934967
0.00125
0.876629
0.002073
125/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
126/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
127/128
file:///Users/vivi/Desktop/xgboost/index.html#1
XGBoost
8/30/15, 10:09 PM
FAQ
128/128
file:///Users/vivi/Desktop/xgboost/index.html#1