You are on page 1of 10

25/08/2016 XGBoostRTutorialxgboost0.

6documentation

XGBoostRTutorial

Introduction
XgboostisshortforeXtremeGradientBoostingpackage.
ThepurposeofthisVignetteistoshowyouhowtouseXgboosttobuildamodelandmake
predictions.

Itisanefficientandscalableimplementationofgradientboostingframeworkby
@friedman2000additiveand@friedman2001greedy.Twosolversareincluded:

linearmodel
treelearningalgorithm.

Itsupportsvariousobjectivefunctions,includingregression,classificationandranking.The
packageismadetobeextendible,sothatusersarealsoallowedtodefinetheirownobjective
functionseasily.
IthasbeenusedtowinseveralKagglecompetitions.

Ithasseveralfeatures:
Speed:itcanautomaticallydoparallelcomputationonWindowsandLinux,withOpenMP.Itis
generallyover10timesfasterthantheclassical gbm .
InputType:ittakesseveraltypesofinputdata:
DenseMatrix:Rsdensematrix,i.e. matrix
SparseMatrix:Rssparsematrix,i.e. Matrix::dgCMatrix
DataFile:localdatafiles
xgb.DMatrix :itsownclass(recommended).
Sparsity:itacceptssparseinputforbothtreeboosterandlinearbooster,andisoptimized
forsparseinput
Customization:itsupportscustomizedobjectivefunctionsandevaluationfunctions.

Installation

Githubversion
Forweeklyupdatedversion(highlyrecommended),installfromGithub:

install.packages("drat",repos="https://cran.rstudio.com")
drat:::addRepo("dmlc")
install.packages("xgboost",repos="http://dmlc.ml/drat/",type="source")

WindowsuserwillneedtoinstallRtoolsfirst.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 1/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

CRANversion
Theversion0.42isonCRAN,andyoucaninstallitby:

install.packages("xgboost")

FormerlyavailableversionscanbeobtainedfromtheCRANarchive

Learning
ForthepurposeofthistutorialwewillloadXGBoostpackage.

require(xgboost)

Datasetpresentation
Inthisexample,weareaimingtopredictwhetheramushroomcanbeeatenornot(likeinmany
tutorials,exampledataarethethesameasyouwilluseoninyoureverydaylife:).
MushroomdataiscitedfromUCIMachineLearningRepository.@Bache+Lichman:2013.

Datasetloading
Wewillloadthe agaricus datasetsembeddedwiththepackageandwilllinkthemtovariables.

Thedatasetsarealreadysplitin:

train :willbeusedtobuildthemodel
test :willbeusedtoassessthequalityofourmodel.

Whysplitthedatasetintwoparts?
Inthefirstpartwewillbuildourmodel.Inthesecondpartwewillwanttotestitandassessits
quality.Withoutdividingthedatasetwewouldtestthemodelonthedatawhichthealgorithmhave
alreadyseen.

data(agaricus.train,package='xgboost')
data(agaricus.test,package='xgboost')
train<agaricus.train
test<agaricus.test

Intherealworld,itwouldbeuptoyoutomakethisdivision
between train and test data.Thewaytodoitisoutofthepurposeofthisarticle,
however caret packagemayhelp.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 2/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Eachvariableisa list containingtwothings, label and data :

str(train)

##Listof2
##$data:Formalclass'dgCMatrix'[package"Matrix"]with6slots
##....@i:int[1:143286]26811182021242832...
##....@p:int[1:127]036937233065845648965138380838410991...
##....@Dim:int[1:2]6513126
##....@Dimnames:Listof2
##......$:NULL
##......$:chr[1:126]"capshape=bell""capshape=conical""capshape=convex""capshape=
##....@x:num[1:143286]1111111111...
##....@factors:list()
##$label:num[1:6513]1001000100...

label istheoutcomeofourdatasetmeaningitisthebinaryclassificationwewilltrytopredict.

Letsdiscoverthedimensionalityofourdatasets.

dim(train$data)

##[1]6513126

dim(test$data)

##[1]1611126

ThisdatasetisverysmalltonotmaketheRpackagetooheavy,howeverXGBoostisbuiltto
managehugedatasetveryefficiently.

Asseenbelow,the data arestoredina dgCMatrix whichisasparsematrixand label vectoris


a numeric vector( {0,1} ):

class(train$data)[1]

##[1]"dgCMatrix"

class(train$label)

##[1]"numeric"

BasicTrainingusingXGBoost
Thisstepisthemostcriticalpartoftheprocessforthequalityofourmodel.

Basictraining
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 3/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Weareusingthe train data.Asexplainedabove,both data and label arestoredina list .

Inasparsematrix,cellscontaining 0 arenotstoredinmemory.Therefore,inadatasetmainly
madeof 0 ,memorysizeisreduced.Itisveryusualtohavesuchdataset.

Wewilltraindecisiontreemodelusingthefollowingparameters:
objective="binary:logistic" :wewilltrainabinaryclassificationmodel
max.deph=2 :thetreeswontbedeep,becauseourcaseisverysimple
nthread=2 :thenumberofcputhreadswearegoingtouse
nround=2 :therewillbetwopassesonthedata,thesecondonewillenhancethemodelby
furtherreducingthedifferencebetweengroundtruthandprediction.

bstSparse<xgboost(data=train$data,label=train$label,max.depth=2,eta=1,nthread

##[0]trainerror:0.046522
##[1]trainerror:0.022263

Morecomplextherelationshipbetweenyourfeaturesandyour label is,more


passesyouneed.

Parametervariations
Densematrix

Alternatively,youcanputyourdatasetinadensematrix,i.e.abasicRmatrix.

bstDense<xgboost(data=as.matrix(train$data),label=train$label,max.depth=2,eta=

##[0]trainerror:0.046522
##[1]trainerror:0.022263

xgb.DMatrix

XGBoostoffersawaytogroupthemina xgb.DMatrix .Youcanevenaddothermetadatainit.It


willbeusefulforthemostadvancedfeatureswewilldiscoverlater.

dtrain<xgb.DMatrix(data=train$data,label=train$label)
bstDMatrix<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective

##[0]trainerror:0.046522
##[1]trainerror:0.022263

Verboseoption

XGBoosthasseveralfeaturestohelpyoutoviewhowthelearningprogressinternally.The
purposeistohelpyoutosetthebestparameters,whichisthekeyofyourmodelquality.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 4/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Oneofthesimplestwaytoseethetrainingprogressistosetthe verbose option(seebelowfor


moreadvancedtechnics).

#verbose=0,nomessage
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=

#verbose=1,printevaluationmetric
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=

##[0]trainerror:0.046522
##[1]trainerror:0.022263

#verbose=2,alsoprintinformationabouttree
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=

##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,6extran
##[0]trainerror:0.046522
##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,4extran
##[1]trainerror:0.022263

BasicpredictionusingXGBoost

Performtheprediction
Thepurposeofthemodelwehavebuiltistoclassifynewdata.Asexplainedbefore,wewilluse
the test datasetforthisstep.

pred<predict(bst,test$data)

#sizeofthepredictionvector
print(length(pred))

##[1]1611

#limitdisplayofpredictionstothefirst10
print(head(pred))

##[1]0.285830170.923923910.285830170.285830170.051698730.92392391

Thesenumbersdoesntlooklikebinaryclassification {0,1} .Weneedtoperformasimple


transformationbeforebeingabletousetheseresults.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 5/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Transformtheregressioninabinaryclassification
TheonlythingthatXGBoostdoesisaregression.XGBoostisusing label vectortobuild
itsregressionmodel.

Howcanweusearegressionmodeltoperformabinaryclassification?
Ifwethinkaboutthemeaningofaregressionappliedtoourdata,thenumberswegetare
probabilitiesthatadatumwillbeclassifiedas 1 .Therefore,wewillsettherulethatifthis
probabilityforaspecificdatumis >0.5 thentheobservationisclassifiedas 1 (or 0 otherwise).

prediction<as.numeric(pred>0.5)
print(head(prediction))

##[1]010001

Measuringmodelperformance
Tomeasurethemodelperformance,wewillcomputeasimplemetric,theaverageerror.

err<mean(as.numeric(pred>0.5)!=test$label)
print(paste("testerror=",err))

##[1]"testerror=0.0217256362507759"

Notethatthealgorithmhasnotseenthe test dataduringthemodelconstruction.

Stepsexplanation:
1. as.numeric(pred>0.5) appliesourrulethatwhentheprobability(<=>regression<=>
prediction)is >0.5 theobservationisclassifiedas 1 and 0 otherwise
2. probabilityVectorPreviouslyComputed!=test$label computesthevectoroferror
betweentruedataandcomputedprobabilities
3. mean(vectorOfErrors) computestheaverageerroritself.

Themostimportantthingtorememberisthattodoaclassification,youjustdoaregressionto
the label andthenapplyathreshold.

Multiclassclassificationworksinasimilarway.

Thismetricis0.02andisprettylow:ouryummlymushroommodelworkswell!

Advancedfeatures

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 6/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Mostofthefeaturesbelowhavebeenimplementedtohelpyoutoimproveyourmodelbyofferinga
betterunderstandingofitscontent.

Datasetpreparation
Forthefollowingadvancedfeatures,weneedtoputdatain xgb.DMatrix asexplainedabove.

dtrain<xgb.DMatrix(data=train$data,label=train$label)
dtest<xgb.DMatrix(data=test$data,label=test$label)

Measurelearningprogresswithxgb.train
Both xgboost (simple)and xgb.train (advanced)functionstrainmodels.

Oneofthespecialfeatureof xgb.train isthecapacitytofollowtheprogressofthelearningafter


eachround.Becauseofthewayboostingworks,thereisatimewhenhavingtoomanyroundslead
toanoverfitting.Youcanseethisfeatureasacousinofcrossvalidationmethod.Thefollowing
techniqueswillhelpyoutoavoidoverfittingoroptimizingthelearningtimeinstoppingitassoonas
possible.
OnewaytomeasureprogressinlearningofamodelistoprovidetoXGBoostaseconddataset
alreadyclassified.Thereforeitcanlearnonthefirstdatasetandtestitsmodelonthesecondone.
Somemetricsaremeasuredaftereachroundduringthelearning.

insomewayitissimilartowhatwehavedoneabovewiththeaverageerror.The
maindifferenceisthatbelowitwasafterbuildingthemodel,andnowitisduringthe
constructionthatwemeasureerrors.

Forthepurposeofthisexample,weuse watchlist parameter.Itisalistof xgb.DMatrix ,eachof


themtaggedwithaname.

watchlist<list(train=dtrain,test=dtest)

bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist

##[0]trainerror:0.046522testerror:0.042831
##[1]trainerror:0.022263testerror:0.021726

XGBoosthascomputedateachroundthesameaverageerrormetricthanseenabove(we
set nround to2,thatiswhywehavetwolines).Obviously,the trainerror numberisrelatedto
thetrainingdataset(theonethealgorithmlearnsfrom)andthe testerror numbertothetest
dataset.

Bothtrainingandtesterrorrelatedmetricsareverysimilar,andinsomeway,itmakessense:what
wehavelearnedfromthetrainingdatasetmatchestheobservationsfromthetestdataset.

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 7/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

Ifwithyourowndatasetyouhavenotsuchresults,youshouldthinkabouthowyoudividedyour
datasetintrainingandtest.Maybethereissomethingtofix.Again, caret packagemayhelp.

Forabetterunderstandingofthelearningprogression,youmaywanttohavesomespecificmetric
orevenusemultipleevaluationmetrics.

bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist

##[0]trainerror:0.046522trainlogloss:0.233376testerror:0.042831testlogloss:0.22668
##[1]trainerror:0.022263trainlogloss:0.136658testerror:0.021726testlogloss:0.13787

eval.metric allowsustomonitortwonewmetricsforeach
round, logloss and error .

Linearboosting
Untilnow,allthelearningswehaveperformedwerebasedonboosting
trees.XGBoostimplementsasecondalgorithm,basedonlinearboosting.Theonlydifferencewith
previouscommandis booster="gblinear" parameter(andremoving eta parameter).

bst<xgb.train(data=dtrain,booster="gblinear",max.depth=2,nthread=2,nround=2,watchlis

##[0]trainerror:0.024720trainlogloss:0.184616testerror:0.022967testlogloss:0.18423
##[1]trainerror:0.004146trainlogloss:0.069885testerror:0.003724testlogloss:0.06808

Inthisspecificcase,linearboostinggetssligtlybetterperformancemetricsthandecisiontrees
basedalgorithm.

Insimplecases,itwillhappenbecausethereisnothingbetterthanalinearalgorithmtocatcha
linearlink.However,decisiontreesaremuchbettertocatchanonlinearlinkbetweenpredictors
andoutcome.Becausethereisnosilverbullet,weadviseyoutocheckbothalgorithmswithyour
owndatasetstohaveanideaofwhattouse.

Manipulatingxgb.DMatrix

Save/Load

Likesavingmodels, xgb.DMatrix object(whichgroupsbothdatasetandoutcome)canalsobe


savedusing xgb.DMatrix.save function.

xgb.DMatrix.save(dtrain,"dtrain.buffer")

##[1]TRUE
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 8/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

#toloaditin,simplycallxgb.DMatrix
dtrain2<xgb.DMatrix("dtrain.buffer")

##[11:41:01]6513x126matrixwith143286entriesloadedfromdtrain.buffer

bst<xgb.train(data=dtrain2,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist

##[0]trainerror:0.046522testerror:0.042831
##[1]trainerror:0.022263testerror:0.021726

Informationextraction

Informationcanbeextractedfrom xgb.DMatrix using getinfo function.Hereafterwewill


extract label data.

label=getinfo(dtest,"label")
pred<predict(bst,dtest)
err<as.numeric(sum(as.integer(pred>0.5)!=label))/length(label)
print(paste("testerror=",err))

##[1]"testerror=0.0217256362507759"

Viewfeatureimportance/influencefromthelearntmodel
FeatureimportanceissimilartoRgbmpackagesrelativeinfluence(rel.inf).

importance_matrix<xgb.importance(model=bst)
print(importance_matrix)
xgb.plot.importance(importance_matrix=importance_matrix)

Viewthetreesfromamodel

Youcandumpthetreeyoulearnedusing xgb.dump intoatextfile.

xgb.dump(bst,with.stats=T)

##[1]"booster[0]"
##[2]"0:[f28<1.00136e05]yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
##[3]"1:[f55<1.00136e05]yes=3,no=4,missing=3,gain=1158.21,cover=924.5"
##[4]"3:leaf=1.71218,cover=812"
##[5]"4:leaf=1.70044,cover=112.5"
##[6]"2:[f108<1.00136e05]yes=5,no=6,missing=5,gain=198.174,cover=703.75"
##[7]"5:leaf=1.94071,cover=690.5"
##[8]"6:leaf=1.85965,cover=13.25"
##[9]"booster[1]"

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 9/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation

##[10]"0:[f59<1.00136e05]yes=1,no=2,missing=1,gain=832.545,cover=788.852"
##[11]"1:[f28<1.00136e05]yes=3,no=4,missing=3,gain=569.725,cover=768.39"
##[12]"3:leaf=0.784718,cover=458.937"
##[13]"4:leaf=0.96853,cover=309.453"
##[14]"2:leaf=6.23624,cover=20.4624"

Youcanplotthetreesfromyourmodelusing```xgb.plot.tree``

xgb.plot.tree(model=bst)

ifyouprovideapathto fname parameteryoucansavethetreestoyourharddrive.

Saveandloadmodels

Maybeyourdatasetisbig,andittakestimetotrainamodelonit?Maybeyouarenotabigfanof
losingtimeinredoingthesametaskagainandagain?Intheseveryrarecases,youwillwantto
saveyourmodelandloaditwhenrequired.

Hopefullyforyou,XGBoostimplementssuchfunctions.

#savemodeltobinarylocalfile
xgb.save(bst,"xgboost.model")

##[1]TRUE

xgb.save functionshouldreturnTRUEifeverythinggoeswellandcrashes
otherwise.

Aninterestingtesttoseehowidenticaloursavedmodelistotheoriginalonewouldbetocompare
thetwopredictions.

#loadbinarymodeltoR
bst2<xgb.load("xgboost.model")
pred2<predict(bst2,test$data)

#Andnowthetest
print(paste("sum(abs(pred2pred))=",sum(abs(pred2pred))))

##[1]"sum(abs(pred2pred))=0"

http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 10/10

You might also like