XGBoost R Tutorial

25/08/2016 XGBoostRTutorialxgboost0.
6documentation
XGBoostRTutorial
Introduction
XgboostisshortforeXtremeGradientBoostingpackage.
ThepurposeofthisVignetteistoshowyouhowtouseXgboosttobuildamodelandmake
predictions.
Itisanefficientandscalableimplementationofgradientboostingframeworkby
@friedman2000additiveand@friedman2001greedy.Twosolversareincluded:
linearmodel
treelearningalgorithm.
Itsupportsvariousobjectivefunctions,includingregression,classificationandranking.The
packageismadetobeextendible,sothatusersarealsoallowedtodefinetheirownobjective
functionseasily.
IthasbeenusedtowinseveralKagglecompetitions.
Ithasseveralfeatures:
Speed:itcanautomaticallydoparallelcomputationonWindowsandLinux,withOpenMP.Itis
generallyover10timesfasterthantheclassical gbm .
InputType:ittakesseveraltypesofinputdata:
DenseMatrix:Rsdensematrix,i.e. matrix
SparseMatrix:Rssparsematrix,i.e. Matrix::dgCMatrix
DataFile:localdatafiles
xgb.DMatrix :itsownclass(recommended).
Sparsity:itacceptssparseinputforbothtreeboosterandlinearbooster,andisoptimized
forsparseinput
Customization:itsupportscustomizedobjectivefunctionsandevaluationfunctions.
Installation
Githubversion
Forweeklyupdatedversion(highlyrecommended),installfromGithub:
install.packages("drat",repos="https://cran.rstudio.com")
drat:::addRepo("dmlc")
install.packages("xgboost",repos="http://dmlc.ml/drat/",type="source")
WindowsuserwillneedtoinstallRtoolsfirst.
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 1/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
CRANversion
Theversion0.42isonCRAN,andyoucaninstallitby:
install.packages("xgboost")
FormerlyavailableversionscanbeobtainedfromtheCRANarchive
Learning
ForthepurposeofthistutorialwewillloadXGBoostpackage.
require(xgboost)
Datasetpresentation
Inthisexample,weareaimingtopredictwhetheramushroomcanbeeatenornot(likeinmany
tutorials,exampledataarethethesameasyouwilluseoninyoureverydaylife:).
MushroomdataiscitedfromUCIMachineLearningRepository.@Bache+Lichman:2013.
Datasetloading
Wewillloadthe agaricus datasetsembeddedwiththepackageandwilllinkthemtovariables.
Thedatasetsarealreadysplitin:
train :willbeusedtobuildthemodel
test :willbeusedtoassessthequalityofourmodel.
Whysplitthedatasetintwoparts?
Inthefirstpartwewillbuildourmodel.Inthesecondpartwewillwanttotestitandassessits
quality.Withoutdividingthedatasetwewouldtestthemodelonthedatawhichthealgorithmhave
alreadyseen.
data(agaricus.train,package='xgboost')
data(agaricus.test,package='xgboost')
train<agaricus.train
test<agaricus.test
Intherealworld,itwouldbeuptoyoutomakethisdivision
between train and test data.Thewaytodoitisoutofthepurposeofthisarticle,
however caret packagemayhelp.
Eachvariableisa list containingtwothings, label and data :
str(train)
##Listof2
##$data:Formalclass'dgCMatrix'[package"Matrix"]with6slots
##....@i:int[1:143286]26811182021242832...
##....@p:int[1:127]036937233065845648965138380838410991...
##....@Dim:int[1:2]6513126
##....@Dimnames:Listof2
##......$:NULL
##......$:chr[1:126]"capshape=bell""capshape=conical""capshape=convex""capshape=
##....@x:num[1:143286]1111111111...
##....@factors:list()
##$label:num[1:6513]1001000100...
label istheoutcomeofourdatasetmeaningitisthebinaryclassificationwewilltrytopredict.
Letsdiscoverthedimensionalityofourdatasets.
dim(train$data)
##[1]6513126
dim(test$data)
##[1]1611126
ThisdatasetisverysmalltonotmaketheRpackagetooheavy,howeverXGBoostisbuiltto
managehugedatasetveryefficiently.
Asseenbelow,the data arestoredina dgCMatrix whichisasparsematrixand label vectoris

a numeric vector( {0,1} ):
class(train$data)[1]
##[1]"dgCMatrix"
class(train$label)
##[1]"numeric"
BasicTrainingusingXGBoost
Thisstepisthemostcriticalpartoftheprocessforthequalityofourmodel.
Basictraining
Weareusingthe train data.Asexplainedabove,both data and label arestoredina list .
Inasparsematrix,cellscontaining 0 arenotstoredinmemory.Therefore,inadatasetmainly
madeof 0 ,memorysizeisreduced.Itisveryusualtohavesuchdataset.
Wewilltraindecisiontreemodelusingthefollowingparameters:
objective="binary:logistic" :wewilltrainabinaryclassificationmodel
max.deph=2 :thetreeswontbedeep,becauseourcaseisverysimple
nthread=2 :thenumberofcputhreadswearegoingtouse
nround=2 :therewillbetwopassesonthedata,thesecondonewillenhancethemodelby
furtherreducingthedifferencebetweengroundtruthandprediction.
bstSparse<xgboost(data=train$data,label=train$label,max.depth=2,eta=1,nthread
##[0]trainerror:0.046522
Morecomplextherelationshipbetweenyourfeaturesandyour label is,more

passesyouneed.
Parametervariations
Densematrix
Alternatively,youcanputyourdatasetinadensematrix,i.e.abasicRmatrix.
bstDense<xgboost(data=as.matrix(train$data),label=train$label,max.depth=2,eta=
xgb.DMatrix
XGBoostoffersawaytogroupthemina xgb.DMatrix .Youcanevenaddothermetadatainit.It

willbeusefulforthemostadvancedfeatureswewilldiscoverlater.
dtrain<xgb.DMatrix(data=train$data,label=train$label)
bstDMatrix<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective
Verboseoption
XGBoosthasseveralfeaturestohelpyoutoviewhowthelearningprogressinternally.The
purposeistohelpyoutosetthebestparameters,whichisthekeyofyourmodelquality.
Oneofthesimplestwaytoseethetrainingprogressistosetthe verbose option(seebelowfor

moreadvancedtechnics).
#verbose=0,nomessage
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=
#verbose=1,printevaluationmetric
#verbose=2,alsoprintinformationabouttree
##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,6extran
##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,4extran
BasicpredictionusingXGBoost
Performtheprediction
Thepurposeofthemodelwehavebuiltistoclassifynewdata.Asexplainedbefore,wewilluse
the test datasetforthisstep.
pred<predict(bst,test$data)
#sizeofthepredictionvector
print(length(pred))
##[1]1611
#limitdisplayofpredictionstothefirst10
print(head(pred))
##[1]0.285830170.923923910.285830170.285830170.051698730.92392391
Thesenumbersdoesntlooklikebinaryclassification {0,1} .Weneedtoperformasimple

transformationbeforebeingabletousetheseresults.
Transformtheregressioninabinaryclassification
TheonlythingthatXGBoostdoesisaregression.XGBoostisusing label vectortobuild
itsregressionmodel.
Howcanweusearegressionmodeltoperformabinaryclassification?
Ifwethinkaboutthemeaningofaregressionappliedtoourdata,thenumberswegetare
probabilitiesthatadatumwillbeclassifiedas 1 .Therefore,wewillsettherulethatifthis
probabilityforaspecificdatumis >0.5 thentheobservationisclassifiedas 1 (or 0 otherwise).
prediction<as.numeric(pred>0.5)
print(head(prediction))
##[1]010001
Measuringmodelperformance
Tomeasurethemodelperformance,wewillcomputeasimplemetric,theaverageerror.
err<mean(as.numeric(pred>0.5)!=test$label)
print(paste("testerror=",err))
##[1]"testerror=0.0217256362507759"
Notethatthealgorithmhasnotseenthe test dataduringthemodelconstruction.
Stepsexplanation:
1. as.numeric(pred>0.5) appliesourrulethatwhentheprobability(<=>regression<=>
prediction)is >0.5 theobservationisclassifiedas 1 and 0 otherwise
2. probabilityVectorPreviouslyComputed!=test$label computesthevectoroferror
betweentruedataandcomputedprobabilities
3. mean(vectorOfErrors) computestheaverageerroritself.
Themostimportantthingtorememberisthattodoaclassification,youjustdoaregressionto
the label andthenapplyathreshold.
Multiclassclassificationworksinasimilarway.
Thismetricis0.02andisprettylow:ouryummlymushroommodelworkswell!
Advancedfeatures
Mostofthefeaturesbelowhavebeenimplementedtohelpyoutoimproveyourmodelbyofferinga
betterunderstandingofitscontent.
Datasetpreparation
Forthefollowingadvancedfeatures,weneedtoputdatain xgb.DMatrix asexplainedabove.
dtrain<xgb.DMatrix(data=train$data,label=train$label)
dtest<xgb.DMatrix(data=test$data,label=test$label)
Measurelearningprogresswithxgb.train
Both xgboost (simple)and xgb.train (advanced)functionstrainmodels.
Oneofthespecialfeatureof xgb.train isthecapacitytofollowtheprogressofthelearningafter

eachround.Becauseofthewayboostingworks,thereisatimewhenhavingtoomanyroundslead
toanoverfitting.Youcanseethisfeatureasacousinofcrossvalidationmethod.Thefollowing
techniqueswillhelpyoutoavoidoverfittingoroptimizingthelearningtimeinstoppingitassoonas
possible.
OnewaytomeasureprogressinlearningofamodelistoprovidetoXGBoostaseconddataset
alreadyclassified.Thereforeitcanlearnonthefirstdatasetandtestitsmodelonthesecondone.
Somemetricsaremeasuredaftereachroundduringthelearning.
insomewayitissimilartowhatwehavedoneabovewiththeaverageerror.The
maindifferenceisthatbelowitwasafterbuildingthemodel,andnowitisduringthe
constructionthatwemeasureerrors.
Forthepurposeofthisexample,weuse watchlist parameter.Itisalistof xgb.DMatrix ,eachof

themtaggedwithaname.
watchlist<list(train=dtrain,test=dtest)
bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist
##[0]trainerror:0.046522testerror:0.042831
XGBoosthascomputedateachroundthesameaverageerrormetricthanseenabove(we
set nround to2,thatiswhywehavetwolines).Obviously,the trainerror numberisrelatedto
thetrainingdataset(theonethealgorithmlearnsfrom)andthe testerror numbertothetest
dataset.
Bothtrainingandtesterrorrelatedmetricsareverysimilar,andinsomeway,itmakessense:what
wehavelearnedfromthetrainingdatasetmatchestheobservationsfromthetestdataset.
Ifwithyourowndatasetyouhavenotsuchresults,youshouldthinkabouthowyoudividedyour
datasetintrainingandtest.Maybethereissomethingtofix.Again, caret packagemayhelp.
Forabetterunderstandingofthelearningprogression,youmaywanttohavesomespecificmetric
orevenusemultipleevaluationmetrics.
bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist
##[0]trainerror:0.046522trainlogloss:0.233376testerror:0.042831testlogloss:0.22668
eval.metric allowsustomonitortwonewmetricsforeach
round, logloss and error .
Linearboosting
Untilnow,allthelearningswehaveperformedwerebasedonboosting
trees.XGBoostimplementsasecondalgorithm,basedonlinearboosting.Theonlydifferencewith
previouscommandis booster="gblinear" parameter(andremoving eta parameter).
bst<xgb.train(data=dtrain,booster="gblinear",max.depth=2,nthread=2,nround=2,watchlis
Inthisspecificcase,linearboostinggetssligtlybetterperformancemetricsthandecisiontrees
basedalgorithm.
Insimplecases,itwillhappenbecausethereisnothingbetterthanalinearalgorithmtocatcha
linearlink.However,decisiontreesaremuchbettertocatchanonlinearlinkbetweenpredictors
andoutcome.Becausethereisnosilverbullet,weadviseyoutocheckbothalgorithmswithyour
owndatasetstohaveanideaofwhattouse.
Manipulatingxgb.DMatrix
Save/Load
Likesavingmodels, xgb.DMatrix object(whichgroupsbothdatasetandoutcome)canalsobe

savedusing xgb.DMatrix.save function.
xgb.DMatrix.save(dtrain,"dtrain.buffer")
##[1]TRUE
#toloaditin,simplycallxgb.DMatrix
dtrain2<xgb.DMatrix("dtrain.buffer")
##[11:41:01]6513x126matrixwith143286entriesloadedfromdtrain.buffer
bst<xgb.train(data=dtrain2,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist
Informationextraction
Informationcanbeextractedfrom xgb.DMatrix using getinfo function.Hereafterwewill

extract label data.
label=getinfo(dtest,"label")
pred<predict(bst,dtest)
err<as.numeric(sum(as.integer(pred>0.5)!=label))/length(label)
print(paste("testerror=",err))
##[1]"testerror=0.0217256362507759"
Viewfeatureimportance/influencefromthelearntmodel
FeatureimportanceissimilartoRgbmpackagesrelativeinfluence(rel.inf).
importance_matrix<xgb.importance(model=bst)
print(importance_matrix)
xgb.plot.importance(importance_matrix=importance_matrix)
Viewthetreesfromamodel
Youcandumpthetreeyoulearnedusing xgb.dump intoatextfile.
xgb.dump(bst,with.stats=T)
##[1]"booster[0]"
##[2]"0:[f28<1.00136e05]yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
##[4]"3:leaf=1.71218,cover=812"
##[5]"4:leaf=1.70044,cover=112.5"
##[7]"5:leaf=1.94071,cover=690.5"
##[8]"6:leaf=1.85965,cover=13.25"
##[9]"booster[1]"
##[12]"3:leaf=0.784718,cover=458.937"
##[13]"4:leaf=0.96853,cover=309.453"
##[14]"2:leaf=6.23624,cover=20.4624"
Youcanplotthetreesfromyourmodelusing```xgb.plot.tree``
xgb.plot.tree(model=bst)
ifyouprovideapathto fname parameteryoucansavethetreestoyourharddrive.
Saveandloadmodels
Maybeyourdatasetisbig,andittakestimetotrainamodelonit?Maybeyouarenotabigfanof
losingtimeinredoingthesametaskagainandagain?Intheseveryrarecases,youwillwantto
saveyourmodelandloaditwhenrequired.
Hopefullyforyou,XGBoostimplementssuchfunctions.
#savemodeltobinarylocalfile
xgb.save(bst,"xgboost.model")
##[1]TRUE
xgb.save functionshouldreturnTRUEifeverythinggoeswellandcrashes
otherwise.
Aninterestingtesttoseehowidenticaloursavedmodelistotheoriginalonewouldbetocompare
thetwopredictions.
#loadbinarymodeltoR
bst2<xgb.load("xgboost.model")
pred2<predict(bst2,test$data)
#Andnowthetest
print(paste("sum(abs(pred2pred))=",sum(abs(pred2pred))))
##[1]"sum(abs(pred2pred))=0"

XGBoost R Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

XGBoost R Tutorial

Uploaded by

Copyright:

Available Formats

25/08/2016 XGBoostRTutorialxgboost0.

Eachvariableisa list containingtwothings, label and data :

Asseenbelow,the data arestoredina dgCMatrix whichisasparsematrixand label vectoris

Weareusingthe train data.Asexplainedabove,both data and label arestoredina list .

Morecomplextherelationshipbetweenyourfeaturesandyour label is,more

XGBoostoffersawaytogroupthemina xgb.DMatrix .Youcanevenaddothermetadatainit.It

Oneofthesimplestwaytoseethetrainingprogressistosetthe verbose option(seebelowfor

Thesenumbersdoesntlooklikebinaryclassification {0,1} .Weneedtoperformasimple

Notethatthealgorithmhasnotseenthe test dataduringthemodelconstruction.

Oneofthespecialfeatureof xgb.train isthecapacitytofollowtheprogressofthelearningafter

Forthepurposeofthisexample,weuse watchlist parameter.Itisalistof xgb.DMatrix ,eachof

Likesavingmodels, xgb.DMatrix object(whichgroupsbothdatasetandoutcome)canalsobe

Informationcanbeextractedfrom xgb.DMatrix using getinfo function.Hereafterwewill

Youcandumpthetreeyoulearnedusing xgb.dump intoatextfile.

ifyouprovideapathto fname parameteryoucansavethetreestoyourharddrive.

You might also like