Professional Documents
Culture Documents
DonorsChoose.org
WinningEntryDocumentation
Name:JeremyAchin
Location:BostonMA,UnitedStates
Email:jeremy@datarobot.com
Name:XavierConort
Location:Singapore
Email:xavier@datarobot.com
Name:LucasEustquioGomesdaSilva
Location:BeloHorizonteMG,Brazil
Email:lucas@datarobot.com
Summary
DonorsChoose.orgisanonlinecharitythatmakesiteasytohelpstudentsinneedthrough
schooldonations.Atanytime,thousandsofteachersinK12schoolsproposeprojects
requestingmaterialstoenhancetheeducationoftheirstudents.Whenaprojectreachesits
fundinggoal,theyshipthematerialstotheschool.
The2014KDDCupaskedparticipantstohelpDonorsChoose.orgidentifyprojectsthatare
exceptionallyexcitingtodonorsatthetimeofposting.
Inordertopredicthowexcitingisaproject,datawasprovidedinarelationalformatand
splitbydates.AnyprojectpostedpriortoJanuary1,2014wasinthetrainingset(along
withitsfundingoutcomes).AnyprojectpostedafterJanuary1,2014wasinthetestset.
ThetestsetusedknownoutcomesfromJanuary2014tomidMay2014.Kaggleignored
liveprojectsinthetestsetanddidnotdisclosewhichprojectswerestilllivetoavoid
leakageregardingthefundingstatus.
Adatadictionnaryofthedataprovidedisavailablehere:
https://www.kaggle.com/c/kddcup2014predictingexcitementatdonorschoose/data
Ourapproachtriedtoextractthebestfeaturesfromthedataandusethemin2Gradient
BoostingMachinesmodels:
onebasedonthesklearnGradientBoostingRegressor
(http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoosting
Regressor.html)
andonebasedontheRgradientboostingmachine(gbm)package
(http://cran.rproject.org/web/packages/gbm/)
Bothusedasresponse2013outcomesonly.Thisgaveussignificantgainincomputation
timewithoutmuchlossinpredictiveaccuracy.
WemadetheassumptionthatthemidMaycutoffinthetestsetproducesacensoringeffect
ontheresponse.Toreproducetheassumedtimeeffectontheresponseinthetraining
set,wecensoredtheresponsebeforetrainingourmodelsandcreated2typesofcensored
outcomes:
randomcutoffoutcomes:excitingoutcomescensoredat3cutoffsdrawnrandomly
fromthefirst131daysaftertheprojectwasposted
20weeksoutcomes:excitingoutcomescensoredeveryweekduringthefirst20
weeks
FeatureExtraction
Ourfeatureextractionconsistsof:
1. Rawfeaturesfromprojects.csv.Thiscontainsinformationabouteachprojectand
wasprovidedforboththetrainingandtestset.
2. Lapsesbetweenprojectspostedbyteachers
3. Proxiesoftextpostedbyteacherssuchasnbofcharacters,nbofwords,statson
lengthofwords,nbofsentences,nbofwordspersentence,statsonpunctuations
usage,misspelling,etc...
4. Stackedpredictionsoftheexcitingoutcomeandoftherequiredcriteriatobe
qualifiedasexciting,basedoninformationcontainedinprojecttitleandessay
postedbytheteacher
5. Deviationsfromanexpectedprojectcostthatwasestimatedbyastacked
predictionofthecost.ThemodelusedtopredictwasaGradientBoostingMachine
thatusedaspredictors"primary_focus_subject","grade_level"and
"students_reached"
6. Vendoridofthemostexpensiveitemintheproject
7. Stackedpredictionsoffinalexcitingoutcomebasedonthenameofthemost
expensiveitem
8. Historyfeatures
Tobuildhistoryfeatures,weslicedtimeintochunksof4months,computedstatisticsfor
eachchunkandusedasfeaturesthestatsofthelast3chunkspriortothetimechunkof
theproject.
Thestatsthatwecomputedforeachchunkincludestatson
1. Projectspostedbyteachers:
a. nbofprojects
b. foreachcriteria,sumofprojectsthatmetthecriteria
c. criteriametbylastproject
d. meanprojectcost
2. Donationsreceivedbyteachers:
a. nbofdonationsreceived
b. sumandlastamountsreceived
3. Donationsmadebyteachers:
a. nbofdonationsmade
b. sumofamountsdonated
c. sumofexcitingprojectstowhichtheteachersdonatedmoney
d. sumofdistancebetweentheteacherlocationandthelocationofprojects
theysponsored
4. Donationsmadebythezip,city,stateoftheproject:
a. sumandmeanamountdonated
b. sumandmeanofexcitingoutcomesoftheprojectssponsored
Tobuildstackedpredictionsoftheexcitingoutcomeandcriteriametbyaproject,we
trainedregularizedregressions(fromtheRpackageglmnet)trainedonwords2grams
documenttermmatricesgeneratedfromtheprojecttitleandtheessaypostedbythe
teacher.Regressionsweretrained:
Byprimaryfocusareaonemodelforeacharea
Foreacharea,webuilt:
onelogisticregression(L1penalty)topredictis_excitingusingtitle
documenttermmatrix
onelogisticregression(L2penalty)topredictis_excitingusingessay
documenttermmatrix
regressions(L2penalty)topredicteachcriteriatoqualify(fullyfunded,
at_least_1_teacher_referred_donor,...)usingessaydocumenttermmatrix
only
Stackedpredictionsofthefinalexcitingoutcomebasedonthenameofthemostexpensive
itemusedanelasticnetlogisticregression(fromglmnet)trainedonawords2grams
documenttermmatrix.
Alltextstackedpredictionsweregeneratedvia5foldscrossvalidation.
Fromthis,wecreated2setsoffeatures:
FG1:mostlydescribestheprojectstopredict,theteachersprojecthistoryandthe
teachersdonationshistory.Thefeaturesetincludes:
Rawfeaturesfromprojects.csv
Lapsesbetweenprojectspostedbyteachers
Statsonpastprojectspostedbyteachers
Statsonpastdonationsreceivedbyteachers
Statsonpastdonationsmadebyteachers
Textproxiesoftextpostedbyteachers
FG2:includesmorefeaturesontheprojectsandusesonlythehistoryofdonations
madebytheteachersortheprojectlocations(zip,cityandstate).Thisfeatureset
canbeseenasdesignedforteacherswithlow/noprojecthistorywhilethefirstset
reliesmoreonpastperformanceofteachersprojects.
Rawfeaturesfromprojects.csv
Lapsesbetweenprojectspostedbyteachers
Statsonpastdonationsmadebyteachers
Statsonpastdonationsmadebythezip,city,stateoftheproject
Stackedpredictionsoftheexcitingoutcomeandcriteriarequiredforaproject
tobequalifiedasexcitingbasedontheprojecttitlesandessayspostedby
teachers
Deviationsfromtheprojectexpectedcost
Vendoridoftheprojectsmostexpensiveitem
Stackedpredictionsoftheexcitingoutcomebasedonthenameofthemost
expensiveitem
ModelingTechniquesandTraining
WemadetheassumptionthatthemidMaycutoffinthetestsetproducesacensoringeffect
ontheresponseandweexpectedthiseffecttobemuchstrongerforthemostrecent
months.Toreproducetheassumedtimeeffectontheresponseinthetrainingset,we
censoredtheresponsebeforetrainingourmodelsandcreated2typesofcensored
outcomes:
randomcutoffoutcomes:excitingoutcomescensoredat3cutoffsdrawnrandomly
fromthefirst131daysaftertheprojectwasposted
20weeksoutcomes:excitingoutcomescensoredeveryweekduringthefirst20
weeks
Wetrained
Onesklearngradientboostedtreesmodeltopredicttherandomcutoffoutcomes.
ThemodelusedtheFG1featuresetandtherandomcutoffaspredictors.The
trainingsetsizewasmultipliedby3aseachrecordofthetrainingsethad3
censoredresponses.
TwentyRgradientboostingmachine(gbm)models:onemodelforeachweekofthe
20weeksoutcomes.AllmodelsusedtheFG2featuresetaspredictors.
Allmodelsweretrainedwith2013outcomesonly.
Thesklearngradientboostedtreesmodelusedashyperparameters:
n_estimators:2000
learning_rate:0.01
max_features:12
max_depth:7
subsample:1
The20Rgradientboostingmachinemodelsusedashyperparameters:
distribution="bernoulli"
n.trees=2500+week_n*100
n.minobsinnode=10
interaction.depth=5
shrinkage=0.01
bag.fraction=0.75
withweek_nthenbofweeksusedtocensortheresponse
Topredictoutcomesinthetestset:
Wefirstcomputedthenbofdaysnbetweentheprojectposteddateandthetest
setcutoff(May12,2014)
Whenusingthesklearnrandomcutoffgbm,thenbofdaysnwasusedasa
predictor
WhenusingtheR20weeksgbms,weselectedthegbmsthatweretrainedwitha
numberofweekscloseton/7
Averagedthe2solutions
CodeDescription
CodetogenerateFG1features
Script
Folder
Description
FG1_functions.R
mainfolder
Supportfunctions.
RUN_FG1.R
mainfolder
RunsolutionforFG1features
FG1_read_files.R
mainfolder
Readcompetitionsfilesanddosimplefeature
transformation
FG1_cost.R
mainfolder
Buildhistoryofcostofteacherspastprojects
FG1_outcomes.R
mainfolder
Buildhistoryofoutcomesofteacherspast
project
FG1_received.R
mainfolder
Buildhistoryofdonationsreceivedby
teachers
FG1_donated.R
mainfolder
Buildhistoryofdonationsmadebyteachers
FG1_txt_proxies.R,
FG1_vocab.R,
FG1_proxies.R
mainfolder
Buildtextproxiesoftextpostedbyteachers
FG1_lapse.R
mainfolder
Computelapsebetweenprojectsofasame
teacher
FG1_subset.R
mainfolder
ListofFG1features
FG1_Conso.R
mainfolder
Consolidatesfeaturesandsavesthemtodisk
Codetogeneratestackedtextfeatures
Script
Folder
Description
FG2_essay_NLP.R
mainfolder
Savetodisktextpostedbyteachers
FG2_resources.R
mainfolder
ExtractitemnameandVendoridof
mostexpensiveitemofaprojectand
savetodisk.
RUNNLP.R
NLP
Runstackedpredictionssolutionfortext
postedbyteachersanditemname
NLP
Trainstackedpredictionssolutionfor
textpostedbyteachersandsavemodel
andstackedpredictionsintodisk
NLP
Trainstackedpredictionssolutionfor
itemnameandsavemodelandstacked
predictionsintodisk
GLMNETsFITS.R
GLMNETFITSitem.R
_DTM_WORDS.R
NLP
Converttextintowordngrams
documenttermmatrix
_NUMBERS.R
NLP
Convertnumberintotext
_KFolds.R
NLP
Partition
_METRICS.R
NLP
functiontocomputeevaluationmetrics
CV_GLMNET.R
NLP
functiontotrainglmnetonKfolds
GLMNETsPREDICT.R
NLP
Predictionsbasedontextpostedby
teachersandsavepredictionsintodisk
GLMNETPREDICTitem.R
NLP
Predictionsbasedonitemnameand
savepredictionsintodisk
CodetogenerateFG2features
Script
Folder
Description
FG2_functions.R
mainfolder
Supportfunctions.
RUN_FG2.R
mainfolder
RunsolutionforFG2features
FG2_essay_NLP.R
mainfolder
Savetodisktextpostedby
teachers
FG2_donations_distance.R
mainfolder
Buildfeaturesofrelativelocation
ofdonations(receivedand
madebyteachers)
FG2_cost_deviation.R
mainfolder
Estimateanormalcostfora
project
FG2_donation_history_per_locati
on.R
mainfolder
Buildhistoryofdonationsmade
thezip,cityandstateofthe
project
FG2_subset.R
mainfolder
ListofFG2features
FG2_Conso.R
mainfolder
Consolidatesfeaturesand
savesthemtodisk
Codetogeneratecensoredoutcomes
Script
Folder
Description
fn.base.R
kddcup2014r
Supportfunctions.
data.build.R
kddcup2014r
Buildthefeaturesandsavesthemto
disk
Script
Folder
Description
sci_learn_train.py
kddcup2014py
Pythonscripttotraingradientboosted
trees
train.FG1.rc.R
kddcup2014r
Trainrandomcutoffoutcomesmodel
train.FG2.20W.R
kddcup2014r
Train20weeksoutcomesmodel
train.ens.R
kddcup2014r
Averagethe2solutionsandsavethe
submissionfileindata/submission
Codetopredict
HowtoRuntheCode
1. unzipKDD2014_DATAROBOT.zip
2. Putcompetitionfilesintothe"data/input"folder.
3. OpenaRsessionwithfolder"KDD2014_DATAROBOT"setasworkingdir.
4. RunRUNFG1.R
5. OpenaRsessionwithfolder"KDD2014_DATAROBOT/NLP"setasworkingdir.
6. RunRUNNLP.R
7. OpenaRsessionwithfolder"KDD2014_DATAROBOT"setasworkingdir.
8. RunRUNFG2.R
9. OpenaRsessionwithfolder"KDD2014_DATAROBOT/kddcup2014r"setas
workingdir.
10. Rundata.build.R
11. Runtrain.FG1.rc.R
12. Runtrain.FG2.20W.R
13. Runtrain.ens.R
ThepredictionswillbesavedinKDD2014_DATAROBOT/data/submission/ens.csv
Dependencies
Tobuildthesolution,Randpythonwereused.TheRversionusedwas3.0.2,andthe
Pythonversionwas2.7.3.
Asforthepackages:
R:SOAR0.9911,doSNOW1.0.9,foreach1.4.1,cvTools0.3.2,data.table1.8.10,
Matrix1.14,tau0.015,RtextTools1.4.1,glmnet1.95,gbm2.1
Python:pandas0.13.1,numpy1.8.1,scikitlearn0.15.0
Allthelistedversionaretheusedones.Itwillprobablyworkwithnewerversions,butit
wasn'ttested.
AdditionalComments
Thetimebiaspresentinthetestsetmadepredictionsverychallenging.
Wechosetotrustoursolutionsbasedoncensoredoutcomesratherthansolutionsusing
therawresponseandalineardecaytoadjustthesubmission.
Basedonothercompetitorsfeedbackandourown(unselected)submissions,modelswith
amoreaggressivetimedecayperformedbetteronthePrivateLeaderboard(ourhighest
scoreofunselectedsubmissionswentupto0.685).
Thiscouldbeexplainedbyeitheraseasonalitythatwedidntcaptureinourmodelsor
someadditionalcensoringdonebyKaggleinthetestset.
References
J.Friedman,GreedyFunctionApproximation:AGradientBoostingMachine,TheAnnals
ofStatistics,Vol.29,No.5,2001.
Friedman,StochasticGradientBoosting,1999
Hastie,R.TibshiraniandJ.Friedman,ElementsofStatisticalLearningEd.2,Springer,
2009.