You are on page 1of 13

KDDCup2014Predictingexcitement

atDonorChoose.org
{100277E,100131D,100381R,100393F}
DepartmentofComputerScienceandEngineering,
UniversityofMoratuwa,
SriLanka.

Introduction

DonorChoose.org is an online organization for charity where it supports to teachers to publish their
projects online and gives the willing donors an opportunity to cater the schools in need.
DonorChoose.org is interestedinknowingwhataretheexcitingprojectsproposedbyteachers.Sothe
requirement is to train a model thatwillpredictwhetheragivenprojectisexcitingornot.Theguidelines
todeterminethelevelofexcitementofaprojectaregivenhere[1].

DonorChoose.orgdetermineshowexcitingagivenprojectbymeansofanevaluationcriteriawhichis
describedbelow.

wasfullyfunded(fully_funded)
hadatleastoneteacheracquireddonor(at_least_1_teacher_referred_donor)
hasahigherthanaveragepercentageofdonorsleavinganoriginalmessage(great_chat)
hasatleastone"green"donation(at_least_1_green_donation)
hasoneormoreof:
donationsfromthreeormorenonteacheracquireddonors
(three_or_more_non_teacher_referred_donors)
onenonteacheracquireddonorgavemorethan$100
(one_non_teacher_referred_donor_giving_100_plus)
theprojectreceivedadonationfroma"thoughtfuldonor"
(donation_from_thoughtful_donor)

Data

The data provided by the Kaggle is in the relational format and split by dates. It basically contains
detailsontheprojects,donations,resourcesneededbytheprojectsandessaystatements.Kaggletreats
the any projects posted after 20140101 as the testdataandthedonationdetailsandoutcomedetails
arenotavailableforthoseprojects.
1


Thefollowingdatafilesareprovidedfortrainingandtestingthemodels.

donations.csvinformationaboutdonationsprovidedfortheprojectsinthetrainingset.
projects.csvinformationabouttheprojectsforbothtrainingandtestingsets.
sampleSubmission.csvprojectidsforthetestsetandsubmissionformatasaguidetothe
competitors.
resource.csvinformationabouttherequestedresourcesforeachproject.Thisfilecontainsa
largesetofattributesthatdescribestheresourcesrequestedbyteachers.
essays.csvEssayswrittenbyteachersfortheproposedprojects.Thesearethestatements
writtenbyteachersexplainingtherequirementandtheimportanceofreceivingtheresources.
outcomes.csvinformationabouttheoutcomeoftheprojectsinthetrainingset.

Dataintegration

For data integration, we used Pandas python library. It provides a lotofservicestohandlecsvfiles..It


provides R like data frames that can be used easily for sql like operations in relational DBs(joins,
groupingetc.).
We have combined essays.csv, projects.csv, outcomes.csv and resources.csv files to get necessary
data.Extracting data from resources.csv file was achieved as follows. In order tomergeresourceswith
projects, we have to first group resources by project id. To get totalresourcecost,wehavemultiplied
itemunitpricewithitemquantityandgetthesumwithinagroup.

DataPreprocessing
Data preprocessing was mainly used for TFIDF vector generation using essaydata.Firstweneedto
clean essays to remove illegal characters such as \r that gives errors in tfidf vectorization if not
removed.
Thefollowingtechniqueswereusedtoimproveresultsfromvectorizationofessays.
stopwordremoval
lowercasing
stemmingandlemmatization


Approach1.TFIDFBasedClassifier(Essaysonly)

Inthissolution,weusedonlytheessays.csvfile todeterminewhetheragivenprojectis_excitingornot.
TFIDF stands for Term Frequency Inverse Document Frequency, a techniques used in order to
maximizetheweightofsignificanttermsintextualdata.
Scikit Learn library has built in functions that return tfidf vectors when theinputtextisgiven.Wehave
tried using essays and need statements. For all cases we have removed stop words and used lower
casing.

Essayswithmaximumfeaturesof20000.56531areaundertheROCcurve(auc)
Essayswithmaximumfeaturesof200000.56724auc

We further tried to improve the model using stemming and lemmatization. In linguistic morphology,
stemming is the process of stripping off affixes from words in order to obtain a base form of a set of
terms.Eg.stemmingstem
This process is helpful to reduce the number different terms with the same semantic base. Further we
utilized lemmatization provided in NLTK Python library. NLTK has wordnet based lemmatizer that
removes affixes only if the resulting word is in its dictionary. Lemmatizer is more advanced than a
stemmerinthesensethatitdetectsnontrivialsemanticbases(womenwoman,childrenchild).
Results showed an improvement after applying followed by a regular stemming step. Adapting an
Ngram(Eg. bigram) based technique for essays would have improved the accuracy a lot but
calculating ngram sequences consumes a lot of RAM that we did not have in our systems. However,
wemanagedtoapplyngramswithNeedStatementsthatliftedtheaccuracyfrom0.51to0.53.

Textbasedclassificationmodelwastrainedusinglogisticregression.

Approach2.Regularfeatureswithdifferentalgorithms

In a later attempt, we tried a different model by eliminating essay details and focusing on learning with
other attributes, especially the ones from projects.csv. Wetrainedseveralmodels usingdifferentsetsof
attributes. We used Pandas python library to manipulate csv files and feature vectors.
One_Hot_DataFrame is a convenient way of converting categorical data into numerical attributes and
producingfeaturevectors.

For example attribute,povertylevel={highestpoverty, highpoverty,moderatepoverty,lowpoverty}.


One hot dataframe will create columns to represent each level and put 1 where the valueisgivenfora
3

project,whileothercolumnsvaluesarezero
E.g:highest,high,moderate,low
0100

Essaylengthbenchmarkis0.54531

Approach3.Hybridapproach(RegularClassifier+TFIDFClassifier)

We built several models using a combination of the above two approaches. TFIDF vectors obtained
from essay data were appended to the regular feature vector. However, simply concatenating the two
vectors degrades the accuracy. The reason for this is, the higher accuracy obtained from project
features is diluted by the high dimensional TFIDF vectors with a lower accuracy so that the overall
accuracyisbelowexpectations.

The problem was with the way we merged two vectors. So in our next attempt we devisedadifferent
solution to combine the effects of TFIDF vectors and other feature vectors. We trained twoseparate
models for TFIDF and project features. These two models separately output two probability values.
We then trained a third model giving these 2 valued tuples to output a real number that acts as the
overall probability for a given project. The followingfigure Thismethodworkedexceptionallywelland
werecordedourhighestplaceinKaggleleaderboardwiththisapproach.

Figure1.Twotierhierarchicalclassificationmodel

Failedefforts
4

In TFIDF vectorization, we assumed nouns would be more important than other words, we used
POS(Part of Speech) tagging using NLTK and created a dictionary using only nouns (1000 nouns
formsthefeaturevectorfortfidf).Howeverthisstrategydidnotimproveresults.

Simple concatenation of TFIDF vectors with the regular feature vector did notproducebetterresults.
So we devised a two level hierarchical classification model to combine the effect of the two types of
vectors.

We thought the derived feature, per_student_cost = (total_cost / students_reached) might increase the
accuracy because thatisanindicatoroftheimportanceofaprojectinastudentsperspective.Butitdid
notimprovetheresults.

MilestoneSubmissions

In this section we have described our milestone submissions that we havesubmittedtothekaggle.Our


team Sapients has made a total number of 36 submissions to Kaggle, but here are the critical and
importantsubmissionsthatwehavedonetoachievethefinalresults.

Submission1

In our initial submission, we have used all the data before 20140101 in our training set with the
followingselectedattributes.Themodelwastrainedusinglogisticregression.

'poverty_level',
'primary_focus_area,
'fulfillment_labor_materials',
'total_price_excluding_optional_support',
'students_reached',
'school_year_round',
'secondary_focus_area',
'grade_level',
'eligible_double_your_impact_match',
'teacher_teach_for_america',
teacher_ny_teaching_fellow',
'eligible_almost_home_match',
'school_magnet',
5

'resource_type',
'school_charter'
essay_length#Aderivedfeaturewithhighimportance

ROCScore=0.59447

Figure2.ROCaccuracy

Submission2

After studying the time series analysis given in [2], we removed all the data prior to 20100101
because most ofthepastdataareobsoleteandthesedatahinderminingmostoftheinterestingpatterns.
Further, most of the past data includes negative examples so that the algorithms fail to see intense
patterns for determining positive examples.Also,weaddedmonthasseparatefeature.Thesechanges
improved the results by a considerable amount. Throughsubmission2wehaveachieveda ROCscore
of0.60128

Submission3
6

Changing the algorithm from logistic regression to gradient tree boosting improved the results to
0.61190.Wehaveusedthegradienttreeboostingwithn_estimates=100andwithmaximumdepthof5.

The following graph shows the improvements of accuracy for each major submission. The verticalaxis
representsthepercentageimprovementwithrespecttoessaylengthbenchmark.

Figure3.ROCaccuracyofthemodels

As an alternative experiment, we developed aclassifierbasedonbitmap featurerepresentation.Bitmap


feature vectors are used to create lightweight vectors for nominal attributes. The models are trained
usinglogisticregressionalgorithm.

Thefollowingfeatureswereusedforbitmapencodingandtheoutcomewasconvertedintofloatvalues.

primary_focus_area
school_year_round
school_charter
fulfillment_labor_materials
teacher_teach_for_america
school_magnet
school_kipp
grade_level
7

primary_focus_subject
poverty_level
school_state
secondary_focus_area
school_charter_ready_promise
teacher_prefix
eligible_double_your_impact_match
teacher_ny_teaching_fellow
secondary_focus_subject
eligible_almost_home_match
school_nlns
school_metro
resource_type
date_posted_month

Additionally,thefollowingfeatureswereusedwhicharefloatsbydefault.

total_price_including_optional_support
total_price_excluding_optional_support
support_price=(total_price_including_optional_support)
(total_price_excluding_optional_support)
students_reached

Accuracy of the model was given in the following list with regard to the attributes selected in each
training iteration. As it can be seen, the results are not so impressive but this effort has been helpful in
determining which attributes to choose. One reason for these low results may be the variance of
attribute values. Normalization would have increased the accuracy but since this was experimental, we
decidedtokeepthingssimple.

In the following table we have listed attributes that we have added to the feature vector and the ROC
value that we have gained by dividing the training data as the projects posted earlier 20130101 and
testdataastheprojectspostedafter20130101andbefore20140101.

Iteration

Attributes

Accuracy
(ROC)

Remarks

allthefeaturesinthebitmap
encodedlist

0.508950

primary_focus_area,month,
teacher_prefix

0.518087

adding
resource_type degrades
accuracy

primary_focus_area,month,
teacher_prefix,school_year_round

0.519026

removingmonthimprovedthevalue

primary_focus_area,
teacher_prefix,school_year_round

0.528135

removingprimary_focus_area
improvedthevalue

teacher_prefix,school_year_round

0.535579

removing
school_year_round
improvedthevalue

teacher_prefix

0.538610

addinggrade_leveldegradesthe
accuracy

teacher_prefix,
fulfillment_labor_materials,
teacher_teach_for_america

0.546426

fulfillment_lab_or_materialshasno
effect
addingpoverty_leveldegrades,
school_statedegradestheaccuracy

teacher_prefix,teacher_teach_for_a 0.546784
merica,school_charter_ready_promi
se

school_charter_ready_promise
slightlyincreasedtheaccuracy

teacher_prefix,teacher_teach_for_a
merica,
teacher_ny_teaching_fellow,

0.547151

eligible_almost_home_match
degrades,school_metrodegradesthe
accuracy

10

teacher_prefix,
0.552683
teacher_teach_for_america,

teacher_ny_teaching_fellow,
school_charter_ready_promise,
total_price_including_optional_supp
ort

total_price_excluding_optional_supp
ort and support_cost degrade the
accuracy

Table1.Selectedattributesandcorrespondingmodelaccuracies

Then we drew diagrams to check whether there is a correlation between these parameters we found
randomly.

Figure4.Attributesetsvsaccuracyachievedbytheattributeset

DifferentAlgorithms

In thisproject,weusedanumberofclassificationalgorithmstobuildtheis_excitingclassifier.Selection
criteria of classification algorithms was based on trial and error. Since we have got5submissionsfor a
day we have subdividedthetraining datasetagainintotrainingdatasetandtestdatasettocomparethe
resultsofdifferentalgorithms.Webasicallytried
SupportVectorMachineClassifier
LogisticRegressionClassifier
GradientTreeBoostingClassifier

SVM

Support Vector Machines are popular as large margin classifiers because of their ability to find an
optimum decision boundary that separates two classes. However in efficiency point of view, it takes a
lotoftimetotrainthemodel.Aftersomeinitialtrials,wedecidednottouseSVMinthisproject.
LogisticRegression

10

Wehaveusedlogisticregressionasoneofourmainalgorithminthisproject.Logisticregressionmodel
wasusedtotrainTFIDFbasedclassifier.Ascikitlearnlogisticmodelcanbeobtainedasfollows.

sklearn.linear_model.LogisticRegression.
(penalty='l2',dual=False,tol=0.0001,C=1.0,fit_intercept=True,intercept_scaling=1,
class_weight=None,random_state=None)

Adetailedexplanationoftheparameterscanbefoundin[4].

GradientTreeBoosting

GradientTreeBoostingisanensemblelearningmethodweusedintheprojectanditproducedavery
goodmodel.Gradienttreeboostingisbasedondecisiontrees.Thereasonforusinganensemble
methodisthatittriestobuilddifferentbaseestimatorsandmergesthemtoproducemoregeneralized
results.Soensemblemethodsoftenproducesbetterresults.

ScikitlearnprovidesaverygoodimplementationofGradientTreeBoostingalgorithm[5],thatcanbe
invokedasfollows.

clf=GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,max_depth=1,
random_state=0).fit(X_train,y_train)
clf.score(X_test,y_test)

Tunableparametersinthisalgorithmaren_estimators,learning_rate,max_depthandrandom_state.
As in the general case, smaller learning rates produces more accurate models in the expense of time
consumption. The number of base learners canbesetbyn_estimators.Wecansetthesizeofeachtree
bysettingthemax_depthparameter.
However, the values applied for these parameters are always a tradeoff between accuracy and
availableresources.

Challenges

As in anydataminingproject,therewerecertainchallengesthatwe neededtoaddress.Inthedatafiles,
11

so many attributes were given so that we had to eliminate unnecessary data and use a minimal set of
important data. The reason is that we did not have enough computational resources to process all the
given features. Further, using all the given features without examining their effect would reduce the
qualityofthemodel.Sometimesthemodelstendtooverfitinthepresenceofcertainfeaturesets.

Findingproperresourcesforrunningalgorithmswasanotherchallenge.

Initially we used a entire data set provided by Kaggle to train our models. But the accuracy was not
reaching expected levels. Later on while looking for patterns in the dataset, we observed that the
percentage ofis_excitingprojectsisverylowinthedatasetsothealgorithmsfailtoextractsignificantly
intense patterns to decide which projects are is_exciting. We removed all data prior to 20100101
and used a reduced data set for training which lifted the accuracy in a considerable amount, as
expected.

Improvements

A better dimensionality reduction algorithmwouldbeusedtoidentifyareducedsetofattributes. Inthis


project,featureselectionwasbasedonsimplevisualizationtechniquesandtheintuition.

Ensemble learning is a powerful learning technique to achieve improved accuracy. We have already
exploited an ensemble method available in scikit learn(gradienttreeboosting)but thereisenoughroom
toimprovetheaccuracywithmoreadvancedensembletechniquesintheexpenseofresourcesandtime.

Parameter configuration is also an important aspect in the optimal use of a learning algorithm. We can
train with a lower learning rate for higheraccuracyandperformotherparametertuninginalgorithmsfor
betterresults.

References

[1]KDDCup2014onKaggle,[online]
http://www.kaggle.com/c/kddcup2014predictingexcitementatdonorschoose/data

[2]TimeSeriesanalysisonKDDCup2014datasets,[online]http://rpubs.com/wacax/21669

12


[3]IntroductiontoROCcurves,[online]http://gim.unmc.edu/dxtests/ROC1.htm

[4]LogisticRegression,Scikitlearnofficialdocumentation,[online]
http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

[5]GradientTreeBoosting,Scikitlearnofficialdocumentation,[online],
http://scikitlearn.org/stable/modules/ensemble.html#gradientboosting

[6]DataMining:ConceptsandTechniques(TheMorganKaufmannSeriesinDataManagement
Systems)(08September2000)byJiaweiHan,MichelineKamber

13

You might also like