Projectpaperkddcup2014 140922023843 Phpapp01

KDDCup2014Predictingexcitement
atDonorChoose.org
{100277E,100131D,100381R,100393F}
DepartmentofComputerScienceandEngineering,
UniversityofMoratuwa,
SriLanka.
Introduction
DonorChoose.org is an online organization for charity where it supports to teachers to publish their
projects online and gives the willing donors an opportunity to cater the schools in need.
DonorChoose.org is interestedinknowingwhataretheexcitingprojectsproposedbyteachers.Sothe
requirement is to train a model thatwillpredictwhetheragivenprojectisexcitingornot.Theguidelines
todeterminethelevelofexcitementofaprojectaregivenhere[1].
DonorChoose.orgdetermineshowexcitingagivenprojectbymeansofanevaluationcriteriawhichis
describedbelow.
wasfullyfunded(fully_funded)
hadatleastoneteacheracquireddonor(at_least_1_teacher_referred_donor)
hasahigherthanaveragepercentageofdonorsleavinganoriginalmessage(great_chat)
hasatleastone"green"donation(at_least_1_green_donation)
hasoneormoreof:
donationsfromthreeormorenonteacheracquireddonors
(three_or_more_non_teacher_referred_donors)
onenonteacheracquireddonorgavemorethan$100
(one_non_teacher_referred_donor_giving_100_plus)
theprojectreceivedadonationfroma"thoughtfuldonor"
(donation_from_thoughtful_donor)
Data
The data provided by the Kaggle is in the relational format and split by dates. It basically contains
detailsontheprojects,donations,resourcesneededbytheprojectsandessaystatements.Kaggletreats
the any projects posted after 20140101 as the testdataandthedonationdetailsandoutcomedetails
arenotavailableforthoseprojects.
1

Thefollowingdatafilesareprovidedfortrainingandtestingthemodels.
donations.csvinformationaboutdonationsprovidedfortheprojectsinthetrainingset.
projects.csvinformationabouttheprojectsforbothtrainingandtestingsets.
sampleSubmission.csvprojectidsforthetestsetandsubmissionformatasaguidetothe
competitors.
resource.csvinformationabouttherequestedresourcesforeachproject.Thisfilecontainsa
largesetofattributesthatdescribestheresourcesrequestedbyteachers.
essays.csvEssayswrittenbyteachersfortheproposedprojects.Thesearethestatements
writtenbyteachersexplainingtherequirementandtheimportanceofreceivingtheresources.
outcomes.csvinformationabouttheoutcomeoftheprojectsinthetrainingset.
Dataintegration
For data integration, we used Pandas python library. It provides a lotofservicestohandlecsvfiles..It

provides R like data frames that can be used easily for sql like operations in relational DBs(joins,
groupingetc.).
We have combined essays.csv, projects.csv, outcomes.csv and resources.csv files to get necessary
data.Extracting data from resources.csv file was achieved as follows. In order tomergeresourceswith
projects, we have to first group resources by project id. To get totalresourcecost,wehavemultiplied
itemunitpricewithitemquantityandgetthesumwithinagroup.
DataPreprocessing
Data preprocessing was mainly used for TFIDF vector generation using essaydata.Firstweneedto
clean essays to remove illegal characters such as \r that gives errors in tfidf vectorization if not
removed.
Thefollowingtechniqueswereusedtoimproveresultsfromvectorizationofessays.
stopwordremoval
lowercasing
stemmingandlemmatization

Approach1.TFIDFBasedClassifier(Essaysonly)
Inthissolution,weusedonlytheessays.csvfile todeterminewhetheragivenprojectis_excitingornot.
TFIDF stands for Term Frequency Inverse Document Frequency, a techniques used in order to
maximizetheweightofsignificanttermsintextualdata.
Scikit Learn library has built in functions that return tfidf vectors when theinputtextisgiven.Wehave
tried using essays and need statements. For all cases we have removed stop words and used lower
casing.
Essayswithmaximumfeaturesof20000.56531areaundertheROCcurve(auc)
Essayswithmaximumfeaturesof200000.56724auc
We further tried to improve the model using stemming and lemmatization. In linguistic morphology,
stemming is the process of stripping off affixes from words in order to obtain a base form of a set of
terms.Eg.stemmingstem
This process is helpful to reduce the number different terms with the same semantic base. Further we
utilized lemmatization provided in NLTK Python library. NLTK has wordnet based lemmatizer that
removes affixes only if the resulting word is in its dictionary. Lemmatizer is more advanced than a
stemmerinthesensethatitdetectsnontrivialsemanticbases(womenwoman,childrenchild).
Results showed an improvement after applying followed by a regular stemming step. Adapting an
Ngram(Eg. bigram) based technique for essays would have improved the accuracy a lot but
calculating ngram sequences consumes a lot of RAM that we did not have in our systems. However,
wemanagedtoapplyngramswithNeedStatementsthatliftedtheaccuracyfrom0.51to0.53.
Textbasedclassificationmodelwastrainedusinglogisticregression.
Approach2.Regularfeatureswithdifferentalgorithms
In a later attempt, we tried a different model by eliminating essay details and focusing on learning with
other attributes, especially the ones from projects.csv. Wetrainedseveralmodels usingdifferentsetsof
attributes. We used Pandas python library to manipulate csv files and feature vectors.
One_Hot_DataFrame is a convenient way of converting categorical data into numerical attributes and
producingfeaturevectors.
For example attribute,povertylevel={highestpoverty, highpoverty,moderatepoverty,lowpoverty}.

One hot dataframe will create columns to represent each level and put 1 where the valueisgivenfora
3
project,whileothercolumnsvaluesarezero
E.g:highest,high,moderate,low
0100
Essaylengthbenchmarkis0.54531
Approach3.Hybridapproach(RegularClassifier+TFIDFClassifier)
We built several models using a combination of the above two approaches. TFIDF vectors obtained
from essay data were appended to the regular feature vector. However, simply concatenating the two
vectors degrades the accuracy. The reason for this is, the higher accuracy obtained from project
features is diluted by the high dimensional TFIDF vectors with a lower accuracy so that the overall
accuracyisbelowexpectations.
The problem was with the way we merged two vectors. So in our next attempt we devisedadifferent
solution to combine the effects of TFIDF vectors and other feature vectors. We trained twoseparate
models for TFIDF and project features. These two models separately output two probability values.
We then trained a third model giving these 2 valued tuples to output a real number that acts as the
overall probability for a given project. The followingfigure Thismethodworkedexceptionallywelland
werecordedourhighestplaceinKaggleleaderboardwiththisapproach.
Figure1.Twotierhierarchicalclassificationmodel
Failedefforts
4
In TFIDF vectorization, we assumed nouns would be more important than other words, we used
POS(Part of Speech) tagging using NLTK and created a dictionary using only nouns (1000 nouns
formsthefeaturevectorfortfidf).Howeverthisstrategydidnotimproveresults.
Simple concatenation of TFIDF vectors with the regular feature vector did notproducebetterresults.
So we devised a two level hierarchical classification model to combine the effect of the two types of
vectors.
We thought the derived feature, per_student_cost = (total_cost / students_reached) might increase the
accuracy because thatisanindicatoroftheimportanceofaprojectinastudentsperspective.Butitdid
notimprovetheresults.
MilestoneSubmissions
In this section we have described our milestone submissions that we havesubmittedtothekaggle.Our

team Sapients has made a total number of 36 submissions to Kaggle, but here are the critical and
importantsubmissionsthatwehavedonetoachievethefinalresults.
Submission1
In our initial submission, we have used all the data before 20140101 in our training set with the
followingselectedattributes.Themodelwastrainedusinglogisticregression.
'poverty_level',
'primary_focus_area,
'fulfillment_labor_materials',
'total_price_excluding_optional_support',
'students_reached',
'school_year_round',
'secondary_focus_area',
'grade_level',
'eligible_double_your_impact_match',
'teacher_teach_for_america',
teacher_ny_teaching_fellow',
'eligible_almost_home_match',
'school_magnet',
5
'resource_type',
'school_charter'
essay_length#Aderivedfeaturewithhighimportance
ROCScore=0.59447
Figure2.ROCaccuracy
Submission2
After studying the time series analysis given in [2], we removed all the data prior to 20100101
because most ofthepastdataareobsoleteandthesedatahinderminingmostoftheinterestingpatterns.
Further, most of the past data includes negative examples so that the algorithms fail to see intense
patterns for determining positive examples.Also,weaddedmonthasseparatefeature.Thesechanges
improved the results by a considerable amount. Throughsubmission2wehaveachieveda ROCscore
of0.60128
Submission3
6
Changing the algorithm from logistic regression to gradient tree boosting improved the results to
0.61190.Wehaveusedthegradienttreeboostingwithn_estimates=100andwithmaximumdepthof5.
The following graph shows the improvements of accuracy for each major submission. The verticalaxis
representsthepercentageimprovementwithrespecttoessaylengthbenchmark.
Figure3.ROCaccuracyofthemodels
As an alternative experiment, we developed aclassifierbasedonbitmap featurerepresentation.Bitmap

feature vectors are used to create lightweight vectors for nominal attributes. The models are trained
usinglogisticregressionalgorithm.
Thefollowingfeatureswereusedforbitmapencodingandtheoutcomewasconvertedintofloatvalues.
primary_focus_area
school_year_round
school_charter
fulfillment_labor_materials
teacher_teach_for_america
school_magnet
school_kipp
grade_level
7
primary_focus_subject
poverty_level
school_state
secondary_focus_area
school_charter_ready_promise
teacher_prefix
eligible_double_your_impact_match
teacher_ny_teaching_fellow
secondary_focus_subject
eligible_almost_home_match
school_nlns
school_metro
resource_type
date_posted_month
Additionally,thefollowingfeatureswereusedwhicharefloatsbydefault.
total_price_including_optional_support
total_price_excluding_optional_support
support_price=(total_price_including_optional_support)
(total_price_excluding_optional_support)
students_reached
Accuracy of the model was given in the following list with regard to the attributes selected in each
training iteration. As it can be seen, the results are not so impressive but this effort has been helpful in
determining which attributes to choose. One reason for these low results may be the variance of
attribute values. Normalization would have increased the accuracy but since this was experimental, we
decidedtokeepthingssimple.
In the following table we have listed attributes that we have added to the feature vector and the ROC
value that we have gained by dividing the training data as the projects posted earlier 20130101 and
testdataastheprojectspostedafter20130101andbefore20140101.
Iteration
Attributes
Accuracy
(ROC)
Remarks
allthefeaturesinthebitmap
encodedlist
0.508950
primary_focus_area,month,
teacher_prefix
0.518087
adding
resource_type degrades
accuracy
primary_focus_area,month,
teacher_prefix,school_year_round
0.519026
removingmonthimprovedthevalue
primary_focus_area,
0.528135
removingprimary_focus_area
improvedthevalue
0.535579
removing
school_year_round
improvedthevalue
teacher_prefix
0.538610
addinggrade_leveldegradesthe
accuracy
teacher_prefix,
fulfillment_labor_materials,
teacher_teach_for_america
0.546426
fulfillment_lab_or_materialshasno
effect
addingpoverty_leveldegrades,
school_statedegradestheaccuracy
teacher_prefix,teacher_teach_for_a 0.546784
merica,school_charter_ready_promi
se
school_charter_ready_promise
slightlyincreasedtheaccuracy
teacher_prefix,teacher_teach_for_a
merica,
teacher_ny_teaching_fellow,
0.547151
eligible_almost_home_match
degrades,school_metrodegradesthe
accuracy
10
teacher_prefix,
0.552683
teacher_teach_for_america,
teacher_ny_teaching_fellow,
school_charter_ready_promise,
total_price_including_optional_supp
ort
total_price_excluding_optional_supp
ort and support_cost degrade the
accuracy
Table1.Selectedattributesandcorrespondingmodelaccuracies
Then we drew diagrams to check whether there is a correlation between these parameters we found
randomly.
Figure4.Attributesetsvsaccuracyachievedbytheattributeset
DifferentAlgorithms
In thisproject,weusedanumberofclassificationalgorithmstobuildtheis_excitingclassifier.Selection
criteria of classification algorithms was based on trial and error. Since we have got5submissionsfor a
day we have subdividedthetraining datasetagainintotrainingdatasetandtestdatasettocomparethe
resultsofdifferentalgorithms.Webasicallytried
SupportVectorMachineClassifier
LogisticRegressionClassifier
GradientTreeBoostingClassifier
SVM
Support Vector Machines are popular as large margin classifiers because of their ability to find an
optimum decision boundary that separates two classes. However in efficiency point of view, it takes a
lotoftimetotrainthemodel.Aftersomeinitialtrials,wedecidednottouseSVMinthisproject.
LogisticRegression
10
Wehaveusedlogisticregressionasoneofourmainalgorithminthisproject.Logisticregressionmodel
wasusedtotrainTFIDFbasedclassifier.Ascikitlearnlogisticmodelcanbeobtainedasfollows.
sklearn.linear_model.LogisticRegression.
(penalty='l2',dual=False,tol=0.0001,C=1.0,fit_intercept=True,intercept_scaling=1,
class_weight=None,random_state=None)
Adetailedexplanationoftheparameterscanbefoundin[4].
GradientTreeBoosting
GradientTreeBoostingisanensemblelearningmethodweusedintheprojectanditproducedavery
goodmodel.Gradienttreeboostingisbasedondecisiontrees.Thereasonforusinganensemble
methodisthatittriestobuilddifferentbaseestimatorsandmergesthemtoproducemoregeneralized
results.Soensemblemethodsoftenproducesbetterresults.
ScikitlearnprovidesaverygoodimplementationofGradientTreeBoostingalgorithm[5],thatcanbe
invokedasfollows.
clf=GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,max_depth=1,
random_state=0).fit(X_train,y_train)
clf.score(X_test,y_test)
Tunableparametersinthisalgorithmaren_estimators,learning_rate,max_depthandrandom_state.
As in the general case, smaller learning rates produces more accurate models in the expense of time
consumption. The number of base learners canbesetbyn_estimators.Wecansetthesizeofeachtree
bysettingthemax_depthparameter.
However, the values applied for these parameters are always a tradeoff between accuracy and
availableresources.
Challenges
As in anydataminingproject,therewerecertainchallengesthatwe neededtoaddress.Inthedatafiles,
11
so many attributes were given so that we had to eliminate unnecessary data and use a minimal set of
important data. The reason is that we did not have enough computational resources to process all the
given features. Further, using all the given features without examining their effect would reduce the
qualityofthemodel.Sometimesthemodelstendtooverfitinthepresenceofcertainfeaturesets.
Findingproperresourcesforrunningalgorithmswasanotherchallenge.
Initially we used a entire data set provided by Kaggle to train our models. But the accuracy was not
reaching expected levels. Later on while looking for patterns in the dataset, we observed that the
percentage ofis_excitingprojectsisverylowinthedatasetsothealgorithmsfailtoextractsignificantly
intense patterns to decide which projects are is_exciting. We removed all data prior to 20100101
and used a reduced data set for training which lifted the accuracy in a considerable amount, as
expected.
Improvements
A better dimensionality reduction algorithmwouldbeusedtoidentifyareducedsetofattributes. Inthis

project,featureselectionwasbasedonsimplevisualizationtechniquesandtheintuition.
Ensemble learning is a powerful learning technique to achieve improved accuracy. We have already
exploited an ensemble method available in scikit learn(gradienttreeboosting)but thereisenoughroom
toimprovetheaccuracywithmoreadvancedensembletechniquesintheexpenseofresourcesandtime.
Parameter configuration is also an important aspect in the optimal use of a learning algorithm. We can
train with a lower learning rate for higheraccuracyandperformotherparametertuninginalgorithmsfor
betterresults.
References
[1]KDDCup2014onKaggle,[online]
http://www.kaggle.com/c/kddcup2014predictingexcitementatdonorschoose/data
[2]TimeSeriesanalysisonKDDCup2014datasets,[online]http://rpubs.com/wacax/21669
12

[3]IntroductiontoROCcurves,[online]http://gim.unmc.edu/dxtests/ROC1.htm
[4]LogisticRegression,Scikitlearnofficialdocumentation,[online]
http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
[5]GradientTreeBoosting,Scikitlearnofficialdocumentation,[online],
http://scikitlearn.org/stable/modules/ensemble.html#gradientboosting
[6]DataMining:ConceptsandTechniques(TheMorganKaufmannSeriesinDataManagement
Systems)(08September2000)byJiaweiHan,MichelineKamber
13

Projectpaperkddcup2014 140922023843 Phpapp01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Projectpaperkddcup2014 140922023843 Phpapp01

Uploaded by

Copyright:

Available Formats

KDDCup2014Predictingexcitement

For data integration, we used Pandas python library. It provides a lotofservicestohandlecsvfiles..It

For example attribute,povertylevel={highestpoverty, highpoverty,moderatepoverty,lowpoverty}.

In this section we have described our milestone submissions that we havesubmittedtothekaggle.Our

As an alternative experiment, we developed aclassifierbasedonbitmap featurerepresentation.Bitmap

A better dimensionality reduction algorithmwouldbeusedtoidentifyareducedsetofattributes. Inthis

You might also like