You are on page 1of 4

20/5/2016

MachinelearningtopredictSanFranciscocrimeEFavDB

MachinelearningtopredictSanFranciscocrime
July20,2015

DamienRJ Casestudies

Intodayspost,wedocumentoursubmissiontotherecentKagglecompetitionaimedatpredictingthe
categoryofSanFranciscocrimes,givenonlytheirtimeandlocationofoccurrence.Asareminder,
Kaggleisasitewhereonecancompetewithotherdatascientistsonvariousdatachallenges.Wetook
thiscompetitionasanopportunitytoexploretheNaiveBayesalgorithm.Withthefewstepsdiscussed
below,wewereabletoquicklymovefromthemiddleofthepacktothetop33%onthecompetition
leaderboard,allthewhilecontinuingwiththissimplemodel!
Follow@efavdb
Followusontwitterfornewsubmissionalerts!

Introduction
Asinallcities,crimeisarealitySanFrancisco:EveryonewholivesinSanFranciscoseemstoknow
someonewhosecarwindowhasbeensmashedin,orwhosebicyclewasstolenwithinthepastyearor
two.EvenPriuscarbatteriesareapparentlyconsideredfairgamebythecitysdiligentthieves.The
challengewetackletodayinvolvesattemptingtoguesstheclassofacrimecommittedwithinthecity,
giventhetimeandlocationittookplace.Suchstudiesarerepresentativeofeffortsbymanypoliceforces
today:Usingmachinelearningapproaches,onecangetanimprovedunderstandingofwhichcrimes
occurwhereandwheninacitythisthenallowsforbetter,dynamicallocationofpoliceresources.To
aidintheSFchallenge,Kagglehasprovidedabout12yearsofcrimereportsfromalloverthecitya
datasetthatisprettyinterestingtocombthrough.
Here,weoutlineourapproachtotacklingthisproblem,usingtheNaiveBayesclassifier.Thisisoneof
thesimplestclassificationalgorithms,theessentialingredientsofwhichincludecombiningBayes
theoremwithanindependenceassumptiononthefeatures(thisisthenaivepart).Althoughsimple,it
isstillapopularmethodfortextcategorization.Forexample,usingwordfrequenciesasfeatures,
thisapproachcanaccuratelyclassifyemailsasspam,orwhetheraparticularapieceoftextwaswritten
byaspecificauthor.Infact,withcarefulpreprocessing,thealgorithmisoftencompetitivewithmore
advancedmethods,includingsupportvectormachines.

Loadingpackageanddata
Below,weshowtherelevantcommandsneededtoloadallthepackagesandtraining/testdatawewillbe
using.Asinpreviousposts,wewillworkwithPandasforquickandeasydataloadingandwrangling.We
willbehavingapostdedicatedtoPandasinthenearfuture,sostaytuned!Westartoffwithusing
theparse_datesmethodtoconverttheDatescolumnofourprovideddatawhichcanbe
downloadedherefromstringtodatetimeformat.
1
2
3
4

importpandasaspd
fromsklearn.cross_validationimporttrain_test_split
fromsklearnimportpreprocessing
fromsklearn.metricsimportlog_loss

fromsklearn.naive_bayesimportBernoulliNB

http://efavdb.com/predictingsanfranciscocrimes/

1/4

20/5/2016

5
6
7
8
9
10
11

MachinelearningtopredictSanFranciscocrimeEFavDB

fromsklearn.naive_bayesimportBernoulliNB
fromsklearn.linear_modelimportLogisticRegression
importnumpyasnp

#LoadDatawithpandas,andparsethefirstcolumnintodatetime
train=pd.read_csv('train.csv',parse_dates=
test=pd.read_csv('test.csv',parse_dates=['Dates'

Thetrainingdataprovidedcontainsthefollowingfields:
Datedate+timestamp
CategoryThetypeofcrime,Larceny,etc.
DescriptAmoredetaileddescriptionofthecrime.
DayOfWeekDayofcrime:Monday,Tuesday,etc.
PdDistrictPolicedepartmentdistrict.
ResolutionWhatwastheoutcome,Arrest,Unfounded,None,etc.
AddressStreetaddressofcrime.
XandYGPScoordinatesofcrime.
Aswementionedearlier,theprovideddataspansalmost12years,andboththetrainingdatasetandthe
testingdataseteachhaveabout900krecords.Atthispointwehaveallthedatainmemory.However,the
majorityofthisdataiscategoricalinnature,andsowillrequiresomemorepreprocessing.

Howtohandlecategoricaldata
Manymachinelearningalgorithmsincludingthatwhichweapplybelowwillnotaccept
categorical,ortext,features.Whatisthebestwaytoconvertsuchdataintonumericalvalues?Anatural
ideaistoconverteachuniquestringtoauniquevalue.Forexample,inourdatasetwemighttakethe
crimecategoryvaluetocorrespondtoonenumericalfeature,withLarcenysetto1,Homicideto2,etc.
However,thisschemecancauseproblemsformanyalgorithms,becausetheywillincorrectlyassume
thatnearbynumericalvaluesimplysomesortofsimilaritybetweentheunderlyingcategoricalvalues.
Toavoidtheproblemnotedabove,wewillinsteadbinarizeourcategoricaldata,usingvectorsof1sand
0s.Forexample,wewillwrite
larceny=1,0,0,0,...
homicide=0,1,0,0,...
prostitution=0,0,1,0,...
...

Thereareavarietyofmethodstodothisencoding,butPandashasaparticularlynicemethod
calledget_dummies()thatcangostraightfromyourcolumnoftexttoabinarizedarray.Below,wealso
convertthecrimecategorylabelstointegervaluesusingthemethodLabelEncoder,andusePandasto
extractthehourfromeachtimepoint.Wethenconvertthedistricts,weekday,andhourintobinarized
arraysandcombinethemintoanewdataframe.Wethensplitupthetrain_dataintoatrainingand
validationsetsothatwehaveawayofaccessingthemodelperformancewhileleavingthetestdata
untouched.

#Convertcrimelabelstonumbers

http://efavdb.com/predictingsanfranciscocrimes/

2/4

20/5/2016

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

MachinelearningtopredictSanFranciscocrimeEFavDB

#Convertcrimelabelstonumbers
le_crime=preprocessing.LabelEncoder()
crime=le_crime.fit_transform(train.Category)

#Getbinarizedweekdays,districts,andhours.
days=pd.get_dummies(train.DayOfWeek)
district=pd.get_dummies(train.PdDistrict)
hour=train.Dates.dt.hour
hour=pd.get_dummies(hour)

#Buildnewarray
train_data=pd.concat([hour,days,district],axis
train_data['crime']=crime

#Repeatfortestdata
days=pd.get_dummies(test.DayOfWeek)
district=pd.get_dummies(test.PdDistrict)

hour=test.Dates.dt.hour
hour=pd.get_dummies(hour)

test_data=pd.concat([hour,days,district],axis

training,validation=train_test_split(train_data,train_size

Modeldevelopment
Forthiscompetitionthemetricusedtoratetheperformanceofthemodelisthemulticlasslog_loss
smallervaluesofthislosscorrespondtoimprovedperformance.
Firstpass
Forourfirstquickpass,weusedjustthedayoftheweekanddistrictforfeaturesinourclassifier
training.WealsocarriedoutaLogisticRegression(LR)onthedatainordertogetafeelforhowthe
NaiveBayes(NB)modelwasperforming.TheresultsfromtheNBmodelgaveusaloglossof2.62,
whileLRaftertuningwasabletogive2.62.However,LRtook60secondstorun,whileNBtookonly
1.5seconds!Asareference,thecurrenttopscoreontheleaderboardisabout2.27,whiletheworstis
around35.Notbadperformance!
1
features=['Friday','Monday','Saturday','Sunday'
2
'Wednesday','BAYVIEW','CENTRAL','INGLESIDE'
3
'NORTHERN','PARK','RICHMOND','SOUTHERN',
4

5
training,validation=train_test_split(train_data,train_size
6
model=BernoulliNB()
7
model.fit(training[features],training['crime'
8
predicted=np.array(model.predict_proba(validation[features]))
9
log_loss(validation['crime'],predicted)
10

11
#LogisticRegressionforcomparison
12
model=LogisticRegression(C=.01)
13
model.fit(training[features],training['crime'
http://efavdb.com/predictingsanfranciscocrimes/

3/4

20/5/2016

13
14
15

MachinelearningtopredictSanFranciscocrimeEFavDB

model.fit(training[features],training['crime'
predicted=np.array(model.predict_proba(validation[features]))
log_loss(validation['crime'],predicted)

Submissioncode
1
2
3
4
5
6
7

model=BernoulliNB()
model.fit(train_data[features],train_data['crime'
predicted=model.predict_proba(test_data[features])

#Writeresults
result=pd.DataFrame(predicted,columns=le_crime.classes_)
result.to_csv('testResult.csv',index=True,index_label

Withtheabovemodelperformingwell,weusedourcodetowriteoutourpredictionsonthetestsetto
csvformat,andsubmittedthistoKaggle.Itturnsoutwegotascoreof2.61whichisslightlybetterthan
ourvalidationsetestimate.Thewasagoodenoughscoretoputusintheto50%.Prettygoodforafirst
try!
Secondpass
Toimprovethemodelfurther,wenextaddedthetimetothefeaturelistusedintraining.Thisclearly
providessomerelevantinformation,assometypesofcrimehappenmoreduringthedaythanthenight.
Forexample,weexpectpublicdrunkennesstoprobablygoupinthelateevening.Addingthisfeature
wewereabletopushourloglossscoredownto2.58quickandeasyprogress!Asasidenote,we
alsotriedleavingthehoursasacontinuousvariable,butthisdidnotleadtoanyscoreimprovements.
Aftertrainingonthewholedatasetagain,wealsoget2.58onthetestdate.Thismovedusupanother32
spots,givingafinalplacementof76/226!
1
2
3
4
5
6

features=['Friday','Monday','Saturday','Sunday'
'Wednesday','BAYVIEW','CENTRAL','INGLESIDE'
'NORTHERN','PARK','RICHMOND','SOUTHERN','TARAVAL'

features2=[xforxinrange(0,24)]
features=features+features2

Discussion
AlthoughNaiveBayesisafairlysimplemodel,properlywieldeditcangivegreatresults.Infact,inthis
competitionourresultswerecompetitivewithteamswhowereusingmuchmorecomplicatedmodels,
e.g.neuralnets.Wealsolearnedafewotherinterestingthingshere:Forexample,Pandas
get_dummies()methodlookslikeitwillbeahugetimesaverwhendealingwithcategoricaldata.Till
nexttimekeepyourPriussafe!

http://efavdb.com/predictingsanfranciscocrimes/

4/4

You might also like