Machine Learning To Predict San Francisco Crime - EFavDB PDF

20/5/2016
MachinelearningtopredictSanFranciscocrimeEFavDB
MachinelearningtopredictSanFranciscocrime
July20,2015
DamienRJ Casestudies
Intodayspost,wedocumentoursubmissiontotherecentKagglecompetitionaimedatpredictingthe
categoryofSanFranciscocrimes,givenonlytheirtimeandlocationofoccurrence.Asareminder,
Kaggleisasitewhereonecancompetewithotherdatascientistsonvariousdatachallenges.Wetook
thiscompetitionasanopportunitytoexploretheNaiveBayesalgorithm.Withthefewstepsdiscussed
below,wewereabletoquicklymovefromthemiddleofthepacktothetop33%onthecompetition
leaderboard,allthewhilecontinuingwiththissimplemodel!
Follow@efavdb
Followusontwitterfornewsubmissionalerts!
Introduction
Asinallcities,crimeisarealitySanFrancisco:EveryonewholivesinSanFranciscoseemstoknow
someonewhosecarwindowhasbeensmashedin,orwhosebicyclewasstolenwithinthepastyearor
two.EvenPriuscarbatteriesareapparentlyconsideredfairgamebythecitysdiligentthieves.The
challengewetackletodayinvolvesattemptingtoguesstheclassofacrimecommittedwithinthecity,
giventhetimeandlocationittookplace.Suchstudiesarerepresentativeofeffortsbymanypoliceforces
today:Usingmachinelearningapproaches,onecangetanimprovedunderstandingofwhichcrimes
occurwhereandwheninacitythisthenallowsforbetter,dynamicallocationofpoliceresources.To
aidintheSFchallenge,Kagglehasprovidedabout12yearsofcrimereportsfromalloverthecitya
datasetthatisprettyinterestingtocombthrough.
Here,weoutlineourapproachtotacklingthisproblem,usingtheNaiveBayesclassifier.Thisisoneof
thesimplestclassificationalgorithms,theessentialingredientsofwhichincludecombiningBayes
theoremwithanindependenceassumptiononthefeatures(thisisthenaivepart).Althoughsimple,it
isstillapopularmethodfortextcategorization.Forexample,usingwordfrequenciesasfeatures,
thisapproachcanaccuratelyclassifyemailsasspam,orwhetheraparticularapieceoftextwaswritten
byaspecificauthor.Infact,withcarefulpreprocessing,thealgorithmisoftencompetitivewithmore
advancedmethods,includingsupportvectormachines.
Loadingpackageanddata
Below,weshowtherelevantcommandsneededtoloadallthepackagesandtraining/testdatawewillbe
using.Asinpreviousposts,wewillworkwithPandasforquickandeasydataloadingandwrangling.We
willbehavingapostdedicatedtoPandasinthenearfuture,sostaytuned!Westartoffwithusing
theparse_datesmethodtoconverttheDatescolumnofourprovideddatawhichcanbe
downloadedherefromstringtodatetimeformat.
1
2
3
4
importpandasaspd
fromsklearn.cross_validationimporttrain_test_split
fromsklearnimportpreprocessing
fromsklearn.metricsimportlog_loss
fromsklearn.naive_bayesimportBernoulliNB
http://efavdb.com/predictingsanfranciscocrimes/
1/4
20/5/2016
5
6
7
8
9
10
11
fromsklearn.naive_bayesimportBernoulliNB
fromsklearn.linear_modelimportLogisticRegression
importnumpyasnp
#LoadDatawithpandas,andparsethefirstcolumnintodatetime
train=pd.read_csv('train.csv',parse_dates=
test=pd.read_csv('test.csv',parse_dates=['Dates'
Thetrainingdataprovidedcontainsthefollowingfields:
Datedate+timestamp
CategoryThetypeofcrime,Larceny,etc.
DescriptAmoredetaileddescriptionofthecrime.
DayOfWeekDayofcrime:Monday,Tuesday,etc.
PdDistrictPolicedepartmentdistrict.
ResolutionWhatwastheoutcome,Arrest,Unfounded,None,etc.
AddressStreetaddressofcrime.
XandYGPScoordinatesofcrime.
Aswementionedearlier,theprovideddataspansalmost12years,andboththetrainingdatasetandthe
testingdataseteachhaveabout900krecords.Atthispointwehaveallthedatainmemory.However,the
majorityofthisdataiscategoricalinnature,andsowillrequiresomemorepreprocessing.
Howtohandlecategoricaldata
Manymachinelearningalgorithmsincludingthatwhichweapplybelowwillnotaccept
categorical,ortext,features.Whatisthebestwaytoconvertsuchdataintonumericalvalues?Anatural
ideaistoconverteachuniquestringtoauniquevalue.Forexample,inourdatasetwemighttakethe
crimecategoryvaluetocorrespondtoonenumericalfeature,withLarcenysetto1,Homicideto2,etc.
However,thisschemecancauseproblemsformanyalgorithms,becausetheywillincorrectlyassume
thatnearbynumericalvaluesimplysomesortofsimilaritybetweentheunderlyingcategoricalvalues.
Toavoidtheproblemnotedabove,wewillinsteadbinarizeourcategoricaldata,usingvectorsof1sand
0s.Forexample,wewillwrite
larceny=1,0,0,0,...
homicide=0,1,0,0,...
prostitution=0,0,1,0,...
...
Thereareavarietyofmethodstodothisencoding,butPandashasaparticularlynicemethod
calledget_dummies()thatcangostraightfromyourcolumnoftexttoabinarizedarray.Below,wealso
convertthecrimecategorylabelstointegervaluesusingthemethodLabelEncoder,andusePandasto
extractthehourfromeachtimepoint.Wethenconvertthedistricts,weekday,andhourintobinarized
arraysandcombinethemintoanewdataframe.Wethensplitupthetrain_dataintoatrainingand
validationsetsothatwehaveawayofaccessingthemodelperformancewhileleavingthetestdata
untouched.
#Convertcrimelabelstonumbers
2/4
20/5/2016
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#Convertcrimelabelstonumbers
le_crime=preprocessing.LabelEncoder()
crime=le_crime.fit_transform(train.Category)
#Getbinarizedweekdays,districts,andhours.
days=pd.get_dummies(train.DayOfWeek)
district=pd.get_dummies(train.PdDistrict)
hour=train.Dates.dt.hour
hour=pd.get_dummies(hour)
#Buildnewarray
train_data=pd.concat([hour,days,district],axis
train_data['crime']=crime
#Repeatfortestdata
days=pd.get_dummies(test.DayOfWeek)
district=pd.get_dummies(test.PdDistrict)
hour=test.Dates.dt.hour
hour=pd.get_dummies(hour)
test_data=pd.concat([hour,days,district],axis
training,validation=train_test_split(train_data,train_size
Modeldevelopment
Forthiscompetitionthemetricusedtoratetheperformanceofthemodelisthemulticlasslog_loss
smallervaluesofthislosscorrespondtoimprovedperformance.
Firstpass
Forourfirstquickpass,weusedjustthedayoftheweekanddistrictforfeaturesinourclassifier
training.WealsocarriedoutaLogisticRegression(LR)onthedatainordertogetafeelforhowthe
NaiveBayes(NB)modelwasperforming.TheresultsfromtheNBmodelgaveusaloglossof2.62,
whileLRaftertuningwasabletogive2.62.However,LRtook60secondstorun,whileNBtookonly
1.5seconds!Asareference,thecurrenttopscoreontheleaderboardisabout2.27,whiletheworstis
around35.Notbadperformance!
1
features=['Friday','Monday','Saturday','Sunday'
2
'Wednesday','BAYVIEW','CENTRAL','INGLESIDE'
3
'NORTHERN','PARK','RICHMOND','SOUTHERN',
4
5
training,validation=train_test_split(train_data,train_size
6
model=BernoulliNB()
7
model.fit(training[features],training['crime'
8
predicted=np.array(model.predict_proba(validation[features]))
9
log_loss(validation['crime'],predicted)
10
11
#LogisticRegressionforcomparison
12
model=LogisticRegression(C=.01)
13
3/4
20/5/2016
13
14
15
predicted=np.array(model.predict_proba(validation[features]))
log_loss(validation['crime'],predicted)
Submissioncode
1
2
3
4
5
6
7
model=BernoulliNB()
model.fit(train_data[features],train_data['crime'
predicted=model.predict_proba(test_data[features])
#Writeresults
result=pd.DataFrame(predicted,columns=le_crime.classes_)
result.to_csv('testResult.csv',index=True,index_label
Withtheabovemodelperformingwell,weusedourcodetowriteoutourpredictionsonthetestsetto
csvformat,andsubmittedthistoKaggle.Itturnsoutwegotascoreof2.61whichisslightlybetterthan
ourvalidationsetestimate.Thewasagoodenoughscoretoputusintheto50%.Prettygoodforafirst
try!
Secondpass
Toimprovethemodelfurther,wenextaddedthetimetothefeaturelistusedintraining.Thisclearly
providessomerelevantinformation,assometypesofcrimehappenmoreduringthedaythanthenight.
Forexample,weexpectpublicdrunkennesstoprobablygoupinthelateevening.Addingthisfeature
wewereabletopushourloglossscoredownto2.58quickandeasyprogress!Asasidenote,we
alsotriedleavingthehoursasacontinuousvariable,butthisdidnotleadtoanyscoreimprovements.
Aftertrainingonthewholedatasetagain,wealsoget2.58onthetestdate.Thismovedusupanother32
spots,givingafinalplacementof76/226!
1
2
3
4
5
6
features=['Friday','Monday','Saturday','Sunday'
'Wednesday','BAYVIEW','CENTRAL','INGLESIDE'
'NORTHERN','PARK','RICHMOND','SOUTHERN','TARAVAL'
features2=[xforxinrange(0,24)]
features=features+features2
Discussion
AlthoughNaiveBayesisafairlysimplemodel,properlywieldeditcangivegreatresults.Infact,inthis
competitionourresultswerecompetitivewithteamswhowereusingmuchmorecomplicatedmodels,
e.g.neuralnets.Wealsolearnedafewotherinterestingthingshere:Forexample,Pandas
get_dummies()methodlookslikeitwillbeahugetimesaverwhendealingwithcategoricaldata.Till
nexttimekeepyourPriussafe!
4/4

Machine Learning To Predict San Francisco Crime - EFavDB PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning To Predict San Francisco Crime - EFavDB PDF

Uploaded by

Copyright:

Available Formats

20/5/2016

You might also like