You are on page 1of 8

Intro to Hadoop and MapReduce

Lesson 3 Notes

Introduction

Inthislessonweregoingtotakealookattheactualcodeweraninthelast
lesson.ThenyoullwriteyourownMapReducecode.

Ifyourecall,thecodefoundthetotalsalesperstore.Typically,thedatawould
havecomefromdatabasetablesandweduseSqooptoimportitintoHDFS.

InputData

Sofirstletstakeacloserlookattheinputdataformat.RememberthateachMapper
processesaportionoftheinputdata,andeachonewillbegivenalineatatime.Thelines
looklikethis:


TheMapperneedstotakethatlineandextracttheinformationitneeds.Oftenwhenwere
dealingwithtextitsprettyfreeformsowedusesomethinglikearegularexpression.Butin
thiscase,itsniceandregular:itstabdelimited.Sowecanjustsplitthelinebasedontabs
andextractthevaluesforallfields.

Quiz:HowToFindTotalSales?

QuestionYouvebeenaskedtocalculatethetotalsalesforeachstore.Whatshouldyouuse
astheintermediatekeyandvalue?

Key Value
[]time storename
[]cost storename
[]storename cost
[]storename itemdescription

Copyright2014Udacity,Inc.AllRightsReserved.


Answer:

Thestorenameandtheamountisthecorrectanswer.

DefensiveMapperCode

HeresourMappercode.Letslook
atitlinebyline.Weregoingtoloop
aroundstandardinput,whichwill
giveusalineatatime.Ofcourse,
thelinewillhaveanewlinecharacter
attheendofit,soletsstripoutthat,
plusanyotherwhitespacearound
theline,andsinceourlineistab
delimited,wecansplititatthesame
time.Thatgivesusanarray,which
wellcalldata.
re

ProgrammingQuiz:FinishTheMapper

AsyouuseHadoopmoreandmore,youlldiscoverthatthemoredatayouhave,themore
likelyyouaretoencounterweirdnessinthatdata.Lineswillbemalformed,therewillbe
strangelogmessagesinthedatayouregoingtocomeacrosseverystrangeedgecase
youcanimagine,andplentythatyoucant.Sohere,youshouldmakesurethatnomatter
whatkindofmalformedlinethefilehas,themappercancontinueworking.Youwouldntwant
your2TBprocessingjobtodiepartwaythrough.

So,wedlikeyoutoaddsomedefensiveprogrammingtomakesurethingsdontbreakifyou
getastrangelineinthemiddleofyourdata.

Answer:

Inthiscase,werejustcheckingthatthelineactuallyhassixfields.Ifitdoesnt,welljust
ignorethatline.Butanothergoodthingtodowouldbetocheckthatthecostisactuallyavalid
number.

AssumingthatthelineisOK,wellsimplywriteourintermediatedataoutintheformofthe
key,thenatab,thenthevalue.Andthenweloopbackandreadthenextlinefromourinput
file.

Copyright2014Udacity,Inc.AllRightsReserved.

Quiz:WhatHappensBetweenMapperAndReducer?

OncetheMapperisdone,theHadoopframeworkpassestheintermediatedatatothe
Reducers.WhatstheprocesscalledthathappensbetweenMappersandReducers?

[]Bubblesort
[]Shuffleandsort
[]FindandShuffle
[]Quicksort

Answer:

TheprocessiscalledtheShuffleandSort.Itensuresthatthevaluesforanyparticularkeyare
collectedtogether,andsendsthekeysandtheirlistsofvaluestotheReducer.

Reducer

Inourcase,weonlyhaveasingleReducer,becausethatsthe
Hadoopdefault,soitwillgetallthekeys.Ifwehadspecified
morethanoneReducer,eachwouldreceivesomeofthekeys,
alongwithallthevaluesfromalltheMappersforthosekeys.


WereusingHadoopStreaminghere,
becausewerewritingourcodein
Python.HadoopStreamingallowsyouto
writeyourMappersandReducersin
prettymuchanylanguage,ratherthan
forcingyoutouseJava.

ButthewaytheReducergetsthedatais
alittletrickytodealwith.Itsgoingtoget
thedatacominginsomethinglikethis.

Copyright2014Udacity,Inc.AllRightsReserved.

Quiz:WhatVariablesDoWeNeedToKeepTrackOf?

Asyoucansee,thedatacomesinasastreamoflines,eachcontainingastorenameand
cost.Thestorenamesaresorted,whichwereguaranteedbecauseoftheshuffleandsort,so
weknowthatallthelinesfor,say,Miamiwillappearoneaftertheother.So,whatvariablesdo
weneedtokeeptrackoftocalculatethesalesperstore,basedonhowthedatais
appearing?

[]previoussale
[]currentsale
[]totalsalesperstore
[]previousstore
[]currentstorename
[]allstorenames

Answer:

SowhenwecodetheReducer,weregoingtoneedtokeeptrackofthekeys.Whenthekey
changes,weknowwevereceivedallthedatafromthepreviouskey,sowecanthenwriteout
thefinalresultforthatpreviouskeyandinourcase,wellhavebeenaddingupallthe
valuesforthatkey.

ReducerCode

SoherestheReducercode.Letsstepthroughit.

Wellstartbysettingacoupleofvariablesup.salesTotaliswhatwellusetokeeptherunning
total.Weinitializethattozero.Andsincewehaventreadanydatainyet,wehaventhadany
keys,sooldKeyisinitializedtoNone.

Thenwestartreadingfromstandardinput.
Eachlinewillcontainakey,atab,anda
value.Inourcase,astorename,atab,
andoneofthesalesfromthatstore.

So,westripoffthenewlinecharacterthe
endofthelineandsplitthelinebasedon
thetab.Thatshouldgiveusexactlytwo
items,whichwellstoreinthedataarray.

Ifwedonthavetwoitems,somethingstrangehashappened,sowellskipthatlineofinput
althoughthatshouldinfactneverbethecase,sinceweknowourMappersarewritingthe
dataoutinthisformat.

Copyright2014Udacity,Inc.AllRightsReserved.

Nowwellpullthetwoelementsofthearrayoutintonamedvariablesforclarity.thisKeywill
holdthestorename,thisValuethesaleamount.

Nowheresthetrickypart.Wewanttoknowifthekeyhaschangedsincethelastonewe
read.SowechecktoseeifoldKeyisevensetbecauseifitsnotthenthiswillbethefirst
linewevereadand,ifitis,weseeifitsdifferenttothekeywejustreadin.

Ifthatstrue,thenthekeyhasjustchangedintheexamplewejustlookedat,wedhave
readalltheMiamilinesandnowwevejustreceivedaNewYorkkey.Soweneedtowriteout
thedataforthepreviouskey.Wedothatbywritingthatkey,atab,andtherunningtotalweve
beenkeeping.ThatdatawillbewrittentoafileinHDFSbytheHadoopframework.


Oncewevedonethat,wesetthesalesTotalbacktozerosincewerenowdealingwithanew
store.

OK,nowwecanactuallyprocessthedatawevejustread.WesetoldKeyupwiththe
contentsofthekey,andthenaddthevaluetoourrunningtotal.

Andthenweloopbackanddothewholethingagain.

Eventually,wellrunoutofdatatoprocess,whichtakesusoutoftheloop.

Copyright2014Udacity,Inc.AllRightsReserved.

Question:AreWeDone?

DoyouthinktheReducerisfinishedatthispoint?

[]Yes,itsfinished
[]No,anotherprocessneedstoberunontheoutput
[]No,thelastkeyhasnotyetbeenoutput

Answer:

Becareful!Whenweexittheloop,wehaventyetoutputthedataforthelastkeywevebeen
tracking.Thatswhatthelasttwolinesareforwewriteoutthekeyandvalueforthelaststore
weveprocessed.Ifwedidnthavethoselines,wewouldntwriteoutdataforthatlaststore.

NowletshaveIantalkalittleabouttestingthecode,andthenactuallyrunitonourcluster.

PuttingItAllTogether

Sothatsthecode.OneofthenicethingsaboutusingHadoopstreamingisthatitseasyto
testourcodeoutsideofHadoop.Letsseehowtodothat.OurMappertakesdatainfrom
standardinput,andwritesitsresultstostandardoutput,sowecanjustrunitfromthe
commandlineandtypedataintotestit.Or,evenbetter,wecanbuildjustasmallsample
datafileandpipethattotheMapper.Letsdothat.Herewehaveaverysmallfilejust10or
solines.SototesttheMapper,wecanjustdothis:cattestfile|./mapper.py

Andtheresourresultstorenamesandsales.Excellent!Ifwehadproblems,wecouldgo
backandedittheMapperuntilitworked,anditsreallyniceandquicktodothiswithout
needingtorunitviaHadoopeverytime.

Copyright2014Udacity,Inc.AllRightsReserved.

WecandoasimilarthingwiththeReducer.Itsexpectingasetoflineswhichlooklike
storenametabvalueso,again,wecancreateasamplefilewhichlookslikethatandpassit
in.Butevennicer,wecantesttheentirepipeline.RememberthattheMappersoutputis
sortedbytheHadoopframeworkandthenpassedtotheReducer.Sowecansimulatethe
entirethingonthecommandlinelikethis,usingtheUnixsortutilityinbetweentheMapper
andReducer:

...andtheresouroutput,exactlyaswedexpect.Sonowthatwevetestedonthecommand
line,wecannowtestitonthecluster.BestpracticewhenyouredevelopingMapReducejobs
istofirsttestwithasmalldatasetbeforeyourunyourcodeonyourentire,hugesetofdata,
butwereprettyconfidentheresoletsjustrunthethingonourwholepurchases.txtfile.

Copyright2014Udacity,Inc.AllRightsReserved.

Welluseouraliashstocutdownonourtyping.HeresourMapper...andourReducer
andthefilesargumentswealsohavetospecify.Ourinputdirectoryismyinput,andwelltell
Hadooptowritetheresultstooutput2.Remember,theoutputdirectorymustnotalreadyexist
orthejobwillfail.

Offitgoes.OnthispseudodistributedclusterwecanonlyruntwoMapperssimultaneously,
anditturnsoutthatweneedfourtoprocesstheentiredatasetbecauseofitssize.Sotwowill
run,thenwhentheyvefinishedthenexttwowillstart.OncetheyredonetheReducerswill
thenbegin.Itturnsout,youcanwatchthishappeningviaaWebbaseduserinterfacethat
Hadoopgivesus.YoupointyourWebbrowserattheJobTracker,whichonourmachineis
justlocalhost,onport50030.Hereyoucanseethattheresonerunningjob,andwhenwe
clickonitwecanseetheMappersandReducersrunning.

ItgivesusatonofotherinterestinginformationonekeythingisthatifaMapperorReducer
fails,youcanactuallydrilldownandviewthelogsfromthatparticularpieceofcode.

OK,itsdone.Soletstakealookattheoutputandthereitis.Oursales,totalledbystore,
justaswedexpected.

Copyright2014Udacity,Inc.AllRightsReserved.

You might also like