You are on page 1of 16

HadoopI/O

WhyNotUseJavaObjectSerialization?p.101 SerializationwithThrift:

ThereislimitedsupportfortheseasMapReduce formats(p.102):
http://wiki.apache.org/hadoop/Hbase/ThriftApi

SequenceFilecanuse'any'serializationframework Incontrast,MapFilecanonlyuseWritables

IheardHbaseusedtouseMapFile.Doesitmeanwe can'tuseitoutsideJava?

NO!ThriftorStargateRESTConnector(ALPHA)

MapFileisasanindexedandsortedSequenceFile.

Map&Reduce:otherlanguages

UnixstandardstreamsastheinterfacebetweenHadoopandyourprogram (p.32) TheJavaAPIisgearedtowardprocessingyourmapfunctiononerecordat atime

Recordsarepushed butitsstillpossibletoconsidermultiplelinesatatimeby accumulatingpreviouslinesinaninstancevariableinthe Mapper Orusethenewpullstyle(p.25)

WhereaswithStreamingthemapprogramcandecidehowtoprocessthe input What'sthepenaltyforthat?DataiscopiedoverfromtheJavaproccess spacetootherproccessspace[1].IsRemoteDebugging(p.144) possible?Don'tthinkso.


[1]:http://www.cloudera.com/hadooptrainingprogrammingwithhadoop49'50

Map&Reduce:C++

HadoopPipes Usessocketsasthechanneloverwhichthetasktrackercommunicateswiththe processrunningtheC++maporreducefunction Implement

HadoopPipes::Mapper HadoopPipes::Reducer Main: intmain(intargc,char*argv[]){ return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperat ureMapper,MapTemperatureReducer>()); }

%hadooppipesDhadoop.pipes.java.recordreader=trueD hadoop.pipes.java.recordwriter=trueinputsample.txtoutputoutput programbin/max_temperature


HDFSconcepts

AfileinHDFSthatissmallerthanasingleblockdoesnotoccupyafullblocks worthofunderlyingstorage Afilecanbelargerthananysinglediskinthenetwork.Thefilewillbespreadto morethan1node Ablockistypicallyreplicatedto3otherphysicalmachines

Someapplicationsmaychoosetosetahighreplicationfactorforthe blocksinapopularfiletospreadthereadloadonthecluster.

FilepermissionsinHDFS(p.47) Interfaces:

FUSE,Thrift,C TheFUSEinterfaceallowsanyHDFStobemountedinthestandardFS. ItmakesitpossibletouseanyUnixutilitylikels,cat... Howeveritdoesn'tmeanyoushoulduseitasageneralpurposeFS.


HDFSConcepts

HDFSstoressmallfilesinefficiently[1]

EatsupalotofNamenode'smemory Howeveritwon'ttakeupanymorespacethanis requiredtostoretherawcontentsofafile

HDFSallowsonlysequentialwritestoanopenfile, orappendstoanalreadywrittenfile.

[1]: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

HDFS:clientread
The Namenode gives back the closest (p. 64) nodes with these block location.

HDFScoherencymodel
Path p = new Path("p"); FSDataOutputStream out = fs.create(p); out.write("content".getBytes("UTF-8")); out.flush(); out.sync(); assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));

Whenyousync,youareguaranteedtoseethe changestotheFS Withnocallstosync(),youshouldbepreparedto loseuptoablockofdataintheeventofclientor systemfailure Tradeoffbetweendatarobustnessandthroughput


Questions

WhathappensiftheNamenodedies?[1] WhatifaDatanodefailsduringawrite?(p.67) InaJobif1taskkeepsfailingHadoopwillgiveup andbydefaultsaythattheJobfailed. HoweveryoucanspecifyaQualityFactorand specifythatifonly99%ofmyinputismapped it'sgoodenoughforme.[3] SinceeverythingisstoredasString,howmuch spaceareweloosingwhenstoringbinarydataas base64?

config

<property> <name>sizeweight</name> <value>${size},${weight}</value> <description>Sizeandweight</description> </property> Toolinterface MiniDFSClusterandMiniMRCluster:aprogrammaticwayof creatinginprocessclusters.Unlikethelocaljobrunner,these allowtestingagainstthefullHDFSandMapReducemachinery. Bearinmindtoothattasktrackersinaminiclusterlaunch separateJVMstoruntasksin,whichcanmakedebuggingmore difficult.

debugging
System.err.println("Temperature over 100 degrees for input: " + value); reporter.setStatus("Detected possibly corrupt record: see logs."); reporter.incrCounter(Temperature.OVER_100, 1);

debugging

Logs:

Systemdaemons(p.256) HDFSaudit(p.280) MapReducejobhistory(p.135) MapReducetask(p.143)

Adminstration

distcp Balancer Metrics:jmx/ganglia

FaultTolerance:whenaTaskfails
When the jobtracker is notified of a task attempt that has failed (by the tasktrackers heartbeat call) it will reschedule execution of the task. The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously failed. Furthermore, if a task fails more than four times, it will not be retried further. This value is configurable: the maximum number of attempts to run a task is controlled by the mapred.map.max.attempts property for map tasks, and mapred.reduce.max.attempts for reduce tasks. By default, if any task fails more than four times (or whatever the maximum number of attempts is configured to), the whole job fails. (p. 160) If a Streaming process hangs, the tasktracker does not try to kill it (although the JVM that launched it will be killed), so you should take precautions to monitor for this scenario, and kill orphaned processes by some other means Currently, Hadoop has no mechanism for dealing with failure of the jobtrackerit is a single point of failure (p. 161)

Faulttolerance:baddata

Whenyouhaveabadrecordthesearetheoptions:

Youcandetectthebadrecordandignoreit.Additionallyyoucan useacustomcounter. Youcanabortthejobbythrowinganexception Automaticmechanismforskipingbadrecords(youcanthandlethe problembecausethereisabugina3rdpartylibrarythatyou cantworkaroundinyourmapperorreducer): 1.Taskfails. 2.Taskfails. 3.Skippingmodeisenabled.Taskfailsbutfailedrecordis storedbythetasktracker. 4.Skippingmodeisstillenabled.Tasksucceedsby skippingthebadrecordthatfailecintheprevious attempt.

Itisoftenagoodideatocompressthemapoutput asitiswrittentodisk,sincedoingsomakesit fastertowritetodisk,savesdiskspace,and reducestheamountofdatatotransfertothe reducer TheamountofmemorygiventotheJVMsinwhich themapandreducetasksrunissetbythe mapred.child.java.optsproperty.Youshouldtry tomakethisaslargeaspossiblefortheamountof memoryonyoutasknodes;thediscussionin Memoryonpage254goesthroughthe constraintstoconsider.

Tunning

Incompletepresentationd:~(

Sorry,thiswasapresentationthatIwasmakingbut didnothavethetimetofinish.Nevertheless,Ifelt likesharingit... BasedontheOReillyHadoopTheDefinitive Guide(062009)book.

You might also like