Professional Documents
Culture Documents
WhyNotUseJavaObjectSerialization?p.101 SerializationwithThrift:
ThereislimitedsupportfortheseasMapReduce formats(p.102):
http://wiki.apache.org/hadoop/Hbase/ThriftApi
SequenceFilecanuse'any'serializationframework Incontrast,MapFilecanonlyuseWritables
IheardHbaseusedtouseMapFile.Doesitmeanwe can'tuseitoutsideJava?
NO!ThriftorStargateRESTConnector(ALPHA)
MapFileisasanindexedandsortedSequenceFile.
Map&Reduce:otherlanguages
Map&Reduce:C++
HDFSconcepts
Someapplicationsmaychoosetosetahighreplicationfactorforthe blocksinapopularfiletospreadthereadloadonthecluster.
FilepermissionsinHDFS(p.47) Interfaces:
HDFSConcepts
HDFSstoressmallfilesinefficiently[1]
HDFSallowsonlysequentialwritestoanopenfile, orappendstoanalreadywrittenfile.
[1]: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
HDFS:clientread
The Namenode gives back the closest (p. 64) nodes with these block location.
HDFScoherencymodel
Path p = new Path("p"); FSDataOutputStream out = fs.create(p); out.write("content".getBytes("UTF-8")); out.flush(); out.sync(); assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));
Questions
WhathappensiftheNamenodedies?[1] WhatifaDatanodefailsduringawrite?(p.67) InaJobif1taskkeepsfailingHadoopwillgiveup andbydefaultsaythattheJobfailed. HoweveryoucanspecifyaQualityFactorand specifythatifonly99%ofmyinputismapped it'sgoodenoughforme.[3] SinceeverythingisstoredasString,howmuch spaceareweloosingwhenstoringbinarydataas base64?
config
<property> <name>sizeweight</name> <value>${size},${weight}</value> <description>Sizeandweight</description> </property> Toolinterface MiniDFSClusterandMiniMRCluster:aprogrammaticwayof creatinginprocessclusters.Unlikethelocaljobrunner,these allowtestingagainstthefullHDFSandMapReducemachinery. Bearinmindtoothattasktrackersinaminiclusterlaunch separateJVMstoruntasksin,whichcanmakedebuggingmore difficult.
debugging
System.err.println("Temperature over 100 degrees for input: " + value); reporter.setStatus("Detected possibly corrupt record: see logs."); reporter.incrCounter(Temperature.OVER_100, 1);
debugging
Logs:
Adminstration
FaultTolerance:whenaTaskfails
When the jobtracker is notified of a task attempt that has failed (by the tasktrackers heartbeat call) it will reschedule execution of the task. The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously failed. Furthermore, if a task fails more than four times, it will not be retried further. This value is configurable: the maximum number of attempts to run a task is controlled by the mapred.map.max.attempts property for map tasks, and mapred.reduce.max.attempts for reduce tasks. By default, if any task fails more than four times (or whatever the maximum number of attempts is configured to), the whole job fails. (p. 160) If a Streaming process hangs, the tasktracker does not try to kill it (although the JVM that launched it will be killed), so you should take precautions to monitor for this scenario, and kill orphaned processes by some other means Currently, Hadoop has no mechanism for dealing with failure of the jobtrackerit is a single point of failure (p. 161)
Faulttolerance:baddata
Whenyouhaveabadrecordthesearetheoptions:
Youcandetectthebadrecordandignoreit.Additionallyyoucan useacustomcounter. Youcanabortthejobbythrowinganexception Automaticmechanismforskipingbadrecords(youcanthandlethe problembecausethereisabugina3rdpartylibrarythatyou cantworkaroundinyourmapperorreducer): 1.Taskfails. 2.Taskfails. 3.Skippingmodeisenabled.Taskfailsbutfailedrecordis storedbythetasktracker. 4.Skippingmodeisstillenabled.Tasksucceedsby skippingthebadrecordthatfailecintheprevious attempt.
Itisoftenagoodideatocompressthemapoutput asitiswrittentodisk,sincedoingsomakesit fastertowritetodisk,savesdiskspace,and reducestheamountofdatatotransfertothe reducer TheamountofmemorygiventotheJVMsinwhich themapandreducetasksrunissetbythe mapred.child.java.optsproperty.Youshouldtry tomakethisaslargeaspossiblefortheamountof memoryonyoutasknodes;thediscussionin Memoryonpage254goesthroughthe constraintstoconsider.
Tunning
Incompletepresentationd:~(