You are on page 1of 593

Cloudera"Developer"Training"

for"Apache"Hadoop"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#1$

201301"

IntroducEon"
Chapter"1"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#2$

Course"Chapters"
! Introduc/on$

Course$Introduc/on$

! The"MoEvaEon"for"Hadoop"
! Hadoop:"Basic"Concepts"
! WriEng"a"MapReduce"Program"
! Unit"TesEng"MapReduce"Programs"
! Delving"Deeper"into"the"Hadoop"API"
! PracEcal"Development"Tips"and"Techniques"
! Data"Input"and"Output"
! Common"MapReduce"Algorithms"
! Joining"Data"Sets"in"MapReduce"Jobs"
! IntegraEng"Hadoop"into"the"Enterprise"Workow"
! Machine"Learning"and"Mahout"
! An"IntroducEon"to"Hive"and"Pig"
! An"IntroducEon"to"Oozie"
! Conclusion"
! Appendix:"Cloudera"Enterprise"
! Appendix:"Graph"ManipulaEon"in"MapReduce"

IntroducEon"to"Apache"Hadoop""
and"its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#3$

Chapter"Topics"
Introduc/on$

Course$Introduc/on$

! About$this$course$
! About"Cloudera"
! Course"logisEcs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#4$

Course"ObjecEves"
During$this$course,$you$will$learn:$
!The$core$technologies$of$Hadoop$
!How$HDFS$and$MapReduce$work$
!How$to$develop$MapReduce$applica/ons$
!How$to$unit$test$MapReduce$applica/ons$
!How$to$use$MapReduce$combiners,$par//oners,$and$the$distributed$cache$
!Best$prac/ces$for$developing$and$debugging$MapReduce$applica/ons$
!How$to$implement$data$input$and$output$in$MapReduce$applica/ons$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#5$

Course"ObjecEves"(contd)"
!Algorithms$for$common$MapReduce$tasks$
!How$to$join$data$sets$in$MapReduce$
!How$Hadoop$integrates$into$the$data$center$
!How$to$use$Mahouts$Machine$Learning$algorithms$
!How$Hive$and$Pig$can$be$used$for$rapid$applica/on$development$
!How$to$create$large$workows$using$Oozie$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#6$

Chapter"Topics"
Introduc/on$

Course$Introduc/on$

! About"this"course"
! About$Cloudera$
! Course"logisEcs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#7$

About"Cloudera"
!Founded$by$leading$experts$on$Hadoop$from$Facebook,$Google,$Oracle$
and$Yahoo$
!Provides$consul/ng$and$training$services$for$Hadoop$users$
!Sta$includes$commi[ers$to$virtually$all$Hadoop$projects$
!Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
Lars"George,"Tom"White,"Eric"Sammer,"etc."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#8$

Cloudera"Soaware"
!Clouderas$Distribu/on,$including$Apache$Hadoop$(CDH)$
A"set"of"easy/to/install"packages"built"from"the"Apache"Hadoop"core"
repository,"integrated"with"several"addiEonal"open"source"Hadoop"
ecosystem"projects"
Includes"a"stable"version"of"Hadoop,"plus"criEcal"bug"xes"and"solid"new"
features"from"the"development"version"
100%"open"source"
!Cloudera$Manager,$Free$Edi/on$
The"easiest"way"to"deploy"a"Hadoop"cluster"
Automates"installaEon"of"Hadoop"soaware"
InstallaEon,"monitoring"and"conguraEon"is"performed"from"a"central"
machine"
Manages"up"to"50"nodes"
Completely"free"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#9$

Cloudera"Enterprise"
!Cloudera$Enterprise$Core$
Complete"package"of"soaware"and"support"
Built"on"top"of"CDH"
Includes"full"version"of"Cloudera"Manager"
Install,"manage,"and"maintain"a"cluster"of"any"size"
LDAP"integraEon"
Resource"consumpEon"tracking"
ProacEve"health"checks"
AlerEng"
ConguraEon"change"audit"trails"
And"more"
!Cloudera$Enterprise$RTD$
Includes"support"for"Apache"HBase"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#10$

Cloudera"Services"
!Provides$consultancy$and$support$services$to$many$key$users$of$Hadoop$
Including"eBay,"JPMorganChase,"Experian,"Groupon,"Morgan"Stanley,"
Nokia,"Orbitz,"NaEonal"Cancer"InsEtute,"RIM,"The"Walt"Disney"
Company"
!Solu/ons$Architects$are$experts$in$Hadoop$and$related$technologies$
Many"are"commi>ers"to"the"Apache"Hadoop"and"ecosystem"projects"
!Provides$training$in$key$areas$of$Hadoop$administra/on$and$development$
Courses"include"System"Administrator"training,"Developer"training,"Hive"
and"Pig"training,"HBase"Training,"EssenEals"for"Managers"
Custom"course"development"available"
Both"public"and"on/site"training"available"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#11$

Chapter"Topics"
Introduc/on$

Course$Introduc/on$

! About"this"course"
! About"Cloudera"
! Course$logis/cs$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#12$

LogisEcs"
!Course$start$and$end$/mes$
!Lunch$
!Breaks$
!Restrooms$
!Can$I$come$in$early/stay$late?$
!Cer/ca/on$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#13$

IntroducEons"
!About$your$instructor$
!About$you$
Experience"with"Hadoop?"
Experience"as"a"developer?"
What"programming"languages"do"you"use?"
ExpectaEons"from"the"course?"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

01#14$

The"MoBvaBon"for"Hadoop"
Chapter"2"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#1%

Course"Chapters"
! IntroducBon"

Course"IntroducBon"

! The%Mo.va.on%for%Hadoop%
! Hadoop:"Basic"Concepts"
! WriBng"a"MapReduce"Program"
! Unit"TesBng"MapReduce"Programs"
! Delving"Deeper"into"the"Hadoop"API"
! PracBcal"Development"Tips"and"Techniques"
! Data"Input"and"Output"
! Common"MapReduce"Algorithms"
! Joining"Data"Sets"in"MapReduce"Jobs"
! IntegraBng"Hadoop"into"the"Enterprise"Workow"
! Machine"Learning"and"Mahout"
! An"IntroducBon"to"Hive"and"Pig"
! An"IntroducBon"to"Oozie"
! Conclusion"
! Appendix:"Cloudera"Enterprise"
! Appendix:"Graph"ManipulaBon"in"MapReduce"

Introduc.on%to%Apache%Hadoop%%
and%its%Ecosystem%

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#2%

The"MoBvaBon"For"Hadoop"
In%this%chapter%you%will%learn%
!What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%
!What%requirements%an%alterna.ve%approach%should%have%
!How%Hadoop%addresses%those%requirements%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#3%

Chapter"Topics"
The%Mo.va.on%for%Hadoop%

Introduc.on%to%Apache%Hadoop%
and%its%Ecosystem%

! Problems%with%tradi.onal%large#scale%systems%
! Requirements"for"a"new"approach"
! Introducing"Hadoop"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#4%

TradiBonal"Large/Scale"ComputaBon"
!Tradi.onally,%computa.on%has%been%processor#bound%
RelaBvely"small"amounts"of"data"
Signicant"amount"of"complex"processing"performed"on"that"data"
!For%decades,%the%primary%push%was%to%increase%the%compu.ng%power%of%a%
single%machine%
Faster"processor,"more"RAM"
!Distributed%systems%evolved%to%allow%
developers%to%use%mul.ple%machines%
for%a%single%job%
MPI"
PVM"
Condor"
MPI: Message Passing Interface
PVM: Parallel Virtual Machine
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#5%

Distributed"Systems:"Problems"
!Programming%for%tradi.onal%distributed%systems%is%complex%
Data"exchange"requires"synchronizaBon"
Finite"bandwidth"is"available"
Temporal"dependencies"are"complicated"
It"is"dicult"to"deal"with"parBal"failures"of"the"system"
!Ken%Arnold,%CORBA%designer:%
Failure"is"the"dening"dierence"between"distributed"and"local"
programming,"so"you"have"to"design"distributed"systems"with"the"
expectaBon"of"failure"
Developers"spend"more"Bme"designing"for"failure"than"they"do"
actually"working"on"the"problem"itself"

CORBA: Common Object Request Broker Architecture


"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#6%

Distributed"Systems:"Data"Storage"
!Typically,%data%for%a%distributed%system%is%stored%on%a%SAN%
!At%compute%.me,%data%is%copied%to%the%compute%nodes%
!Fine%for%rela.vely%limited%amounts%of%data%

SAN: Storage Area Network


"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#7%

The"Data/Driven"World"
!Modern%systems%have%to%deal%with%far%more%data%than%was%the%case%in%the%
past%
OrganizaBons"are"generaBng"huge"amounts"of"data"
That"data"has"inherent"value,"and"cannot"be"discarded"
!Examples:%
Facebook""over"70PB"of"data"
eBay""over"5PB"of"data"
!Many%organiza.ons%are%genera.ng%data%at%a%rate%of%terabytes%per%day%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#8%

Data"Becomes"the"Bo>leneck"
!Moores%Law%has%held%rm%for%over%40%years%
Processing"power"doubles"every"two"years"
Processing"speed"is"no"longer"the"problem"
!Ge^ng%the%data%to%the%processors%becomes%the%bo_leneck%
!Quick%calcula.on%
Typical"disk"data"transfer"rate:"75MB/sec"
Time"taken"to"transfer"100GB"of"data"to"the"processor:"approx"22"
minutes!"
Assuming"sustained"reads"
Actual"Bme"will"be"worse,"since"most"servers"have"less"than"100GB"
of"RAM"available"
!A%new%approach%is%needed%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#9%

Chapter"Topics"
The%Mo.va.on%for%Hadoop%

Introduc.on%to%Apache%Hadoop%
and%its%Ecosystem%

! Problems"with"tradiBonal"large/scale"systems"
! Requirements%for%a%new%approach%
! Introducing"Hadoop"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#10%

ParBal"Failure"Support"
!The%system%must%support%par.al%failure%
Failure"of"a"component"should"result"in"a"graceful"degradaBon"of"
applicaBon"performance"
Not"complete"failure"of"the"enBre"system"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#11%

Data"Recoverability"
!If%a%component%of%the%system%fails,%its%workload%should%be%assumed%by%
s.ll#func.oning%units%in%the%system%
Failure"should"not"result"in"the"loss"of"any"data"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#12%

Component"Recovery"
!If%a%component%of%the%system%fails%and%then%recovers,%it%should%be%able%to%
rejoin%the%system%
Without"requiring"a"full"restart"of"the"enBre"system"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#13%

Consistency"
!Component%failures%during%execu.on%of%a%job%should%not%aect%the%
outcome%of%the%job%%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#14%

Scalability"
!Adding%load%to%the%system%should%result%in%a%graceful%decline%in%
performance%of%individual%jobs%
Not"failure"of"the"system"
!Increasing%resources%should%support%a%propor.onal%increase%in%load%
capacity%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#15%

Chapter"Topics"
The%Mo.va.on%for%Hadoop%

Introduc.on%to%Apache%Hadoop%
and%its%Ecosystem%

! Problems"with"tradiBonal"large/scale"systems"
! Requirements"for"a"new"approach"
! Introducing%Hadoop%
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#16%

Hadoops"History"
!Hadoop%is%based%on%work%done%by%Google%in%the%late%1990s/early%2000s%
Specically,"on"papers"describing"the"Google"File"System"(GFS)"
published"in"2003,"and"MapReduce"published"in"2004"
!This%work%takes%a%radical%new%approach%to%the%problem%of%distributed%
compu.ng%
Meets"all"the"requirements"we"have"for"reliability"and"scalability"
!Core%concept:%distribute%the%data%as%it%is%ini.ally%stored%in%the%system%
Individual"nodes"can"work"on"data"local"to"those"nodes"
No"data"transfer"over"the"network"is"required"for"iniBal"processing"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#17%

Core"Hadoop"Concepts"
!Applica.ons%are%wri_en%in%high#level%code%
Developers"need"not"worry"about"network"programming,"temporal"
dependencies"or"low/level"infrastructure"
!Nodes%talk%to%each%other%as%li_le%as%possible%
Developers"should"not"write"code"which"communicates"between"nodes"
Shared"nothing"architecture"
!Data%is%spread%among%machines%in%advance%
ComputaBon"happens"where"the"data"is"stored,"wherever"possible"
Data"is"replicated"mulBple"Bmes"on"the"system"for"increased"
availability"and"reliability"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#18%

Hadoop:"Very"High/Level"Overview"
!When%data%is%loaded%into%the%system,%it%is%split%into%blocks%
Typically"64MB"or"128MB"
!Map%tasks%(the%rst%part%of%the%MapReduce%system)%work%on%rela.vely%
small%por.ons%of%data%
Typically"a"single"block"
!A%master%program%allocates%work%to%nodes%such%that%a%Map%task%will%work%
on%a%block%of%data%stored%locally%on%that%node%whenever%possible%
Many"nodes"work"in"parallel,"each"on"their"own"part"of"the"overall"
dataset"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#19%

Fault"Tolerance"
!If%a%node%fails,%the%master%will%detect%that%failure%and%re#assign%the%work%to%
a%dierent%node%on%the%system%
!Restar.ng%a%task%does%not%require%communica.on%with%nodes%working%on%
other%por.ons%of%the%data%
!If%a%failed%node%restarts,%it%is%automa.cally%added%back%to%the%system%and%
assigned%new%tasks%
!If%a%node%appears%to%be%running%slowly,%the%master%can%redundantly%
execute%another%instance%of%the%same%task%
Results"from"the"rst"to"nish"will"be"used"
Known"as"speculaBve"execuBon"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#20%

Chapter"Topics"
The%Mo.va.on%for%Hadoop%

Introduc.on%to%Apache%Hadoop%
and%its%Ecosystem%

! Problems"with"tradiBonal"large/scale"systems"
! Requirements"for"a"new"approach"
! Introducing"Hadoop"
! Conclusion%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#21%

Conclusion"
In%this%chapter%you%have%learned%
!What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%
!What%requirements%an%alterna.ve%approach%should%have%
!How%Hadoop%addresses%those%requirements%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

02#22%

Hadoop:"Basic"Concepts"
Chapter"3"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#1%

Course"Chapters"
! IntroducDon"

Course"IntroducDon"

! The"MoDvaDon"for"Hadoop"
! Hadoop:%Basic%Concepts%
! WriDng"a"MapReduce"Program"
! Unit"TesDng"MapReduce"Programs"
! Delving"Deeper"into"the"Hadoop"API"
! PracDcal"Development"Tips"and"Techniques"
! Data"Input"and"Output"
! Common"MapReduce"Algorithms"
! Joining"Data"Sets"in"MapReduce"Jobs"
! IntegraDng"Hadoop"into"the"Enterprise"Workow"
! Machine"Learning"and"Mahout"
! An"IntroducDon"to"Hive"and"Pig"
! An"IntroducDon"to"Oozie"
! Conclusion"
! Appendix:"Cloudera"Enterprise"
! Appendix:"Graph"ManipulaDon"in"MapReduce"

Introduc/on%to%Apache%Hadoop%%
and%its%Ecosystem%

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#2%

Hadoop:"Basic"Concepts"
In%this%chapter%you%will%learn%
!What%Hadoop%is%
!What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%
!The%concepts%behind%MapReduce%
!How%a%Hadoop%cluster%operates%
!What%other%Hadoop%Ecosystem%projects%exist%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#3%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The%Hadoop%project%and%Hadoop%components%
! The"Hadoop"Distributed"File"System"(HDFS)"
! Hands/On"Exercise:"Using"HDFS"
! How"MapReduce"works"
! Hands/On"Exercise:"Running"a"MapReduce"Job"
! How"a"Hadoop"cluster"operates"
! Other"Hadoop"ecosystem"components"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#4%

The"Hadoop"Project"
!Hadoop%is%an%open#source%project%overseen%by%the%Apache%SoPware%
Founda/on%
!Originally%based%on%papers%published%by%Google%in%2003%and%2004%
!Hadoop%commiTers%work%at%several%dierent%organiza/ons%
Including"Cloudera,"Yahoo!,"Facebook,"LinkedIn"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#5%

Hadoop"Components"
!Hadoop%consists%of%two%core%components%
The"Hadoop"Distributed"File"System"(HDFS)"
MapReduce"
!There%are%many%other%projects%based%around%core%Hadoop%
Oaen"referred"to"as"the"Hadoop"Ecosystem"
Pig,"Hive,"HBase,"Flume,"Oozie,"Sqoop,"etc"
Many"are"discussed"later"in"the"course"
!A%set%of%machines%running%HDFS%and%MapReduce%is%known%as%a%Hadoop&
Cluster&
Individual"machines"are"known"as"nodes&
A"cluster"can"have"as"few"as"one"node,"as"many"as"several"thousand"
More"nodes"="be>er"performance!"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#6%

Hadoop"Components:"HDFS"
!HDFS,%the%Hadoop%Distributed%File%System,%is%responsible%for%storing%data%
on%the%cluster%
!Data%is%split%into%blocks%and%distributed%across%mul/ple%nodes%in%the%
cluster%
Each"block"is"typically"64MB"or"128MB"in"size"
!Each%block%is%replicated%mul/ple%/mes%
Default"is"to"replicate"each"block"three"Dmes"
Replicas"are"stored"on"dierent"nodes"
This"ensures"both"reliability"and"availability"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#7%

Hadoop"Components:"MapReduce"
!MapReduce%is%the%system%used%to%process%data%in%the%Hadoop%cluster%
!Consists%of%two%phases:%Map,%and%then%Reduce%
Between"the"two"is"a"stage"known"as"the"shue&and&sort"
!Each%Map%task%operates%on%a%discrete%por/on%of%the%overall%dataset%
Typically"one"HDFS"block"of"data"
!APer%all%Maps%are%complete,%the%MapReduce%system%distributes%the%
intermediate%data%to%nodes%which%perform%the%Reduce%phase%
Much"more"on"this"later!"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#8%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The"Hadoop"project"and"Hadoop"components"
! The%Hadoop%Distributed%File%System%(HDFS)%
! Hands/On"Exercise:"Using"HDFS"
! How"MapReduce"works"
! Hands/On"Exercise:"Running"a"MapReduce"Job"
! How"a"Hadoop"cluster"operates"
! Other"Hadoop"ecosystem"components"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#9%

HDFS"Basic"Concepts"
!HDFS%is%a%lesystem%wriTen%in%Java%
Based"on"Googles"GFS"
!Sits%on%top%of%a%na/ve%lesystem%
Such"as"ext3,"ext4"or"xfs"
!Provides%redundant%storage%for%massive%amounts%of%data%
Using"readily/available,"industry/standard"computers"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#10%

HDFS"Basic"Concepts"(contd)"
!HDFS%performs%best%with%a%modest%number%of%large%les%
Millions,"rather"than"billions,"of"les"
Each"le"typically"100MB"or"more"
!Files%in%HDFS%are%write%once%
No"random"writes"to"les"are"allowed"
!HDFS%is%op/mized%for%large,%streaming%reads%of%les%
Rather"than"random"reads"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#11%

How"Files"Are"Stored"
!Files%are%split%into%blocks%
Each"block"is"usually"64MB"or"128MB"
!Data%is%distributed%across%many%machines%at%load%/me%
Dierent"blocks"from"the"same"le"will"be"stored"on"dierent"machines"
This"provides"for"ecient"MapReduce"processing"(see"later)"
!Blocks%are%replicated%across%mul/ple%machines,%known%as%DataNodes&
Default"replicaDon"is"three/fold"
Meaning"that"each"block"exists"on"three"dierent"machines"
!A%master%node%called%the%NameNode&keeps%track%of%which%blocks%make%up%
a%le,%and%where%those%blocks%are%located%
Known"as"the"metadata"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#12%

How"Files"Are"Stored:"Example"
!NameNode%holds%metadata%for%the%
two%les%(Foo.txt%and%Bar.txt)%
!DataNodes%hold%the%actual%blocks%
Each"block"will"be"64MB"or"
128MB"in"size"
Each"block"is"replicated"three"
Dmes"on"the"cluster"

NameNode
Foo.txt: blk_001, blk_002, blk_003
Bar.txt: blk_004, blk_005

DataNodes
blk_001

blk_005

blk_002

blk_003

blk_005

blk_003

blk_004

blk_001

blk_002

blk_005

blk_001

blk_003

blk_002

blk_004

blk_004

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#13%

More"On"The"HDFS"NameNode"
!The%NameNode%daemon%must%be%running%at%all%/mes%
If"the"NameNode"stops,"the"cluster"becomes"inaccessible"
Your"system"administrator"will"take"care"to"ensure"that"the"NameNode"
hardware"is"reliable!"
!The%NameNode%holds%all%of%its%metadata%in%RAM%for%fast%access%
It"keeps"a"record"of"changes"on"disk"for"crash"recovery"
!A%separate%daemon%known%as%the%Secondary&NameNode&takes%care%of%
some%housekeeping%tasks%for%the%NameNode%
Be"careful:"The"Secondary"NameNode"is"not%a"backup"NameNode!"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#14%

NameNode"High"Availability"in"CDH4"
!CDH4%introduced%High%Availability%for%the%NameNode%
!Instead%of%a%single%NameNode,%there%are%now%two%
An"AcDve"NameNode"
A"Standby"NameNode"
!If%the%Ac/ve%NameNode%fails,%the%Standby%NameNode%can%automa/cally%
take%over%
!The%Standby%NameNode%does%the%work%performed%by%the%Secondary%
NameNode%in%classic%HDFS%
HA"HDFS"does"not"run"a"Secondary"NameNode"daemon"
!Your%system%administrator%will%choose%whether%to%set%the%cluster%up%with%
NameNode%High%Availability%or%not%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#15%

HDFS:"Points"To"Note"
!Although%les%are%split%into%64MB%or%128MB%blocks,%if%a%le%is%smaller%than%
this%the%full%64MB/128MB%will%not%be%used%
!Blocks%are%stored%as%standard%les%on%the%DataNodes,%in%a%set%of%
directories%specied%in%Hadoops%congura/on%les%
This"will"be"set"by"the"system"administrator"
!Without%the%metadata%on%the%NameNode,%there%is%no%way%to%access%the%
les%in%the%HDFS%cluster%
!When%a%client%applica/on%wants%to%read%a%le:%
It"communicates"with"the"NameNode"to"determine"which"blocks"make"
up"the"le,"and"which"DataNodes"those"blocks"reside"on"
It"then"communicates"directly"with"the"DataNodes"to"read"the"data"
The"NameNode"will"not"be"a"bo>leneck"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#16%

Accessing"HDFS"
!Applica/ons%can%read%and%write%HDFS%les%directly%via%the%Java%API%
Covered"later"in"the"course"
!Typically,%les%are%created%on%a%local%lesystem%and%must%be%moved%into%
HDFS%
!Likewise,%les%stored%in%HDFS%may%need%to%be%moved%to%a%machines%local%
lesystem%
!Access%to%HDFS%from%the%command%line%is%achieved%with%the%hadoop fs%
command%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#17%

hadoop fs"Examples"
!Copy%le%foo.txt%from%local%disk%to%the%users%directory%in%HDFS%
hadoop fs -put foo.txt foo.txt

This"will"copy"the"le"to"/user/username/foo.txt
!Get%a%directory%lis/ng%of%the%users%home%directory%in%HDFS%
hadoop fs -ls

!Get%a%directory%lis/ng%of%the%HDFS%root%directory%
hadoop fs ls /

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#18%

hadoop fs"Examples"(contd)"
!Display%the%contents%of%the%HDFS%le%/user/fred/bar.txt%%
hadoop fs cat /user/fred/bar.txt

!Move%that%le%to%the%local%disk,%named%as%baz.txt
hadoop fs get /user/fred/bar.txt baz.txt

!Create%a%directory%called%input%under%the%users%home%directory%
hadoop fs mkdir input

Note:"copyFromLocal"is"a"synonym"for"put;"copyToLocal"is"a"synonym"for"get""
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#19%

hadoop fs"Examples"(contd)"
!Delete%the%directory%input_old%and%all%its%contents%
hadoop fs rm -r input_old

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#20%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The"Hadoop"project"and"Hadoop"components"
! The"Hadoop"Distributed"File"System"(HDFS)"
! Hands#On%Exercise:%Using%HDFS%
! How"MapReduce"works"
! Hands/On"Exercise:"Running"a"MapReduce"Job"
! How"a"Hadoop"cluster"operates"
! Other"Hadoop"ecosystem"components"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#21%

Aside:"The"Training"Virtual"Machine"
!During%this%course,%you%will%perform%numerous%hands#on%exercises%using%
the%Cloudera%Training%Virtual%Machine%(VM)%
!The%VM%has%Hadoop%installed%in%pseudo5distributed&mode%
This"essenDally"means"that"it"is"a"cluster"comprised"of"a"single"node"
Using"a"pseudo/distributed"cluster"is"the"typical"way"to"test"your"code"
before"you"run"it"on"your"full"cluster"
It"operates"almost"exactly"like"a"real"cluster"
A"key"dierence"is"that"the"data"replicaDon"factor"is"set"to"1,"not"3"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#22%

Hands/On"Exercise:"Using"HDFS"
!In%this%Hands#On%Exercise%you%will%gain%familiarity%with%manipula/ng%les%
in%HDFS%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#23%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The"Hadoop"project"and"Hadoop"components"
! The"Hadoop"Distributed"File"System"(HDFS)"
! Hands/On"Exercise:"Using"HDFS"
! How%MapReduce%works%
! Hands/On"Exercise:"Running"a"MapReduce"Job"
! How"a"Hadoop"cluster"operates"
! Other"Hadoop"ecosystem"components"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#24%

What"Is"MapReduce?"
!MapReduce%is%a%method%for%distribu/ng%a%task%across%mul/ple%nodes%
!Each%node%processes%data%stored%on%that%node%%
Where"possible"
!Consists%of%two%phases:%
Map"
Reduce"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#25%

Features"of"MapReduce"
!Automa/c%paralleliza/on%and%distribu/on%
!Fault#tolerance%
!Status%and%monitoring%tools%
!A%clean%abstrac/on%for%programmers%
MapReduce"programs"are"usually"wri>en"in"Java"
Can"be"wri>en"in"any"language"using"Hadoop&Streaming"(see"later)"
All"of"Hadoop"is"wri>en"in"Java"
!MapReduce%abstracts%all%the%housekeeping%away%from%the%developer%
Developer"can"concentrate"simply"on"wriDng"the"Map"and"Reduce"
funcDons"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#26%

MapReduce:"The"Big"Picture"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#27%

MapReduce:"The"JobTracker"
!MapReduce%jobs%are%controlled%by%a%soPware%daemon%known%as%the%
JobTracker&
!The%JobTracker%resides%on%a%master%node%
Clients"submit"MapReduce"jobs"to"the"JobTracker"
The"JobTracker"assigns"Map"and"Reduce"tasks"to"other"nodes"on"the"
cluster"
These"nodes"each"run"a"soaware"daemon"known"as"the"TaskTracker"
The"TaskTracker"is"responsible"for"actually"instanDaDng"the"Map"or"
Reduce"task,"and"reporDng"progress"back"to"the"JobTracker"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#28%

Aside:"MapReduce"Version"2"
!CDH4%contains%standard%MapReduce%(MR1)%
!CDH4%also%includes%MapReduce%version%2%(MR2)%
Also"known"as"YARN"(Yet"Another"Resource"NegoDator)"
A"complete"rewrite"of"the"Hadoop"MapReduce"framework"
!MR2%is%not%yet%considered%produc/on#ready%
Included"in"CDH4"as"a"technology"preview"
!Exis/ng%code%will%work%with%no%modica/on%on%MR2%clusters%when%the%
technology%matures%
Code"will"need"to"be"re/compiled,"but"the"API"remains"idenDcal"
!For%produc/on%use,%we%strongly%recommend%using%MR1%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#29%

MapReduce:"Terminology"
!A%job%is%a%full%program%
A"complete"execuDon"of"Mappers"and"Reducers"over"a"dataset"
!A%task%is%the%execu/on%of%a%single%Mapper%or%Reducer%over%a%slice%of%data%
!A%task&a<empt%is%a%par/cular%instance%of%an%aTempt%to%execute%a%task%
There"will"be"at"least"as"many"task"a>empts"as"there"are"tasks"
If"a"task"a>empt"fails,"another"will"be"started"by"the"JobTracker"
Specula7ve&execu7on"(see"later)"can"also"result"in"more"task"a>empts"
than"completed"tasks&

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#30%

MapReduce:"The"Mapper"
!Hadoop%aTempts%to%ensure%that%Mappers%run%on%nodes%which%hold%their%
por/on%of%the%data%locally,%to%avoid%network%trac%
MulDple"Mappers"run"in"parallel,"each"processing"a"porDon"of"the"input"
data"
!The%Mapper%reads%data%in%the%form%of%key/value%pairs%
!It%outputs%zero%or%more%key/value%pairs%(pseudo#code):%
map(in_key, in_value) ->
(inter_key, inter_value) list

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#31%

MapReduce:"The"Mapper"(contd)"
!The%Mapper%may%use%or%completely%ignore%the%input%key%
For"example,"a"standard"pa>ern"is"to"read"a"line"of"a"le"at"a"Dme"
The"key"is"the"byte"oset"into"the"le"at"which"the"line"starts"
The"value"is"the"contents"of"the"line"itself"
Typically"the"key"is"considered"irrelevant"
!If%the%Mapper%writes%anything%out,%the%output%must%be%in%the%form%of%%
key/value%pairs%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#32%

Example"Mapper:"Upper"Case"Mapper"
!Turn%input%into%upper%case%(pseudo#code):%

let map(k, v) =
emit(k.toUpper(), v.toUpper())

('foo', 'bar') -> ('FOO', 'BAR')


('foo', 'other') -> ('FOO', 'OTHER')
('baz', 'more data') -> ('BAZ', 'MORE DATA')

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#33%

Example"Mapper:"Explode"Mapper"
!Output%each%input%character%separately%(pseudo#code):%

let map(k, v) =
foreach char c in v:
emit (k, c)
('foo', 'bar') ->

('foo',
('foo',
('baz', 'other') -> ('baz',
('baz',
('baz',

'b'), ('foo', 'a'),


'r')
'o'), ('baz', 't'),
'h'), ('baz', 'e'),
'r')

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#34%

Example"Mapper:"Filter"Mapper"
!Only%output%key/value%pairs%where%the%input%value%is%a%prime%number%
(pseudo#code):%
let map(k, v) =
if (isPrime(v)) then emit(k, v)

('foo', 7) ->
('baz', 10) ->

('foo', 7)
nothing

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#35%

Example"Mapper:"Changing"Keyspaces"
!The%key%output%by%the%Mapper%does%not%need%to%be%iden/cal%to%the%input%
key%
!Output%the%word%length%as%the%key%(pseudo#code):%
let map(k, v) =
emit(v.length(), v)
('foo', 'bar') ->
(3, 'bar')
('baz', 'other') -> (5, 'other')
('foo', 'abracadabra') -> (11, 'abracadabra')

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#36%

MapReduce:"The"Reducer"
!APer%the%Map%phase%is%over,%all%the%intermediate%values%for%a%given%
intermediate%key%are%combined%together%into%a%list%
!This%list%is%given%to%a%Reducer%
There"may"be"a"single"Reducer,"or"mulDple"Reducers"
This"is"specied"as"part"of"the"job"conguraDon"(see"later)"
All"values"associated"with"a"parDcular"intermediate"key"are"guaranteed"
to"go"to"the"same"Reducer"
The"intermediate"keys,"and"their"value"lists,"are"passed"to"the"Reducer"
in"sorted"key"order"
This"step"is"known"as"the"shue"and"sort"
!The%Reducer%outputs%zero%or%more%nal%key/value%pairs%
These"are"wri>en"to"HDFS"
In"pracDce,"the"Reducer"usually"emits"a"single"key/value"pair"for"each"
input"key"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#37%

Example"Reducer:"Sum"Reducer"
!Add%up%all%the%values%associated%with%each%intermediate%key%(pseudo#
code):%

let reduce(k, vals) =


sum = 0
foreach int i in vals:
sum += i
emit(k, sum)
(bar', [9, 3, -17, 44]) ->
(bar', 39)
(foo', [123, 100, 77]) -> (foo', 300)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#38%

Example"Reducer:"IdenDty"Reducer"
!The%Iden/ty%Reducer%is%very%common%(pseudo#code):%

let reduce(k, vals) =


foreach v in vals:
emit(k, v)
('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100),
('bar', 77)
('foo', [9, 3, -17, 44]) ->
('foo', 9), ('foo', 3),
('foo', -17), ('foo', 44)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#39%

MapReduce"Example:"Word"Count"
!Count%the%number%of%occurrences%of%each%word%in%a%large%amount%of%input%
data%
This"is"the"hello"world"of"MapReduce"programming"
map(String input_key, String input_value)
foreach word w in input_value:
emit(w, 1)
reduce(String output_key,
Iterator<int> intermediate_vals)
set count = 0
foreach v in intermediate_vals:
count += v
emit(output_key, count)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#40%

MapReduce"Example:"Word"Count"(contd)"
!Input%to%the%Mapper:%
(3414, 'the cat sat on the mat')
(3437, 'the aardvark sat on the sofa')

!Output%from%the%Mapper:%
('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#41%

MapReduce"Example:"Word"Count"(contd)"
!Intermediate%data%sent%to%the%Reducer:%
('aardvark', [1])
('cat', [1])
('mat', [1])
('on', [1, 1])
('sat', [1, 1])
('sofa', [1])
('the', [1, 1, 1, 1])

!Final%Reducer%output:%
('aardvark', 1)
('cat', 1)
('mat', 1)
('on', 2)
('sat', 2)
('sofa', 1)
('the', 4)
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#42%

MapReduce:"Data"Locality"
!Whenever%possible,%Hadoop%will%aTempt%to%ensure%that%a%Map%task%on%a%
node%is%working%on%a%block%of%data%stored%locally%on%that%node%via%HDFS%
!If%this%is%not%possible,%the%Map%task%will%have%to%transfer%the%data%across%
the%network%as%it%processes%that%data%
!Once%the%Map%tasks%have%nished,%data%is%then%transferred%across%the%
network%to%the%Reducers%
Although"the"Reducers"may"run"on"the"same"physical"machines"as"the"
Map"tasks,"there"is"no"concept"of"data"locality"for"the"Reducers"
All"Mappers"will,"in"general,"have"to"communicate"with"all"Reducers"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#43%

MapReduce:"Is"Shue"and"Sort"a"Bo>leneck?"
!It%appears%that%the%shue%and%sort%phase%is%a%boTleneck%
The"reduce"method"in"the"Reducers"cannot"start"unDl"all"Mappers"
have"nished"
!In%prac/ce,%Hadoop%will%start%to%transfer%data%from%Mappers%to%Reducers%
as%the%Mappers%nish%work%
This"miDgates"against"a"huge"amount"of"data"transfer"starDng"as"soon"
as"the"last"Mapper"nishes"
Note"that"this"behavior"is"congurable"
The"developer"can"specify"the"percentage"of"Mappers"which"should"
nish"before"Reducers"start"retrieving"data"
The"developers"reduce"method"sDll"does"not"start"unDl"all"
intermediate"data"has"been"transferred"and"sorted"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#44%

MapReduce:"Is"a"Slow"Mapper"a"Bo>leneck?"
!It%is%possible%for%one%Map%task%to%run%more%slowly%than%the%others%
Perhaps"due"to"faulty"hardware,"or"just"a"very"slow"machine"
!It%would%appear%that%this%would%create%a%boTleneck%
The"reduce"method"in"the"Reducer"cannot"start"unDl"every"Mapper"
has"nished"
!Hadoop%uses%specula=ve&execu=on%to%mi/gate%against%this%
If"a"Mapper"appears"to"be"running"signicantly"more"slowly"than"the"
others,"a"new"instance"of"the"Mapper"will"be"started"on"another"
machine,"operaDng"on"the"same"data"
The"results"of"the"rst"Mapper"to"nish"will"be"used"
Hadoop"will"kill"o"the"Mapper"which"is"sDll"running"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#45%

CreaDng"and"Running"a"MapReduce"Job"
!Write%the%Mapper%and%Reducer%classes%
!Write%a%Driver%class%that%congures%the%job%and%submits%it%to%the%cluster%
Driver"classes"are"covered"in"the"next"chapter"
!Compile%the%Mapper,%Reducer,%and%Driver%classes%
Example:""
javac -classpath `hadoop classpath` *.java
!Create%a%jar%le%with%the%Mapper,%Reducer,%and%Driver%classes%
Example:"jar cvf foo.jar *.class
!Run%the%hadoop jar%command%to%submit%the%job%to%the%Hadoop%cluster%
Example:"hadoop jar foo.jar Foo in_dir out_dir

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#46%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The"Hadoop"project"and"Hadoop"components"
! The"Hadoop"Distributed"File"System"(HDFS)"
! Hands/On"Exercise:"Using"HDFS"
! How"MapReduce"works"
! Hands#On%Exercise:%Running%a%MapReduce%Job%
! How"a"Hadoop"cluster"operates"
! Other"Hadoop"ecosystem"components"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#47%

Hands/On"Exercise:"Running"A"MapReduce"Job"
!In%this%Hands#On%Exercise,%you%will%run%a%MapReduce%job%on%your%pseudo#
distributed%Hadoop%cluster%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#48%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The"Hadoop"project"and"Hadoop"components"
! The"Hadoop"Distributed"File"System"(HDFS)"
! Hands/On"Exercise:"Using"HDFS"
! How"MapReduce"works"
! Hands/On"Exercise:"Running"a"MapReduce"Job"
! How%a%Hadoop%cluster%operates%
! Other"Hadoop"ecosystem"components"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#49%

Installing"A"Hadoop"Cluster"
!Cluster%installa/on%is%usually%performed%by%the%system%administrator,%and%
is%outside%the%scope%of%this%course%
Cloudera"oers"a"training"course"for"System"Administrators"specically"
aimed"at"those"responsible"for"commissioning"and"maintaining"Hadoop"
clusters"
!However,%its%very%useful%to%understand%how%the%component%parts%of%the%
Hadoop%cluster%work%together%
!Typically,%a%developer%will%congure%their%machine%to%run%in%pseudo5
distributed&mode%
This"eecDvely"creates"a"single/machine"cluster"
All"ve"Hadoop"daemons"are"running"on"the"same"machine"
Very"useful"for"tesDng"code"before"it"is"deployed"to"the"real"cluster"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#50%

Installing"A"Hadoop"Cluster"(contd)"
!Easiest%way%to%download%and%install%Hadoop,%either%for%a%full%cluster%or%in%
pseudo#distributed%mode,%is%by%using%Clouderas%Distribu/on,%including%
Apache%Hadoop%(CDH)%
Vanilla"Hadoop"plus"many"patches,"backports,"bugxes"
Supplied"as"a"Debian"package"(for"Linux"distribuDons"such"as"Ubuntu),"
an"RPM"(for"CentOS/RedHat"Enterprise"Linux),"and"as"a"tarball"
Full"documentaDon"available"at"http://cloudera.com/

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#51%

The"Five"Hadoop"Daemons"
!Hadoop%is%comprised%of%ve%separate%daemons%
!NameNode%
Holds"the"metadata"for"HDFS"
!Secondary%NameNode%
Performs"housekeeping"funcDons"for"the"NameNode"
Is"not"a"backup"or"hot"standby"for"the"NameNode!"
!DataNode%
Stores"actual"HDFS"data"blocks"
!JobTracker%
Manages"MapReduce"jobs,"distributes"individual"tasks"to"machines"
running"the"
!TaskTracker%
InstanDates"and"monitors"individual"Map"and"Reduce"tasks"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#52%

The"Five"Hadoop"Daemons"(contd)"
!Each%daemon%runs%in%its%own%Java%Virtual%Machine%(JVM)%
!No%node%on%a%real%cluster%will%run%all%ve%daemons%
Although"this"is"technically"possible"
!We%can%consider%nodes%to%be%in%two%dierent%categories:%
Master"Nodes"
Run"the"NameNode,"Secondary"NameNode,"JobTracker"daemons"
Only"one"of"each"of"these"daemons"runs"on"the"cluster"
Slave"Nodes"
Run"the"DataNode"and"TaskTracker"daemons"
A"slave"node"will"run"both"of"these"daemons"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#53%

Basic"Cluster"ConguraDon"

!On%very%small%clusters,%the%NameNode,%JobTracker%and%Secondary%NameNode%
daemons%can%all%reside%on%a%single%machine%
It"is"typical"to"put"them"on"separate"machines"as"the"cluster"grows"beyond"
20/30"nodes"
!Each%daemon%runs%in%a%separate%Java%Virtual%Machine%(JVM)%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#54%

Submirng"A"Job"
!When%a%client%submits%a%job,%its%congura/on%informa/on%is%packaged%into%
an%XML%le%
!This%le,%along%with%the%.jar%le%containing%the%actual%program%code,%is%
handed%to%the%JobTracker%
The"JobTracker"then"parcels"out"individual"tasks"to"TaskTracker"nodes"
When"a"TaskTracker"receives"a"request"to"run"a"task,"it"instanDates"a"
separate"JVM"for"that"task"
TaskTracker"nodes"can"be"congured"to"run"mulDple"tasks"at"the"same"
Dme"
If"the"node"has"enough"processing"power"and"memory"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#55%

Submirng"A"Job"(contd)"
!The%intermediate%data%is%held%on%the%TaskTrackers%local%disk%
!As%Reducers%start%up,%the%intermediate%data%is%distributed%across%the%
network%to%the%Reducers%
!Reducers%write%their%nal%output%to%HDFS%
!Once%the%job%has%completed,%the%TaskTracker%can%delete%the%intermediate%
data%from%its%local%disk%
Note"that"the"intermediate"data"is"not"deleted"unDl"the"enDre"job"
completes"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#56%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The"Hadoop"project"and"Hadoop"components"
! The"Hadoop"Distributed"File"System"(HDFS)"
! Hands/On"Exercise:"Using"HDFS"
! How"MapReduce"works"
! Hands/On"Exercise:"Running"a"MapReduce"Job"
! How"a"Hadoop"cluster"operates"
! Other%Hadoop%ecosystem%components%
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#57%

Other"Ecosystem"Projects:"IntroducDon"
!The%term%Hadoop%core%refers%to%HDFS%and%MapReduce%
!Many%other%projects%exist%which%use%Hadoop%core%
Either"both"HDFS"and"MapReduce,"or"just"HDFS"
!Most%are%Apache%projects%or%Apache%Incubator%projects%
Some"others"are"not"hosted"by"the"Apache"Soaware"FoundaDon"
These"are"oaen"hosted"on"GitHub"or"a"similar"repository"
!We%will%inves/gate%many%of%these%projects%later%in%the%course%
!Following%is%an%introduc/on%to%some%of%the%most%signicant%projects%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#58%

Hive"
!Hive%is%an%abstrac/on%on%top%of%MapReduce%
!Allows%users%to%query%data%in%the%Hadoop%cluster%without%knowing%Java%or%
MapReduce%
!Uses%the%HiveQL%language%
Very"similar"to"SQL"
!The%Hive%Interpreter%runs%on%a%client%machine%
Turns"HiveQL"queries"into"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"
!Note:%this%does%not%turn%the%cluster%into%a%rela/onal%database%server!%
It"is"sDll"simply"running"MapReduce"jobs"
Those"jobs"are"created"by"the"Hive"Interpreter"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#59%

Hive"(contd)"
!Sample%Hive%query:%
SELECT stock.product, SUM(orders.purchases)
FROM stock JOIN orders
ON (stock.id = orders.stock_id)
WHERE orders.quarter = 'Q1'
GROUP BY stock.product;
%
!We%will%inves/gate%Hive%in%greater%detail%later%in%the%course%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#60%

Pig"
!Pig%is%an%alterna/ve%abstrac/on%on%top%of%MapReduce%
!Uses%a%dataow%scrip/ng%language%
Called"PigLaDn"
!The%Pig%interpreter%runs%on%the%client%machine%
Takes"the"PigLaDn"script"and"turns"it"into"a"series"of"MapReduce"jobs"
Submits"those"jobs"to"the"cluster"
!As%with%Hive,%nothing%magical%happens%on%the%cluster%
It"is"sDll"simply"running"MapReduce"jobs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#61%

Pig"(contd)"
!Sample%Pig%script:%

stock = LOAD '/user/fred/stock' AS (id, item);


orders = LOAD '/user/fred/orders' AS (id, cost);
grpd = GROUP orders BY id;
totals = FOREACH grpd GENERATE group,
SUM(orders.cost) AS t;
result = JOIN stock BY id, totals BY group;
DUMP result;

!We%will%inves/gate%Pig%in%more%detail%later%in%the%course%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#62%

Impala"
!Impala%is%an%open#source%project%created%by%Cloudera%
!Facilitates%real#/me%queries%of%data%in%HDFS%
!Does%not%use%MapReduce%
Uses"its"own"daemon,"running"on"each"slave"node"
Queries"data"stored"in"HDFS"
!Uses%a%language%very%similar%to%HiveQL%
But"produces"results"much,"much"faster"
Typically"between"ve"and"40"Dmes"faster"than"Hive"
!Currently%in%beta%
Although"being"used"in"producDon"by"mulDple"organizaDons"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#63%

Flume"and"Sqoop"
!Flume%provides%a%method%to%import%data%into%HDFS%as%it%is%generated%
Instead"of"batch/processing"the"data"later"
For"example,"log"les"from"a"Web"server"
!Sqoop%provides%a%method%to%import%data%from%tables%in%a%rela/onal%
database%into%HDFS%
Does"this"very"eciently"via"a"Map/only"MapReduce"job"
Can"also"go"the"other"way"
Populate"database"tables"from"les"in"HDFS"
!We%will%inves/gate%Flume%and%Sqoop%later%in%the%course%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#64%

Oozie"
!Oozie%allows%developers%to%create%a%workow%of%MapReduce%jobs%
Including"dependencies"between"jobs"
!The%Oozie%server%submits%the%jobs%to%the%server%in%the%correct%sequence%
!We%will%inves/gate%Oozie%later%in%the%course%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#65%

HBase"
!HBase%is%the%Hadoop%database%
!A%NoSQL%datastore%
!Can%store%massive%amounts%of%data%
Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"
!Scales%to%provide%very%high%write%throughput%
Hundreds"of"thousands"of"inserts"per"second"
!Copes%well%with%sparse%data%
Tables"can"have"many"thousands"of"columns"
Even"if"most"columns"are"empty"for"any"given"row"
!Has%a%very%constrained%access%model%
Insert"a"row,"retrieve"a"row,"do"a"full"or"parDal"table"scan"
Only"one"column"(the"row"key)"is"indexed"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#66%

HBase"vs"TradiDonal"RDBMSs"
RDBMS%

HBase%

Data%layout%

Row/oriented"

Column/oriented"

Transac/ons%

Yes"

Single"row"only"

Query%language%

SQL"

get/put/scan"

Security%

AuthenDcaDon/AuthorizaDon" Kerberos"

Indexes%

On"arbitrary"columns"

Row/key"only"

Max%data%size%

TBs"

PB+"

Read/write%throughput%
limits%

1000s"queries/second"

Millions"of"queries/second"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#67%

Chapter"Topics"
Hadoop:%Basic%Concepts%

Introduc/on%to%Apache%Hadoop%
and%its%Ecosystem%

! The"Hadoop"project"and"Hadoop"components"
! The"Hadoop"Distributed"File"System"(HDFS)"
! Hands/On"Exercise:"Using"HDFS"
! How"MapReduce"works"
! Hands/On"Exercise:"Running"a"MapReduce"Job"
! How"a"Hadoop"cluster"operates"
! Other"Hadoop"ecosystem"components"
! Conclusion%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#68%

Conclusion"
In%this%chapter%you%have%learned%
!What%Hadoop%is%
!What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%
!The%concepts%behind%MapReduce%
!How%a%Hadoop%cluster%operates%
!What%other%Hadoop%Ecosystem%projects%exist%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

03#69%

WriAng"a"MapReduce"Program"
Chapter"4"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#1%

Course"Chapters"
! IntroducAon"

Course"IntroducAon"

! The"MoAvaAon"for"Hadoop"
! Hadoop:"Basic"Concepts"
! Wri*ng%a%MapReduce%Program%
! Unit"TesAng"MapReduce"Programs"
! Delving"Deeper"into"the"Hadoop"API"
! PracAcal"Development"Tips"and"Techniques"
! Data"Input"and"Output"
! Common"MapReduce"Algorithms"
! Joining"Data"Sets"in"MapReduce"Jobs"
! IntegraAng"Hadoop"into"the"Enterprise"Workow"
! Machine"Learning"and"Mahout"
! An"IntroducAon"to"Hive"and"Pig"
! An"IntroducAon"to"Oozie"
! Conclusion"
! Appendix:"Cloudera"Enterprise"
! Appendix:"Graph"ManipulaAon"in"MapReduce"

IntroducAon"to"Apache"Hadoop""
and"its"Ecosystem"

Basic%Programming%with%the%
Hadoop%Core%API%

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#2%

WriAng"a"MapReduce"Program"
In%this%chapter%you%will%learn%
!The%MapReduce%ow%
!Basic%MapReduce%API%concepts%
!How%to%write%MapReduce%drivers,%Mappers,%and%Reducers%in%Java%
!How%to%write%Mappers%and%Reducers%in%other%languages%using%the%
Streaming%API%
!How%to%speed%up%your%Hadoop%development%by%using%Eclipse%
!The%dierences%between%the%old%and%new%MapReduce%APIs%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#3%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The%MapReduce%ow%
! Basic"MapReduce"API"concepts"
! WriAng"MapReduce"applicaAons"in"Java"
The"driver"
The"Mapper"
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#4%

A"Sample"MapReduce"Program:"IntroducAon"
!In%the%previous%chapter,%you%ran%a%sample%MapReduce%program%
WordCount,"which"counted"the"number"of"occurrences"of"each"unique"
word"in"a"set"of"les"
!In%this%chapter,%we%will%examine%the%code%for%WordCount%
This"will"demonstrate"the"Hadoop"API"
!We%will%also%inves*gate%Hadoop%Streaming%
Allows"you"to"write"MapReduce"programs"in"(virtually)"any"language"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#5%

The"MapReduce"Flow:"IntroducAon"
!On%the%following%slides%we%show%the%MapReduce%ow%
!Each%of%the%por*ons%(RecordReader,%Mapper,%Par**oner,%Reducer,%etc.)%
can%be%created%by%the%developer%
!We%will%cover%each%of%these%as%we%move%through%the%course%
!You%will%always%create%at%least%a%Mapper,%Reducer,%and%driver%code%
Those"are"the"porAons"we"will"invesAgate"in"this"chapter"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#6%

The"MapReduce"Flow:"The"Mapper"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#7%

The"MapReduce"Flow:"Shue"and"Sort"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#8%

The"MapReduce"Flow:"Reducers"to"Outputs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#9%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic%MapReduce%API%concepts%
! WriAng"MapReduce"applicaAons"in"Java"
The"driver"
The"Mapper"
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#10%

Our"MapReduce"Program:"WordCount"
!To%inves*gate%the%API,%we%will%dissect%the%WordCount%program%you%ran%in%
the%previous%chapter%
!This%consists%of%three%por*ons%
The"driver"code"
Code"that"runs"on"the"client"to"congure"and"submit"the"job"
The"Mapper"
The"Reducer"
!Before%we%look%at%the%code,%we%need%to%cover%some%basic%Hadoop%API%
concepts%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#11%

Gecng"Data"to"the"Mapper"
!The%data%passed%to%the%Mapper%is%specied%by%an%InputFormat+
Specied"in"the"driver"code"
Denes"the"locaAon"of"the"input"data"
A"le"or"directory,"for"example"
Determines"how"to"split"the"input"data"into"input&splits"
Each"Mapper"deals"with"a"single"input"split""
InputFormat"is"a"factory"for"RecordReader"objects"to"extract""
(key,"value)"records"from"the"input"source"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#12%

Gecng"Data"to"the"Mapper"(contd)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#13%

Some"Standard"InputFormats"
! FileInputFormat%
The"base"class"used"for"all"le/based"InputFormats"
! TextInputFormat
The"default"
Treats"each"\n/terminated"line"of"a"le"as"a"value"
Key"is"the"byte"oset"within"the"le"of"that"line"
! KeyValueTextInputFormat
Maps"\n/terminated"lines"as"key"SEP"value"
By"default,"separator"is"a"tab"
! SequenceFileInputFormat
Binary"le"of"(key,"value)"pairs"with"some"addiAonal"metadata"
! SequenceFileAsTextInputFormat
Similar,"but"maps"(key.toString(),"value.toString())"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#14%

Keys"and"Values"are"Objects"
!Keys%and%values%in%Hadoop%are%Objects%
!Values%are%objects%which%implement%Writable
!Keys%are%objects%which%implement%WritableComparable

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#15%

What"is"Writable?"
!Hadoop%denes%its%own%box%classes%for%strings,%integers%and%so%on%
IntWritable"for"ints"
LongWritable"for"longs"
FloatWritable"for"oats"
DoubleWritable"for"doubles"
Text"for"strings"
Etc.""
!The%Writable%interface%makes%serializa*on%quick%and%easy%for%Hadoop%%
!Any%values%type%must%implement%the%Writable%interface%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#16%

What"is"WritableComparable?"
!A%WritableComparable%is%a%Writable%which%is%also%Comparable
Two"WritableComparables"can"be"compared"against"each"other"to"
determine"their"order"
Keys"must"be"WritableComparables"because"they"are"passed"to"
the"Reducer"in"sorted"order"
We"will"talk"more"about"WritableComparables"later"
!Note%that%despite%their%names,%all%Hadoop%box%classes%implement%both%
Writable%and%WritableComparable
For"example,"IntWritable"is"actually"a"WritableComparable

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#17%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! Wri*ng%MapReduce%applica*ons%in%Java%
The%driver%
The"Mapper"
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#18%

The"Driver"Code:"IntroducAon"
!The%driver%code%runs%on%the%client%machine%
!It%congures%the%job,%then%submits%it%to%the%cluster%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#19%

The"Driver:"Complete"Code"
import
import
import
import
import
import

org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.Job;

public class WordCount {


public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#20%

The"Driver:"Complete"Code"(contd)"
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#21%

The"Driver:"Import"Statements"
import
import
import
import
import
import

org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.Job;

public class WordCount {

You"will"typically"import"these"classes"into"every"
MapReduce"job"you"write."We"will"omit"the"import
if (args.length
!= 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
statements"in"future"slides"for"brevity.""
System.exit(-1);

public static void main(String[] args) throws Exception {

}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#22%

The"Driver:"Main"Code"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#23%

The"Driver"Class:"main"Method"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");

The"main"method"accepts"two"command/line"arguments:"the"input"
and"output"directories."
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);

}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#24%

Sanity"Checking"The"Jobs"InvocaAon"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");

The"rst"step"is"to"ensure"that"we"have"been"given"two"command/
FileInputFormat.setInputPaths(job, new Path(args[0]));
line"arguments."If"not,"print"a"help"message"and"exit."
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);

}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#25%

Conguring"The"Job"With"the"Job"Object
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
To"congure"the"job,"create"a"new"Job"object"and"specify"the"class"
FileOutputFormat.setOutputPath(job, new Path(args[1]));
which"will"be"called"to"run"the"job."
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#26%

CreaAng"a"New"Job"Object"
!The%Job%class%allows%you%to%set%congura*on%op*ons%for%your%MapReduce%
job%
The"classes"to"be"used"for"your"Mapper"and"Reducer"
The"input"and"output"directories"
Many"other"opAons"
!Any%op*ons%not%explicitly%set%in%your%driver%code%will%be%read%from%your%
Hadoop%congura*on%les%
Usually"located"in"/etc/hadoop/conf
!Any%op*ons%not%specied%in%your%congura*on%les%will%receive%Hadoops%
default%values%
!You%can%also%use%the%Job%object%to%submit%the%job,%control%its%execu*on,%
and%query%its%state%%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#27%

Naming"The"Job"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

Give"the"job"a"meaningful"name."

job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#28%

Specifying"Input"and"Output"Directories"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

Next,"specify"the"input"directory"from"which"data"will"be"read,"and"
job.setMapOutputKeyClass(Text.class);
the"output"directory"to"which"nal"output"will"be"wri>en.""
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#29%

Specifying"the"InputFormat"
!The%default%InputFormat%(TextInputFormat)%will%be%used%unless%you%
specify%otherwise%
!To%use%an%InputFormat%other%than%the%default,%use%e.g.%
job.setInputFormatClass(KeyValueTextInputFormat.class)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#30%

Determining"Which"Files"To"Read"
!By%default,%FileInputFormat.setInputPaths()%will%read%all%les%
from%a%specied%directory%and%send%them%to%Mappers%
ExcepAons:"items"whose"names"begin"with"a"period"(.)"or"underscore"
(_)"
Globs"can"be"specied"to"restrict"input"
For"example,"/2010/*/01/*
!Alterna*vely,%FileInputFormat.addInputPath()%can%be%called%
mul*ple%*mes,%specifying%a%single%le%or%directory%each%*me%
!More%advanced%ltering%can%be%performed%by%implemen*ng%a%
PathFilter%
Interface"with"a"method"named"accept
Takes"a"path"to"a"le,"returns"true"or"false"depending"on"
whether"or"not"the"le"should"be"processed"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#31%

Specifying"Final"Output"With"OutputFormat"
! FileOutputFormat.setOutputPath()%species%the%directory%to%
which%the%Reducers%will%write%their%nal%output%
!The%driver%can%also%specify%the%format%of%the%output%data%
Default"is"a"plain"text"le"
Could"be"explicitly"wri>en"as"
job.setOutputFormatClass(TextOutputFormat.class)
!We%will%discuss%OutputFormats%in%more%depth%in%a%later%chapter%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#32%

Specify"The"Classes"for"Mapper"and"Reducer"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

Give"the"Job"object"informaAon"about"which"classes"are"to"be"
job.setOutputKeyClass(Text.class);
instanAated"as"the"Mapper"and"Reducer."
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#33%

Specify"The"Intermediate"Data"Types"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

Specify"the"types"for"the"intermediate"output"key"and"value"
boolean success = job.waitForCompletion(true);
produced"by"the"Mapper."
System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#34%

Specify"The"Final"Output"Data"Types"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

Specify"the"types"for"the"Reducers"output"key"and"value."

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#35%

Running"The"Job"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

Start"the"job,"wait"for"it"to"complete,"and"exit"with"a"return"code."

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

boolean success = job.waitForCompletion(true);


System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#36%

Running"The"Job"(contd)"
!There%are%two%ways%to%run%your%MapReduce%job:%
job.waitForCompletion()
Blocks"(waits"for"the"job"to"complete"before"conAnuing)"
job.submit()
Does"not"block"(driver"code"conAnues"as"the"job"is"running)"
! The%job%determines%the%proper%division%of%input%data%into%InputSplits,%and%
then%sends%the%job%informa*on%to%the%JobTracker%daemon%on%the%cluster%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#37%

Reprise:"Driver"Code"
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#38%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! Wri*ng%MapReduce%applica*ons%in%Java%
The"driver"
The%Mapper%
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#39%

The"Mapper:"Complete"Code"
import
import
import
import
import

java.io.IOException;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>


{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#40%

The"Mapper:"import"Statements"
import
import
import
import
import

java.io.IOException;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>


{
You"will"typically"import"java.io.IOException,"and"the"
@Override
org.apache.hadoop"classes"shown,"in"every"Mapper"you"
public
void map(LongWritable key, Text value, Context context)
throws
IOException, InterruptedException {
write."We"will"omit"the"import"statements"in"future"slides"for"
String
line = value.toString();
brevity.""
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#41%

The"Mapper:"Main"Code"
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#42%

The"Mapper:"Main"Code"(contd)"
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
Your"Mapper"class"should"extend"the"Mapper"class."The"
public void
map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

Mapper class"expects"four"generics,"which"dene"the"
Stringtypes"of"the"input"and"output"key/value"pairs."The"rst"two"
line = value.toString();
parameters"dene"the"input"key"and"value"types,"the"
for (String
word : line.split("\\W+")) {
if (word.length() > 0) {
second"two"dene"the"output"key"and"value"types."
context.write(new Text(word), new IntWritable(1));
}
}

}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#43%

The"map"Method"
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();

The"map"methods"signature"looks"like"this."It"will"be"passed"
for (String word : line.split("\\W+")) {
a"key,"a"value,"and"a"Context"object."The"Context"is"
if (word.length() > 0) {
used"to"write"the"intermediate"data."It"also"contains"
context.write(new Text(word), new IntWritable(1));
informaAon"about"the"jobs"conguraAon"(see"later)."
}
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#44%

The"map"Method:"Processing"The"Line"
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
value"is"a"Text"object,"so"we"retrieve"the"string"it"contains."
if (word.length()
> 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#45%

The"map"Method:"Processing"The"Line"(contd)"
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}

We"split"the"string"up"into"words"using"a"regular"expression"
with"non/alphanumeric"characters"as"the"delimiter,"and"
then"loop"through"the"words."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#46%

Outpucng"Intermediate"Data"
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context)
To"emit"a"(key,"value)"pair,"we"call"the"write"method"of"our"Context
throws
IOException, InterruptedException {

object."The"key"will"be"the"word"itself,"the"value"will"be"the"number"1."
Recall"that"the"output"key"must"be"a"WritableComparable,"and"the"
for (String word : line.split("\\W+")) {
value"must"be"a"Writable.
if
(word.length() > 0) {
String line = value.toString();

context.write(new Text(word), new IntWritable(1));


}
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#47%

Reprise:"The"Map"Method"
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#48%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! Wri*ng%MapReduce%applica*ons%in%Java%
The"driver"
The"Mapper"
The%Reducer%
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#49%

The"Reducer:"Complete"Code"
import
import
import
import

java.io.IOException;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Reducer;

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>


{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}

context.write(key, new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#50%

The"Reducer:"Import"Statements"
import
import
import
import

java.io.IOException;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Reducer;

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>


{

As"with"the"Mapper,"you"will"typically"import"
@Override
java.io.IOException,"and"the"org.apache.hadoop
public void reduce(Text key, Iterable<IntWritable> values, Context
throws
IOException, InterruptedException {
classes"shown,"in"every"Reducer"you"write."We"will"omit"the"
intimport"statements"in"future"slides"for"brevity.""
wordCount = 0;

context)

for (IntWritable value : values) {


wordCount += value.get();
}

context.write(key, new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#51%

The"Reducer:"Main"Code"
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}

context.write(key, new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#52%

The"Reducer:"Main"Code"(contd)"
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
Your"Reducer"class"should"extend"Reducer."The"Reducer
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws
IOException, InterruptedException {
class"expects"four"generics,"which"dene"the"types"of"the"input"
int wordCount
= 0;
and"output"key/value"pairs."The"rst"two"parameters"dene"the"

intermediate"key"and"value"types,"the"second"two"dene"the"
for (IntWritable
value : values) {
wordCount += value.get();
nal"output"key"and"value"types."The"keys"are"
}
}

WritableComparables,"the"values"are"Writables."
context.write(key,
new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#53%

The"reduce"Method"
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;

The"reduce"method"receives"a"key"and"an"Iterable"

for (IntWritable value : values) {


wordCountcollecAon"of"objects"(which"are"the"values"emi>ed"from"the"
+= value.get();
}

Mappers"for"that"key);"it"also"receives"a"Context"object."

context.write(key, new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#54%

Processing"The"Values"
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}

We"use"the"Java"for/each"syntax"to"step"through"all"the"elements"
in"the"collecAon."In"our"example,"we"are"merely"adding"all"the"
values"together."We"use"value.get()"to"retrieve"the"actual"
numeric"value."

context.write(key, new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#55%

WriAng"The"Final"Output"
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;

Finally,"we"write"the"output"key/value"pair"to"HDFS"using"

for (IntWritable value : values) {


wordCount
+= value.get();
the"write"method"of"our"Context"object."
}

context.write(key, new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#56%

Reprise:"The"Reduce"Method"
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}

context.write(key, new IntWritable(wordCount));

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#57%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! WriAng"MapReduce"applicaAons"in"Java"
The"driver"
The"Mapper"
The"Reducer"
! Wri*ng%Mappers%and%Reducers%in%other%languages%with%the%Streaming%API%

! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#58%

The"Streaming"API:"MoAvaAon"
!Many%organiza*ons%have%developers%skilled%in%languages%other%than%Java,%
such%as%%
Ruby"
Python"
Perl"
!The%Streaming%API%allows%developers%to%use%any%language%they%wish%to%
write%Mappers%and%Reducers%
As"long"as"the"language"can"read"from"standard"input"and"write"to"
standard"output"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#59%

The"Streaming"API:"Advantages"and"Disadvantages"
!Advantages%of%the%Streaming%API:%
No"need"for"non/Java"coders"to"learn"Java"
Fast"development"Ame"
Ability"to"use"exisAng"code"libraries"
!Disadvantages%of%the%Streaming%API:%
Performance"
Primarily"suited"for"handling"data"that"can"be"represented"as"text"
Streaming"jobs"can"use"excessive"amounts"of"RAM"or"fork"excessive"
numbers"of"processes"
Although"Mappers"and"Reducers"can"be"wri>en"using"the"Streaming"
API,"ParAAoners,"InputFormats"etc."must"sAll"be"wri>en"in"Java"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#60%

How"Streaming"Works"
!To%implement%streaming,%write%separate%Mapper%and%Reducer%programs%in%
the%language%of%your%choice%
They"will"receive"input"via"stdin"
They"should"write"their"output"to"stdout"
!If%TextInputFormat%(the%default)%is%used,%the%streaming%Mapper%just%
receives%each%line%from%the%le%on%stdin%
No"key"is"passed"
!Streaming%Mapper%and%streaming%Reducers%output%should%be%sent%to%
stdout%as%key%(tab)%value%(newline)%
!Separators%other%than%tab%can%be%specied%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#61%

Streaming:"Example"Mapper"
!Example%streaming%wordcount%Mapper:%
#!/usr/bin/env perl
while (<>) {
chomp;
(@words) = split /\s+/;
foreach $w (@words) {
print "$w\t1\n";
}
}

#
#
#
#
#

Read lines from stdin


Get rid of the trailing newline
Create an array of words
Loop through the array
Print out the key and value

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#62%

Streaming"Reducers:"CauAon"
!Recall%that%in%Java,%all%the%values%associated%with%a%key%are%passed%to%the%
Reducer%as%an%Iterable
!Using%Hadoop%Streaming,%the%Reducer%receives%its%input%as%(key,%value)%
pairs%
One"per"line"of"standard"input"
!Your%code%will%have%to%keep%track%of%the%key%so%that%it%can%detect%when%
values%from%a%new%key%start%appearing%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#63%

Launching"a"Streaming"Job"
!To%launch%a%Streaming%job,%use%e.g.,:%
%

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/ \


streaming/hadoop-streaming*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapScript.pl \
-reducer myReduceScript.pl \
-file myMapScript.pl \
-file myReduceScript.pl

!Many%other%command#line%op*ons%are%available%
See"the"documentaAon"for"full"details"
!Note%that%system%commands%can%be%used%as%a%Streaming%Mapper%or%
Reducer%
For"example:"awk,"grep,"sed,"or"wc"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#64%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! WriAng"MapReduce"applicaAons"in"Java"
The"driver"
The"Mapper"
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding%up%Hadoop%development%by%using%Eclipse%
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#65%

Integrated"Development"Environments"
!There%are%many%Integrated%Development%Environments%(IDEs)%available%
!Eclipse%is%one%such%IDE%
Open"source"
Very"popular"among"Java"developers"
Has"plug/ins"to"speed"development"in"several"dierent"languages"
!If%you%would%prefer%to%write%your%code%this%week%using%a%terminal#based%
editor%such%as%vi,%we%certainly%wont%stop%you!%
But"using"Eclipse"can"dramaAcally"speed"up"your"development"process"
!On%the%next%few%slides%we%will%demonstrate%how%to%use%Eclipse%to%write%a%
MapReduce%program%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#66%

StarAng"Eclipse"
!Double#click%the%Eclipse%
icon%on%the%Desktop%to%%
launch%Eclipse%
!Import%pre#built%projects%
for%all%Java%API%hands#on%%
exercises%in%this%course%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#67%

LocaAng"Source"Code"
!In%Package%Explorer,%expand%%
the%project%you%want%to%work%%
with%
!Locate%the%class%you%want%to%%
edit%
!Double#click%the%class%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#68%

Specifying"the"Java"Build"Path"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#69%

EdiAng"Source"Code"
!Edit%the%class%in%the%right%window%pane%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#70%

Accessing"the"Javadoc"
!If%you%have%network%access,%you%can%select%an%element%and%click%Shii%+%F2%
to%access%the%elements%full%Javadoc%in%a%browser%
!Or,%simply%hover%your%mouse%over%an%element%for%which%you%want%to%
access%the%top#level%Javadoc%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#71%

Accessing"the"Hadoop"Source"Code"
!Your%virtual%machine%has%been%provisioned%with%the%Hadoop%source%code%
!Select%a%Hadoop%element%and%click%F3%to%access%the%elements%source%code%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#72%

CreaAng"a"Jar"File"
!When%you%are%%
ready%to%test%your%
code,%right#click%%
the%default%package%
and%choose%Export %%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#73%

CreaAng"a"Jar"File"(contd)"
!Expand%Java,%select%
the%JAR%le%op*on,%%
and%then%click%Next%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#74%

CreaAng"a"Jar"File"(contd)"
!Enter%a%path%and%lename%%
inside%/home/training%%
(your%home%directory),%and%%
click%Finish%
!Your%JAR%le%will%be%saved;%%
you%can%now%run%it%from%the%%
command%line%with%the%%
standard%hadoop jar...%%
command%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#75%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! WriAng"MapReduce"applicaAons"in"Java"
The"driver"
The"Mapper"
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands#On%Exercise:%Wri*ng%a%MapReduce%Program%
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#76%

Hands/On"Exercise:"WriAng"A"MapReduce"Program"
!In%this%Hands#On%Exercise,%you%will%write%a%MapReduce%program%using%
either%Java%or%Hadoops%Streaming%interface%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#77%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! WriAng"MapReduce"applicaAons"in"Java"
The"driver"
The"Mapper"
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences%between%the%Old%and%New%MapReduce%APIs%
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#78%

What"Is"The"Old"API?"
!When%Hadoop%0.20%was%released,%a%New%API%was%introduced%
Designed"to"make"the"API"easier"to"evolve"in"the"future"
Favors"abstract"classes"over"interfaces"
!Some%developers%s*ll%use%the%Old%API%
UnAl"CDH4,"the"New"API"was"not"absolutely"feature/complete"
!All%the%code%examples%in%this%course%use%the%New%API%
Old"API/based"soluAons"for"many"of"the"Hands/On"Exercises"for"this"
course"are"available"in"the"sample_solutions_oldapi"directory"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#79%

New"API"vs."Old"API:"Some"Key"Dierences"
New API

Old API

import org.apache.hadoop.mapreduce.*

import org.apache.hadoop.mapred.*

Driver code:

Driver code:

Configuration conf = new Configuration();


Job job = new Job(conf);
job.setJarByClass(Driver.class);
job.setSomeProperty(...);
...
job.waitForCompletion(true);

JobConf conf = new JobConf(conf,


Driver.class);
conf.setSomeProperty(...);
...
JobClient.runJob(conf);

Mapper:

Mapper:

public class MyMapper extends Mapper {

public class MyMapper extends MapReduceBase


implements Mapper {

public void map(Keytype k, Valuetype v,


Context c) {
...
c.write(key, val);
}

public void map(Keytype k, Valuetype v,


OutputCollector o, Reporter r) {
...
o.collect(key, val);
}

}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#80%

New"API"vs."Old"API:"Some"Key"Dierences"(contd)"
New API

Old API

Reducer:

Reducer:

public class MyReducer extends Reducer {

public class MyReducer extends MapReduceBase


implements Reducer {

public void reduce(Keytype k,


Iterable<Valuetype> v, Context c) {
for(Valuetype v : eachval) {
// process eachval
c.write(key, val);
}
}

public void reduce(Keytype k,


Iterator<Valuetype> v,
OutputCollector o, Reporter r) {
while(v.hasnext()) {
// process v.next()
o.collect(key, val);
}
}

}
}
setup(Context c) (See later)

configure(JobConf job)

cleanup(Context c) (See later)

close()

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#81%

MRv1"vs"MRv2,"Old"API"vs"New"API"
!There%is%a%lot%of%confusion%about%the%New%and%Old%APIs,%and%MapReduce%
version%1%and%MapReduce%version%2%
!The%chart%below%should%clarify%what%is%available%with%each%version%of%
MapReduce%
Old%API%

New%API%

MapReduce%v1%

MapReduce%v2%

!Summary:%Code%using%either%the%Old%API%or%the%New%API%will%run%under%
MRv1%and%MRv2%
You"will"have"to"recompile"the"code"to"move"from"MR1"to"MR2,"but"you"
will"not"have"to"change"the"code"itself"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#82%

Chapter"Topics"
Wri*ng%a%MapReduce%Program%

Basic%Programming%with%the%%
Hadoop%Core%API%

! The"MapReduce"ow"
! Basic"MapReduce"API"concepts"
! WriAng"MapReduce"applicaAons"in"Java"
The"driver"
The"Mapper"
The"Reducer"
! WriAng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"
! Speeding"up"Hadoop"development"by"using"Eclipse"
! Hands/On"Exercise:"WriAng"a"MapReduce"Program"
! Dierences"between"the"Old"and"New"MapReduce"APIs"
! Conclusion%
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#83%

Conclusion"
In%this%chapter%you%have%learned%
!The%MapReduce%ow%
!Basic%MapReduce%API%concepts%
!How%to%write%MapReduce%drivers,%Mappers,%and%Reducers%in%Java%
!How%to%write%Mappers%and%Reducers%in%other%languages%using%the%
Streaming%API%
!How%to%speed%up%your%Hadoop%development%by%using%Eclipse%
!The%dierences%between%the%old%and%new%MapReduce%APIs%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

04#84%

Unit"TesBng"MapReduce"Programs"
Chapter"5"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#1%

Course"Chapters"
Course"IntroducBon"

! "IntroducBon"
! "The"MoBvaBon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriBng"a"MapReduce"Program"
! %Unit%Tes.ng%MapReduce%Programs%
! "Delving"Deeper"into"the"Hadoop"API"
! "PracBcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraBng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducBon"to"Hive"and"Pig"
! "An"IntroducBon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaBon"in"MapReduce"""

IntroducBon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic%Programming%with%the%
Hadoop%Core%API%

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#2%

Unit"TesBng"MapReduce"Programs"
In%this%chapter%you%will%learn%
!What%unit%tes.ng%is,%and%why%you%should%write%unit%tests%
!What%the%JUnit%tes.ng%framework%is,%and%how%MRUnit%builds%on%the%JUnit%
framework%
!How%to%write%unit%tests%with%MRUnit%
!How%to%run%unit%tests%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#3%

Chapter"Topics"
Unit%Tes.ng%MapReduce%Programs%%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Unit%tes.ng%
! The"JUnit"and"MRUnit"tesBng"frameworks"
! WriBng"unit"tests"with"MRUnit"
! Running"unit"tests"
! Hands/On"Exercise:"WriBng"Unit"Tests"with"the"MRUnit"Framework"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#4%

An"IntroducBon"to"Unit"TesBng"
!A%unit%is%a%small%piece%of%your%code%
A"small"piece"of"funcBonality"
!A%unit%test%veries%the%correctness%of%that%unit%of%code%
A"purist"might"say"that"in"a"well/wri>en"unit"test,"only"a"single"thing"
should"be"able"to"fail"
Generally"accepted"rule/of/thumb:"a"unit"test"should"take"less"than"a"
second"to"complete"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#5%

Why"Write"Unit"Tests?"
!Unit%tes.ng%provides%verica.on%that%your%code%is%func.oning%correctly%
!Much%faster%than%tes.ng%your%en.re%program%each%.me%you%modify%the%
code%
Fastest"MapReduce"job"on"a"cluster"will"take"many"seconds"
Even"in"pseudo/distributed"mode"
Even"running"in"LocalJobRunner"mode"will"take"several"seconds"
LocalJobRunner"mode"is"discussed"later"in"the"course"
Unit"tests"help"you"iterate"faster"in"your"code"development"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#6%

Chapter"Topics"
Unit%Tes.ng%MapReduce%Programs%%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Unit"tesBng"
! The%JUnit%and%MRUnit%tes.ng%frameworks%
! WriBng"unit"tests"with"MRUnit"
! Running"unit"tests"
! Hands/On"Exercise:"WriBng"Unit"Tests"with"the"MRUnit"Framework"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#7%

Why"MRUnit?"
!JUnit%is%a%very%popular%Java%unit%tes.ng%framework%
!Problem:%JUnit%cannot%be%used%directly%to%test%Mappers%or%Reducers%
Unit"tests"require"mocking"up"classes"in"the"MapReduce"framework"
A"lot"of"work"
!MRUnit%is%built%on%top%of%JUnit%
Works"with"the"mockito"framework"to"provide"required"mock"objects"
!Allows%you%to%test%your%code%from%within%an%IDE%
Much"easier"to"debug"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#8%

JUnit"Basics"
!We%are%using%JUnit%4%in%class%
Earlier"versions"would"also"work"
! @Test
Java"annotaBon"
Indicates"that"this"method"is"a"test"which"JUnit"should"execute"
! @Before
Java"annotaBon"
Tells"JUnit"to"call"this"method"before"every"@Test"method"
Two"@Test"methods"would"result"in"the"@Before"method"being"
called"twice"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#9%

JUnit"Basics"(contd)"
!JUnit%test%methods:%
assertEquals(),"assertNotNull()"etc"
Fail"if"the"condiBons"of"the"statement"are"not"met"
fail(msg)
Fails"the"test"with"the"given"error"message"
!With%a%JUnit%test%open%in%Eclipse,%run%all%tests%in%the%class%by%going%to%%
Run%"%Run%
!Eclipse%also%provides%func.onality%to%run%all%JUnit%tests%in%your%project%
!Other%IDEs%have%similar%func.onality%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#10%

JUnit:"Example"Code"
import static org.junit.Assert.assertEquals;
import org.junit.Before;
import org.junit.Test;
public class JUnitHelloWorld {
protected String s;
@Before
public void setup() {
s = "HELLO WORLD";
}
@Test
public void testHelloWorldSuccess() {
s = s.toLowerCase();
assertEquals("hello world", s);
}
// will fail even if testHelloWorldSuccess is called first
@Test
public void testHelloWorldFail() {
assertEquals("hello world", s);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#11%

Chapter"Topics"
Unit%Tes.ng%MapReduce%Programs%%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Unit"tesBng"
! The"JUnit"and"MRUnit"tesBng"frameworks"
! Wri.ng%unit%tests%with%MRUnit%
! Running"unit"tests"
! Hands/On"Exercise:"WriBng"Unit"Tests"with"the"MRUnit"Framework"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#12%

Using"MRUnit"to"Test"MapReduce"Code"
!MRUnit%builds%on%top%of%JUnit%
!Provides%a%mock%InputSplit%and%other%classes%
!Can%test%just%the%Mapper,%just%the%Reducer,%or%the%full%MapReduce%ow%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#13%

MRUnit:"Example"Code""Mapper"Unit"Test"
import
import
import
import
import
import

org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.mapreduce.MapDriver;
org.junit.Before;
org.junit.Test;

public class TestWordCount {


MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
@Before
public void setUp() {
WordMapper mapper = new WordMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
}

@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#14%

MRUnit:"Example"Code""Mapper"Unit"Test"(contd)"
import
import
import
import
import
import

org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.mapreduce.MapDriver;
org.junit.Before;
org.junit.Test;

public class TestWordCount {


MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;

Import"the"relevant"JUnit"classes"and"the"MRUnit"MapDriver"
@Before
public class"as"we"will"be"wriBng"a"unit"test"for"our"Mapper."
void setUp() {
}

WordMapper mapper = new WordMapper();


mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);

@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#15%

MRUnit:"Example"Code""Mapper"Unit"Test"(contd)"
import
import
import
import
import
import

org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.mapreduce.MapDriver;
org.junit.Before;
org.junit.Test;

public class TestWordCount {


MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
@Before
public MapDriver"is"an"MRUnit"class"(not"a"user/dened"driver)."
void setUp() {
WordMapper mapper = new WordMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
}

@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#16%

MRUnit:"Example"Code""Mapper"Unit"Test"(contd)"
import
import
import
import
import
import

org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.mapreduce.MapDriver;
org.junit.Before;
org.junit.Test;

public class TestWordCount {


MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
@Before
public void setUp() {
WordMapper mapper = new WordMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
}

@Test
public Set"up"the"test."This"method"will"be"called"before"every"test,"
void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
just"as"with"JUnit." Text("cat"), new IntWritable(1));
mapDriver.withOutput(new
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#17%

MRUnit:"Example"Code""Mapper"Unit"Test"(contd)"
import
import
import
import
import
import

org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mrunit.mapreduce.MapDriver;
org.junit.Before;
org.junit.Test;

public class TestWordCount {


MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
@Before
The"test"itself."Note"that"the"order"in"which"the"output"is"
public void
setUp() {
WordMapper mapper = new WordMapper();
specied"is"important""it"must"match"the"order"in"which"
mapDriver
= new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
the"output"will"be"created"by"the"Mapper."
}

@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#18%

MRUnit"Drivers"
!MRUnit%has%a%MapDriver,%a%ReduceDriver,%and%a%
MapReduceDriver
!Methods%to%specify%test%input%and%output:%
withInput
Species"input"to"the"Mapper/Reducer"
Builder"method"that"can"be"chained"
withOutput
Species"expected"output"from"the"Mapper/Reducer"
Builder"method"that"can"be"chained"
addInput
Similar"to"withInput"but"returns"void
addOutput
Similar"to"withOutput"but"returns"void

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#19%

MRUnit"Drivers"(contd)"
!Methods%to%run%tests:%
runTest
Runs"the"test"and"veries"the"output"
run
Runs"the"test"and"returns"the"result"set"
Ignores"previous"withOutput"and"addOutput"calls"
!Drivers%take%a%single%(key,%value)%pair%as%input%
!Can%take%mul.ple%(key,%value)%pairs%as%expected%output%
!If%you%are%calling%driver.runTest()%or%driver.run()%mul.ple%
.mes,%call%driver.resetOutput()%between%each%call%
MRUnit"will"fail"if"you"do"not"do"this"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#20%

MRUnit"Conclusions"
!You%should%write%unit%tests%for%your%code!%
!As%you%are%performing%the%Hands#On%Exercises%in%the%rest%of%the%course%we%
strongly%recommend%that%you%write%unit%tests%as%you%proceed%
This"will"help"greatly"in"debugging"your"code"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#21%

Chapter"Topics"
Unit%Tes.ng%MapReduce%Programs%%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Unit"tesBng"
! The"JUnit"and"MRUnit"tesBng"frameworks"
! WriBng"unit"tests"with"MRUnit"
! Running%unit%tests%
! Hands/On"Exercise:"WriBng"Unit"Tests"with"the"MRUnit"Framework"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#22%

Running"Unit"Tests"From"Eclipse"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#23%

Compiling"and"Running"Unit"Tests"From"the"Command"Line"
[training@localhost sample_solution]$ javac -classpath `hadoop classpath`:
/home/training/lib/mrunit-0.9.0-incubating-hadoop2.jar:. *.java
[training@localhost sample_solution]$ java -cp `hadoop classpath`:/home/
training/lib/mrunit-0.9.0-incubating-hadoop2.jar:. org.junit.runner.JUnitCore
TestWordCount
JUnit version 4.8.2
...
Time: 0.51
OK (3 tests)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#24%

Chapter"Topics"
Unit%Tes.ng%MapReduce%Programs%%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Unit"tesBng"
! The"JUnit"and"MRUnit"tesBng"frameworks"
! WriBng"unit"tests"with"MRUnit"
! Running"unit"tests"
! Hands#On%Exercise:%Wri.ng%Unit%Tests%with%the%MRUnit%Framework%
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#25%

Hands/On"Exercise:"WriBng"Unit"Tests"With"the"MRUnit"
Framework"
!In%this%Hands#On%Exercise,%you%will%gain%prac.ce%crea.ng%unit%tests%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#26%

Chapter"Topics"
Unit%Tes.ng%MapReduce%Programs%%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Unit"tesBng"
! The"JUnit"and"MRUnit"tesBng"frameworks"
! WriBng"unit"tests"with"MRUnit"
! Running"unit"tests"
! Hands/On"Exercise:"WriBng"Unit"Tests"with"the"MRUnit"Framework"
! Conclusion%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#27%

Conclusion"
In%this%chapter%you%have%learned%
!What%unit%tes.ng%is,%and%why%you%should%write%unit%tests%
!What%the%JUnit%tes.ng%framework%is,%and%how%MRUnit%builds%on%the%JUnit%
framework%
!How%to%write%unit%tests%with%MRUnit%
!How%to%run%unit%tests%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

05#28%

Delving"Deeper"into"the"Hadoop"API"
Chapter"6"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#1%

Course"Chapters"
Course"IntroducEon"

! "IntroducEon"
! "The"MoEvaEon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriEng"a"MapReduce"Program"
! "Unit"TesEng"MapReduce"Programs"
! %Delving%Deeper%into%the%Hadoop%API%
! "PracEcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraEng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducEon"to"Hive"and"Pig"
! "An"IntroducEon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaEon"in"MapReduce"""

IntroducEon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic%Programming%with%the%
Hadoop%Core%API%

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#2%

Delving"Deeper"Into"The"Hadoop"API"
In%this%chapter%you%will%learn%
!How%to%use%the%ToolRunner%class%
!How%to%decrease%the%amount%of%intermediate%data%with%Combiners%
!How%to%set%up%and%tear%down%Mappers%and%Reducers%by%using%the%setup%
and%cleanup%methods%
!How%to%write%custom%ParGGoners%for%beHer%load%balancing%
!How%to%access%HDFS%programmaGcally%
!How%to%use%the%distributed%cache%
!How%to%use%the%Hadoop%APIs%library%of%Mappers,%Reducers,%and%
ParGGoners%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#3%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using%the%ToolRunner%class%
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#4%

Why"Use"ToolRunner?"
!You%can%use%ToolRunner%in%MapReduce%driver%classes%
This"is"not"required,"but"is"a"best"pracEce"
! ToolRunner%uses%the%GenericOptionsParser%class%internally%
Allows"you"to"specify"conguraEon"opEons"on"the"command"line"
Also"allows"you"to"specify"items"for"the"Distributed"Cache"on"the"
command"line"(see"later)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#5%

How"to"Implement"ToolRunner"
!Import%the%relevant%classes%in%your%driver%
%

import
import
import
import

org.apache.hadoop.conf.Configured;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;

!Change%your%driver%class%so%that%it%extends%Configured%and%implements%
Tool
public class WordCount extends Configured implements Tool
{

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#6%

How"to"Implement"ToolRunner"(contd)"
!The%main%method%should%call%ToolRunner.run
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(),
new WordCount(), args);
System.exit(exitCode);
}

!Create%a%run%method%
Congure"and"submit"the"job"in"this"method"
Note"how"the"Job"object"is"created"when"using"ToolRunner"
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
Job.setJarByClass(WordCount.class);
...

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#7%

How"to"Implement"ToolRunner:"Complete"Driver"
public class WordCount extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;
}
Job job = new Job(getConf());
job.setJarByClass(WordCount.class); job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;

}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(exitCode);
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#8%

ToolRunner"Command"Line"OpEons"
!ToolRunner%allows%the%user%to%specify%conguraGon%opGons%on%the%
command%line%
!Commonly%used%to%specify%Hadoop%properGes%using%the%-D%ag%
Will"override"any"default"or"site"properEes"in"the"conguraEon"
But"will"not"override"those"set"in"the"driver"code"
$ hadoop jar myjar.jar MyDriver \
-D mapreduce.job.reduces=10 myinputdir myoutputdir

!Note%that%-D%opGons%must%appear%before%any%addiGonal%program%
arguments%
!Can%specify%an%XML%conguraGon%le%with%-conf
!Can%specify%the%default%lesystem%with%-fs uri
Shortcut"for"D fs.defaultFS=uri

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#9%

Aside:"Deprecated"ConguraEon"ProperEes""
!In%CDH%4,%a%large%number%of%conguraGon%properGes%were%deprecated%
!The%new%property%names%work%in%CDH%4%but%do#not%work%in%CDH%3%
!All%conguraGon%property%names%shown%in%this%course%are%the%new%
property%names%
The"deprecated"property"names"are"also"provided"for"students"who"are"
sEll"working"with"CDH"3"
!CDH%3%equivalents%for%conguraGon%properGes%on%the%previous%slide%are:%
mapred.reduce.tasks"(for"mapreduce.job.reduces)"
fs.default.name"(for"fs.defaultFS)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#10%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing%the%amount%of%intermediate%data%with%Combiners%
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#11%

The"Combiner"
!O^en,%Mappers%produce%large%amounts%of%intermediate%data%
That"data"must"be"passed"to"the"Reducers"
This"can"result"in"a"lot"of"network"trac"
!It%is%o^en%possible%to%specify%a%Combiner%
Like"a"mini/Reducer"
Runs"locally"on"a"single"Mappers"output"
Output"from"the"Combiner"is"sent"to"the"Reducers"
Input"and"output"data"types"for"the"Combiner/Reducer"must"be"
idenEcal"
!Combiner%and%Reducer%code%are%o^en%idenGcal%
Technically,"this"is"possible"if"the"operaEon"performed"is"commutaEve"
and"associaEve"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#12%

MapReduce"Example:"Word"Count"
!To%see%how%a%Combiner%works,%lets%revisit%the%WordCount%example%we%
covered%earlier%
map(String input_key, String input_value)
foreach word w in input_value:
emit(w, 1)

reduce(String output_key,
Iterator<int> intermediate_vals)
set count = 0
foreach v in intermediate_vals:
count += v
emit(output_key, count)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#13%

MapReduce"Example:"Word"Count"(contd)"
!Input%to%the%Mapper:%
(3414, 'the cat sat on the mat')
(3437, 'the aardvark sat on the sofa')

!Output%from%the%Mapper:%
('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#14%

MapReduce"Example:"Word"Count"(contd)"
!Intermediate%data%sent%to%the%Reducer:%
('aardvark', [1])
('cat', [1])
('mat', [1])
('on', [1, 1])
('sat', [1, 1])
('sofa', [1])
('the', [1, 1, 1, 1])

!Final%Reducer%output:%
('aardvark', 1)
('cat', 1)
('mat', 1)
('on', 2)
('sat', 2)
('sofa', 1)
('the', 4)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#15%

Word"Count"With"Combiner"
!A%Combiner%would%decrease%the%amount%of%data%sent%to%the%Reducer%
Intermediate"data"sent"to"the"Reducer"ager"a"Combiner"using"the"same"
code"as"the"Reducer:"
('aardvark', [1])
('cat', [1])
('mat', [1])
('on', [2])
('sat', [2])
('sofa', [1])
('the', [4])

!Combiners%decrease%the%amount%of%network%trac%required%during%the%
shue%and%sort%phase%
Ogen"also"decrease"the"amount"of"work"needed"to"be"done"by"the"
Reducer"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#16%

Specifying"a"Combiner"
!To%specify%the%Combiner%class%to%be%used%in%your%MapReduce%code,%put%the%
following%line%in%your%Driver:%
job.setCombinerClass(YourCombinerClass.class);

!The%Combiner%uses%the%same%interface%as%the%Reducer%
Takes"in"a"key"and"a"list"of"values"
Outputs"zero"or"more"(key,"value)"pairs"
The"actual"method"called"is"the"reduce"method"in"the"class"
!VERY%IMPORTANT:%The%Combiner%may%run%once,%or%more%than%once,%on%
the%output%from%any%given%Mapper%
Do"not"put"code"in"the"Combiner"which"could"inuence"your"results"if"it"
runs"more"than"once"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#17%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands#On%Exercise:%WriGng%and%ImplemenGng%a%Combiner%
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#18%

Hands/On"Exercise:"WriEng"and""
ImplemenEng"a"Combiner""
!In%this%Hands#On%Exercise,%you%will%gain%pracGce%wriGng%Combiners%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#19%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Sedng%up%and%tearing%down%Mappers%and%Reducers%using%the%setup%and%
cleanup%methods%
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#20%

The"setup"Method"
!It%is%common%to%want%your%Mapper%or%Reducer%to%execute%some%code%
before%the%map%or%reduce%method%is%called%
IniEalize"data"structures"
Read"data"from"an"external"le"
Set"parameters"
!The%setup%method%is%run%before%the%map%or%reduce%method%is%called%for%
the%rst%Gme%
public void setup(Context context)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#21%

The"cleanup"Method"
!Similarly,%you%may%wish%to%perform%some%acGon(s)%a^er%all%the%records%
have%been%processed%by%your%Mapper%or%Reducer%
!The%cleanup%method%is%called%before%the%Mapper%or%Reducer%terminates%
public void cleanup(Context context) throws
IOException, InterruptedException

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#22%

Passing"Parameters:"The"Wrong"Way!"
public class MyClass {
private static int param;
...
private static class MyMapper extends Mapper ... {
public void map... {
int v = param;
}
}
...
public static void main(String[] args) throws IOException {
Job job = new Job();
param = 5;
...
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#23%

Passing"Parameters:"The"Right"Way"
public class MyClass {
private static class MyMapper extends Mapper ... {
public void setup(Context context) {
Configuration conf = context.getConfiguration();
int v = conf.getInt("param", 0);
...
}
public void map...
}
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.setInt ("param",5);
Job job = new Job(conf);
...
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#24%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriGng%custom%ParGGoners%for%beHer%load%balancing%
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#25%

What"Does"The"ParEEoner"Do?"
!The%ParGGoner%divides%up%the%keyspace%
Controls"which"Reducer"each"intermediate"key"and"its"associated"values"
goes"to"
!O^en,%the%default%behavior%is%ne%
Default"is"the"HashPartitioner
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#26%

Custom"ParEEoners"
!SomeGmes%you%will%need%to%write%your%own%ParGGoner%
!Example:%your%key%is%a%custom%WritableComparable%which%contains%a%
pair%of%values%(a, b)
You"may"decide"that"all"keys"with"the"same"value"for"a"need"to"go"to"
the"same"Reducer"
The"default"ParEEoner"is"not"sucient"in"this"case"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#27%

Custom"ParEEoners"(contd)"
!Custom%ParGGoners%are%needed%when%performing%a%secondary%sort%(see%
later)%
!Custom%ParGGoners%are%also%useful%to%avoid%potenGal%performance%issues%
To"avoid"one"Reducer"having"to"deal"with"many"very"large"lists"of"values"
Example:"in"our"word"count"job,"we"wouldn't"want"a"single"Reducer"
dealing"with"all"the"three/"and"four/le>er"words,"while"another"only"had"
to"handle"10/"and"11/le>er"words"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#28%

CreaEng"a"Custom"ParEEoner"
!To%create%a%custom%ParGGoner:%
1. Create%a%class%for%the%ParGGoner%
Should"extend"Partitioner""
2. Create%a%method%in%the%class%called%getPartition
Receives"the"key,"the"value,"and"the"number"of"Reducers"
Should"return"an"int"between"0"and"one"less"than"the"number"of"
Reducers"
e.g.,"if"it"is"told"there"are"10"Reducers,"it"should"return"an"int"
between"0"and"9"
3. Specify%the%custom%ParGGoner%in%your%driver%code%
job.setPartitionerClass(MyPartitioner.class);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#29%

Aside:"Se[ng"up"Variables"for"your"ParEEoner"
!If%you%need%to%set%up%variables%for%use%in%your%parGGoner,%it%should%
implement%Configurable
!Example:%
class MyPartitioner extends Partitioner<K, V> implements Configurable {
private Configuration configuration;
// Define your own variables here
@Override
public void setConf(Configuration configuration) {
this.configuration = configuration;
// Set up your variables here
}
@Override
public Configuration getConf() {
return configuration;
}
...
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#30%

Aside:"Se[ng"up"Variables"for"your"ParEEoner"(contd)"
!If%a%Hadoop%object%implements%Configurable,%its%setConf()%method%
will%be%called%once,%when%it%is%instanGated%
!You%can%therefore%set%up%variables%in%the%setConf()%method%which%your%
getPartition()%method%will%then%be%able%to%access%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#31%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands#On%Exercise:%WriGng%a%ParGGoner%
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#32%

Hands/On"Exercise:"WriEng"a"ParEEoner"
!In%this%Hands#On%Exercise,%you%will%write%code%which%uses%a%ParGGoner%and%
mulGple%Reducers%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#33%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing%HDFS%programaGcally%
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#34%

Accessing"HDFS"ProgrammaEcally"
!In%addiGon%to%using%the%command#line%shell,%you%can%access%HDFS%
programmaGcally%
Useful"if"your"code"needs"to"read"or"write"side"data"in"addiEon"to"the"
standard"MapReduce"inputs"and"outputs"
Or"for"programs"outside"of"Hadoop"which"need"to"read"the"results"of"
MapReduce"jobs"
!Beware:%HDFS%is%not%a%general#purpose%lesystem!%
Files"cannot"be"modied"once"they"have"been"wri>en,"for"example"
!Hadoop%provides%the%FileSystem%abstract%base%class%
Provides"an"API"to"generic"le"systems"
Could"be"HDFS"
Could"be"your"local"le"system"
Could"even"be,"for"example,"Amazon"S3"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#35%

The"FileSystem"API"
!In%order%to%use%the%FileSystem%API,%retrieve%an%instance%of%it%
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

!The%conf%object%has%read%in%the%Hadoop%conguraGon%les,%and%therefore%
knows%the%address%of%the%NameNode%etc.%
!A%le%in%HDFS%is%represented%by%a%Path%object%
Path p = new Path("/path/to/my/file");

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#36%

The"FileSystem"API"(contd)"
!Some%useful%API%methods:%
FSDataOutputStream create(...)
Extends"java.io.DataOutputStream
Provides"methods"for"wriEng"primiEves,"raw"bytes"etc"
FSDataInputStream open(...)
Extends"java.io.DataInputStream
Provides"methods"for"reading"primiEves,"raw"bytes"etc
boolean delete(...)
boolean mkdirs(...)
void copyFromLocalFile(...)
void copyToLocalFile(...)
FileStatus[] listStatus(...)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#37%

The"FileSystem"API:"Directory"LisEng"
!Get%a%directory%lisGng:%
Path p = new Path("/my/path");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus[] fileStats = fs.listStatus(p);
for (int i = 0; i < fileStats.length; i++) {
Path f = fileStats[i].getPath();
// do something interesting
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#38%

The"FileSystem"API:"WriEng"Data"
!Write%data%to%a%le%
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path p = new Path("/my/path/foo");
FSDataOutputStream out = fs.create(path, false);
// write some raw bytes
out.write(getBytes());
// write an int
out.writeInt(getInt());
...
out.close();

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#39%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using%the%Distributed%Cache%
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#40%

The"Distributed"Cache:"MoEvaEon"
!A%common%requirement%is%for%a%Mapper%or%Reducer%to%need%access%to%
some%side%data%
Lookup"tables"
DicEonaries"
Standard"conguraEon"values"
!One%opGon:%read%directly%from%HDFS%in%the%setup%method%
Works,"but"is"not"scalable"
!The%Distributed%Cache%provides%an%API%to%push%data%to%all%slave%nodes%
Transfer"happens"behind"the"scenes"before"any"task"is"executed"
Note:"Distributed"Cache"is"read/only"
Files"in"the"Distributed"Cache"are"automaEcally"deleted"from"slave"
nodes"when"the"job"nishes"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#41%

Using"the"Distributed"Cache:"The"Dicult"Way"
!Place%the%les%into%HDFS%
!Congure%the%Distributed%Cache%in%your%driver%code%
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat"),conf);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"),conf);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz",conf));

.jar"les"added"with"addFileToClassPath"will"be"added"to"your"
Mapper"or"Reducers"classpath"
Files"added"with"addCacheArchive"will"automaEcally"be"
dearchived/decompressed"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#42%

Using"the"DistributedCache:"The"Easy"Way"
!If%you%are%using%ToolRunner,%you%can%add%les%to%the%Distributed%Cache%
directly%from%the%command%line%when%you%run%the%job%
No"need"to"copy"the"les"to"HDFS"rst"
!Use%the%-files%opGon%to%add%les%
hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...

!The%-archives%ag%adds%archived%les,%and%automaGcally%unarchives%
them%on%the%desGnaGon%machines%
!The%-libjars%ag%adds%jar%les%to%the%classpath%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#43%

Accessing"Files"in"the"Distributed"Cache"
!Files%added%to%the%Distributed%Cache%are%made%available%in%your%tasks%
local%working%directory%
Access"them"from"your"Mapper"or"Reducer"the"way"you"would"read"any"
ordinary"local"le"
File f = new File("file_name_here");

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#44%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using%the%Hadoop%APIs%library%of%Mappers,%Reducers%and%ParGGoners%
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#45%

Reusable"Classes"for"the"New"API"
!The%org.apache.hadoop.mapreduce.lib.*/*%packages%contain%a%
library%of%Mappers,%Reducers,%and%ParGGoners%supporGng%the%new%API%
!Example%classes:%
InverseMapper""Swaps"keys"and"values"
RegexMapper""Extracts"text"based"on"a"regular"expression"
IntSumReducer,"LongSumReducer""Add"up"all"values"for"a"key"
TotalOrderPartitioner""Reads"a"previously/created"parEEon"
le"and"parEEons"based"on"the"data"from"that"le"
Sample"the"data"rst"to"create"the"parEEon"le"
Allows"you"to"parEEon"your"data"into"n"parEEons"without"hard/
coding"the"parEEoning"informaEon"
!Refer%to%the%Javadoc%for%classes%available%in%your%version%of%CDH%
Available"classes"vary"greatly"from"version"to"version"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#46%

Chapter"Topics"
Delving%Deeper%into%the%Hadoop%API%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Using"the"ToolRunner"class"
! Decreasing"the"amount"of"intermediate"data"with"Combiners"
! Hands/On"Exercise:"WriEng"and"ImplemenEng"a"Combiner"
! Se[ng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"
cleanup"methods"
! WriEng"custom"ParEEoners"for"be>er"load"balancing"
! Hands/On"Exercise:"WriEng"a"ParEEoner"
! Accessing"HDFS"programaEcally"
! Using"the"Distributed"Cache"
! Using"the"Hadoop"APIs"library"of"Mappers,"Reducers"and"ParEEoners"
! Conclusion%
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#47%

Conclusion"
In%this%chapter%you%have%learned%
!How%to%use%the%ToolRunner%class%
!How%to%decrease%the%amount%of%intermediate%data%with%Combiners%
!How%to%set%up%and%tear%down%Mappers%and%Reducers%by%using%the%setup%
and%cleanup%methods%
!How%to%write%custom%ParGGoners%for%beHer%load%balancing%
!How%to%access%HDFS%programmaGcally%
!How%to%use%the%distributed%cache%
!How%to%use%the%Hadoop%APIs%library%of%Mappers,%Reducers,%and%
ParGGoners%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

06#48%

PracAcal"Development"Tips"and"Techniques"
Chapter"7"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#1%

Course"Chapters"
Course"IntroducAon"

! "IntroducAon"
! "The"MoAvaAon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriAng"a"MapReduce"Program"
! "Unit"TesAng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! %Prac+cal%Development%Tips%and%Techniques%
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraAng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducAon"to"Hive"and"Pig"
! "An"IntroducAon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaAon"in"MapReduce"""

IntroducAon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic%Programming%with%the%
Hadoop%Core%API%

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#2%

PracAcal"Development"Tips"and"Techniques"
In%this%chapter%you%will%learn%
!Strategies%for%debugging%MapReduce%code%
!How%to%test%MapReduce%code%locally%by%using%LocalJobRunner%
!How%to%write%and%view%log%les%
!How%to%retrieve%job%informa+on%with%counters%
!How%to%determine%the%op+mal%number%of%Reducers%for%a%job%
!Why%reusing%objects%is%a%best%prac+ce%
!How%to%create%Map#only%MapReduce%jobs%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#3%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies%for%debugging%MapReduce%code%
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! WriAng"and"viewing"log"les"
! Retrieving"job"informaAon"with"Counters"
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing"objects"
! CreaAng"Map/only"MapReduce"jobs"
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#4%

IntroducAon"to"Debugging"
!Debugging%MapReduce%code%is%dicult!%
Each"instance"of"a"Mapper"runs"as"a"separate"task"
O]en"on"a"dierent"machine"
Dicult"to"a>ach"a"debugger"to"the"process"
Dicult"to"catch"edge"cases"
!Very%large%volumes%of%data%mean%that%unexpected%input%is%likely%to%appear%
Code"which"expects"all"data"to"be"well/formed"is"likely"to"fail"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#5%

Common/Sense"Debugging"Tips"
!Code%defensively%
Ensure"that"input"data"is"in"the"expected"format"
Expect"things"to"go"wrong"
Catch"excepAons"
!Start%small,%build%incrementally%
!Make%as%much%of%your%code%as%possible%Hadoop#agnos+c%
Makes"it"easier"to"test"
!Write%unit%tests%
!Test%locally%whenever%possible%
With"small"amounts"of"data"
!Then%test%in%pseudo#distributed%mode%
!Finally,%test%on%the%cluster%
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#6%

TesAng"Strategies"
!When%tes+ng%in%pseudo#distributed%mode,%ensure%that%you%are%tes+ng%
with%a%similar%environment%to%that%on%the%real%cluster%
Same"amount"of"RAM"allocated"to"the"task"JVMs"
Same"version"of"Hadoop"
Same"version"of"Java"
Same"versions"of"third/party"libraries"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#7%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! Tes+ng%MapReduce%code%locally%using%LocalJobRunner%
! WriAng"and"viewing"log"les"
! Retrieving"job"informaAon"with"Counters"
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing"objects"
! CreaAng"Map/only"MapReduce"jobs"
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#8%

TesAng"Locally"
!Hadoop%can%run%MapReduce%in%a%single,%local%process%
Does"not"require"any"Hadoop"daemons"to"be"running"
Uses"the"local"lesystem"instead"of"HDFS"
Known"as"LocalJobRunner"mode"
!This%is%a%very%useful%way%of%quickly%tes+ng%incremental%changes%to%code%
%
"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#9%

TesAng"Locally"(contd)"
!To%run%in%LocalJobRunner%mode,%add%the%following%lines%to%the%driver%code:%
Configuration conf = new Configuration();
conf.set("mapreduce.jobtracker.address", "local");
conf.set("fs.defaultFS", "file:///");

CDH3:"
mapred.job.tracker,"fs.default.name"
Or"set"these"opAons"on"the"command"line"with"the"-D"ag"
If"your"code"is"using"ToolRunner"
!Some%limita+ons%of%LocalJobRunner%mode:%
Distributed"Cache"does"not"work"
The"job"can"only"specify"a"single"Reducer"
Some"beginner"mistakes"may"not"be"caught"
For"example,"a>empAng"to"share"data"between"Mappers"will"work,"
because"the"code"is"running"in"a"single"JVM"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#10%

LocalJobRunner"Mode"in"Eclipse"
!The%installa+on%of%Eclipse%on%your%VMs%is%congured%to%run%Hadoop%code%
in%LocalJobRunner%mode%
From"within"the"IDE"
!This%allows%rapid%development%itera+ons%
Agile"programming"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#11%

LocalJobRunner"Mode"in"Eclipse"(contd)"
!Specify%a%Run%Congura+on%%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#12%

LocalJobRunner"Mode"in"Eclipse"(contd)"
!Select%Java%Applica+on,%then%select%the%New%bu^on%

!Verify%that%the%Project%and%Main%Class%elds%are%pre#lled%correctly%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#13%

LocalJobRunner"Mode"in"Eclipse"(contd)"
!Specify%values%in%the%Arguments%tab%
Local"input"and"output"les"
Any"conguraAon"opAons"needed"when"your"job"runs"

!Dene%breakpoints%if%desired%
!Execute%the%applica+on%in%run%mode%or%debug%mode%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#14%

LocalJobRunner"Mode"in"Eclipse"(contd)"
!Review%output%in%the%Eclipse%console%window%%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#15%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! Wri+ng%and%viewing%log%les%
! Retrieving"job"informaAon"with"Counters"
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing"objects"
! CreaAng"Map/only"MapReduce"jobs"
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#16%

Before"Logging:"stdout"and"stderr
!Tried#and#true%debugging%technique:%write%to%stdout%or%stderr
!If%running%in%LocalJobRunner%mode,%you%will%see%the%results%of%
System.err.println()
!If%running%on%a%cluster,%that%output%will%not%appear%on%your%console%
Output"is"visible"via"Hadoops"Web"UI"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#17%

Aside:"The"Hadoop"Web"UI"
!All%Hadoop%daemons%contain%a%Web%server%
Exposes"informaAon"on"a"well/known"port"
!Most%important%for%developers%is%the%JobTracker%Web%UI%
http://<job_tracker_address>:50030/
http://localhost:50030/"if"running"in"pseudo/distributed"mode"
!Also%useful:%the%NameNode%Web%UI%
http://<name_node_address>:50070/

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#18%

Aside:"The"Hadoop"Web"UI"(contd)"
!Your%instructor%will%now%demonstrate%the%JobTracker%UI

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#19%

Logging:"Be>er"Than"PrinAng"
! println%statements%rapidly%become%awkward%
Turning"them"on"and"o"in"your"code"is"tedious,"and"leads"to"errors"
!Logging%provides%much%ner#grained%control%over:%
What"gets"logged"
When"something"gets"logged"
How"something"is"logged"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#20%

Logging"With"log4j
!Hadoop%uses%log4j%to%generate%all%its%log%les%
!Your%Mappers%and%Reducers%can%also%use%log4j
All"the"iniAalizaAon"is"handled"for"you"by"Hadoop"
!Add%the%log4j.jar-<version>%le%from%your%CDH%distribu+on%to%
your%classpath%when%you%reference%the%log4j%classes%
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
class FooMapper implements Mapper {
private static final Logger LOGGER =
Logger.getLogger (FooMapper.class.getName());
...
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#21%

Logging"With"log4j"(contd)"
!Simply%send%strings%to%loggers%tagged%with%severity%levels:%

LOGGER.trace("message");
LOGGER.debug("message");
LOGGER.info("message");
LOGGER.warn("message");
LOGGER.error("message);

!Beware%expensive%opera+ons%like%concatena+on%
To"avoid"performance"penalty,"make"it"condiAonal"like"this:
if (LOGGER.isDebugEnabled()) {
LOGGER.debug("Account info:" + acct.getReport());
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#22%

log4j"ConguraAon"
!Congura+on%for%log4j%is%stored%in%
%/etc/hadoop/conf/log4j.properties
!Can%change%global%log%seangs%with%hadoop.root.log%property%
!Can%override%log%level%on%a%per#class%basis:%
log4j.logger.org.apache.hadoop.mapred.JobTracker=WARN
log4j.logger.com.mycompany.myproject.FooMapper=DEBUG

!Programma+cally:%
LOGGER.setLevel(Level.WARN);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#23%

Dynamically"Sehng"Log"Levels"
!Although%log%levels%can%be%set%in%log4j.properties,%this%would%
require%modica+on%of%les%on%all%slave%nodes%
In"pracAce,"this"is"unrealisAc"
!Instead,%a%good%solu+on%is%to%set%the%log%level%in%your%code%based%on%a%
command#line%parameter%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#24%

Dynamically"Sehng"Log"Levels"(contd)"
!In%the%code%for%your%Mapper%or%Reducer:%
public void setup(Context context) {
Configuration conf = context.getConfiguration();
if ("DEBUG".equals(conf.get("com.cloudera.job.logging"))){
LOGGER.setLevel(Level.DEBUG);
LOGGER.debug("** Log Level set to DEBUG **");
}
}

!Then%on%the%command%line,%specify%the%log%level:%
$ hadoop jar wc.jar WordCountWTool \
D com.cloudera.job.logging=DEBUG indir outdir

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#25%

Where"Are"Log"Files"Stored?"
!Log%les%are%stored%by%default%at%
%/var/log/hadoop-0.20-mapreduce/
userlogs/${task.id}/syslog%
on%the%machine%where%the%task%a^empt%ran%
Congurable"
!Tedious%to%have%to%ssh%in%to%a%node%to%view%its%logs%
Much"easier"to"use"the"JobTracker"Web"UI"
AutomaAcally"retrieves"and"displays"the"log"les"for"you"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#26%

RestricAng"Log"Output"
!If%you%suspect%the%input%data%of%being%faulty,%you%may%be%tempted%to%log%
the%(key,%value)%pairs%your%Mapper%receives%
Reasonable"for"small"amounts"of"input"data"
CauAon!"If"your"job"runs"across"500GB"of"input"data,"you"could"be"
wriAng"up"to"500GB"of"log"les!"
Remember"to"think"at"scale"
!Instead,%wrap%vulnerable%sec+ons%of%code%in%%try {...}%blocks%
Write"logs"in"the"catch {...}"block"
This"way"only"criAcal"data"is"logged"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#27%

Aside:"Throwing"ExcepAons"
!You%can%throw%excep+ons%if%a%par+cular%condi+on%is%met%
For"example,"if"illegal"data"is"found"
"throw new RuntimeException("Your message here");
!Usually%not%a%good%idea%
ExcepAon"causes"the"task"to"fail"
If"a"task"fails"four"Ames,"the"enAre"job"will"fail"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#28%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! WriAng"and"viewing"log"les"
! Retrieving%job%informa+on%with%Counters%
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing"objects"
! CreaAng"Map/only"MapReduce"jobs"
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#29%

What"Are"Counters?"
!Counters%provide%a%way%for%Mappers%or%Reducers%to%pass%aggregate%values%
back%to%the%driver%ader%the%job%has%completed%
Their"values"are"also"visible"from"the"JobTrackers"Web"UI"
And"are"reported"on"the"console"when"the"job"ends"
!Very%basic:%just%have%a%name%and%a%value%
Value"can"be"incremented"within"the"code"
!Counters%are%collected%into%Groups%
Within"the"group,"each"Counter"has"a"name"
!Example:%A%group%of%Counters%called%RecordType
Names:"TypeA,"TypeB,"TypeC
Appropriate"Counter"will"be"incremented"as"each"record"is"read"in"the"
Mapper"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#30%

What"Are"Counters?"(contd)"
!Counters%provide%a%way%for%Mappers%or%Reducers%to%pass%aggregate%values%
back%to%the%driver%ader%the%job%has%completed%
Their"values"are"also"visible"from"the"JobTrackers"Web"UI"
!Counters%can%be%set%and%incremented%via%the%method%
context.getCounter(group, name).increment(amount);

!Example:%
context.getCounter("RecordType","A").increment(1);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#31%

Retrieving"Counters"in"the"Driver"Code"
!To%retrieve%Counters%in%the%Driver%code%ader%the%job%is%complete,%use%code%
like%this%in%the%driver:%
long typeARecords =
job.getCounters().findCounter("RecordType","A").getValue();
long typeBRecords =
job.getCounters().findCounter("RecordType","B").getValue();

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#32%

Counters:"CauAon"
!Do%not%rely%on%a%counters%value%from%the%Web%UI%while%a%job%is%running%
Due"to"possible"speculaAve"execuAon,"a"counters"value"could"appear"
larger"than"the"actual"nal"value"
ModicaAons"to"counters"from"subsequently"killed/failed"tasks"will"be"
removed"from"the"nal"count"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#33%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! WriAng"and"viewing"log"les"
! Retrieving"job"informaAon"with"Counters"
! Determining%the%op+mal%number%of%Reducers%for%a%job%
! Reusing"objects"
! CreaAng"Map/only"MapReduce"jobs"
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#34%

How"Many"Reducers"Do"You"Need?"
!An%important%considera+on%when%crea+ng%your%job%is%to%determine%the%
number%of%Reducers%specied%
!Default%is%a%single%Reducer%
!With%a%single%Reducer,%one%task%receives%all%keys%in%sorted%order%
This"is"someAmes"advantageous"if"the"output"must"be"in"completely"
sorted"order"
Can"cause"signicant"problems"if"there"is"a"large"amount"of"
intermediate"data"
Node"on"which"the"Reducer"is"running"may"not"have"enough"disk"
space"to"hold"all"intermediate"data"
The"Reducer"will"take"a"long"Ame"to"run"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#35%

Jobs"Which"Require"a"Single"Reducer"
!If%a%job%needs%to%output%a%le%where%all%keys%are%listed%in%sorted%order,%a%
single%Reducer%must%be%used%
!Alterna+vely,%the%TotalOrderPar++oner%can%be%used%
Uses"an"externally"generated"le"which"contains"informaAon"about"
intermediate"key"distribuAon"
ParAAons"data"such"that"all"keys"which"go"to"the"rst"Reducer"are"
smaller"than"any"which"go"to"the"second,"etc"
In"this"way,"mulAple"Reducers"can"be"used"
ConcatenaAng"the"Reducers"output"les"results"in"a"totally"ordered"list"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#36%

Jobs"Which"Require"a"Fixed"Number"of"Reducers"
!Some%jobs%will%require%a%specic%number%of%Reducers%
!Example:%a%job%must%output%one%le%per%day%of%the%week%
Key"will"be"the"weekday"
Seven"Reducers"will"be"specied"
A"ParAAoner"will"be"wri>en"which"sends"one"key"to"each"Reducer"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#37%

Jobs"With"a"Variable"Number"of"Reducers"
!Many%jobs%can%be%run%with%a%variable%number%of%Reducers%
!Developer%must%decide%how%many%to%specify%
Each"Reducer"should"get"a"reasonable"amount"of"intermediate"data,"but"
not"too"much"
Chicken/and/egg"problem"
!Typical%way%to%determine%how%many%Reducers%to%specify:%
Test"the"job"with"a"relaAvely"small"test"data"set"
Extrapolate"to"calculate"the"amount"of"intermediate"data"expected"
from"the"real"input"data"
Use"that"to"calculate"the"number"of"Reducers"which"should"be"specied"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#38%

Jobs"With"a"Variable"Number"of"Reducers"(contd)"
!Note:%you%should%take%into%account%the%number%of%Reduce%slots%likely%to%
be%available%on%the%cluster%
If"your"job"requires"one"more"Reduce"slot"than"there"are"available,"a"
second"wave"of"Reducers"will"run"
ConsisAng"just"of"that"single"Reducer"
PotenAally"doubling"the"amount"of"Ame"spent"on"the"Reduce"phase"
In"this"case,"increasing"the"number"of"Reducers"further"may"cut"down"
the"Ame"spent"in"the"Reduce"phase"
Two"or"more"waves"will"run,"but"the"Reducers"in"each"wave"will"
have"to"process"less"data"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#39%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! WriAng"and"viewing"log"les"
! Retrieving"job"informaAon"with"Counters"
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing%objects%
! CreaAng"Map/only"MapReduce"jobs"
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#40%

Reuse"of"Objects"is"Good"PracAce"
!It%is%generally%good%prac+ce%to%reuse%objects%
Instead"of"creaAng"many"new"objects""
!Example:%Our%original%WordCount%Mapper%code%
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();

Each"Ame"the"map()"method"is"called,"we"create"a"new"Text"
for (String word : line.split("\\W+")) {
object"and"a"new IntWritable"object."
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#41%

Reuse"of"Objects"is"Good"PracAce"(contd)"
!Instead,%this%is%be^er%prac+ce:%
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text wordObject = new Text();
@Override
public Create"objects"for"the"key"and"value"outside"of"your"map()"method"
void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
wordObject.set(word);
context.write(wordObject, one);
}
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#42%

Reuse"of"Objects"is"Good"PracAce"(contd)"
!Instead,%this%is%be^er%prac+ce:%
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text wordObject = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
Within"the"map()"method,"populate"the"objects"and"write"them"
throws IOException, InterruptedException {

out."Hadoop"will"take"care"of"serializing"the"data"so"it"is"perfectly"
safe"to"re/use"the"objects."

String line = value.toString();

for (String word : line.split("\\W+")) {


if (word.length() > 0) {
wordObject.set(word);
context.write(wordObject, one);
}
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#43%

Object"Reuse:"CauAon!"
!Hadoop%re#uses%objects%all%the%+me%
!For%example,%each%+me%the%Reducer%is%passed%a%new%value%the%same%
object%is%reused%
!This%can%cause%subtle%bugs%in%your%code%
For"example,"if"you"build"a"list"of"value"objects"in"the"Reducer,"each"
element"of"the"list"will"point"to"the"same"underlying"object"
Unless"you"do"a"deep"copy"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#44%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! WriAng"and"viewing"log"les"
! Retrieving"job"informaAon"with"Counters"
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing"objects"
! Crea+ng%Map#only%MapReduce%jobs%
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#45%

Map/Only"MapReduce"Jobs"
!There%are%many%types%of%job%where%only%a%Mapper%is%needed%
!Examples:%
Image"processing"
File"format"conversion"
Input"data"sampling"
ETL"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#46%

CreaAng"Map/Only"Jobs"
!To%create%a%Map#only%job,%set%the%number%of%Reducers%to%0%in%your%Driver%
code%
job.setNumReduceTasks(0);

!Call%the%Job.setOutputKeyClass%and%
Job.setOutputValueClass%methods%to%specify%the%output%classes%
Not"the"Job.setMapOutputKeyClass"and"
Job.setMapOutputValueClass"methods"
!Anything%wri^en%using%the%Context.write%method%will%be%wri^en%to%
HDFS%
Rather"than"wri>en"as"intermediate"data"
One"le"per"Mapper"will"be"wri>en"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#47%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! WriAng"and"viewing"log"les"
! Retrieving"job"informaAon"with"Counters"
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing"objects"
! CreaAng"Map/only"MapReduce"jobs"
! Hands#On%Exercise:%Using%Counters%and%a%Map#Only%Job%
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#48%

Hands/On"Exercise:"Using"Counters"and"a""
Map/Only"Job""
!In%this%Hands#On%Exercise%you%will%write%a%Map#Only%MapReduce%job%using%
Counters%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#49%

Chapter"Topics"
Prac+cal%Development%Tips%%
and%Techniques%

Basic%Programming%with%the%%
Hadoop%Core%API%

! Strategies"for"debugging"MapReduce"code"
! TesAng"MapReduce"code"locally"using"LocalJobRunner"
! WriAng"and"viewing"log"les"
! Retrieving"job"informaAon"with"Counters"
! Determining"the"opAmal"number"of"Reducers"for"a"job"
! Reusing"objects"
! CreaAng"Map/only"MapReduce"jobs"
! Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"
! Conclusion%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#50%

Conclusion"
In%this%chapter%you%have%learned%
!Strategies%for%debugging%MapReduce%code%
!How%to%test%MapReduce%code%locally%by%using%LocalJobRunner%
!How%to%write%and%view%log%les%
!How%to%retrieve%job%informa+on%with%counters%
!How%to%determine%the%op+mal%number%of%Reducers%for%a%job%
!Why%reusing%objects%is%a%best%prac+ce%
!How%to%create%Map#only%MapReduce%jobs%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

07#51%

Data"Input"and"Output"
Chapter"8"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#1%

Course"Chapters"
Course"IntroducDon"

! "IntroducDon"
! "The"MoDvaDon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriDng"a"MapReduce"Program"
! "Unit"TesDng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracDcal"Development"Tips"and"Techniques"
! %Data%Input%and%Output%
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraDng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducDon"to"Hive"and"Pig"
! "An"IntroducDon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaDon"in"MapReduce"""

IntroducDon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic%Programming%with%the%
Hadoop%Core%API%

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#2%

Data"Input"and"Output"
In%this%chapter%you%will%learn%
!How%to%create%custom%Writable%and%WritableComparable%
implementaDons%
!How%to%save%binary%data%using%SequenceFile%and%Avro%data%les%
!How%to%implement%custom%InputFormats%and%OutputFormats%
!What%issues%to%consider%when%using%le%compression%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#3%

Recap:"Inputs"to"Mappers"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#4%

Recap:"Sort"and"Shue"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#5%

Recap:"Reducers"to"Outputs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#6%

Chapter"Topics"
Data%Input%and%Output%

Basic%Programming%with%the%%
Hadoop%Core%API%

! CreaDng%custom%Writable%and%WritableComparable%implementaDons%
! Saving"binary"data"using"SequenceFiles"and"Avro"data"les"
! ImplemenDng"custom"InputFormats"and"OutputFormats"
! Issues"to"consider"when"using"le"compression"
! Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#7%

Data"Types"in"Hadoop"
Writable

WritableComparable

IntWritable
LongWritable
Text

Denes"a"de/serializaDon"protocol."
Every"data"type"in"Hadoop"is"a"
Writable

Denes"a"sort"order."All"keys"must"
be WritableComparable

Concrete"classes"for"dierent"data"
types"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#8%

Box"Classes"in"Hadoop"
!Hadoops%built#in%data%types%are%box%classes%
They"contain"a"single"piece"of"data"
Text:"String
IntWritable:"int
LongWritable:"long
FloatWritable:"float
etc."
! Writable%denes%the%wire%transfer%format%
How"the"data"is"serialized"and"deserialized"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#9%

CreaDng"a"Complex"Writable
!Example:%say%we%want%a%tuple%(a,%b)%
We"could"arDcially"construct"it"by,"for"example,"saying"
Text t = new Text(a + "," + b);
...
String[] arr = t.toString().split(",");

!Inelegant%
!ProblemaDc%
If"a"or"b"contained"commas,"for"example"
!Not%always%pracDcal%
Doesnt"easily"work"for"binary"objects"
!SoluDon:%create%your%own%Writable%object%
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#10%

The"Writable"Interface"
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}

!The%readFields%and%write%methods%will%dene%how%your%custom%
object%will%be%serialized%and%deserialized%by%Hadoop%
!The%DataInput%and%DataOutput%classes%support%
boolean
byte,"char"(Unicode:"2"bytes)"
double,"float,"int,"long,""
String"(Unicode"or"UTF/8)
Line"unDl"line"terminator"
unsigned"byte,"short
byte"array"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#11%

A"Sample"Custom"Writable:"DateWritable"
class DateWritable implements Writable {
int month, day, year;
// Constructors omitted for brevity
public void readFields(DataInput in) throws IOException {
this.month = in.readInt();
this.day = in.readInt();
this.year = in.readInt();
}
public void write(DataOutput out) throws IOException {
out.writeInt(this.month);
out.writeInt(this.day);
out.writeInt(this.year);
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#12%

What"About"Binary"Objects?"
!SoluDon:%use%byte%arrays%
!Write%idiom:%
Serialize"object"to"byte"array"
Write"byte"count"
Write"byte"array"
!Read%idiom:%
Read"byte"count"
Create"byte"array"of"proper"size"
Read"byte"array"
Deserialize"object"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#13%

WritableComparable
! WritableComparable%is%a%sub#interface%of%Writable
Must"implement"compareTo,"hashCode,"equals"methods"
!All%keys%in%MapReduce%must%be%WritableComparable

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#14%

Making"our"Sample"Object"a"WritableComparable
class DateWritable implements WritableComparable<DateWritable> {
int month, day, year;
// Constructors omitted for brevity
public void readFields (DataInput in) . . .

// Refer to Writable
// example

public void write (DataOutput out) . . .

// Refer to Writable
// example

public boolean equals(Object o) {


if (o instanceof DateWritable) {
DateWritable other = (DateWritable) o;
return this.year == other.year && this.month == other.month
&& this.day == other.day;
}
return false;
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#15%

Making"our"Sample"Object"a"WritableComparable"(contd)"
public int compareTo(DateWritable other) {
// Return -1 if this date is earlier
// Return 0 if dates are equal
// Return 1 if this date is later
if (this.year != other.year) {
return (this.year < other.year ? -1 : 1);
} else if (this.month != other.month) {
return (this.month < other.month ? -1 : 1);
} else if (this.day != other.day) {
return (this.day < other.day ? -1 : 1);
}
return 0;
}
public int hashCode() {
int seed = 163;
// Arbitrary seed value
return this.year * seed + this.month * seed + this.day * seed;
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#16%

Using"Custom"Types"in"MapReduce"Jobs"
!Use%methods%in%Job%to%specify%your%custom%key/value%types%
!For%output%of%Mappers:%
job.setMapOutputKeyClass()
job.setMapOutputValueClass()

!For%output%of%Reducers:%
job.setOutputKeyClass()
job.setOutputValueClass()

!Input%types%are%dened%by%InputFormat
See"later"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#17%

Chapter"Topics"
Data%Input%and%Output%

Basic%Programming%with%the%%
Hadoop%Core%API%

! CreaDng"custom"Writable"and"WritableComparable"implementaDons"
! Saving%binary%data%using%SequenceFiles%and%Avro%data%les%
! ImplemenDng"custom"InputFormats"and"OutputFormats"
! Issues"to"consider"when"using"le"compression"
! Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#18%

What"Are"SequenceFiles?"
!SequenceFiles%are%les%containing%binary#encoded%key#value%pairs%
Work"naturally"with"Hadoop"data"types"
SequenceFiles"include"metadata"which"idenDes"the"data"type"of"the"
key"and"value"
!Actually,%three%le%types%in%one%
Uncompressed"
Record/compressed"
Block/compressed"
!Oaen%used%in%MapReduce%
Especially"when"the"output"of"one"job"will"be"used"as"the"input"for"
another"
SequenceFileInputFormat
SequenceFileOutputFormat
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#19%

Directly"Accessing"SequenceFiles"
!It%is%possible%to%directly%access%SequenceFiles%from%your%code:%
Configuration config = new Configuration();
SequenceFile.Reader reader =
new SequenceFile.Reader(FileSystem.get(config), path, config);
Text key = (Text) reader.getKeyClass().newInstance();
IntWritable value = (IntWritable) reader.getValueClass().newInstance();
while (reader.next(key, value)) {
// do something here
}
reader.close();

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#20%

Problems"With"SequenceFiles"
!SequenceFiles%are%very%useful%but%have%some%potenDal%problems%
!They%are%only%typically%accessible%via%the%Java%API%
Some"work"has"been"done"to"allow"access"from"other"languages"
!If%the%deniDon%of%the%key%or%value%object%changes,%the%le%becomes%
unreadable%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#21%

An"AlternaDve"to"SequenceFiles:"Avro"
!Apache%Avro%is%a%serializaDon%format%which%is%becoming%a%popular%
alternaDve%to%SequenceFiles%
Project"was"created"by"Doug"Cugng,"the"creator"of"Hadoop"
!Self#describing%le%format%
The"schema"for"the"data"is"included"in"the"le"itself"
!Compact%le%format%
!Portable%across%mulDple%languages%
Support"for"C,"C++,"Java,"Python,"Ruby"and"others"
!CompaDble%with%Hadoop%
Via"the"AvroMapper"and"AvroReducer"classes"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#22%

Chapter"Topics"
Data%Input%and%Output%

Basic%Programming%with%the%%
Hadoop%Core%API%

! CreaDng"custom"Writable"and"WritableComparable"implementaDons"
! Saving"binary"data"using"SequenceFiles"and"Avro"data"les"
! ImplemenDng%custom%InputFormats%and%OutputFormats%
! Issues"to"consider"when"using"le"compression"
! Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#23%

Reprise:"The"Role"of"the"InputFormat"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#24%

Most"Common"InputFormats"
!Most%common%InputFormats:%
TextInputFormat
KeyValueTextInputFormat
SequenceFileInputFormat
!Others%are%available%
NLineInputFormat
Every"n"lines"of"an"input"le"is"treated"as"a"separate"InputSplit"
Congure"in"the"driver"code"by"segng:"
mapreduce.input.lineinput.linespermap"(CDH"4)"
mapred.line.inputformat.linespermap"(CDH"3)"
MultiFileInputFormat
Abstract"class"that"manages"the"use"of"mulDple"les"in"a"single"task"
You"must"supply"a"getRecordReader()"implementaDon"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#25%

How"FileInputFormat"Works"
!All%le#based%InputFormats%inherit%from%FileInputFormat
! FileInputFormat%computes%InputSplits%based%on%the%size%of%each%le,%
in%bytes%
HDFS"block"size"is"used"as"upper"bound"for"InputSplit"size"
Lower"bound"can"be"specied"in"your"driver"code"
This"means"that"an"InputSplit"typically"correlates"to"an"HDFS"block"
So"the"number"of"Mappers"will"equal"the"number"of"HDFS"blocks"of"
input"data"to"be"processed"
!Important:%InputSplits%do%not%respect%record%boundaries!%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#26%

What"RecordReaders"Do"
!InputSplits%are%handed%to%the%RecordReaders%
Specied"by"the"path,"starDng"posiDon"oset,"length"
!RecordReaders%must:%
Ensure"each"(key,"value)"pair"is"processed"
Ensure"no"(key,"value)"pair"is"processed"more"than"once"
Handle"(key,"value)"pairs"which"are"split"across"InputSplits"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#27%

Sample"InputSplit"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#28%

From"InputSplits"to"RecordReaders"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#29%

WriDng"Custom"InputFormats"
!Use%FileInputFormat%as%a%starDng%point%
Extend"it"
!Write%your%own%custom%RecordReader%
!Override%the%getRecordReader%method%in%FileInputFormat
!Override%isSplittable%if%you%dont%want%input%les%to%be%split%
Method"is"passed"each"le"name"in"turn"
Return"false"for"non/spli>able"les"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#30%

Reprise:"Role"of"the"OutputFormat"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#31%

OutputFormat"
! OutputFormats%work%much%like%InputFormat%classes%
!Custom%OutputFormats%must%provide%a%RecordWriter%
implementaDon%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#32%

Chapter"Topics"
Data%Input%and%Output%

Basic%Programming%with%the%%
Hadoop%Core%API%

! CreaDng"custom"Writable"and"WritableComparable"implementaDons"
! Saving"binary"data"using"SequenceFiles"and"Avro"data"les"
! ImplemenDng"custom"InputFormats"and"OutputFormats"
! Issues%to%consider%when%using%le%compression%
! Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#33%

Hadoop"and"Compressed"Files"
!Hadoop%understands%a%variety%of%le%compression%formats%
Including"GZip"
!If%a%compressed%le%is%included%as%one%of%the%les%to%be%processed,%Hadoop%
will%automaDcally%decompress%it%and%pass%the%decompressed%contents%to%
the%Mapper%
There"is"no"need"for"the"developer"to"worry"about"decompressing"the"
le"
!However,%GZip%is%not%a%splifable%le%format%
A"GZipped"le"can"only"be"decompressed"by"starDng"at"the"beginning"of"
the"le"and"conDnuing"on"to"the"end"
You"cannot"start"decompressing"the"le"part"of"the"way"through"it"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#34%

Non/Spli>able"File"Formats"and"Hadoop"
!If%the%MapReduce%framework%receives%a%non#splifable%le%(such%as%a%
GZipped%le)%it%passes%the%en#re%le%to%a%single%Mapper%
!This%can%result%in%one%Mapper%running%for%far%longer%than%the%others%
It"is"dealing"with"an"enDre"le,"while"the"others"are"dealing"with"smaller"
porDons"of"les"
SpeculaDve"execuDon"could"occur"
Although"this"will"provide"no"benet"
!Typically%it%is%not%a%good%idea%to%use%GZip%to%compress%les%which%will%be%
processed%by%MapReduce%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#35%

Spli>able"Compression"Formats:"LZO"
!One%splifable%compression%format%is%LZO%
!Because%of%licensing%restricDons,%LZO%cannot%be%shipped%with%Hadoop%
But"it"is"easy"to"add"
See https://github.com/cloudera/hadoop-lzo
!To%make%an%LZO%le%splifable,%you%must%rst%index%the%le%
!The%index%le%contains%informaDon%about%how%to%break%the%LZO%le%into%
splits%that%can%be%decompressed%
!Access%the%splifable%LZO%le%as%follows:%
In"Java"MapReduce"programs,"use"the"LzoTextInputFormat"class"
In"Streaming"jobs,"specify"-inputformat com.hadoop.
mapred.DeprecatedLzoTextInputFormat"on"the"command"
line""

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#36%

Spli>able"Compression"for"SequenceFiles"and"Avro"Files"Using"
the"Snappy"Codec"
!Snappy%is%a%relaDvely%new%compression%codec%
Developed"at"Google"
Very"fast"
!Snappy%does%not%compress%a%SequenceFile%and%produce,%e.g.,%a%le%with%
a%.snappy%extension%
Instead,"it"is"a"codec"that"can"be"used"to"compress"data"within"a"le"
That"data"can"be"decompressed"automaDcally"by"Hadoop"(or"other"
programs)"when"the"le"is"read"
Works"well"with"SequenceFiles,"Avro"les"
!Snappy%is%now%preferred%over%LZO%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#37%

Compressing"Output"SequenceFiles"With"Snappy"
!Specify%output%compression%in%the%Job%object%
!Specify%block%or%record%compression%%
Block"compression"is"recommended"for"the"Snappy"codec"
!Set%the%compression%codec%to%the%Snappy%codec%in%the%Job%object%
!For%example:%
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.compress.SnappyCodec;
. . .
job.setOutputFormatClass(SequenceFileOutputFormat.class);
FileOutputFormat.setCompressOutput(job,true);
FileOutputFormat.setOutputCompressorClass(job,SnappyCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#38%

Chapter"Topics"
Data%Input%and%Output%

Basic%Programming%with%the%%
Hadoop%Core%API%

! CreaDng"custom"Writable"and"WritableComparable"implementaDons"
! Saving"binary"data"using"SequenceFiles"and"Avro"data"les"
! ImplemenDng"custom"InputFormats"and"OutputFormats"
! Issues"to"consider"when"using"le"compression"
! Hands#On%Exercise:%Using%SequenceFiles%and%File%Compression%
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#39%

Hands/On"Exercise:"Using"Sequence"Files"and"File"Compression"
!In%this%Hands#On%Exercise,%you%will%explore%reading%and%wriDng%
uncompressed%and%compressed%SequenceFiles%%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#40%

Chapter"Topics"
Data%Input%and%Output%

Basic%Programming%with%the%%
Hadoop%Core%API%

! CreaDng"custom"Writable"and"WritableComparable"implementaDons"
! Saving"binary"data"using"SequenceFiles"and"Avro"data"les"
! ImplemenDng"custom"InputFormats"and"OutputFormats"
! Issues"to"consider"when"using"le"compression"
! Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"
! Conclusion%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#41%

Conclusion"
In%this%chapter%you%have%learned%
!How%to%create%custom%Writable%and%WritableComparable%
implementaDons%
!How%to%save%binary%data%using%SequenceFile%and%Avro%data%les%
!How%to%implement%custom%InputFormats%and%OutputFormats%
!What%issues%to%consider%when%using%le%compression%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

08#42%

Common"MapReduce"Algorithms"
Chapter"9"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#1%

Course"Chapters"
Course"IntroducEon"

! "IntroducEon"
! "The"MoEvaEon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriEng"a"MapReduce"Program"
! "Unit"TesEng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracEcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! %Common%MapReduce%Algorithms%
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraEng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducEon"to"Hive"and"Pig"
! "An"IntroducEon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaEon"in"MapReduce"""

IntroducEon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem%Solving%with%MapReduce%

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#2%

Common"MapReduce"Algorithms"
In%this%chapter%you%will%learn%
!How%to%sort%and%search%large%data%sets%
!How%to%perform%a%secondary%sort%
!How%to%index%data%
!How%to%compute%term%frequency%%inverse%document%frequency%(TF#IDF)%
!How%to%calculate%word%co#occurrence%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#3%

IntroducEon"
!MapReduce%jobs%tend%to%be%relaOvely%short%in%terms%of%lines%of%code%
!It%is%typical%to%combine%mulOple%small%MapReduce%jobs%together%in%a%single%
workow%
OZen"using"Oozie"(see"later)"
!You%are%likely%to%nd%that%many%of%your%MapReduce%jobs%use%very%similar%
code%
!In%this%chapter%we%present%some%very%common%MapReduce%algorithms%
These"algorithms"are"frequently"the"basis"for"more"complex"
MapReduce"jobs"
"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#4%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorOng%and%searching%large%data%sets%
! Performing"a"secondary"sort"
! Indexing"data"
! Hands/On"Exercise:"CreaEng"an"Inverted"Index"
! CompuEng"term"frequency""inverse"document"frequency"(TF/IDF)"
! CalculaEng"word"co/occurrence"
! Hands/On"Exercise:"CalculaEng"Word"Co/Occurrence"
! OpEonal"Hands/On"Exercise:"ImplemenEng"Word"Co/Occurrence"with"a"
Custom"WritableComparable"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#5%

SorEng"
!MapReduce%is%very%well%suited%to%sorOng%large%data%sets%
!Recall:%keys%are%passed%to%the%Reducer%in%sorted%order%
!Assuming%the%le%to%be%sorted%contains%lines%with%a%single%value:%
Mapper"is"merely"the"idenEty"funcEon"for"the"value"
"(k, v) -> (v, _)
Reducer"is"the"idenEty"funcEon"
"(k, _) -> (k, '')

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#6%

SorEng"(contd)"
!Trivial%with%a%single%Reducer%
!For%mulOple%Reducers,%need%to%choose%a%parOOoning%funcOon%such%that%if%%
k1 < k2, partition(k1) <= partition(k2)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#7%

SorEng"as"a"Speed"Test"of"Hadoop"
!SorOng%is%frequently%used%as%a%speed%test%for%a%Hadoop%cluster%
Mapper"and"Reducer"are"trivial"
Therefore"sorEng"is"eecEvely"tesEng"the"Hadoop"frameworks"I/O"
!Good%way%to%measure%the%increase%in%performance%if%you%enlarge%your%
cluster%
Run"and"Eme"a"sort"job"before"and"aZer"you"add"more"nodes"
terasort"is"one"of"the"sample"jobs"provided"with"Hadoop"
Creates"and"sorts"very"large"les"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#8%

Searching"
!Assume%the%input%is%a%set%of%les%containing%lines%of%text%
!Assume%the%Mapper%has%been%passed%the%pa[ern%for%which%to%search%as%a%
special%parameter%
We"saw"how"to"pass"parameters"to"your"Mapper"in"the"previous"
chapter"
!Algorithm:%
Mapper"compares"the"line"against"the"pa>ern"
If"the"pa>ern"matches,"Mapper"outputs"(line, _)
Or"(filename+line, _),"or""
If"the"pa>ern"does"not"match,"Mapper"outputs"nothing"
Reducer"is"the"IdenEty"Reducer"
Just"outputs"each"intermediate"key"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#9%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorEng"and"searching"large"data"sets"
! Performing%a%secondary%sort%
! Indexing"data"
! Hands/On"Exercise:"CreaEng"an"Inverted"Index"
! CompuEng"term"frequency""inverse"document"frequency"(TF/IDF)"
! CalculaEng"word"co/occurrence"
! Hands/On"Exercise:"CalculaEng"Word"Co/Occurrence"
! OpEonal"Hands/On"Exercise:"ImplemenEng"Word"Co/Occurrence"with"a"
Custom"WritableComparable"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#10%

Secondary"Sort:"MoEvaEon"
!Recall%that%keys%are%passed%to%the%Reducer%in%sorted%order%
!The%list%of%values%for%a%parOcular%key%is%not%sorted%
Order"may"well"change"between"dierent"runs"of"the"MapReduce"job"
!SomeOmes%a%job%needs%to%receive%the%values%for%a%parOcular%key%in%a%
sorted%order%
This"is"known"as"a"secondary*sort"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#11%

Secondary"Sort:"MoEvaEon"(contd)"
!Example:%Your%Reducer%will%emit%the%largest%value%produced%by%Mappers%
for%each%dierent%key%
!Nave%soluOon%
Loop"through"all"values,"keeping"track"of"the"largest"
Finally,"emit"the"largest"value"
!Be[er%soluOon%
Arrange"for"the"values"for"a"given"key"to"be"presented"to"the"Reducer"in"
sorted,"descending"order"
Reducer"just"needs"to"read"and"emit"the"rst"value"it"is"given"for"a"key"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#12%

Aside:"Comparator"Classes"
!Comparator%classes%are%classes%that%compare%objects%
!Custom%comparators%can%be%used%in%a%secondary%sort%to%compare%
composite%keys%
!Grouping%comparators%can%be%used%in%a%secondary%sort%to%ensure%that%only%
the%natural%key%is%used%for%parOOoning%and%grouping%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#13%

ImplemenEng"the"Secondary"Sort"
!To%implement%a%secondary%sort,%the%intermediate%key%should%be%a%
composite%of%the%actual%(natural)%key%and%the%value%
!Dene%a%ParOOoner%which%parOOons%just%on%the%natural%key%
!Dene%a%Comparator%class%which%sorts%on%the%enOre%composite%key%
Ensures"that"the"keys"are"passed"to"the"Reducer"in"the"desired"order"
Orders"by"natural"key"and,"for"the"same"natural"key,"on"the"value"
porEon"of"the"key"
Specied"in"the"driver"code"by"
job.setSortComparatorClass(MyOKCC.class);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#14%

ImplemenEng"the"Secondary"Sort"(contd)"
!Now%we%know%that%all%values%for%the%same%natural%key%will%go%to%the%same%
Reducer%
And"they"will"be"in"the"order"we"desire"
!We%must%now%ensure%that%all%the%values%for%the%same%natural%key%are%
passed%in%one%call%to%the%Reducer%
!Achieved%by%dening%a%Grouping%Comparator%class%%
Determines"which"keys"and"values"are"passed"in"a"single"call"to"the"
Reducer""
Looks"at"just"the"natural"key"
Specied"in"the"driver"code"by"
job.setGroupingComparatorClass(MyOVGC.class);

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#15%

Secondary"Sort:"Example"
!Assume%we%have%input%with%(key,%value)%pairs%like%this%
foo
foo
bar
baz
foo
bar
baz

98
101
12
18
22
55
123

!We%want%the%Reducer%to%receive%the%intermediate%data%for%each%key%in%
descending%numerical%order%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#16%

Secondary"Sort:"Example"(contd)"
!Write%the%Mapper%such%that%the%intermediate%key%is%a%composite%of%the%
natural%key%and%value%
For"example,"intermediate"output"may"look"like"this:"
('foo#98', 98)
('foo#101', 101)
('bar#12',12)
('baz#18', 18)
('foo#22', 22)
('bar#55', 55)
('baz#123', 123)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#17%

Secondary"Sort:"Example"(contd)"
!Write%a%class%that%extends%WritableComparator%and%sorts%on%natural%
key,%and%for%idenOcal%natural%keys,%sorts%on%the%value%porOon%in%
descending%order%
Just"override"compare(WritableComparable,
WritableComparable)"
Supply"a"reference"to"this"class"in"your"driver"using"the"
Job.setSortComparatorClass"method"
Will"result"in"keys"being"passed"to"the"Reducer"in"this"order:"
('bar#55', 55)
('bar#12', 12)
('baz#123', 123)
('baz#18', 18)
('foo#101', 101)
('foo#98', 98)
('foo#22', 22)
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#18%

Secondary"Sort:"Example"(contd)"
!Finally,%write%another%WritableComparator%subclass%which%just%
examines%the%rst%(natural)%porOon%of%the%key%
Again,"just"override"compare(WritableComparable,
WritableComparable)
Supply"a"reference"to"this"class"in"your"driver"using"the"
Job.setGroupingComparatorClass"method"
This"will"ensure"that"values"associated"with"the"same"natural"key"will"be"
sent"to"the"same"pass"of"the"Reducer"
But"theyre"sorted"in"descending"order,"as"we"required"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#19%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorEng"and"searching"large"data"sets"
! Performing"a"secondary"sort"
! Indexing%data%
! Hands/On"Exercise:"CreaEng"an"Inverted"Index"
! CompuEng"term"frequency""inverse"document"frequency"(TF/IDF)"
! CalculaEng"word"co/occurrence"
! Hands/On"Exercise:"CalculaEng"Word"Co/Occurrence"
! OpEonal"Hands/On"Exercise:"ImplemenEng"Word"Co/Occurrence"with"a"
Custom"WritableComparable"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#20%

Indexing"
!Assume%the%input%is%a%set%of%les%containing%lines%of%text%
!Key%is%the%byte%oset%of%the%line,%value%is%the%line%itself%
!We%can%retrieve%the%name%of%the%le%using%the%Context%object%
More"details"on"how"to"do"this"later"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#21%

Inverted"Index"Algorithm"
!Mapper:%
For"each"word"in"the"line,"emit"(word, filename)
!Reducer:%
IdenEty"funcEon"
Collect"together"all"values"for"a"given"key"(i.e.,"all"lenames"for"a"
parEcular"word)"
Emit"(word, filename_list)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#22%

Inverted"Index:"Dataow"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#23%

Aside:"Word"Count"
!Recall%the%WordCount%example%we%used%earlier%in%the%course%
For"each"word,"Mapper"emi>ed"(word, 1)
Very"similar"to"the"inverted"index"
!This%is%a%common%theme:%reuse%of%exisOng%Mappers,%with%minor%
modicaOons%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#24%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorEng"and"searching"large"data"sets"
! Performing"a"secondary"sort"
! Indexing"data"
! Hands#On%Exercise:%CreaOng%an%Inverted%Index%
! CompuEng"term"frequency""inverse"document"frequency"(TF/IDF)"
! CalculaEng"word"co/occurrence"
! Hands/On"Exercise:"CalculaEng"Word"Co/Occurrence"
! OpEonal"Hands/On"Exercise:"ImplemenEng"Word"Co/Occurrence"with"a"
Custom"WritableComparable"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#25%

Hands/On"Exercise:"CreaEng"an"Inverted"Index"
!In%this%Hands#On%Exercise,%you%will%write%a%MapReduce%program%to%
generate%an%inverted%index%of%a%set%of%documents%
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#26%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorEng"and"searching"large"data"sets"
! Performing"a"secondary"sort"
! Indexing"data"
! Hands/On"Exercise:"CreaEng"an"Inverted"Index"
! CompuOng%term%frequency%%inverse%document%frequency%(TF#IDF)%
! CalculaEng"word"co/occurrence"
! Hands/On"Exercise:"CalculaEng"Word"Co/Occurrence"
! OpEonal"Hands/On"Exercise:"ImplemenEng"Word"Co/Occurrence"with"a"
Custom"WritableComparable"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#27%

Term"Frequency""Inverse"Document"Frequency"
!Term%Frequency%%Inverse%Document%Frequency%(TF#IDF)%
Answers"the"quesEon"How"important"is"this"term"in"a"document?"
!Known%as%a%term%weigh*ng%func*on%
Assigns"a"score"(weight)"to"each"term"(word)"in"a"document"
!Very%commonly%used%in%text%processing%and%search%
!Has%many%applicaOons%in%data%mining%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#28%

TF/IDF:"MoEvaEon"
!Merely%counOng%the%number%of%occurrences%of%a%word%in%a%document%is%
not%a%good%enough%measure%of%its%relevance%
If"the"word"appears"in"many"other"documents,"it"is"probably"less"
relevant"
Some"words"appear"too"frequently"in"all"documents"to"be"relevant"
Known"as"stopwords"
!TF#IDF%considers%both%the%frequency%of%a%word%in%a%given%document%and%
the%number%of%documents%which%contain%the%word%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#29%

TF/IDF:"Data"Mining"Example"
!Consider%a%music%recommendaOon%system%
Given"many"users"music"libraries,"provide"you"may"also"like"
suggesEons"
!If%user%A%and%user%B%have%similar%libraries,%user%A%may%like%an%arOst%in%user%
Bs%library%
But"some"arEsts"will"appear"in"almost"everyones"library,"and"should"
therefore"be"ignored"when"making"recommendaEons"
Almost"everyone"has"The"Beatles"in"their"record"collecEon!"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#30%

TF/IDF"Formally"Dened"
!Term%Frequency%(TF)%
Number"of"Emes"a"term"appears"in"a"document"(i.e.,"the"count)"
!Inverse%Document%Frequency%(IDF)%

"N%
idf = log$ '
#n&
N:"total"number"of"documents"
n:"number"of"documents"that"contain"a"term"

!TF#IDF%
TF""IDF"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#31%

CompuEng"TF/IDF"
!What%we%need:%
Number"of"Emes"t"appears"in"a"document"
Dierent"value"for"each"document"
Number"of"documents"that"contains"t"
One"value"for"each"term"
Total"number"of"documents"
One"value"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#32%

CompuEng"TF/IDF"With"MapReduce"
!Overview%of%algorithm:%3%MapReduce%jobs%
Job"1:"compute"term"frequencies"
Job"2:"compute"number"of"documents"each"word"occurs"in"
Job"3:"compute"TF/IDF"
!NotaOon%in%following%slides:%
docid"="a"unique"ID"for"each"document"
contents*="the"complete"text"of"each"document"
N"="total"number"of"documents"
term"="a"term"(word)"found"in"the"document*
/"="term"frequency"
n"="number"of"documents"a"term"appears"in"
!Note%that%real#world%systems%typically%perform%stemming%on%terms%
Removal"of"plurals,"tense,"possessives"etc"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#33%

CompuEng"TF/IDF:"Job"1""Compute"/*
!Mapper%
Input:"(docid,"contents)"
For"each"term"in"the"document,"generate"a"(term,"docid)"pair"
i.e.,"we"have"seen"this"term"in"this"document"once"
Output:"((term,"docid),"1)"
!Reducer%
Sums"counts"for"word"in"document"
Outputs"((term,"docid),"/)"
i.e.,"the"term"frequency"of"term"in"docid"is"/*
!We%can%add%a%Combiner,%which%will%use%the%same%code%as%the%Reducer%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#34%

CompuEng"TF/IDF:"Job"2""Compute"n*
!Mapper%
Input:"((term,"docid),"/)"
Output:"(term,"(docid,"/,"1))"
!Reducer%
Sums"1s"to"compute"n"(number"of"documents"containing"term)"
Note:"need"to"buer"(docid,"/)"pairs"while"we"are"doing"this"(more"
later)"
Outputs"((term,"docid),"(/,"n))"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#35%

CompuEng"TF/IDF:"Job"3""Compute"TF/IDF"
!Mapper%
Input:"((term,"docid),"(/,"n))"
Assume"N"is"known"(easy"to"nd)"
Output"((term,"docid),"TF""IDF)"
!Reducer%
The"idenEty"funcEon"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#36%

CompuEng"TF/IDF:"Working"At"Scale"
!Job%2:%We%need%to%buer%(docid,%0)%pairs%counts%while%summing%1s%(to%
compute%n)%
Possible"problem:"pairs"may"not"t"in"memory!"
In"how"many"documents"does"the"word"the"occur?"
!Possible%soluOons%
Ignore"very/high/frequency"words"
Write"out"intermediate"data"to"a"le"
Use"another"MapReduce"pass"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#37%

TF/IDF:"Final"Thoughts"
!Several%small%jobs%add%up%to%full%algorithm%
Thinking"in"MapReduce"oZen"means"decomposing"a"complex"algorithm"
into"a"sequence"of"smaller"jobs"
!Beware%of%memory%usage%for%large%amounts%of%data!%
Any"Eme"when"you"need"to"buer"data,"theres"a"potenEal"scalability"
bo>leneck"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#38%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorEng"and"searching"large"data"sets"
! Performing"a"secondary"sort"
! Indexing"data"
! Hands/On"Exercise:"CreaEng"an"Inverted"Index"
! CompuEng"term"frequency""inverse"document"frequency"(TF/IDF)"
! CalculaOng%word%co#occurrence%
! Hands/On"Exercise:"CalculaEng"Word"Co/Occurrence"
! OpEonal"Hands/On"Exercise:"ImplemenEng"Word"Co/Occurrence"with"a"
Custom"WritableComparable"
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#39%

Word"Co/Occurrence:"MoEvaEon"
!Word%Co#Occurrence%measures%the%frequency%with%which%two%words%
appear%close%to%each%other%in%a%corpus%of%documents%
For"some"deniEon"of"close"
!This%is%at%the%heart%of%many%data#mining%techniques%
Provides"results"for"people"who"did"this,"also"do"that"
Examples:"
Shopping"recommendaEons"
Credit"risk"analysis"
IdenEfying"people"of"interest"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#40%

Word"Co/Occurrence:"Algorithm"
!Mapper%
map(docid a, doc d) {
foreach w in d do
foreach u near w do
emit(pair(w, u), 1)
}

!Reducer%
reduce(pair p, Iterator counts) {
s = 0
foreach c in counts do
s += c
emit(p, s)
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#41%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorEng"and"searching"large"data"sets"
! Performing"a"secondary"sort"
! Indexing"data"
! Hands/On"Exercise:"CreaEng"an"Inverted"Index"
! CompuEng"term"frequency""inverse"document"frequency"(TF/IDF)"
! CalculaEng"word"co/occurrence"
! Hands#On%Exercise:%CalculaOng%Word%Co#Occurrence%
! OpOonal%Hands#On%Exercise:%ImplemenOng%Word%Co#Occurrence%with%a%
Custom%WritableComparable%
! Conclusion"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#42%

Hands/On"Exercises:"CalculaEng"Word""
Co/Occurrence,"Using"a"Custom"WritableComparable
!In%these%Hands#On%Exercises%you%will%write%an%applicaOon%that%counts%the%
number%of%Omes%words%appear%next%to%each%other%
!If%you%complete%the%rst%exercise,%please%a[empt%the%opOonal%follow#up%
exercise,%in%which%you%will%rewrite%your%code%to%use%a%custom%
WritableComparable
!Please%refer%to%the%Hands#On%Exercise%Manual%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#43%

Chapter"Topics"
Common%MapReduce%Algorithms%

Problem%Solving%with%MapReduce%

! SorEng"and"searching"large"data"sets"
! Performing"a"secondary"sort"
! Indexing"data"
! Hands/On"Exercise:"CreaEng"an"Inverted"Index"
! CompuEng"term"frequency""inverse"document"frequency"(TF/IDF)"
! CalculaEng"word"co/occurrence"
! Hands/On"Exercise:"CalculaEng"Word"Co/Occurrence"
! OpEonal"Hands/On"Exercise:"ImplemenEng"Word"Co/Occurrence"with"a"
Custom"WritableComparable"
! Conclusion%
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#44%

Conclusion"
In%this%chapter%you%have%learned%
!How%to%sort%and%search%large%data%sets%
!How%to%perform%a%secondary%sort%
!How%to%index%data%
!How%to%compute%term%frequency%%inverse%document%frequency%(TF#IDF)%
!How%to%calculate%word%co#occurrence%

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

09#45%

Joining"Data"Sets"in"MapReduce"Jobs"
Chapter"10"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#1$

Course"Chapters"
Course"IntroducFon"

! "IntroducFon"
! "The"MoFvaFon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriFng"a"MapReduce"Program"
! "Unit"TesFng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracFcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! $Joining$Data$Sets$in$MapReduce$Jobs$
! "IntegraFng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducFon"to"Hive"and"Pig"
! "An"IntroducFon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaFon"in"MapReduce"""

IntroducFon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem$Solving$with$MapReduce$

The"Hadoop"Ecosystem"

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#2$

Joining"Data"Sets"in"MapReduce"Jobs"
In$this$chapter$you$will$learn$
!How$to$write$a$Map#side$join$
!How$to$write$a$Reduce#side$join$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#3$

IntroducFon"
!We$frequently$need$to$join$data$together$from$two$sources$as$part$of$a$
MapReduce$job,$such$as$
Lookup"tables"
Data"from"database"tables"
!There$are$two$fundamental$approaches:$Map#side$joins$and$Reduce#side$
joins$
!Map#side$joins$are$easier$to$write,$but$have$potenKal$scaling$issues$
!We$will$invesKgate$both$types$of$joins$in$this$chapter$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#4$

But"First"
!But$rst$
!Avoid$wriKng$joins$in$Java$MapReduce$if$you$can!$
!AbstracKons$such$as$Pig$and$Hive$are$much$easier$to$use$
Save"hours"of"programming"
!If$you$are$dealing$with$text#based$data,$there$really$is$no$reason$not$to$
use$Pig$or$Hive$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#5$

Chapter"Topics"
Joining$Data$Sets$in$$
MapReduce$Jobs$

Problem$Solving$with$MapReduce$

! WriKng$a$Map#side$join$
! WriFng"a"Reduce/side"join"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#6$

Map/Side"Joins:"The"Algorithm"
!Basic$idea$for$Map#side$joins:$
Load"one"set"of"data"into"memory,"stored"in"a"hash"table"
Key"of"the"hash"table"is"the"join"key"
Map"over"the"other"set"of"data,"and"perform"a"lookup"on"the"hash"table"
using"the"join"key"
If"the"join"key"is"found,"you"have"a"successful"join"
Otherwise,"do"nothing"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#7$

Map/Side"Joins:"Problems,"Possible"SoluFons"
!Map#side$joins$have$scalability$issues$
The"associaFve"array"may"become"too"large"to"t"in"memory"
!Possible$soluKon:$break$one$data$set$into$smaller$pieces$
Load"each"piece"into"memory"individually,"mapping"over"the"second"
data"set"each"Fme"
Then"combine"the"result"sets"together"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#8$

Chapter"Topics"
Joining$Data$Sets$in$$
MapReduce$Jobs$

Problem$Solving$with$MapReduce$

! WriFng"a"Map/side"join"
! WriKng$a$Reduce#side$join$
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#9$

Reduce/Side"Joins:"The"Basic"Concept"
!For$a$Reduce#side$join,$the$basic$concept$is:$
Map"over"both"data"sets"
Emit"a"(key,"value)"pair"for"each"record"
Key"is"the"join"key,"value"is"the"enFre"record"
In"the"Reducer,"do"the"actual"join"
Because"of"the"Shue"and"Sort,"values"with"the"same"key"are"
brought"together"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#10$

Reduce/Side"Joins:"Example"
!Example$input$data:$

EMP: 42, Aaron, loc(13)


LOC: 13, New York City

!Required$output:$
EMP: 42, Aaron, loc(13), New York City

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#11$

Example"Record"Data"Structure"
!A$data$structure$to$hold$a$record$could$look$like$this:$
class Record {
enum Typ { emp, loc };
Typ type;
String empName;
int empId;
int locId;
String locationName;
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#12$

Reduce/Side"Join:"Mapper"
void map(k, v) {
Record r = parse(v);
emit (r.locId, r);
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#13$

Reduce/Side"Join:"Reducer"
void reduce(k, values) {
Record thisLocation;
List<Record> employees;
for (Record v in values) {
if (v.type == Typ.loc) {
thisLocation = v;
} else {
employees.add(v);
}
}
for (Record e in employees) {
e.locationName = thisLocation.locationName;
emit(e);
}
}
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#14$

Scalability"Problems"With"Our"Reducer"
!All$employees$for$a$given$locaKon$must$potenKally$be$buered$in$the$
Reducer$
Could"result"in"out/of/memory"errors"for"large"data"sets"
!SoluKon:$Ensure$the$locaKon$record$is$the$rst$one$to$arrive$at$the$
Reducer$
Using"a"Secondary"Sort"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#15$

A"Be>er"Intermediate"Key"
class LocKey {
boolean isPrimary;
int locId;
public int compareTo(LocKey k) {
if (locId != k.locId) {
return Integer.compare(locId, k.locId);
} else {
return Boolean.compare(k.isPrimary, isPrimary);
}
}
public int hashCode() {
return locId;
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#16$

A"Be>er"Intermediate"Key"(contd)"
class LocKey {
boolean isPrimary;
int locId;
public int compareTo(LocKey k) {
if (locId != k.locId) {
return Integer.compare(locId, k.locId);
} else {
return Boolean.compare(k.isPrimary, isPrimary);
}
}

The"compareTo"method"ensures"that"primary"keys"will"
publicsort"earlier"than"non/primary"keys"for"the"same"locaFon."
int hashCode() {
return locId;

}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#17$

A"Be>er"Intermediate"Key"(contd)"
class LocKey {
boolean isPrimary;
int locId;
public int compareTo(LocKey k) {
if (locId != k.locId) {
return Integer.compare(locId, k.locId);
} else
{
The"hashCode"method"ensures"that"all"records"with"the"
return Boolean.compare(k.isPrimary, isPrimary);
same"key"will"go"to"the"same"Reducer."This"is"an"alternaFve"
}
to"providing"a"custom"Comparator."
}
public int hashCode() {
return locId;
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#18$

A"Be>er"Mapper"
void map(k, v) {
Record r = parse(v);
if (r.type == Typ.emp) {
emit (setisPrimaryFalse(r.locId), r);
} else {
emit (setisPrimaryTrue(r.locId), r);
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#19$

A"Be>er"Reducer"
Record thisLoc;
void reduce(k, values) {
for (Record v in values) {
if (v.type == Typ.loc) {
thisLoc = v;
} else {
v.locationName = thisLoc.locationName;
emit(v);
}
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#20$

Create"a"Grouping"Comparator"
!Create$a$Grouping$Comparator$to$ensure$that$all$records$with$the$same$
locaKon$are$passed$to$the$Reducer$in$one$call$
class LocIDComparator extends WritableComparator {
public int compare(Record r1, Record r2) {
return Integer.compare(r1.locId, r2.locId);
}
}

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#21$

And"Congure"Hadoop"To"Use"It"In"The"Driver"
job.setGroupingComparatorClass(LocIDComparator.class)

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#22$

Chapter"Topics"
Joining$Data$Sets$in$$
MapReduce$Jobs$

Problem$Solving$with$MapReduce$

! WriFng"a"Map/side"join"
! WriFng"a"Reduce/side"join"
! Conclusion$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#23$

Conclusion"
In$this$chapter$you$have$learned$
!How$to$join$write$a$Map#side$join$
!How$to$write$a$Reduce#side$join$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

10#24$

IntegraAng"Hadoop"into"the""
Enterprise"Workow"
Chapter"11"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"1#

Course"Chapters"
Course"IntroducAon"

! "IntroducAon"
! "The"MoAvaAon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriAng"a"MapReduce"Program"
! "Unit"TesAng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracAcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! #Integra,ng#Hadoop#into#the#Enterprise#Workow#
! "Machine"Learning"and"Mahout"
! "An"IntroducAon"to"Hive"and"Pig"
! "An"IntroducAon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaAon"in"MapReduce"""

IntroducAon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The#Hadoop#Ecosystem#

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"2#

IntegraAng"Hadoop"Into"The"Enterprise"Workow"
In#this#chapter#you#will#learn#
!How#Hadoop#can#be#integrated#into#an#exis,ng#enterprise#
!How#to#load#data#from#an#exis,ng#RDBMS#into#HDFS#by#using#Sqoop#
!How#to#manage#real",me#data#such#as#log#les#using#Flume#
!How#to#access#HDFS#from#legacy#systems#with#FuseDFS#and#HKpFS#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"3#

Chapter"Topics"
Integra,ng#Hadoop#into#the#
Enterprise#Workow#

The#Hadoop#Ecosystem#

! Integra,ng#Hadoop#into#an#exis,ng#enterprise#
! Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"
! Hands/On"Exercise:"ImporAng"Data"With"Sqoop"
! Managing"real/Ame"data"using"Flume"
! Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H>pFS"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"4#

IntroducAon"
!Your#data#center#already#has#a#lot#of#components#
Database"servers"
Data"warehouses"
File"servers"
Backup"systems"
!How#does#Hadoop#t#into#this#ecosystem?#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"5#

RDBMS"Strengths"
!Rela,onal#Database#Management#Systems#(RDBMSs)#have#many#
strengths#
Ability"to"handle"complex"transacAons"
Ability"to"process"hundreds"or"thousands"of"queries"per"second"
Real/Ame"delivery"of"results"
Simple"but"powerful"query"language"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"6#

RDBMS"Weaknesses"
!There#are#some#areas#where#RDBMSs#are#less#ideal#
Data"schema"is"determined"before"data"is"ingested"
Can"make"ad/hoc"data"collecAon"dicult"
Upper"bound"on"data"storage"of"100s"of"terabytes"
PracAcal"upper"bound"on"data"in"a"single"query"of"10s"of"terabytes"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"7#

Typical"RDBMS"Scenario"
!Typical#scenario:#use#an#interac,ve#RDBMS#to#serve#queries#from#a#Web#
site#etc#
!Data#is#later#extracted#and#loaded#into#a#data#warehouse#for#future#
processing#and#archiving#
Usually"denormalized"into"an"OLAP"cube"

OLAP: OnLine Analytical Processing


"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"8#

Typical"RDBMS"Scenario"(contd)"
Enterprise web site
Business
intelligence apps

Interactive
database

Data export

OLAP load

Oracle,
SAP...

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"9#

OLAP"Database"LimitaAons"
!All#dimensions#must#be#prematerialized#
Re/materializaAon"can"be"very"Ame"consuming"
!Daily#data#load"in#,mes#can#increase#
Typically"this"leads"to"some"data"being"discarded"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"10#

Using"Hadoop"to"Augment"ExisAng"Databases"
Enterprise web site
Dynamic
OLAP queries

Interactive
database

New
data

Hadoop

Business
intelligence apps

Oracle,
SAP...

Recommendations, etc...

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"11#

Benets"of"Hadoop"
!Processing#power#scales#with#data#storage#
As"you"add"more"nodes"for"storage,"you"get"more"processing"power"for"
free"
!Views#do#not#need#prematerializa,on#
Ad/hoc"full"or"parAal"dataset"queries"are"possible"
!Total#query#size#can#be#mul,ple#petabytes#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"12#

Hadoop"Tradeos"
!Cannot#serve#interac,ve#queries#
The"fastest"Hadoop"job"will"sAll"take"several"seconds"to"run"
!Less#powerful#updates#
No"transacAons"
No"modicaAon"of"exisAng"records"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"13#

TradiAonal"High/Performance"File"Servers"
!Enterprise#data#is#o_en#held#on#large#leservers,#such#as#
NetApp"
EMC"
!Advantages:#
Fast"random"access"
Many"concurrent"clients"
!Disadvantages#
High"cost"per"terabyte"of"storage"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"14#

File"Servers"and"Hadoop"
!Choice#of#des,na,on#medium#depends#on#the#expected#access#paKerns#
SequenAally"read,"append/only"data:"HDFS"
Random"access:"le"server"
!HDFS#can#crunch#sequen,al#data#faster#
!Ooading#data#to#HDFS#leaves#more#room#on#le#servers#for#interac,ve#
data#
!Use#the#right#tool#for#the#job!#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"15#

Chapter"Topics"
Integra,ng#Hadoop#into#the#
Enterprise#Workow#

The#Hadoop#Ecosystem#

! IntegraAng"Hadoop"into"an"exisAng"enterprise"
! Loading#data#into#HDFS#from#an#RDBMS#using#Sqoop#
! Hands/On"Exercise:"ImporAng"Data"With"Sqoop"
! Managing"real/Ame"data"using"Flume"
! Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H>pFS"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"16#

ImporAng"Data"From"an"RDBMS"to"HDFS"
!Typical#scenario:#the#need#to#use#data#stored#in#an#RDBMS#(such#as#Oracle#
database,#MySQL#or#Teradata)#in#a#MapReduce#job#
Lookup"tables"
Legacy"data"
!Possible#to#read#directly#from#an#RDBMS#in#your#Mapper#
Can"lead"to"the"equivalent"of"a"distributed"denial"of"service"(DDoS)"
a>ack"on"your"RDBMS"
In"pracAce""dont"do"it!"
!BeKer#scenario:#import#the#data#into#HDFS#beforehand##

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"17#

Sqoop:"SQL"to"Hadoop"
!Sqoop:#open#source#tool#originally#wriKen#at#Cloudera#
Now"a"top/level"Apache"Sofware"FoundaAon"project"
!Imports#tables#from#an#RDBMS#into#HDFS#
Just"one"table"
All"tables"in"a"database"
Just"porAons"of"a"table"
Sqoop"supports"a"WHERE"clause"
!Uses#MapReduce#to#actually#import#the#data#
Thro>les"the"number"of"Mappers"to"avoid"DDoS"scenarios"
Uses"four"Mappers"by"default"
Value"is"congurable"
!Uses#a#JDBC#interface#
Should"work"with"any"JDBC/compaAble"database"
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"18#

Sqoop:"SQL"to"Hadoop"(contd)"
!Imports#data#to#HDFS#as#delimited#text#les#or#SequenceFiles#
Default"is"a"comma/delimited"text"le"
!Can#be#used#for#incremental#data#imports#
First"import"retrieves"all"rows"in"a"table"
Subsequent"imports"retrieve"just"rows"created"since"the"last"import"
!Generates#a#class#le#which#can#encapsulate#a#row#of#the#imported#data#
Useful"for"serializing"and"deserializing"data"in"subsequent"MapReduce"
jobs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"19#

Custom"Sqoop"Connectors"
!Cloudera#has#partnered#with#other#organiza,ons#to#create#custom#Sqoop#
connectors#
Use"a"systems"naAve"protocols"to"access"data"rather"than"JDBC"
Provides"much"faster"performance"
!Current#systems#supported#by#custom#connectors#include:#
Netezza"
Teradata"
Oracle"Database"(connector"developed"with"Quest"Sofware)"
!Others#are#in#development#
!Custom#connectors#are#not#open#source,#but#are#free#
Available"from"the"Cloudera"Web"site"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"20#

Sqoop:"Basic"Syntax"
!Standard#syntax:#
sqoop tool-name [tool-options]

!Tools#include:#
import
import-all-tables
list-tables

!Op,ons#include:#
--connect
--username
--password
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"21#

Sqoop:"Example"
!Example:#import#a#table#called#employees#from#a#database#called#
personnel#in#a#MySQL#RDBMS#
sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees

!Example:#as#above,#but#only#records#with#an#ID#greater#than#1000#
sqoop import --username fred --password derf \
--connect jdbc:mysql://database.example.com/personnel \
--table employees \
--where "id > 1000"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"22#

Sqoop:"Other"OpAons"
!Sqoop#can#take#data#from#HDFS#and#insert#it#into#an#already"exis,ng#table#
in#an#RDBMS#with#the#command#
sqoop export [options]

!For#general#Sqoop#help:#
sqoop help

!For#help#on#a#par,cular#command:#
sqoop help command

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"23#

Chapter"Topics"
Integra,ng#Hadoop#into#the#
Enterprise#Workow#

The#Hadoop#Ecosystem#

! IntegraAng"Hadoop"into"an"exisAng"enterprise"
! Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"
! Hands"On#Exercise:#Impor,ng#Data#With#Sqoop#
! Managing"real/Ame"data"using"Flume"
! Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H>pFS"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"24#

Hands/On"Exercise:"ImporAng"Data"With"Sqoop"
!In#this#Hands"On#Exercise,#you#will#import#data#into#HDFS#from#MySQL#
!Please#refer#to#the#Hands"On#Exercise#Manual#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"25#

Chapter"Topics"
Integra,ng#Hadoop#into#the#
Enterprise#Workow#

The#Hadoop#Ecosystem#

! IntegraAng"Hadoop"into"an"exisAng"enterprise"
! Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"
! Hands/On"Exercise:"ImporAng"Data"With"Sqoop"
! Managing#real",me#data#using#Flume#
! Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H>pFS"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"26#

Flume:"Basics"
!Flume#is#a#distributed,#reliable,#available#service#for#eciently#moving#
large#amounts#of#data#as#it#is#produced#
Ideally"suited"to"gathering"logs"from"mulAple"systems"and"inserAng"
them"into"HDFS"as"they"are"generated"
!Flume#is#Open#Source#
IniAally"developed"by"Cloudera"
!Flumes#design#goals:#
Reliability"
Scalability"
Manageability"
Extensibility"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"27#

Flume:"High/Level"Overview"

Agent#

Agent##

Agent#

Agent#

encrypt#

Writes to multiple HDFS file


formats (text, SequenceFile,
JSON, Avro, others)
Parallelized writes across
many collectors as much
write throughput as required

Agent#

Agent#

compress#

batch#
encrypt#

Optionally process incoming


data: perform transformations,
suppressions, metadata
enrichment
Each agent can be configured
with an in memory or durable
channel

Agent(s)#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"28#

Flume"Agent"CharacterisAcs"
!Each#Flume#agent#has#a#source,#a#sink#and#a#channel#
!Source#
Tells"the"node"where"to"receive"data"from"
!Sink#
Tells"the"node"where"to"send"data"to"
!Channel#
A"queue"between"the"Source"and"Sink"
Can"be"in/memory"only"or"Durable"
Durable"channels"will"not"lose"data"if"power"is"lost"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"29#

Flumes"Design"Goals:"Reliability"
!Channels#provide#Flumes#reliability#
!Memory#Channel#
Data"will"be"lost"if"power"is"lost"
!File#Channel#
Data"stored"on"disk"
Guarantees"durability"of"data"in"face"of"a"power"loss"
!Data#transfer#between#Agents#and#Channels#is#transac,onal#
A"failed"data"transfer"to"a"downstream"agent"rolls"back"and"retries"
!Can#congure#mul,ple#Agents#with#the#same#task#
e.g.,"two"Agents"doing"the"job"of"one"collector""if"one"agent"fails"
then"upstream"agents"would"fail"over"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"30#

Flumes"Design"Goals:"Scalability"
!Scalability#
The"ability"to"increase"system"performance"linearly""or"be>er""by"
adding"more"resources"to"the"system"
Flume"scales"horizontally"
As"load"increases,"more"machines"can"be"added"to"the"
conguraAon"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"31#

Flumes"Design"Goals:"Manageability"
!Manageability#
The"ability"to"control"data"ows,"monitor"nodes,"modify"the"sekngs,"
and"control"outputs"of"a"large"system"
!Congura,on#is#loaded#from#a#proper,es#le#
ProperAes"le"can"be"reloaded"on"the"y"
File"must"be"pushed"out"to"each"node"(using"scp,"Puppet,"Chef,"etc.)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"32#

Flumes"Design"Goals:"Extensibility"
!Extensibility#
The"ability"to"add"new"funcAonality"to"a"system"
!Flume#can#be#extended#by#adding#Sources#and#Sinks#to#exis,ng#storage#
layers#or#data#plalorms#
General"Sources"include"data"from"les,"syslog,"and"standard"output"
from"a"process"
General"Sinks"include"les"on"the"local"lesystem"or"HDFS"
Developers"can"write"their"own"Sources"or"Sinks"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"33#

Flume:"Usage"Pa>erns"
!Flume#is#typically#used#to#ingest#log#les#from#real",me#systems#such#as#
Web#servers,#rewalls#and#mailservers#into#HDFS#
!Currently#in#use#in#many#large#organiza,ons,#inges,ng#millions#of#events#
per#day#
At"least"one"organizaAon"is"using"Flume"to"ingest"over"200"million"
events"per"day"
!Flume#is#typically#installed#and#congured#by#a#system#administrator#
Check"the"Flume"documentaAon"if"you"intend"to"install"it"yourself"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"34#

Chapter"Topics"
Integra,ng#Hadoop#into#the#
Enterprise#Workow#

The#Hadoop#Ecosystem#

! IntegraAng"Hadoop"into"an"exisAng"enterprise"
! Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"
! Hands/On"Exercise:"ImporAng"Data"With"Sqoop"
! Managing"real/Ame"data"using"Flume"
! Accessing#HDFS#from#legacy#systems#with#FuseDFS#and#HKpFS#
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"35#

FuseDFS"and"H>pFS:"MoAvaAon"
!Many#applica,ons#generate#data#which#will#ul,mately#reside#in#HDFS#
!If#Flume#is#not#an#appropriate#solu,on#for#inges,ng#the#data,#some#other#
method#must#be#used#
!Typically#this#is#done#as#a#batch#process#
!Problem:#many#legacy#systems#do#not#understand#HDFS#
Dicult"to"write"to"HDFS"if"the"applicaAon"is"not"wri>en"in"Java"
May"not"have"Hadoop"installed"on"the"system"generaAng"the"data"
!We#need#some#way#for#these#systems#to#access#HDFS#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"36#

FuseDFS"
!FuseDFS#is#based#on#FUSE#(Filesystem#in#USEr#space)#
!Allows#you#to#mount#HDFS#as#a#regular#lesystem#
!Note:#HDFS#limita,ons#s,ll#exist!#
Not"intended"as"a"general/purpose"lesystem"
Files"are"write/once"
Not"opAmized"for"low"latency"
!FuseDFS#is#included#as#part#of#the#Hadoop#distribu,on#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"37#

H>pFS"
!Provides#an#HTTP/HTTPS#REST#interface#to#HDFS#
Supports"both"reads"and"writes"from/to"HDFS"
Can"be"accessed"from"within"a"program"
Can"be"used"via"command/line"tools"such"as"curl"or"wget
!Client#accesses#the#HKpFS#server#
H>pFS"server"then"accesses"HDFS"
!Example:#curl http://httpfs-host:14000/webhdfs/v1/
user/foo/README.txt#returns#the#contents#of#the#HDFS##
/user/foo/README.txt#le#

REST: REpresentational State Transfer


"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"38#

Chapter"Topics"
Integra,ng#Hadoop#into#the#
Enterprise#Workow#

The#Hadoop#Ecosystem#

! IntegraAng"Hadoop"into"an"exisAng"enterprise"
! Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"
! Hands/On"Exercise:"ImporAng"Data"With"Sqoop"
! Managing"real/Ame"data"using"Flume"
! Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H>pFS"
! Conclusion#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"39#

Conclusion"
In#this#chapter#you#have#learned#
!How#Hadoop#can#be#integrated#into#an#exis,ng#enterprise#
!How#to#load#data#from#an#exis,ng#RDBMS#into#HDFS#by#using#Sqoop#
!How#to#manage#real",me#data#such#as#log#les#using#Flume#
!How#to#access#HDFS#from#legacy#systems#with#FuseDFS#and#HKpFS#

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

11"40#

Machine"Learning"and"Mahout"
Chapter"12"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#1$

Course"Chapters"
Course"IntroducCon"

! "IntroducCon"
! "The"MoCvaCon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriCng"a"MapReduce"Program"
! "Unit"TesCng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracCcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraCng"Hadoop"into"the"Enterprise"Workow"
! $Machine$Learning$and$Mahout$
! "An"IntroducCon"to"Hive"and"Pig"
! "An"IntroducCon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaCon"in"MapReduce"""

IntroducCon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The$Hadoop$Ecosystem$

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#2$

Machine"Learning"and"Mahout"
In$this$chapter$you$will$learn$
!Machine$Learning$basics$
!Mahout$basics$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#3$

Chapter"Topics"
Machine$Learning$and$Mahout$

The$Hadoop$Ecosystem$

! Introduc@on$to$Machine$Learning$
! Using"Mahout"
! Hands/On"Exercise:"Using"a"Mahout"Recommender"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#4$

Machine"learning:"IntroducCon"
!Machine$Learning$is$a$complex$discipline$
!Much$research$is$ongoing$
!Here$we$merely$give$a$very$high#level$overview$of$some$aspects$of$ML$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#5$

What"Is"Machine"Learning"Not?"
!Most$programs$tell$computers$exactly$what$to$do$
Database"transacCons"and"queries"
Controllers"
Phone"systems,"manufacturing"processes,"transport,"weaponry,"
etc."
Media"delivery"
Simple"search"
Social"systems"
Chat,"blogs,"e/mail"etc."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#6$

What"Is"Machine"Learning?"
!An$alterna@ve$technique$is$to$have$computers$learn$what$to$do$
!Machine$Learning$refers$to$a$few$classes$of$program$that$leverage$
collected$data$to$drive$future$program$behavior$
!This$represents$another$major$opportunity$to$gain$value$from$data$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#7$

Why"Use"Hadoop"for"Machine"Learning?"
!Machine$Learning$systems$are$sensi@ve$to$the$skill$you$bring$to$them$
!However,$prac@@oners$oMen$agree$[Banko$and$Brill,$2001]:$

Its$not$who$has$the$best$algorithms$that$wins.$$
Its$who$has$the$most$data.$
or$
Theres$no$data$like$more$data.$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#8$

The"Three"Cs"
!Machine$Learning$is$an$ac@ve$area$of$research$and$new$applica@ons$
!There$are$three$well#established$categories$of$techniques$for$exploi@ng$
data:$
CollaboraCve"ltering"(recommendaCons)"
Clustering"
ClassicaCon"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#9$

CollaboraCve"Filtering"
!Collabora@ve$Filtering$is$a$technique$for$recommenda@ons$
!Example$applica@on:$given$people$who$each$like$certain$books,$learn$to$
suggest$what$someone$may$like$based$on$what$they$already$like$
!Very$useful$in$helping$users$navigate$data$by$expanding$to$topics$that$
have$anity$with$their$established$interests$
!Collabora@ve$Filtering$algorithms$are$agnos@c$to$the$dierent$types$of$
data$items$involved$
So"they"are"equally"useful"in"many"dierent"domains"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#10$

Clustering"
!Clustering$algorithms$discover$structure$in$collec@ons$of$data$
Where"no"formal"structure"previously"existed"
!They$discover$what$clusters,$or$groupings,$naturally$occur$in$data$
!Examples:$
Finding"related"news"arCcles"
Computer"vision"(groups"of"pixels"that"cohere"into"objects)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#11$

ClassicaCon"
!The$previous$two$techniques$are$considered$unsupervised$learning$
The"algorithm"discovers"groups"or"recommendaCons"itself"
!Classica@on$is$a$form$of$supervised$learning$
!A$classica@on$system$takes$a$set$of$data$records$with$known$labels$
Learns"how"to"label"new"records"based"on"that"informaCon"
!Example:$
Given"a"set"of"e/mails"idenCed"as"spam/not"spam,"label"new"e/mails"as"
spam/not"spam"
Given"tumors"idenCed"as"benign"or"malignant,"classify"new"tumors"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#12$

Chapter"Topics"
Machine$Learning$and$Mahout$

The$Hadoop$Ecosystem$

! IntroducCon"to"Machine"Learning"
! Using$Mahout$
! Hands/On"Exercise:"Using"a"Mahout"Recommender"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#13$

Mahout:"A"Machine"Learning"Library"
!Mahout$is$a$Machine$Learning$library$wriaen$in$Java$$
Included"in"CDH3"onwards"
Contains"algorithms"for"each"of"the"categories"listed"
!Algorithms$included$in$Mahout:$
Recommenda@on$

Clustering$

Classica@on$

Pearson"correlaCon"
Log"likelihood"
Spearman"correlaCon"
Tanimoto"coecient"
Singular"value"
decomposiCon"(SVD)"
Linear"interpolaCon"
Cluster/based"
recommenders"

k/means"clustering"
Canopy"clustering"
Fuzzy"k/means"
Latent"Dirichlet"
analysis"(LDA)"

StochasCc"gradient"

descent"(SGD)"
Support"vector"
machine"(SVM)"""
Nave"Bayes"
Complementary"nave"
Bayes"
Random"forests"
"

"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#14$

Mahout:"A"Machine"Learning"Library"(contd)"
!Some$Mahout$algorithms$can$be$used$by$stand#alone$programs$
!Many$are$op@mized$to$work$with$Hadoop$
!Mahout$also$comes$with$some$pre#built$scripts$to$analyze$data$
We"will"use"one"of"these"in"the"Hands/On"Exercise"
!The$libraries$are$data$agnos@c$
Example:"the"Recommender"engines"dont"care"whether"you"are"gehng"
recommendaCons"for"books,"music,"movies,"brands"of"toothpaste"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#15$

Chapter"Topics"
Machine$Learning$and$Mahout$

The$Hadoop$Ecosystem$

! IntroducCon"to"Machine"Learning"
! Using"Mahout"
! Hands#On$Exercise:$Using$a$Mahout$Recommender$
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#16$

Hands/On"Exercise:"Using"a"Mahout"Recommender"
!In$this$Hands#On$Exercise,$you$will$use$a$Mahout$recommender$to$
generate$a$set$of$movie$recommenda@ons$
!Please$refer$to$the$Hands#On$Exercise$Manual$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#17$

Chapter"Topics"
Machine$Learning$and$Mahout$

The$Hadoop$Ecosystem$

! IntroducCon"to"Machine"Learning"
! Using"Mahout"
! Hands/On"Exercise:"Using"a"Mahout"Recommender"
! Conclusion$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#18$

Conclusion"
In$this$chapter$you$have$learned$
!Machine$Learning$basics$
!Mahout$basics$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

12#19$

An"IntroducAon"to"Hive"and"Pig"
Chapter"13"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#1$

Course"Chapters"
Course"IntroducAon"

! "IntroducAon"
! "The"MoAvaAon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriAng"a"MapReduce"Program"
! "Unit"TesAng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracAcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraAng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! $An$Introduc/on$to$Hive$and$Pig$
! "An"IntroducAon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaAon"in"MapReduce"""

IntroducAon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The$Hadoop$Ecosystem$

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#2$

An"IntroducAon"to"Hive"and"Pig"
In$this$chapter$you$will$learn$
!What$features$Hive$provides$
!What$features$Pig$provides$
!How$to$choose$between$Pig$and$Hive$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#3$

Chapter"Topics"
An$Introduc/on$to$Hive$and$Pig$

The$Hadoop$Ecosystem$

! The$mo/va/on$for$Hive$and$Pig$
! Hive"basics"
! Hands/On"Exercise:"ManipulaAng"Data"with"Hive"
! Pig"basics"
! Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"
Recommender"
! Choosing"between"Hive"and"Pig"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#4$

Hive"and"Pig:"MoAvaAon"
!MapReduce$code$is$typically$wriGen$in$Java$
Although"it"can"be"wri>en"in"other"languages"using"Hadoop"Streaming"
!Requires:$
A"programmer"
Who"is"a"good$Java"programmer"
Who"understands"how"to"think"in"terms"of"MapReduce"
Who"understands"the"problem"theyre"trying"to"solve"
Who"has"enough"Ame"to"write"and"test"the"code"
Who"will"be"available"to"maintain"and"update"the"code"in"the"future"as"
requirements"change"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#5$

Hive"and"Pig:"MoAvaAon"(contd)"
!Many$organiza/ons$have$only$a$few$developers$who$can$write$good$
MapReduce$code$
!Meanwhile,$many$other$people$want$to$analyze$data$
Business"analysts"
Data"scienAsts"
StaAsAcians"
Data"analysts"
!Whats$needed$is$a$higher#level$abstrac/on$on$top$of$MapReduce$
Providing"the"ability"to"query"the"data"without"needing"to"know"
MapReduce"inAmately"
Hive"and"Pig"address"these"needs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#6$

Chapter"Topics"
An$Introduc/on$to$Hive$and$Pig$

The$Hadoop$Ecosystem$

! The"moAvaAon"for"Hive"and"Pig"
! Hive$basics$
! Hands/On"Exercise:"ManipulaAng"Data"with"Hive"
! Pig"basics"
! Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"
Recommender"
! Choosing"between"Hive"and"Pig"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#7$

Hive:"IntroducAon"
!Hive$was$originally$developed$at$Facebook$
Provides"a"very"SQL/like"language"
Can"be"used"by"people"who"know"SQL"
Under"the"covers,"generates"MapReduce"jobs"that"run"on"the"Hadoop"
cluster"
Enabling"Hive"requires"almost"no"extra"work"by"the"system"
administrator"
!Hive$is$now$a$top#level$Apache$SoTware$Founda/on$project$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#8$

The"Hive"Data"Model"
!Hive$layers$table$deni/ons$on$top$of$data$in$HDFS$
!Tables$
Typed"columns"(int,"oat,"string,"boolean"and"so"on)"
Also"array,"struct,"map"(for"JSON/like"data)"
!Par//ons$
e.g.,"to"range/parAAon"tables"by"date"
!Buckets$
Hash"parAAons"within"ranges"(useful"for"sampling,"join"opAmizaAon)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#9$

Hive"Data"Types"
!Primi/ve$types:$
TINYINT
SMALLINT
INT
BIGINT
FLOAT
BOOLEAN
DOUBLE
STRING
BINARY"(available"starAng"in"CDH4)"
TIMESTAMP"(available"starAng"in"CDH4)
!Type$constructors:$
ARRAY < primitive-type >
MAP < primitive-type, data-type >
STRUCT < col-name : data-type, ... >
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#10$

The"Hive"Metastore"
!Hives$Metastore$is$a$database$containing$table$deni/ons$and$other$
metadata$
By"default,"stored"locally"on"the"client"machine"in"a"Derby"database"
If"mulAple"people"will"be"using"Hive,"the"system"administrator"should"
create"a"shared"Metastore"
Usually"in"MySQL"or"some"other"relaAonal"database"server"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#11$

Hive"Data:"Physical"Layout"
!Hive$tables$are$stored$in$Hives$warehouse$directory$in$HDFS$
By"default,"/user/hive/warehouse
!Tables$are$stored$in$subdirectories$of$the$warehouse$directory$
ParAAons"form"subdirectories"of"tables"
!Possible$to$create$external(tables$if$the$data$is$already$in$HDFS$and$should$
not$be$moved$from$its$current$loca/on$
!Actual$data$is$stored$in$at$les$
Control"character/delimited"text,"or"SequenceFiles"
Can"be"in"arbitrary"format"with"the"use"of"a"custom"Serializer/
Deserializer"(SerDe)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#12$

StarAng"The"Hive"Shell"
!To$launch$the$Hive$shell,$start$a$terminal$and$run$
$ hive

!Results$in$the$Hive$prompt:$
hive>

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#13$

Hive"Basics:"CreaAng"Tables"

hive> SHOW TABLES;


hive> CREATE TABLE shakespeare
(freq INT, word STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
hive> DESCRIBE shakespeare;
$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#14$

Loading"Data"Into"Hive"
!Data$is$loaded$into$Hive$with$the$LOAD DATA INPATH$statement$
Assumes"that"the"data"is"already"in"HDFS"
LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;

!If$the$data$is$on$the$local$lesystem,$use$LOAD DATA LOCAL INPATH


AutomaAcally"loads"it"into"HDFS"in"the"correct"directory"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#15$

Using"Sqoop"to"Import"Data"into"Hive"Tables"
!The$Sqoop$op/on$--hive-import$will$automa/cally$create$a$Hive$
table$from$the$imported$data$
Imports"the"data"
Generates"the"Hive"CREATE TABLE"statement"based"on"the"table"
deniAon"in"the"RDBMS"
Runs"the"statement"
Note:"This"will"move"the"imported"table"into"Hives"warehouse"directory"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#16$

Basic"SELECT"Queries"
!Hive$supports$most$familiar$SELECT$syntax$
hive> SELECT * FROM shakespeare LIMIT 10;
hive> SELECT * FROM shakespeare
WHERE freq > 100 ORDER BY freq ASC
LIMIT 10;

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#17$

Joining"Tables"
!Joining$datasets$is$a$complex$opera/on$in$standard$Java$MapReduce$
We"saw"this"earlier"in"the"course"
!In$Hive,$its$easy!$
SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#18$

Storing"Output"Results"
!The$SELECT$statement$on$the$previous$slide$would$write$the$data$to$the$
console$
!To$store$the$results$in$HDFS,$create$a$new$table$then$write,$for$example:$
INSERT OVERWRITE TABLE newTable
SELECT s.word, s.freq, k.freq FROM
shakespeare s JOIN kjv k ON
(s.word = k.word)
WHERE s.freq >= 5;

Results"are"stored"in"the"table"
Results"are"just"les"within"the"newTable"directory"
Data"can"be"used"in"subsequent"queries,"or"in"MapReduce"jobs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#19$

Using"User/Dened"Code"
!Hive$supports$manipula/on$of$data$via$User#Dened$Func/ons$(UDFs)$
Wri>en"in"Java"
!Also$supports$user#created$scripts$wriGen$in$any$language$via$the$
TRANSFORM$operator$
EssenAally"leverages"Hadoop"Streaming"
Example:"
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#20$

Hive"LimitaAons"
!Not$all$standard$SQL$is$supported$
Subqueries"are"only"supported"in"the"FROM"clause"
No"correlated"subqueries"
!No$support$for$UPDATE$or$DELETE
!No$support$for$INSERTing$single$rows$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#21$

Hive:"Where"To"Learn"More"
!Main$Web$site$is$at$http://hive.apache.org/
!Cloudera$training$course:$Cloudera$Training$for$Apache$Hive$And$Pig$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#22$

Chapter"Topics"
An$Introduc/on$to$Hive$and$Pig$

The$Hadoop$Ecosystem$

! The"moAvaAon"for"Hive"and"Pig"
! Hive"basics"
! Hands#On$Exercise:$Manipula/ng$Data$with$Hive$
! Pig"basics"
! Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"
Recommender"
! Choosing"between"Hive"and"Pig"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#23$

Hands/On"Exercise:"ManipulaAng"Data""
With"Hive"
!In$this$Hands#On$Exercise,$you$will$manipulate$a$dataset$using$Hive$
!Please$refer$to$the$Hands#On$Exercise$Manual$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#24$

Chapter"Topics"
An$Introduc/on$to$Hive$and$Pig$

The$Hadoop$Ecosystem$

! The"moAvaAon"for"Hive"and"Pig"
! Hive"basics"
! Hands/On"Exercise:"ManipulaAng"Data"with"Hive"
! Pig$basics$
! Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"
Recommender"
! Choosing"between"Hive"and"Pig"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#25$

Pig:"IntroducAon"
!Pig$was$originally$created$at$Yahoo!$to$answer$a$similar$need$to$Hive$
Many"developers"did"not"have"the"Java"and/or"MapReduce"knowledge"
required"to"write"standard"MapReduce"programs"
But"sAll"needed"to"query"data"
!Pig$is$a$high#level$plahorm$for$crea/ng$MapReduce$programs$
Language"is"called"PigLaAn"
RelaAvely"simple"syntax"
Under"the"covers,"PigLaAn"scripts"are"turned"into"MapReduce"jobs"and"
executed"on"the"cluster"
!Pig$is$now$a$top#level$Apache$project$$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#26$

Pig"InstallaAon"
!Installa/on$of$Pig$requires$no$modica/on$to$the$cluster$
!The$Pig$interpreter$runs$on$the$client$machine$
Turns"PigLaAn"into"standard"Java"MapReduce"jobs,"which"are"then"
submi>ed"to"the"JobTracker"
!There$is$(currently)$no$shared$metadata,$so$no$need$for$a$shared$
metastore$of$any$kind$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#27$

Pig"Concepts"
!In$Pig,$a$single$element$of$data$is$an$atom(
!A$collec/on$of$atoms$$such$as$a$row,$or$a$par/al$row$$is$a$tuple$
!Tuples$are$collected$together$into$bags$
!Typically,$a$PigLa/n$script$starts$by$loading$one$or$more$datasets$into$
bags,$and$then$creates$new$bags$by$modifying$those$it$already$has$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#28$

Pig"Features"
!Pig$supports$many$features$which$allow$developers$to$perform$
sophis/cated$data$analysis$without$having$to$write$Java$MapReduce$code$
Joining"datasets"
Grouping"data"
Referring"to"elements"by"posiAon"rather"than"name"
Useful"for"datasets"with"many"elements"
Loading"non/delimited"data"using"a"custom"SerDe"
CreaAon"of"user/dened"funcAons,"wri>en"in"Java"
And"more"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#29$

Using"the"Grunt"Shell"to"Run"PigLaAn"
!Star/ng$Grunt$
$ pig
grunt>

! Useful$commands:$
$
$
$
$

pig
pig
pig
pig

-help (or -h)


-version (-i)
-execute (-e)
script.pig$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#30$

A"Sample"Pig"Script"
emps = LOAD 'people' AS (id, name, salary);
rich = FILTER emps BY salary > 100000;
srtd = ORDER rich BY salary DESC;
STORE srtd INTO 'rich_people';

!Here,$we$load$a$directory$of$data$into$a$bag$called$emps
!Then$we$create$a$new$bag$called$rich$which$contains$just$those$records$
where$the$salary$por/on$is$greater$than$100000$
!Finally,$we$write$the$contents$of$the$srtd$bag$to$a$new$directory$in$HDFS$
By"default,"the"data"will"be"wri>en"in"tab/separated"format"
!Alterna/vely,$to$write$the$contents$of$a$bag$to$the$screen,$say$
DUMP srtd;

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#31$

More"PigLaAn"
!To$view$the$structure$of$a$bag:$
DESCRIBE bagname;

!Joining$two$datasets:$
data1
data2
jnd =
STORE

= LOAD 'data1' AS (col1, col2, col3, col4);


= LOAD 'data2' AS (colA, colB, colC);
JOIN data1 BY col3, data2 BY colA;
jnd INTO 'outfile';

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#32$

More"PigLaAn:"Grouping"
!Grouping:$
grpd = GROUP bag1 BY elementX

!Creates$a$new$bag$
Each"tuple"in"grpd"has"an"element"called"group,"and"an"element"
called"bag1
The"group"element"has"a"unique"value"for"elementX"from"bag1
The"bag1"element"is"itself"a"bag,"containing"all"the"tuples"from"bag1"
with"that"value"for"elementX

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#33$

More"PigLaAn:"FOREACH
!The$FOREACH...GENERATE$statement$iterates$over$members$of$a$bag$
!Example:$
justnames = FOREACH emps GENERATE name;

!Can$combine$with$COUNT:$
summedUp = FOREACH grpd GENERATE group,
COUNT(bag1) AS elementCount;

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#34$

Pig:"Where"To"Learn"More"
!Main$Web$site$is$at$http://pig.apache.org
!To$locate$the$Pig$documenta/on:$
For"CDH3,"select"the"Release"0.8.1"link"under"documentaAon"on"the"lef"
side"of"the"page""
For"CDH4,"select"the"Release"0.9.2"link"under"documentaAon"on"the"lef"
side"of"the"page""
!Cloudera$training$course:$Cloudera$Training$for$Apache$Hive$And$Pig$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#35$

Chapter"Topics"
An$Introduc/on$to$Hive$and$Pig$

The$Hadoop$Ecosystem$

! The"moAvaAon"for"Hive"and"Pig"
! Hive"basics"
! Hands/On"Exercise:"ManipulaAng"Data"with"Hive"
! Pig"basics"
! Hands#On$Exercise:$Using$Pig$to$Retrieve$Movie$Names$from$our$
Recommender$
! Choosing"between"Hive"and"Pig"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#36$

Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"From"
Our"Recommender"
!In$this$Hands#On$Exercise,$you$will$use$Pig$to$take$the$data$you$generated$
with$Mahout$earlier$in$the$course$and$produce$the$actual$movie$names$
that$have$been$recommended$
!Please$refer$to$the$Hands#On$Exercise$Manual$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#37$

Chapter"Topics"
An$Introduc/on$to$Hive$and$Pig$

The$Hadoop$Ecosystem$

! The"moAvaAon"for"Hive"and"Pig"
! Hive"basics"
! Hands/On"Exercise:"ManipulaAng"Data"with"Hive"
! Pig"basics"
! Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"
Recommender"
! Choosing$between$Hive$and$Pig$
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#38$

Choosing"Between"Pig"and"Hive"
!Typically,$organiza/ons$wan/ng$an$abstrac/on$on$top$of$standard$
MapReduce$will$choose$to$use$either$Hive$or$Pig$$
!Which$one$is$chosen$depends$on$the$skillset$of$the$target$users$
Those"with"an"SQL"background"will"naturally"gravitate"towards"Hive"
Those"who"do"not"know"SQL"will"ofen"choose"Pig"
!Each$has$strengths$and$weaknesses;$it$is$worth$spending$some$/me$
inves/ga/ng$each$so$you$can$make$an$informed$decision$
!Some$organiza/ons$are$now$choosing$to$use$both$
Pig"deals"be>er"with"less/structured"data,"so"Pig"is"used"to"manipulate"
the"data"into"a"more"structured"form,"then"Hive"is"used"to"query"that"
structured"data"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#39$

Chapter"Topics"
An$Introduc/on$to$Hive$and$Pig$

The$Hadoop$Ecosystem$

! The"moAvaAon"for"Hive"and"Pig"
! Hive"basics"
! Hands/On"Exercise:"ManipulaAng"Data"with"Hive"
! Pig"basics"
! Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"
Recommender"
! Choosing"between"Hive"and"Pig"
! Conclusion$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#40$

Conclusion"
In$this$chapter$you$have$learned$
!What$features$Hive$provides$
!What$features$Pig$provides$
!How$to$choose$between$Pig$and$Hive$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

13#41$

An"IntroducAon"to"Oozie"
Chapter"14"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#1$

Course"Chapters"
Course"IntroducAon"

! "IntroducAon"
! "The"MoAvaAon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriAng"a"MapReduce"Program"
! "Unit"TesAng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracAcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraAng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducAon"to"Hive"and"Pig"
! $An$Introduc/on$to$Oozie$
! "Conclusion"
! "Cloudera"Enterprise"
! "Graph"ManipulaAon"in"MapReduce"""

IntroducAon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The$Hadoop$Ecosystem$

Course"Conclusion"and"Appendices"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#2$

An"IntroducAon"to"Oozie"
In$this$chapter$you$will$learn$
!What$Oozie$is$
!How$to$create$Oozie$workows$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#3$

Chapter"Topics"
An$Introduc/on$to$Oozie$

The$Hadoop$Ecosystem$

! Introduc/on$to$Oozie$
! CreaAng"Oozie"workows"
! Hands/On"Exercise:"Running"an"Oozie"Workow"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#4$

The"MoAvaAon"for"Oozie"
!Many$problems$cannot$be$solved$with$a$single$$
MapReduce$job$

Start
Data

!Instead,$a$workow$of$jobs$must$be$created$
!Simple$workow:$
Run"Job"A"
Use"output"of"Job"A"as"input"to"Job"B"
Use"output"of"Job"B"as"input"to"Job"C"
Output"of"Job"C"is"the"nal"required"output"
!Easy$if$the$workow$is$linear$like$this$
Can"be"created"as"standard"Driver"code"

Job A

Job B

Job C

Final
Result

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#5$

The"MoAvaAon"for"Oozie"(contd)"
!If$the$workow$is$more$complex,$Driver$code$becomes$much$more$
dicult$to$maintain$
!Example:$running$mul/ple$jobs$in$parallel,$using$the$output$from$all$of$
those$jobs$as$the$input$to$the$next$job$
!Example:$including$Hive$or$Pig$jobs$as$part$of$the$workow$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#6$

What"is"Oozie?"
!Oozie$is$a$workow$engine$
!Runs$on$a$server$
Typically"outside"the"cluster"
!Runs$workows$of$Hadoop$jobs$
Including"Pig,"Hive,"Sqoop"jobs"
Submits"those"jobs"to"the"cluster"based"on"a"workow"deniAon"
!Workow$deni/ons$are$submiWed$via$HTTP$
!Jobs$can$be$run$at$specic$/mes$
One/o"or"recurring"jobs"
!Jobs$can$be$run$when$data$is$present$in$a$directory$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#7$

Chapter"Topics"
An$Introduc/on$to$Oozie$

The$Hadoop$Ecosystem$

! IntroducAon"to"Oozie"
! Crea/ng$Oozie$workows$
! Hands/On"Exercise:"Running"an"Oozie"Workow"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#8$

Oozie"Workow"Basics"
!Oozie$workows$are$wriWen$in$XML$$
!Workow$is$a$collec/on$of$ac/ons$
MapReduce"jobs,"Pig"jobs,"Hive"jobs"etc."
!A$workow$consists$of$control*ow*nodes$and$ac/on*nodes$
!Control$ow$nodes$dene$the$beginning$and$end$of$a$workow$
They"provide"methods"to"determine"the"workow"execuAon"path"
Example:"Run"mulAple"jobs"simultaneously"
!Ac/on$nodes$trigger$the$execu/on$of$a$processing$task,$such$as$
A"MapReduce"job"
A"Pig"job"
A"Sqoop"data"import"job"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#9$

Simple"Oozie"Example"
!Simple$example$workow$for$WordCount:$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#10$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#11$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

A"workow"is"wrapped"in"the"workflow"enAty"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#12$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

The"start"node"is"the"control"node"which"tells"
Oozie"which"workow"node"should"be"run"rst."There"
must"be"one"start"node"in"an"Oozie"workow."In"
our"example,"we"are"telling"Oozie"to"start"by"
transiAoning"to"the"wordcount"workow"node."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#13$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

The"wordcount"acAon"node"denes"a"mapreduce"acAon""a"standard"Java"MapReduce"job."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#14$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

Within"the"acAon,"we"dene"the"jobs"properAes."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#15$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='kill'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

We"specify"what"to"do"if"the"acAon"ends"successfully,"
and"what"to"do"if"it"fails."In"this"example,"if"the"job"is"
successful"we"go"to"the"end"node."If"it"fails"we"go"to"
the"kill"node."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#16$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

Every"workow"must"have"an"end"node."This"
indicates"that"the"workow"has"completed"
successfully."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#17$

Simple"Oozie"Example"(contd)"
<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1">
<start to='wordcount'/>
<action name='wordcount'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='end'/>
<error to='end'/>
</action>
<kill name='kill'>
<message>Something went wrong: ${wf:errorCode('wordcount')}</message>
</kill/>
<end name='end'/>
</workflow-app>

If"the"workow"reaches"a"kill"node,"it"will"kill"all"
running"acAons"and"then"terminate"with"an"error."A"
workow"can"have"zero"or"more"kill"nodes."

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#18$

Other"Oozie"Control"Nodes"
!A$decision$control$node$allows$Oozie$to$determine$the$workow$
execu/on$path$based$on$some$criteria$
Similar"to"a"switch/case"statement"
! fork$and$join$control$nodes$split$one$execu/on$path$into$mul/ple$
execu/on$paths$which$run$concurrently$
fork"splits"the"execuAon"path"
join"waits"for"all"concurrent"execuAon"paths"to"complete"before"
proceeding"
fork"and"join"are"used"in"pairs"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#19$

Oozie"Workow"AcAon"Nodes"
Node$Name$

Descrip/on$

map-reduce

Runs"either"a"Java"MapReduce"or"Streaming"job"

fs

Create"directories,"move"or"delete"les"or"directories"

java

Runs"the"main()"method"in"the"specied"Java"class"as"a"single/
Map,"Map/only"job"on"the"cluster"

pig

Runs"a"Pig"job"

hive

Runs"a"Hive"job"

sqoop

Runs"a"Sqoop"job"

email

Sends"an"e/mail"message"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#20$

Submieng"an"Oozie"Workow"
!To$submit$an$Oozie$workow$using$the$command#line$tool:$

$ oozie job -oozie http://<oozie_server>/oozie


-config config_file -run

!Oozie$can$also$be$called$from$within$a$Java$program$
Via"the"Oozie"client"API"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#21$

More"on"Oozie"
Informa/on$

Resource$

Oozie"installaAon"and"
conguraAon"

CDH"InstallaAon"Guide$
http://docs.cloudera.com

Oozie"workows"and"acAons"

https://oozie.apache.org

The"procedure"of"running"a"
MapReduce"job"using"Oozie"

https://cwiki.apache.org/OOZIE/
map-reduce-cookbook.html

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#22$

Chapter"Topics"
An$Introduc/on$to$Oozie$

The$Hadoop$Ecosystem$

! IntroducAon"to"Oozie"
! CreaAng"Oozie"workows"
! Hands#On$Exercise:$Running$an$Oozie$Workow$
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#23$

Hands/On"Exercise:"Running"an"Oozie"Workow"
!In$this$Hands#On$Exercise$you$will$run$Oozie$jobs$
!Please$refer$to$the$Hands#On$Exercise$Manual$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#24$

Chapter"Topics"
An$Introduc/on$to$Oozie$

The$Hadoop$Ecosystem$

! IntroducAon"to"Oozie"
! CreaAng"Oozie"workows"
! Hands/On"Exercise:"Running"an"Oozie"Workow"
! Conclusion$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#25$

Conclusion"
In$this$chapter$you$have$learned$
!What$Oozie$is$
!How$to$create$Oozie$workows$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

14#26$

Conclusion"
Chapter"15"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

15#1$

Course"Chapters"
Course"IntroducBon"

! "IntroducBon"
! "The"MoBvaBon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriBng"a"MapReduce"Program"
! "Unit"TesBng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracBcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraBng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducBon"to"Hive"and"Pig"
! "An"IntroducBon"to"Oozie"
! $Conclusion$
! "Cloudera"Enterprise"
! "Graph"ManipulaBon"in"MapReduce"""

IntroducBon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course$Conclusion$and$Appendices$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

15#2$

Conclusion"
During$this$course,$you$have$learned:$
!The$core$technologies$of$Hadoop$
!How$HDFS$and$MapReduce$work$
!How$to$develop$MapReduce$applicaFons$
!How$to$unit$test$MapReduce$applicaFons$
!How$to$use$MapReduce$combiners,$parFFoners,$and$distributed$cache$
!Best$pracFces$for$developing$and$debugging$MapReduce$applicaFons$
!How$to$implement$data$input$and$output$in$MapReduce$applicaFons$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

15#3$

Conclusion"(contd)"
!Algorithms$for$common$MapReduce$tasks$
!How$to$join$data$sets$in$MapReduce$
!How$Hadoop$integrates$into$the$data$center$
!How$to$use$Mahouts$Machine$Learning$algorithms$
!How$Hive$and$Pig$can$be$used$for$rapid$applicaFon$development$
!How$to$create$large$workows$using$Oozie$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

15#4$

CerBcaBon"
!This$course$helps$to$prepare$you$for$the$Cloudera$CerFed$Developer$for$
Apache$Hadoop$exam$
!For$more$informaFon$about$Cloudera$cerFcaFon,$refer$to$$
http://university.cloudera.com/certification.html

!Thank$you$for$aTending$the$course!$
!If$you$have$any$quesFons$or$comments,$please$contact$us$via$$
http://www.cloudera.com$$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

15#5$

Cloudera"Enterprise"
Appendix"A"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"1$

Course"Chapters"
Course"IntroducCon"

! "IntroducCon"
! "The"MoCvaCon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriCng"a"MapReduce"Program"
! "Unit"TesCng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracCcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraCng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducCon"to"Hive"and"Pig"
! "An"IntroducCon"to"Oozie"
! "Conclusion"
! $Cloudera$Enterprise$
! "Graph"ManipulaCon"in"MapReduce"""

IntroducCon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course$Conclusion$and$Appendices$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"2$

Cloudera"Enterprise"Core"
!Includes$support$and$management$for$all$core$components$of$CDH$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"3$

Cloudera"Manager"
!Cloudera$Manager$provides$enterprise"grade$Hadoop$deployment$and$
management$
!Built"in$intelligence$and$best$pracBces$
!Integrates$with$Clouderas$support$infrastructure$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"4$

Cloudera"Manager"(contd)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"5$

AcCvity"Monitor"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"6$

Cloudera"Enterprise"RTD"
!Includes$support$and$management$for$HBase$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"7$

Conclusion"
!Cloudera$Enterprise$makes$it$easy$to$run$open$source$Hadoop$in$
producBon$
!Includes$$
Clouderas"DistribuCon"including"Apache"Hadoop"(CDH)"
Cloudera"Manager"
ProducCon"Support"
!Cloudera$Manager$enables$you$to:$
Simplify"and"accelerate"Hadoop"deployment"
Reduce"the"costs"and"risks"of"adopCng"Hadoop"in"producCon"
Reliably"operate"Hadoop"in"producCon"with"repeatable"success"
Apply"SLAs"to"Hadoop"
Increase"control"over"Hadoop"cluster"provisioning"and"management"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"8$

Graph"ManipulaBon"in"MapReduce"
Appendix"B"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"1$

Course"Chapters"
Course"IntroducBon"

! "IntroducBon"
! "The"MoBvaBon"for"Hadoop"
! "Hadoop:"Basic"Concepts"
! "WriBng"a"MapReduce"Program"
! "Unit"TesBng"MapReduce"Programs"
! "Delving"Deeper"into"the"Hadoop"API"
! "PracBcal"Development"Tips"and"Techniques"
! "Data"Input"and"Output"
! "Common"MapReduce"Algorithms"
! "Joining"Data"Sets"in"MapReduce"Jobs"
! "IntegraBng"Hadoop"into"the"Enterprise"Workow"
! "Machine"Learning"and"Mahout"
! "An"IntroducBon"to"Hive"and"Pig"
! "An"IntroducBon"to"Oozie"
! "Conclusion"
! "Cloudera"Enterprise"
! $Graph$Manipula0on$in$MapReduce$$$

IntroducBon"to"Apache"Hadoop"and"
its"Ecosystem"

Basic"Programming"with"the"
Hadoop"Core"API"

Problem"Solving"with"MapReduce"

The"Hadoop"Ecosystem"

Course$Conclusion$and$Appendices$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"2$

Graph"ManipulaBon"in"MapReduce"
In$this$appendix$you$will$learn$
!What$graphs$are$
!Best$prac0ces$for$represen0ng$graphs$in$Hadoop$
!How$to$implement$a$single$source$shortest$path$algorithm$in$MapReduce$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"3$

Chapter"Topics"
Graph$Manipula0on$in$MapReduce$

Course$Conclusion$and$Appendixes$

! Graphs$
! Best"pracBces"for"represenBng"graphs"in"MapReduce"
! ImplemenBng"a"single/source"shortest/path"algorithm"in"MapReduce"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"4$

IntroducBon:"What"Is"A"Graph?"
!Loosely$speaking,$a$graph$is$a$set$of$ver0ces,$or$nodes,$connected$by$
edges,$or$lines$
!There$are$many$dierent$types$of$graphs$
Directed"
Undirected"
Cyclic"
Acyclic"
Weighted"
Unweighted"
DAG"(Directed,"Acyclic"Graph)"is"a"very"common"graph"type"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"5$

What"Can"Graphs"Represent?"
!Graphs$are$everywhere$
Hyperlink"structure"of"the"Web"
Physical"structure"of"computers"on"a"network"
Roadmaps"
Airline"ights"
Social"networks"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"6$

Examples"of"Graph"Problems"
!Finding$the$shortest$path$through$a$graph$
RouBng"Internet"trac"
Giving"driving"direcBons"
!Finding$the$minimum$spanning$tree$
Lowest/cost"way"of"connecBng"all"nodes"in"a"graph"
Example:"telecoms"company"laying"ber"
Must"cover"all"customers"
Need"to"minimize"ber"used"
!Finding$maximum$ow$
Move"the"most"amount"of"trac"through"a"network"
Example:"airline"scheduling"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"7$

Examples"of"Graph"Problems"(contd)"
!Finding$cri0cal$nodes$without$which$a$graph$would$break$into$disjoint$
components$
Controlling"the"spread"of"epidemics"
Breaking"up"terrorist"cells"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"8$

Graphs"and"MapReduce"
!Graph$algorithms$typically$involve:$
Performing"computaBons"at"each"vertex"
Traversing"the"graph"in"some"manner"
!Key$ques0ons:$
How"do"we"represent"graph"data"in"MapReduce?"
How"do"we"traverse"a"graph"in"MapReduce?"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"9$

Chapter"Topics"
Graph$Manipula0on$in$MapReduce$

Course$Conclusion$and$Appendixes$

! Graphs"
! Best$prac0ces$for$represen0ng$graphs$in$MapReduce$
! ImplemenBng"a"single/source"shortest/path"algorithm"in"MapReduce"
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"10$

RepresenBng"Graphs"
!Imagine$we$want$to$represent$this$simple$graph:$
!Two$approaches:$
Adjacency"matrices"
Adjacency"lists"

2$
1$
3$

4$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"11$

Adjacency"Matrices"
!Represent$the$graph$as$an$n$x$n$square$matrix$

2$

v1 v2 v3 v4
v1 0

v2 1

v3 1

v4 1

1$
3$

4$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"12$

Adjacency"Matrices:"CriBque"
!Advantages:$
Naturally"encapsulates"iteraBon"over"nodes"
Rows"and"columns"correspond"to"inlinks"and"outlinks"
!Disadvantages:$
Lots"of"zeros"for"sparse"matrices"
Lots"of"wasted"space"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"13$

Adjacency"Lists"
!Take$an$adjacency$matrix$and$throw$away$all$the$zeros$

v1 v2 v3 v4
v1 0

v2 1

v3 1

v4 1

v1: v2, v4
v2: v1, v3, v4
v3: v1
v4: v1, v3

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"14$

Adjacency"Lists:"CriBque"
!Advantages:$
Much"more"compact"representaBon"
Easy"to"compute"outlinks"
Graph"structure"can"be"broken"up"and"distributed"
!Disadvantages:$
More"dicult"to"compute"inlinks"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"15$

Encoding"Adjacency"Lists"
!Adjacency$lists$are$the$preferred$way$of$represen0ng$graphs$in$
MapReduce$
Typically"we"represent"each"vertex"(node)"with"an"ID"number"
A"eld"of"type"long"usually"suces"
!Typical$encoding$format$(Writable)$
long:"vertex"ID"of"the"source"
int:"number"of"outgoing"edges"
Sequence"of"longs:"desBnaBon"verBces"

v1: v2, v4
v2: v1, v3, v4
v3: v1
v4: v1, v3

1: [2] 2, 4
2: [3] 1, 3, 4
3: [1] 1
4: [2] 1, 3

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"16$

Chapter"Topics"
Graph$Manipula0on$in$MapReduce$

Course$Conclusion$and$Appendixes$

! Graphs"
! Best"pracBces"for"represenBng"graphs"in"MapReduce"
! Implemen0ng$a$single"source$shortest"path$algorithm$in$MapReduce$
! Conclusion"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"17$

Single"Source"Shortest"Path"
!Problem:$nd$the$shortest$path$from$a$source$node$to$one$or$more$target$
nodes$
!Serial$algorithm:$Dijkstras$Algorithm$
Not"suitable"for"parallelizaBon"
!MapReduce$algorithm:$parallel$breadth"rst$search$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"18$

Parallel"Breadth/First"Search"
!The$algorithm,$intui0vely:$
Distance"to"the"source"="0"
For"all"nodes"directly"reachable"from"the"source,"distance"="1"
For"all"nodes"reachable"from"some"node"n"in"the"graph,"distance"from"
source"="1"+"min(distance"to"that"node)"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"19$

Parallel"Breadth/First"Search:"Algorithm"
!Mapper:$
Input"key"is"some"vertex"ID"
Input"value"is"D"(distance"from"source),"adjacency"list"
Processing:"For"all"nodes"in"the"adjacency"list,""
emit"(node"ID,"D"+"1)"
If"the"distance"to"this"node"is"D,"then"the"distance"to"any"node"
reachable"from"this"node"is"D"+"1"
!Reducer:$
Receives"vertex"and"list"of"distance"values"
Processing:"Selects"the"shortest"distance"value"for"that"node"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"20$

IteraBons"of"Parallel"BFS"
!A$MapReduce$job$corresponds$to$one$itera0on$of$parallel$breadth"rst$
search$
Each"iteraBon"advances"the"known"fronBer"by"one"hop"
IteraBon"is"accomplished"by"using"the"output"from"one"job"as"the"input"
to"the"next"
!How$many$itera0ons$are$needed?$
MulBple"iteraBons"are"needed"to"explore"the"enBre"graph"
As"many"as"the"diameter"of"the"graph"
Graph"diameters"are"surprisingly"small,"even"for"large"graphs"
Six"degrees"of"separaBon"
!Controlling$itera0ons$in$Hadoop$
Use"counters;"when"you"reach"a"node,"count"it"
At"the"end"of"each"iteraBon,"check"the"counters"
When"youve"reached"all"the"nodes,"youre"nished"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"21$

One"More"Trick:"Preserving"Graph"Structure"
!Characteris0cs$of$Parallel$BFS$
Mappers"emit"distances,"Reducers"select"the"shortest"distance"
Output"of"the"Reducers"becomes"the"input"of"the"Mappers"for"the"next"
iteraBon"
!Problem:$where$did$the$graph$structure$(adjacency$lists)$go?$
!Solu0on:$Mapper$must$emit$the$adjacency$lists$as$well$
Mapper"emits"two"types"of"key/value"pairs"
RepresenBng"distances"
RepresenBng"adjacency"lists"
Reducer"recovers"the"adjacency"list"and"preserves"it"for"the"next"
iteraBon"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"22$

Parallel"BFS:"Pseudo/Code"

From Lin & Dyer (2010) Data-Intensive Text Processing with MapReduce
"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"23$

Parallel"BFS:"DemonstraBon"
!Your$instructor$will$now$demonstrate$the$parallel$breadth"rst$search$
algorithm$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"24$

Graph"Algorithms:"General"Thoughts"
!MapReduce$is$adept$at$manipula0ng$graphs$
Store"graphs"as"adjacency"lists"
!Typically,$MapReduce$graph$algorithms$are$itera0ve$
Iterate"unBl"some"terminaBon"condiBon"is"met"
Remember"to"pass"the"graph"structure"from"one"iteraBon"to"the"next"

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"25$

Chapter"Topics"
Graph$Manipula0on$in$MapReduce$

Course$Conclusion$and$Appendixes$

! Graphs"
! Best"pracBces"for"represenBng"graphs"in"MapReduce"
! ImplemenBng"a"single/source"shortest/path"algorithm"in"MapReduce"
! Conclusion$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"26$

Conclusion"
In$this$appendix$you$have$learned$
!What$graphs$are$
!Best$prac0ces$for$represen0ng$graphs$in$Hadoop$
!How$to$implement$a$single$source$shortest$path$algorithm$in$MapReduce$

"Copyright"2010/2013"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

B"27$

You might also like