You are on page 1of 28

A Seminar Report On

HADOOP

By Varun Narang MA 399 Seminar


IIT Guwahati Ro Num!er" #9#$%33%

Index of Topics:
1. Abstract 2. Introduction 3. What is MapReduce? 4. H !" Assu#ptions esi$n %oncepts The %o##unication &rotoco's Robustness %'uster Reba'ancin$ ata Inte$rit( Metadata dis) fai'ure "napshots *. ata +r$anisation ata ,'oc)s "ta$in$ Rep'ication &ipe'inin$ -. Accessibi'it( .. "pace Rec'ai#ation !i'e e'etes and /nde'etes ecrease Rep'ication !actor Hadoop !i'es(ste#s .. Hadoop Archi0es

,ib'io$raph(
11Hadoop2 The efiniti0e 3uide4 +5Rei''( 26674 8ahoo9 &ress 21MapReduce: "i#p'ified ata &rocessin$ on :ar$e %'usters4 ;effre( ean and "an<a( 3he#a=at 31Ran)in$ and "e#i2super0ised %'assification on :ar$e "ca'e 3raphs /sin$ Map2 Reduce4 e'ip Rao4 a0id 8aro=s)(4 ept. of %o#puter "cience4 ;ohns Hop)ins /ni0ersit( 41I#pro0in$ MapReduce &erfor#ance in Hetero$eneous >n0iron#ents4 Matei ?aharia4 And( @on=ins)i4 Anthon( . ;oseph4 Rand( @atA4 Ion "toica4 /ni0ersit( of %a'ifornia4 ,er)e'e( *1MapReduce in a Wee) ,( Hannah Tan$4 A'bert Won$4 Aaron @i#ba'' Winter 266.

Abstract
Problem Statement: The a#ount tota' di$ita' data in the =or'd has exp'oded in recent (ears. This has happened pri#ari'( due to infor#ation Bor data1 $enerated b( 0arious enterprises a'' o0er

the $'obe. In 266-4 the uni0ersa' data =as esti#ated to be 6.1C Aettab(tes in 266-4 and is forecastin$ a tenfo'd $ro=th b( 2611 to 1.C Aettab(tes. 1 Aettab(te D 1621 b(tes The prob'e# is that =hi'e the stora$e capacities of hard dri0es ha0e increased #assi0e'( o0er the (ears4 access speedsEthe rate at =hich data can be read fro# dri0es ha0e not )ept up. +ne t(pica' dri0e fro# 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s4 so =e cou'd read a'' the data fro# a fu'' dri0e in around 366 seconds. In 26164 1 bdri0es are the standard hard dis) siAe4 but the transfer speed is around 100 MB/s4 so it ta)es #ore than t=o and a ha'f hours to read a'' the data off the dis). Solut!on Proposed: Parallelisation: A 0er( ob0ious so'ution to so'0in$ this prob'e# is para''e'isation. The input data is usua''( 'ar$e and the co#putations ha0e to be distributed across hundreds or thousands of #achines in order to finish in a reasonab'e a#ount of ti#e. Readin$ 1 Tb fro# a sin$'e hard dri0e #a( ta)e a 'on$ ti#e4 but on para''e'iAin$ this o0er 166 different #achines can so'0e the prob'e# in 2 #inutes. The )e( issues in0o'0ed in this "o'ution: Hard=are fai'ure %o#bine the data after ana'(sis Bi.e readin$1 Apache Hadoop is a fra#e=or) for runnin$ app'ications on 'ar$e c'uster bui't of co##odit( hard=are. The Hadoop fra#e=or) transparent'( pro0ides app'ications both re'iabi'it( and data #otion. It so'0es the prob'e# of Hard=are !ai'ure throu$h rep'ication. Redundant copies of the data are )ept b( the s(ste# so that in the e0ent of fai'ure4 there is another cop( a0ai'ab'e. BHadoop istributed !i'e "(ste#1 The second prob'e# is so'0ed b( a si#p'e pro$ra##in$ #ode'2 Mapreduce. This pro$ra##in$ paradi$# abstracts the prob'e# fro# data readF=rite to co#putation o0er a series of )e(s. >0en thou$h H !" and MapReduce are the #ost si$nificant features of Hadoop4 other subpro<ects pro0ide co#p'e#entar( ser0ices: The 0arious subpro<ects of hadoop inc'udes:2 %ore A0ro &i$ H,ase ?oo @eeper Hi0e %hu)=a

Intro&u'tion
Hadoop is desi$ned to efficient'( process 'ar$e 0o'u#es of infor#ation b( connectin$ #an( co##odit( co#puters to$ether to =or) in para''e'. A 1666 %&/ sin$'e #achine Bi.e a

superco#puter =ith a 0ast #e#or( stora$e1 =ou'd cost a 'ot. Thus Hadoop para''e'iAes the co#putation b( t(in$ s#a''er and #ore reasonab'( priced #achines to$ether into a sin$'e cost2 effecti0e co#pute c'uster. The features of hadoop that stand out are its si#p'ified pro$ra##in$ #ode' and its efficient4 auto#atic distribution of data and =or) across #achines. Go= =e ta)e a deeper 'oo) into these t=o #ain features of Hadoop and 'ist their i#portant characteristics and description.

$( Data Di)tri!ution"
In a Hadoop c'uster4 data is distributed to a'' the nodes of the c'uster as it is bein$ 'oaded in. The Hadoop istributed !i'e "(ste# BH !"1 =i'' sp'it 'ar$e data fi'es into chun)s =hich are #ana$ed b( different nodes in the c'uster. In addition to this each chun) is rep'icated across se0era' #achines4 so that a sin$'e #achine fai'ure does not resu't in an( data bein$ una0ai'ab'e. In case of a s(ste# fai'ure4 the data is re2rep'icated =hich can resu't in partia' stora$e. >0en thou$h the fi'e chun)s are rep'icated and distributed across se0era' #achines4 the( for# a sin$'e na#espace4 so their contents are uni0ersa''( accessib'e.

Data is conceptually record-oriented in the Hadoop programming framework. Individual input files are broken into segments and each segment is processed upon by a node. The Hadoop framework schedules the processes to be run in proximity to the location of data/records using knowledge from the distributed file system. Each computation process running on a node operates on a subset of the data. hich data operated on by which node is decide based on its proximity to the node! i.e! "ost data is read from the local disk straight into the #$%& alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data4 instead of mo"!n# the data to the computat!on a''o=s Hadoop to achie0e hi$h data 'oca'it( =hich in turn resu'ts in hi$h perfor#ance.

%( MapRe&u'e" I)o ate& Pro'e))e)


Hadoop 'i#its the a#ount of co##unication =hich can be perfor#ed b( the processes4 as each indi0idua' record is processed b( a tas) in iso'ation fro# one another. It #a)es the =ho'e fra#e=or) #uch #ore re'iab'e. &ro$ra#s #ust be =ritten to confor# to a particu'ar pro$ra##in$ #ode'4 na#ed HMapReduce.H

MapReduce is co#posed of t=o chief e'e#ents: Mappers and Reducers. '. ata se$#ents or records are processed in iso'ation b( tas)s ca''ed Mappers. (. The output fro# the Mappers is then brou$ht to$ether b( Reducers4 =here resu'ts fro# different #appers are #er$ed to$ether. "eparate nodes in a Hadoop c'uster co##unicate i#p'icit'(. &ieces of data can be ta$$ed =ith )e( na#es =hich infor# Hadoop ho= to send re'ated bits of infor#ation to a co##on destination node. Hadoop interna''( #ana$es a'' of the data transfer and c'uster topo'o$( issues. ,( restrictin$ the co##unication bet=een nodes4 Hadoop #a)es the distributed s(ste# #uch #ore re'iab'e. Indi0idua' node fai'ures can be =or)ed around b( restartin$ tas)s on other #achines. The other =or)ers continue to operate as thou$h nothin$ =ent =ron$4 'ea0in$ the cha''en$in$ aspects of partia''( restartin$ the pro$ra#.

*hat i) MapRe&u'e+
MapReduce is a pro#ramm!n# model for processin$ and $eneratin$ 'ar$e data sets. /sers specif( a map function that processes a )e(F0a'ue pair to $enerate a set of inter#ediate )e(F0a'ue

pairs4 and a reduce function that #er$es a'' inter#ediate 0a'ues associated =ith the sa#e inter#ediate )e(. &ro$ra#s =ritten in this functiona' st('e are auto#atica''( para''e'iAed and executed on a 'ar$e c'uster of co##odit( #achines. The run$t!me s%stem ta)es care of the detai's of partitionin$ the input data4 schedu'in$ the pro$ra#Is execution across a set of #achines4 hand'in$ #achine fai'ures4 and #ana$in$ the reJuired inter2#achine co##unication Bi.e this procedure is abstracted or hidden fro# the user =ho can focus on the co#putationa' prob'e#1 Gote: This abstraction was inspired by the map and reduces primitives present in )isp
and many other functional languages.

he Pro#ramm!n# Model:
The computation takes a set of input key/value pairs& and produces a set of output key/value pairs. The user of the "ap*educe library expresses the computation as two functions! Map and Reduce. Map& written by the user& takes an input pair and produces a set of intermediate key/value pairs. The "ap*educe library groups together all intermediate values associated with the same intermediate key Iand passes them to the Reduce function.

The Reduce function& also written by the user& accepts an intermediate key Iand a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically +ust ,ero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user-s reduce function via an iterator.

Map and &educe 'Assoc!ated %pes(:


The input keys and values are drawn from a different domain than the output keys and values. .lso& the intermediate keys and values are from the same domain as the output keys and values. Hadoop "ap/*educe is a software framework for easily writing applications which process vast amounts of data 0multi/terabyte data/sets1 in/parallel on large clusters 0thousands of nodes1 of commodity hardware in a reliable& fault/tolerant manner.

Anal%)!n# the *ata +!th ,adoop Map&educe: MapReduce =or)s b( brea)in$ the processin$ into t=o phases: the #ap phase and the reduce phase. It sp'its the input data2set into independent chun)s =hich are processed b( the #ap tas)s in a co#p'ete'( para''e' #anner. The fra#e=or) sorts the outputs of the #aps4 =hich are then input to the reduce tas)s. ,oth the input and the output of the <ob are stored in a fi'e2s(ste#. The fra#e=or) ta)es care of schedu'in$ tas)s4 #onitorin$ the# and re2executes the fai'ed tas)s. Data Locality Optimisation: T(pica''( the co#pute nodes and the stora$e nodes are the sa#e. The Map2Reduce fra#e=or) and the istributed !i'e "(ste# run on the sa#e set of nodes. This confi$uration a''o=s the fra#e=or) to effecti0e'( schedu'e tas)s on the nodes =here data is a'read( present4 resu'tin$ in 0er( hi$h a$$re$ate band=idth across the c'uster. There are t=o t(pes of nodes that contro' the <ob execution process: 1. jobtrackers 2. tasktrackers The <obtrac)er coordinates a'' the <obs run on the s(ste# b( schedu'in$ tas)s to run on tas)trac)ers. Tas)trac)ers run tas)s and send pro$ress reports to the <obtrac)er4 =hich )eeps a record of the o0era'' pro$ress of each <ob. If a tas)s fai's4 the <obtrac)er can reschedu'e it on a different tas)trac)er.

-nput spl!ts: Hadoop di0ides the input to a MapReduce <ob into fixed2siAe pieces ca''ed input sp'its4 or <ust sp'its. Hadoop creates one #ap tas) for each sp'it4 =hich runs the userdefined #ap function for each record in the sp'it. The Jua'it( of the 'oad ba'ancin$ increases as the sp'its beco#e #ore fine2$rained. +n the other hand4 if sp'its are too s#a''4 then the o0erhead of #ana$in$ the sp'its and of #ap tas) creation be$ins to do#inate the tota' <ob execution ti#e. !or #ost <obs4 a $ood sp'it siAe tends to be the siAe of a H !" b'oc)4 -4 M, b( defau't. .,/0 Map tas)s =rite their output to 'oca' dis)4 not to H !". Map output is inter#ediate output: it5s processed b( reduce tas)s to produce the fina' output4 and once the <ob is co#p'ete the #ap output can be thro=n a=a(. "o storin$ it in H !"4 =ith rep'ication4 =ou'd be a =aste of ti#e. It is a'so possib'e that the node runnin$ the #ap tas) fai's before the #ap output has been consu#ed b( the reduce tas). Reduce tas)s don5t ha0e the ad0anta$e of data 'oca'it(Ethe input to a sin$'e reduce tas) is nor#a''( the output fro# a'' #appers. In case we have a single reduce task that is fed by all of the ap tasks: The sorted #ap outputs ha0e to be transferred across the net=or) to the node =here the reduce tas) is runnin$4 =here the( are #er$ed and then passed to the user2defined reduce function. The output of the reducer is nor#a''( stored in H !" for re'iabi'it(. !or each H !" b'oc) of the reduce output4 the first rep'ica is stored on the 'oca' node4 =ith other rep'icas bein$ stored on off2 rac) nodes.

MapReduce data f'o= =ith a sin$'e reduce tas) !hen there are ultiple reducers: The #ap tas)s partition their output4 each creatin$ one partition for each reduce tas). There can be #an( )e(s Band their associated 0a'ues1 in each partition4 but the records for e0er( )e( are a'' in a sin$'e partition.

MapReduce data f'o= =ith #u'tip'e reduce tas). It is also possible to have "ero reduce tasks as illustrated in the figure below.

MapReduce data f'o= =ith no reduce tas)s

Combiner Functions Man( MapReduce <obs are 'i#ited b( the band=idth a0ai'ab'e on the c'uster. In order to #ini#iAe the data transferred bet=een the #ap and reduce tas)s4 co#biner functions are introduced. Hadoop a''o=s the user to specif( a co#biner function to be run on the #ap outputE the co#biner function5s output for#s the input to the reduce function. %o#biner finctions can he'p cut do=n the a#ount of data shuff'ed bet=een the #aps and the reduces.

Hadoop Streaming: Hadoop pro0ides an A&I to MapReduce that a''o=s (ou to =rite (our #ap and reduce functions in lan#ua#es other than 1a"a. Hadoop "trea#in$ uses /nix standard strea#s as the interface bet=een Hadoop and (our pro$ra#4 so (ou can use an( 'an$ua$e that can read standard input and =rite to standard output to =rite (our MapReduce pro$ra#. Hadoop Pipes: Hadoop &ipes is the na#e of the %KK interface to Hadoop MapReduce. /n'i)e "trea#in$4 =hich uses standard input and output to co##unicate =ith the #ap and reduce code4 &ipes uses soc)ets

as the channe' o0er =hich the tas)trac)er co##unicates =ith the process runnin$ the %KK #ap or reduce function. ;GI is not used.

,A*22P *-S &-B3 4* 5-64S/S 4M ',*5S( !i'es(ste#s that #ana$e the stora$e across a net=or) of #achines are ca''ed distributed fi'es(ste#s. The( are net=or)2based4 and thus a'' the co#p'ications of net=or) pro$ra##in$ are a'so present in distributed fi'e s(ste#. Hadoop co#es =ith a distributed fi'es(ste# ca''ed H !"4 =hich stands for Hadoop istributed !i'es(ste#. H !"4 the Hadoop istributed !i'e "(ste#4 is a distributed fi'e s(ste# desi$ned to ho'd 0er( 'ar$e a#ounts of data Bterab(tes or e0en petab(tes14 and pro0ide hi$h2throu$hput access to this infor#ation. ASSUMPT O!S A!D "OALS: 1. ,ard+are 5a!lure An H !" instance #a( consist of hundreds or thousands of ser0er #achines4 each storin$ part of the fi'e s(ste#5s data. In case of such a 'ar$e nu#ber of nodes4 the probabi'it( of one of the# fai'in$ beco#es substantia'. 7. Stream!n# *ata Access App'ications that run on H !" need strea#in$ access to their data sets. H !" is desi$ned #ore for batch processin$ rather than interacti0e use b( users. The e#phasis is on hi$h throu$hput of data access rather than 'o= 'atenc( of data access. 3. 6ar#e *ata Sets App'ications that run on H !" ha0e 'ar$e data sets. A t(pica' fi'e in H !" is $i$ab(tes to terab(tes in siAe. Thus4 H !" is tuned to support 'ar$e fi'es. It shou'd pro0ide hi$h a$$re$ate data band=idth and sca'e to hundreds of nodes in a sin$'e c'uster. 4. S!mple 8oherenc% Model H !" app'ications need a =rite2once2read2#an( access #ode' for fi'es. A fi'e once created4 =ritten4 and c'osed need not be chan$ed. This assu#ption si#p'ifies data coherenc( issues and enab'es hi$h throu$hput data access. A MapFReduce app'ication or a =eb cra='er app'ication fits perfect'( =ith this #ode'. There is a p'an to support appendin$2=rites to fi'es in the future. 9. :Mo"!n# 8omputat!on !s 8heaper than Mo"!n# *ata; A co#putation reJuested b( an app'ication is #uch #ore efficient if it is executed near the data it operates on. This is especia''( true =hen the siAe of the data set is hu$e. This #ini#iAes net=or) con$estion and increases the o0era'' throu$hput of the s(ste#. H !" pro0ides interfaces for app'ications to #o0e the#se'0es c'oser to =here the data is 'ocated. <. Portab!l!t% Across ,etero#eneous ,ard+are and Soft+are Platforms H !" has been desi$ned to be easi'( portab'e fro# one p'atfor# to another. This faci'itates =idespread adoption of H !" as a p'atfor# of choice for a 'ar$e set of app'ications.

T#e Design o$ HDFS:


=er% lar#e f!les:

LMer( 'ar$eN in this context #eans fi'es that are hundreds of #e$ab(tes4 $i$ab(tes4 or terab(tes in siAe. There are Hadoop c'usters runnin$ toda( that store petab(tes of data. Stream!n# data access:

H !" is bui't around the idea that the #ost efficient data processin$ pattern is a =rite2once4 read2#an(2ti#es pattern. A dataset is t(pica''( $enerated or copied fro# source4 then 0arious ana'(ses are perfor#ed on that dataset o0er ti#e. >ach ana'(sis =i'' in0o'0e a 'ar$e proportion4 if not a''4 of the dataset4 so the ti#e to read the =ho'e dataset is #ore i#portant than the 'atenc( in readin$ the first record.

8ommod!t% hard+are:

Hadoop doesn5t reJuire expensi0e4 hi$h'( re'iab'e hard=are to run on. It5s desi$ned to run on c'usters of co##odit( hard=are for =hich the chance of node fai'ure across the c'uster is hi$h for 'ar$e c'usters. H !" is desi$ned to carr( on =or)in$ =ithout a noticeab'e interruption to the user in the face of such fai'ure. The use of co##odit( hard=are restricts the effecti0eness of Hadoop in so#e app'ications. These app'ications ha0e the fo''o=in$ co##on characteristics: 6o+$latenc% data access

App'ications that reJuire 'o=2'atenc( access to data4 in the tens of #i''iseconds ran$e4 =i'' not =or) =e'' =ith H !".

6ots of small f!les:

"ince the #aster node Bor na#enode1 ho'ds fi'e2s(ste# #etadata in #e#or(4 the 'i#it to the nu#ber of fi'es in a fi'es(ste# is $o0erned b( the a#ount of #e#or( on the na#enode. As a ru'e of thu#b4 each fi'e4 director(4 and b'oc) ta)es about 1*6 b(tes.

Mult!ple +r!ters> arb!trar% f!le mod!f!cat!ons:

!i'es in H !" #a( be =ritten to b( a sin$'e =riter. Writes are a'=a(s #ade at the end of the fi'e. There is no support for #u'tip'e =riters4 or for #odifications at arbitrar( offsets in the fi'e.

A $e% mportant Concepts o$ Hadoop Distributed File System:


1. Bloc?s: A dis) has a b'oc) siAe4 =hich is the #ini#u# a#ount of data that it can read or =rite. !i'es(ste#s for a sin$'e dis) bui'd on this b( dea'in$ =ith data in b'oc)s4 =hich are an inte$ra' #u'tip'e of the dis) b'oc) siAe. !i'es(ste# b'oc)s are t(pica''( a fe= )i'ob(tes in siAe4 =hi'e dis) b'oc)s are nor#a''( *12 b(tes H !" too has the concept of a b'oc)4 but it is a #uch 'ar$er unitE <4 MB b% default. :i)e in a fi'es(ste# for a sin$'e dis)4 fi'es in H !" are bro)en into b'oc)2siAed chun)s4 =hich are stored as independent units. /n'i)e a fi'es(ste# for a sin$'e dis)4 a fi'e in H !" that is s#a''er than a sin$'e b'oc) does not occup( a fu'' b'oc)5s =orth of under'(in$ stora$e. H !" b'oc)s are 'ar$e co#pared to dis) b'oc)s.

Ha0in$ a b'oc) abstraction for a distributed fi'es(ste# brin$s se0era' benefits: A fi'e can be 'ar$er than an( sin$'e dis) in the net=or). There5s nothin$ that reJuires the b'oc)s fro# a fi'e to be stored on the sa#e dis)4 so the( can ta)e ad0anta$e of an( of the dis)s in the c'uster. Ma)in$ the unit of abstraction a b'oc) rather than a fi'e si#p'ifies the stora$e subs(ste#. ,'oc)s fit =e'' =ith rep'ication for pro0idin$ fau't to'erance and a0ai'abi'it(. To insure a$ainst corrupted b'oc)s and dis) and #achine fai'ure4 each b'oc) is rep'icated to a s#a'' nu#ber of ph(sica''( separate #achines Bt(pica''( three1. 7. @amenodes and *atanodes: A H !" c'uster has t=o t(pes of node operatin$ in a #aster2=or)er pattern: a na#enode Bthe #aster1 and a nu#ber of datanodes B=or)ers1. The na#enode has t=o chief functions: To #ana$e the fi'es(ste# na#espace. To #aintains the fi'es(ste# tree and the #etadata for a'' the fi'es and directories in the tree.

. The na#enode a'so )no=s the datanodes on =hich a'' the b'oc)s for a $i0en fi'e are 'ocated This infor#ation is stored persistent'( on the 'oca' dis) in the for# of t=o fi'es: the namespace !ma#e and the ed!t lo#. The na#enode ho=e0er does not store b'oc) 'ocations persistent'(4 since this infor#ation is reconstructed fro# datanodes =hen the s(ste# starts. A c'ient accesses the fi'es(ste# on beha'f of the user b( co##unicatin$ =ith the na#enode and datanodes.

atanodes are the =or) horses of the fi'es(ste#. The( store and retrie0e b'oc)s =hen the( are to'd to Bb( c'ients or the na#enode14 and the( report bac) to the na#enode periodica''( =ith 'ists of b'oc)s that the( are storin$. Without the na#enode4 the fi'es(ste# cannot be used. In fact4 if the #achine runnin$ the na#enode =ere ob'iterated4 a'' the fi'es on the fi'es(ste# =ou'd be 'ost since there =ou'd be no =a( of )no=in$ ho= to reconstruct the fi'es fro# the b'oc)s on the datanodes. !or this reason4 it is i#portant to #a)e the na#enode resi'ient to fai'ure4 and Hadoop pro0ides t=o #echanis#s for this.

The first =a( is to bac) up the fi'es that #a)e up the persistent state of the fi'es(ste# #etadata. Hadoop can be confi$ured so that the na#enode =rites its persistent state to #u'tip'e fi'es(ste#s. These =rites are s(nchronous and ato#ic. The usua' confi$uration %hoice is to =rite to 'oca' dis) as =e'' as a re#ote G!" #ount. Another approach is to run a secondar( na#enode. It does not act as a na#enode. Its #ain ro'e is to periodica''( #er$e the na#espace i#a$e =ith the edit 'o$ to pre0ent the edit 'o$ fro# beco#in$ too 'ar$e. The secondar( na#enode usua''( runs on a separate ph(sica' #achine4 since it reJuires p'ent( of %&/ and as #uch #e#or( as the na#enode to perfor# the #er$e. It )eeps a cop( of the #er$ed na#espace i#a$e4 =hich can be used in the e0ent of the na#enode fai'in$. Ho=e0er4 the state of the secondar( na#enode 'a$s that of the pri#ar(4 so in the e0ent of tota' fai'ure of the pri#ar( data4 'oss is a'#ost $uaranteed. 3. he 5!le S%stem @amespace: H !" supports a traditiona' hierarchica' fi'e or$aniAation. A user or an app'ication can create directories and store fi'es inside these directories. The fi'e s(ste# na#espace hierarch( is si#i'ar to #ost other existin$ fi'e s(ste#sO one can create and re#o0e fi'es4 #o0e a fi'e fro# one director( to another4 or rena#e a fi'e. H !" does not (et i#p'e#ent user Juotas or access per#issions. The Ga#enode #aintains the fi'e s(ste# na#espace. An( chan$e to the fi'e s(ste# na#espace or its properties is recorded b( the Ga#enode. 4. *ata &epl!cat!on: H !" is desi$ned to re'iab'( store 0er( 'ar$e fi'es across #achines in a 'ar$e c'uster. It stores each fi'e as a seJuence of b'oc)sO a'' b'oc)s in a fi'e except the 'ast b'oc) are the sa#e siAe. The b'oc)s of a fi'e are rep'icated for fau't to'erance. The b'oc) siAe and rep'ication factor are confi$urab'e per fi'e. An app'ication can specif( the nu#ber of rep'icas of a fi'e.

The Ga#eGode #a)es a'' decisions re$ardin$ rep'ication of b'oc)s. It periodica''( recei0es a Heartbeat and a ,'oc)report fro# each of the atanodes in the c'uster. Receipt of a Heartbeat i#p'ies that the ataGode is functionin$ proper'(. A ,'oc)report contains a 'ist of a'' b'oc)s on a atanode.

9. &epl!ca Placement: +pti#iAin$ rep'ica p'ace#ent distin$uishes H !" fro# #ost other distributed fi'e s(ste#s.. The purpose of a rac)2a=are rep'ica p'ace#ent po'ic( is to i#pro0e data re'iabi'it(4 a0ai'abi'it(4 and net=or) band=idth uti'iAation. :ar$e H !" instances run on a c'uster of co#puters that co##on'( spread across #an( rac)s. %o##unication bet=een t=o nodes in different rac)s has to $o throu$h s=itches. In #ost cases4 net=or) band=idth bet=een #achines in the sa#e rac) is $reater than net=or) band=idth bet=een #achines in different rac)s. !or the co##on case4 =hen the rep'ication factor is three4 H !"5s p'ace#ent po'ic( is to put one rep'ica on one node in the 'oca' rac)4 another on a different node in the 'oca' rac)4 and the 'ast on a different node in a different rac). This po'ic( cuts the inter2rac) =rite traffic =hich $enera''( i#pro0es =rite perfor#ance. <. &epl!ca Select!on: To #ini#iAe $'oba' band=idth consu#ption and read 'atenc(4 H !" tries to satisf( a read reJuest fro# a rep'ica that is c'osest to the reader. 7. Safemode: +n startup4 the Ga#eGode enters a specia' state ca''ed "afe#ode. Rep'ication of data b'oc)s does not occur =hen the Ga#eGode is in the "afe#ode state. The Ga#eGode recei0es Heartbeat and ,'oc)report #essa$es fro# the ataGodes. A ,'oc)report contains the 'ist of data b'oc)s that a ataGode is hostin$. >ach b'oc) has a specified #ini#u# nu#ber of rep'icas. A b'oc) is

considered safe'( rep'icated =hen the #ini#u# nu#ber of rep'icas of that data b'oc) has chec)ed in =ith the Ga#eGode. After a confi$urab'e percenta$e of safe'( rep'icated data b'oc)s chec)s in =ith the Ga#eGode Bp'us an additiona' 36 seconds14 the Ga#eGode exits the "afe#ode state. It then deter#ines the 'ist of data b'oc)s Bif an(1 that sti'' ha0e fe=er than the specified nu#ber of rep'icas. The Ga#eGode then rep'icates these b'oc)s to other ataGodes. A. he Pers!stence of 5!le S%stem Metadata: The H !" na#espace is stored b( the Ga#eGode. The Ga#eGode uses a transaction 'o$ ca''ed the >dit:o$ to persistent'( record e0er( chan$e that occurs to fi'e s(ste# #etadata. The Ga#eGode uses a fi'e in its 'oca' host +" fi'e s(ste# to store the >dit:o$. The entire fi'e s(ste# na#espace4 inc'udin$ the #appin$ of b'oc)s to fi'es and fi'e s(ste# properties4 is stored in a fi'e ca''ed the !sI#a$e. The !sI#a$e is stored as a fi'e in the Ga#eGode5s 'oca' fi'e s(ste# too. The Ga#eGode )eeps an i#a$e of the entire fi'e s(ste# na#espace and fi'e ,'oc)#ap in #e#or(. This )e( #etadata ite# is desi$ned to be co#pact4 such that a Ga#eGode =ith 4 3, of RAM is p'ent( to support a hu$e nu#ber of fi'es and directories The ataGode stores H !" data in fi'es in its 'oca' fi'e s(ste#. The ataGode has no )no='ed$e about H !" fi'es. It stores each b'oc) of H !" data in a separate fi'e in its 'oca' fi'e s(ste#. The ataGode does not create a'' fi'es in the sa#e director(. Instead4 it uses a heuristic to deter#ine the opti#a' nu#ber of fi'es per director( and creates subdirectories appropriate'(. It is not opti#a' to create a'' 'oca' fi'es in the sa#e director( because the 'oca' fi'e s(ste# #i$ht not be ab'e to efficient'( support a hu$e nu#ber of fi'es in a sin$'e director(. When a ataGode starts up4 it scans throu$h its 'oca' fi'e s(ste#4 $enerates a 'ist of a'' H !" data b'oc)s that correspond to each of these 'oca' fi'es and sends this report to the Ga#eGode: this is the ,'oc)report

The ,ommuni'ation Proto'o )"


A'' H !" co##unication protoco's are 'a(ered on top of the T%&FI& protoco'. A c'ient estab'ishes a connection to a confi$urab'e T%& port on the Ga#eGode #achine. It ta')s the %'ient&rotoco' =ith the Ga#eGode. The ataGodes ta') to the Ga#eGode usin$ the ataGode &rotoco'. A Re#ote &rocedure %a'' BR&%1 abstraction =raps both the %'ient &rotoco' and the ataGode &rotoco'. ,( desi$n4 the Ga#eGode ne0er initiates an( R&%s. Instead4 it on'( responds to R&% reJuests issued b( ataGodes or c'ients.

&obustness:
The pri#ar( ob<ecti0e of H !" is to store data re'iab'( e0en in the presence of fai'ures. The three co##on t(pes of fai'ures are Ga#eGode fai'ures4 ataGode fai'ures and net=or) partitions. *ata *!s? 5a!lure> ,eartbeats and &e$&epl!cat!on >ach ataGode sends a Heartbeat #essa$e to the Ga#eGode periodica''(. A net=or) partition can cause a subset of ataGodes to 'ose connecti0it( =ith the Ga#eGode. The Ga#eGode detects this condition b( the absence of a Heartbeat #essa$e. The Ga#eGode #ar)s ataGodes =ithout recent Heartbeats as dead and does not for=ard an( ne= I+ reJuests to the#. An( data that =as re$istered to a dead ataGode is not a0ai'ab'e to H !" an( #ore. ataGode death #a( cause the rep'ication factor of so#e b'oc)s to fa'' be'o= their specified 0a'ue. The Ga#eGode constant'( trac)s =hich b'oc)s need to be rep'icated and initiates rep'ication =hene0er necessar(. The necessit( for re2rep'ication #a( arise due to #an( reasons: a ataGode #a( beco#e una0ai'ab'e4 a rep'ica #a( beco#e corrupted4 a hard dis) on a ataGode #a( fai'4 or the rep'ication factor of a fi'e #a( be increased.

Cluster &ebalancing The H !" architecture is co#patib'e =ith data reba'ancin$ sche#es. A sche#e #i$ht auto#atica''( #o0e data fro# one ataGode to another if the free space on a ataGode fa''s be'o= a certain thresho'd. In the e0ent of a sudden hi$h de#and for a particu'ar fi'e4 a sche#e #i$ht d(na#ica''( create additiona' rep'icas and reba'ance other data in the c'uster. These t(pes of data reba'ancin$ sche#es are not (et i#p'e#ented. Data ntegrity It is possib'e that a b'oc) of data fetched fro# a ataGode arri0es corrupted. This corruption can occur because of fau'ts in a stora$e de0ice4 net=or) fau'ts4 or bu$$( soft=are. The H !" c'ient soft=are i#p'e#ents chec)su# chec)in$ on the contents of H !" fi'es. When a c'ient creates an H !" fi'e4 it co#putes a chec)su# of each b'oc) of the fi'e and stores these chec)su#s in a separate hidden fi'e in the sa#e H !" na#espace. When a c'ient retrie0es fi'e contents it 0erifies that the data it recei0ed fro# each ataGode #atches the chec)su# stored in the associated chec)su# fi'e. If not4 then the c'ient can opt to retrie0e that b'oc) fro# another ataGode that has a rep'ica of that b'oc). Metadata Dis' Failure The !sI#a$e and the >dit:o$ are centra' data structures of H !". A corruption of these fi'es can cause the H !" instance to be non2functiona'. !or this reason4 the Ga#eGode can be confi$ured to support #aintainin$ #u'tip'e copies of the !sI#a$e and >dit:o$. An( update to either the !sI#a$e or >dit:o$ causes each of the !sI#a$es and >dit:o$s to $et updated s(nchronous'(. This s(nchronous updatin$ of #u'tip'e copies of the !sI#a$e and >dit:o$ #a( de$rade the rate of na#espace transactions per second that a Ga#eGode can support. Ho=e0er4 this de$radation is acceptab'e because e0en thou$h H !" app'ications are 0er( data intensi0e in nature4 the( are not #etadata intensi0e. When a Ga#eGode restarts4 it se'ects the 'atest consistent !sI#a$e and >dit:o$ to use. The Ga#eGode #achine is a sin$'e point of fai'ure for an H !" c'uster. If the Ga#eGode #achine fai's4 #anua' inter0ention is necessar(. %urrent'(4 auto#atic restart and fai'o0er of the Ga#eGode soft=are to another #achine is not supported. Snaps#ots "napshots support storin$ a cop( of data at a particu'ar instant of ti#e. +ne usa$e of the snapshot feature #a( be to ro'' bac) a corrupted H !" instance to a pre0ious'( )no=n $ood point in ti#e. H !" does not current'( support snapshots but =i'' in a future re'ease.

Data Organi-ation
*ata Bloc?s H !" is desi$ned to support 0er( 'ar$e fi'es. App'ications that are co#patib'e =ith H !" are those that dea' =ith 'ar$e data sets. These app'ications =rite their data on'( once but the( read it one or #ore ti#es and reJuire these reads to be satisfied at strea#in$ speeds. H !" supports =rite2once2read2#an( se#antics on fi'es. A t(pica' b'oc) siAe used b( H !" is -4 M,. Thus4 an H !" fi'e is chopped up into -4 M, chun)s4 and if possib'e4 each chun) =i'' reside on a different ataGode. Sta#!n# A c'ient reJuest to create a fi'e does not reach the Ga#eGode i##ediate'(. In fact4 initia''( the H !" c'ient caches the fi'e data into a te#porar( 'oca' fi'e. App'ication =rites are transparent'( redirected to this te#porar( 'oca' fi'e. When the 'oca' fi'e accu#u'ates data =orth o0er one H !" b'oc) siAe4 the c'ient contacts the Ga#eGode. The Ga#eGode inserts the fi'e na#e into the fi'e s(ste# hierarch( and a''ocates a data b'oc) for it. The Ga#eGode responds to the c'ient reJuest =ith the identit( of the ataGode and the destination data b'oc). Then the c'ient f'ushes the b'oc) of data fro# the 'oca' te#porar( fi'e to the specified ataGode. When a fi'e is c'osed4 the re#ainin$ un2f'ushed data in the te#porar( 'oca' fi'e is transferred to the ataGode. The c'ient then te''s the Ga#eGode that the fi'e is c'osed. At this point4 the Ga#eGode co##its the fi'e creation operation into a persistent store. If the Ga#eGode dies before the fi'e is c'osed4 the fi'e is 'ost. The abo0e approach has been adopted after carefu' consideration of tar$et app'ications that run on H !". These app'ications need strea#in$ =rites to fi'es. If a c'ient =rites to a re#ote fi'e direct'( =ithout an( c'ient side bufferin$4 the net=or) speed and the con$estion in the net=or) i#pacts throu$hput considerab'(. This approach is not =ithout precedent. >ar'ier distributed fi'e s(ste#s4 e.$. A!"4 ha0e used c'ient side cachin$ to i#pro0e perfor#ance. A &+"IP reJuire#ent has been re'axed to achie0e hi$her perfor#ance of data up'oads. &epl!cat!on P!pel!n!n# When a c'ient is =ritin$ data to an H !" fi'e4 its data is first =ritten to a 'oca' fi'e as exp'ained in the pre0ious section. "uppose the H !" fi'e has a rep'ication factor of three. When the 'oca' fi'e accu#u'ates a fu'' b'oc) of user data4 the c'ient retrie0es a 'ist of ataGodes fro# the Ga#eGode. This 'ist contains the ataGodes that =i'' host a rep'ica of that b'oc). The c'ient then f'ushes the data b'oc) to the first ataGode. The first ataGode starts recei0in$ the data in s#a'' portions B4 @,14 =rites each portion to its 'oca' repositor( and transfers that portion to the second ataGode in the 'ist. The second ataGode4 in turn starts recei0in$ each portion of the data b'oc)4 =rites that portion to its repositor( and then f'ushes that portion to the third ataGode. !ina''(4 the third ataGode =rites the data to its 'oca' repositor(. Thus4 a ataGode can be recei0in$ data fro# the pre0ious one in the pipe'ine and at the sa#e ti#e for=ardin$ data to the next one in the pipe'ine. Thus4 the data is pipe'ined fro# one ataGode to the next.

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ

Access!b!l!t%:
H !" can be accessed fro# app'ications in #an( different =a(s. Gati0e'(4 H !" pro0ides a <a0a A&I for app'ications to use. A % 'an$ua$e =rapper for this ;a0a A&I is a'so a0ai'ab'e. In addition4 an HTT& bro=ser can a'so be used to bro=se the fi'es of an H !" instance. Wor) is in pro$ress to expose H !" throu$h the Web AM protoco'.

Space &eclamat!on:
1. 5!le *eletes and 3ndeletes When a fi'e is de'eted b( a user or an app'ication4 it is not i##ediate'( re#o0ed fro# H !". Instead4 H !" first rena#es it to a fi'e in the Ftrash director(. The fi'e can be restored Juic)'( as 'on$ as it re#ains in Ftrash. A fi'e re#ains in Ftrash for a confi$urab'e a#ount of ti#e. After the expir( of its 'ife in Ftrash4 the Ga#eGode de'etes the fi'e fro# the H !" na#espace. The de'etion of a fi'e causes the b'oc)s associated =ith the fi'e to be freed. Gote that there cou'd be an appreciab'e ti#e de'a( bet=een the ti#e a fi'e is de'eted b( a user and the ti#e of the correspondin$ increase in free space in H !". A user can /nde'ete a fi'e after de'etin$ it as 'on$ as it re#ains in the Ftrash director(. If a user =ants to unde'ete a fi'e that heFshe has de'eted4 heFshe can na0i$ate the Ftrash director( and retrie0e the fi'e. The Ftrash director( contains on'( the 'atest cop( of the fi'e that =as de'eted. The Ftrash director( is <ust 'i)e an( other director( =ith one specia' feature: H !" app'ies specified po'icies to auto#atica''( de'ete fi'es fro# this director(. The current defau't po'ic( is to de'ete fi'es fro# Ftrash that are #ore than - hours o'd. In the future4 this po'ic( =i'' be confi$urab'e throu$h a =e'' defined interface. 7. *ecrease &epl!cat!on 5actor When the rep'ication factor of a fi'e is reduced4 the Ga#eGode se'ects excess rep'icas that can be de'eted. The next Heartbeat transfers this infor#ation to the ataGode. The ataGode then re#o0es the correspondin$ b'oc)s and the correspondin$ free space appears in the c'uster. +nce a$ain4 there #i$ht be a ti#e de'a( bet=een the co#p'etion of the setRep'ication A&I ca'' and the appearance of free space in the c'uster.

3. ,adoop 5!les%stems Hadoop has an abstract notion of fi'es(ste#4 of =hich H !" is <ust one i#p'e#entation. The ;a0a abstract c'ass or$.apache.hadoop.fs.!i'e"(ste# represents a fi'es(ste# in Hadoop4 and there are se0era' concrete i#p'e#entations4 =hich are described in fo''o=in$ tab'e.

A f!les%stem for a locall% connected

d!s? +!th cl!ent$s!de chec?sums. 6ocal f!le fs.6ocal5!leS%stem 3se &a+6ocal5!leS%s tem for a local f!les%stem +!th no chec?sums. Hadoop5s distributed fi'es(ste#. H !" is desi$ned to =or) efficient'( ,*5S hdfs hdfs. istributed!i'e"(ste# in con<unction =ith Map2 Reduce. A fi'es(ste# pro0idin$ read2on'( access to H !" o0er HTT&. B espite ,5 P hftp hdfs.Hftp!i'e"(ste# its na#e4 H!T& has no connection =ith !T&.1 +ften used =ith distcp BL&ara''e' %op(in$ =ith A fi'es(ste# pro0idin$ read2on'( access to H !" o0er HTT&". BA$ain4 ,S5 P hsftp Hdfs.Hsftp!i'e"(ste# this has no connection =ith !T&.1
. filesystem layered on another filesystem HAR har 2s.Har2ile3ystem Hadoop .rchives are typically used for archiving files in HD23 to reduce the namenode4s memory usage. #loud3tore 0formerly 5osmos filesystem1 KFS(Clou d Store) 5fs fs.kfs.5osmos2ile3ystem is a distributed filesystem like HD23 or 6oogle4s 623& written in #77. . filesystem backed by an 2T$ FT S!("ativ e) ftp s8n fs.ftp.2tp2ile3ystem fs.s8native.9ative382ile3yste m . filesystem backed by .ma,on 38& which stores files in blocks S!(#loc$ #ased) 38 fs.s8.382ile3ystem . 0much like HD231 to overcome 384s : 6; file si,e limit. server. . filesystem backed by .ma,on 38. for archiving files.

,adoop Arch!"es:
H !" stores s#a'' fi'es inefficient'(4 since each fi'e is stored in a b'oc)4 and b'oc) #etadata is he'd in #e#or( b( the na#enode. Thus4 a 'ar$e nu#ber of s#a'' fi'es can eat up a 'ot of #e#or( on the na#enode. BGote4 ho=e0er4 that s#a'' fi'es do not ta)e up an( #ore dis) space than is reJuired to store the ra= contents of the fi'e. !or exa#p'e4 a 1 M, fi'e stored =ith a b'oc) siAe of 12C M, uses 1 M, of dis) space4 not 12C M,.1 Hadoop Archi0es4 or HAR fi'es4 are a fi'e archi0in$ faci'it( that pac)s fi'es into H !" b'oc)s #ore efficient'(4 thereb( reducin$ na#enode #e#or( usa$e =hi'e sti'' a''o=in$ transparent access to fi'es. In particu'ar4 Hadoop Archi0es can be used as input to MapReduce.

3s!n# ,adoop Arch!"es A Hadoop Archi0e is created fro# a co''ection of fi'es usin$ the archi0e too'. The too' runs a MapReduce <ob to process the input fi'es in para''e'4 so to run it4 (ou need a MapReduce c'uster runnin$ to use it. 6!m!tat!ons There are a fe= 'i#itations to be a=are of =ith HAR fi'es. %reatin$ an archi0e creates a cop( of the ori$ina' fi'es4 so (ou need as #uch dis) space as the fi'es (ou are archi0in$ to create the archi0e Ba'thou$h (ou can de'ete the ori$ina's once (ou ha0e created the archi0e1. There is current'( no support for archi0e co#pression4 a'thou$h the fi'es that $o into the archi0e can be co#pressed BHAR fi'es are 'i)e tar fi'es in this respect1. Archi0es are i##utab'e once the( ha0e been created. To add or re#o0e fi'es4 (ou #ust recreate the archi0e. In practice4 this is not a prob'e# for fi'es that don5t chan$e after bein$ =ritten4 since the( can be archi0ed in batches on a re$u'ar basis4 such as dai'( or =ee)'(. As noted ear'ier4 HAR fi'es can be used as input to MapReduce. Ho=e0er4 there is no archi0e2a=are Input!or#at that can pac) #u'tip'e fi'es into a sin$'e MapReduce sp'it4 so processin$ 'ots of s#a'' fi'es4 e0en in a HAR fi'e4 can sti'' be inefficient.

You might also like