Professional Documents
Culture Documents
Hadoop Streaming
Search
Hadoop Streaming
Hadoop Streaming How Streaming Works Streaming Command Options Specifying a Java Class as the Mapper/Reducer Packaging Files With Job Submissions Specifying Other Plugins for Jobs Setting Environment Variables Generic Command Options Specifying Configuration Variables with the -D Option Specifying Directories Specifying Map-Only Jobs Specifying the Number of Reducers Customizing How Lines are Split into Key/Value Pairs Working with Large Files and Archives Making Files Available to Tasks Making Archives Available to Tasks More Usage Examples Hadoop Partitioner Class Hadoop Comparator Class Hadoop Aggregate Package Hadoop Field Selection Class Frequently Asked Questions How do I use Hadoop Streaming to run an arbitrary set of (semi) independent tasks? How do I process files, one per map? How many reducers should I use? If I set up an alias in my shell script, will that work after -mapper? Can I use UNIX pipes? What do I do if I get the "No space left on device" error? How do I specify multiple input directories? How do I generate output files with gzip format? How do I provide my own input/output format with streaming? How do I parse XML documents using streaming? How do I update counters in streaming applications? How do I update status in streaming applications? How do I get the JobConf variables in a streaming job's mapper/reducer? How do I get the JobConf variables in a streaming job's mapper/reducer?
Hadoop Streaming
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example: $ H A D O O P _ H O M E / b i n / h a d o o p j a r$ H A D O O P _ H O M E / h a d o o p s t r e a m i n g . j a r\ i n p u tm y I n p u t D i r s\ o u t p u tm y O u t p u t D i r\ m a p p e r/ b i n / c a t\ r e d u c e r/ b i n / w c
https://hadoop.apache.org/docs/r1.2.1/streaming.html
1/9
3/22/2014
Hadoop Streaming
runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later. When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later. This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer. Y ou can supply a Java class as the mapper and/or the reducer. The above example is equivalent to: $ H A D O O P _ H O M E / b i n / h a d o o p j a r$ H A D O O P _ H O M E / h a d o o p s t r e a m i n g . j a r\ i n p u tm y I n p u t D i r s\ o u t p u tm y O u t p u t D i r\ m a p p e ro r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r\ r e d u c e r/ b i n / w c User can specify s t r e a m . n o n . z e r o . e x i t . i s . f a i l u r eas t r u eor f a l s eto make a streaming task that exits with a non-zero status to be F a i l u r eor S u c c e s srespectively. By default, streaming tasks exiting with non-zero status are considered to be failed tasks.
https://hadoop.apache.org/docs/r1.2.1/streaming.html
2/9
3/22/2014
m a p p e rm y P y t h o n S c r i p t . p y\ r e d u c e r/ b i n / w c\ f i l em y P y t h o n S c r i p t . p y
Hadoop Streaming
The above example specifies a user defined Python executable as the mapper. The option "-file myPythonScript.py" causes the python executable shipped to the cluster machines as a part of job submission. In addition to executable files, you can also package other auxiliary files (such as dictionaries, configuration files, etc) that may be used by the mapper and/or the reducer. For example: $ H A D O O P _ H O M E / b i n / h a d o o p j a r$ H A D O O P _ H O M E / h a d o o p s t r e a m i n g . j a r\ i n p u tm y I n p u t D i r s\ o u t p u tm y O u t p u t D i r\ m a p p e rm y P y t h o n S c r i p t . p y\ r e d u c e r/ b i n / w c\ f i l em y P y t h o n S c r i p t . p y\ f i l em y D i c t i o n a r y . t x t
Specifying Directories
To change the local temp directory use: Dd f s . d a t a . d i r = / t m p To specify additional local temp directories use: Dm a p r e d . l o c a l . d i r = / t m p / l o c a l Dm a p r e d . s y s t e m . d i r = / t m p / s y s t e m Dm a p r e d . t e m p . d i r = / t m p / t e m p Note: For more details on jobconf parameters see: mapred-default.html
https://hadoop.apache.org/docs/r1.2.1/streaming.html
3/9
3/22/2014
Hadoop Streaming
To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapred.reduce.tasks=0".
In the above example, "-D stream.map.output.field.separator=." specifies "." as the field separator for the map outputs, and the prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. If a line has less than four "."s, then the whole line will be the key and the value will be an empty Text object (like the one created by new Text("")). Similarly, you can use "-D stream.reduce.output.field.separator=SEP" and "-D stream.num.reduce.output.fields=NUM" to specify the nth field separator in a line of the reduce outputs as the separator between the key and the value. Similarly, you can specify "stream.map.input.field.separator" and "stream.reduce.input.field.separator" as the input separator for Map/Reduce inputs. By default the separator is the tab character.
User can specify a different symlink name for -files using #. f i l e sh d f s : / / h o s t : f s _ p o r t / u s e r / t e s t f i l e . t x t # t e s t f i l e Multiple entries can be specified like this: f i l e sh d f s : / / h o s t : f s _ p o r t / u s e r / t e s t f i l e 1 . t x t , h d f s : / / h o s t : f s _ p o r t / u s e r / t e s t f i l e 2 . t x t
https://hadoop.apache.org/docs/r1.2.1/streaming.html
4/9
3/22/2014
Hadoop Streaming
In this example, Hadoop automatically creates a symlink named testfile.jar in the current working directory of tasks. This symlink points to the directory that stores the unjarred contents of the uploaded jar file. a r c h i v e sh d f s : / / h o s t : f s _ p o r t / u s e r / t e s t f i l e . j a r User can specify a different symlink name for -archives using #. a r c h i v e sh d f s : / / h o s t : f s _ p o r t / u s e r / t e s t f i l e . t g z # t g z d i r In this example, the input.txt file has two lines specifying the names of the two files: cachedir.jar/cache.txt and cachedir.jar/cache2.txt. "cachedir.jar" is a symlink to the archived directory, which has the files "cache.txt" and "cache2.txt". $ H A D O O P _ H O M E / b i n / h a d o o p j a r$ H A D O O P _ H O M E / h a d o o p s t r e a m i n g . j a r\ a r c h i v e s' h d f s : / / h a d o o p n n 1 . e x a m p l e . c o m / u s e r / m e / s a m p l e s / c a c h e f i l e / c a c h e d i r . j a r '\ Dm a p r e d . m a p . t a s k s = 1\ Dm a p r e d . r e d u c e . t a s k s = 1\ Dm a p r e d . j o b . n a m e = " E x p e r i m e n t "\ i n p u t" / u s e r / m e / s a m p l e s / c a c h e f i l e / i n p u t . t x t " \ o u t p u t" / u s e r / m e / s a m p l e s / c a c h e f i l e / o u t "\ m a p p e r" x a r g sc a t " \ r e d u c e r" c a t " $l st e s t _ j a r / c a c h e . t x t c a c h e 2 . t x t $j a rc v fc a c h e d i r . j a rCt e s t _ j a r /. a d d e dm a n i f e s t a d d i n g :c a c h e . t x t ( i n=3 0 )( o u t =2 9 ) ( d e f l a t e d3 % ) a d d i n g :c a c h e 2 . t x t ( i n=3 7 )( o u t =3 5 ) ( d e f l a t e d5 % ) $h a d o o pd f sp u tc a c h e d i r . j a rs a m p l e s / c a c h e f i l e $h a d o o pd f sc a t/ u s e r / m e / s a m p l e s / c a c h e f i l e / i n p u t . t x t c a c h e d i r . j a r / c a c h e . t x t c a c h e d i r . j a r / c a c h e 2 . t x t $c a tt e s t _ j a r / c a c h e . t x t T h i si sj u s tt h ec a c h es t r i n g $c a tt e s t _ j a r / c a c h e 2 . t x t T h i si sj u s tt h es e c o n dc a c h es t r i n g $h a d o o pd f sl s/ u s e r / m e / s a m p l e s / c a c h e f i l e / o u t F o u n d1i t e m s / u s e r / m e / s a m p l e s / c a c h e f i l e / o u t / p a r t 0 0 0 0 0 < r3 >
6 9
$h a d o o pd f sc a t/ u s e r / m e / s a m p l e s / c a c h e f i l e / o u t / p a r t 0 0 0 0 0 T h i si sj u s tt h ec a c h es t r i n g T h i si sj u s tt h es e c o n dc a c h es t r i n g
https://hadoop.apache.org/docs/r1.2.1/streaming.html
5/9
3/22/2014
Output of map (the keys) 1 1 . 1 2 . 1 . 2 1 1 . 1 4 . 2 . 3 1 1 . 1 1 . 4 . 1 1 1 . 1 2 . 1 . 1 1 1 . 1 4 . 2 . 2 Partition into 3 reducers (the first 2 fields are used as keys for partition) 1 1 . 1 1 . 4 . 1 1 1 . 1 2 . 1 . 2 1 1 . 1 2 . 1 . 1 1 1 . 1 4 . 2 . 3 1 1 . 1 4 . 2 . 2 Sorting within each partition for the reducer(all 4 fields used for sorting) 1 1 . 1 1 . 4 . 1 1 1 . 1 2 . 1 . 1 1 1 . 1 2 . 1 . 2 1 1 . 1 4 . 2 . 2 1 1 . 1 4 . 2 . 3
Hadoop Streaming
https://hadoop.apache.org/docs/r1.2.1/streaming.html
6/9
3/22/2014
f i l em y A g g r e g a t o r F o r K e y C o u n t . p y\ The python program myAggregatorForKeyCount.py looks like: # ! / u s r / b i n / p y t h o n i m p o r ts y s ; d e fg e n e r a t e L o n g C o u n t T o k e n ( i d ) : r e t u r n" L o n g V a l u e S u m : "+i d+" \ t "+" 1 " d e fm a i n ( a r g v ) : l i n e=s y s . s t d i n . r e a d l i n e ( ) ; t r y : w h i l el i n e : l i n e=l i n e [ : 1 ] ; f i e l d s=l i n e . s p l i t ( " \ t " ) ; p r i n tg e n e r a t e L o n g C o u n t T o k e n ( f i e l d s [ 0 ] ) ; l i n e=s y s . s t d i n . r e a d l i n e ( ) ; e x c e p t" e n do ff i l e " : r e t u r nN o n e i f_ _ n a m e _ _= =" _ _ m a i n _ _ " : m a i n ( s y s . a r g v )
Hadoop Streaming
https://hadoop.apache.org/docs/r1.2.1/streaming.html
7/9
3/22/2014
O u t p u t C o l l e c t o ro u t p u t , R e p o r t e rr e p o r t e r ) t h r o w sI O E x c e p t i o n{ o u t p u t . c o l l e c t ( ( T e x t ) v a l u e ,n u l l ) ; }
Hadoop Streaming
Note that the output filename will not be the same as the original filename
https://hadoop.apache.org/docs/r1.2.1/streaming.html
8/9
3/22/2014
Hadoop Streaming
https://hadoop.apache.org/docs/r1.2.1/streaming.html
9/9