Professional Documents
Culture Documents
Problems
Where are documents stored?
Having more machines for processing only helps up to a certain point
until the storage server cant keep up.
Youll also need to split up the documents among the set of processing
machines such that each machine will process only those documents
that are stored in it.
Map-Reduce
MapReduce programs are executed in two main
phases, called
mapping and
reducing .
Map-Reduce
Map-Reduce Program
Based on two functions: Map and Reduce
Every Map/Reduce program must specify a Mapper and optionally a Reducer
Operate on key and value pairs
uniq -c
Reduce
Map-Reduce on Hadoop
Machine -x
Machine -2
HDFS
Splits
Record Reader
Input
files
File 1
(Kay, Value)
pairs
Split 1
Map 1
Split 2
Map 2
Split 3
Map 3
combiner
Combiner 1
Partition 1
Reducer 1
File 1
File 2
Partition 2
Reducer 2
File 3
File 3
.
.
.
.
.
.
Mapper
File N-2
Partition P-1
Reducer
Reducer R-1
File N-1
File N
Machine - M
File 2
.
.
.
File O-2
File O-1
Split M-2
Map M-2
Split M-1
Map M-1
Split M
Map M
Combiner C
Partition P
Reducer R
File O
Output
Input
HDFS
Partitionar
Terminology Example
Running Word Count across 20 files is one job
20 files to be mapped simply 20 map tasks + some
number of reduce tasks
At least 20 map task attempts will be performed
more if a machine crashes, due to speculative
execution etc.
Task Attempts
A particular task will be attempted at least once,
possibly more times if it crashes
If the same input causes crashes over and over, that input
will eventually be abandoned
JobTracker
Slave node
Slave node
Slave node
TaskTracker
TaskTracker
TaskTracker
Task instance
Task instance
Task instance
Node-to-Node Communication
Job Distribution
MapReduce programs are contained in a Java jar file
+ an XML file containing serialized program
configuration options
Running a MapReduce job places these files into the
HDFS and notifies TaskTrackers where to retrieve the
relevant program code
Wheres the data distribution?
Data Distribution
JobClient:
Determines proper division of input into
InputSplits
Sends job data to master JobTracker server
JobTracker:
Inserts jar and JobConf (serialized to XML) in
shared location
Posts a JobInProgress to its run queue
TaskTracker.Child.main():
Sets up the child TaskInProgress attempt
Reads XML configuration
Connects back to necessary MapReduce
components via RPC
Uses TaskRunner to launch user process
Mapper
What is Writable?
InputFormat
Input file
Input file
InputSplit
InputSplit
InputSplit
InputSplit
RecordReader
RecordReader
RecordReader
RecordReader
Mapper
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Reading Data
Record Readers
WritableComparator
shuffling
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Partitioner
Partitioner
Partitioner
Partitioner
(intermediates)
(intermediates)
(intermediates)
Reducer
Reducer
Reducer
Partitioner
Reduction
OutputFormat
Reducer
Reducer
Reducer
RecordWriter
RecordWriter
RecordWriter
output file
output file
output file
WordCount M/R
map(String filename, String document)
{
List<String> T = tokenize(document);
for each token in T {
emit ((String)token, (Integer) 1);
}
}
42
43
44
Tasks belong to a job, and their IDs are formed by replacing the job prefix of
a job ID with a task prefix, and adding a suffix to identify the task within the
job.
task_201206111011_0002_m_000003:
is the fourth (000003, task IDs are 0-based)
map (m) task of the job with ID job_201206111011_0002.
The task IDs are created for a job when it is initialized, so they do not necessarily
dictate the order that the tasks will be executed in.
Exercise - description
Exercise - tasks
Task 2: (30min)
Use distributed code Python/Java and execute them following instruction
Where input and out data was stored, and in what format?
What were K1 V1, K2, V2 data types used?
Task 3: (45min)
Some words are so common that their presence in an inverted index is "noise" -they can obfuscate the more interesting properties of that document. For
example, the words "the", "a", "and", "of", "in", and "for" occur in almost every
English document. How can you determine whether a word is "noisy?
Re-write your pseudo-code with determination (your algorithms) and removal of
noisy words using map-reduce framework.
Map:
foreach word in text.split():
output(word, filename)
49
Inverted Index
hamlet.txt
to be or
not to be
to, hamlet.txt
be, hamlet.txt
or, hamlet.txt
not, hamlet.txt
be, 1Xth.txt
not, 1Xth.txt
afraid, 1Xth.txt
of, 1Xth.txt
greatness, 1Xth.txt
1Xth.txt
be not
afraid of
greatness
50
afraid, (1Xth.txt)
be, (1Xth.txt, hamlet.txt)
greatness, (12th.txt)
not, (1Xth.txt, hamlet.txt)
of, (12th.txt)
or, (hamlet.txt)
to, (hamlet.txt)
A better example
Billions of crawled pages and links
Generate an index of words linking to web urls in which
they occur.
Input is split into url->pages (lines of pages)
Map looks for words in lines of page and puts out word -> link
pairs
Group k,v pairs to generate word->{list of links}
Reduce puts out pairs to output
End of sesssion
Day 1: First MR job - Inverted Index construction