You are on page 1of 54

First MR job - Inverted Index construction

Map Reduce - Introduction


Parallel Job processing framework
Written in java
Close integration with HDFS
Provides :
Auto partitioning of job into sub tasks
Auto retry on failures
Linear Scalability
Locality of task execution
Plugin based framework for extensibility

Lets think scalability


Lets go through an exercise of scaling a simple program to process a large
data set.
Problem: count the number of times each word occurs in a set of
documents.
Example: only one document with only one sentence Do as I say, not as I
do.
Pseudocode: A multiset is a set where each element also has a count
define wordCount as Multiset; (assume a hash table)
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);

How about a billion documents?


Looping through all the documents using a single computer will be
extremely time consuming.
You speed it up by rewriting the program so that it distributes the work over
several machines.
Each machine will process a distinct fraction of the documents. When all the
machines have completed this, a second phase of processing will combine
the result of all the machines.
define wordCount as Multiset;
for each document in documentSubset {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
sendToSecondPhase(wordCount);
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}

Problems
Where are documents stored?
Having more machines for processing only helps up to a certain point
until the storage server cant keep up.
Youll also need to split up the documents among the set of processing
machines such that each machine will process only those documents
that are stored in it.

wordCount (and totalWordCount) are stored in memory


When processing large document sets, the number of unique words can
exceed the RAM storage of a machine.
Furthermore phase two has only one machine, which will process
wordCount sent from all the machines in phase one. The single machine
in phase two will become the bottleneck.

Solution: divide based on expected output!


Lets say we have 26 machines for phase two. We assign each
machine to only handle wordCount for words beginning with a
particular letter in the alphabet.

Map-Reduce
MapReduce programs are executed in two main
phases, called
mapping and
reducing .

In the mapping phase, MapReduce takes the input


data and feeds each data element to the mapper.
In the reducing phase, the reducer processes all the
outputs from the mapper and arrives at a final result.
The mapper is meant to filter and transform the input
into something
That the reducer can aggregate over.
MapReduce uses lists and (key/value) pairs as its main
data primitives.

Map-Reduce

Map-Reduce Program
Based on two functions: Map and Reduce
Every Map/Reduce program must specify a Mapper and optionally a Reducer
Operate on key and value pairs

Map-Reduce works like a Unix pipeline:


cat input | grep |
sort
|
cat > output
Input | Map
| Shuffle & Sort |
Output

uniq -c

Reduce

cat /var/log/auth.log* | grep session opened | cut -d -f10 |


sort | uniq -c > ~/userlist

Map function: Takes a key/value pair and generates a set of intermediate


key/value pairs
map(k1, v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same
intermediate key reduce(k2, list(v2)) -> list (k3, v3)

Map-Reduce on Hadoop

Putting things in context


Machine -1

Machine -x

Machine -2

HDFS

Splits

Record Reader

Input
files

File 1

(Kay, Value)
pairs

Split 1

Map 1

Split 2

Map 2

Split 3

Map 3

combiner

Combiner 1

Partition 1

Reducer 1

File 1

File 2

Partition 2

Reducer 2

File 3

File 3

.
.
.

.
.
.

Mapper

File N-2

Partition P-1

Reducer

Reducer R-1

File N-1

File N

Machine - M

File 2

.
.
.
File O-2

File O-1
Split M-2

Map M-2

Split M-1

Map M-1

Split M

Map M

Combiner C

Partition P

Reducer R

File O

Output

Input

HDFS

Partitionar

Some MapReduce Terminology


Job A full program - an execution of a Mapper and
Reducer across a data set
Task An execution of a Mapper or a Reducer on a
slice of data
a.k.a. Task-In-Progress (TIP)

Task Attempt A particular instance of an attempt to


execute a task on a machine

Terminology Example
Running Word Count across 20 files is one job
20 files to be mapped simply 20 map tasks + some
number of reduce tasks
At least 20 map task attempts will be performed
more if a machine crashes, due to speculative
execution etc.

Task Attempts
A particular task will be attempted at least once,
possibly more times if it crashes
If the same input causes crashes over and over, that input
will eventually be abandoned

Multiple attempts at one task may occur in parallel


with speculative execution turned on
Task ID from TaskInProgress is not a unique identifier; dont
use it that way

MapReduce: High Level


Master node
MapReduce job
submitted by
client computer

JobTracker

Slave node

Slave node

Slave node

TaskTracker

TaskTracker

TaskTracker

Task instance

Task instance

Task instance

Node-to-Node Communication

Hadoop uses its own RPC protocol


All communication begins in slave nodes
Prevents circular-wait deadlock
Slaves periodically poll for status message

Classes must provide explicit serialization

Nodes, Trackers, Tasks


Master node runs JobTracker instance, which accepts
Job requests from clients

TaskTracker instances run on slave nodes


TaskTracker forks separate Java process for task
instances

Job Distribution
MapReduce programs are contained in a Java jar file
+ an XML file containing serialized program
configuration options
Running a MapReduce job places these files into the
HDFS and notifies TaskTrackers where to retrieve the
relevant program code
Wheres the data distribution?

Data Distribution

Implicit in design of MapReduce!


All mappers are equivalent; so map whatever data
is local to a particular node in HDFS

If lots of data does happen to pile up on the


same node, nearby nodes will map instead
Data transfer is handled implicitly by HDFS

Configuring With JobConf


MR Programs have many configurable options
JobConf objects hold (key, value) components mapping
String a
e.g., mapred.map.tasks 20
JobConf is serialized and distributed before running the job

Objects implementing JobConfigurable can retrieve


elements from a JobConf

Job Launch Process: Client

Client program creates a JobConf


Identify classes implementing Mapper and Reducer
interfaces
JobConf.setMapperClass(), setReducerClass()

Specify inputs, outputs


JobConf.setInputPath(), setOutputPath()

Optionally, other options too:


JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()

Job Launch Process: JobClient

Pass JobConf to JobClient.runJob() or


submitJob()
runJob() blocks, submitJob() does not

JobClient:
Determines proper division of input into
InputSplits
Sends job data to master JobTracker server

Job Launch Process: JobTracker

JobTracker:
Inserts jar and JobConf (serialized to XML) in
shared location
Posts a JobInProgress to its run queue

Job Launch Process: TaskTracker

TaskTrackers running on slave nodes


periodically query JobTracker for work
Retrieve job-specific jar and config
Launch task in separate instance of Java
main() is provided by Hadoop

Job Launch Process: Task

TaskTracker.Child.main():
Sets up the child TaskInProgress attempt
Reads XML configuration
Connects back to necessary MapReduce
components via RPC
Uses TaskRunner to launch user process

Job Launch Process: TaskRunner

TaskRunner, MapTaskRunner, MapRunner work


in a daisy-chain to launch your Mapper
Task knows ahead of time which InputSplits it
should be mapping
Calls Mapper once for each record retrieved from
the InputSplit

Running the Reducer is much the same

Creating the Mapper

You provide the instance of Mapper


Should extend MapReduceBase

One instance of your Mapper is initialized by


the MapTaskRunner for a TaskInProgress
Exists in separate process from all other instances of
Mapper no data sharing!

Mapper

void map(WritableComparable key,


Writable value,
OutputCollector output,
Reporter reporter)

What is Writable?

Hadoop defines its own classes for strings


(Text), integers (IntWritable), etc.
All values are instances of Writable
All keys are instances of WritableComparable

Getting Data To The Mapper

InputFormat

Input file

Input file

InputSplit

InputSplit

InputSplit

InputSplit

RecordReader

RecordReader

RecordReader

RecordReader

Mapper

Mapper

Mapper

Mapper

(intermediates)

(intermediates)

(intermediates)

(intermediates)

Reading Data

Data sets are specified by InputFormats


Defines input data (e.g., a directory)
Identifies partitions of the data that form an
InputSplit
Factory for RecordReader objects to extract (k, v)
records from the input source

FileInputFormat and Friends


TextInputFormat Treats each \n-terminated line of
a file as a value
KeyValueTextInputFormat Maps \n- terminated
text lines of k SEP v
SequenceFileInputFormat Binary file of (k, v) pairs
with some additional metadata
SequenceFileAsTextInputFormat Same, but maps
(k.toString(), v.toString())

Filtering File Inputs

FileInputFormat will read all files out of a


specified directory and send them to the
mapper
Delegates filtering this file list to a method
subclasses may override
e.g., Create your own xyzFileInputFormat to read
*.xyz from directory list

Record Readers

Each InputFormat provides its own


RecordReader implementation
Provides capability multiplexing

LineRecordReader Reads a line from a text file


KeyValueRecordReader Used by
KeyValueTextInputFormat

Input Split Size

FileInputFormat will divide large files into


chunks
Exact size controlled by mapred.min.split.size

RecordReaders receive file, offset, and length of


chunk
Custom InputFormat implementations may
override split size e.g., NeverChunkFile

Sending Data To Reducers

Map function receives OutputCollector object


OutputCollector.collect() takes (k, v) elements

Any (WritableComparable, Writable) can be


used

WritableComparator

Compares WritableComparable data


Will call WritableComparable.compare()
Can provide fast path for serialized data
JobConf.setOutputValueGroupingComparator()

Sending Data To The Client

Reporter object sent to Mapper allows simple


asynchronous feedback
incrCounter(Enum key, long amount)
setStatus(String msg)

Allows self-identification of input


InputSplit getInputSplit()

shuffling

Partition And Shuffle


Mapper

Mapper

Mapper

Mapper

(intermediates)

(intermediates)

(intermediates)

(intermediates)

Partitioner

Partitioner

Partitioner

Partitioner

(intermediates)

(intermediates)

(intermediates)

Reducer

Reducer

Reducer

Partitioner

int getPartition(key, val, numPartitions)


Outputs the partition number for a given key
One partition == values sent to one Reduce task

HashPartitioner used by default


Uses key.hashCode() to return partition num

JobConf sets Partitioner implementation

Reduction

reduce( WritableComparable key,


Iterator values,
OutputCollector output,
Reporter reporter)
Keys & values sent to one partition all go
to the same reduce task
Calls are sorted by key earlier keys are
reduced and output before later keys

OutputFormat

Finally: Writing The Output

Reducer

Reducer

Reducer

RecordWriter

RecordWriter

RecordWriter

output file

output file

output file

WordCount M/R
map(String filename, String document)
{
List<String> T = tokenize(document);
for each token in T {
emit ((String)token, (Integer) 1);
}
}

reduce(String token, List<Integer> values)


{
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}

Word Count: Java Mapper


public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
Text word = new Text(itr.nextToken());
output.collect(word, new IntWritable(1));
}
}
}

42

Word Count: Java Reduce


public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

43

Word Count: Java Driver


public void run(String inPath, String outPath)
throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new
Path(inPath));
FileOutputFormat.setOutputPath(conf, new
Path(outPath));
JobClient.runJob(conf);
}

44

WordCount with many mapper and One reducer

Job, Task, and Task Attempt IDs


The format of a job ID is composed of the time that the jobtracker (not the
job) started and an incrementing counter maintained by the jobtracker to
uniquely identify the job to that instance of the jobtracker.
job_201206111011_0002 :
is the second (0002, job IDs are 1-based) job run by the jobtracker
which started at 10:11 on June 11, 2012

Tasks belong to a job, and their IDs are formed by replacing the job prefix of
a job ID with a task prefix, and adding a suffix to identify the task within the
job.
task_201206111011_0002_m_000003:
is the fourth (000003, task IDs are 0-based)
map (m) task of the job with ID job_201206111011_0002.
The task IDs are created for a job when it is initialized, so they do not necessarily
dictate the order that the tasks will be executed in.

Tasks may be executed more than once, due to failure or speculative


execution, so to identify different instances of a task execution, task
attempts are given unique IDs on the jobtracker.
attempt_200904110811_0002_m_000003_0:
is the first (0, attempt IDs are 0-based) attempt at running task
task_201206111011_0002_m_000003.

Exercise - description

The objectives for this exercise are:


Become familiar with decomposing a problem into Map and Reduce stages.
Get a sense for how MapReduce can be used in the real world.

An inverted index is a mapping of words to their location in a set of documents.


Most modern search engines utilize some form of an inverted index to process
user-submitted queries. In its most basic form, an inverted index is a simple hash
table which maps words in the documents to some sort of document identifier.

For example, if given the following 2 documents:

Doc1: Buffalo buffalo buffalo.


Doc2: Buffalo are mammals.
we could construct the following inverted file index:
Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2

Exercise - tasks

Task - 1: (30 min)


Write pseudo-code for map and reduce to solve inverted index problem
What are your K1 V1, K2, V2 etc.
Execute your pseudo-code with following example and explain what shuffle & Sort
stage do with keys and values

Task 2: (30min)
Use distributed code Python/Java and execute them following instruction
Where input and out data was stored, and in what format?
What were K1 V1, K2, V2 data types used?

Task 3: (45min)
Some words are so common that their presence in an inverted index is "noise" -they can obfuscate the more interesting properties of that document. For
example, the words "the", "a", "and", "of", "in", and "for" occur in almost every
English document. How can you determine whether a word is "noisy?
Re-write your pseudo-code with determination (your algorithms) and removal of
noisy words using map-reduce framework.

Group / individual presentation (45 min)

Example: Inverted Index


Input: (filename, text) records
Output: list of files containing each word

Map:
foreach word in text.split():
output(word, filename)

Combine: unique filenames for each word


Reduce:
def reduce(word, filenames):
output(word, sort(filenames))

49

Inverted Index

hamlet.txt
to be or
not to be

to, hamlet.txt
be, hamlet.txt
or, hamlet.txt
not, hamlet.txt

be, 1Xth.txt
not, 1Xth.txt
afraid, 1Xth.txt
of, 1Xth.txt
greatness, 1Xth.txt

1Xth.txt
be not
afraid of
greatness

50

afraid, (1Xth.txt)
be, (1Xth.txt, hamlet.txt)
greatness, (12th.txt)
not, (1Xth.txt, hamlet.txt)
of, (12th.txt)
or, (hamlet.txt)
to, (hamlet.txt)

A better example
Billions of crawled pages and links
Generate an index of words linking to web urls in which
they occur.
Input is split into url->pages (lines of pages)
Map looks for words in lines of page and puts out word -> link
pairs
Group k,v pairs to generate word->{list of links}
Reduce puts out pairs to output

Search Reverse Index


public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, IntWritable> {

private Text word = new Text();


public void map(Text url, Text pageText,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = pageText.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
//ignore unwanted and redundant words
word.set(itr.nextToken());
output.collect(word, url);
}
}

Search Reverse Index


public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text word, Iterator<Text> urls,


OutputCollector<Text, Iterator<Text>> output,
Reporter reporter) throws IOException {
output.collect(word, urls);
}

End of sesssion
Day 1: First MR job - Inverted Index construction

You might also like