03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF

First MR job - Inverted Index construction
Map Reduce - Introduction

Parallel Job processing framework
Written in java
Close integration with HDFS
Provides :
Auto partitioning of job into sub tasks
Auto retry on failures
Linear Scalability
Locality of task execution
Plugin based framework for extensibility
Lets think scalability

Lets go through an exercise of scaling a simple program to process a large
data set.
Problem: count the number of times each word occurs in a set of
documents.
Example: only one document with only one sentence Do as I say, not as I
do.
Pseudocode: A multiset is a set where each element also has a count
define wordCount as Multiset; (assume a hash table)
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);
How about a billion documents?

Looping through all the documents using a single computer will be
extremely time consuming.
You speed it up by rewriting the program so that it distributes the work over
several machines.
Each machine will process a distinct fraction of the documents. When all the
machines have completed this, a second phase of processing will combine
the result of all the machines.
define wordCount as Multiset;
for each document in documentSubset {
T = tokenize(document);
wordCount[token]++;
}
}
sendToSecondPhase(wordCount);
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
Problems
Where are documents stored?
Having more machines for processing only helps up to a certain point
until the storage server cant keep up.
Youll also need to split up the documents among the set of processing
machines such that each machine will process only those documents
that are stored in it.
wordCount (and totalWordCount) are stored in memory

When processing large document sets, the number of unique words can
exceed the RAM storage of a machine.
Furthermore phase two has only one machine, which will process
wordCount sent from all the machines in phase one. The single machine
in phase two will become the bottleneck.
Solution: divide based on expected output!

Lets say we have 26 machines for phase two. We assign each
machine to only handle wordCount for words beginning with a
particular letter in the alphabet.
Map-Reduce
MapReduce programs are executed in two main
phases, called
mapping and
reducing .
In the mapping phase, MapReduce takes the input

data and feeds each data element to the mapper.
In the reducing phase, the reducer processes all the
outputs from the mapper and arrives at a final result.
The mapper is meant to filter and transform the input
into something
That the reducer can aggregate over.
MapReduce uses lists and (key/value) pairs as its main
data primitives.
Map-Reduce
Map-Reduce Program
Based on two functions: Map and Reduce
Every Map/Reduce program must specify a Mapper and optionally a Reducer
Operate on key and value pairs
Map-Reduce works like a Unix pipeline:

cat input | grep |
sort
|
cat > output
Input | Map
| Shuffle & Sort |
Output
uniq -c
Reduce
cat /var/log/auth.log* | grep session opened | cut -d -f10 |

sort | uniq -c > ~/userlist
Map function: Takes a key/value pair and generates a set of intermediate

key/value pairs
map(k1, v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same
intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Putting things in context

Machine -1
Machine -x
Machine -2
HDFS
Splits
Record Reader
Input
files
File 1
(Kay, Value)
pairs
Split 1
Map 1
Split 2
Map 2
Split 3
Map 3
combiner
Combiner 1
Partition 1
Reducer 1
File 1
File 2
Partition 2
Reducer 2
File 3
File 3
.
.
.
.
.
.
Mapper
File N-2
Partition P-1
Reducer
Reducer R-1
File N-1
File N
Machine - M
File 2
.
.
.
File O-2
File O-1
Split M-2
Map M-2
Split M-1
Map M-1
Split M
Map M
Combiner C
Partition P
Reducer R
File O
Output
Input
HDFS
Partitionar
Some MapReduce Terminology

Job A full program - an execution of a Mapper and
Reducer across a data set
Task An execution of a Mapper or a Reducer on a
slice of data
a.k.a. Task-In-Progress (TIP)
Task Attempt A particular instance of an attempt to

execute a task on a machine
Terminology Example
Running Word Count across 20 files is one job
20 files to be mapped simply 20 map tasks + some
number of reduce tasks
At least 20 map task attempts will be performed
more if a machine crashes, due to speculative
execution etc.
Task Attempts
A particular task will be attempted at least once,
possibly more times if it crashes
If the same input causes crashes over and over, that input
will eventually be abandoned
Multiple attempts at one task may occur in parallel

with speculative execution turned on
Task ID from TaskInProgress is not a unique identifier; dont
use it that way
MapReduce: High Level

Master node
MapReduce job
submitted by
client computer
JobTracker
Slave node
Slave node
Slave node
TaskTracker
TaskTracker
TaskTracker
Task instance
Task instance
Task instance
Node-to-Node Communication
Hadoop uses its own RPC protocol

All communication begins in slave nodes
Prevents circular-wait deadlock
Slaves periodically poll for status message
Classes must provide explicit serialization
Nodes, Trackers, Tasks

Master node runs JobTracker instance, which accepts
Job requests from clients
TaskTracker instances run on slave nodes

TaskTracker forks separate Java process for task
instances
Job Distribution
MapReduce programs are contained in a Java jar file
+ an XML file containing serialized program
configuration options
Running a MapReduce job places these files into the
HDFS and notifies TaskTrackers where to retrieve the
relevant program code
Wheres the data distribution?
Data Distribution
Implicit in design of MapReduce!

All mappers are equivalent; so map whatever data
is local to a particular node in HDFS
If lots of data does happen to pile up on the

same node, nearby nodes will map instead
Data transfer is handled implicitly by HDFS
Configuring With JobConf

MR Programs have many configurable options
JobConf objects hold (key, value) components mapping
String a
e.g., mapred.map.tasks 20
JobConf is serialized and distributed before running the job
Objects implementing JobConfigurable can retrieve

elements from a JobConf
Job Launch Process: Client
Client program creates a JobConf

Identify classes implementing Mapper and Reducer
interfaces
JobConf.setMapperClass(), setReducerClass()
Specify inputs, outputs

JobConf.setInputPath(), setOutputPath()
Optionally, other options too:

JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()
Job Launch Process: JobClient
Pass JobConf to JobClient.runJob() or

submitJob()
runJob() blocks, submitJob() does not
JobClient:
Determines proper division of input into
InputSplits
Sends job data to master JobTracker server
Job Launch Process: JobTracker
JobTracker:
Inserts jar and JobConf (serialized to XML) in
shared location
Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker
TaskTrackers running on slave nodes

periodically query JobTracker for work
Retrieve job-specific jar and config
Launch task in separate instance of Java
main() is provided by Hadoop
Job Launch Process: Task
TaskTracker.Child.main():
Sets up the child TaskInProgress attempt
Reads XML configuration
Connects back to necessary MapReduce
components via RPC
Uses TaskRunner to launch user process
Job Launch Process: TaskRunner
TaskRunner, MapTaskRunner, MapRunner work

in a daisy-chain to launch your Mapper
Task knows ahead of time which InputSplits it
should be mapping
Calls Mapper once for each record retrieved from
the InputSplit
Running the Reducer is much the same
Creating the Mapper
You provide the instance of Mapper

Should extend MapReduceBase
One instance of your Mapper is initialized by

the MapTaskRunner for a TaskInProgress
Exists in separate process from all other instances of
Mapper no data sharing!
Mapper
void map(WritableComparable key,

Writable value,
OutputCollector output,
Reporter reporter)
What is Writable?
Hadoop defines its own classes for strings

(Text), integers (IntWritable), etc.
All values are instances of Writable
All keys are instances of WritableComparable
Getting Data To The Mapper
InputFormat
Input file
Input file
InputSplit
InputSplit
InputSplit
InputSplit
RecordReader
RecordReader
RecordReader
RecordReader
Mapper
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Reading Data
Data sets are specified by InputFormats

Defines input data (e.g., a directory)
Identifies partitions of the data that form an
InputSplit
Factory for RecordReader objects to extract (k, v)
records from the input source
FileInputFormat and Friends

TextInputFormat Treats each \n-terminated line of
a file as a value
KeyValueTextInputFormat Maps \n- terminated
text lines of k SEP v
SequenceFileInputFormat Binary file of (k, v) pairs
with some additional metadata
SequenceFileAsTextInputFormat Same, but maps
(k.toString(), v.toString())
Filtering File Inputs
FileInputFormat will read all files out of a

specified directory and send them to the
mapper
Delegates filtering this file list to a method
subclasses may override
e.g., Create your own xyzFileInputFormat to read
*.xyz from directory list
Record Readers
Each InputFormat provides its own

RecordReader implementation
Provides capability multiplexing
LineRecordReader Reads a line from a text file

KeyValueRecordReader Used by
KeyValueTextInputFormat
Input Split Size
FileInputFormat will divide large files into

chunks
Exact size controlled by mapred.min.split.size
RecordReaders receive file, offset, and length of

chunk
Custom InputFormat implementations may
override split size e.g., NeverChunkFile
Sending Data To Reducers
Map function receives OutputCollector object

OutputCollector.collect() takes (k, v) elements
Any (WritableComparable, Writable) can be

used
WritableComparator
Compares WritableComparable data

Will call WritableComparable.compare()
Can provide fast path for serialized data
JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
Reporter object sent to Mapper allows simple

asynchronous feedback
incrCounter(Enum key, long amount)
setStatus(String msg)
Allows self-identification of input

InputSplit getInputSplit()
shuffling
Partition And Shuffle

Mapper
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Partitioner
Partitioner
Partitioner
Partitioner
(intermediates)
(intermediates)
(intermediates)
Reducer
Reducer
Reducer
Partitioner
int getPartition(key, val, numPartitions)

Outputs the partition number for a given key
One partition == values sent to one Reduce task
HashPartitioner used by default

Uses key.hashCode() to return partition num
JobConf sets Partitioner implementation
Reduction
reduce( WritableComparable key,

Iterator values,
OutputCollector output,
Reporter reporter)
Keys & values sent to one partition all go
to the same reduce task
Calls are sorted by key earlier keys are
reduced and output before later keys
OutputFormat
Finally: Writing The Output
Reducer
Reducer
Reducer
RecordWriter
RecordWriter
RecordWriter
output file
output file
output file
WordCount M/R
map(String filename, String document)
{
List<String> T = tokenize(document);
emit ((String)token, (Integer) 1);
}
}
reduce(String token, List<Integer> values)

{
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}
Word Count: Java Mapper

public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
Text word = new Text(itr.nextToken());
output.collect(word, new IntWritable(1));
}
}
}
42
Word Count: Java Reduce

public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
43
Word Count: Java Driver

public void run(String inPath, String outPath)
throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new
Path(inPath));
FileOutputFormat.setOutputPath(conf, new
Path(outPath));
JobClient.runJob(conf);
}
44
WordCount with many mapper and One reducer
Job, Task, and Task Attempt IDs

The format of a job ID is composed of the time that the jobtracker (not the
job) started and an incrementing counter maintained by the jobtracker to
uniquely identify the job to that instance of the jobtracker.
job_201206111011_0002 :
is the second (0002, job IDs are 1-based) job run by the jobtracker
which started at 10:11 on June 11, 2012
Tasks belong to a job, and their IDs are formed by replacing the job prefix of
a job ID with a task prefix, and adding a suffix to identify the task within the
job.
task_201206111011_0002_m_000003:
is the fourth (000003, task IDs are 0-based)
map (m) task of the job with ID job_201206111011_0002.
The task IDs are created for a job when it is initialized, so they do not necessarily
dictate the order that the tasks will be executed in.
Tasks may be executed more than once, due to failure or speculative

execution, so to identify different instances of a task execution, task
attempts are given unique IDs on the jobtracker.
attempt_200904110811_0002_m_000003_0:
is the first (0, attempt IDs are 0-based) attempt at running task
task_201206111011_0002_m_000003.
Exercise - description
The objectives for this exercise are:

Become familiar with decomposing a problem into Map and Reduce stages.
Get a sense for how MapReduce can be used in the real world.
An inverted index is a mapping of words to their location in a set of documents.

Most modern search engines utilize some form of an inverted index to process
user-submitted queries. In its most basic form, an inverted index is a simple hash
table which maps words in the documents to some sort of document identifier.
For example, if given the following 2 documents:
Doc1: Buffalo buffalo buffalo.

Doc2: Buffalo are mammals.
we could construct the following inverted file index:
Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2
Exercise - tasks
Task - 1: (30 min)

Write pseudo-code for map and reduce to solve inverted index problem
What are your K1 V1, K2, V2 etc.
Execute your pseudo-code with following example and explain what shuffle & Sort
stage do with keys and values
Task 2: (30min)
Use distributed code Python/Java and execute them following instruction
Where input and out data was stored, and in what format?
What were K1 V1, K2, V2 data types used?
Task 3: (45min)
Some words are so common that their presence in an inverted index is "noise" -they can obfuscate the more interesting properties of that document. For
example, the words "the", "a", "and", "of", "in", and "for" occur in almost every
English document. How can you determine whether a word is "noisy?
Re-write your pseudo-code with determination (your algorithms) and removal of
noisy words using map-reduce framework.
Group / individual presentation (45 min)
Example: Inverted Index

Input: (filename, text) records
Output: list of files containing each word
Map:
foreach word in text.split():
output(word, filename)
Combine: unique filenames for each word

Reduce:
def reduce(word, filenames):
output(word, sort(filenames))
49
Inverted Index
hamlet.txt
to be or
not to be
to, hamlet.txt
be, hamlet.txt
or, hamlet.txt
not, hamlet.txt
be, 1Xth.txt
not, 1Xth.txt
afraid, 1Xth.txt
of, 1Xth.txt
greatness, 1Xth.txt
1Xth.txt
be not
afraid of
greatness
50
afraid, (1Xth.txt)
be, (1Xth.txt, hamlet.txt)
greatness, (12th.txt)
not, (1Xth.txt, hamlet.txt)
of, (12th.txt)
or, (hamlet.txt)
to, (hamlet.txt)
A better example
Billions of crawled pages and links
Generate an index of words linking to web urls in which
they occur.
Input is split into url->pages (lines of pages)
Map looks for words in lines of page and puts out word -> link
pairs
Group k,v pairs to generate word->{list of links}
Reduce puts out pairs to output
Search Reverse Index

public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, IntWritable> {
private Text word = new Text();

public void map(Text url, Text pageText,
OutputCollector<Text, Text> output,
String line = pageText.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
//ignore unwanted and redundant words
word.set(itr.nextToken());
output.collect(word, url);
}
}
Search Reverse Index

public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text word, Iterator<Text> urls,

OutputCollector<Text, Iterator<Text>> output,
output.collect(word, urls);
}
End of sesssion
Day 1: First MR job - Inverted Index construction

03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF

Uploaded by

Copyright:

Available Formats

First MR job - Inverted Index construction

Map Reduce - Introduction

Lets think scalability

How about a billion documents?

wordCount (and totalWordCount) are stored in memory

Solution: divide based on expected output!

In the mapping phase, MapReduce takes the input

Map-Reduce works like a Unix pipeline:

cat /var/log/auth.log* | grep session opened | cut -d -f10 |

Map function: Takes a key/value pair and generates a set of intermediate

Putting things in context

Some MapReduce Terminology

Task Attempt A particular instance of an attempt to

Multiple attempts at one task may occur in parallel

MapReduce: High Level

Hadoop uses its own RPC protocol

Classes must provide explicit serialization

Nodes, Trackers, Tasks

TaskTracker instances run on slave nodes

Implicit in design of MapReduce!

If lots of data does happen to pile up on the

Configuring With JobConf

Objects implementing JobConfigurable can retrieve

Job Launch Process: Client

Client program creates a JobConf

Specify inputs, outputs

Optionally, other options too:

Job Launch Process: JobClient

Pass JobConf to JobClient.runJob() or

Job Launch Process: JobTracker

Job Launch Process: TaskTracker

TaskTrackers running on slave nodes

Job Launch Process: Task

Job Launch Process: TaskRunner

TaskRunner, MapTaskRunner, MapRunner work

Running the Reducer is much the same

Creating the Mapper

You provide the instance of Mapper

One instance of your Mapper is initialized by

void map(WritableComparable key,

Hadoop defines its own classes for strings

Getting Data To The Mapper

Data sets are specified by InputFormats

FileInputFormat and Friends

Filtering File Inputs

FileInputFormat will read all files out of a

Each InputFormat provides its own

LineRecordReader Reads a line from a text file

Input Split Size

FileInputFormat will divide large files into

RecordReaders receive file, offset, and length of

Sending Data To Reducers

Map function receives OutputCollector object

Any (WritableComparable, Writable) can be

Compares WritableComparable data

Sending Data To The Client

Reporter object sent to Mapper allows simple

Allows self-identification of input

Partition And Shuffle

int getPartition(key, val, numPartitions)

HashPartitioner used by default

JobConf sets Partitioner implementation