You are on page 1of 3

MapReduce Programming #1

MapReduce Overview
1
MapReduce is a programming model for processing large data sets with a parallel, distributed
algorithm on a cluster. In our case, the Hadoop Distributed File System (HDFS) is used and this
provides a cluster of commodity-quality datanodes over which the blocks of the files are
distributed.
A MapReduce (MR) program comprises a Map procedure or routine that performs extraction,
filtering, and sorting and a Reduce procedure that performs a summary operation. The
MapReduce infrastructure or framework coordinates and controls the distributed servers,
running the various tasks in parallel, managing all communications and data transfers between
the various parts of the system, providing for redundancy and fault tolerance, and overall
management of the whole process.
The model is inspired by Map and Reduce functions commonly used in functional programming,
although their purpose in the MapReduce framework is not the same as their original forms. All
the same, functional programming breaks down a problem into a set of functions that take inputs
and produce outputs and that is the origin of the approach.
The key contributions of the Hadoop MapReduce framework are not the actual map and reduce
functions since you supply differing Map and Reduce functions (generally in the form of Java
classes) dependent of the work that needs to be done but the scalability and fault-tolerance
achieved for a variety of applications by optimizing the execution engine once. In addition to the
framework, you supply for each specific problem the mapper algorithm (or piece of code), the
reducer algorithm, and any appropriate parameters.
The name MapReduce originally referred to a proprietary Google technology, but has since been
genericized.
2
After completing this topic, you will be able to:
Describe & explain the term map in regard to Hadoop
Describe & explain the term reduce in regard to Hadoop
Describe how the JobTracker and TaskTrackers work with MapReduce
Explain the fault tolerance capability of MapReduce
There is a single JobTracker for the cluster. Each datanode on the cluster has a TaskTracker and
thus there are multiple TaskTrackers possibly dozen, hundreds, or thousands, depending on
the number of nodes in the cluster.

3
What is MapReduce?
It is a way to process large data sets by distributing the work across a large number of nodes.
Prior to executing the Mapper function, the master node partitions the input into smaller subproblems which are then distributed to worker nodes.
The JobTracker runs on the master node, and a TaskTracker works on each data (or worker)
node. Worker nodes may themselves act as master nodes in that they in turn may partition the
sub-problem into even smaller sub-problems.
In the Reduce step the master node takes the answers from all of the Mapper sub-problems and
combines them in such as way as to get the output that solves the problem.
4
Lets look at an analogy to understand the concept. For the moment, forget about Hadoop.
Let's look at the MapReduce paradigm in something more familiar, a relational database. Since
this is an IBM course, DB2 will be used in this example, but other RDBMS closely follow suit.
Assume that you have a massively parallel processing database environment. This is similar to a
Hadoop cluster in that you have multiple machines or nodes working together. Next assume that
you have an employee table that is partitioned across multiple nodes in your DB2 system. This is
just a somewhat technical way of saying that portions of the employee table reside on various
nodes in the database cluster, although to the user, the employee table appears as a single entity.
A client program connects to the coordinator node and sends a request to total the number of
employees in each job classification. The coordinator node in turn sends that request to each of
the nodes on which a portion of the employee table resides. Since for this request there is no
inter-data dependencies, each sub-agent is able to process the request against its portion of the
table in parallel with all of the other sub-agents.
Each sub-agent reads through the portion of the employee table that it stores and extracts the job
classification for each employee. Then each sub-agent sorts the results in job classification
sequence. Each sub-agent reads through the sorted results, counting the number of records for
each job classification. Finally each sub-agent sends, for each job classification, a single record
back to the coordinator node with the job classification value and its number of occurrences.
Once the coordinator node has all results from the sub-agents, it is able to then sort the records
on job classification and come up with a total for each job classification. The coordinator then
returns its results to the client.
In this example, the work done by the sub-agents is the mapping phase. The work handled by the
coordinator is the reduce phase.
The analogy breaks down somewhat in the case of Hadoop MR in that Hadoop MR generally
provides multiple Reduce tasks to finalize the output. Note that a unified result one output
file is not a requirement for the output Hadoop MR, but you could if you wish require
that only one Reduce task run. It all depends on what you need to get done.

With large problems, you might prefer the output files themselves to be distributed across the
HDFS as the output of this MR job may be the input to a separate and later MR job or another
job. This chaining of jobs can be done with Oozie or other job coordination frameworks.
We continue in the next video.

You might also like