You are on page 1of 4

THE HADOOP FRAMEWORK

Krishna Prasad. D
B.E 3rd Year, Department of Computer Science & Engineering
Canara Engineering College, Mangalore, Karnataka, India.
d.krishnaprasad2010@gmail.com
AbstractIn order to solve the problem
that enterpriselevel users mass data
storage and high data processing, and the
traditional system cant meet their data
services growth ,enterprise file cloud
system was proposed it is based on
Hadoop , which uses Linux cluster
technology ,distributed file systems. This
paper provides a beginners approach to
distributed parallel programming with
Hadoop framework by explaining the
necessary framework and giving an
introduction to Map Reduce.
KeywordMapReduce, Clusters, Hadoop,
Data
Processing,
Distributed
Programming.

1. Introduction
Now we live in the age of Big Data, where
the data volumes we need to work with on
a day to day basis have outgrown the
storage and processing capabilities of a
single host.
As of 2014 the Internet archive stored
around 2 petabytes of data and was
growing at a rate of 20 terabytes per
month [1] and Facebook by hosting an
approximately 10 billion photos, taking up
to 1 petabytes of data [2] .
In 2005 Hadoop was created by Doug
Cutting and Mike Cafarella [3], it was
named after his sons toy Elephant. This
Hadoop addresses to these problems by
distributing the work and the data over

thousands of machines. The


spread across the clusters.

work

is

The Map reduce programming technique is


used for processing these large amount of
data at a faster rate .The two functions
Map and Reduce, receive incoming data
modify it and set it to an output channel ,
it is a programming model for distributed
computing .

Comparison to RDBMS
Relational
database
management
system(RDBMS) are often compared to the
Hadoop but then both are more like in
opposite where DBMS can store only up to
one terabyte of data ,but Hadoop can
store several terabytes of data .
Another big difference is that the RDBMS
can store only a structured data whereas
Hadoop can store all kinds of data for
illustration, Assume we are living in a 100
% of data world out of which around 60%
of data which we get is semi structured
data and unstructured data, which cant
be stored in RDBMS.
Example:-consider the data which will be
hosted by the users in the Facebook will
be of audio, video, text, etc. which cant
be stored in the RBMS can be stored in
Hadoop system.

2. Hadoop Framework
Hadoop is an open source software
framework that allows large sets of data to
be processed using commodity hardware.

Hadoop is designed to run on top of a


large cluster of nodes that are connected
to form a large distributed system. Hadoop
implements a computational paradigm
known as MapReduce, which was inspired
by an architecture developed by Google to
implement its search technology. The
MapReduce model runs over a distributed
file system and the combination allows
Hadoop to process a huge amount of data,
while at the same time being fault tolerant
[5].
Hadoop consists of the Hadoop common
package, which provides file system and
operating system level abstractions, a
map reduce engine [4] and the Hadoop
distributed file system.
The Hadoop frame work is written in java
but can run map reduce programs
expressed in various languages example: ruby, python and C++. This frame work
consists of several modules which provide
different
parts
of
the
necessary
functionalities to distribute tasks across
the clusters.

configuration should include information


about the topology to allow the underlying
system the use of the location for its
replication. Each node running a task
tracker for the MapReduce tasks and a
data node for the distributed storage
system .one special node, the master
node runs the job tracker and the name
node which are organizing the distribution
of the task.
The central parts of the frame work on the
common module which contains generally
available interfaces and the tools to
support the usage of the other module.

3. MapReduce
MapReduce is a computational paradigm
designed to process very large sets of data in a
distributed fashion. The MapReduce model was
developed by Google to implement their search
technology, specifically the indexing of web
pages. It executes in 3 stages.
Map shuffle reduce
In some simple cases we use only map and
reduce concepts. It performs a sequence of
transformation on key value parts.

Map
This phase is used for independent record
transformation
from
dropping
and
replication.
(k1, v1)list(k2,v2)

Figure1: Hadoop
Framework.

A Hadoop cluster typically consists of 10


to 5000 nodes and subdivided into racks
by a 2 level network topology [2] as shown
in
the
figure1,
the
frame
work

Key value of some generic type of (k1, v1)


and it gives a key values (k2, v2) of some
other type.

Reduce
This phase aggregates results from map
phase, for every unique key of the type

k2, the reduce function receives a set of


values of the type v2
(k2,list(v2))list(k2,v3)

Framework
This phase is used to set up and launch jar
and it also takes responsibilities like
Scheduling and rerunning of the failed
tasks splitting the input Moves maps
output to reduce input and thus receives
the result.

MapReduce Flow
We start with an input which may be a file
or many files .it breaks up the input in to
splits. Each split contains records, each
split is processed by a map task. Map
function shuffles order the map output, it
is as show in the figure2

Figure3: word count using


MapReduce concept.

4. Hadoop Distributed File System


(HDFS)
HDFS was designed to be a scalable, faulttolerant, distributed storage system that
works closely with MapReduce. HDFS will
just work under a variety of physical and
systemic circumstances. By distributing
storage and computation across many
servers, the combined storage resource
can grow with demand while remaining
economical at every size.
HDFS has Master slave relationship

Figure2: MapReduce
flow.
This all phases is illustrated by using an
example of word count algorithm as
shown the figure3.

Name node: master maintains the name


space (metadata, file to block mapping,
location of blocks) and maintains overall
health of the file system, this node keeps
petadata in ram each block occupies
roughly 150 bytes, and without this node
file system cant be used.
Data node: - slave manages the data blocks,
they store data and talk to clients, they report
periodically to the name node the list of blocks
they hold.

5. Conclusion

figure4: HDFS architecture.

This paper demonstrates the usage of


Hadoop systems, which can overcome the
disadvantages of the RDBMS where the
semi structured and the unstructured data
can be stored without any difficulty. This
system uses Linux as it operating system
and the Hadoop 2.2.0 which is newly
invented can be used in windows
operating system.

Acknowledgement
If the client wants to read the data then he
has to contact the name node to find
where the actual data is stored, the name
node replies with the block identifier and
the location (i.e. the data) from where he
has to fetch the data. Similarly if the client
wants to write the data then he has to
contact the name node to update the
namespace and check the permissions,
this name node allocates a new block on
the suitable data block, The client directly
streams to the selected data node and can
perform his operation .since the data is
not moved across the name node there is
no bottle neck.
The figure5 represents the entire process
of the Hadoop system.

I would like to thank Horton works for


giving the Hadoop source code, and I
would also like to thank Prof. Abhishek S.
Rao for all his support and guidance.

References
[1]http://www.it.iitb.ac.in/frg/brainstorming
/sites/default/files/hadoopHive.pdf
[2] Hadoop: The Definitive Guide, Second
Edition by Tom White Copyright 2011.
Printed in the United States of America.
Published by OReilly Media, Inc., 1005
Graven stein Highway North, Sebastopol,
CA 95472.

[3]https://hadoopthenextbigthing.wordpre
ss.com/2014/11/26/history-of-hadoop/
[4]http://www.cloudera.com/content/cloud
era/en/documentation/HadoopTutorial/CDH
4/Hadoop-Tutorial.html
[5] Data-Intensive Text Processing with
MapReduce by Jimmy Lin and Chris Dyer,
University of Maryland, College Park
Manuscript prepared April 11, 2010

Figure5: Hadoop Process

[6] architects.dzone.com/articles/big databeyond-MapReduce.

You might also like