Professional Documents
Culture Documents
Krishna Prasad. D
B.E 3rd Year, Department of Computer Science & Engineering
Canara Engineering College, Mangalore, Karnataka, India.
d.krishnaprasad2010@gmail.com
AbstractIn order to solve the problem
that enterpriselevel users mass data
storage and high data processing, and the
traditional system cant meet their data
services growth ,enterprise file cloud
system was proposed it is based on
Hadoop , which uses Linux cluster
technology ,distributed file systems. This
paper provides a beginners approach to
distributed parallel programming with
Hadoop framework by explaining the
necessary framework and giving an
introduction to Map Reduce.
KeywordMapReduce, Clusters, Hadoop,
Data
Processing,
Distributed
Programming.
1. Introduction
Now we live in the age of Big Data, where
the data volumes we need to work with on
a day to day basis have outgrown the
storage and processing capabilities of a
single host.
As of 2014 the Internet archive stored
around 2 petabytes of data and was
growing at a rate of 20 terabytes per
month [1] and Facebook by hosting an
approximately 10 billion photos, taking up
to 1 petabytes of data [2] .
In 2005 Hadoop was created by Doug
Cutting and Mike Cafarella [3], it was
named after his sons toy Elephant. This
Hadoop addresses to these problems by
distributing the work and the data over
work
is
Comparison to RDBMS
Relational
database
management
system(RDBMS) are often compared to the
Hadoop but then both are more like in
opposite where DBMS can store only up to
one terabyte of data ,but Hadoop can
store several terabytes of data .
Another big difference is that the RDBMS
can store only a structured data whereas
Hadoop can store all kinds of data for
illustration, Assume we are living in a 100
% of data world out of which around 60%
of data which we get is semi structured
data and unstructured data, which cant
be stored in RDBMS.
Example:-consider the data which will be
hosted by the users in the Facebook will
be of audio, video, text, etc. which cant
be stored in the RBMS can be stored in
Hadoop system.
2. Hadoop Framework
Hadoop is an open source software
framework that allows large sets of data to
be processed using commodity hardware.
3. MapReduce
MapReduce is a computational paradigm
designed to process very large sets of data in a
distributed fashion. The MapReduce model was
developed by Google to implement their search
technology, specifically the indexing of web
pages. It executes in 3 stages.
Map shuffle reduce
In some simple cases we use only map and
reduce concepts. It performs a sequence of
transformation on key value parts.
Map
This phase is used for independent record
transformation
from
dropping
and
replication.
(k1, v1)list(k2,v2)
Figure1: Hadoop
Framework.
Reduce
This phase aggregates results from map
phase, for every unique key of the type
Framework
This phase is used to set up and launch jar
and it also takes responsibilities like
Scheduling and rerunning of the failed
tasks splitting the input Moves maps
output to reduce input and thus receives
the result.
MapReduce Flow
We start with an input which may be a file
or many files .it breaks up the input in to
splits. Each split contains records, each
split is processed by a map task. Map
function shuffles order the map output, it
is as show in the figure2
Figure2: MapReduce
flow.
This all phases is illustrated by using an
example of word count algorithm as
shown the figure3.
5. Conclusion
Acknowledgement
If the client wants to read the data then he
has to contact the name node to find
where the actual data is stored, the name
node replies with the block identifier and
the location (i.e. the data) from where he
has to fetch the data. Similarly if the client
wants to write the data then he has to
contact the name node to update the
namespace and check the permissions,
this name node allocates a new block on
the suitable data block, The client directly
streams to the selected data node and can
perform his operation .since the data is
not moved across the name node there is
no bottle neck.
The figure5 represents the entire process
of the Hadoop system.
References
[1]http://www.it.iitb.ac.in/frg/brainstorming
/sites/default/files/hadoopHive.pdf
[2] Hadoop: The Definitive Guide, Second
Edition by Tom White Copyright 2011.
Printed in the United States of America.
Published by OReilly Media, Inc., 1005
Graven stein Highway North, Sebastopol,
CA 95472.
[3]https://hadoopthenextbigthing.wordpre
ss.com/2014/11/26/history-of-hadoop/
[4]http://www.cloudera.com/content/cloud
era/en/documentation/HadoopTutorial/CDH
4/Hadoop-Tutorial.html
[5] Data-Intensive Text Processing with
MapReduce by Jimmy Lin and Chris Dyer,
University of Maryland, College Park
Manuscript prepared April 11, 2010