You are on page 1of 17

Presented by

Rahul Singh
Roll No:-1503314918 , MCA
Rajkumare Goel institute of technology
Contents
Introduction to Hadoop
Hadoop Architecture
HDFS Architecture
MapReduce
What is Hadoop
Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using
simple programming models. The Hadoop framework application works in
an environment that provides distributed storage and computation across
clusters of computers.
Hadoop Architecture
At its core, Hadoop has two major layers namely:
(a) Processing/Computation layer (MapReduce),
(b) Storage layer (Hadoop Distributed File System).
Hadoop core components
Hadoop Cluster
Hadoop cluster is a special type of computational cluster designed
for storing and analyzing vast amount of unstructured data in a
distributed computing environment. These clusters run on low cost
commodity computers.
Hadoop cluster has 3
components:

Client

Master

slave
HDFS Architecture
Hadoop- Typical Workflow in
HDFS
Let's try to find out answers of these questions
Take the example of input file as Sample.txt.

How Sample.txt gets loaded into the


Hadoop Cluster?

Client machine does this step and loads the


Sample.txt into cluster. It breaks the sample.txt
into smaller chunks which are known
as "Blocks" in Hadoop context. Client put
these blocks on different machines (data
nodes) throughout the cluster.
Next, how does the Client knows that to
which data nodes load the blocks?

Now NameNode comes into picture. The


NameNode used its Rack Awareness
intelligence to decide on which DataNode
to provide. For each of the data block (in
this case Block-A, Block-B and Block-C),
Client contacts NameNode and in
response NameNode sends an ordered list
of 3 DataNodes.

For example in response to Block-A


request, Node Name may send DataNode-
2, DataNode-3 and DataNode-4.
Who does the block replication?
MapReduce Overview
A method for distributing computation
across multiple nodes
Each node processes the data that is
stored at that node
The Mapper
Reads data as key/value pairs
The key is often discarded
Outputs zero or more key/value pairs
The Reducer
Called once for each unique key
Gets a list of all values associated with a
key as input
The reducer outputs zero or more final
key/value pairs
Usually just one output per input key
MapReduce: Word Count
Who uses Hadoop

You might also like