You are on page 1of 15

What is Hadoop (Storage perspective)?

Hadoop is a java frame work (software platform) for storing vast amounts of data (and
also process the data). It can be setup on commonly available computers.
Use Case
It can be used when following requirements arise
Store terabytes of data: HDFS uses commonly available computers and storage
devices and pools up the storage space on all the systems into one large piece.
Streaming access of data: HDFS is designed more for batch processing rather than
interactive use by users. The emphasis is on high throughput of data access rather
than low latency of data access
Large data sets: File sizes typically in gigabytes. HDFS is tuned to support large
files. It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster
WORM requirement: HDFS applications need a write-once-read-many access
model for files. A file once created, written, and closed need not be changed. This
assumption simplifies data coherency issues and enables high throughput data
access.
High availability: HDFS stores multiple instances of data on various systems in
the cluster. This ensures availability of data even if systems come down.
Architecture
Hadoop is based on Master-Slave architecture. An HDFS cluster consists of a single
Namenode (master server) that manages the file system namespace and regulates access
to files by clients. In addition, there are a number of Datanodes (Slaves), usually one per
node in the cluster, which manage storage attached to the nodes that they run on. HDFS
exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of Datanodes. The
Namenode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to Datanodes.
The Datanodes are responsible for serving read and write requests from the file systems
clients. The Datanodes also perform block creation, deletion, and replication upon
instruction from the Namenode. The Namenode is the arbitrator and repository for all
HDFS metadata. The system is designed in such a way that user data never flows through
the Namenode.

Client

NameNode

DataNode

DataNode

Key Features
File System Namespace: HDFS Supports hierarchical file organization. It
supports operations like create, remove, move & rename files as well as
directories. It doesnt have perms and quotas.
Replication: HDFS Stores files as series of blocks. Blocks are replicated for fault
tolerance. The replication factor and block size are configurable. Files in HDFS
are write-once and have strictly one writer at any time. The replication placement
is very critical for the performance. In large clusters nodes are spread across
racks. Thee racks are connected via switches. Its observed that traffic within
nodes in a rack is much higher than that across racks. Replicating data across
racks saves network bandwidth. To minimize global bandwidth consumption and
read latency, HDFS tries to satisfy a read request from a replica that is closest to
the reader. If there exists a replica on the same rack as the reader node, then that
replica is preferred to satisfy the read request.
File System Metadata: Name node uses EditLog to record every change to file
system metadata. The entire file system namespace is stored in file called
FsImage.
Robustness: Network or Disk Failure and data integrity
Datanodes send heartbeat messages to namenode and when namenode doesnt
receive them the datanode is marked dead. This may cause replication factor for
some blocks fall. The name node constantly monitors the replication count for
each block. If it falls then the namenode re replicates those nodes. This may
happen because a replica may be corrupted, data node is dead or the replication
count for a particular file may be increased. It also does rebalancing in case space
on one node falls below a threshold value. Name stores checksum for each block
and checks while retrieving.

NameNode failure: The FsImage and the EditLog are central data structures of
HDFS. A corruption of these files can cause the HDFS instance to be nonfunctional. For this reason, the Namenode can be configured to support
maintaining multiple copies of the FsImage and EditLog. Any update to either the
FsImage or EditLog causes each of the FsImages and EditLogs to get updated
synchronously. This synchronous updating of multiple copies of the FsImage and
EditLog may degrade the rate of namespace transactions per second that a
Namenode can support. However, this degradation is acceptable because even
though HDFS applications are very data intensive in nature, they are not
metadata intensive. When a Namenode restarts, it selects the latest consistent
FsImage and EditLog to use. The Namenode machine is a single point of failure
for an HDFS cluster. If the Namenode machine fails, manual intervention is
necessary. Currently, automatic restart and failover of the Namenode software to
another machine is not supported.
Data organization: Block size by HDFS is 64MB and it supports write once read
many semantics. HDFS client has to do local catching. Suppose the HDFS file has
a replication factor of three. When the local file accumulates a full block of user
data, the client retrieves a list of Datanodes from the Namenode. This list contains
the Datanodes that will host a replica of that block. The client then flushes the
data block to the first Datanode. The first Datanode starts receiving the data in
small portions (4 KB), writes each portion to its local repository and transfers that
portion to the second Datanode in the list. The second Datanode, in turn starts
receiving each portion of the data block, writes that portion to its repository and
then flushes that portion to the third Datanode. Finally, the third Datanode writes
the data to its local repository. Thus, a Datanode can be receiving data from the
previous one in the pipeline and at the same time forwarding data to the next one
in the pipeline. Thus, the data is pipelined from one Datanode to the next. When
data is deleted it is not removed immediately removed rather it remains in /trash.
It can be either removed or restored from there. How long to store the data in
trash is configurable. Default value is 6 hrs.

How to access HDFS?


DFSshell: from the shell user can create, remove and rename directories as well as
files. This is intended for applications that use scripting languages to interact with
HDFS.
Browser Interface: HDFS installation configures the web server to expose the
HDFS namespace through a configurable TCP port. This allows a user to navigate
the HDFS namespace and view the contents of its files using a web browser.
For administration purpose DFSadmin command is also provided.

Setting up Hadoop

We used three Linux boxes with CentOS, to setup a hadoop cluster. Details as follows-

DataNode-1
Property

NameNode
NameNode

DataNode-1

IP
10.245.0.121
Storage Space
Hostname
NameNode
data node

10.245.0.131
35GB
DataNode-1
name node

DataNode-2
DataNode-2
10.245.0.57
9GB
DataNode-2
data node

Step By Step Approach


Multi Node Cluster
Step-1: (steps 1 to 5 needs to be done on all nodes)
Set the host names of three systems as indicated above. Added the entries in /etc/hosts file
as follows
127.0.0.1
localhost localhost.localdomain localhost
10.245.0.57 DataNode-1
10.245.0.131 DataNode-2
10.245.0.121 NameNode
10.245.0.192 DataNode-3
Then gave the following command on each of the three systems- hostname XXX, and
rebooted them.( XXX corresponds the hostname of each system)
Step-2:
Added a dedicated system user named hadoop.
[root@NameNode]#groupadd hadoop
[root@NameNode]#useradd g hadoop hadoop

[root@NameNode]#passwd hadoop

Step-3:
Installed JDK (jdk-1_5_0_14-linux-i586.rpm) and hadoop (hadoop-0.14.4.tar.gz) as user
hadoop in /home/hadoop.
Step-4:
Setup the Linux systems in such a way that any system can ssh to any other system
without password. Copy public keys of every system in cluster (including itself) into
authorized_keys file.
Step-5:
Set JAVA_HOME variable in <hadoop install dir>/conf/hadoop-env.sh to correct path.
In our case it was export JAVA_HOME=/usr/java/jdk1.5.0_14/
Step-6: (on NameNode)

Add following entry into <HADOOP_INSTALL>/conf/masters file


NameNode

Add following entry into <HADOOP_INSTALL>/conf/slaves file

DataNode-1
DataNode-2
The conf/slaves file on master is used only by the
bin/stop-dfs.sh for starting data nodes.

scripts like bin/start-dfs.sh or

Step-7: (on data nodes)


Create a directory named hadoop-datastore (any name of your choice) where hadoop
stores all the data. The path of this directory needs to be mentioned in hadoop-site.xml
file for hadoop.temp.dir property.
Step- 8:
Change conf/hadoop-site.xml file. The file on NameNode looks as follows
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://NameNode:54310</value>
<description>

the name od the file system. the URI whose scheme


determine the file system implementayion. the uri's
scheme determines the config property( fs.scheme.impl)
naming the FS implementation class. the uri's
authority is used to determine host, port etc for
the file system
</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>NameNode:54311</value>
<description> the host and port that the mapreduce job tracker
runs at. if local then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description> no of replications when a file is created.
</description>
</property>
</configuration>
On data nodes one extra property is added,
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-datastore</value>
<description> base for haddop temp directories</description>
</property>
Step- 9:
Format Hadoop's distributed filesystem (HDFS) for the namenode. You need to do this
the first time you set up a Hadoop cluster. Do not format a running Hadoop namenode,
this will cause all your data in the HDFS filesytem to be erased. The command is
<Hadoop install>bin/hadoop namenode format

The HDFS name table is stored on the namenode's (here: master) local filesystem in the
directory specified by dfs.name.dir. The name table is used by the namenode to store
tracking and coordination information for the datanodes.
Run the command <HADOOP_INSTALL>/bin/start-dfs.sh on the machine you want the
namenode to run on. This will bring up HDFS with the namenode running on the

machine you ran the previous command on, and datanodes on the machines listed in the
conf/slaves file.

Run the command <HADOOP_INSTALL>/bin/stop-dfs.sh on the namenode machine to


stop the cluster.
Step-10:
Hadoop comes with several web interfaces which are by default (see conf/hadoopdefault.xml) available at these locations:

http://NameNode:50070/ - web UI for HDFS name node(s)

These web interfaces provide concise information about what's happening in your
Hadoop cluster. You may have to update hosts file in your windows system to resolve the
names to its IP.
From the NameNode you can do management as well as file operations via DFSshell.
The command <hadoop installation> bin/hadoop dfs help, gives you the operations
permitted by DFS. The command <hadoop installation> bin/hadoop dfsadmin help
gives the administration operations supported.

Step-11:
To add a new data node on fly just follow the above steps on new node and execute
following command on the new node to join the cluster.
bin/hadoop-daemon.sh --config <config_path> start datanode

Step-12:
To setup client machine install hadoop on a client machine and set the java_home
variable in hadoop.env.sh. To copy data to HDFS from client use fs switch of dfs and use
the URI of the namenode
bin/hadoop dfs -fs hdfs://10.245.0.121:54310 -mkdir remotecopy
bin/hadoop dfs -fs hdfs://10.245.0.121:54310 -copyFromLocal /home/devendra/jdk-1_5_0_14-linuxi586-rpm.bin remotecopy

Observations
1. I/O handling.
See appendix for some test scripts and the Log analysis.
2. Fault Tolerance
Observations on a two data-node cluster with replication factor 2.
The data was accessible even if one of data nodes was down.
Observations on a three data-node cluster with replication factor 2
The data was accessible when one of the data-nodes was down
Some data was accessible when two nodes were down
Overflow condition: With a two data-node setup and nodes having free space of
20 GB and 1 GB , tried to copy 10 GB of data. The copy operation was successful
without any errors. Observed warning in the log messages indicating only one
copy is done. ( I guess if we connect one more datanode on fly I suppose the data
will be replicated on to the new systemwill have to try this out to be sure)
Accidental data loss: Even if we remove data-blocks from one of the data nodes,
they will be synchronized(this was observed).

Appendix
SCRIPT -1
The script copies 1 GB of data to the HDFS and back to the
local system indefinitely. The md5 checksum matches after
stopping the script. The script is executed from namenode.
I=0
echo "[`date +%X`] :: start script" >>log
echo "size of movies directory is 1 GB" >>log
echo "[`date +%X`] :: creating a directory Movies" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -mkdir Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: mkdir sucessful" >>log
fi
while [ 1 = 1 ]
do
echo "------------------LOOP $i ------------------------" >>log
echo "[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/Movies
/user/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: copy sucessful" >>log
fi
echo "[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: remove sucessful" >>log
fi
echo "[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/Movies
/home/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: move back sucessful" >>log
fi
echo "[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: move back sucessful" >>log
fi
i=`expr $i + 1`
done

LOG
[03:48:52 PM] :: start script
size of movies directory is 1GB
[03:48:52 PM] :: creating a directory Movies
[03:48:54 PM] :: mkdir sucessful
------------------LOOP 0 ------------------------

[03:48:54 PM] :: coping data into the directory


[03:51:15 PM] :: copy sucessful
[03:51:15 PM] :: removing copy of file
[03:51:16 PM] :: remove sucessful
[03:51:16 PM] :: copying back to local system
[03:52:58 PM] :: move back sucessful
[03:52:58 PM] :: removing the file from hadoop
[03:53:01 PM] :: move back sucessful
------------------LOOP 1 -----------------------[03:53:01 PM] :: coping data into the directory
[03:55:23 PM] :: copy sucessful
[03:55:23 PM] :: removing copy of file
[03:55:24 PM] :: remove sucessful
[03:55:24 PM] :: copying back to local system
[03:57:03 PM] :: move back sucessful
[03:57:03 PM] :: removing the file from hadoop
[03:57:06 PM] :: move back sucessful
------------------LOOP 2 -----------------------[03:57:06 PM] :: coping data into the directory
[03:59:26 PM] :: copy successful
Copying 1GB data from file system to hadoop on a LAN of speed 100Mbps
took 140 seconds on average( observe the text in green). This turned to
be at speed of 58 Mbps.
Copying 1GB of data from file system to hadoop took 100 seconds on
average(observations in blue).This turned to be at speed of 80 Mbps.

Multithreaded script
Two threads are spawned, each of which copy data from the local system into hadoop and
back from hadoop to local system infinitely. Logs are captured to analyze the I/O
performance. The script was run for 48 hours and 850 loops got executed.
(
i=0
echo "thread1:[`date +%X`] :: start thread1" >>log
echo "thread1:size of thread1 directory 640MB" >>log
while [ 1 = 1 ]
do
echo "thread1:------------------LOOP $i ------------------------" >>log
echo "thread1:[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread1
/user/hadoop/
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: copy sucessful" >>log
fi
echo "thread1:[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/thread1
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: remove sucessful" >>log
fi
echo "thread1:[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread1 /home/hadoop/
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: move back sucessful" >>log

fi
echo "thread1:[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread1
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: deletion sucessful" >>log
fi
i=`expr $i + 1`
done
)&
(
j=0
echo "thread2:[`date +%X`] :: start thread2" >>log
echo "thread2:size of thread2 directory 640MB" >>log
while [ 1 = 1 ]
do
echo "thread2:------------------LOOP $j ------------------------" >>log
echo "thread2:[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread2
/user/hadoop/
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: copy sucessful" >>log
fi
echo "thread2:[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/thread2
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: remove sucessful" >>log
fi
echo "thread2:[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread2 /home/hadoop/
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: move back sucessful" >>log
fi
echo "thread2:[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread2
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: deletion sucessful" >>log
fi
j=`expr $j + 1`
done
)
Wait

LOG
Messages from thread1 are in black and that of thread2 are in green
thread1:[05:15:00 PM] :: start thread1
thread2:[05:15:00 PM] :: start thread2
thread1:size of thread1 directory 640 MB
thread2:size of thread2 directory 640MB
thread1:------------------LOOP 0 -----------------------thread2:------------------LOOP 0 -----------------------thread1:[05:15:00 PM] :: coping data into the directory
thread2:[05:15:00 PM] :: coping data into the directory
thread1:[05:17:19 PM] :: copy sucessful
thread1:[05:17:19 PM] :: removing copy of file
thread1:[05:17:20 PM] :: remove sucessful
thread1:[05:17:20 PM] :: copying back to local system
thread2:[05:17:32 PM] :: copy sucessful
thread2:[05:17:32 PM] :: removing copy of file
thread2:[05:17:33 PM] :: remove sucessful
thread2:[05:17:33 PM] :: copying back to local system

139 seconds ( to write)


NOTE
152 seconds( to write)

110 Seconds ( to read)

thread2:[05:19:23 PM] :: move back sucessful


thread2:[05:19:23 PM] :: removing the file from hadoop
thread1:[05:19:26 PM] :: move back sucessful
thread1:[05:19:26 PM] :: removing the file from hadoop
thread2:[05:19:28 PM] :: deletion sucessful
thread1:[05:19:29 PM] :: deletion sucessful
thread1:------------------LOOP 1 -----------------------thread1:[05:19:29 PM] :: coping data into the directory
thread2:------------------LOOP 1 -----------------------thread2:[05:19:29 PM] :: coping data into the directory
thread1:[05:21:43 PM] :: copy sucessful
thread1:[05:21:43 PM] :: removing copy of file
thread1:[05:21:44 PM] :: remove sucessful
thread1:[05:21:44 PM] :: copying back to local system
thread2:[05:21:48 PM] :: copy sucessful
thread2:[05:21:48 PM] :: removing copy of file
thread2:[05:21:49 PM] :: remove sucessful
thread2:[05:21:49 PM] :: copying back to local system
thread1:[05:23:44 PM] :: move back sucessful
thread1:[05:23:44 PM] :: removing the file from hadoop
thread1:[05:23:49 PM] :: deletion sucessful
thread1:------------------LOOP 2 -----------------------thread1:[05:23:49 PM] :: coping data into the directory
thread2:[05:23:49 PM] :: move back sucessful

120 seconds ( to read)

125 seconds ( to read )

NOTE: Copying started at same time for thread1 and thread2 and also finished at about same time. That
Means in 152 seconds 1.28 GB of data was transferred. The average data throughput was 70 Mbps.

So is it that more the systems in cluster higher the speed( though there would be
definitely a saturation point where the speed will come down) ?

MAP-REDUCE
MapReduce is a programming model for processing and large data sets. A map function
processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce
function that merges all intermediate values associated with the same intermediate key.
Many real world tasks are expressible in this model.
Programs written in this functional style are automatically parallelized and executed on a
large cluster of commodity machines. The run-time system(hadoop) takes care of the
details of partitioning the input data, scheduling the program's execution across a set of
machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large distributed system.
For implementation details of map-reduce in hadoop follow the link below
http://wiki.apache.org/lucene-hadoopdata/attachments/HadoopPresentations/attachments/HadoopMapReduceArch.pdf

Follow the link below for clear understanding on MapReduce

http://209.85.163.132/papers/mapreduce-osdi04.pdf

Sample MapReduce implementation


The program will mimic the wordcount example, i.e. it reads text files and counts how
often words occur. The input is text files and the output is text files, each line of which
contains a word and the count of how often it occurred, separated by a tab. The "trick"
behind the following Python code is that we will use hadoopstreaming for helping us
passing data between our Map and Reduce code via STDIN (standard input) and
STDOUT (standard output). We will simply use Python's sys.stdin to read input data
and print our own output to sys.stdout.
Save the file in mapper.py and reducer.py respectively.( requires python 2.4 or greater)
in /home/hadoop and give executable permissions to them. One needs to start
MapReduce deamons before submitting jobs bin/start-mapred.sh
Mapper.py
It will read data from STDIN (standard input), split it into words and output a list of lines
mapping words to their (intermediate) counts to STDOUT (standard output)
#!/usr/bin/env python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words while removing any empty strings
words = filter(lambda word: word, line.split())
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)

reducer.py

It will read the results of mapper.py from STDIN (standard input), and sum the
occurences of each word to a final count, and output its results to STDOUT (standard
output).
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split()
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s'% (word, count)

Test the code as follows


[hadoop@NameNode ~]$echo "foo foo quux labs foo bar quux" |
/home/hadoop/mapper.py | /home/hadoop/reducer.py
bar
foo
labs
quux

1
3
1
2

Implementation on hadoop
Copy some large plain text files ( typically in GBs) into some local directory say
text.Copy the data into HDFS

[hadoop@NameNode ~]$hadoop dfs copyFromLocal

/path/to/test test

Run the mapreduce job


[hadoop@NameNode~]$bin/hadoop jar contrib/hadoop-streaming.jar
-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input
test/* -output mapreduce-output

The results can be viewed at http://localhost:50030/ or one can copy the output to local
system hadoop dfs copyToLocal mapreduce-output
Inverted Index: Example of a mapreduce job
Suppose there are three documents with some text content and we have to compute the
inverted index using map-reduce.
Doc-1
Hello
World
welcome
to
India

Doc-2
Hello
India

Doc-3
World
is
welcoming
India

Map Phase
<Hello,Doc-1>
<World,Doc-1>
<welcome,Doc-1>
<to,Doc-1>
<India,Doc-1>

<Hello,Doc-2>
<India,Doc-2>

<World,Doc-3>
<is,Doc-3>
<welcoming,Doc-3>
<India,Doc-3>

Reduce Phase
<Hello,[ Doc-1,Doc-2 ] >
<World,[Doc-1,Doc-3 ] >
<welcome,[ Doc-1,Doc-3 ] >
<India, [Doc-1,Doc-2,Doc-3] >
Words such as to, is etc are considered noise and should be filtered appropriately.

You might also like