Professional Documents
Culture Documents
Hadoop is a java frame work (software platform) for storing vast amounts of data (and
also process the data). It can be setup on commonly available computers.
Use Case
It can be used when following requirements arise
Store terabytes of data: HDFS uses commonly available computers and storage
devices and pools up the storage space on all the systems into one large piece.
Streaming access of data: HDFS is designed more for batch processing rather than
interactive use by users. The emphasis is on high throughput of data access rather
than low latency of data access
Large data sets: File sizes typically in gigabytes. HDFS is tuned to support large
files. It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster
WORM requirement: HDFS applications need a write-once-read-many access
model for files. A file once created, written, and closed need not be changed. This
assumption simplifies data coherency issues and enables high throughput data
access.
High availability: HDFS stores multiple instances of data on various systems in
the cluster. This ensures availability of data even if systems come down.
Architecture
Hadoop is based on Master-Slave architecture. An HDFS cluster consists of a single
Namenode (master server) that manages the file system namespace and regulates access
to files by clients. In addition, there are a number of Datanodes (Slaves), usually one per
node in the cluster, which manage storage attached to the nodes that they run on. HDFS
exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of Datanodes. The
Namenode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to Datanodes.
The Datanodes are responsible for serving read and write requests from the file systems
clients. The Datanodes also perform block creation, deletion, and replication upon
instruction from the Namenode. The Namenode is the arbitrator and repository for all
HDFS metadata. The system is designed in such a way that user data never flows through
the Namenode.
Client
NameNode
DataNode
DataNode
Key Features
File System Namespace: HDFS Supports hierarchical file organization. It
supports operations like create, remove, move & rename files as well as
directories. It doesnt have perms and quotas.
Replication: HDFS Stores files as series of blocks. Blocks are replicated for fault
tolerance. The replication factor and block size are configurable. Files in HDFS
are write-once and have strictly one writer at any time. The replication placement
is very critical for the performance. In large clusters nodes are spread across
racks. Thee racks are connected via switches. Its observed that traffic within
nodes in a rack is much higher than that across racks. Replicating data across
racks saves network bandwidth. To minimize global bandwidth consumption and
read latency, HDFS tries to satisfy a read request from a replica that is closest to
the reader. If there exists a replica on the same rack as the reader node, then that
replica is preferred to satisfy the read request.
File System Metadata: Name node uses EditLog to record every change to file
system metadata. The entire file system namespace is stored in file called
FsImage.
Robustness: Network or Disk Failure and data integrity
Datanodes send heartbeat messages to namenode and when namenode doesnt
receive them the datanode is marked dead. This may cause replication factor for
some blocks fall. The name node constantly monitors the replication count for
each block. If it falls then the namenode re replicates those nodes. This may
happen because a replica may be corrupted, data node is dead or the replication
count for a particular file may be increased. It also does rebalancing in case space
on one node falls below a threshold value. Name stores checksum for each block
and checks while retrieving.
NameNode failure: The FsImage and the EditLog are central data structures of
HDFS. A corruption of these files can cause the HDFS instance to be nonfunctional. For this reason, the Namenode can be configured to support
maintaining multiple copies of the FsImage and EditLog. Any update to either the
FsImage or EditLog causes each of the FsImages and EditLogs to get updated
synchronously. This synchronous updating of multiple copies of the FsImage and
EditLog may degrade the rate of namespace transactions per second that a
Namenode can support. However, this degradation is acceptable because even
though HDFS applications are very data intensive in nature, they are not
metadata intensive. When a Namenode restarts, it selects the latest consistent
FsImage and EditLog to use. The Namenode machine is a single point of failure
for an HDFS cluster. If the Namenode machine fails, manual intervention is
necessary. Currently, automatic restart and failover of the Namenode software to
another machine is not supported.
Data organization: Block size by HDFS is 64MB and it supports write once read
many semantics. HDFS client has to do local catching. Suppose the HDFS file has
a replication factor of three. When the local file accumulates a full block of user
data, the client retrieves a list of Datanodes from the Namenode. This list contains
the Datanodes that will host a replica of that block. The client then flushes the
data block to the first Datanode. The first Datanode starts receiving the data in
small portions (4 KB), writes each portion to its local repository and transfers that
portion to the second Datanode in the list. The second Datanode, in turn starts
receiving each portion of the data block, writes that portion to its repository and
then flushes that portion to the third Datanode. Finally, the third Datanode writes
the data to its local repository. Thus, a Datanode can be receiving data from the
previous one in the pipeline and at the same time forwarding data to the next one
in the pipeline. Thus, the data is pipelined from one Datanode to the next. When
data is deleted it is not removed immediately removed rather it remains in /trash.
It can be either removed or restored from there. How long to store the data in
trash is configurable. Default value is 6 hrs.
Setting up Hadoop
We used three Linux boxes with CentOS, to setup a hadoop cluster. Details as follows-
DataNode-1
Property
NameNode
NameNode
DataNode-1
IP
10.245.0.121
Storage Space
Hostname
NameNode
data node
10.245.0.131
35GB
DataNode-1
name node
DataNode-2
DataNode-2
10.245.0.57
9GB
DataNode-2
data node
[root@NameNode]#passwd hadoop
Step-3:
Installed JDK (jdk-1_5_0_14-linux-i586.rpm) and hadoop (hadoop-0.14.4.tar.gz) as user
hadoop in /home/hadoop.
Step-4:
Setup the Linux systems in such a way that any system can ssh to any other system
without password. Copy public keys of every system in cluster (including itself) into
authorized_keys file.
Step-5:
Set JAVA_HOME variable in <hadoop install dir>/conf/hadoop-env.sh to correct path.
In our case it was export JAVA_HOME=/usr/java/jdk1.5.0_14/
Step-6: (on NameNode)
DataNode-1
DataNode-2
The conf/slaves file on master is used only by the
bin/stop-dfs.sh for starting data nodes.
The HDFS name table is stored on the namenode's (here: master) local filesystem in the
directory specified by dfs.name.dir. The name table is used by the namenode to store
tracking and coordination information for the datanodes.
Run the command <HADOOP_INSTALL>/bin/start-dfs.sh on the machine you want the
namenode to run on. This will bring up HDFS with the namenode running on the
machine you ran the previous command on, and datanodes on the machines listed in the
conf/slaves file.
These web interfaces provide concise information about what's happening in your
Hadoop cluster. You may have to update hosts file in your windows system to resolve the
names to its IP.
From the NameNode you can do management as well as file operations via DFSshell.
The command <hadoop installation> bin/hadoop dfs help, gives you the operations
permitted by DFS. The command <hadoop installation> bin/hadoop dfsadmin help
gives the administration operations supported.
Step-11:
To add a new data node on fly just follow the above steps on new node and execute
following command on the new node to join the cluster.
bin/hadoop-daemon.sh --config <config_path> start datanode
Step-12:
To setup client machine install hadoop on a client machine and set the java_home
variable in hadoop.env.sh. To copy data to HDFS from client use fs switch of dfs and use
the URI of the namenode
bin/hadoop dfs -fs hdfs://10.245.0.121:54310 -mkdir remotecopy
bin/hadoop dfs -fs hdfs://10.245.0.121:54310 -copyFromLocal /home/devendra/jdk-1_5_0_14-linuxi586-rpm.bin remotecopy
Observations
1. I/O handling.
See appendix for some test scripts and the Log analysis.
2. Fault Tolerance
Observations on a two data-node cluster with replication factor 2.
The data was accessible even if one of data nodes was down.
Observations on a three data-node cluster with replication factor 2
The data was accessible when one of the data-nodes was down
Some data was accessible when two nodes were down
Overflow condition: With a two data-node setup and nodes having free space of
20 GB and 1 GB , tried to copy 10 GB of data. The copy operation was successful
without any errors. Observed warning in the log messages indicating only one
copy is done. ( I guess if we connect one more datanode on fly I suppose the data
will be replicated on to the new systemwill have to try this out to be sure)
Accidental data loss: Even if we remove data-blocks from one of the data nodes,
they will be synchronized(this was observed).
Appendix
SCRIPT -1
The script copies 1 GB of data to the HDFS and back to the
local system indefinitely. The md5 checksum matches after
stopping the script. The script is executed from namenode.
I=0
echo "[`date +%X`] :: start script" >>log
echo "size of movies directory is 1 GB" >>log
echo "[`date +%X`] :: creating a directory Movies" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -mkdir Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: mkdir sucessful" >>log
fi
while [ 1 = 1 ]
do
echo "------------------LOOP $i ------------------------" >>log
echo "[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/Movies
/user/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: copy sucessful" >>log
fi
echo "[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: remove sucessful" >>log
fi
echo "[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/Movies
/home/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: move back sucessful" >>log
fi
echo "[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/Movies
if [ $? -eq 0 ]
then
echo "[`date +%X`] :: move back sucessful" >>log
fi
i=`expr $i + 1`
done
LOG
[03:48:52 PM] :: start script
size of movies directory is 1GB
[03:48:52 PM] :: creating a directory Movies
[03:48:54 PM] :: mkdir sucessful
------------------LOOP 0 ------------------------
Multithreaded script
Two threads are spawned, each of which copy data from the local system into hadoop and
back from hadoop to local system infinitely. Logs are captured to analyze the I/O
performance. The script was run for 48 hours and 850 loops got executed.
(
i=0
echo "thread1:[`date +%X`] :: start thread1" >>log
echo "thread1:size of thread1 directory 640MB" >>log
while [ 1 = 1 ]
do
echo "thread1:------------------LOOP $i ------------------------" >>log
echo "thread1:[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread1
/user/hadoop/
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: copy sucessful" >>log
fi
echo "thread1:[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/thread1
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: remove sucessful" >>log
fi
echo "thread1:[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread1 /home/hadoop/
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: move back sucessful" >>log
fi
echo "thread1:[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread1
if [ $? -eq 0 ]
then
echo "thread1:[`date +%X`] :: deletion sucessful" >>log
fi
i=`expr $i + 1`
done
)&
(
j=0
echo "thread2:[`date +%X`] :: start thread2" >>log
echo "thread2:size of thread2 directory 640MB" >>log
while [ 1 = 1 ]
do
echo "thread2:------------------LOOP $j ------------------------" >>log
echo "thread2:[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread2
/user/hadoop/
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: copy sucessful" >>log
fi
echo "thread2:[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/thread2
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: remove sucessful" >>log
fi
echo "thread2:[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread2 /home/hadoop/
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: move back sucessful" >>log
fi
echo "thread2:[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread2
if [ $? -eq 0 ]
then
echo "thread2:[`date +%X`] :: deletion sucessful" >>log
fi
j=`expr $j + 1`
done
)
Wait
LOG
Messages from thread1 are in black and that of thread2 are in green
thread1:[05:15:00 PM] :: start thread1
thread2:[05:15:00 PM] :: start thread2
thread1:size of thread1 directory 640 MB
thread2:size of thread2 directory 640MB
thread1:------------------LOOP 0 -----------------------thread2:------------------LOOP 0 -----------------------thread1:[05:15:00 PM] :: coping data into the directory
thread2:[05:15:00 PM] :: coping data into the directory
thread1:[05:17:19 PM] :: copy sucessful
thread1:[05:17:19 PM] :: removing copy of file
thread1:[05:17:20 PM] :: remove sucessful
thread1:[05:17:20 PM] :: copying back to local system
thread2:[05:17:32 PM] :: copy sucessful
thread2:[05:17:32 PM] :: removing copy of file
thread2:[05:17:33 PM] :: remove sucessful
thread2:[05:17:33 PM] :: copying back to local system
NOTE: Copying started at same time for thread1 and thread2 and also finished at about same time. That
Means in 152 seconds 1.28 GB of data was transferred. The average data throughput was 70 Mbps.
So is it that more the systems in cluster higher the speed( though there would be
definitely a saturation point where the speed will come down) ?
MAP-REDUCE
MapReduce is a programming model for processing and large data sets. A map function
processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce
function that merges all intermediate values associated with the same intermediate key.
Many real world tasks are expressible in this model.
Programs written in this functional style are automatically parallelized and executed on a
large cluster of commodity machines. The run-time system(hadoop) takes care of the
details of partitioning the input data, scheduling the program's execution across a set of
machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large distributed system.
For implementation details of map-reduce in hadoop follow the link below
http://wiki.apache.org/lucene-hadoopdata/attachments/HadoopPresentations/attachments/HadoopMapReduceArch.pdf
http://209.85.163.132/papers/mapreduce-osdi04.pdf
reducer.py
It will read the results of mapper.py from STDIN (standard input), and sum the
occurences of each word to a final count, and output its results to STDOUT (standard
output).
#!/usr/bin/env python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split()
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s'% (word, count)
1
3
1
2
Implementation on hadoop
Copy some large plain text files ( typically in GBs) into some local directory say
text.Copy the data into HDFS
/path/to/test test
The results can be viewed at http://localhost:50030/ or one can copy the output to local
system hadoop dfs copyToLocal mapreduce-output
Inverted Index: Example of a mapreduce job
Suppose there are three documents with some text content and we have to compute the
inverted index using map-reduce.
Doc-1
Hello
World
welcome
to
India
Doc-2
Hello
India
Doc-3
World
is
welcoming
India
Map Phase
<Hello,Doc-1>
<World,Doc-1>
<welcome,Doc-1>
<to,Doc-1>
<India,Doc-1>
<Hello,Doc-2>
<India,Doc-2>
<World,Doc-3>
<is,Doc-3>
<welcoming,Doc-3>
<India,Doc-3>
Reduce Phase
<Hello,[ Doc-1,Doc-2 ] >
<World,[Doc-1,Doc-3 ] >
<welcome,[ Doc-1,Doc-3 ] >
<India, [Doc-1,Doc-2,Doc-3] >
Words such as to, is etc are considered noise and should be filtered appropriately.