Professional Documents
Culture Documents
Hadoop 2 Handbook
by Sujee Maniyam
Dedication
To the open source community
Acknowledgements
From Sujee
To the kind souls who helped me along the way
Copyright 2014 Elephant Scale LLC. All Rights Reserved.
ii
Table of Contents
1. About this Book ............................................................................................................. 1
2. About the Author ............................................................................................................ 2
3. Hadoop Versions ............................................................................................................ 3
3.1. Hadoop version 1.0 ............................................................................................... 3
3.2. Hadoop version 2.0 ............................................................................................... 4
3.3. (Older) History of Hadoop ..................................................................................... 5
4. Hadoop 2 New and Noteworthy ........................................................................................ 6
4.1. HDFS Features in Hadoop 2 .................................................................................. 6
4.2. YARN (Yet Another Resource Negotiator) ............................................................... 6
5. HDFS Snapshots ............................................................................................................. 7
5.1. Case for Snapshots ............................................................................................... 7
5.2. Snapshotting Internals ........................................................................................... 7
5.3. Snapshotting Example ........................................................................................... 8
5.4. References .......................................................................................................... 8
6. HDFS NFS Access ......................................................................................................... 9
6.1. Accessing HDFS Files From Outside ....................................................................... 9
6.2. HDFS NFS Functionality ..................................................................................... 10
6.3. Setting up HDFS NFS ......................................................................................... 10
6.4. References ......................................................................................................... 10
7. HDFS Federation .......................................................................................................... 11
7.1. Case for HDFS Federation ................................................................................... 11
7.2. Benefits of Federation ......................................................................................... 12
7.3. Internals of Federation ......................................................................................... 12
7.4. References ......................................................................................................... 13
8. Hadoop 2 vs. Hadoop 1 ................................................................................................. 14
8.1. Daemons ........................................................................................................... 14
8.2. Web Interface Port Numbers ................................................................................. 14
8.3. Directory Layout ................................................................................................ 15
8.4. Start / Stop Scripts .............................................................................................. 15
8.5. Hadoop Command Split ....................................................................................... 16
8.6. Configuration Files ............................................................................................. 17
9. Quick Start .................................................................................................................. 18
9.1. Getting Hadoop 2 Running ................................................................................... 18
10. Running Hadoop 2 (YARN + MapReduce) on a Single Node (Using Tar File) ......................... 19
10.1. High Level Steps .............................................................................................. 19
10.2. Installing Java / JDK ......................................................................................... 19
10.3. Setting up password-less SSH ............................................................................. 19
10.4. Get Hadoop 2 tar ball ........................................................................................ 21
10.5. Hadoop Commands ........................................................................................... 21
10.6. Configuring Hadoop 2 HDFS .............................................................................. 22
10.7. Formatting HDFS Storage .................................................................................. 23
10.8. Starting HDFS .................................................................................................. 23
10.9. Verifying that HDFS is running ........................................................................... 24
10.10. Configuring YARN / MapReduce ....................................................................... 25
10.11. Starting YARN ............................................................................................... 26
10.12. History Server ................................................................................................. 26
10.13. Verifying YARN daemons ................................................................................ 26
10.14. Final Test -- Running a MapReduce job .............................................................. 27
10.15. Shutting down the cluster .................................................................................. 28
10.16. References ...................................................................................................... 29
11. HDFS Benchmarking in Hadoop 2 ................................................................................. 30
iii
Hadoop 2 Handbook
iv
List of Figures
3.1. Hadoop Versions .......................................................................................................... 3
5.1. Snapshots Explained ..................................................................................................... 7
6.1. HDFS Is An Isolated File System .................................................................................... 9
6.2. HDFS NFS ................................................................................................................. 9
7.1. HDFS Architecture ..................................................................................................... 11
7.2. HDFS Federation (With Multiple NameNodes) ................................................................ 11
7.3. HDFS No Federation ................................................................................................... 12
7.4. HDFS Federation - Namespace Pools ............................................................................. 13
10.1. Namenode UI ........................................................................................................... 25
10.2. YARN UI ................................................................................................................ 27
10.3. YARN Running Jobs ................................................................................................. 28
List of Tables
3.1. Hadoop Versions .......................................................................................................... 3
8.1. Hadoop Daemons ....................................................................................................... 14
8.2. Hadoop Web Interface Port Numbers ............................................................................. 14
8.3. Hadoop Directory Layout ............................................................................................. 15
8.4. Start / Stop scripts for tar package version ....................................................................... 16
8.5. Start / Stop scripts for rpm package version ..................................................................... 16
8.6. Hadoop Command Split ............................................................................................... 16
8.7. Hadoop Configuration Files .......................................................................................... 17
11.1. Terasort ................................................................................................................... 31
11.2. TeraGen arguments ................................................................................................... 31
vi
[https://github.com/ele-
Release Date
Description
0.21
Aug 2010
0.23
Feb 2012
1.0
Dec 2011
2.2.0
Oct 2013
Hadoop Versions
Hadoop Versions
And file system can be altered more than just deleting files. For example contents, permissions ..etc can
change. Snapshots can capture this information. And enables us to restore files back at a later point in
future.
We can snapshot the entire file system or specific directories. A directory has to be enabled for snapshotting
by HDFS administrator. This is called making a directory snapshottable . After that normal users can
snapshot their directories on their own without administrator intervention.
A directory with snapshots can not be deleted or renamed until all snapshots are deleted.
HDFS Snapshots
drwxr-xr-x
-rw-r--r--
- sujee supergroup
1 sujee supergroup
5.4. References
Hortonworks guide [http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/bk_user-guide/
content/user-guide-hdfs-snapshots.html]
Apache Hadoop documentation [http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoophdfs/HdfsSnapshots.html]
We use HDFS command line utilities (hdfs) to interact with HDFS. How ever files residing in HDFS are
not accessible to traditional Linux programs.Lets say we have a Linux executable 'my_awesome_program'
that analyzes videos. If our video files are in HDFS they won't be accessible to this program. So the
following command would not work.
$ my_awesome_program < input/file/in/HDFS > output/file/in/HDFS
It would be nice to have files in HDFS accessible to other programs. So we don't need to re-write these
programs in Java just access files in HDFS. The solution was to make HDFS available over sample link :
Networked File System (NFS) [http://en.wikipedia.org/wiki/Network_File_System]. NFS has been around
for a while and lot of applications and operating systems know how to talk to NFS file systems.
HDFS NFS Gateway allows HDFS to be mounted as a NFS file system.
6.4. References
Hortonworks guide to setting up HDFS NFS [http://docs.hortonworks.com/HDPDocuments/HDP1/
HDP-1.3.0/bk_user-guide/content/user-guide-hdfs-nfs.html]
Blog post from Hortonworks explaining NFS feature [http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/]
HDFS-4750 [https://issues.apache.org/jira/browse/HDFS-4750] - Original JIRA to implement NFS support in HDFS
10
11
HDFS Federation
12
HDFS Federation
7.4. References
Apache Hadoop documentation [https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoophdfs/Federation.html]
13
8.1. Daemons
HDFS daemons are pretty much the same in Hadoop 1 and Hadoop 2. The biggest difference is Hadoop
2 has YARN instead of Map Reduce.
Hadoop 1
Hadoop 2
Namenode (master)
[one per cluster]
Namenode (master)
[one per cluster]
Secondary Namenode
DataNode (worker)
[many per cluster, one per
node]
Processing
DataNode (worker)
[many per cluster, one per
node]
MapReduce v1
YARN (MRv2)
Resource Manager
[one per cluster]
Node Manager
[many per cluster, one per
node]
Application Master
[many per cluster]
Hadoop 1
Hadoop 2
50070
50070
50030
--
--
8088
19888
14
Hadoop 1
Hadoop 2
Configuration Files
Jar files
'HADOOP_INSTALL / share /
hadoop' directory
jar files live in component specific sub-directories (common, hdfs,
mapreduce, yarn)
15
Hadoop 1
Hadoop 2
to start HDFS
$ HADOOP_INSTALL/bin/
start-dfs.sh
$ HADOOP_INSTALL/bin/
hadoop-daemon.sh start namenode
$ HADOOP_INSTALL/sbin/
start-dfs.sh
$ HADOOP_INSTALL/sbin/
hadoop-daemon.sh start namenode
$ HADOOP_INSTALL/bin/
start-mapred.sh
$ HADOOP_INSTALL/sbin/
start-yarn.sh
to start everything
$ HADOOP_INSTALL/bin/
start-all.sh
$ HADOOP_INSTALL/sbin/
start-all.sh
RPM Version
Table 8.5. Start / Stop scripts for rpm package version
Task
Hadoop 1
Hadoop 2
to start HDFS
On NameNode:
sudo service hadoop-0.20-namenode start
On Data Node:
sudo service hadoop-0.20datanode start
On NameNode:
sudo service hadoop-hdfs-namenode start
On Data Node:
sudo service hadoop-hdfsdatanode start
on Job Tracker:
sudo service hadoop-0.20-jobtracker start
on Task Tracker:
sudo service hadoop-0.20-tasktracker start
on Job Tracker:
sudo service hadoop-yarn-resourcemanager restart
sudo service hadoop-mapreduce-historyserver start
sudo service hadoop-yarnproxyserver start
on "worker nodes":
sudo service hadoop-yarnnodemanager start
Hadoop 1
Hadoop 2
$ HADOOP_INSTALL/bin/
hadoop dfs -ls
$ HADOOP_INSTALL/bin/
hdfs dfs -ls
NameNode operations
$ HADOOP_INSTALL/bin/
hadoop namenode -format
$ HADOOP_INSTALL/bin/
hdfs namenode -format
16
Task
Hadoop 1
Hadoop 2
$ HADOOP_INSTALL/bin/
hadoop dfsadmin -refreshNodes
$ HADOOP_INSTALL/bin/
hdfs dfsadmin -refreshNodes
MapReduce commands
$ HADOOP_INSTALL/bin/
hadoop job ....
$ HADOOP_INSTALL/bin/
mapred job ...
Hadoop 1
Core
HADOOP_INSTALL/conf/core- HADOOP_INSTALL/etc/
site.xml
hadoop/core-site.xml
HDFS
HADOOP_INSTALL/conf/hdfs- HADOOP_INSTALL/etc/
site.xml
hadoop/hdfs-site.xml
MapReduce
HADOOP_INSTALL/conf/
mapred-site.xml
HADOOP_INSTALL/etc/
hadoop/mapred-site.xml
YARN
--
HADOOP_INSTALL/etc/
hadoop/yarn-site.xml
17
Hadoop 2
18
19
$
ls -l
total 304
-rw-r--r--rw-------rw-r--r--
$HOME/.ssh/
1 sujee
1 sujee
1 sujee
staff
staff
staff
If you see files id_rsa or id_dsa you already have ssh keys, and you don't have to generate them again
Let's generate SSH keys
$ ssh-keygen
To accept default values, press 'Enter' a couple of times. You will see an output like this.
$ ssh-keygen
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.
The key fingerprint is:
a9:63:a1:4e:9e:12:f7:08:a7:ab:bf:3f:a7:f8:6a:0c sujee@melbourne-2.local
The key's randomart image is:
+--[ RSA 2048]----+
|
|
|
|
|
|
|
.
|
|
. S
|
|E o o. o
|
| o *oo+
|
| ==ooo.
|
|.+*BB+
|
+-----------------+
Now let's add the key to 'authorized_keys' file so we can login without entering password
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod 600 $HOME/.ssh/authorized_keys
Try logging into the machine
$ ssh localhost
For the very first time, there may be a 'are you sure' prompt. Just say YES. And after that you can ssh
into localhost without entering any password.
20
21
22
What we are looking for a message Storage directory has been successfully formatted. All looks good.
$ cat /hadoop_install_location/etc/hadoop/slaves
localhost
23
We are going to use sbin/start-dfs.sh script. This command will do the following:
Start Namenode on the current node
Start Secondary NameNode (if necessary) on machines specified by masters(etc/hadoop/masters) file
Start DataNodes on node specified slaves(etc/hadoop/slaves) file.
For this, to work password-less ssh has to work (this is why we set up ssh before)
$ start-dfs.sh
or using full path
$ /hadoop_install_location/sbin/start-dfs.sh
$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-sujee-nam
localhost: starting datanode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-sujee-dat
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-su
Verifying on command-line
We will use jps command to see running Java processes. jps command is part of JDK.
$ jps
30438
30375
30226
30291
Jps
SecondaryNameNode
NameNode
DataNode
24
yarn-site.xml (etc/hadoop)
yarn-site.xml (on GitHub) [https://gist.github.com/sujee/8551096]
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
25
mapred-site.xml (etc/hadoop)
mapred-site.xml (on GitHub) [https://gist.github.com/sujee/8551273]
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
26
NameNode
DataNode
JobHistoryServer
Jps
YARN Web UI
Access YARN Web UI http://localhost:8088 in a browser. Verify Active Nodes are 1.
27
INFO
INFO
INFO
INFO
INFO
167 name
32 dfs
21 xml
Let's check YARN UI. See the screen shot below. We see two MapReduce jobs (grep-search and grep
sort) ran and finished successfully.
28
10.16. References
Cloudera 4.x YARN guide [http://www.cloudera.com/content/cloudera-content/cloudera-docs/
CDH4/4.3.0/CDH4-Installation-Guide/cdh4ig_topic_11_4.html]
29
11.1. TestDFSIO
This benchmark exercises HDFS and measures IO throughput. TestDFSIO uses MapReduce to read / write
files in parallel.
To run this test we need both DFS and YARN daemons running.
$ hadoop org.apache.hadoop.fs.TestDFSIO -write -nrFiles 5 -fileSize 1GB
The parameters are as follows...
-write : write files
-nrFiles : number of files
-fileSize : size of each file, you can use shortcuts like MB or GB
This would run a MapReduce job. The output might look like this
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
...
14/02/14
14/02/14
14/02/14
...
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
00:02:57
00:02:57
00:02:57
00:02:57
00:02:57
INFO
INFO
INFO
INFO
INFO
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
TestDFSIO.1.7
nrFiles = 5
nrBytes (MB) = 1024.0
bufferSize = 1000000
baseDir = /benchmarks/TestDFSIO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
11.2. Terasort
Terasort is ultimate IO benchmark for Hadoop. The test is comprised of three components.
30
1 GB
10 million records
10,000,000
10 GB
100,000,000
100 GB
1 billion records
1,000,000,000
1 TB
10 billion records
10,000,000,000
Explanation
-D mapreduce.job.maps=20
20 mappers
Each mapper will create a file. So 10GB divided
by 20 maps will yield 500MB file each
-D dfs.blocksize=536870912
100000000
number of records.
See the above table on calculation specifics
input_dir
Lets generate 10 G of data. How we invoke TeraGen command will depend on our Hadoop installation.
See below, you may need to adjust your command to match your environment.
option 1) for tar based installs...
31
$
hadoop
jar
HADOOP_INSTALL_PATH/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar teragen -D mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000
input_dir
option 2) RPM / Package based Hadoop Distribution
$
hadoop
jar
/usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000 input_dir
teragen
-D
terasort
-D
11.3. References
Michael
Noll
has
an
excellent
tutorial
on
TestDFSIO
and
Terasort [http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/] (Hadoop v1 version)
32