Hadoop2 Handbook

Hadoop 2 Handbook
Sujee Maniyam <sujee@ElephantScale.com>
Hadoop 2 Handbook
by Sujee Maniyam
Dedication
To the open source community
Acknowledgements
From Sujee
To the kind souls who helped me along the way
Copyright 2014 Elephant Scale LLC. All Rights Reserved.
ii
Table of Contents
1. About this Book ............................................................................................................. 1
2. About the Author ............................................................................................................ 2
3. Hadoop Versions ............................................................................................................ 3
3.1. Hadoop version 1.0 ............................................................................................... 3
3.2. Hadoop version 2.0 ............................................................................................... 4
3.3. (Older) History of Hadoop ..................................................................................... 5
4. Hadoop 2 New and Noteworthy ........................................................................................ 6
4.1. HDFS Features in Hadoop 2 .................................................................................. 6
4.2. YARN (Yet Another Resource Negotiator) ............................................................... 6
5. HDFS Snapshots ............................................................................................................. 7
5.1. Case for Snapshots ............................................................................................... 7
5.2. Snapshotting Internals ........................................................................................... 7
5.3. Snapshotting Example ........................................................................................... 8
5.4. References .......................................................................................................... 8
6. HDFS NFS Access ......................................................................................................... 9
6.1. Accessing HDFS Files From Outside ....................................................................... 9
6.2. HDFS NFS Functionality ..................................................................................... 10
6.3. Setting up HDFS NFS ......................................................................................... 10
6.4. References ......................................................................................................... 10
7. HDFS Federation .......................................................................................................... 11
7.1. Case for HDFS Federation ................................................................................... 11
7.2. Benefits of Federation ......................................................................................... 12
7.3. Internals of Federation ......................................................................................... 12
7.4. References ......................................................................................................... 13
8. Hadoop 2 vs. Hadoop 1 ................................................................................................. 14
8.1. Daemons ........................................................................................................... 14
8.2. Web Interface Port Numbers ................................................................................. 14
8.3. Directory Layout ................................................................................................ 15
8.4. Start / Stop Scripts .............................................................................................. 15
8.5. Hadoop Command Split ....................................................................................... 16
8.6. Configuration Files ............................................................................................. 17
9. Quick Start .................................................................................................................. 18
9.1. Getting Hadoop 2 Running ................................................................................... 18
10. Running Hadoop 2 (YARN + MapReduce) on a Single Node (Using Tar File) ......................... 19
10.1. High Level Steps .............................................................................................. 19
10.2. Installing Java / JDK ......................................................................................... 19
10.3. Setting up password-less SSH ............................................................................. 19
10.4. Get Hadoop 2 tar ball ........................................................................................ 21
10.5. Hadoop Commands ........................................................................................... 21
10.6. Configuring Hadoop 2 HDFS .............................................................................. 22
10.7. Formatting HDFS Storage .................................................................................. 23
10.8. Starting HDFS .................................................................................................. 23
10.9. Verifying that HDFS is running ........................................................................... 24
10.10. Configuring YARN / MapReduce ....................................................................... 25
10.11. Starting YARN ............................................................................................... 26
10.12. History Server ................................................................................................. 26
10.13. Verifying YARN daemons ................................................................................ 26
10.14. Final Test -- Running a MapReduce job .............................................................. 27
10.15. Shutting down the cluster .................................................................................. 28
10.16. References ...................................................................................................... 29
11. HDFS Benchmarking in Hadoop 2 ................................................................................. 30
iii
Hadoop 2 Handbook
11.1. TestDFSIO ....................................................................................................... 30

11.2. Terasort ........................................................................................................... 30
11.3. References ....................................................................................................... 32
iv
List of Figures
3.1. Hadoop Versions .......................................................................................................... 3
5.1. Snapshots Explained ..................................................................................................... 7
6.1. HDFS Is An Isolated File System .................................................................................... 9
6.2. HDFS NFS ................................................................................................................. 9
7.1. HDFS Architecture ..................................................................................................... 11
7.2. HDFS Federation (With Multiple NameNodes) ................................................................ 11
7.3. HDFS No Federation ................................................................................................... 12
7.4. HDFS Federation - Namespace Pools ............................................................................. 13
10.1. Namenode UI ........................................................................................................... 25
10.2. YARN UI ................................................................................................................ 27
10.3. YARN Running Jobs ................................................................................................. 28
List of Tables
3.1. Hadoop Versions .......................................................................................................... 3
8.1. Hadoop Daemons ....................................................................................................... 14
8.2. Hadoop Web Interface Port Numbers ............................................................................. 14
8.3. Hadoop Directory Layout ............................................................................................. 15
8.4. Start / Stop scripts for tar package version ....................................................................... 16
8.5. Start / Stop scripts for rpm package version ..................................................................... 16
8.6. Hadoop Command Split ............................................................................................... 16
8.7. Hadoop Configuration Files .......................................................................................... 17
11.1. Terasort ................................................................................................................... 31
11.2. TeraGen arguments ................................................................................................... 31
vi
Chapter 1. About this Book

This is book covers Hadoop version 2. It is intended for developers.
The book is freely available, it is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License [http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US].
"Hadoop2 Handbook" is a work in progress. It is a 'living book'. We will keep updating the book to reflect
the fast moving world of Big Data and Hadoop. So keep checking back.
We appreciate your feedback. You can follow it on Twitter, discuss it in Google Groups, or send your
feedback via email.
Twitter : @ElephantScale [https://twitter.com/ElephantScale]
Email Authors Directly : principals@ElephantScale.com [mailto:principals@ElephantScale.com]
Book
GitHub
:
github.com/elephantscale/hadoop2-handbook
phantscale/hadoop2-handbook]
[https://github.com/ele-
Source Code on GitHub : github.com/elephantscale/HI-labs [https://github.com/elephantscale/HI-labs]
Chapter 2. About the Author
Sujee Maniyam (Author)

Sujee Maniyam is an experienced/hands-on Big Data architect. He has been developing software for the
past 12 years in a variety of technologies (enterprise, web and mobile). He currently focuses on Hadoop,
Big Data and NoSQL, and Amazon Cloud Services. His clients include early stage startups and enterprise
companies.
Sujee stays active in the Hadoop / Open Source community. He runs a developer focused Meetup called
'Big Data Gurus' [http://www.meetup.com/BigDataGurus/]. He has also presented at a variety of Meetups
Sujee contributes to Hadoop projects and his open source projects can be found on GitHub. He writes
about Hadoop and other technologies on his website.
Sujee does Hadoop training for individuals and corporations; his classes are hands-on and draw heavily
on his industry experience
Links:
LinkedIn [http://www.linkedin.com/in/sujeemaniyam] || GitHub [https://github.com/sujee] || Tech writings [http://sujee.net/tech/articles/] || Tech talks [http://sujee.net/tech/talks/] || BigDataGuru meetup [http://
www.meetup.com/BigDataGurus/]
Chapter 3. Hadoop Versions

Let's take a moment to explore the different versions of Hadoop. The following are/were the most known
versions.
Table 3.1. Hadoop Versions

Version
Release Date
Description
0.21
Aug 2010
A widely used release. This

eventually became the Hadoop
1.0 release
0.23
Feb 2012
Was a branch created to add new

features. This branch eventually
became Hadoop 2.0
1.0
Dec 2011
Current production version of

Hadoop. Most battle tested.
2.2.0
Oct 2013
Current public release of

Hadoop.
Figure 3.1. Hadoop Versions
3.1. Hadoop version 1.0

This is currently the production version of Hadoop. It has been in wide use for a while and has been proven
in the field. The following distributions are based on Hadoop 1.0
Hadoop Versions
Cloudera's CDH 4 (Cloudera's Distribution of Hadoop) series

HortonWorks's HDP 1 (HortonWorks Data Platform) series
3.2. Hadoop version 2.0

This is the current public release of Hadoop. Hadoop 2 has significant new enhancements. It has been
under development for a while. version 2 is released on Aug 2013. This is the version this book will cover.
Hadoop 2 has the following new features:
HDFS High Availability
Federated NameNode
Map Reduce version 2 (MRV2) also known as YARN
The following distributions currently bundle Hadoop 2:
Cloudera's CDH5 (Cloudera's Distribution of Hadoop) series.
HortonWorks HDP2 (HortonWorks Data Platform) series.
Hadoop Versions
3.3. (Older) History of Hadoop
Chapter 4. Hadoop 2 New and

Noteworthy
Hadoop 2 has a bunch of cool features, both in HDFS and MapReduce.
4.1. HDFS Features in Hadoop 2

NameNode High Availability
The 'single master' design in HDFS made NameNode as a single point of failure. Hadoop 2 addresses
this issue by allowing a 'standby' NameNode.
Snapshots
Ability to save the state of HDFS file system and restore later
See more : Chapter 5, HDFS Snapshots [7]
NameNode Federation
Multiple NameNodes, each managing a separate name space
See more : Chapter 7, HDFS Federation [11]
NFSv3 Access
Access HDFS using NFS protocol
See more : Chapter 6, HDFS NFS Access [9]
Improved IO
HDFS has seen lot of optimizations for read / write
4.2. YARN (Yet Another Resource Negotiator)

YARN is the next generation processing framework for Hadoop.
Chapter 5. HDFS Snapshots

5.1. Case for Snapshots
Starting from version 2 HDFS allows taking snapshots of the file system. Think of snapshots as a Apple's
Time Machine' feature. We can save the state of file system and restore it later. Snapshots can be useful
for data backup, recovering from user errors and recovering from disasters.
But wait, doesn't HDFS replicate files by default, preventing data loss? Yes, it does. How ever, HDFS
doesn't protect from user errors, like the following command
$ hdfs dfs -rm -r /important_data
Ouch! This will delete the files, no matter how many times they are replicated.
How ever there is a Trash feature in HDFS. This works like a Desktop trash can found on most operating
systems. If trash feature is enabled the files can be salvaged from there.
Snapshots give more than Trash. For starters files in Trash are only kept for a certain period. This is
controlled by property (fs.trash.interval in core-site.xml; this property is in minutes)
Figure 5.1. Snapshots Explained
And file system can be altered more than just deleting files. For example contents, permissions ..etc can
change. Snapshots can capture this information. And enables us to restore files back at a later point in
future.
We can snapshot the entire file system or specific directories. A directory has to be enabled for snapshotting
by HDFS administrator. This is called making a directory snapshottable . After that normal users can
snapshot their directories on their own without administrator intervention.
A directory with snapshots can not be deleted or renamed until all snapshots are deleted.
5.2. Snapshotting Internals

HDFS snapshotting is pretty efficient. The actual physical data (e.g. blocks in DataNodes-anodes) is NOT
copied. Snapshotting records the status of meta data. Since dealing with meta data is pretty fast, snapshotting operation is very fast. In computer science term the cost is constant O(1).
Snapshotting doesn't slow down regular HDFS operations.
HDFS uses a special directory named .snapshot for snapshot purposes. Hence the file name '.snapshot' is
a reserved file name. Users can not create file or directory with the name '.snapshot'
HDFS Snapshots
5.3. Snapshotting Example

Let's do an example snapshotting operation to better understand the mechanics of this the operation.
First let's create a directory we want to snapshot
$ hdfs dfs -mkdir /user/sujee/foo
Before we can snapshot file system, HDFS administrator has to enable it. Assuming HDFS administrator
user is 'hdfs'...
$ sudo -u hdfs hdfs dfsadmin -allowSnapshot /user/sujee/foo
Create a file 'bar' within this directory
$ hdfs dfs -mkdir /user/sujee/foo/bar
Create a snapshot
$ hdfs dfs -createSnapshot /user/sujee/foo snap1
First argument (/user/sujee/foo) for createSnapshot is the directory.
Second argument (snap1) (optional) is the name of snapshot.
All snapshots are stored within snapshot directory : /user/sujee/foo/.snapshot
Under the .snapshot directory, a directory created (snap1) with snapshot's name : /user/sujee/foo/.snapshot/
snap1
And finally all files are saved under this directory : /user/sujee/foo/.snapshot/snap1
We can use the following command to inspect the directory layout.
$ hdfs dfs -ls -R /user/sujee/foo/.snapshot
drwxr-xr-x
-rw-r--r--
- sujee supergroup
1 sujee supergroup
0 2014-02-08 22:07 /user/sujee/foo/.snapsh

0 2014-02-08 22:03 /user/sujee/foo/.snapsh
Now lets delete the file 'bar'

$ hdfs dfs -rm /user/sujee/foo/bar
Now the file has disappeared, verify it using the following command
$ hdfs dfs -ls /user/sujee/foo/
Time to restore the file from snapshot
$ hdfs dfs -cp /user/sujee/foo/.snapshot/snap1/bar /user/sujee/foo/bar-restored
There we go, snapshot in action!
5.4. References
Hortonworks guide [http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/bk_user-guide/
content/user-guide-hdfs-snapshots.html]
Apache Hadoop documentation [http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoophdfs/HdfsSnapshots.html]
Chapter 6. HDFS NFS Access

6.1. Accessing HDFS Files From Outside
HDFS is a user land file system. It doesn't behave like other Linux file systems (e.g. ext3 or ext4). HDFS
can not be 'mounted'.
Figure 6.1. HDFS Is An Isolated File System
We use HDFS command line utilities (hdfs) to interact with HDFS. How ever files residing in HDFS are
not accessible to traditional Linux programs.Lets say we have a Linux executable 'my_awesome_program'
that analyzes videos. If our video files are in HDFS they won't be accessible to this program. So the
following command would not work.
$ my_awesome_program < input/file/in/HDFS > output/file/in/HDFS
It would be nice to have files in HDFS accessible to other programs. So we don't need to re-write these
programs in Java just access files in HDFS. The solution was to make HDFS available over sample link :
Networked File System (NFS) [http://en.wikipedia.org/wiki/Network_File_System]. NFS has been around
for a while and lot of applications and operating systems know how to talk to NFS file systems.
HDFS NFS Gateway allows HDFS to be mounted as a NFS file system.
Figure 6.2. HDFS NFS
HDFS NFS Access
6.2. HDFS NFS Functionality

Browse HDFS
NFS clients will be able to browse HDFS file systems.
Download files from HDFS
Copy files from HDFS --> local
Upload files to HDFS
Copy files from local --> HDFS
Upload files to HDFS
Copy files from local --> HDFS
Append to files
Clients can append to HDFS files
No random write to files
Since HDFS does NOT support random write to files, this feature is not available via NFS.
6.3. Setting up HDFS NFS

Follow this guide from Hortonworks [http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/
bk_user-guide/content/user-guide-hdfs-nfs.html]
6.4. References
Hortonworks guide to setting up HDFS NFS [http://docs.hortonworks.com/HDPDocuments/HDP1/
HDP-1.3.0/bk_user-guide/content/user-guide-hdfs-nfs.html]
Blog post from Hortonworks explaining NFS feature [http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/]
HDFS-4750 [https://issues.apache.org/jira/browse/HDFS-4750] - Original JIRA to implement NFS support in HDFS
10
Chapter 7. HDFS Federation

7.1. Case for HDFS Federation
HDFS has a single NameNode and many DataNodes. So the storage (managed by DataNodes) scale horizontally, but the namespace (managed by a single NameNode) doesn't. This is particularly problem for
very large clusters (1000s of nodes).
Federation solves this by having multiple NameNodes manage the namespace.
Figure 7.1. HDFS Architecture
Figure 7.2. HDFS Federation (With Multiple NameNodes)
11
HDFS Federation
7.2. Benefits of Federation

Increasing File System Performance.
NameNode is involved in pretty much all file system operations (create files, read / write files ..etc). On
a large cluster (1000s of nodes) a single NameNode could be a bottleneck on IO operations. Multiple
NameNodes help scale IO rate.
Isolation of File Systems.
Each NameNode can manage a namespace. For example
/user directory can be managed by NN1
/hive directory can be managed by NN2
/hbase directory can be managed by NN3
In this scenario different namespaces are managed by different NameNodes. This provides isolation of
namespaces. For example, if NN2 is down, only /hive directory is in-accessible. Other directories /user
and /hbase are available.
7.3. Internals of Federation

Federation is implemented by grouping data blocks into into pools. Each NameNode manages a namespace
pool. Note that DataNodes are not segmented into pools. All NameNodes share all DataNodes. A DataNode
may host blocks belonging multiple namespace pools.
Figure 7.3. HDFS No Federation
12
HDFS Federation
Figure 7.4. HDFS Federation - Namespace Pools
7.4. References
Apache Hadoop documentation [https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoophdfs/Federation.html]
13
Chapter 8. Hadoop 2 vs. Hadoop 1

This chapter details how Hadoop version 2 'looks and feels' different from version 1. To learn about functional differences between version 1 and 2, look at chapters see Chapter 3, Hadoop Versions [3].
8.1. Daemons
HDFS daemons are pretty much the same in Hadoop 1 and Hadoop 2. The biggest difference is Hadoop
2 has YARN instead of Map Reduce.
Table 8.1. Hadoop Daemons

Daemons
Hadoop 1
Hadoop 2
HDFS (very little change)
Namenode (master)
[one per cluster]
Namenode (master)
[one per cluster]
Secondary Namenode
Checkpoint Node (formerly

Secondary NameNode)
DataNode (worker)
[many per cluster, one per
node]
Processing
DataNode (worker)
node]
MapReduce v1
YARN (MRv2)
Job Tracker (master)

[one per cluster]
Resource Manager
[one per cluster]
Task Tracker (worker)

node]
Node Manager
node]
Application Master
[many per cluster]
8.2. Web Interface Port Numbers

Table 8.2. Hadoop Web Interface Port Numbers
Daemons
Hadoop 1
Hadoop 2
HDFS : NameNode (same)
50070
50070
MapReduce 1 : Job Tracker
50030
--
YARN : Resource Manager
--
8088
YARN : MapReduce Job History -Server
19888
14
Hadoop 2 vs. Hadoop 1
8.3. Directory Layout

Table 8.3. Hadoop Directory Layout
Artifacts
Hadoop 1
Hadoop 2
commands like Hadoop, mapred
'HADOOP_INSTALL / bin' directory
admin commands (start-* and

stop-* scripts)
'HADOOP_INSTALL / sbin' directory
Configuration Files
'HADOOP_INSTALL / conf' di- 'HADOOP_INSTALL / etc /

rectory
hadoop' directory
Jar files
'HADOOP_INSTALL / lib' directory
'HADOOP_INSTALL / share /
hadoop' directory
jar files live in component specific sub-directories (common, hdfs,
mapreduce, yarn)
Note on jar files

In Hadoop v1 all dependency jars are in the lib dir. In v2 they are broken out into sub directories.
HADOOP_INSTALL/share/hadoop/common
HADOOP_INSTALL/share/hadoop/hdfs
HADOOP_INSTALL/share/hadoop/mapreduce
HADOOP_INSTALL/share/hadoop/tools
HADOOP_INSTALL/share/hadoop/yarn
Each of these directories also has a lib sub folder for their own dependencies. This means there may be
duplicate copies of some jars.
For example
find hadoop-2 -name "log4j*.jar" yields
hadoop-2/share/hadoop/common/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/hdfs/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/mapreduce/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/yarn/lib/log4j-1.2.17.jar
This is okay, as now each component (hdfs, mapreduce, yarn) can have it's dependencies in a clean way.
The split of jar files means that we have to change run scripts also. Here is an example of a java program
that connects with HDFS:
java -cp my.jar:$hadoop_home/share/hadoop/hdfs/*:$hadoop_home/share/
hadoop/hdfs/lib/*:$hadoop_home/share/hadoop/common/*:$hadoop_home/
share/hadoop/common/lib/* MyClass
8.4. Start / Stop Scripts

These are files used to start/stop Hadoop daemons:
15
Tar Package Version

Table 8.4. Start / Stop scripts for tar package version
Task
Hadoop 1
Hadoop 2
to start HDFS
$ HADOOP_INSTALL/bin/
start-dfs.sh
hadoop-daemon.sh start namenode
$ HADOOP_INSTALL/sbin/
start-dfs.sh
hadoop-daemon.sh start namenode
to start Map Reduce
start-mapred.sh
start-yarn.sh
to start everything
start-all.sh
start-all.sh
RPM Version
Table 8.5. Start / Stop scripts for rpm package version
Task
Hadoop 1
Hadoop 2
to start HDFS
On NameNode:
sudo service hadoop-0.20-namenode start
On Data Node:
sudo service hadoop-0.20datanode start
On NameNode:
sudo service hadoop-hdfs-namenode start
On Data Node:
sudo service hadoop-hdfsdatanode start
to start Map Reduce
on Job Tracker:
sudo service hadoop-0.20-jobtracker start
on Task Tracker:
sudo service hadoop-0.20-tasktracker start
on Job Tracker:
sudo service hadoop-yarn-resourcemanager restart
sudo service hadoop-mapreduce-historyserver start
sudo service hadoop-yarnproxyserver start
on "worker nodes":
sudo service hadoop-yarnnodemanager start
8.5. Hadoop Command Split

In v1 there are bin/hadoop executables used for file system operations, administration and map reduce
operations. In v2 these are handled by separate binaries.
Table 8.6. Hadoop Command Split

Task
Hadoop 1
Hadoop 2
File system operations
hadoop dfs -ls
hdfs dfs -ls
NameNode operations
hadoop namenode -format
hdfs namenode -format
16
Task
Hadoop 1
Hadoop 2
File system administration commands
hadoop dfsadmin -refreshNodes
hdfs dfsadmin -refreshNodes
MapReduce commands
hadoop job ....
mapred job ...
8.6. Configuration Files

Sample files: core-site.xml, hdfs-site.xml, mapred-site.xml
In Hadoop v1 the configuration files are in the HADOOP_HOME/conf directory. In Hadoop v2 they are
in the HADOOP_HOME/etc/hadoop directory. The files are all intact at their new location.
Table 8.7. Hadoop Configuration Files

Task
Hadoop 1
Core
HADOOP_INSTALL/conf/core- HADOOP_INSTALL/etc/
site.xml
hadoop/core-site.xml
HDFS
HADOOP_INSTALL/conf/hdfs- HADOOP_INSTALL/etc/
site.xml
hadoop/hdfs-site.xml
MapReduce
HADOOP_INSTALL/conf/
mapred-site.xml
HADOOP_INSTALL/etc/
hadoop/mapred-site.xml
YARN
--
HADOOP_INSTALL/etc/
hadoop/yarn-site.xml
17
Hadoop 2
Chapter 9. Quick Start

9.1. Getting Hadoop 2 Running
Best way to experience Hadoop 2 is to install it and play with it. We will describe two ways of installing
Hadoop 2. We will focus on single node install.
Using TAR file -- probably quickest way to get started. See here for complete list of Chapter 10, Running
Hadoop 2 (YARN + MapReduce) on a Single Node (Using Tar File) [19]
Using Distribution virtual machine : using virtual machines from vendors
18
Chapter 10. Running Hadoop 2 (YARN

+ MapReduce) on a Single Node (Using
Tar File)
10.1. High Level Steps
These steps apply for Linux or MacOS-X. These may work on Windows if you have Cygwin installed, but
not tested. If you are using Windows, recommended approach is to get a virtual machine that is running
Linux and install Hadoop on that.
These are the steps we will go through to get Hadoop 2 installed. You may skip some steps if your environment already have that (e.g Java installed already)
Install Java / JDK
Setting up password-less SSH
Get Hadoop 2 tar distribution and unpack
Make Hadoop commands available in PATH
Configuring HDFS
Format HDFS storage
Running HDFS and verifying
Configuring YARN
Running YARN
Running a sample MapReduce Job
10.2. Installing Java / JDK

Hadoop2 can work with Java version 6 or 7. You can get Java from here : Oracle Java Site [http://
www.oracle.com/technetwork/java/javase/downloads/index.html]. Download JDK (Java Development
Kit) instead of JRE (Java Run time Environment). JDK will have development tools you may end up using.
Installation instructions for Java vary widely by operating system (windows, Linux, mac). Follow the
instructions from Java download site. For brevity we are going to leave installing Java for your platform
up to you. The rest of the guide assumes you have Java installed and ready to go.
10.3. Setting up password-less SSH

This is a convenient step, if we want to use start / stop scripts bundled with Hadoop
Generate SSH key pair. Skip this step, if you already have ssh keys. You can verify if you have keys
by using the following command
19
Running Hadoop 2 (YARN

+ MapReduce) on a Single Node (Using Tar File)
$ ls -l $HOME/.ssh/
$
ls -l
total 304
-rw-r--r--rw-------rw-r--r--
$HOME/.ssh/
1 sujee
1 sujee
1 sujee
staff
staff
staff
815B Jan 10 22:44 authorized_keys

1.6K Apr 21 2010 id_rsa
403B Apr 21 2010 id_rsa.pub
If you see files id_rsa or id_dsa you already have ssh keys, and you don't have to generate them again
Let's generate SSH keys
$ ssh-keygen
To accept default values, press 'Enter' a couple of times. You will see an output like this.
$ ssh-keygen
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.
The key fingerprint is:
a9:63:a1:4e:9e:12:f7:08:a7:ab:bf:3f:a7:f8:6a:0c sujee@melbourne-2.local
The key's randomart image is:
+--[ RSA 2048]----+
|
|
|
|
|
|
|
.
|
|
. S
|
|E o o. o
|
| o *oo+
|
| ==ooo.
|
|.+*BB+
|
+-----------------+
Now let's add the key to 'authorized_keys' file so we can login without entering password
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod 600 $HOME/.ssh/authorized_keys
Try logging into the machine
$ ssh localhost
For the very first time, there may be a 'are you sure' prompt. Just say YES. And after that you can ssh
into localhost without entering any password.
20

10.4. Get Hadoop 2 tar ball

Hadoop tar distributions are hosted on Apache Hadoop site and its mirrors. Hadoop releases page [http://
hadoop.apache.org/releases.html] will list available versions. We are looking for 2.x series. At this time,
the stable 2 series is version 2.2.0. We will use this. If you download a different version from 2.x series,
adjust the setup instructions below.
Download the release from a nearest Apache mirror [http://www.apache.org/dyn/closer.cgi/hadoop/common/].
The following command will download Hadoop distribution using wget. Adjust the download location
based on closest Apache mirror to you.
$ wget http://download.nextag.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
$ tar xvf hadoop-2.2.0.tar.gz
This command creates a convenient 'hadoop2' symlink
$ ln -s hadoop-2.2.0 hadoop2
From now on we will refer where ever hadoop is installed as '/hadoop_install_location'. For example on
my Mac this is /Users/sujee/hadoop2
10.5. Hadoop Commands

In Hadoop version 1 all command line utilities were in 'hadoop install dir'/bin directory. In Hadoop 2,
commands live in two directories.
bin : This has most used commands like hadoop, mapred ..etc
sbin: This has administrative commands like start-all.sh (command used to start all Hadoop daemons ..etc) and stop-all.sh
We can invoke these commands using full path to them:
$ /hadoop_install_location/bin/hadoop
$ /hadoop_install_location/sbin/start-all.sh
Typing the full path repeatedly can get pretty tiresome. A convenient way is to add hadoop2/bin and
hadoop2/sbin directory to our PATH environment
For temporary usage in one terminal do this:
$ export PATH=$PATH:/hadoop_install_location/bin:/hadoop_install_location/sbin
After that, hadoop commands will be available for that terminal session.
To add these paths permanently to PATH, we can add them to our terminal start up file. If you are using
BASH shell, add the following to $HOME/.bashrc file.
$ export PATH=$PATH:/hadoop_install_location/bin:/hadoop_install_location/sbin
Save the file and the changes will be visible to new bash shells (opening a new terminal)
21

10.6. Configuring Hadoop 2 HDFS

Configuration files are located in /hadoop_install_location/etc/hadoop directory within Hadoop install
directory. This is different from Hadoop 1, where configuration files were located in 'conf' directory.
hadoop-env.sh (within etc/hadoop/)

This file defines environment variables used by Hadoop. We will only need to set JAVA_HOME variable
here.
JAVA_HOME will vary depending on your system setup. Here are some sample values.
On Linux : /usr/java/latest
On Mac : /Library/JavaVirtualMachines/jdk1.7.0.51.jdk/Contents/Home
export JAVA_HOME=/usr/java/latest
core-site.xml (within etc/hadoop/)

This file has 'core' generic properties for the cluster.
core-site.xml (on github) [https://gist.github.com/sujee/8547538].
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
<description>
this replaces Hadoop 1 property : fs.default.name
</description>
</property>
<property>

<name>hadoop.tmp.dir</name>
<value>/Users/sujee/hadoop2-data</value>
<description>
This is the 'root' directory for Hadoop (HDFS & MapReduce)
</description>
</property>
</configuration>
For single node cluster, it is convenient to set hadoop.tmp.dir rather than setting multiple directories for
various properties. Both HDFS and MapReduce would store their data under this directory.
hdfs-site.xml (within in etc/hadoop/)

This is where we define HDFS specific properties. In this case it is pretty simple, we are just setting the
replication factor to 1.
22

hdfs-site.xml (on github) [https://gist.github.com/sujee/8549067].
<?xml version="1.0" encoding="UTF-8"?>

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>single node, 1 copy</description>
</property>
</configuration>
10.7. Formatting HDFS Storage

Before using HDFS, we need to format NameNode directory.
$ hdfs namenode -format
or using full path
$ /hadoop_install_location/bin/hdfs namenode -format
The output might look like...
$ bin/hdfs namenode -format

14/01/21 13:36:11 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
....
....
14/01/21 13:36:12 INFO common.Storage: Storage directory /hadoop/data/hadoop-2/dfs/name has
14/01/21 13:36:12 INFO namenode.FSImage: Saving image file /hadoop/data/hadoop-2/dfs/name/c
14/01/21 13:36:12 INFO namenode.FSImage: Image file /hadoop/data/hadoop-2/dfs/name/current/
14/01/21 13:36:12 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with tx
14/01/21 13:36:12 INFO util.ExitUtil: Exiting with status 0
14/01/21 13:36:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at melbourne-2.local/10.102.36.146
************************************************************/
What we are looking for a message Storage directory has been successfully formatted. All looks good.
10.8. Starting HDFS

Check the file etc/hadoop/slaves file. Make sure it has only one line
localhost
$ cat /hadoop_install_location/etc/hadoop/slaves
localhost
23

We are going to use sbin/start-dfs.sh script. This command will do the following:
Start Namenode on the current node
Start Secondary NameNode (if necessary) on machines specified by masters(etc/hadoop/masters) file
Start DataNodes on node specified slaves(etc/hadoop/slaves) file.
For this, to work password-less ssh has to work (this is why we set up ssh before)
$ start-dfs.sh
or using full path
$ /hadoop_install_location/sbin/start-dfs.sh
$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-sujee-nam
localhost: starting datanode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-sujee-dat
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-su
10.9. Verifying that HDFS is running

If start-dfs.sh didn't error out, most likely HDFS daemons are up and running.
Verifying on command-line
We will use jps command to see running Java processes. jps command is part of JDK.
$ jps
30438
30375
30226
30291
Jps
SecondaryNameNode
NameNode
DataNode
Looks good, we have all HDFS daemons running on this node
Verifying HDFS WebUI

Access HDFS Web UI http://localhost:50070 in a browser. Verify 'Live Nodes' count is 1.
See screen-shot below
24

Figure 10.1. Namenode UI
10.10. Configuring YARN / MapReduce

Now that we have HDFS working, let's configure and run YARN.
yarn-site.xml (etc/hadoop)
yarn-site.xml (on GitHub) [https://gist.github.com/sujee/8551096]
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
25

<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
-->
</configuration>
mapred-site.xml (etc/hadoop)
mapred-site.xml (on GitHub) [https://gist.github.com/sujee/8551273]
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
10.11. Starting YARN

Using sbin/start-yarn.sh script
$ start-yarn.sh
or using full path
$ /hadoop_install_location/sbin/start-yarn.sh
10.12. History Server

History server will allow us to browse job stats ..etc
$ /hadoop_install_location/sbin/mr-jobhistory-daemon.sh start historyserver
Access Job History Server Web UI http://localhost:19888 in a browser
10.13. Verifying YARN daemons

Command line check
$ jps
33780 ResourceManager
33858 NodeManager
30375 SecondaryNameNode
26

30226
30291
30297
34928
NameNode
DataNode
JobHistoryServer
Jps
We see two daemons ResourceManager and NodeManager. All good!
YARN Web UI
Access YARN Web UI http://localhost:8088 in a browser. Verify Active Nodes are 1.
Figure 10.2. YARN UI
10.14. Final Test -- Running a MapReduce job

Now that we have HDFS and YARN running, let's run a MapReduce job to see if the whole cluster is
functional. We will be using GREP MapReduce example that ships with Hadoop for this purpose.
Copy some files into HDFS

Let's use hadoop configuration xml files as input for grep
Create an input directory in HDFS to hold the files
$ hdfs dfs -mkdir -p /grep/in
or using full path
$ /hadoop_install_location/bin/hdfs dfs -mkdir -p /grep/in
Copy a bunch of files into /grep/in directory
$ hdfs dfs -put /hadoop_install_location/etc/hadoop/* /grep/in/
or using full path
$ /hadoop_install_location/bin/hdfs dfs -put /hadoop_install_location/etc/hadoop/* /grep/in/
Run Grep MapReduce job

$ /hadoop_install_location/bin/hadoop jar /hadoop_install_location/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar grep /grep/in/ /grep/out '(xml|name|dfs)'
Here we are asking grep to look for 3 words : xml, name and dfs. We are using standard regular expression
format. This command will run two MapReduce jobs in sequence. The output may look like this
27

...
14/01/21
14/01/21
14/01/21
...
14/01/21
...
14/01/21
14/01/21
14/01/21
14/01/21
14/01/21
...
17:05:25 INFO mapreduce.Job: Running job: job_1390348071644_0005

17:05:32 INFO mapreduce.Job: map 0% reduce 0%
17:07:02 INFO mapreduce.Job: Job job_1390348071644_0005 completed successf

17:07:03
17:07:16
17:07:21
17:07:27
17:07:28
INFO
INFO
INFO
INFO
INFO
mapreduce.Job: Running job: job_1390348071644_0006

mapreduce.Job: map 0% reduce 0%
mapreduce.Job: Job job_1390348071644_0006 completed successf
Once the MR job is successfully completed, let's look at the output

$ hdfs dfs -cat /grep/out/*
or using full path
$ /hadoop_install_location/bin/hdfs dfs -cat /grep/out/*
Output might look like following. As you can see Grep finds the number of instances for each word, and
sorts them according to their frequency. The second MapReduce job is used for sorting.
167 name
32 dfs
21 xml
Let's check YARN UI. See the screen shot below. We see two MapReduce jobs (grep-search and grep
sort) ran and finished successfully.
Figure 10.3. YARN Running Jobs
All looking good
10.15. Shutting down the cluster

First we will shutdown YARN, then we will shutdown HDFS. We will use the stop-yarn.sh and stopdfs.sh scripts located in /hadoop_install_location/sbin directory.
28

$ /hadoop_install_location/sbin/mr-jobhistory-daemon.sh stop historyserver $ stop-yarn.sh
or using full path
$ /hadoop_install_location/sbin/stop-yarn.sh
$ stop-dfs.sh
or using full path
$ /hadoop_install_location/sbin/stop-dfs.sh
10.16. References
Cloudera 4.x YARN guide [http://www.cloudera.com/content/cloudera-content/cloudera-docs/
CDH4/4.3.0/CDH4-Installation-Guide/cdh4ig_topic_11_4.html]
29
Chapter 11. HDFS Benchmarking in

Hadoop 2
We will look at two benchmarks for HDFS. Running these benchmarks is a little different in Hadoop 2
11.1. TestDFSIO
This benchmark exercises HDFS and measures IO throughput. TestDFSIO uses MapReduce to read / write
files in parallel.
To run this test we need both DFS and YARN daemons running.
$ hadoop org.apache.hadoop.fs.TestDFSIO -write -nrFiles 5 -fileSize 1GB
The parameters are as follows...
-write : write files
-nrFiles : number of files
-fileSize : size of each file, you can use shortcuts like MB or GB
This would run a MapReduce job. The output might look like this
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
...
14/02/14
14/02/14
14/02/14
...
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
00:02:57
00:02:57
00:02:57
00:02:57
00:02:57
INFO
INFO
INFO
INFO
INFO
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
TestDFSIO.1.7
nrFiles = 5
nrBytes (MB) = 1024.0
bufferSize = 1000000
baseDir = /benchmarks/TestDFSIO
00:03:00 INFO mapreduce.Job: Running job: job_1392364939455_0001

00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
fs.TestDFSIO: ----- TestDFSIO ----- : write

fs.TestDFSIO:
Date and time: Fri Feb 14 00:05:36
fs.TestDFSIO:
Number of files: 5
fs.TestDFSIO: Total MBytes processed: 5120.0
fs.TestDFSIO:
Throughput mb/sec: 8.957344147460278
fs.TestDFSIO: Average IO rate mb/sec: 8.959185600280762
fs.TestDFSIO: IO rate std deviation: 0.12878510722494685
fs.TestDFSIO:
Test exec time sec: 157.34
fs.TestDFSIO:
Also there is read version of the command

$ hadoop org.apache.hadoop.fs.TestDFSIO -read -nrFiles 5 -fileSize 1GB
11.2. Terasort
Terasort is ultimate IO benchmark for Hadoop. The test is comprised of three components.
30
HDFS Benchmarking in Hadoop 2
TeraGen : Generate bunch of random data

Terasort : Sorts the data
TeraValidate : Verifies the data is indeed sorted
1TB (the amount of data Terasort is named after) data looks like the following...
Each record is 100 bytes long
10 billion records
1 TB of data (10 Billion records x 100 bytes)
Terasort allows us to generate various size of data. Here is a cheat table to generate the desired size of data.
Table 11.1. Terasort

Data Size
Number of records (divided by #

100 bytes per record)
1 GB
10 million records
10,000,000
10 GB
100 million records
100,000,000
100 GB
1 billion records
1,000,000,000
1 TB
10 billion records
10,000,000,000
Lets generate 10 G of data. We will supply the following arguments to Teragen

teragen -D mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000 input_dir
Table 11.2. TeraGen arguments

Parameter
Explanation
-D mapreduce.job.maps=20
20 mappers
Each mapper will create a file. So 10GB divided
by 20 maps will yield 500MB file each
-D dfs.blocksize=536870912
We are specifying block size to be 512MB.

Note by specifying number of mappers, we are
creating files each size of 500MB. We don't want
these files split up into blocks by default block size
of 128MB. We want to keep the entire file as one
block. Hence we are bumping up the block size to
512MB (larger than an individual file size)
A map task will sort a block. If the block size is
too small, mappers would be done very quickly,
specially on modern hardware. That is why we are
setting the block size higher.
100000000
number of records.
See the above table on calculation specifics
input_dir
This is where the generated data will end up
Lets generate 10 G of data. How we invoke TeraGen command will depend on our Hadoop installation.
See below, you may need to adjust your command to match your environment.
option 1) for tar based installs...
31
HDFS Benchmarking in Hadoop 2
$
hadoop
jar
HADOOP_INSTALL_PATH/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar teragen -D mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000
input_dir
option 2) RPM / Package based Hadoop Distribution
$
hadoop
jar
/usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000 input_dir
teragen
-D
Now it is time to sort the data.

option 1) for tar based installs...
$
hadoop
jar
HADOOP_INSTALL_PATH/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar terasort -D mapreduce.job.reducers=20 tera/in tera/out
option 2) RPM / Package based Hadoop Distribution
$
hadoop
jar
/usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
mapreduce.job.reducers=20 tera/in tera/out
terasort
-D
11.3. References
Michael
Noll
has
an
excellent
tutorial
on
TestDFSIO
and
Terasort [http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/] (Hadoop v1 version)
32

Hadoop2 Handbook

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop2 Handbook

Uploaded by

Copyright:

Available Formats

Hadoop 2 Handbook

Sujee Maniyam <sujee@ElephantScale.com>

11.1. TestDFSIO ....................................................................................................... 30

Chapter 1. About this Book

Source Code on GitHub : github.com/elephantscale/HI-labs [https://github.com/elephantscale/HI-labs]

Chapter 2. About the Author

Sujee Maniyam (Author)

Chapter 3. Hadoop Versions

Table 3.1. Hadoop Versions

A widely used release. This

Was a branch created to add new

Current production version of

Current public release of

Figure 3.1. Hadoop Versions

3.1. Hadoop version 1.0

Cloudera's CDH 4 (Cloudera's Distribution of Hadoop) series

3.2. Hadoop version 2.0

3.3. (Older) History of Hadoop

Chapter 4. Hadoop 2 New and

4.1. HDFS Features in Hadoop 2

4.2. YARN (Yet Another Resource Negotiator)

Chapter 5. HDFS Snapshots

Figure 5.1. Snapshots Explained

5.2. Snapshotting Internals

5.3. Snapshotting Example

0 2014-02-08 22:07 /user/sujee/foo/.snapsh

Now lets delete the file 'bar'

Chapter 6. HDFS NFS Access

Figure 6.1. HDFS Is An Isolated File System

Figure 6.2. HDFS NFS

HDFS NFS Access

6.2. HDFS NFS Functionality

6.3. Setting up HDFS NFS

Chapter 7. HDFS Federation

Figure 7.1. HDFS Architecture

Figure 7.2. HDFS Federation (With Multiple NameNodes)

7.2. Benefits of Federation

7.3. Internals of Federation

Figure 7.3. HDFS No Federation

Figure 7.4. HDFS Federation - Namespace Pools

Chapter 8. Hadoop 2 vs. Hadoop 1

Table 8.1. Hadoop Daemons

HDFS (very little change)

Checkpoint Node (formerly

Job Tracker (master)

Task Tracker (worker)

8.2. Web Interface Port Numbers

HDFS : NameNode (same)

MapReduce 1 : Job Tracker

YARN : Resource Manager

YARN : MapReduce Job History -Server

Hadoop 2 vs. Hadoop 1

8.3. Directory Layout

commands like Hadoop, mapred

'HADOOP_INSTALL / bin' directory

'HADOOP_INSTALL / bin' directory

admin commands (start-* and

'HADOOP_INSTALL / bin' directory

'HADOOP_INSTALL / sbin' directory

'HADOOP_INSTALL / conf' di- 'HADOOP_INSTALL / etc /

'HADOOP_INSTALL / lib' directory

Note on jar files

8.4. Start / Stop Scripts

Hadoop 2 vs. Hadoop 1