You are on page 1of 40

Hadoop 2 Handbook

Sujee Maniyam <sujee@ElephantScale.com>

Hadoop 2 Handbook
by Sujee Maniyam

Dedication
To the open source community

Acknowledgements
From Sujee
To the kind souls who helped me along the way
Copyright 2014 Elephant Scale LLC. All Rights Reserved.

ii

Table of Contents
1. About this Book ............................................................................................................. 1
2. About the Author ............................................................................................................ 2
3. Hadoop Versions ............................................................................................................ 3
3.1. Hadoop version 1.0 ............................................................................................... 3
3.2. Hadoop version 2.0 ............................................................................................... 4
3.3. (Older) History of Hadoop ..................................................................................... 5
4. Hadoop 2 New and Noteworthy ........................................................................................ 6
4.1. HDFS Features in Hadoop 2 .................................................................................. 6
4.2. YARN (Yet Another Resource Negotiator) ............................................................... 6
5. HDFS Snapshots ............................................................................................................. 7
5.1. Case for Snapshots ............................................................................................... 7
5.2. Snapshotting Internals ........................................................................................... 7
5.3. Snapshotting Example ........................................................................................... 8
5.4. References .......................................................................................................... 8
6. HDFS NFS Access ......................................................................................................... 9
6.1. Accessing HDFS Files From Outside ....................................................................... 9
6.2. HDFS NFS Functionality ..................................................................................... 10
6.3. Setting up HDFS NFS ......................................................................................... 10
6.4. References ......................................................................................................... 10
7. HDFS Federation .......................................................................................................... 11
7.1. Case for HDFS Federation ................................................................................... 11
7.2. Benefits of Federation ......................................................................................... 12
7.3. Internals of Federation ......................................................................................... 12
7.4. References ......................................................................................................... 13
8. Hadoop 2 vs. Hadoop 1 ................................................................................................. 14
8.1. Daemons ........................................................................................................... 14
8.2. Web Interface Port Numbers ................................................................................. 14
8.3. Directory Layout ................................................................................................ 15
8.4. Start / Stop Scripts .............................................................................................. 15
8.5. Hadoop Command Split ....................................................................................... 16
8.6. Configuration Files ............................................................................................. 17
9. Quick Start .................................................................................................................. 18
9.1. Getting Hadoop 2 Running ................................................................................... 18
10. Running Hadoop 2 (YARN + MapReduce) on a Single Node (Using Tar File) ......................... 19
10.1. High Level Steps .............................................................................................. 19
10.2. Installing Java / JDK ......................................................................................... 19
10.3. Setting up password-less SSH ............................................................................. 19
10.4. Get Hadoop 2 tar ball ........................................................................................ 21
10.5. Hadoop Commands ........................................................................................... 21
10.6. Configuring Hadoop 2 HDFS .............................................................................. 22
10.7. Formatting HDFS Storage .................................................................................. 23
10.8. Starting HDFS .................................................................................................. 23
10.9. Verifying that HDFS is running ........................................................................... 24
10.10. Configuring YARN / MapReduce ....................................................................... 25
10.11. Starting YARN ............................................................................................... 26
10.12. History Server ................................................................................................. 26
10.13. Verifying YARN daemons ................................................................................ 26
10.14. Final Test -- Running a MapReduce job .............................................................. 27
10.15. Shutting down the cluster .................................................................................. 28
10.16. References ...................................................................................................... 29
11. HDFS Benchmarking in Hadoop 2 ................................................................................. 30

iii

Hadoop 2 Handbook

11.1. TestDFSIO ....................................................................................................... 30


11.2. Terasort ........................................................................................................... 30
11.3. References ....................................................................................................... 32

iv

List of Figures
3.1. Hadoop Versions .......................................................................................................... 3
5.1. Snapshots Explained ..................................................................................................... 7
6.1. HDFS Is An Isolated File System .................................................................................... 9
6.2. HDFS NFS ................................................................................................................. 9
7.1. HDFS Architecture ..................................................................................................... 11
7.2. HDFS Federation (With Multiple NameNodes) ................................................................ 11
7.3. HDFS No Federation ................................................................................................... 12
7.4. HDFS Federation - Namespace Pools ............................................................................. 13
10.1. Namenode UI ........................................................................................................... 25
10.2. YARN UI ................................................................................................................ 27
10.3. YARN Running Jobs ................................................................................................. 28

List of Tables
3.1. Hadoop Versions .......................................................................................................... 3
8.1. Hadoop Daemons ....................................................................................................... 14
8.2. Hadoop Web Interface Port Numbers ............................................................................. 14
8.3. Hadoop Directory Layout ............................................................................................. 15
8.4. Start / Stop scripts for tar package version ....................................................................... 16
8.5. Start / Stop scripts for rpm package version ..................................................................... 16
8.6. Hadoop Command Split ............................................................................................... 16
8.7. Hadoop Configuration Files .......................................................................................... 17
11.1. Terasort ................................................................................................................... 31
11.2. TeraGen arguments ................................................................................................... 31

vi

Chapter 1. About this Book


This is book covers Hadoop version 2. It is intended for developers.
The book is freely available, it is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License [http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US].
"Hadoop2 Handbook" is a work in progress. It is a 'living book'. We will keep updating the book to reflect
the fast moving world of Big Data and Hadoop. So keep checking back.
We appreciate your feedback. You can follow it on Twitter, discuss it in Google Groups, or send your
feedback via email.
Twitter : @ElephantScale [https://twitter.com/ElephantScale]
Email Authors Directly : principals@ElephantScale.com [mailto:principals@ElephantScale.com]
Book
GitHub
:
github.com/elephantscale/hadoop2-handbook
phantscale/hadoop2-handbook]

[https://github.com/ele-

Source Code on GitHub : github.com/elephantscale/HI-labs [https://github.com/elephantscale/HI-labs]

Chapter 2. About the Author

Sujee Maniyam (Author)


Sujee Maniyam is an experienced/hands-on Big Data architect. He has been developing software for the
past 12 years in a variety of technologies (enterprise, web and mobile). He currently focuses on Hadoop,
Big Data and NoSQL, and Amazon Cloud Services. His clients include early stage startups and enterprise
companies.
Sujee stays active in the Hadoop / Open Source community. He runs a developer focused Meetup called
'Big Data Gurus' [http://www.meetup.com/BigDataGurus/]. He has also presented at a variety of Meetups
Sujee contributes to Hadoop projects and his open source projects can be found on GitHub. He writes
about Hadoop and other technologies on his website.
Sujee does Hadoop training for individuals and corporations; his classes are hands-on and draw heavily
on his industry experience
Links:
LinkedIn [http://www.linkedin.com/in/sujeemaniyam] || GitHub [https://github.com/sujee] || Tech writings [http://sujee.net/tech/articles/] || Tech talks [http://sujee.net/tech/talks/] || BigDataGuru meetup [http://
www.meetup.com/BigDataGurus/]

Chapter 3. Hadoop Versions


Let's take a moment to explore the different versions of Hadoop. The following are/were the most known
versions.

Table 3.1. Hadoop Versions


Version

Release Date

Description

0.21

Aug 2010

A widely used release. This


eventually became the Hadoop
1.0 release

0.23

Feb 2012

Was a branch created to add new


features. This branch eventually
became Hadoop 2.0

1.0

Dec 2011

Current production version of


Hadoop. Most battle tested.

2.2.0

Oct 2013

Current public release of


Hadoop.

Figure 3.1. Hadoop Versions

3.1. Hadoop version 1.0


This is currently the production version of Hadoop. It has been in wide use for a while and has been proven
in the field. The following distributions are based on Hadoop 1.0

Hadoop Versions

Cloudera's CDH 4 (Cloudera's Distribution of Hadoop) series


HortonWorks's HDP 1 (HortonWorks Data Platform) series

3.2. Hadoop version 2.0


This is the current public release of Hadoop. Hadoop 2 has significant new enhancements. It has been
under development for a while. version 2 is released on Aug 2013. This is the version this book will cover.
Hadoop 2 has the following new features:
HDFS High Availability
Federated NameNode
Map Reduce version 2 (MRV2) also known as YARN
The following distributions currently bundle Hadoop 2:
Cloudera's CDH5 (Cloudera's Distribution of Hadoop) series.
HortonWorks HDP2 (HortonWorks Data Platform) series.

Hadoop Versions

3.3. (Older) History of Hadoop

Chapter 4. Hadoop 2 New and


Noteworthy
Hadoop 2 has a bunch of cool features, both in HDFS and MapReduce.

4.1. HDFS Features in Hadoop 2


NameNode High Availability
The 'single master' design in HDFS made NameNode as a single point of failure. Hadoop 2 addresses
this issue by allowing a 'standby' NameNode.
Snapshots
Ability to save the state of HDFS file system and restore later
See more : Chapter 5, HDFS Snapshots [7]
NameNode Federation
Multiple NameNodes, each managing a separate name space
See more : Chapter 7, HDFS Federation [11]
NFSv3 Access
Access HDFS using NFS protocol
See more : Chapter 6, HDFS NFS Access [9]
Improved IO
HDFS has seen lot of optimizations for read / write

4.2. YARN (Yet Another Resource Negotiator)


YARN is the next generation processing framework for Hadoop.

Chapter 5. HDFS Snapshots


5.1. Case for Snapshots
Starting from version 2 HDFS allows taking snapshots of the file system. Think of snapshots as a Apple's
Time Machine' feature. We can save the state of file system and restore it later. Snapshots can be useful
for data backup, recovering from user errors and recovering from disasters.
But wait, doesn't HDFS replicate files by default, preventing data loss? Yes, it does. How ever, HDFS
doesn't protect from user errors, like the following command
$ hdfs dfs -rm -r /important_data
Ouch! This will delete the files, no matter how many times they are replicated.
How ever there is a Trash feature in HDFS. This works like a Desktop trash can found on most operating
systems. If trash feature is enabled the files can be salvaged from there.
Snapshots give more than Trash. For starters files in Trash are only kept for a certain period. This is
controlled by property (fs.trash.interval in core-site.xml; this property is in minutes)

Figure 5.1. Snapshots Explained

And file system can be altered more than just deleting files. For example contents, permissions ..etc can
change. Snapshots can capture this information. And enables us to restore files back at a later point in
future.
We can snapshot the entire file system or specific directories. A directory has to be enabled for snapshotting
by HDFS administrator. This is called making a directory snapshottable . After that normal users can
snapshot their directories on their own without administrator intervention.
A directory with snapshots can not be deleted or renamed until all snapshots are deleted.

5.2. Snapshotting Internals


HDFS snapshotting is pretty efficient. The actual physical data (e.g. blocks in DataNodes-anodes) is NOT
copied. Snapshotting records the status of meta data. Since dealing with meta data is pretty fast, snapshotting operation is very fast. In computer science term the cost is constant O(1).
Snapshotting doesn't slow down regular HDFS operations.
HDFS uses a special directory named .snapshot for snapshot purposes. Hence the file name '.snapshot' is
a reserved file name. Users can not create file or directory with the name '.snapshot'

HDFS Snapshots

5.3. Snapshotting Example


Let's do an example snapshotting operation to better understand the mechanics of this the operation.
First let's create a directory we want to snapshot
$ hdfs dfs -mkdir /user/sujee/foo
Before we can snapshot file system, HDFS administrator has to enable it. Assuming HDFS administrator
user is 'hdfs'...
$ sudo -u hdfs hdfs dfsadmin -allowSnapshot /user/sujee/foo
Create a file 'bar' within this directory
$ hdfs dfs -mkdir /user/sujee/foo/bar
Create a snapshot
$ hdfs dfs -createSnapshot /user/sujee/foo snap1
First argument (/user/sujee/foo) for createSnapshot is the directory.
Second argument (snap1) (optional) is the name of snapshot.
All snapshots are stored within snapshot directory : /user/sujee/foo/.snapshot
Under the .snapshot directory, a directory created (snap1) with snapshot's name : /user/sujee/foo/.snapshot/
snap1
And finally all files are saved under this directory : /user/sujee/foo/.snapshot/snap1
We can use the following command to inspect the directory layout.
$ hdfs dfs -ls -R /user/sujee/foo/.snapshot

drwxr-xr-x
-rw-r--r--

- sujee supergroup
1 sujee supergroup

0 2014-02-08 22:07 /user/sujee/foo/.snapsh


0 2014-02-08 22:03 /user/sujee/foo/.snapsh

Now lets delete the file 'bar'


$ hdfs dfs -rm /user/sujee/foo/bar
Now the file has disappeared, verify it using the following command
$ hdfs dfs -ls /user/sujee/foo/
Time to restore the file from snapshot
$ hdfs dfs -cp /user/sujee/foo/.snapshot/snap1/bar /user/sujee/foo/bar-restored
There we go, snapshot in action!

5.4. References
Hortonworks guide [http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/bk_user-guide/
content/user-guide-hdfs-snapshots.html]
Apache Hadoop documentation [http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoophdfs/HdfsSnapshots.html]

Chapter 6. HDFS NFS Access


6.1. Accessing HDFS Files From Outside
HDFS is a user land file system. It doesn't behave like other Linux file systems (e.g. ext3 or ext4). HDFS
can not be 'mounted'.

Figure 6.1. HDFS Is An Isolated File System

We use HDFS command line utilities (hdfs) to interact with HDFS. How ever files residing in HDFS are
not accessible to traditional Linux programs.Lets say we have a Linux executable 'my_awesome_program'
that analyzes videos. If our video files are in HDFS they won't be accessible to this program. So the
following command would not work.
$ my_awesome_program < input/file/in/HDFS > output/file/in/HDFS
It would be nice to have files in HDFS accessible to other programs. So we don't need to re-write these
programs in Java just access files in HDFS. The solution was to make HDFS available over sample link :
Networked File System (NFS) [http://en.wikipedia.org/wiki/Network_File_System]. NFS has been around
for a while and lot of applications and operating systems know how to talk to NFS file systems.
HDFS NFS Gateway allows HDFS to be mounted as a NFS file system.

Figure 6.2. HDFS NFS

HDFS NFS Access

6.2. HDFS NFS Functionality


Browse HDFS
NFS clients will be able to browse HDFS file systems.
Download files from HDFS
Copy files from HDFS --> local
Upload files to HDFS
Copy files from local --> HDFS
Upload files to HDFS
Copy files from local --> HDFS
Append to files
Clients can append to HDFS files
No random write to files
Since HDFS does NOT support random write to files, this feature is not available via NFS.

6.3. Setting up HDFS NFS


Follow this guide from Hortonworks [http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/
bk_user-guide/content/user-guide-hdfs-nfs.html]

6.4. References
Hortonworks guide to setting up HDFS NFS [http://docs.hortonworks.com/HDPDocuments/HDP1/
HDP-1.3.0/bk_user-guide/content/user-guide-hdfs-nfs.html]
Blog post from Hortonworks explaining NFS feature [http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/]
HDFS-4750 [https://issues.apache.org/jira/browse/HDFS-4750] - Original JIRA to implement NFS support in HDFS

10

Chapter 7. HDFS Federation


7.1. Case for HDFS Federation
HDFS has a single NameNode and many DataNodes. So the storage (managed by DataNodes) scale horizontally, but the namespace (managed by a single NameNode) doesn't. This is particularly problem for
very large clusters (1000s of nodes).
Federation solves this by having multiple NameNodes manage the namespace.

Figure 7.1. HDFS Architecture

Figure 7.2. HDFS Federation (With Multiple NameNodes)

11

HDFS Federation

7.2. Benefits of Federation


Increasing File System Performance.
NameNode is involved in pretty much all file system operations (create files, read / write files ..etc). On
a large cluster (1000s of nodes) a single NameNode could be a bottleneck on IO operations. Multiple
NameNodes help scale IO rate.
Isolation of File Systems.
Each NameNode can manage a namespace. For example
/user directory can be managed by NN1
/hive directory can be managed by NN2
/hbase directory can be managed by NN3
In this scenario different namespaces are managed by different NameNodes. This provides isolation of
namespaces. For example, if NN2 is down, only /hive directory is in-accessible. Other directories /user
and /hbase are available.

7.3. Internals of Federation


Federation is implemented by grouping data blocks into into pools. Each NameNode manages a namespace
pool. Note that DataNodes are not segmented into pools. All NameNodes share all DataNodes. A DataNode
may host blocks belonging multiple namespace pools.

Figure 7.3. HDFS No Federation

12

HDFS Federation

Figure 7.4. HDFS Federation - Namespace Pools

7.4. References
Apache Hadoop documentation [https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoophdfs/Federation.html]

13

Chapter 8. Hadoop 2 vs. Hadoop 1


This chapter details how Hadoop version 2 'looks and feels' different from version 1. To learn about functional differences between version 1 and 2, look at chapters see Chapter 3, Hadoop Versions [3].

8.1. Daemons
HDFS daemons are pretty much the same in Hadoop 1 and Hadoop 2. The biggest difference is Hadoop
2 has YARN instead of Map Reduce.

Table 8.1. Hadoop Daemons


Daemons

Hadoop 1

Hadoop 2

HDFS (very little change)

Namenode (master)
[one per cluster]

Namenode (master)
[one per cluster]

Secondary Namenode

Checkpoint Node (formerly


Secondary NameNode)

DataNode (worker)
[many per cluster, one per
node]
Processing

DataNode (worker)
[many per cluster, one per
node]

MapReduce v1

YARN (MRv2)

Job Tracker (master)


[one per cluster]

Resource Manager
[one per cluster]

Task Tracker (worker)


[many per cluster, one per
node]

Node Manager
[many per cluster, one per
node]
Application Master
[many per cluster]

8.2. Web Interface Port Numbers


Table 8.2. Hadoop Web Interface Port Numbers
Daemons

Hadoop 1

Hadoop 2

HDFS : NameNode (same)

50070

50070

MapReduce 1 : Job Tracker

50030

--

YARN : Resource Manager

--

8088

YARN : MapReduce Job History -Server

19888

14

Hadoop 2 vs. Hadoop 1

8.3. Directory Layout


Table 8.3. Hadoop Directory Layout
Artifacts

Hadoop 1

Hadoop 2

commands like Hadoop, mapred

'HADOOP_INSTALL / bin' directory

'HADOOP_INSTALL / bin' directory

admin commands (start-* and


stop-* scripts)

'HADOOP_INSTALL / bin' directory

'HADOOP_INSTALL / sbin' directory

Configuration Files

'HADOOP_INSTALL / conf' di- 'HADOOP_INSTALL / etc /


rectory
hadoop' directory

Jar files

'HADOOP_INSTALL / lib' directory

'HADOOP_INSTALL / share /
hadoop' directory
jar files live in component specific sub-directories (common, hdfs,
mapreduce, yarn)

Note on jar files


In Hadoop v1 all dependency jars are in the lib dir. In v2 they are broken out into sub directories.
HADOOP_INSTALL/share/hadoop/common
HADOOP_INSTALL/share/hadoop/hdfs
HADOOP_INSTALL/share/hadoop/mapreduce
HADOOP_INSTALL/share/hadoop/tools
HADOOP_INSTALL/share/hadoop/yarn
Each of these directories also has a lib sub folder for their own dependencies. This means there may be
duplicate copies of some jars.
For example
find hadoop-2 -name "log4j*.jar" yields
hadoop-2/share/hadoop/common/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/hdfs/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/mapreduce/lib/log4j-1.2.17.jar
hadoop-2/share/hadoop/yarn/lib/log4j-1.2.17.jar
This is okay, as now each component (hdfs, mapreduce, yarn) can have it's dependencies in a clean way.
The split of jar files means that we have to change run scripts also. Here is an example of a java program
that connects with HDFS:
java -cp my.jar:$hadoop_home/share/hadoop/hdfs/*:$hadoop_home/share/
hadoop/hdfs/lib/*:$hadoop_home/share/hadoop/common/*:$hadoop_home/
share/hadoop/common/lib/* MyClass

8.4. Start / Stop Scripts


These are files used to start/stop Hadoop daemons:

15

Hadoop 2 vs. Hadoop 1

Tar Package Version


Table 8.4. Start / Stop scripts for tar package version
Task

Hadoop 1

Hadoop 2

to start HDFS

$ HADOOP_INSTALL/bin/
start-dfs.sh
$ HADOOP_INSTALL/bin/
hadoop-daemon.sh start namenode

$ HADOOP_INSTALL/sbin/
start-dfs.sh
$ HADOOP_INSTALL/sbin/
hadoop-daemon.sh start namenode

to start Map Reduce

$ HADOOP_INSTALL/bin/
start-mapred.sh

$ HADOOP_INSTALL/sbin/
start-yarn.sh

to start everything

$ HADOOP_INSTALL/bin/
start-all.sh

$ HADOOP_INSTALL/sbin/
start-all.sh

RPM Version
Table 8.5. Start / Stop scripts for rpm package version
Task

Hadoop 1

Hadoop 2

to start HDFS

On NameNode:
sudo service hadoop-0.20-namenode start
On Data Node:
sudo service hadoop-0.20datanode start

On NameNode:
sudo service hadoop-hdfs-namenode start
On Data Node:
sudo service hadoop-hdfsdatanode start

to start Map Reduce

on Job Tracker:
sudo service hadoop-0.20-jobtracker start
on Task Tracker:
sudo service hadoop-0.20-tasktracker start

on Job Tracker:
sudo service hadoop-yarn-resourcemanager restart
sudo service hadoop-mapreduce-historyserver start
sudo service hadoop-yarnproxyserver start
on "worker nodes":
sudo service hadoop-yarnnodemanager start

8.5. Hadoop Command Split


In v1 there are bin/hadoop executables used for file system operations, administration and map reduce
operations. In v2 these are handled by separate binaries.

Table 8.6. Hadoop Command Split


Task

Hadoop 1

Hadoop 2

File system operations

$ HADOOP_INSTALL/bin/
hadoop dfs -ls

$ HADOOP_INSTALL/bin/
hdfs dfs -ls

NameNode operations

$ HADOOP_INSTALL/bin/
hadoop namenode -format

$ HADOOP_INSTALL/bin/
hdfs namenode -format

16

Hadoop 2 vs. Hadoop 1

Task

Hadoop 1

Hadoop 2

File system administration commands

$ HADOOP_INSTALL/bin/
hadoop dfsadmin -refreshNodes

$ HADOOP_INSTALL/bin/
hdfs dfsadmin -refreshNodes

MapReduce commands

$ HADOOP_INSTALL/bin/
hadoop job ....

$ HADOOP_INSTALL/bin/
mapred job ...

8.6. Configuration Files


Sample files: core-site.xml, hdfs-site.xml, mapred-site.xml
In Hadoop v1 the configuration files are in the HADOOP_HOME/conf directory. In Hadoop v2 they are
in the HADOOP_HOME/etc/hadoop directory. The files are all intact at their new location.

Table 8.7. Hadoop Configuration Files


Task

Hadoop 1

Core

HADOOP_INSTALL/conf/core- HADOOP_INSTALL/etc/
site.xml
hadoop/core-site.xml

HDFS

HADOOP_INSTALL/conf/hdfs- HADOOP_INSTALL/etc/
site.xml
hadoop/hdfs-site.xml

MapReduce

HADOOP_INSTALL/conf/
mapred-site.xml

HADOOP_INSTALL/etc/
hadoop/mapred-site.xml

YARN

--

HADOOP_INSTALL/etc/
hadoop/yarn-site.xml

17

Hadoop 2

Chapter 9. Quick Start


9.1. Getting Hadoop 2 Running
Best way to experience Hadoop 2 is to install it and play with it. We will describe two ways of installing
Hadoop 2. We will focus on single node install.
Using TAR file -- probably quickest way to get started. See here for complete list of Chapter 10, Running
Hadoop 2 (YARN + MapReduce) on a Single Node (Using Tar File) [19]
Using Distribution virtual machine : using virtual machines from vendors

18

Chapter 10. Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using
Tar File)
10.1. High Level Steps
These steps apply for Linux or MacOS-X. These may work on Windows if you have Cygwin installed, but
not tested. If you are using Windows, recommended approach is to get a virtual machine that is running
Linux and install Hadoop on that.
These are the steps we will go through to get Hadoop 2 installed. You may skip some steps if your environment already have that (e.g Java installed already)
Install Java / JDK
Setting up password-less SSH
Get Hadoop 2 tar distribution and unpack
Make Hadoop commands available in PATH
Configuring HDFS
Format HDFS storage
Running HDFS and verifying
Configuring YARN
Running YARN
Running a sample MapReduce Job

10.2. Installing Java / JDK


Hadoop2 can work with Java version 6 or 7. You can get Java from here : Oracle Java Site [http://
www.oracle.com/technetwork/java/javase/downloads/index.html]. Download JDK (Java Development
Kit) instead of JRE (Java Run time Environment). JDK will have development tools you may end up using.
Installation instructions for Java vary widely by operating system (windows, Linux, mac). Follow the
instructions from Java download site. For brevity we are going to leave installing Java for your platform
up to you. The rest of the guide assumes you have Java installed and ready to go.

10.3. Setting up password-less SSH


This is a convenient step, if we want to use start / stop scripts bundled with Hadoop
Generate SSH key pair. Skip this step, if you already have ssh keys. You can verify if you have keys
by using the following command

19

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)
$ ls -l $HOME/.ssh/

$
ls -l
total 304
-rw-r--r--rw-------rw-r--r--

$HOME/.ssh/
1 sujee
1 sujee
1 sujee

staff
staff
staff

815B Jan 10 22:44 authorized_keys


1.6K Apr 21 2010 id_rsa
403B Apr 21 2010 id_rsa.pub

If you see files id_rsa or id_dsa you already have ssh keys, and you don't have to generate them again
Let's generate SSH keys
$ ssh-keygen
To accept default values, press 'Enter' a couple of times. You will see an output like this.

$ ssh-keygen
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.
The key fingerprint is:
a9:63:a1:4e:9e:12:f7:08:a7:ab:bf:3f:a7:f8:6a:0c sujee@melbourne-2.local
The key's randomart image is:
+--[ RSA 2048]----+
|
|
|
|
|
|
|
.
|
|
. S
|
|E o o. o
|
| o *oo+
|
| ==ooo.
|
|.+*BB+
|
+-----------------+

Now let's add the key to 'authorized_keys' file so we can login without entering password
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod 600 $HOME/.ssh/authorized_keys
Try logging into the machine
$ ssh localhost
For the very first time, there may be a 'are you sure' prompt. Just say YES. And after that you can ssh
into localhost without entering any password.

20

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)

10.4. Get Hadoop 2 tar ball


Hadoop tar distributions are hosted on Apache Hadoop site and its mirrors. Hadoop releases page [http://
hadoop.apache.org/releases.html] will list available versions. We are looking for 2.x series. At this time,
the stable 2 series is version 2.2.0. We will use this. If you download a different version from 2.x series,
adjust the setup instructions below.
Download the release from a nearest Apache mirror [http://www.apache.org/dyn/closer.cgi/hadoop/common/].
The following command will download Hadoop distribution using wget. Adjust the download location
based on closest Apache mirror to you.
$ wget http://download.nextag.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
$ tar xvf hadoop-2.2.0.tar.gz
This command creates a convenient 'hadoop2' symlink
$ ln -s hadoop-2.2.0 hadoop2
From now on we will refer where ever hadoop is installed as '/hadoop_install_location'. For example on
my Mac this is /Users/sujee/hadoop2

10.5. Hadoop Commands


In Hadoop version 1 all command line utilities were in 'hadoop install dir'/bin directory. In Hadoop 2,
commands live in two directories.
bin : This has most used commands like hadoop, mapred ..etc
sbin: This has administrative commands like start-all.sh (command used to start all Hadoop daemons ..etc) and stop-all.sh
We can invoke these commands using full path to them:
$ /hadoop_install_location/bin/hadoop
$ /hadoop_install_location/sbin/start-all.sh
Typing the full path repeatedly can get pretty tiresome. A convenient way is to add hadoop2/bin and
hadoop2/sbin directory to our PATH environment
For temporary usage in one terminal do this:
$ export PATH=$PATH:/hadoop_install_location/bin:/hadoop_install_location/sbin
After that, hadoop commands will be available for that terminal session.
To add these paths permanently to PATH, we can add them to our terminal start up file. If you are using
BASH shell, add the following to $HOME/.bashrc file.
$ export PATH=$PATH:/hadoop_install_location/bin:/hadoop_install_location/sbin
Save the file and the changes will be visible to new bash shells (opening a new terminal)

21

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)

10.6. Configuring Hadoop 2 HDFS


Configuration files are located in /hadoop_install_location/etc/hadoop directory within Hadoop install
directory. This is different from Hadoop 1, where configuration files were located in 'conf' directory.

hadoop-env.sh (within etc/hadoop/)


This file defines environment variables used by Hadoop. We will only need to set JAVA_HOME variable
here.
JAVA_HOME will vary depending on your system setup. Here are some sample values.
On Linux : /usr/java/latest
On Mac : /Library/JavaVirtualMachines/jdk1.7.0.51.jdk/Contents/Home
export JAVA_HOME=/usr/java/latest

core-site.xml (within etc/hadoop/)


This file has 'core' generic properties for the cluster.
core-site.xml (on github) [https://gist.github.com/sujee/8547538].
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
<description>
this replaces Hadoop 1 property : fs.default.name
</description>
</property>
<property>
<!-- TODO : modify this to match your environment -->
<name>hadoop.tmp.dir</name>
<value>/Users/sujee/hadoop2-data</value>
<description>
This is the 'root' directory for Hadoop (HDFS & MapReduce)
</description>
</property>
</configuration>
For single node cluster, it is convenient to set hadoop.tmp.dir rather than setting multiple directories for
various properties. Both HDFS and MapReduce would store their data under this directory.

hdfs-site.xml (within in etc/hadoop/)


This is where we define HDFS specific properties. In this case it is pretty simple, we are just setting the
replication factor to 1.

22

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)
hdfs-site.xml (on github) [https://gist.github.com/sujee/8549067].

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>single node, 1 copy</description>
</property>
</configuration>

10.7. Formatting HDFS Storage


Before using HDFS, we need to format NameNode directory.
$ hdfs namenode -format
or using full path
$ /hadoop_install_location/bin/hdfs namenode -format
The output might look like...

$ bin/hdfs namenode -format


14/01/21 13:36:11 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
....
....
14/01/21 13:36:12 INFO common.Storage: Storage directory /hadoop/data/hadoop-2/dfs/name has
14/01/21 13:36:12 INFO namenode.FSImage: Saving image file /hadoop/data/hadoop-2/dfs/name/c
14/01/21 13:36:12 INFO namenode.FSImage: Image file /hadoop/data/hadoop-2/dfs/name/current/
14/01/21 13:36:12 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with tx
14/01/21 13:36:12 INFO util.ExitUtil: Exiting with status 0
14/01/21 13:36:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at melbourne-2.local/10.102.36.146
************************************************************/

What we are looking for a message Storage directory has been successfully formatted. All looks good.

10.8. Starting HDFS


Check the file etc/hadoop/slaves file. Make sure it has only one line
localhost

$ cat /hadoop_install_location/etc/hadoop/slaves
localhost

23

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)

We are going to use sbin/start-dfs.sh script. This command will do the following:
Start Namenode on the current node
Start Secondary NameNode (if necessary) on machines specified by masters(etc/hadoop/masters) file
Start DataNodes on node specified slaves(etc/hadoop/slaves) file.
For this, to work password-less ssh has to work (this is why we set up ssh before)
$ start-dfs.sh
or using full path
$ /hadoop_install_location/sbin/start-dfs.sh

$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-sujee-nam
localhost: starting datanode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-sujee-dat
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /hadoop/hadoop-2.2.0/logs/hadoop-su

10.9. Verifying that HDFS is running


If start-dfs.sh didn't error out, most likely HDFS daemons are up and running.

Verifying on command-line
We will use jps command to see running Java processes. jps command is part of JDK.
$ jps

30438
30375
30226
30291

Jps
SecondaryNameNode
NameNode
DataNode

Looks good, we have all HDFS daemons running on this node

Verifying HDFS WebUI


Access HDFS Web UI http://localhost:50070 in a browser. Verify 'Live Nodes' count is 1.
See screen-shot below

24

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)

Figure 10.1. Namenode UI

10.10. Configuring YARN / MapReduce


Now that we have HDFS working, let's configure and run YARN.

yarn-site.xml (etc/hadoop)
yarn-site.xml (on GitHub) [https://gist.github.com/sujee/8551096]
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>

25

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
-->
</configuration>

mapred-site.xml (etc/hadoop)
mapred-site.xml (on GitHub) [https://gist.github.com/sujee/8551273]
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

10.11. Starting YARN


Using sbin/start-yarn.sh script
$ start-yarn.sh
or using full path
$ /hadoop_install_location/sbin/start-yarn.sh

10.12. History Server


History server will allow us to browse job stats ..etc
$ /hadoop_install_location/sbin/mr-jobhistory-daemon.sh start historyserver
Access Job History Server Web UI http://localhost:19888 in a browser

10.13. Verifying YARN daemons


Command line check
$ jps
33780 ResourceManager
33858 NodeManager
30375 SecondaryNameNode

26

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)
30226
30291
30297
34928

NameNode
DataNode
JobHistoryServer
Jps

We see two daemons ResourceManager and NodeManager. All good!

YARN Web UI
Access YARN Web UI http://localhost:8088 in a browser. Verify Active Nodes are 1.

Figure 10.2. YARN UI

10.14. Final Test -- Running a MapReduce job


Now that we have HDFS and YARN running, let's run a MapReduce job to see if the whole cluster is
functional. We will be using GREP MapReduce example that ships with Hadoop for this purpose.

Copy some files into HDFS


Let's use hadoop configuration xml files as input for grep
Create an input directory in HDFS to hold the files
$ hdfs dfs -mkdir -p /grep/in
or using full path
$ /hadoop_install_location/bin/hdfs dfs -mkdir -p /grep/in
Copy a bunch of files into /grep/in directory
$ hdfs dfs -put /hadoop_install_location/etc/hadoop/* /grep/in/
or using full path
$ /hadoop_install_location/bin/hdfs dfs -put /hadoop_install_location/etc/hadoop/* /grep/in/

Run Grep MapReduce job


$ /hadoop_install_location/bin/hadoop jar /hadoop_install_location/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar grep /grep/in/ /grep/out '(xml|name|dfs)'
Here we are asking grep to look for 3 words : xml, name and dfs. We are using standard regular expression
format. This command will run two MapReduce jobs in sequence. The output may look like this

27

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)
...
14/01/21
14/01/21
14/01/21
...
14/01/21
...
14/01/21
14/01/21
14/01/21
14/01/21
14/01/21
...

17:05:25 INFO mapreduce.Job: Running job: job_1390348071644_0005


17:05:32 INFO mapreduce.Job: map 0% reduce 0%
17:05:50 INFO mapreduce.Job: map 12% reduce 0%

17:07:02 INFO mapreduce.Job: Job job_1390348071644_0005 completed successf


17:07:03
17:07:16
17:07:21
17:07:27
17:07:28

INFO
INFO
INFO
INFO
INFO

mapreduce.Job: Running job: job_1390348071644_0006


mapreduce.Job: map 0% reduce 0%
mapreduce.Job: map 100% reduce 0%
mapreduce.Job: map 100% reduce 100%
mapreduce.Job: Job job_1390348071644_0006 completed successf

Once the MR job is successfully completed, let's look at the output


$ hdfs dfs -cat /grep/out/*
or using full path
$ /hadoop_install_location/bin/hdfs dfs -cat /grep/out/*
Output might look like following. As you can see Grep finds the number of instances for each word, and
sorts them according to their frequency. The second MapReduce job is used for sorting.

167 name
32 dfs
21 xml

Let's check YARN UI. See the screen shot below. We see two MapReduce jobs (grep-search and grep
sort) ran and finished successfully.

Figure 10.3. YARN Running Jobs

All looking good

10.15. Shutting down the cluster


First we will shutdown YARN, then we will shutdown HDFS. We will use the stop-yarn.sh and stopdfs.sh scripts located in /hadoop_install_location/sbin directory.

28

Running Hadoop 2 (YARN


+ MapReduce) on a Single Node (Using Tar File)
$ /hadoop_install_location/sbin/mr-jobhistory-daemon.sh stop historyserver $ stop-yarn.sh
or using full path
$ /hadoop_install_location/sbin/stop-yarn.sh
$ stop-dfs.sh
or using full path
$ /hadoop_install_location/sbin/stop-dfs.sh

10.16. References
Cloudera 4.x YARN guide [http://www.cloudera.com/content/cloudera-content/cloudera-docs/
CDH4/4.3.0/CDH4-Installation-Guide/cdh4ig_topic_11_4.html]

29

Chapter 11. HDFS Benchmarking in


Hadoop 2
We will look at two benchmarks for HDFS. Running these benchmarks is a little different in Hadoop 2

11.1. TestDFSIO
This benchmark exercises HDFS and measures IO throughput. TestDFSIO uses MapReduce to read / write
files in parallel.
To run this test we need both DFS and YARN daemons running.
$ hadoop org.apache.hadoop.fs.TestDFSIO -write -nrFiles 5 -fileSize 1GB
The parameters are as follows...
-write : write files
-nrFiles : number of files
-fileSize : size of each file, you can use shortcuts like MB or GB
This would run a MapReduce job. The output might look like this
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
...
14/02/14
14/02/14
14/02/14
...
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14
14/02/14

00:02:57
00:02:57
00:02:57
00:02:57
00:02:57

INFO
INFO
INFO
INFO
INFO

fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:
fs.TestDFSIO:

TestDFSIO.1.7
nrFiles = 5
nrBytes (MB) = 1024.0
bufferSize = 1000000
baseDir = /benchmarks/TestDFSIO

00:03:00 INFO mapreduce.Job: Running job: job_1392364939455_0001


00:03:09 INFO mapreduce.Job: map 0% reduce 0%
00:03:35 INFO mapreduce.Job: map 40% reduce 0%
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36
00:05:36

INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO

fs.TestDFSIO: ----- TestDFSIO ----- : write


fs.TestDFSIO:
Date and time: Fri Feb 14 00:05:36
fs.TestDFSIO:
Number of files: 5
fs.TestDFSIO: Total MBytes processed: 5120.0
fs.TestDFSIO:
Throughput mb/sec: 8.957344147460278
fs.TestDFSIO: Average IO rate mb/sec: 8.959185600280762
fs.TestDFSIO: IO rate std deviation: 0.12878510722494685
fs.TestDFSIO:
Test exec time sec: 157.34
fs.TestDFSIO:

Also there is read version of the command


$ hadoop org.apache.hadoop.fs.TestDFSIO -read -nrFiles 5 -fileSize 1GB

11.2. Terasort
Terasort is ultimate IO benchmark for Hadoop. The test is comprised of three components.

30

HDFS Benchmarking in Hadoop 2

TeraGen : Generate bunch of random data


Terasort : Sorts the data
TeraValidate : Verifies the data is indeed sorted
1TB (the amount of data Terasort is named after) data looks like the following...
Each record is 100 bytes long
10 billion records
1 TB of data (10 Billion records x 100 bytes)
Terasort allows us to generate various size of data. Here is a cheat table to generate the desired size of data.

Table 11.1. Terasort


Data Size

Number of records (divided by #


100 bytes per record)

1 GB

10 million records

10,000,000

10 GB

100 million records

100,000,000

100 GB

1 billion records

1,000,000,000

1 TB

10 billion records

10,000,000,000

Lets generate 10 G of data. We will supply the following arguments to Teragen


teragen -D mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000 input_dir

Table 11.2. TeraGen arguments


Parameter

Explanation

-D mapreduce.job.maps=20

20 mappers
Each mapper will create a file. So 10GB divided
by 20 maps will yield 500MB file each

-D dfs.blocksize=536870912

We are specifying block size to be 512MB.


Note by specifying number of mappers, we are
creating files each size of 500MB. We don't want
these files split up into blocks by default block size
of 128MB. We want to keep the entire file as one
block. Hence we are bumping up the block size to
512MB (larger than an individual file size)
A map task will sort a block. If the block size is
too small, mappers would be done very quickly,
specially on modern hardware. That is why we are
setting the block size higher.

100000000

number of records.
See the above table on calculation specifics

input_dir

This is where the generated data will end up

Lets generate 10 G of data. How we invoke TeraGen command will depend on our Hadoop installation.
See below, you may need to adjust your command to match your environment.
option 1) for tar based installs...

31

HDFS Benchmarking in Hadoop 2

$
hadoop
jar
HADOOP_INSTALL_PATH/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar teragen -D mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000
input_dir
option 2) RPM / Package based Hadoop Distribution
$
hadoop
jar
/usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
mapreduce.job.maps=20 -D dfs.blocksize=536870912 100000000 input_dir

teragen

-D

Now it is time to sort the data.


option 1) for tar based installs...
$
hadoop
jar
HADOOP_INSTALL_PATH/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar terasort -D mapreduce.job.reducers=20 tera/in tera/out
option 2) RPM / Package based Hadoop Distribution
$
hadoop
jar
/usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
mapreduce.job.reducers=20 tera/in tera/out

terasort

-D

11.3. References
Michael
Noll
has
an
excellent
tutorial
on
TestDFSIO
and
Terasort [http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/] (Hadoop v1 version)

32

You might also like