Hadoop Distributed File System (HDFS) : Suresh Pathipati

Hadoop Distributed File System(HDFS)
Prepared By Reviewed By Approved By

Suresh Pathipati
1|Page
Table of contents
1 Introduction to HDFS.......................................................................................................................3
2 Assumptions and Goals ..................................................................................................................3
2.1 Hardware Failure........................................................................................................................3
2.2 Streaming Data Access .............................................................................................................3
2.3 Large Data Sets ............................................................................................................................3
2.4 Simple Coherency Model .........................................................................................................4
2.5 Moving Computation is Cheaper than Moving Data.....................................................4
2.6 Portability Across Heterogeneous Hardware and Software Platforms...............4
3 Building Blocks of Hadoop...............................................................................................................5
3.1 Hadoop Modes…………………………………………………………………………………………...5
3.2 Types of Nodes……………………………………………………………………………………….......5
4 The File System Namespace ...........................................................................................................14
5 Data Replication ...................................................................................................................................14
5.1 Replica Placement: The First Baby Steps...........................................................................15
5.2 Replica Selection ..........................................................................................................................16
5.3 Safe mode ........................................................................................................................................16
6 The Persistence of File System Metadata ..................................................................................16
7 The Communication Protocols.......................................................................................................17
8 Robustness ........................................................................................................................………….......19
8.1 Data Disk Failure, Heartbeats and Re-Replication .........................................................20
8.2 Cluster Rebalancing .....................................................................................................................20
8.3 Data Integrity .................................................................................................................................20
8.4 Metadata Disk Failures ..............................................................................................................20
8.5 Snapshots .........................................................................................................................................20
9 Data Organization..................................................................................................................................21
9.1 Data Blocks .......................................................................................................................................21
9.2 Staging ................................................................................................................................................22
9.3 Replication Pipelining ..................................................................................................................22
9.4Read and Write Files in HDFS…………………………………………………………………….....23
9.5 Rack awareness…………………………………………………………………………………………...24
10 Accessibility ..........................................................................................................................................30
10.1 FS Shell .............................................................................................................................................30
10.2 DFSAdmin ........................................................................................................................................30
10.3 Browser Interface ........................................................................................................................30
11 Space Reclamation..........................................................................................................……….........31
11.1 File Deletes and Undeletes .......................................................................................................31
11.2 Decrease Replication Factor ....................................................................................................31
11.3.Hadoop Configuration Files.......................................................................................................32
11.4 Hadoop Name Node & Data Node Web Interface............................................................33
12.Hadoop Commands..............................................................................................................................39
2|Page
1. Introduction
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
Commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is highly
Fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high
Throughput access to application data and is suitable for applications that have large data
sets.HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
HDFS was originally built as infrastructure for the Apache Nutch web search engine project.
HDFS is part of the Apache Hadoop Core project.
2. Assumptions and Goals
2.1. Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of
Hundreds or thousands of server machines, each storing part of the file system’s data. The
fact that there are a huge number of components and that each component has a non-trivial
Probability of failure means that some component of HDFS is always non-functional.
3|Page
Therefore, detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
2.2. Streaming Data Access
Applications that run on HDFS need streaming access to their data sets. They are not
general purpose applications that typically run on general purpose file systems. HDFS is
designed more for batch processing rather than interactive use by users. The emphasis is on
high throughput of data access rather than low latency of data access. POSIX imposes many
hard requirements that are not needed for applications that are targeted for HDFS.
POSIXsemantics in a few key areas has been traded to increase data throughput rates.
2.3. Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate
data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
2.4. Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created,
written, and closed need not be changed. This assumption simplifies data coherency issues
and enables high throughput data access. A Map/Reduce application or a web crawler
application fits perfectly with this model. There is a plan to support appending-writes to
files in the future.
2.5. “Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it is executed near the

Data it operates on. This is especially true when the size of the data set is huge. This
minimizes network congestion and increases the overall throughput of the system. The
assumption is that it is often better to migrate the computation closer to where the data is
located rather than moving the data to where the application is running. HDFS provides
interfaces for applications to move themselves closer to where the data is located.
2.6. Portability Across Heterogeneous Hardware and Software Platforms
HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.
4|Page
3. Building Blocks of Hadoop
A fully configured cluster, “running Hadoop” means running a set of daemons, or

resident programs, on the different servers in your network. These daemons have
Specific roles; some exist only on one server, some exist across multiple servers.
3.1 Hadoop Modes
5|Page
3.2 Types of Nodes & Trackers
Name Node
The most vital of the Hadoop daemons—the Name Node .Hadoop employs a master/slave
architecture for both distributed storage and distributed computation. The distributed
storage system is called the Hadoop File System, or HDFS. The Name Node is the master of
HDFS that directs the slave Data Node daemons perform the low-level I/O tasks.
The Name Node is the bookkeeper of HDFS; it keeps track of how your files are broken
down into file blocks, which nodes store those blocks, and the overall health of the
distributed file system.
The function of the Name Node is memory and I/O intensive. As such, the server hosting the
Name Node typically doesn’t store any user data or perform any computations for a Map
Reduce program to lower the workload on the machine
Name Node Metadata
6|Page
Secondary Name Node
The Secondary Name Node (SNN) is an assistant daemon for monitoring the state of the
cluster HDFS. Like the Name Node, each cluster has one SNN, and it typically resides on its
own machine as well. No other Data Node or Task Tracker daemons run on the same server.
The SNN differs from the Name Node in that this process doesn’t receive or record any real-
time changes to HDFS. Instead, it communicates with the Name Node to take snapshots of
the HDFS metadata at intervals defined by the cluster configuration.
As mentioned earlier, the Name Node is a single point of failure for a Hadoop cluster, and
the SNN snapshots help minimize the downtime and loss of data
Secondary Name node – a misnomer
7|Page
Check pointing process of Name Node
Name node Recovery
8|Page
Multiple Name nodes/Namespaces
In order to scale the name service horizontally, federation uses multiple independent
Namenodes/namespaces. The Namenodes are federated, that is, the Namenodes are
independent and don’t require coordination with each other. The datanodes are used as
common storage for blocks by all the Namenodes. Each datanode registers with all the
Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and
handles commands from the Namenodes.
Block Pool
A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for
all the block pools in the cluster. It is managed independently of other block pools. This
allows a namespace to generate Block IDs for new blocks without the need for coordination
with the other namespaces. The failure of a Namenode does not prevent the datanode from
serving other Namenodes in the cluster.
A Namespace and its block pool together are called Namespace Volume. It is a self-contained
unit of management. When a Namenode/namespace is deleted, the corresponding block
pool at the datanodes is deleted. Each namespace volume is upgraded as a unit, during
cluster upgrade.
9|Page
ClusterID
A new identifier ClusterID is added to identify all the nodes in the cluster. When a
Namenode is formatted, this identifier is provided or auto generated. This ID should be used
for formatting the other Namenodes into the cluster.
Key Benefits
 Namespace Scalability - HDFS cluster storage scales horizontally but the namespace
does not. Large deployments or deployments using lot of small files benefit from
scaling the namespace by adding more Namenodes to the cluster
 Performance - File system operation throughput is limited by a single Namenode in
the prior architecture. Adding more Namenodes to the cluster scales the file system
read/write operations throughput.
 Isolation - A single Namenode offers no isolation in multi user environment. An
experimental application can overload the Namenode and slow down production
critical applications. With multiple Namenodes, different categories of applications
and users can be isolated to different namespaces.
Checkpoint Node
Name Node persists its namespace using two files: fsimage, which is the latest checkpoint of
the namespace and edits, a journal (log) of changes to the namespace since the checkpoint.
When a Name Node starts up, it merges the fsimage and edits journal to provide an up-to-
date view of the file system metadata. The Name Node then overwrites fsimage with the
new HDFS state and begins a new edits journal.
The Checkpoint node periodically creates checkpoints of the namespace. It downloads

fsimage and edits from the active Name Node, merges them locally, and uploads the new
image back to the active Name Node. The Checkpoint node usually runs on a different
machine than the Name Node since its memory requirements are on the same order as the
Name Node. The Checkpoint node is started by bin/hdfs namenode -checkpoint on the node
specified in the configuration file.
The location of the Checkpoint (or Backup) node and its accompanying web interface are
configured via thedfs.namenode.backup.address and dfs.namenode.backup.http-
address configuration variables.
The start of the checkpoint process on the Checkpoint node is controlled by two
configuration parameters.
 dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum

delay between two consecutive checkpoints
10 | P a g e
 dfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of
uncheckpointed transactions on the NameNode which will force an urgent
checkpoint, even if the checkpoint period has not been reached.
The Checkpoint node stores the latest checkpoint in a directory that is structured the same
as the NameNode's directory. This allows the checkpointed image to be always available for
reading by the NameNode if necessary. See Import checkpoint.
Multiple checkpoint nodes may be specified in the cluster configuration file.
Backup Node
The Backup node provides the same check pointing functionality as the Checkpoint node, as
well as maintaining an in-memory, up-to-date copy of the file system namespace that is
always synchronized with the active Name Node state. Along with accepting a journal
stream of file system edits from the Name Node and persisting this to disk, the Backup node
also applies those edits into its own copy of the namespace in memory, thus creating a
backup of the namespace.
The Backup node does not need to download fsimage and edits files from the active Name
Node in order to create a checkpoint, as would be required with a Checkpoint node or
Secondary Name Node, since it already has an up-to-date state of the namespace state in
memory. The Backup node checkpoint process is more efficient as it only needs to save the
namespace into the local fsimage file and reset edits.
As the Backup node maintains a copy of the namespace in memory, its RAM requirements
are the same as the Name Node.
The Name Node supports one Backup node at a time. No Checkpoint nodes may be
registered if a Backup node is in use. Using multiple Backup nodes concurrently will be
supported in the future.
The Backup node is configured in the same manner as the Checkpoint node. It is started
with bin/hdfs namenode -backup.
The location of the Backup (or Checkpoint) node and its accompanying web interface are
configured via thedfs.namenode.backup.address and dfs.namenode.backup.http-
address configuration variables.
Use of a Backup node provides the option of running the Name Node with no persistent
storage, delegating all responsibility for persisting the state of the namespace to the Backup
node. To do this, start the Name Node with the -import Checkpoint option, along with
11 | P a g e
specifying no persistent storage directories of type editsdfs.namenode.edits.dir for the
NameNode configuration.
Data Node
Each slave machine in your cluster will host a Data Node daemon to perform the grunt work
of the distributed file system—reading and writing HDFS blocks to actual files on the local
file system. When you want to read or write a HDFS file, the file is broken into blocks and
the Name Node will tell your client which Data Node each block resides in.
Your client communicates directly with the Data Node daemons to process the local
files corresponding to the blocks. Furthermore, a Data Node may communicate with other
Data Nodes to replicate its data blocks for redundancy.
12 | P a g e
Trackers
Job Tracker
The Job Tracker daemon is the liaison between your application and Hadoop. Once you
submit your code to your cluster, the Job Tracker determines the execution plan by
determining which files to process, assigns nodes to different tasks, and monitors all tasks
as they’re running.
Should a task fail, the Job Tracker will automatically re-launch the task, possibly on a
different node, up to a predefined limit of retries. There is only one Job Tracker daemon per
Hadoop cluster. It’s typically run on a server as a master node of the cluster.
Task Tracker
As with the storage daemons, the computing daemons also follow master/slave
architecture: the Job Tracker is the master overseeing the overall execution of a Map Reduce
job and the Task Trackers manage the execution of individual tasks on each slave node.
Each Task Tracker is responsible for executing the individual tasks that the Job Tracker
assigns. Although there is a single Task Tracker per slave node, each Task Tracker can
spawn multiple JVMs to handle many map or reduce tasks in parallel. One responsibility of
the Task Tracker is to constantly communicate with the Job Tracker. If the Job Tracker fails
13 | P a g e
to receive a heartbeat from a Task Tracker within a specified amount of time, it will assume
the Task Tracker has crashed and will resubmit the corresponding tasks to other nodes in
the cluster.
4. The File System Namespace
HDFS supports a traditional hierarchical file organization. A user or an application can

create directories and store files inside these directories. The file system namespace
hierarchy is similar to most other existing file systems; one can create and remove files,
move a file from one directory to another, or rename a file. HDFS does not yet implement
user quotas or access permissions. HDFS does not support hard links or soft links. However,
the HDFSarchitecture does not preclude implementing these features.
The Name Node maintains the file system namespace. Any change to the file system
Namespace or its properties is recorded by the Name Node. An application can specify the
Number of replicas of a file that should be maintained by HDFS. The number of copies of a
File is called the replication factor of that file. This information is stored by the NameNode.
5. Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance. The block size and replication factor
are configurable per file. An application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can be changed later. Files
inHDFS are write-once and have strictly one writer at any time.
The Name Node makes all decisions regarding replication of blocks. It periodically receives
heartbeat and a Block report from each of the Data Nodes in the cluster. Receipt of a
Heartbeat implies that the Data Node is functioning properly. A Block report contains a list
of all blocks on a Data Node.
14 | P a g e
5.1. Replica Placement: The First Baby Steps
The placement of replicas is critical to HDFS reliability and performance. Optimizing replica
Placement distinguishes HDFS from most other distributed file systems. This is a feature
that needs lots of tuning and experience. The purpose of a rack-aware replica placement
policy into improve data reliability, availability, and network bandwidth utilization. The
current implementation for the replica placement policy is a first effort in this direction. The
short-term goals of implementing this policy are to validate it on production systems, learn
more about its behavior, and build a foundation to test and research more sophisticated
policies.
Large HDFS instances run on a cluster of computers that commonly spread across many
Racks. Communication between two nodes in different racks has to go through switches. In
most cases, network bandwidth between machines in the same rack is greater than network
bandwidth between machines in different racks.
The Name Node determines the rack id each Data Node belongs to via the process outlined
in Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks.
This prevents losing data when an entire rack fails and allows use of bandwidth from
multiple racks when reading data. This policy evenly distributes replicas in the cluster
which makes it easy to balance load on component failure. However, this policy increases
the cost of writes because a write needs to transfer blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put
one replica on one node in the local rack, another on a different node in the local rack, and
the last on a different node in a different rack. This policy cuts the inter-rack write traffic
which generally improves write performance. The chance of rack failure is far less than that
of node failure; this policy does not impact data reliability and availability guarantees.
However, it does reduce the aggregate network bandwidth used when reading data since
block is placed in only two unique racks rather than three. With this policy, the replicas of
15 | P a g e
file do not evenly distribute across the racks. One third of replicas are on one node, two
thirds of replicas are on one rack, and the other third are evenly distributed across the
remaining racks. This policy improves write performance without compromising data
reliability or read performance.
The current, default replica placement policy described here is a work in progress.
5.2. Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read
request from a replica that is closest to the reader. If there exists a replica on the same rack
as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS
cluster spans multiple data centers, then a replica that is resident in the local data center is
Preferred over any remote replica.
5.3. Safe mode
On startup, the NameNode enters a special state called Safe mode. Replication of data blocks
Does not occur when the Name Node is in the Safe mode state. The Name Node receives
Heartbeat and Blockreport messages from the Data Nodes. A Blockreport contains the list of
data blocks that a DataNode is hosting. Each block has a specified minimum number of
replicas. A block is considered safely replicated when the minimum number of replicas of
that data block has checked in with the NameNode. After a configurable percentage of safely
replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the
NameNode exits the Safemode state. It then determines the list of data blocks (if any) that
still have fewer than the specified number of replicas. The NameNode then replicates these
blocks to other DataNodes.
6. The Persistence of File System Metadata
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log
called the EditLog to persistently record every change that occurs to file system metadata.
For example, creating a new file in HDFS causes the NameNode to insert a record into the
EditLog indicating this. Similarly, changing the replication factor of a file causes a new
record to be inserted into the EditLog. The NameNode uses a file in its local host OS file
system to store the EditLog. The entire file system namespace, including the mapping of
blocks to files and file system properties, is stored in a file called the FsImage. The FsImage
is stored as a file in the NameNode’s local file system too.
The NameNode keeps an image of the entire file system namespace and file Blockmap in
memory. This key metadata item is designed to be compact, such that a NameNode with 4
GB of RAM is plenty to support a huge number of files and directories. When the
NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions
from the EditLog to the in-memory representation of the FsImage, and flushes out this new
16 | P a g e
version into a new FsImage on disk. It can then truncate the old EditLog because its
transactions have been applied to the persistent FsImage. This process is called a
checkpoint.
In the current implementation, a checkpoint only occurs when the Name Node starts up.
Works in progress to support periodic check pointing in the near future.
The Data Node stores HDFS data in files in its local file system. The Data Node has no
knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local
file system. The Data Node does not create all files in the same directory. Instead, it uses a
heuristic to determine the optimal number of files per directory and creates subdirectories
appropriately. It is not optimal to create all local files in the same directory because the local
file system might not be able to efficiently support a huge number of files in a single
directory. When a Data Node starts up, it scans through its local file system, generates a list
of all HDFS data blocks that correspond to each of these local files and sends this report to
theNameNode: this is the Blockreport.
7. The Communication Protocols
All HDFS communication protocols are layered on top of the TCP/IP protocol. A client
establishes a connection to a configurable TCP port on the NameNode machine. It talks the
Client Protocol with the Name Node. The Data Nodes talk to the Name Node using the
Data Node Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client
Protocol and the Data Node Protocol. By design, the Name Node never initiates any RPCs.
Instead, it only responds to RPC requests issued by Data Nodes or clients.
17 | P a g e
HDFS is Hadoop’s File System. It is a distributed file system in that it uses a multitude of
machines to implement its functionality. Contrast that with NTFS, FAT32, ext3 etc. which
are all single machine filesystems.
HDFS is architected such that the metadata, i.e. the information about file names,
directories, permissions, etc. is separated from the user data. HDFS consists of the
NameNode, which is HDFS’s metadata server, and DataNodes, where user data is stored.
There can be only one active instance of the NameNode. A number of DataNodes (a handful
to several thousand) can be part of this HDFS served by the single NameNode.
Here is how a client RPC request to the Hadoop HDFS NameNode flows through the Name
Node. This pertains to the Hadoop trunk code base on Dec 2, 2012, i.e. a few months after
Hadoop 2.0.2-alpha was released.
The Hadoop NameNode receives requests from HDFS clients in the form of Hadoop RPC
requests over a TCP connection. Typical client requests include mkdir, getBlockLocations,
create file, etc. Remember – HDFS separates metadata from actual file data, and that the
NameNode is the metadata server. Hence, these requests are pure metadata requests – no
data transfer is involved. The following diagram traces the path of a HDFS client request
through the Name Node. The various thread pools used by the system, locks taken and
released by these threads, queues used, etc. are described in detail in this message.
 As shown in the diagram, a Listener object listens to the TCP port serving RPC requests from
the client. It accepts new connections from clients, and adds them to
the Server object’s connectionist
 Next, a number of RPC Reader threads read requests from the connections in connectionists,
decode the RPC requests, and add them to the rpc call queue – Server.callQueue.
 Now, the actual worker threads kick in – these are the Handler threads. The threads pick up
RPC calls and process them. The processing involves the following:
 First grab the write lock for the namespace
 Change the in-memory namespace
 Write to the in-memory FSEdits log (journal)
 Now, release the write lock on the namespace. Note that the journal has not been sync’d yet
– this means we cannot return success to the RPC client yet
 Next, each handler thread calls log Sync. Upon returning from this call, it is guaranteed that
the log file modification have been sync’d to disk. Exactly how this is guaranteed is messy.
Here are the details:
 Every time an edit entry is written to the edits log, a unique txid is assigned for this
specific edit. The Handler retrieves this log txid and saves it. This is going to be used
to verify whether this specific edit log entry has been sync’d to disk
18 | P a g e
 When log Sync is called by a Handler, it first checks to see if the last sync’d log edit
entry is greater than the txid of the edit log just finished by the Handler. If the
Handler’s edit log txid is less than the last sync’d txid, then the Handler can mark the
RPC call as complete. If the Handler’s edit log txid is greater than the last sync’d txid,
then the Handler has to do one of the following things:
o It has to grab the sync lock and sync all transactions
o If it cannot grab the sync lock, then it waits 1000ms and tries again in a loop
o At this point, the log entry for the transaction made by this Handler has been
persisted. The Handler can now mark the RPC as complete.
o Now, the single Responder thread picks up completed RPCs and returns the result of the
RPC call to the RPC client. Note that the Responder thread uses NIO to asynchronously send
responses back to waiting clients. Hence one thread is sufficient.
8. Robustness
The primary objective of HDFS is to store data reliably even in the presence of failures. The
three common types of failures are Name Node failures, Data Node failures and network
partitions.
8.1. Data Disk Failure, Heartbeats and Re-Replication
Each Data Node sends a Heartbeat message to the Name Node periodically. A network
partition can cause a subset of Data Nodes to lose connectivity with the Name Node. The
Name Node detects this condition by the absence of a Heartbeat message. The Name Node
marks Data Nodes without recent Heartbeats as dead and does not forward any new IO
requests to them. Any data that was registered to a dead Data Node is not available to HDFS
19 | P a g e
any more. Data Node death may cause the replication factor of some blocks to fall below
their specified value. The Name Node constantly tracks which blocks need to be replicated
and initiates replication whenever necessary. The necessity for re-replication may arise due
to many reasons: a Data Node may become unavailable, a replica may become corrupted, a
hard disk on a Data Node may fail, or the replication factor of a file may be increased.
8.2. Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. A scheme might
automatically move data from one Data Node to another if the free space on a Data Node
falls below a certain threshold. In the event of a sudden high demand for a particular file, a
scheme might dynamically create additional replicas and rebalance other data in the cluster.
These types of data rebalancing schemes are not yet implemented.
8.3. Data Integrity
It is possible that a block of data fetched from a Data Node arrives corrupted. This
corruption can occur because of faults in a storage device, network faults, or buggy
software. The HDFSclient software implements checksum checking on the contents of HDFS
files. When a client creates an HDFS file, it computes a checksum of each block of the file and
stores these checksums in a separate hidden file in the same HDFS namespace. When a
client retrieves file contents it verifies that the data it received from each Data Node
matches the checksum stored in the associated checksum file. If not, then the client can opt
to retrieve that block from another Data Node that has a replica of that block.
20 | P a g e
8.4. Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS. A corruption of these files
can cause the HDFS instance to be non-functional. For this reason, the Name Node can be
configured to support maintaining multiple copies of the FsImage and EditLog. Any update
to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated
synchronously. This synchronous updating of multiple copies of the FsImage and EditLog
may degrade the rate of namespace transactions per second that a Name Node can support.
However, this degradation is acceptable because even though HDFS applications are very
data intensive in nature, they are not metadata intensive. When a Name Node restarts, it
selects the latest consistent FsImage and EditLog to use.
The Name Node machine is a single point of failure for an HDFS cluster. If the Name Node
machine fails, manual intervention is necessary. Currently, automatic restart and failover of
the Name Node software to another machine is not supported.
8.5. Snapshots
Snapshots support storing a copy of data at a particular instant of time. One usage of the
snapshot feature may be to roll back a corrupted HDFS instance to a previously known good
point in time. HDFS does not currently support snapshots but will in a future release.
9. Data Organization
9.1. Data Blocks
HDFS is designed to support very large files. Applications that are compatible with HDFS
are those that deal with large data sets. These applications write their data only once but
they
read it one or more times and require these reads to be satisfied at streaming speeds. HDFS
supports write-once-read-many semantics on files. A typical block size used by HDFS is 64
MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will
reside on a different Data Node.
21 | P a g e
9.2. Staging
A client request to create a file does not reach the Name Node immediately. In fact, initially
the HDFS client caches the file data into a temporary local file. Application writes are
transparently redirected to this temporary local file. When the local file accumulates data
worth over one HDFS block size, the client contacts the Name Node. The Name Node inserts
the file name into the file system hierarchy and allocates a data block for it. The Name Node
responds to the client request with the identity of the Data Node and the destination data
block. Then the client flushes the block of data from the local temporary file to the specified
Data Node. When a file is closed, the remaining un-flushed data in the temporary local file is
transferred to the Data Node. The client then tells the Name Node that the file is closed. At
this point, the Name Node commits the file creation operation into a persistent store. If the
Name Node dies before the file is closed, the file is lost.
The above approach has been adopted after careful consideration of target applications that
run on HDFS. These applications need streaming writes to files. If a client writes to a remote
22 | P a g e
file directly without any client side buffering, the network speed and the congestion in the
network impacts throughput considerably. This approach is not without precedent. Earlier
distributed file systems, e.g. AFS, have used client side caching to improve performance. A
POSIX requirement has been relaxed to achieve higher performance of data uploads.
9.3. Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as
explained in the previous section. Suppose the HDFS file has a replication factor of three.
When the local file accumulates a full block of user data, the client retrieves a list of
Data Nodes from the Name Node. This list contains the Data Nodes that will host a replica of
that block. The client then flushes the data block to the first Data Node. The first Data Node
starts receiving the data in small portions (4 KB), writes each portion to its local repository
and transfers that portion to the second Data Node in the list. The second Data Node, in turn
starts receiving each portion of the data block, writes that portion to its repository and then
flushes that portion to the third Data Node. Finally, the third Data Node writes the data to its
local repository. Thus, a Data Node can be receiving data from the previous one in the
pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data
is pipelined from one Data Node to the next.
9.4 Read and Write Files in HDFS
Read Operation in HDFS
HDFS has a master and slave kind of architecture. Name node acts as master and Data nodes
as worker. All the metadata information is with name node and the original data is stored on
the data nodes. Keeping all these in mind the below figure will give you some idea about
how data flow happens between the Client interacting with HDFS, i.e. the Name node and
the Data nodes
23 | P a g e
There are six steps involved in reading the file from HDFS:
Let's suppose a Client (a HDFS Client) wants to read a file from HDFS. So the steps involved
in reading the file is:
 Step 1: First the Client will open the file by giving a call to open() method
onFileSystem object, which for HDFS is an instance of DistributedFileSystem class.
 Step 2: DistributedFileSystem calls the Namenode, using RPC, to determine the

locations of the blocks for the first few blocks of the file. For each block, the name
node returns the addresses of all the data nodes that have a copy of that block.
 The DistributedFileSystem returns an object of FSDataInputStream(an input

stream that supports file seeks) to the client for it to read data
from. FSDataInputStream in turn wraps a DFSInputStream, which manages the
datanode and namenode I/O
24 | P a g e
 Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the datanode addresses for the first few blocks in the file, then connects to the
firstclosest datanode for the first block in the file.
 Step 4: Data is streamed from the datanode back to the client, which
calls read()repeatedly on the stream.
 Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the datanode, then find the best datanode for the next block. This
happens transparently to the client, which from its point of view is just reading a
continuous stream.
 Step 6: Blocks are read in order, with the DFSInputStream opening new
connections to datanodes as the client reads through the stream. It will also call the
namenode to retrieve the datanode locations for the next batch of blocks as needed.
When the client has finished reading, it calls close() on the FSDataInputStream
25 | P a g e
what happens if DFSInpuStream encounters an error while communicating with a
datanode?
Well if such an incident occurs, then DFSInpuStream will try to fetch the data from the next
closest one for that block (since DFSInputStream has location of all the datanodes where that
block is residing). It will also remember the datanode that was failed and will prevent itself
from going on that datanode for the other blocks. The DFSInputStream also verifies
checksums for the data transferred to it from the datanode. If a corrupted block is found,
it is reported to the namenode before the DFSInput Stream attempts to read a replica of
the block from another datanode.
One important aspect of this design is that the client contacts datanodes directly to retrieve
data and is guided by the namenode to the best datanode for each block. This design allows
HDFS to scale to a large number of concurrent clients because the data traffic is spread
across all the datanodes in the cluster. Meanwhile, the namenode merely has to service
block location requests (which it stores in memory, making them very efficient) and does
not, for example, serve data, which would quickly become a bottleneck as the number of
clients grew.
Read Operation in HDFS
26 | P a g e
Here we are considering the case that we are going to create a new file, write data to it and
will close the file.
Now in writing a data to HDFS there are seven steps involved. These seven steps are:
 Step 1: The client creates the file by calling create() method

on DistributedFileSystem.
 Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new

file in the filesystem’s namespace, with no blocks associated with it.
 The namenode performs various checks to make sure the file doesn't already exist
and that the client has the right permissions to create the file. If these checks pass,
the namenode makes a record of the new file; otherwise, file creation fails and the
client is thrown an I/O Exception. The DistributedFileSystem returns
an FSDataOutputStreamfor the client to start writing data to. Just as in the read
case, FSDataOutputStreamwraps a DFSOutputStream, which handles
communication with the datanodes and namenode.
 Step 3: As the client writes data, DFSOutputStream splits it into packets, which it
writes to an internal queue, called the data queue. The data queue is consumed by
the Data Streamer, which is responsible for asking the namenode to allocate
new blocks by picking a list of suitable datanodes to store the replicas. The list of
datanodes forms a pipeline, and here we’ll assume the replication level is three, so
there are three nodes in the pipeline. The Data Streamer streams the packets to the
first datanode in the pipeline, which stores the packet and forwards it to the second
datanode in the pipeline.
 Step 4: Similarly, the second datanode stores the packet and forwards it to the third
(and last) datanode in the pipeline.
 Step 5: DFSOutputStream also maintains an internal queue of packets that are

waiting to be acknowledged by datanodes, called the ack queue. A packet is removed
from the ackqueue only when it has been acknowledged by all the datanodes in the
pipeline.
27 | P a g e
 Step 6: When the client has finished writing data, it calls close() on the stream.
 Step 7: This action flushes all the remaining packets to the datanode pipeline and
waits for acknowledgments before contacting the namenode to signal that the file is
complete The namenode already knows which blocks the file is made up of (via
Data Streamer asking for block allocations), so it only has to wait for blocks to be
minimally replicated before returning successfully.
9.5 Rack Awareness
For small clusters in which all servers are connected by a single switch, there are only two
levels of locality: “on-machine” and “off-machine.” When loading data from a Data Node's
local drive into HDFS, the Name Node will schedule one copy to go into the local Data Node,
and will pick two other machines at random from the cluster.
For larger Hadoop installations which span multiple racks, it is important to ensure that
replicas of data exist on multiple racks. This way, the loss of a switch does not render
portions of the data unavailable due to all replicas being underneath it.
HDFS can be made rack-aware by the use of a script which allows the master node to map
the network topology of the cluster. While alternate configuration strategies can be used,
28 | P a g e
the default implementation allows you to provide an executable script which returns the
“rack address” of each of a list of IP addresses.
The network topology script receives as arguments one or more IP addresses of nodes in
the cluster. It returns on stdout a list of rack names, one for each input. The input and
output order must be consistent.
To set the rack mapping script, specify the key topology.script.file.name in conf/hadoop-
site.xml. This provides a command to run to return a rack id; it must be an executable script
or program. By default, Hadoop will attempt to send a set of IP addresses to the file as
several separate command line arguments. You can control the maximum acceptable
number of arguments with the topology.script.number.args key.
Rack ids in Hadoop are hierarchical and look like path names. By default, every node has a
rack id of /default-rack. You can set rack ids for nodes to any arbitrary path, e.g., /foo/bar-
rack. Path elements further to the left are higher up the tree. Thus a reasonable structure for
a large installation may be /top-switch-name/rack-name.
Hadoop rack ids are not currently expressive enough to handle an unusual routing topology
such as a 3-d torus; they assume that each node is connected to a single switch which in turn
has a single upstream switch. This is not usually a problem, however. Actual packet routing
will be directed using the topology discovered by or set in switches and routers. The
Hadoop rack ids will be used to find “near” and “far” nodes for replica placement (and in
0.17, Map Reduce task placement).
The following example script performs rack identification based on IP addresses given a
hierarchical IP addressing scheme enforced by the network administrator. This may work
directly for simple installations; more complex network configurations may require a file-
or table-based lookup process. Care should be taken in that case to keep the table up-to-date
as nodes are physically relocated, etc. This script requires that the maximum number of
arguments be set to 1.
#!/bin/bash
# Set rack id based on IP address.
# Assumes network administrator has complete control
# over IP addresses assigned to nodes and they are
# in the 10.x.y.z address space. Assumes that
# IP addresses are distributed hierarchically. e.g.,
# 10.1.y.z is one data center segment and 10.2.y.z is another;
# 10.1.1.z is one rack, 10.1.2.z is another rack in
# the same segment, etc.)
29 | P a g e
#
# This is invoked with an IP address as its only argument
# get IP address from the input

ipaddr=$0
# select "x.y" and convert it to "x/y"

segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`
echo /${segments}
30 | P a g e
10. Accessibility
HDFS can be accessed from applications in many different ways. Natively, HDFS provides a
Java API for applications to use. A C language wrapper for this Java API is also available. In
addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work
is in progress to expose HDFS through the WebDAV protocol.
10.1. FS Shell
HDFS allows user data to be organized in the form of files and directories. It provides a
commandline interface called FS shell that lets a user interact with the data in HDFS. The
syntax of this command set is similar to other shells (e.g. bash, csh) that users are already
familiar with. Here are some sample action/command pairs:
Action Command
Create a directory named /foodir bin/hadoopdfs -mkdir /foodir
Create a directory named /foodir bin/hadoopdfs -mkdir /foodir
View the contents of a file named
/foodir/myfile.txt
bin/hadoopdfs -cat
/foodir/myfile.txt
FS shell is targeted for applications that need a scripting language to interact with the stored
data.
10.2. DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. These are
commands that are used only by an HDFS administrator. Here are some sample
action/command pairs:
Action Command
Put the cluster in Safemodebin/hadoopdfsadmin -safemode enter
Generate a list of Data Nodes bin/hadoopdfsadmin -report
Decommission Data Node datanodename bin/hadoopdfsadmin -decommission
Datanodename
10.3. Browser Interface
A typical HDFS install configures a web server to expose the HDFS namespace through a
configurable TCP port. This allows a user to navigate the HDFS namespace and view the
contents of its files using a web browser.
11. Space Reclamation
11.1. File Deletes and Undeletes
31 | P a g e
When a file is deleted by a user or an application, it is not immediately removed from HDFS.
Instead, HDFS first renames it to a file in the /trash directory. The file can be restored
quickly as long as it remains in /trash. A file remains in /trash for a configurable
amount of time. After the expiry of its life in /trash, the Name Node deletes the file from
the HDFS namespace. The deletion of a file causes the blocks associated with the file to be
freed. Note that there could be an appreciable time delay between the times a file is deleted
by
a user and the time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a
user wants to undelete a file that he/she has deleted, he/she can navigate the /trash
directory and retrieve the file. The /trash directory contains only the latest copy of the file
that was deleted. The /trash directory is just like any other directory with one special
feature: HDFS applies specified policies to automatically delete files from this directory. The
current default policy is to delete files from /trash that are more than 6 hours old. In the
future, this policy will be configurable through a well-defined interface.
11.2. Decrease Replication Factor
When the replication factor of a file is reduced, the Name Node selects excess replicas that
can be deleted. The next Heartbeat transfers this information to the Data Node. The Data
Node
then removes the corresponding blocks and the corresponding free space appears in the
cluster. Once again, there might be a time delay between the completion of the
setReplicationAPI call and the appearance of free space in the cluster.
11.3.Hadoop Configuration Files
32 | P a g e
1.http://hadoop.apache.org/docs/r1.1.2/core-default.html
2.http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3.http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
11.4 Hadoop Name Node & Data Node Web Interface
Name Node
The NN web interface can be accessed via the URL http://<namenode_host>:50070/

assuming you have not changed the port that the service listens on in the configuration. The
first part of the page displays the name of the server running the Name Node and the port,
when it was started, version information etc.
The Cluster ID is a unique string that identifies the cluster all nodes in the cluster will have
the same ID (you can specify your own value for this).With Hadoop 1.0/CDH3 it was only
possible to have a single Name Node to manage the entire file system namespace. Since this
limited scalability for very large clusters and to cater for other requirements of multi-
tenancy the concept of federating the namespace across multiple Name Nodes was
developed, so for example you could have nn1 responsible for the metadata and blocks
associated with /sales and nn2 for /research. In this scenario a NN manages what is termed
a Namespace Volume which consists of the filesytem metadata and the blocks (block pool)
33 | P a g e
for the files under the 'volume'. The Data Nodes in a federated cluster store blocks for
multiple Name Nodes, and the Block Pool ID is used to identify which NN blocks belong to.
The Block Pool ID is derived by concatenating the prefix BP with some random characters,
the Name Node IP address and the current time in milliseconds.
Cluster Summary
The next part of the web interface displays a summary of the state of the cluster.
34 | P a g e
Secondary Name Node (SNN)
The key information provided on the Secondary Name Node web interface from an
administrative standpoint is the last checkpoint time.
Like the NN web UI, you can also view the SNN's log files. The metrics exposed via JMX can
be viewed from the URL http://<secondary_namenode>:50090/jmx
35 | P a g e
Data Node (DN)
Browsing the File System
When you click the 'browse the file system' link from the Name Node web interface (refer to
my post on the Name Node), this redirects you to one of the Data Node's web UI's to browse
the directory structure and view files stored in HDFS.
It's also possible to view the contents of a file through the web interface (if it is a text file).
36 | P a g e
Clicking advanced options shows the blocks associated with the file and the Date Node on
which it resides and allows you to download the file or 'tail' the file which works much like
the tail command on Linux systems.
Block Scanner
Data Nodes periodically scan blocks to detect corruption and report corrupt blocks to the
Name Node so that they can be re-replicated to maintain the minimum replication factor for
the file that the block is associated with. The Data Node has a web
pagehttp://<datanode_host>:50075/blockScannerReport displays information on block
scanning statistics.
37 | P a g e
12. HDFS COMMANDS
38 | P a g e
39 | P a g e
40 | P a g e
41 | P a g e
42 | P a g e
43 | P a g e

Hadoop Distributed File System (HDFS) : Suresh Pathipati

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Distributed File System (HDFS) : Suresh Pathipati

Uploaded by

Copyright:

Available Formats

Hadoop Distributed File System(HDFS)

Prepared By Reviewed By Approved By

2. Assumptions and Goals

2.1. Hardware Failure

2.2. Streaming Data Access

2.3. Large Data Sets

2.4. Simple Coherency Model

2.5. “Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it is executed near the

2.6. Portability Across Heterogeneous Hardware and Software Platforms

A fully configured cluster, “running Hadoop” means running a set of daemons, or

3.1 Hadoop Modes

Name Node Metadata

Secondary Name node – a misnomer

Name node Recovery

The Checkpoint node periodically creates checkpoints of the namespace. It downloads

 dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum

Multiple checkpoint nodes may be specified in the cluster configuration file.

4. The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an application can

5.2. Replica Selection

5.3. Safe mode

6. The Persistence of File System Metadata

7. The Communication Protocols

8.1. Data Disk Failure, Heartbeats and Re-Replication

8.2. Cluster Rebalancing

These types of data rebalancing schemes are not yet implemented.

8.3. Data Integrity

9.1. Data Blocks

9.3. Replication Pipelining

9.4 Read and Write Files in HDFS

Read Operation in HDFS

 Step 2: DistributedFileSystem calls the Namenode, using RPC, to determine the

 The DistributedFileSystem returns an object of FSDataInputStream(an input

Read Operation in HDFS

 Step 1: The client creates the file by calling create() method

 Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new

 Step 5: DFSOutputStream also maintains an internal queue of packets that are

9.5 Rack Awareness

# get IP address from the input

# select "x.y" and convert it to "x/y"

10.3. Browser Interface

11. Space Reclamation

11.1. File Deletes and Undeletes

11.2. Decrease Replication Factor

11.3.Hadoop Configuration Files

11.4 Hadoop Name Node & Data Node Web Interface

The NN web interface can be accessed via the URL http://<namenode_host>:50070/

Browsing the File System

You might also like