Professional Documents
Culture Documents
Scale-up
To understand the popularity of distributed systems (scale-out)
vis--vis huge monolithic servers (scale-up), consider the price
performance of current I/O technology.
A high-end machine with four I/O channels each having a
throughput of 100 MB/sec will require three hours to read a 4
TB data set!
With Hadoop, this same data set will be divided into smaller
(typically 64 MB) blocks that are spread among many machines
in the cluster via the Hadoop Distributed File System (HDFS ).
With a modest degree of replication, the cluster machines can
read the data set in parallel and provide a much higher
throughput.
And such a cluster of commodity machines turns out to be
cheaper than one high-end server!
HDFS
HDFS is the file system component of Hadoop.
Interface to HDFS is patterned after the UNIX file system
Faithfulness to standards was sacrificed in favor of improved
performance for the applications at hand
Streaming data
write-once, read-many-times pattern
the time to read the whole dataset is more important than
the latency in reading the first record
Commodity hardware
HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure
Web Interface
NameNode and DataNode each run an internal web server in
order to display basic information about the current status of
the cluster.
HDFS architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Write
Client
Rack2
Namenode
Keeps image of entire file system namespace and file Blockmap
in memory.
Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same directory.
It uses heuristics to determine optimal number of files per
directory and creates directories appropriately
When the filesystem starts up it generates a list of all HDFS
blocks and send this report to Namenode: Blockreport.
HDFS
HDFS Server
Master node
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
Replicated
HDFS: Modules
Protocol: The protocol package is used in communication between the client and the
namenode and datanode. It describes the messages used between these servers.
Security: security is used in authenticating access to the files. The security is based on
token-based authentication, where the namenode server controls the distribution of
access tokens.
server.protocol: server.protocol defines the communication between namenode and
datanode, and between namenode and balancer.
server.common: server.common contains utilities that are used by the namenode,
datanode and balancer. Examples are classes containing server-wide constants, utilities,
and other logic that is shared among the servers.
Client: The client contains the logic to access the file system from a users computer. It
interfaces with the datanode and namenode servers using the protocol module. In the
diagram this module spans two layers. This is because the client module also contains
some logic that is shared system wide.
Datanode: The datanode is responsible for storing the actual blocks of filesystem data. It
receives instructions on which blocks to store from the namenode. It also services the
client directly to stream file block contents.
Namenode: The namenode is responsible for authorizing the user, storing a mapping from
filenames to data blocks, and it knows which blocks of data are stored where.
Balancer: The balancer is a separate server that tells the namenode to move data blocks
between datanodes when the load is not evenly balanced among datanodes.
Tools: The tools package can be used to administer the filesystem, and also contains
debugging code.
File system
Hierarchical file system with directories and files
Create, remove, move, rename etc.
Namenode maintains the file system
Any meta information changes to the file system recorded by
the Namenode.
An application can specify the number of replicas of the file
needed: replication factor of the file.
This information is stored in the Namenode.
Metadata
The HDFS namespace is stored by Namenode.
Namenode uses a transaction log called the EditLog to
record every change that occurs to the filesystem meta
data.
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenodes local filesystem
Client
Java Interface
One of the simplest ways to read a file from a Hadoop filesystem is by using
a java.net.URL object to open a stream to read the data from.
The general idiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
Theres a little bit more work required to make Java recognize Hadoops hdfs
URL scheme.
This is achieved by calling the setURLStreamHandlerFactory method on URL
with an instance of FsUrlStreamHandlerFactory.
} finally {
IOUtils.closeStream(in);
}
FSDataInputStream
The open() method on FileSystem actually returns a FSDataInputStream
rather than a standard java.io class.
This class is a specialization of java.io.DataInputStream with support for
random access, so you can read from any part of the stream.
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation
}
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
FSDataOutputStream
public FSDataOutputStream create(Path f) throws IOException
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
public FSDataOutputStream append(Path f) throws IOException
});
IOUtils.copyBytes(in, out, 4096, true);
SequenceFile
SequenceFile is a flat file consisting of binary key/value pairs.
It is extensively used in MapReduce as input/output formats.
The SequenceFile.Reader acts as a bridge and can read any of the above
SequenceFile formats.
Using SequenceFile
To use it as a logfile format, you would choose a key, such as timestamp
represented by a LongWritable, and the value is a Writable that represents
the quantity being logged.
Compression
Hadoop allows users to compress output data, intermediate
data, or both.
Hadoop checks whether input data is in a compressed format
and decompresses the data as needed.
Compression codec:
two lossless codecs.
The default codec is gzip, a combination of the Lempel-Ziv 1977 (LZ77)
algorithm and Human encoding.
The other codec implements the Lempel-ZivOberhumer (LZO) algorithm,
a variant of LZ77 optimized for decompression speed.
Compression unit:
Hadoop allows both per-record and per-block compression.
Thus, the record or block size affects the compressibility of the data.
Bzip2 files compress well and are even splittable, but the decompression
algorithm is slow and cannot keep up with the streaming disk reads that are
common in Hadoop jobs.
While Bzip2 compression has some upside because it conserves storage
space, running jobs now spend their time waiting on the CPU to finish
decompressing data.
Which slows them down and offsets the other gains.
End of session
Day 1: Data Management