You are on page 1of 36

Data Management

Scale-up
To understand the popularity of distributed systems (scale-out)
vis--vis huge monolithic servers (scale-up), consider the price
performance of current I/O technology.
A high-end machine with four I/O channels each having a
throughput of 100 MB/sec will require three hours to read a 4
TB data set!
With Hadoop, this same data set will be divided into smaller
(typically 64 MB) blocks that are spread among many machines
in the cluster via the Hadoop Distributed File System (HDFS ).
With a modest degree of replication, the cluster machines can
read the data set in parallel and provide a much higher
throughput.
And such a cluster of commodity machines turns out to be
cheaper than one high-end server!

Hadoop focuses on moving code to data


The clients send only the MapReduce programs to be executed,
and these programs are usually small (often in kilobytes).

More importantly, the move-code-to-data philosophy applies


within the Hadoop cluster itself.
Data is broken up and distributed across the cluster, and as
much as possible, computation on a piece of data takes place
on the same machine where that piece of data resides.
The programs to run (code) are orders of magnitude smaller
than the data and are easier to move around.
Also, it takes more time to move data across a network than to
apply the computation to it.

HDFS
HDFS is the file system component of Hadoop.
Interface to HDFS is patterned after the UNIX file system
Faithfulness to standards was sacrificed in favor of improved
performance for the applications at hand

HDFS stores file system metadata and application data


separately
HDFS is a file-system designed for storing very large files with
streaming data access patterns, running on clusters of
commodity hardware1
1 The Hadoop Distributed File System by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (Proceedings of MSST2010,
May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf)

Key properties of HDFS


Very Large
Very large in this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size.
There are Hadoop clusters running today that store
petabytes of data.

Streaming data
write-once, read-many-times pattern
the time to read the whole dataset is more important than
the latency in reading the first record

Commodity hardware
HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure

Not a good fit for


Low-latency data access
HDFS is optimized for delivering a high throughput of data, and this may be at
the expense of latency.
Hbase is currently a better choice for low-latency access.

Lots of small files


Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode.
As a rule of thumb, each file, directory, and block takes about 150 bytes.
While storing millions of files is feasible, billions is beyond the capability of
current hardware.

Multiple writers, arbitrary file modifications


Files in HDFS may be written to by a single writer. Writes are always made at the
end of the file.
There is no support for multiple writers, or for modifications at arbitrary offsets
in the file.

Namenode and Datanode


Master/slave architecture
HDFS cluster consists of a single Namenode, a master server
that manages the file system namespace and regulates access to
files by clients.
There are a number of DataNodes usually one per node in a
cluster.
The DataNodes manage storage attached to the nodes that they
run on.
HDFS exposes a file system namespace and allows user data to
be stored in files.
A file is split into one or more blocks and set of blocks are stored
in DataNodes.
DataNodes: serves read, write requests, performs block
creation, deletion, and replication upon instruction from
Namenode.

Web Interface
NameNode and DataNode each run an internal web server in
order to display basic information about the current status of
the cluster.

With the default configuration, the NameNode front page is


at http://namenode-name:50070/.

It lists the DataNodes in the cluster and basic statistics of the


cluster.
The web interface can also be used to browse the file system
(using "Browse the file system" link on the NameNode front
page).

HDFS architecture

Metadata ops

Metadata(Name, replicas..)
(/home/foo/data,6. ..

Namenode

Client
Block ops
Read

Datanodes

Datanodes
replication

B
Blocks

Rack1

Write
Client

Rack2

Namenode
Keeps image of entire file system namespace and file Blockmap
in memory.

4GB of local RAM is sufficient to support the above data


structures that represent the huge number of files and
directories.
When the Namenode starts up it gets the FsImage and Editlog
from its local file system, update FsImage with EditLog
information and then stores a copy of the FsImage on the
filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can recover
back to the last checkpointed state in case of a crash.

Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same directory.
It uses heuristics to determine optimal number of files per
directory and creates directories appropriately
When the filesystem starts up it generates a list of all HDFS
blocks and send this report to Namenode: Blockreport.

HDFS

HDFS Server

Master node

HDFS Client
Application

Local file
system
Block size: 2K

Name Nodes
Block size: 128M
Replicated

HDFS: Module view

HDFS: Modules

Protocol: The protocol package is used in communication between the client and the
namenode and datanode. It describes the messages used between these servers.
Security: security is used in authenticating access to the files. The security is based on
token-based authentication, where the namenode server controls the distribution of
access tokens.
server.protocol: server.protocol defines the communication between namenode and
datanode, and between namenode and balancer.
server.common: server.common contains utilities that are used by the namenode,
datanode and balancer. Examples are classes containing server-wide constants, utilities,
and other logic that is shared among the servers.
Client: The client contains the logic to access the file system from a users computer. It
interfaces with the datanode and namenode servers using the protocol module. In the
diagram this module spans two layers. This is because the client module also contains
some logic that is shared system wide.
Datanode: The datanode is responsible for storing the actual blocks of filesystem data. It
receives instructions on which blocks to store from the namenode. It also services the
client directly to stream file block contents.
Namenode: The namenode is responsible for authorizing the user, storing a mapping from
filenames to data blocks, and it knows which blocks of data are stored where.
Balancer: The balancer is a separate server that tells the namenode to move data blocks
between datanodes when the load is not evenly balanced among datanodes.
Tools: The tools package can be used to administer the filesystem, and also contains
debugging code.

File system
Hierarchical file system with directories and files
Create, remove, move, rename etc.
Namenode maintains the file system
Any meta information changes to the file system recorded by
the Namenode.
An application can specify the number of replicas of the file
needed: replication factor of the file.
This information is stored in the Namenode.

Metadata
The HDFS namespace is stored by Namenode.
Namenode uses a transaction log called the EditLog to
record every change that occurs to the filesystem meta
data.
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenodes local filesystem

Entire filesystem namespace including mapping of


blocks to files and file system properties is stored in a
file FsImage.
Stored in Namenodes local filesystem.

Application code <-> Client


HDFS provides a Java API for applications to use.
Fundamentally, the application uses the standard
java.io interface.
A C language wrapper for this Java API is also available.
The client and the application code are bound into the
same address space.

Client

Java Interface
One of the simplest ways to read a file from a Hadoop filesystem is by using
a java.net.URL object to open a stream to read the data from.
The general idiom is:

InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
Theres a little bit more work required to make Java recognize Hadoops hdfs
URL scheme.
This is achieved by calling the setURLStreamHandlerFactory method on URL
with an instance of FsUrlStreamHandlerFactory.

Example : Displaying files from a Hadoop filesystem on standard output


public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}

Reading Data Using the FileSystem API


A file in a Hadoop filesystem is represented by a Hadoop Path object (and
not a java.io.File object.
There are several static factory methods for getting a FileSystem instance:
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws
IOException

A Configuration object encapsulates a client or servers configuration, which


is set using configuration files read from the classpath, such as conf/coresite.xml.
With a FileSystem instance in hand, we invoke an open() method to get the
input stream for a file:
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws
IOException

Example : Displaying files with FileSystem API

public class FileSystemCat {


public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);

} finally {
IOUtils.closeStream(in);
}

FSDataInputStream
The open() method on FileSystem actually returns a FSDataInputStream
rather than a standard java.io class.
This class is a specialization of java.io.DataInputStream with support for
random access, so you can read from any part of the stream.
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation
}
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}

public interface PositionedReadable {


public int read(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}

FSDataOutputStream
public FSDataOutputStream create(Path f) throws IOException
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
public FSDataOutputStream append(Path f) throws IOException

Example: Copying a local file to a Hadoop filesystem


public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}

});
IOUtils.copyBytes(in, out, 4096, true);

File-Based Data Structures


For some applications, you need a specialized data structure to hold your
data.
For doing MapReduce-based processing, putting each blob of binary data
into its own file doesnt scale, so Hadoop developed a number of higherlevel containers for these situations.
Imagine a logfile, where each log record is a new line of text.
If you want to log binary types, plain text isnt a suitable format.
Hadoops SequenceFile class fits the bill in this situation, providing a
persistent data structure for binary key-value pairs.

SequenceFile
SequenceFile is a flat file consisting of binary key/value pairs.
It is extensively used in MapReduce as input/output formats.

Internally, the temporary outputs of maps are stored using SequenceFile.


The SequenceFile provides a Writer, Reader and Sorter classes for writing,
reading and sorting respectively.
There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in
'blocks' separately and compressed. The size of the 'block' is configurable.

The SequenceFile.Reader acts as a bridge and can read any of the above
SequenceFile formats.

Using SequenceFile
To use it as a logfile format, you would choose a key, such as timestamp
represented by a LongWritable, and the value is a Writable that represents
the quantity being logged.

To create a SequenceFile, use one of its createWriter() static methods, which


returns a SequenceFile.Writer instance.
Once you have a SequenceFile.Writer, you then write key-value pairs, using
the append() method.
Then when youve finished, you call the close() method.
Reading sequence files from beginning to end is a matter of creating an
instance of SequenceFile.Reader and iterating over records by repeatedly
invoking one of the next() methods.

Internals of A sequence file


A sequence file consists of a header followed by one or more records
The header contains other fields including the names of the key and value
classes, compression details, user defined metadata, and the sync marker.
A MapFile is a sorted SequenceFile with an index to permit lookups by key.

Compression
Hadoop allows users to compress output data, intermediate
data, or both.
Hadoop checks whether input data is in a compressed format
and decompresses the data as needed.
Compression codec:
two lossless codecs.
The default codec is gzip, a combination of the Lempel-Ziv 1977 (LZ77)
algorithm and Human encoding.
The other codec implements the Lempel-ZivOberhumer (LZO) algorithm,
a variant of LZ77 optimized for decompression speed.

Compression unit:
Hadoop allows both per-record and per-block compression.
Thus, the record or block size affects the compressibility of the data.

When to use compression?


Compression adds a read-time-penalty, why would one enable any
compression?
There are a few reasons why the advantages of compression can outweigh
the disadvantages:
Compression reduces the number of bytes written to/read from HDFS
Compression effectively improves the efficiency of network bandwidth and disk
space
Compression reduces the size of data needed to be read when issuing a read

To be as low friction as necessary, a real-time compression library is


preferred.
To achieve maximal performance and benefit, you must enable LZO.
What about parallelism?

compression and Hadoop


Storing compressed data in HDFS allows your hardware
allocation to go further since compressed data is often 25% of
the size of the original data.
Furthermore, since MapReduce jobs are nearly always IObound, storing compressed data means there is less overall IO
to do, meaning jobs run faster.
There are two caveats to this, however:
some compression formats cannot be split for parallel processing, and
others are slow enough at decompression that jobs become CPU-bound,
eliminating your gains on IO.

gzip compression on Hadoop


The gzip compression format illustrates the first caveat, and to understand
why we need to go back to how Hadoops input splits work.
Imagine you have a 1.1 GB gzip file, and your cluster has a 128 MB block
size.
This file will be split into 9 chunks of size approximately 128 MB.
In order to process these in parallel in a MapReduce job, a different mapper
will be responsible for each chunk.
But this means that the second mapper will start on an arbitrary byte about
128MB into the file.
The contextful dictionary that gzip uses to decompress input will be empty
at this point, which means the gzip decompressor will not be able to
correctly interpret the bytes.
The upshot is that large gzip files in Hadoop need to be processed by a single
mapper, which defeats the purpose of parallelism.

Bzip2 compression on Hadoop

For an example of the second caveat in which jobs become CPU-bound, we


can look to the bzip2 compression format.

Bzip2 files compress well and are even splittable, but the decompression
algorithm is slow and cannot keep up with the streaming disk reads that are
common in Hadoop jobs.
While Bzip2 compression has some upside because it conserves storage
space, running jobs now spend their time waiting on the CPU to finish
decompressing data.
Which slows them down and offsets the other gains.

LZO and ElephantBird


How can we split large compressed data and run them in parallel on
Hadoop?
One of the biggest drawbacks from compression algorithms like Gzip is that
you cant split them into multiple mappers.
This is where LZO comes in
Using LZO compression in Hadoop allows for
reduced data size and
shorter disk read times

LZOs block-based structure allows it to be split into chunks for parallel


processing in Hadoop.
Taken together, these characteristics make LZO an excellent compression
format to use in your cluster.
Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol
Buffer-related Hadoop InputFormats, OutputFormats,
Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea, etc.
More:
https://github.com/twitter/hadoop-lzo
https://github.com/kevinweil/elephant-bird
http://code.google.com/p/protobuf/ (IDL)

End of session
Day 1: Data Management

You might also like