Professional Documents
Culture Documents
Abstract
The Hadoop Distributed File System (HDFS) is designed to run
on commodity hardware and can be used as a stand-alone
general purpose distributed file system (Hdfs user guide, 2008). It
provides the ability to access bulk data with high I/O throughput.
As a result, this system is suitable for applications that have large
I/O data sets. However, the performance of HDFS decreases
dramatically when handling the operations ofinteraction-intensive
files, i.e., files that have relatively small size but are frequently
accessed. The paper analyzes the cause of throughput
degradation issue when accessing interaction-intensive files and
presents an enhanced HDFS architecture along with an
associated storage allocation algorithm that overcomes the
performance degradation problem. Experiments have shown that
with the proposed architecture together with the associated
storage allocation algorithm, the HDFS throughput for
interaction-intensive files increases 300% on average with only a
negligible performance decrease for large data set tasks.
I.
HDFS
about its rack to the MNN after receiving the MNNs heartbeat
inquiry. In addition, the SNN sends its own heartbeat inquiry to
the datanodes in its rack in order to monitor their status.
Furthermore, in order to keep the HDFSs message mechanism
intact, the SNN responds to all the messages generated by the
MNN and sends its own control messages to the datanodes in
its rack.
4.1. Interaction-intensity
IDs of incoming file blocks, and the queue QF contains the IDs
of all the files whose blocks are in FM . As the HDFS uses the
first-comefirstserve policy to schedule their tasks, the queue
QF is hence ordered by the blocks arrival time. The subset
FMI FM contains all the interaction-intensive blocks in FM .
If there are n racks at the SNN layer, to the MNN, there are 2n
datanodes: the first n datanodes represent the datanode pool in
each rack and the other n datanodes represent the caches in
each rack. For example, assume that the MNN decides a block
be allocated to the kth datanode, if k n, then this block is
placed into the datanode pool of rack k; otherwise, if k > n, that
means this block is actually allocated to the cache on the (k
n)th rack. Afterwards, we define the solution vector at the
MNN layer as VM . Its size is |FM |. VM [i] represents
the location where block QF [i] is allocated. The value domain
of entry i in vector VM is defined as follows:
s given in Eq. (9), for blocks in FMI , the value domain of their
corresponding entries in VM is [1, 2n] which indicates that
the blocks in FMI may be stored in cache. For the rest of the
blocks, as their domain is in [1, n], they can only be allocated
to datanodes in a rack.
As an example, assume that there are 5 racks, i.e., n = 5, and
an incoming block QF [i] is in set FMI , then the value domain
of entry VM [i] is in the range of [1, 10]. If VM [i] = 3,
it indicates that the incoming block QF [i] is allocated to rack
3; while if VM [i] = 9, the block is allocated to the cache of
rack 4.
To compare the interaction-intensity of incoming files with
those already in the cache, the indicator is introduced for
each incoming file to represent the ratio of the files current I
value versus the average I value of all cached files. The
definition of indicator for file f is given below:
where EW (i) and ER(i) are the predicted writing and reading
speed on node i; they are given by the auto-regressive
integrated moving average prediction model presented in [16].
k3 and k4 are the weight coefficients for reading and writing
where N and |FM | are the swarm size and the number of
dimensions, respectively. Denote di of the globally best particle
as dg . Compare all di s and determine the maximum and
minimum distance dmax and dmin. Then, the evolutionary
factor f is defined by
1)
Hardware Failure
2)
6) Portability
Across
Heterogeneous Hardware and Software
Platforms
HDFS has been designed to be easily portable from one
platform to another. This facilitates widespread adoption of
HDFS as a platform of choice for a large set of applications.
3)
4)
5) Moving
Computation
Cheaper than Moving Data
is
Data Organization
9.1 Data Blocks
HDFS is designed to support very large files. Applications that
are compatible with HDFS are those that deal with large data
sets. These applications write their data only once but they read
it one or more times and require these reads to be satisfied at
streaming speeds. HDFS supports write-once-read-many
semantics on files. A typical block size used by HDFS is 64
MB. Thus, an HDFS file is chopped up into 64 MB chunks,
and if possible, each chunk will reside on a different DataNode
9.2 Staging
A client request to create a file does not reach the NameNode
immediately. In fact, initially the HDFS client caches the file
data into a temporary local file. Application writes are
transparently redirected to this temporary local file. When the
local file accumulates data worth over one HDFS block size,
the client contacts the NameNode. The NameNode inserts the
file name into the file system hierarchy and allocates a data
block for it. The NameNode responds to the client request with
the identity of the DataNode and the destination data block.
Then the client flushes the block of data from the local
temporary file to the specified DataNode. When a file is closed,
the remaining un-flushed data in the temporary local file is
transferred to the DataNode. The client then tells the
10 Accessibility
HDFS can be accessed from applications in many different
ways. Natively, HDFS provides a Java API for applications to
use. A C language wrapper for this Java API is also available.
In addition, an HTTP browser can also be used to browse the
files of an HDFS instance. Work is in progress to expose HDFS
through the WebDAV protocol.
10.2 DFSAdmin
The DFSAdmin command set is used for
administering an HDFS cluster. These are
commands that are used only by an HDFS
administrator. Here are some sample action/
command pairs:
Space Reclamation
File Deletes and Undeletes
When a file is deleted by a user or an application, it is not
immediately removed from HDFS. Instead, HDFS first
renames it to a file in the /trash directory. The file can be
restored quickly as long as it remains in /trash. A file remains in
/trash for a configurable amount of time. After the expiry of its
life in /trash, the NameNode deletes the file from the HDFS
namespace. The deletion of a file causes the blocks associated
with the file to be freed. Note that there could be an appreciable
time delay between the time a file is deleted by a user and the
time of the corresponding increase in free space in HDFS.
10.1 FS Shell
HDFS allows user data to be organized in the form of files and
directories. It provides a commandline interface called FS shell
that lets a user interact with the data in HDFS. The syntax of
this command set is similar to other shells (e.g. bash, csh) that
users are already familiar with. Here are some sample
action/command pairs:
/trash that are more than 6 hours old. In the future, this policy
will be configurable through a well defined interface.
Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode
selects excess replicas that can be deleted. The next Heartbeat
transfers this information to the DataNode. The DataNode then
removes the corresponding blocks and the corresponding free
space appears in the cluster. Once again, there might be a time
delay between the completion of the set Replication API call
and the appearance of free space in the cluster.
C. Equations
The equations are an exception to the prescribed
specifications of this template. You will need to determine
whether or not your equation should be typed using either the
Times New Roman or the Symbol font (please no other font).
To create multileveled equations, it may be necessary to treat
the equation as a graphic and insert it into the text after your
paper is styled.
Number equations consecutively. Equation numbers, within
parentheses, are to position flush right, as in (1), using a right
tab stop. To make your equations more compact, you may use
the solidus ( / ), the exp function, or appropriate exponents.
Italicize Roman symbols for quantities and variables, but not
Greek symbols. Use a long dash rather than a hyphen for a
minus sign. Punctuate equations with commas or periods when
they are part of a sentence, as in
ab
III.
Before you begin to format your paper, first write and save
the content as a separate text file. Keep your text and graphic
files separate until after the text has been formatted and styled.
Do not use hard tabs, and limit use of hard returns to only one
return at the end of a paragraph. Do not add any kind of
pagination anywhere in the paper. Do not number text headsthe template will do that for you.
Finally, complete content and organizational editing before
formatting. Please take note of the following items when
proofreading spelling and grammar:
B. Units
to mean
After the text edit has been completed, the paper is ready
for the template. Duplicate the template file by using the Save
As command, and use the naming convention prescribed by
your conference for the name of your paper. In this newly
created file, highlight all of the contents and import your
prepared text file. You are now ready to style your paper; use
the scroll down window on the left of the MS Word Formatting
toolbar.
Subhead
Subhead
a.
ACKNOWLEDGMENT (Heading 5)
The preferred spelling of the word acknowledgment in
America is without an e after the g. Avoid the stilted
expression one of us (R. B. G.) thanks .... Instead, try R. B.
G. thanks.... Put sponsor acknowledgments in the unnumbered
footnote on the first page.
[1]
[2]
[3]
REFERENCES
The template will number citations consecutively within
brackets [1]. The sentence punctuation follows the bracket [2].
Refer simply to the reference number, as in [3]do not use
Ref. [3] or reference [3] except at the beginning of a
sentence: Reference [3] was the first ...
Number footnotes separately in superscripts. Place the
actual footnote at the bottom of the column in which it was
cited. Do not put footnotes in the reference list. Use letters for
table footnotes.
[4]
[5]
[6]
[7]
[8]