Professional Documents
Culture Documents
There are a handful of files for controlling the configuration of a Hadoop installation; the most important ones are
listed in Table 9-1. This section covers MapReduce 1, which employs the jobtracker and tasktracker daemons.
Running MapReduce 2 is substantially different, and is covered in YARN Configuration on page 318.
Table 9-1. Hadoop configuration files
Filename
Format
Description
hadoop-env.sh
Bash script
core-site.xml
masters
Hadoop
configuration
XML
Hadoop
configuration
XML
Hadoop
configuration
XML
Plain text
Slaves
Plain text
hadoopmetrics.properties
Java Properties
log4j.properties
Java Properties
hdfs-site.xml
mapred-site.xml
These files are all found in the conf directory of the Hadoop distribution. The configuration directory can be
relocated to another part of the filesystem (outside the Hadoop
YARN Configuration
YARN is the next-generation architecture for running MapReduce (and is described in YARN (MapReduce 2) on
page 194). It has a different set of daemons and configuration options to classic MapReduce (also called
MapReduce 1), and in this section we shall look at these differences and how to run MapReduce on YARN.
Under YARN you no longer run a jobtracker or tasktrackers. Instead, there is a single resource manager running on
the same machine as the HDFS namenode (for small clusters) or on a dedicated machine, and node managers
running on each worker node in the cluster.
The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the cluster. This script will start a
resource manager (on the machine the script is run on), and a node manager on each machine listed in the slaves
file.
YARN also has a job history server daemon that provides users with details of past job runs, and a web app proxy
server for providing a secure way for users to access the UI provided by YARN applications. In the case of
MapReduce, the web UI served by the proxy provides information about the current job you are running, similar to
the one described in The MapReduce Web UI on page 164. By default the web app proxy server runs in the same
process as the resource manager, but it may be configured to run as a standalone daemon.
YARN has its own set of configuration files listed in Table 9-8, these are used in addition to those in Table 9-1.
Table 9-8. YARN configuration files
Filename Format
yarnenv.sh
yarnsite.xml
Bash script
Hadoop
configuration XML
Description
Environment variables that are used in the scripts to run
YARN.
Configuration settings for YARN daemons: the resource
manager, the job history server, the webapp proxy server,
and the node managers.
YARN Configuration
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
</configuration>
The YARN resource manager address is controlled via yarn.resourceman ager.address, which takes the form of a hostport pair. In a client configuration this property is used to connect to the resource manager (using RPC), and in
addition the mapreduce.framework.name property must be set to yarn for the client to use YARN rather than the local
job runner.
Although YARN does not honor mapred.local.dir, it has an equivalent property called yarn.nodemanager.local-dirs,
which allows you to specify which local disks to store intermediate data on. It is specified by a comma-separated
list of local directory paths, which are used in a round-robin fashion.
YARN doesnt have tasktrackers to serve map outputs to reduce tasks, so for this function it relies on shuffle
handlers, which are long-running auxiliary services running in node managers. Since YARN is a general-purpose
service the shuffle handlers need to be explictly enabled in the yarn-site.xml by setting the yarn.nodemanager.aux-serv
ices property to mapreduce.shuffle.
Table 9-9 summarizes the important configuration properties for YARN.
Table 9-9. Important YARN daemon properties
Property name
Type
Default value
Description
yarn.resourceman
ager.address
0.0.0.0:8040
yarn.nodeman
ager.local-dirs
comma-separated
directory names
/tmp/nm-localdir
yarn.nodeman ager.auxservices
commaseparated
service names
Int
8192
Property name
Type
Default value
Description
yarn.nodeman
ager.vmem-pmemratio
Float
2.1