You are on page 1of 29

www.edureka.

in/hadoop-admin
www.edureka.in/hadoop-admin
Course Topics
Week 1
Understanding Big Data
A typical Hadoop Cluster
Hadoop Cluster Administrator: Roles and
Responsibilities
Week 2
Hadoop 2.0
Hadoop Configuration files
Popular Hadoop Distributions
Week 3
Different Hadoop Server Roles
Data processing flow
Cluster Network Configuration
Week 4
Job Scheduling
Fair Scheduler
Monitoring a Hadoop Cluster
Week 5
Securing your Hadoop Cluster
Kerberos and HDFS Federation
Backup and Recovery
Week 6
Oozie and Hive Administration
HBase Architecture
HBase Administration
www.edureka.in/hadoop-admin
Topics for Today
Revision
Hadoop 2.0
Hadoop Configuration Files
Plan your Hadoop Cluster: Hardware Considerations
Plan your Hadoop Cluster: Software Considerations
Popular Hadoop Distributions
www.edureka.in/hadoop-admin
Hadoop Core Components
Different Cluster Modes
Letss Revise
www.edureka.in/hadoop-admin
Client
HDFS Map Reduce
Hadoop 1.0
Secondary
Name Node
Data
Blocks
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
.
www.edureka.in/hadoop-admin
Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, Node
Manager, App Master
www.edureka.in/hadoop-admin
Hadoop 2.0 HDFS Federation
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html
Namenode
Block Management
NS
Storage
Datanode Datanode

N
a
m
e
s
p
a
c
e
B
l
o
c
k

S
t
o
r
a
g
e
N
a
m
e
s
p
a
c
e
NS1
NSk NSn
NN-1
NN-k NN-n
Common Storage
Datanode 1

Datanode 2

Datanode m
B
l
o
c
k

S
t
o
r
a
g
e
Pool 1 Pool k Pool n
Block Pools

www.edureka.in/hadoop-admin
Hadoop 2.0 HDFS NameNode High Availability
Shared
edit logs
Data Blocks
.
Data Nodes are configured with the
location of both Name Nodes, and send
block location information and heartbeats
to both.
Read edit logs and applies to its own
namespace
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Active
Name Node
Standby
Name Node
Data Node Data Node Data Node Data Node
Secondary
Name Node
www.edureka.in/hadoop-admin
Hadoop 2.0 : YARN or MapReduce 2.0 (MRv2)
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN = Yet Another Resource Manager
Node Manager
Container Container
Node Manager
App
Master
Container
Node Manager
Container
App
Master
Resource
Manager
Client
Client
MapReduce Status
Job Submission
Node Status
Resource Request
www.edureka.in/hadoop-admin
Client
HDFS
YARN
Resource Manager
Hadoop 2.0
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node
Data Node
Data Node Data Node
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Standby
NameNode
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Active
NameNode
Poll Questions
www.edureka.in/hadoop-admin
Hadoop 2.0 Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh Settings for Hadoop Daemons process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS
and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for ResourceManager and NodeManager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and NodeManager.
www.edureka.in/hadoop-admin
Hadoop 2.0 Configuration Files
www.edureka.in/hadoop-admin
Deprecated Properties
Deprecated Property Name New Property Name
dfs.data.dir dfs.datanode.data.dir
dfs.http.address dfs.namenode.http-address
fs.default.name fs.defaultFS
The core functionality and usage of these core configuration files are same in Hadoop 2.0
and 1.0 but many new properties have been added and many have been deprecated.
For example:
fs.default.name has been deprecated and replaced with fs.defaultFS for YARN in core-site.xml
dfs.nameservices has been added to enable NameNode High Availability in hdfs-site.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
In Hadoop 2.x.x (CDH4) release, you can use either the old or the new properties.
The old property names are now deprecated, but still work!
www.edureka.in/hadoop-admin
Runtime Environment
Offers a way to provide custom parameters for each of the servers.
Sourced by the Hadoop Daemons start/stop scripts.
Examples of environment variables that you can specify:
HADOOP_DATANODE_HEAPSIZE
YARN_HEAPSIZE
Set parameter JAVA_HOME
JVM
hadoop-env.sh
yarn-env.sh
Map
Reduce
www.edureka.in/hadoop-admin
Configuration Files for Core Components
Core core-site.xml
HDFS hdfs-site.xml
mapred-site.xml
Map
Reduce
yarn-site.xml
YARN
www.edureka.in/hadoop-admin
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<?xml version - "1.0"?> <?xml version ="1.0"?>
<!--hdfs-site.xml--> <!--core-site.xml-->
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.defaultFS</name>
<value>1</value> <value>hdfs://test.abc.in:8020/</value>
</property> </property>
</configuration> </configuration>
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
www.edureka.in/hadoop-admin
mapred-site.xml
mapred-site.xml
<?xml version=1.0?>
<configuration>
<property>
<name>mapreduce.jobhistory.address</name>
<value>test.abc.in:10020</value>
<property>
</configuration>
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/stable/mapred_tutorial.html
Notice difference in URL for
current and stable release
www.edureka.in/hadoop-admin
yarn-site.xml
yarn-site.xml
<?xml version=1.0?>
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>test.abc.in:8021</value>
<property>
</configuration>
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
www.edureka.in/hadoop-admin
Slaves
Map
Reduce
Slaves
Contains a list of slave hosts, one per line, that are to host DataNode and
NodeManager servers.
www.edureka.in/hadoop-admin
http://wiki.apache.org/hadoop/PoweredBy
Hadoop Cluster: Facebook
www.edureka.in/hadoop-admin
Hadoop Cluster: A Typical Use Case (Hadoop 1.0)
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
Name Node
Secondary Name Node
Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
Data Node
www.edureka.in/hadoop-admin
Hadoop Cluster: Thinking About The Problem
Single Machine
Great for testing,
developing.
Not a practical
implementation for
large amounts of data.
Initially four or six
nodes.
As the volume of data
grows, more nodes can
easily be added.
Ways of deciding when the
cluster needs to grow
Increasing amount of
computation power
needed.
Increasing amount of
data which needs to be
stored.
Increasing amount of
memory needed to
process tasks.
Hadoop Cluster
Small Cluster Large Cluster
www.edureka.in/hadoop-admin
Master Hardware
Namenode requirements
RAM to fit metadata
Modest but dedicated disk
Secondary Namenode
Almost identical to Namenode
Resource Manager
Retain Job Data, Memory Hungry
Memory requirements can grow
independent of cluster size
Slave Hardware
Storage
Computation
Cluster Sizing
Usage Pattern and Workloads
IO-bound or CPU-bound
Consider requirements for
additional components such as
HBase
Plan your Hadoop Cluster: Hardware
www.edureka.in/hadoop-admin
Operating System
Linux is the only production quality option today.
A significant number run on RHEL.
Java
JDK- the most critical software
List of tested JVMs:
http://wiki.apache.org/hadoop/HadoopJavaVers
ions
Java 1.6.x
Operating System utilities
ssh
cron
rsync
ntp
Plan your Hadoop Cluster: Software
www.edureka.in/hadoop-admin
Choose a Distribution and Version of Hadoop
Popular Hadoop Distributions
Apache Hadoop
Complex Cluster setup
Manual install and Integration of Hadoop
ecosystem components such as Pig, Hive,
HBase etc
No commercial Support
Good for First try
Cloudera
Established distribution with many referenced
deployments
Powerful tools for deployment, management
and monitoring such as Cloudera Manager
www.edureka.in/hadoop-admin
HortonWorks
Only distribution without any modification in Apache Hadoop
HCatalog for metadata
Stinger for Hive
MapR
Support native Unix filesystem
HA features such as snapshots, mirroring or stateful failover
Amazon Elastic Map Reduce (EMR)
Hosted Solution
Only Pig and Hive are available as of now
Popular Hadoop Distributions
www.edureka.in/hadoop-admin
Assignments Status
Attempt the following Assignments using the documents present in the LMS:
Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or VirtualBox.
Thank You
See You in Class Next Week

You might also like