Demystifying Hadoop 2.0 - Part 1

www.edureka.
in/hadoop-admin
www.edureka.in/hadoop-admin
Course Topics
Week 1
Understanding Big Data
A typical Hadoop Cluster
Hadoop Cluster Administrator: Roles and
Responsibilities
Week 2
Hadoop 2.0
Hadoop Configuration files
Popular Hadoop Distributions
Week 3
Different Hadoop Server Roles
Data processing flow
Cluster Network Configuration
Week 4
Job Scheduling
Fair Scheduler
Monitoring a Hadoop Cluster
Week 5
Securing your Hadoop Cluster
Kerberos and HDFS Federation
Backup and Recovery
Week 6
Oozie and Hive Administration
HBase Architecture
HBase Administration
Topics for Today
Revision
Hadoop 2.0
Hadoop Configuration Files
Plan your Hadoop Cluster: Hardware Considerations
Plan your Hadoop Cluster: Software Considerations
Hadoop Core Components
Different Cluster Modes
Letss Revise
Client
HDFS Map Reduce
Hadoop 1.0
Secondary
Name Node
Data
Blocks
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
.
Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, Node
Manager, App Master
Hadoop 2.0 HDFS Federation
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html
Namenode
Block Management
NS
Storage
Datanode Datanode
N
a
m
e
s
p
a
c
e
B
l
o
c
k

S
t
o
r
a
g
e
N
a
m
e
s
p
a
c
e
NS1
NSk NSn
NN-1
NN-k NN-n
Common Storage
Datanode 1
Datanode 2
Datanode m
B
l
o
c
k

S
t
o
r
a
g
e
Pool 1 Pool k Pool n
Block Pools

Hadoop 2.0 HDFS NameNode High Availability
Shared
edit logs
Data Blocks
.
Data Nodes are configured with the
location of both Name Nodes, and send
block location information and heartbeats
to both.
Read edit logs and applies to its own
namespace
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Active
Name Node
Standby
Name Node
Data Node Data Node Data Node Data Node
Secondary
Name Node
Hadoop 2.0 : YARN or MapReduce 2.0 (MRv2)
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN = Yet Another Resource Manager
Node Manager
Container Container
Node Manager
App
Master
Container
Node Manager
Container
App
Master
Resource
Manager
Client
Client
MapReduce Status
Job Submission
Node Status
Resource Request
Client
HDFS
YARN
Resource Manager
Hadoop 2.0
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node
Data Node
Data Node Data Node
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Standby
NameNode
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Active
NameNode
Poll Questions
Hadoop 2.0 Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh Settings for Hadoop Daemons process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS
and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for ResourceManager and NodeManager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and NodeManager.
Hadoop 2.0 Configuration Files
Deprecated Properties
Deprecated Property Name New Property Name
dfs.data.dir dfs.datanode.data.dir
dfs.http.address dfs.namenode.http-address
fs.default.name fs.defaultFS
The core functionality and usage of these core configuration files are same in Hadoop 2.0
and 1.0 but many new properties have been added and many have been deprecated.
For example:
fs.default.name has been deprecated and replaced with fs.defaultFS for YARN in core-site.xml
dfs.nameservices has been added to enable NameNode High Availability in hdfs-site.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
In Hadoop 2.x.x (CDH4) release, you can use either the old or the new properties.
The old property names are now deprecated, but still work!
Runtime Environment
Offers a way to provide custom parameters for each of the servers.
Sourced by the Hadoop Daemons start/stop scripts.
Examples of environment variables that you can specify:
HADOOP_DATANODE_HEAPSIZE
YARN_HEAPSIZE
Set parameter JAVA_HOME
JVM
hadoop-env.sh
yarn-env.sh
Map
Reduce
Configuration Files for Core Components
Core core-site.xml
HDFS hdfs-site.xml
mapred-site.xml
Map
Reduce
yarn-site.xml
YARN
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<?xml version - "1.0"?> <?xml version ="1.0"?>
 
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.defaultFS</name>
<value>1</value> <value>hdfs://test.abc.in:8020/</value>
</property> </property>
</configuration> </configuration>
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
mapred-site.xml
mapred-site.xml
<?xml version=1.0?>
<configuration>
<property>
<name>mapreduce.jobhistory.address</name>
<value>test.abc.in:10020</value>
<property>
</configuration>
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/stable/mapred_tutorial.html
Notice difference in URL for
current and stable release
yarn-site.xml
yarn-site.xml
<?xml version=1.0?>
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>test.abc.in:8021</value>
<property>
</configuration>
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Slaves
Map
Reduce
Slaves
Contains a list of slave hosts, one per line, that are to host DataNode and
NodeManager servers.
http://wiki.apache.org/hadoop/PoweredBy
Hadoop Cluster: Facebook
Hadoop Cluster: A Typical Use Case (Hadoop 1.0)
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
OS: 32bit CentOS
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
OS: 32bit CentOS
Name Node
Secondary Name Node
Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
OS: 32bit CentOS
Data Node
Hadoop Cluster: Thinking About The Problem
Single Machine
Great for testing,
developing.
Not a practical
implementation for
large amounts of data.
Initially four or six
nodes.
As the volume of data
grows, more nodes can
easily be added.
Ways of deciding when the
cluster needs to grow
Increasing amount of
computation power
needed.
data which needs to be
stored.
memory needed to
process tasks.
Hadoop Cluster
Small Cluster Large Cluster
Master Hardware
Namenode requirements
RAM to fit metadata
Modest but dedicated disk
Secondary Namenode
Almost identical to Namenode
Resource Manager
Retain Job Data, Memory Hungry
Memory requirements can grow
independent of cluster size
Slave Hardware
Storage
Computation
Cluster Sizing
Usage Pattern and Workloads
IO-bound or CPU-bound
Consider requirements for
additional components such as
HBase
Plan your Hadoop Cluster: Hardware
Operating System
Linux is the only production quality option today.
A significant number run on RHEL.
Java
JDK- the most critical software
List of tested JVMs:
http://wiki.apache.org/hadoop/HadoopJavaVers
ions
Java 1.6.x
Operating System utilities
ssh
cron
rsync
ntp
Plan your Hadoop Cluster: Software
Choose a Distribution and Version of Hadoop
Apache Hadoop
Complex Cluster setup
Manual install and Integration of Hadoop
ecosystem components such as Pig, Hive,
HBase etc
No commercial Support
Good for First try
Cloudera
Established distribution with many referenced
deployments
Powerful tools for deployment, management
and monitoring such as Cloudera Manager
HortonWorks
Only distribution without any modification in Apache Hadoop
HCatalog for metadata
Stinger for Hive
MapR
Support native Unix filesystem
HA features such as snapshots, mirroring or stateful failover
Amazon Elastic Map Reduce (EMR)
Hosted Solution
Only Pig and Hive are available as of now
Assignments Status
Attempt the following Assignments using the documents present in the LMS:
Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or VirtualBox.
Thank You
See You in Class Next Week

Demystifying Hadoop 2.0 - Part 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Demystifying Hadoop 2.0 - Part 1

Uploaded by

Copyright:

Available Formats

www.edureka.

You might also like