You are on page 1of 51

Getting Started with Hadoop

Who We Are

Mission: To help organizations profit from their data

How We Do It Credentials Technical Team Leadership


We deliver relevant The Apache Hadoop Unmatched knowledge Strong executive team
products and services. experts. and experience. with proven abilities.
Mike Olson Jeff
A distribution of Apache Hadoop Number 1 distribution of Apache Founders, committers and CEO Hammerbacher
that is tested, certified and Hadoop in the world contributors to Hadoop Chief Scientist
Kirk Dunn
supported
Largest contributor to the open A wealth of experience in the COO Amr Awadalla
Comprehensive support and source Hadoop ecosystem design and delivery of production Charles VP Engineering
professional service offerings software Zedlewski
More committers on staff than VP, Product
Doug Cutting
A suite of management software any other company Mary
Chief Architect
for Hadoop operations Omer Trajman
More than 100 customers across Rorabaugh
VP, Customer
Training and certification a wide variety of industries CFO
Solutions
programs for developers,
Strong growth in revenue and
administrators, managers and
new accounts
data scientists

2
2011 Cloudera, Inc. All Rights Reserved.
Users of Cloudera

Financial Retail &


Web Telecom Media
Consumer

3
2011 Cloudera, Inc. All Rights Reserved.
What is Apache Hadoop?

CORE HADOOP COMPONENTS


Hadoop is a platform for data
storage and processing that is Hadoop MapReduce
Distributed File
Scalable System (HDFS)
Fault tolerant
Open source File Sharing & Data
Protection Across
Distributed Computing
Across Physical Servers
Physical Servers

Flexibility Scalability Low Cost


A single repository for storing Scale-out architecture divides Can be deployed on commodity
processing & analyzing any type workloads across multiple hardware
of data nodes
Open source platform guards
Not bound by a single schema Flexible file system eliminates against vendor lock
ETL bottlenecks

4
2011 Cloudera, Inc. All Rights Reserved.
What Makes Hadoop Different?

Ability to scale out to Petabytes in size using


commodity hardware
Processing (MapReduce) jobs are sent to the
data versus shipping the data to be
processed
Hadoop doesnt impose a single data format
so it can easily handle structure, semi-
structure and unstructured data
Manages fault tolerance and data replication
automatically

5
2011 Cloudera, Inc. All Rights Reserved.
Why the Need for Hadoop?

10,000
GIGABYTES OF DATA CREATED (IN BILLIONS)

1.8 trillion gigabytes of data was


created in 2011

More than 90% is unstructured data


Approx. 500 quadrillion files
5,000 Quantity doubles every 2 years

2005 2010 2015

STRUCTURED DATA UNSTRUCTURED DATA


Source: IDC 2011

6
2011 Cloudera, Inc. All Rights Reserved.
Hadoop Use Cases
Use Case Application Industry Application Use Case

Social Network Analysis Web Clickstream Sessionization

Content Optimization Media Clickstream Sessionization


ADVANCED ANALYTICS

DATA PROCESSING
Network Analytics Telco Mediation

Loyalty & Promotions


Retail Data Factory
Analysis

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Sequencing Analysis Bioinformatics Genome Mapping

7
2011 Cloudera, Inc. All Rights Reserved.
Hadoop in the Enterprise

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

Management Enterprise
IDEs BI / Analytics
Tools Reporting

CUSTOMERS
Enterprise Data
Warehouse

Web
Application

Relational
Logs Files Web Data
Databases

8
2011 Cloudera, Inc. All Rights Reserved.
What is CDH?

Clouderas Distribution Including


Apache Hadoop (CDH) is an enterprise-ready
distribution of Hadoop that is
100% Apache open source
Contains all components needed for deployment
Fully documented and supported
Released on a reliable schedule

Fastest Path to Success Stable and Reliable Community Driven


No need to write your own scripts or Extensive Cloudera QA systems, Incorporates only main-line
do integration testing on different software & processes components from the Apache
components Hadoop ecosystem no forks or
Tested & run in production at scale
proprietary underpinnings
Works with a wide range of operating Proven at scale in dozens of
systems, hardware, databases and FREE
enterprise environments
data warehouses

9
2011 Cloudera, Inc. All Rights Reserved.
Clouderas Commitment to the Open
Source Community
Component Cloudera Committers Cloudera Founder 2011 Commits
Common 6 Yes #1
HDFS 6 Yes #2
MapReduce 5 Yes #1
HBase 2 No #2
Zookeeper 1 Yes #2
Oozie 1 Yes #1
Pig 0 No #3
Hive 1 No #2
Sqoop 2 Yes #1
Flume 3 Yes #1
Hue 3 Yes #1
Snappy 2 No #1
Bigtop 8 Yes #1
Avro 4 Yes #1
Whirr 2 Yes #1

10
2011 Cloudera, Inc. All Rights Reserved.
Components of CDH

Cloudera Enterprise

User Interface
HUE

Workflow File System Mount Scheduling


APACHE OOZIE APACHE OOZIE
FUSE-DFS

Languages / Compilers
APACHE PIG, APACHE HIVE
Fast Read/Write
Data Integration
Access

APACHE FLUME, APACHE SQOOP


APACHE HBASE

Coordination APACHE ZOOKEEPER

11
2011 Cloudera, Inc. All Rights Reserved.
Hadoop Distributed File System

Block Size = 64MB


2 1
Replication Factor = 3
4 2

5 5
1

2 1
HDFS
3 3

4 4

5 2
5
1
3
3
Cost is $400-$500/TB 4
5

12
2011 Cloudera, Inc. All Rights Reserved.
Components of Hadoop

NameNode Holds all metadata for HDFS


Needs to be a highly reliable machine
RAID drives typically RAID 10
Dual power supplies
Dual network cards Bonded
The more memory the better typical 36GB to -
64GB
Secondary NameNode Provides check
pointing for the NameNode. Same hardware
as the NameNode should be used

13
2011 Cloudera, Inc. All Rights Reserved.
Components of Hadoop

DataNodes Hardware will depend on the


specific needs of the cluster
No RAID needed, JBOD (just a bunch of
disks) is used
Typical ratio is:
1 hard drive
2 cores
4GB of RAM

14
2011 Cloudera, Inc. All Rights Reserved.
Networking

One of the most important things to


consider when setting up a Hadoop cluster
Typically a top of rack is used with Hadoop
with a core switch
Careful on over subscribing the backplane
of the switch!

15
2011 Cloudera, Inc. All Rights Reserved.
Map

Records from the data source (lines out of files, rows of a


database, etc) are fed into the map function as key*value
pairs: e.g., (filename, line).

map() produces one or more intermediate values along


with an output key from the input.
(key 1, (key 1, int.
values) values)

Map Shuffle
(key 2, (key 1, int. Reduce Final (key,
Task Phase
values) values) Task values)

(key 3, (key 1, int.


values) values)

16
2011 Cloudera, Inc. All Rights Reserved.
Reduce

After the map phase is over, all the intermediate values for
a given output key are combined together into a list

reduce() combines those intermediate values into one or


more final values for that same output key

(key 1, (key 1, int.


values) values)

Map Shuffle
(key 2, (key 1, int. Reduce Final (key,
Task Phase
values) values) Task values)

(key 3, (key 1, int.


values) values)

17
2011 Cloudera, Inc. All Rights Reserved.
MapReduce Execution

18
2011 Cloudera, Inc. All Rights Reserved.
Sqoop
SQL to Hadoop
Tool to import/export any JDBC-supported database into Hadoop
Transfer data between Hadoop and external databases or EDW
High performance connectors for some RDBMS
Developed at Cloudera

19
2011 Cloudera, Inc. All Rights Reserved.
Flume
Distributed, reliable, available service for efficiently moving
large amounts of data as it is produced
Suited for gathering logs from multiple systems
Inserting them into HDFS as they are generated
Design goals
Reliability, Scalability, Manageability, Extensibility
Developed at Cloudera

20
2011 Cloudera, Inc. All Rights Reserved.
Flume: high-level architecture

Master send
Configurable levels of reliability
configuration to all
Guarantee delivery in event of
Agents failure
Agent Agent Agent Agent
Deployable, centrally administered
encrypt

MASTER
Optionally pre-process incoming
Processor Processor data: perform transformations,
suppressions, metadata enrichment
compress batch
encrypt

Writes to multiple HDFS file formats Collector(s)


(text, sequence, JSON, Avro, others) Flexibly deploy decorators at any
step to improve performance,
Parallelized writes across many reliability or security
collectors as much write throughput
as

21
2011 Cloudera, Inc. All Rights Reserved.
HBase

Column-family store. Based on design of Google BigTable


Provides interactive access to information
Holds extremely large datasets (multi-TB)
Constrained access model
(key, value) lookup
Limited transactions (only one row)

22
2011 Cloudera, Inc. All Rights Reserved.
HBase

23
2011 Cloudera, Inc. All Rights Reserved.
Hive

SQL-based data warehousing application


Language is SQL-like
Supports SELECT, JOIN, GROUP BY, etc.
Features for analyzing very large data sets
Partition columns, Sampling, Buckets

Example:
SELECT s.word, s.freq, k.freq FROM shakespeares
JOIN ON (s.word= k.word) WHERE s.freq >= 5;

24
2011 Cloudera, Inc. All Rights Reserved.
Pig

Data-flow oriented language Pig latin


Datatypes include sets, associative arrays, tuples
High-level language for routing data, allows easy
integration of Java for complex tasks

Example:
emps=LOAD 'people.txt AS(id,name,salary);
rich = FILTER emps BY salary > 100000; srtd =
ORDER rich BY salary DESC; STORE srtd INTO
rich_people.txt';

25
2011 Cloudera, Inc. All Rights Reserved.
Oozie
Oozie is a workflow/cordination service to manage data
processing jobs for Hadoop

26
2011 Cloudera, Inc. All Rights Reserved.
Zookeeper

Zookeeper is a distributed consensus engine


Provides well-defined concurrent access semantics:
Leader election
Service discovery
Distributed locking / mutual exclusion
Message board / mailboxes

27
2011 Cloudera, Inc. All Rights Reserved.
Pipes and Streaming

Multi-language connector libraries for MapReduce


Write native-code MapReduce in C++
Write MapReduce passes in any scripting language,
including
Perl
Python

28
2011 Cloudera, Inc. All Rights Reserved.
FUSE - DFS

Allows mounting of HDFS volumes via Linux FUSE file


system
Does allow easy integration with other systems for data
import/export
Does not imply HDFS can be used for general-purpose
file system

29
2011 Cloudera, Inc. All Rights Reserved.
Hadoop Security

Authentication is secured by Kerberos v5 and integrated with LDAP


Hadoop server can ensure that users and groups are who they say they are
Job Control includes Access Control Lists, which means Jobs can specify who
can view logs, counters, configurations and who can modify a job
Tasks now run as the user who launched the job

30
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise
Cloudera Enterprise makes CLOUDERA ENTERPRISE COMPONENTS
open source Hadoop enterprise-easy
Cloudera Production-Level
Simplify and Accelerate Hadoop Deployment
Manager Support
Reduce Adoption Costs and Risks
Lower the Cost of Administration
End-to-End Management Our Team of Experts On-
Increase the Transparency Control of Hadoop Application for Apache Call to Help You Meet
Hadoop Your SLAs
Leverage the Experience of Our Experts

EFFECTIVENESS EFFICIENCY
Ensuring You Enabling You to
Get Value From Your Hadoop Deployment Affordably Run Hadoop in Production

31
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager

The industrys first

for Apache Hadoop

the Automates the


Apache Hadoop stack of Apache Hadoop

HDFS MAPREDUCE HBASE


DISCOVER DIAGNOSE ACT OPTIMIZE

ZOOKEEPER OOZIE HUE

32
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise

Including Cloudera Support

Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA
requirements

Configuration Checks Verify that your Hadoop cluster is fine-tuned for your
environment

Issue Resolution and Proven processes ensure that support cases get
resolved with maximum efficiency
Escalation Processes

Comprehensive Browse through hundreds of Articles and Tech Notes


to expand upon your knowledge of Apache Hadoop
Knowledgebase
Certified Connectors Connect your Apache Hadoop cluster to your existing
data analysis tools such as IBM Netezza and
Revolution Analytics

Notification of New Stay up to speed with whats going on in the Apache


Hadoop community
Developments and Events

34
2011 Cloudera, Inc. All Rights Reserved.
Cloudera University

Public and Private Training to Enable Your Success

Class Description
Developer Training & Certification Hands-on training and certification for developers who want
(4 Days) to analyze their data but are new to Apache Hadoop

System Administrator Training & Hands-on training and certification for administrators who
Certification (3 Days) will be responsible for setting up, configuring, monitoring an
Apache Hadoop cluster

HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as
well as some advanced topics and best practices

Analyzing Data with Hive and Pig Hive and Pig training is designed for people who have a
(2 Days) basic understanding of how Apache Hadoop works and want
to utilize these languages for analysis of their data

Essentials for Managers (1 Day) Provides decision-makers the information they need to know
about Apache Hadoop, answering questions such as when
is Hadoop appropriate?, what are people using Hadoop
for? and what do I need to know about choosing Hadoop?

35
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Consulting Services
Put Our Expertise To Work For You.

Clouderas team of Solutions Architects provides guidance and


hands-on expertise to address unique enterprise challenges.

Service Description
Use Case Discovery Assess the appropriateness and value of Hadoop
for your organization
New Hadoop Deployment Set up and configure high performance,
production-ready Hadoop clusters
Proof of Concept Verify the prototype functionality and project
feasibility for a new Hadoop cluster
Production Pilot Deploy your first production-level project using
Hadoop
Process and Team Development Define the requirements and processes for
creating a new Hadoop team
Hadoop Deployment Certification Perform periodic health checks to certify and tune
up existing Hadoop clusters

36
2011 Cloudera, Inc. All Rights Reserved.
Journey of the Cloudera Customer

Discover the Benefits Clouderas Subscribe to


of Apache Hadoop Distribution Cloudera Enterprise

Flexibility to store The fastest, surest Simplify and


and mine all types path to success with accelerate Apache
of data Apache Hadoop Hadoop deployment

37
2011 Cloudera, Inc. All Rights Reserved.
Cloudera in Production

Consulting Services
Cloudera University Cloudera Services

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS

Cloudera Enterprise
Management Cloudera Management Suite Enterprise Web
Cloudera Support IDEs BI / Analytics
Tools Reporting Application

Enterprise Data
Warehouse
Clouderas Distribution
Including Apache Hadoop (CDH)
& Operational Rules
SCM Express Engines

Relational
Logs Files Web Data
Databases

38
2011 Cloudera, Inc. All Rights Reserved.
Get Cloudera helps you profit
Hadoop from all your data.

+1 (888) 789-1488 cloudera.com twitter.com/


cloudera
sales@cloudera.com

facebook.com/
cloudera

39
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager

The Hadoop management


application that:

Manages the

Manages and monitors the

Incorporates comprehensive

Has built-in

40
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager

Key and

ONLY
CLOUDERA
Installs the complete Hadoop stack in minutes. The simple, wizard-based
interface guides you through the steps.

Gives you complete, end-to-end visibility and control over your Hadoop
cluster from a single interface
ONLY
CLOUDERA
Set server roles, configure services and manage security across the cluster

Gracefully start, stop and restart of services as needed


ONLY
CLOUDERA
Maintains a complete record of configuration changes for SOX compliance
ONLY
CLOUDERA
Monitors dozens of service performance metrics and alerts you when you
approach critical thresholds
ONLY
CLOUDERA
Gather, view and search Hadoop logs collected from across the cluster

Scans Hadoop logs for irregularities and warns you before they impact the
cluster

41
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager

Key and
ONLY
CLOUDERA
Establishes the time context globally for almost all views

Correlates jobs, activities, logs, system changes, configuration changes and


service metrics along a single timeline to simplify diagnosis
ONLY
CLOUDERA
Takes a snapshot of the cluster state and automatically sends it to Cloudera
support to assist with resolution
ONLY
CLOUDERA
Creates and aggregates relevant Hadoop events pertaining to system health, log
messages, user services and activities and make them available for alerting and
searching

Generates email alerts when certain events occur


ONLY
CLOUDERA
Visualize current and historical disk usage by user, group and directory
Track MapReduce activity on the cluster by job or user

View information pertaining to hosts in your cluster including status, resident


memory, virtual memory and roles

42
2011 Cloudera, Inc. All Rights Reserved.
Two Editions: FREE EDITION ENTERPRISE EDITION**

Max Number of Nodes Supported 50 Unlimited

Automated Deployment

Host-Level Monitoring

Secure Communication Between Server & Agents

Configuration Management

Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper

Audit Trails

Start/Stop/Restart Services

Add/Restart/Decomission Role Instances

Configuration Versioning & History

Support for Kerberos

Service Monitoring

Proactive Health Checks

Status & Health Summary

Intelligent Log Management

Events Management & Alerts

Activity Monitoring

Operational Reporting

Global Time Control

Support Integration

** Part of the Cloudera Enterprise subscription

43
2011 Cloudera, Inc. All Rights Reserved.
View Service Health and Performance

44
2011 Cloudera, Inc. All Rights Reserved.
Get Host-Level Snapshots

45
2011 Cloudera, Inc. All Rights Reserved.
Monitor and Diagnose Cluster Workloads

46
2011 Cloudera, Inc. All Rights Reserved.
Gather, View and Search Hadoop Logs

47
2011 Cloudera, Inc. All Rights Reserved.
Track Events From Across the Cluster

48
2011 Cloudera, Inc. All Rights Reserved.
Run Reports on System Performance & Usage

49
2011 Cloudera, Inc. All Rights Reserved.
New in Cloudera Manager 3.7
ONLY

Proactive Health Checks CLOUDERA


Monitors dozens of service performance metrics and alerts you
when you approach critical thresholds
ONLY
CLOUDERA
Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you
before they impact the cluster
ONLY
CLOUDERA
Global Time Control Correlates jobs, activities, logs, system changes, configuration
changes and service metrics along a single timeline to simplify
diagnosis
ONLY

Support Integration CLOUDERA


Takes a snapshot of the cluster state and automatically sends it to
Cloudera support to assist with resolution
ONLY

Event Management CLOUDERA


Creates and aggregates relevant Hadoop events pertaining to
system health, log messages, user services and activities and make
them available for alerting and searching
Alerts Generates email alerts when certain events occur
ONLY

Audit Trails CLOUDERA


Maintains a complete record of configuration changes for SOX
compliance
ONLY

Operational Reporting CLOUDERA


Visualize current and historical disk usage by user, group and
directory and track MapReduce activity on the cluster by job or user

50
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Support

Our on call to help you meet your SLAs

Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA
requirements

Configuration Checks Verify that your Hadoop cluster is fine-tuned for your
environment

Issue Resolution and Escalation Proven processes ensure that support cases get
Processes resolved with maximum efficiency

Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes


to expand upon your knowledge of Apache Hadoop

Certified Connectors Connect your Apache Hadoop cluster to your existing


data analysis tools such as IBM Netezza, Revolution
Analytics, and MicroStrategy

Proactive Notification of New Stay up to speed with whats going on in the Apache
Developments and Events Hadoop community

51
2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise

The Fastest Path to Success


Running Apache Hadoop in Production.

Only Cloudera Enterprise


Why Cloudera Enterprise?
Apache Hadoop is a distributed system that Has a management application that
presents unique operational challenges supports the full lifecycle of operationalizing
Apache Hadoop
The fixed cost of managing an internal patch

and release infrastructure is prohibitive
Has production support backed by the
Apache Hadoop skills and expertise are scarce Apache committers

Its challenging to track consistently to
Has the depth of experience supporting
community development efforts
hundreds of production Apache Hadoop clusters

52
2011 Cloudera, Inc. All Rights Reserved.

You might also like