Big Data Training

BIG DATA TRAINING
Concepts, Use Cases, Architecture, Hands-on

Training & Exercises
Course Overview and Expectations
This training provides a comprehensive overview of Big Data
and in-depth discussion of Big Data from the perspective of the Day 2 Duration
current market trends. The training will go over key Big Data Industry use cases end to end 10.00 - 11.30 AM
use cases in financial services, Big Data concepts and the
Hadoop technology architecture used to address Big Data issues. Break 11:30 11:45 AM
Upon completion of this training, participants should be able to: Hadoop Installation & HDFS Hands-on 11.45 - 1.00 PM
Understand what Big Data means and how Big Data issues
are manifesting themselves in financial services. Lunch Break 1.00 to 2.00 PM
Understand the Big Data reference architecture. Map Reduce Deep Dive 2.00 - 3.00 PM
Get an in-depth understanding of the Hadoop architecture
platform for Big Data. Map Reduce Hands on development & Deployment 3.00 - 4.30 PM
Understand how to build solutions for Big Data problems

Break 4:30 4:45 PM
Hands-on training on core Big Data concepts such as
MapReduce, Pig, and Hive, etc. Map Reduce Hands on development & Deployment 4.45 6.00 PM
Day 1 Duration Day 3 Duration
Breakfast 8:30- 9:00 AM Breakfast 8:30- 9.00 AM
Big Data Introduction and Concepts 9.00 - 11.00 AM PIG deep dive Theory 9.00 - 10.00 AM
PIG hands on 10.00 - 10.30 AM

Break 10:45 11:00 AM
Break 10:30 10:45 AM
Big Data Reference Architecture 11.00 -12.00 PM
Pig hands on, Hive deep dive Theory 10.45 - 12.00 PM
Lunch Break 12:00 to 1:00 PM
Lunch Break 12.00 to 1.00 PM
Introduction to HDFS and MapReduce 1.00 2.00 PM
Hive hands on 1.00 - 2.30 PM
Pig, HBase 2:00 4.00 PM
Break 2:30 2:45 PM
Break 3:30 3:45 PM
Hadoop administration & troubleshooting 2.45 - 4.00 PM
Hive, Sqoop 3:45 5.00 PM Training Conclusion 4.00 - 4.30 PM
Page 2
Big Data Introduction & Concepts 1
Day
Page 3
What is Big Data?
Big Data constitutes dataset that become so large as to render themselves unmanageable using traditional data
platforms e.g. RDBMS, flat file systems, OO databases etc.. This unmanageability stems from the complexity in
capture, storage, search, retrieval, sharing and analytics of these datasets due to their sheer size.
o Click Stream
o Active/Passive Sensor
o Log
o Unstructured
o
o
Event
Printed Corpus VOLUME VARIETY o Semi Structured
o Structured
o Speech
o Social Media
o Traditional Big Data
VELOCITY How Fast is Fast?

Why Does it Matter?
o Velocity of Generation
o Velocity of Analysis
Page 4
Who Uses BIG Data?
Page 5
Market Trends in Data... What is the Context?
Neil Armstrong lands on the moon with 32KB of data (1969)
Google processes 24PB of data everyday (2010) ~ 240K 100GB hard drives
So What is a PetaByte?
o 1 PB = 1 000 000 000 000 000 bytes = 1 million gigabytes = 1 thousand terabytes
o Large Hadron Collider produces ~ 15 PB per year
o The movie Avatar took ~ 1 PB of storage for rendering 3D CGI graphics
Twitter ~ 7TB
Facebook ~ 10B
Bank of America Who Knows? Why dont we know? Because they dont have the ability to store and process all the
unstructured / semi-structured data their eco-system generates But can they use it?
Spending 80% of your life with 20% of the data!!!

o 80% of new world data generated is unstructured and semi-structured
o Relational structures | strict schemas | canonical encodings will persist But will progressively constitute less and less of the
market data share
o Analysis requirements therefore are shifting towards unstructured and semi-structured data
o Traditional analysis platforms cannot scale to meet the new data paradigm
How Fast is Fast?

o Velocity of data generation is increasing exponentially Proliferation of automatic sensors in software and hardware
o Necessitates proportional increase in the velocity of data analysis
Streams Computing Analytics in Motion on-the-wire
Analytics at Rest Partial Today
Velocity of Data Velocity of Data Flow Restful

Generation Data
o WHY need Velocity of Analytics? Finding arbitrage opportunities in capital markets before asset prices balance
Page 6
What is Big Data Used For?
o Search
Yahoo, Google, Amazon, Zevents
o Log Processing
Yahoo, Facebook, Google, Twitter, LinkedIn
o Recommendation Systems
Yahoo, Facebook, Google, Twitter, LinkedIn
o Data Warehousing
Facebook, AOL
o Sentiment Analysis
Analyze TB volume of emails and transaction logs against an apriori established sentiment ML model to gauge which
customers are likely to leave Offer an incentive program to reduce customer attrition
o System Outage Prediction

Analyze application logs and join with call center data to predict system outages.
o Security Profiling
Collect event data from the enterprise event cloud and run it against an apriori established ML model to detect security breaches
and unforeseen correlations between enterprise events.
o Anti - Money Laundering | Fraud Analysis
o Correlation | Link Analysis

Google, Facebook, LinkedIn, eHarmony
o Multimedia Analysis
New York Times, Veoh
Page 7
Confluence of Influences
80% of new data generated is either completely unstructured or semi-structured Much of this data is being
generated at very high velocity Presents tremendous challenges for storage, search, retrieval and analysis
Traditional data platforms and analysis capabilities are unable to meet these evolving demands!
Consolidate ALL Enterprise

Analytics-in-Motion Data in a Single Place Email, Chat, Voice, Logs
High Velocity of Data Generation Open Data Format / Fixable Schema
On ALL Data
Deep Predictive Analysis Unstructured / Semi Structured Data
Extremely High Throughput Big Data Massively Parallel & Fault Tolerant
Store All Your Data Never Delete No Specialized Hardware Required
Low Cost Per Compute

Massive Datasets 100s of GBytes!! Integration with Structured Data
IVR, Image, Video Complete Support for

DW, ODS, Transaction Data
Page 8
Digression But Illustrative of Big Data Problems!!!
Building extremely large scale search engines or equivalently processing 1000s of terabytes of data in financial
services requires a large number of machines and a complex (aka costly) and scalable processing engine.
Yahoo! Use Case
Yahoo! Search architecture consists of four primary components

Crawler: Crawls the web and downloads pages from web servers across the entire net.
WebMap: Builds a graph of the known web. A graph is an interconnected mesh of web pages as they reference
each other.
Indexer: Builds a reverse index to the best pages for quick search retrieval across multiple facets.
Runtime: Responds to real time queries.
So what is the complexity?

A daily execution of a Yahoo! net crawl yields a graph of 1 trillion (1012) edges and 100 billion nodes (1011).
Each edge represents a web link.
Each node represents a distinct URL.
Analyzing this graph would require traversing each leg of this graph The graph has 1 trillion legs!!!
The analysis has to be done in finite time; typically less than a day.
So what kind of platform and framework would be needed to enable this kind of computation?
1) A large numbers of machines/nodes running in parallel.
2) Logically dividing the analysis in smaller chunks of work that can be processed in parallel.
3) Recombining the smaller pieces of work into a cohesive whole at the end.
4) A fault tolerant platform that can survive node failures.
5) A latency sensitive platform that can work on local data instead of requiring remote data.
Page 9
Who is Hadoop?
Hadoop is a software framework that supports very large scale data intensive processing using an open source license
(Hadoop, 2013)[1]. It is composed of two primary components 1. Hadoop File System (HDFS) & 2. MapReduce
Hadoop File System (HDFS) MapReduce

o Very large distributed file system o Programming model for large scale data
(10K+ nodes, 100+ million files, 10 100 PB) processing
o Runs on commodity hardware. o Parallel programming model.
(Does not require expensive, highly reliable hardware) o Breaks data processing problems into a Map
o Complete fault tolerance and failover. phase and a Reduce phase.
(Fault tolerance is built into the framework) o Each Map process works on a small subset of data
o Scale to hundreds of nodes in a cluster. ideally local to the process. (Data localization is
o Data localization. the responsibility of HDFS not MapReduce)
o Computation moves to data. o MapReduce processes data as key-value pairs.
o Provides very high aggregate bandwidth. o HDFS takes care of splitting and moving data to
o Provides support for streaming data access. MapReduce functions.
(Write once read many times) o Can be represented in any language of choice e.g.
Java, C++, Python, Shell etc.
HDFS does not work well for the following use cases: o High order abstraction languages such has Pig
o Lots of small files Latin have been created to obfuscate lower level
o Multiple writers details.
o Arbitrary file writes
Note: The last two limitations i.e. Multiple writers and arbitrary
file writes are being addressed in the new release of Hadoop
Page 10
Why Hadoop?
Which Data Problems are Hadoopable?
Analysis of structured, semi-structured and unstructured dataset from a variety of data sources.
When the entire population dataset requires analysis instead of merely sampling a subset and extrapolating results.
Ideal for iterative and exploratory analysis when business measures on data are not predetermined.
Coexisting Data Profiles DW IT Investment is Not Lost

In many cases the value-per-byte of data is unknown after money is spent storing it in DW. Hadoop removes this uncertainty
High Value Data

Hadoop DW
DM
DM
DM
o Analyze all data (Structured, unstructured, semi-structured) o Cleansed, enriched, matched data
o Inherent data discovery and data value analysis o Structured data analysis
o Analytics-at-rest & analytics on-the-wire o Analytics-at-rest
o Multiple disparate data sources o Produces insight with known and stable measurements
o Store all data (retain fidelity of transactions, logs, posts etc.) o Defined based on pre-determined corpus of questions
o Store data in native object format o Inflexibility in structure due to rigid data structure design
o Flexible or no data transport encoding o Rigorous data quality controls
o Low cost-per-compute o Performance envelope constrained due to functional limits
o Minimal performance concerns due to massive parallelism o High cost-per-compute
o High value-per-byte
o Data retained based on perceived business value
Page 11
A Little Bit of History
2002 2003
Apache Nutch
o Open source web search engine.
o Building one which can index 1-billion web pages is ambitious and costly at the very least.
o Doug Cutting wrote a Nutch crawler but the architecture would not scale to a billion pages.
Google
o Meanwhile Google site index is growing exponentially based on Sergey and Larrys Page Rank Algorithm.
o Query semantics and composition is becoming increasing complex.
o They are facing similar scale issues as Nutch. Their Oracle RAC is just not scaling.
o At this time Google is looking for a technological miracle else a possible demise.
2004
o Dec 2004 Google publish their seminal paper on distributed computing in the shape of Google File System (GFS).
o GFS would solve Googles storage needs, free up time spent on node management & enable huge indexing & crawling jobs.
o GFS runs all of Google search including all the utility functions NOW.
2005
o Doug Cutting picks up the GFS paper and implements an open sources version called NDFS.
o Nutch realizes that NDFS is applicable to wide array of computing issues beyond merely search.
2006
o NDFS is moved out as a top level Apache project and renamed to Hadoop (after the toy elephant of Dougs kid)
o Yahoo! Hires Doug Cutting and adopts Hadoop as their main computing platform.
o Yahoo! Implements 10GB/node sort benchmark on 188 nodes in 47.9 hours
2008
o Yahoo! wins the 1 terabyte sort benchmark in 209 seconds on 900 nodes
o Yahoo! loads 10 terabytes of data per day on a cluster of 1000 Hadoop nodes.
2009 - Yahoo! has 17 Hadoop clusters with 24,000 nodes. Wins the min. sort by sorting 500GB in 59 seconds on 1400 nodes.
Page 12
Where Can Our Clients Use Big Data?
In the new data paradigm; Big Data constitutes the fundamental enabler of value-add predictive analytics Big Data
enables analytics-at-rest and analytics-on-the-wire
Domain Metric Problem Question Solution

Customer Increased Customers are moving to What is a probability of a Customer sentiment analysis
Retention customer competitors customer leaving? using unstructured data
attrition (voice, email, logs, chat) on
Big Data platform using
Natural Language Processing
(NLP)
Fraud Analysis High rate of AML Correlating high volume of How do we successfully Link analysis using Big Data
alerts potential AML alerts identify a fraud network? and Machine Learning
Operational High rate of Successfully classifying How do we successfully Link analysis using Big Data
Continuity operational significant operational predict future system outages? and Machine Learning
events (Errors | errors with high precision Source ALL enterprise event
Alerts) and developing an data & run it against ML
operational outage model model
Security High rate of Successfully correlating How do we capture and Massive parallelism of
security breaches security breaches and analyze ALL enterprise events? enterprise event data using
enterprise issuing alerts Develop a threat profile? Big Data and Link Analysis to
security policy Issue alerts? isolate threat profiles
violations
Loss High rate of Automated detection of How do train an automated NLP processing of
Management insurance claims insurance fraud system to detect fraud on unstructured text corpus
insurance claims? against a self training fraud
model
Consolidation of Enterprise Business Intelligence Capabilities
Page 13
Big Data Reference Architecture 1
Day
Page 14
Big Data Reference Architecture
1 Connectors: Different methods to connect external
1 Connectors
source/target systems to the Hadoop platform.
ETL DBMS Middleware BI Analytics Visualization
2 Analytics: Data mining algorithms for performing
clustering, regression testing and statistical modeling
and to implement them using the Map Reduce model. 2 Analytics
3 Security: Kerberos Authentication, Role Based Text Analytics Machine Learning Object Correlation Path & Pattern Analysis
Authorization, Audit, Encryption.
4 Data Access | Pipelining | Serialization:
3
Security (Authentication, Authorization, Audit, Encryption)
HBase - A non-relational database that allows for low-
latency, quick lookups and adds transactional capabilities
to Hadoop. 4 Data Access | Pipelining | Serialization 5 Resource Mgmt. & Orchestration
Hive - allows users to write queries in a SQL-like language HBase Hive Pig Sqoop Avro Yarn ZooKeeper Oozie Flume
called HiveQL, which are then converted to Map Reduce.
PIG - A platform for analyzing large data sets that
consists of a high-level language for expressing data 6
Near Real Time Access
analysis programs.
In - Memory Database | Cache Object Immutability Graphing
Sqoop - A connectivity tool for moving data from non-
Hadoop data stores such as relational databases and
data warehouses into Hadoop. 7 Map Reduce -- Data Processing
Avro - A data serialization system that allows for
encoding the schema of Hadoop files.
8 HDFS | Cassandra File System (CFS) | GPFS -- Storage
5 Resource Mgmt. & Orchestration:

Java Virtual Machine
Yarn - Resource Manager consists of the Scheduler and
the Applications Manager; Resource mgmt. and job Operating System (Linux)
scheduling/monitoring.
Zookeeper - Centralized service to maintain 6 Near Real Time Access: IMDB is a database where data is stored in main memory to facilitate faster
configuration information, provides distributed response times with near real time access. Object immutability, with every state available for real time query
synchronization and group services. and retrieval.
Oozie - Workflow processing system that lets users 7 Map Reduce: Data Processing: A programming model for processing large data sets with a parallel,
define a series of jobs written in multiple languages distributed algorithm on a cluster. A Map Reduce program comprises a Map procedure that performs
such as Map Reduce, Pig and Hive -- then intelligently link filtering and sorting and a Reduce procedure that performs a summary operation. Orchestrates by
them to one another. marshaling the distributed servers, running the various tasks in parallel.
Flume - A framework for efficiently collecting, 8 Storage: Highly fault-tolerant distributed file system designed to run on commodity hardware. HDFS is the
aggregating and moving large amounts of log data from primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and
many different sources to a centralized data store. distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations
Page 15
Big Data Extended Eco System
Page 16
Lunch Break 1
Day
Page 17
Introduction to HDFS & MapReduce 1
Day
Page 18
HDFS Architecture Overview
Hadoop Distributed File System Architecture (Borthakur, 2013)[2]
HDFS Cluster
Block Map
NameNode
Metadata - master -
Operations
Client
Block Operations
Read
RACK RACK
DataNode DataNode DataNode DataNode
- worker - - worker - - worker - - worker -
1 4 2 4 5 6 Replication 6 9
2
3 7 8
3
Write Write
HDFS B lock
Client
Page 19
HDFS Architecture What is an HDFS Block?
A Block is the most basic level of persistent storage More basic than a file. Actually, a file is composed of one or more blocks
Alternatively, you can think of a Block as the minimum amount of data that can be read or written to a storage platform.
All filesystems (e.g. NFS, NTFS, FAT, Apple HFS) are designed using the block paradigm
Most filesystem blocks are usually a few KB
HDFS is also organized using blocks
Files in HDFS are broken down into equal sized data blocks
An HDFS blocks is >= 64MB. That is a very large data block
Why is the HDFS data block so large?

o To minimize the cost of seeks
o Map tasks operate on a single block usually.
o Making the block size too small would result in too many Map task Wasted memory
Block based storage architecture allows HDFS to enable the following (Borthakur, 2007)[3]:
o Blocks from same file can be stored on different servers allowing cluster based fault tolerance
o Blocks can be replicated to enable high availability
o Popular files can be set to a high replication factor (meaning replicate blocks to more servers) to enable load balancing and high
throughput
Client 1 Client 2 Client 3
DataNode DataNode DataNode

- worker - - worker - - worker -
1 2 Replication 1 2 3 4
Blocks 1 & 2 have high replication factor

Page 20
HDFS Architecture NameNode The Master
Manages the filesystem namespace This means all CRUD operations against DataNode(s) are managed by the NameNode
o Namespace is a name that uniquely identifies an HDFS instance e.g. hdfs://ernst&young/stas/ali/filename

o HDFS namespaces are defined using URIs
Maintains the filesystem tree

Maintains metadata for all the files and directories for an HDFS cluster
Knows all the DataNodes in the HDFS cluster
Holds the in-memory location of each data block for all file within an HDFS cluster
o Since block locations change during the life of an HDFS cluster; they are not persistently held by the NameNode
o In-memory caching of block locations imposes a limitation on the number of files that can be stored in an HDFS cluster
Clients access NameNode when they want to read or write a file. NameNode proxy client requests to appropriate DataNode(s)
DataNode B NameNode DataNode A
3 4 File1
- master - 1 2 File1
Name Type Server Block
File1 File A 1 -2 1 2 File2
B 3 -4
File2 File A 1-2
Without NameNode an HDFS cluster cannot function
X
NameNode
- master - X
HDFS
Even though Hadoop provides fault tolerance; how do you make Hadoop itself fault tolerant?
o Make NameNode resilient by writing persistent state of HDFS cluster to a remote filesystem e.g. NFS
o Run a secondary NameNode
Page 21
HDFS Architecture Anatomy of an HDFS Read
Client JVM/Node
2 - Remote Procedure Call
DistributedFileSystem 3-
NameNode
http://DataNodeA 1, 2 - master -
http://DataNodeB 3, 4
4 - Read
Client 7 - Close FSDataInputStream Sorted by proximity to client
5 - Read o Data is streamed from DataNode A to the

client consistently till end of the last block is
reached (Block 2)
DataNode A o Client connection to DataNode A is closed
- worker - 6 - Read o Client connects to DataNode B and
repeats the same process
1 2
DataNode B
Primary Proximity - worker -
3 4
How does this design scale?
o Clients retrieve data directly from DataNode(s) 2
o NameNode merely serves up location information
o HDFS can therefore scale to multiple concurrent clients since data Secondary Proximity
traffic is spread across multiple DataNode(s)

And where is the fault tolerance of Hadoop?
o If an error occurs while reading from a DataNode; HDFS will automatically proxy the request to the closest DataNode with the block.
o HDFS will remember the DataNode(s) that have failed and will automatically remove them from future client requests.
Page 22
HDFS Architecture The Concept of Node Proximity
HDFS represents the network as a Tree and the distance between two nodes is the sum of their distances to their common ancestor.
d=4 d=6
d=0 B Node 1 B Node 3 B Node 4
d=2
B Node 2
Rack 1 Rack 2 Rack 3 Rack 4
Data Center 1 Data Center 2
How does HDFS chooses which DataNode to store block replicas on? Bandwidth degradation between DataNode(s)
o The contention is between balancing a) Reliability b) Write bandwidth and c)
Read bandwidth. Same node
Bandwidth
o First replica goes on the same node as the client.
Different node same rack
o Second replica goes off-rack in the same data center chosen randomly.
o Third replica goes on the same rack as the second but on a different node. Different rack same data center
o Further replicas are placed on random nodes within the cluster.
Different data center
o This strategy provides:
Reliability Blocks stored on two separate racks
Write bandwidth Writes only traverse a single network switch
Read bandwidth Choice of reading from two racks or more
Block distribution Client writes a single block on local rack
Page 23
HDFS Architecture Anatomy of an HDFS Write
Replication Factor = 3
Client JVM/Node
2 - Remote Procedure Call NameNode
DistributedFileSystem
- master -
3 - Write
Client 9 - Close FSDataOutputStream
AC K QU E U E Create new file in file system
namespace e.g.
[Data Queue] Ack. Packet hdfs://eny/stas/ali/myfile.doc
DataStreamer
5 - Write Packet 8 - Acknowledgement Packet
DataNode A DataNode B DataNode C

DataNode - worker - - worker - - worker -
6 - Replicate Packet 6 - Replicate Packet
Pipelining 1 2 1 2 1 2
7 - Acknowledge 7 - Acknowledge
So what happens when all of this goes to hell? When a DataNode dies !!
o HDFS closes the pipeline
o Packets in the Ack. Queue are added to the front of Data Queue (Why?)
o Current blocks on the good DataNodes are assigned new identity and communicated to NameNode
o Partial blocks on failed DataNodes are deleted when the node comes back up
o Failed DataNode is removed from the pipeline
o NameNode notices the under-replication and assigns a new healthy DataNode for replication
Page 24
MapReduce 1
Day
Page 25
What is MapReduce?
MapReduce is a programming model pioneered by Google & Yahoo to process extremely large datasets (> 100s of Giga Bytes)
MapReduce parallelizes the data processing problem into smaller chunks of work (MapReduce, 2013)[7]
MapReduce Parallelism
(key, value)
split Map
Dataset
Result
(key, value) (key, value)
split Map Reduce
(key, value)
split Map
o User specifies the problem in terms of Map and Reduce tasks

o A Map task consumes raw input data and converts it into another dataset represented as tuples of key/value pairs
o A Reduce tasks combines data from multiple Map tasks into fewer set of tuples
Map: Input data <key, value>

Reduce: <key, value> <result>
The underlying HDFS runtime automatically provides following capabilities:
o Slicing of large datasets into smaller equally sized splits
o Merging of Map results into a fewer input in Reduce tasks
o Fault tolerance and automatic failover
o Automatic data movement between Map and Reduce taks
o Automatic data replication between DataNode(s)
o Localized processing by moving Map & Reduce tasks to where the data resides
o Load balancing of the base application
Page 26
MapReduce Architecture How Does the Data Flow?
NameNode Job Tracker
- master -
MapReduce Job Management
Task Tracker
DataNode 1
Partitions
sort
Split 1 Map DataNode 5 Task Tracker
Partitions Merge
Source Dataset
Task Tracker HDFS

DataNode 2 Reduce Result 1
Replication
Partitions
sort
Split 2 Map Task Tracker
DataNode 6
Partitions Merge
Task Tracker HDFS
DataNode 3 Reduce Result 2
Replication
Partitions
sort
Split 3 Map
MapReduce Task Distribution

o 1 Split = 1 Map = 1 Processor
o Fine grained splits improve performance and quality of load balancing
o Map tasks utilize data locality optimization
o Map tasks partition their output into multiple segments each segment targeted for a different Reducer
o Reduce output is stored on HDFS as per the principles of node proximity
Page 27
Making MapReduce Real Example from Capital Markets
New York Stock Exchange executes 1.1 billion trades per year. How do you find valuable information in such a large dataset?
Problem Statement I Find the highest traded stock price for each company registered on NYSE within a given year
Problem Statement II Find the spread between trade prices that are within 1, 2 and 3 for each listing on NYSE
MapReduce tasks are defined in terms of (key, value) pairs (Stock Ticker, Trade Price)
MapReduce Parallelism
AAPL 544.21
AAPL 543.90 Key Value
AAPL 521.36 AAPL (544.21, 543.90, 521.36)

Original Dataset
(1.1 Billion Tuples) split(s)
MSFT 32.20 A
H
Map
AAPL 544.21
XRX 8.25
MSFT 31.87 I
AAPL 543.90 P
Map Reduce
AAPL 521.36
Q
Z Map
Key Value
MSFT 32.20 XRX 8.25 Key Value AAPL 544.21
MSFT 31.87 MSFT (32.20, 31.87) MSFT 32.20

Page 28
Sorting in MapReduce Adding Velocity to Data Processing
The ability to sort data is at the heart of MapReduce. It helps to organize data and improves the data processing speed in MR.
How do we create balanced splits that feed into Mappers? - (Consider the NYSE example)
Even splits are important so that no one Mapper can dominate the overall job time.
The overall job is as fast as the slowest Mapper or Reducer
Velocity of Data
Understand the distribution of data in the source before split segmentation
Sort Algorithm in MapReduce
Count every occurrence Brute force!

of a ticker in the source High Latency! Requires
full source dataset scan
1.1 Billion Tuple Dataset count(MSFT) 100K
X
count(APPL) 230K
MSFT 32.20
Approach I
count(XRX) 400K
Cost Prohibitive
AAPL 544.21 Split Segmentation
XRX 8.25
MSFT 31.87
AAPL 543.90 Sample a subset of

AAPL 521.36 Approach II tuples to estimate Hadoop Even Split
the population key InputSampler Segmentation
distribution
Page 29
Elaborating MapReduce Sort Why is Sorting Important?
Why is Sorting Important?
o Improves data processing speed by producing more even splits to be distributed across multiple Mappers
o Facilitates joining multiple datasets together for improved analysis capabilities
o Produces a globally sorted output that is consumable by downstream Mappers and Reducers
Steps in MapReduce Sort

This approach preserves the total order of the population dataset Unsorted
Population
Dataset split
Sample the Estimate Determine
(P) Population Key Number of split
(n) Distribution Splits
split
Sort
A
Map Z
Sorted
A Population
Map Z Reduce
Dataset
A
Map Z
Page 30
Pig & HBase 1
Day
Page 31
Introducing Pig
What is Pig ? (Pig, 2013)[8]
o High-level data processing language
o Hadoop extension that simplifies Hadoop programming
o 40% of all Yahoo jobs are run using Pig. Twitter is another well known user of Pig
Why call it Pig ? There is a reason that we will get to later
Why use Pig ?

o Programming in MapReduce is Non-Trivial
o Pig simplifies Hadoop programming by providing a data processing language
o Provides Data Structures that are much richer and has powerful transformation capabilities
Pig has two major components:

o Pig Latin - A high-level data processing language
o Execution Environment- That translates Pig Latin scripts to Map Reduce jobs and then runs.
Execution Modes: Pig has two execution types or modes: local mode and Hadoop mode
Local Mode Hadoop Mode
Pig Latin Pig Latin
Single JVM Runtime Hadoop Cluster
o Single JVM o Runs on Hadoop cluster

o Accesses local file system o Suitable for large data sets
o Suitable for small data sets o Queries are translated to MapReduce jobs
Page 32
Where Does a Pig Fit ?
Data processing usually involves three higher level tasks (Data Collection, Data Preparation & Data Presentation)
Data
Data Preparation Data Presentation
Collection
ETL (or) Data Factory Data Warehouse
Data Factory Use Cases:
o Pipeline Bring in data feed and clean and transform it. Example is logs from Yahoo Web Servers. Logs undergo cleaning
step where bots, company internal views and clicks are removed.
o Iterative Processing Typical processing on large data set involves bringing in small new pieces of data that will change the
state of the large data set
o Research Quickly write a script to test a theory or gain deeper insight by combing through Petabyte of data
Page 33
Pig Latin The Language
Pig Latin provides a higher order abstraction for implementing MapReduce jobs. It constitutes a data flow language which is
made up of a series of operations and transformations that are applied to the input data
Key features of Pig Latin

o Ease of programming - Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as
data flow sequences, making them easy to write, understand, and maintain.
o Optimization opportunities - The way in which tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency.
o Extensibility - Users can create their own functions to do special-purpose processing.
Pig Latin Statement

o Pig Latin statements are the basic constructs you use to process data using Pig.
o A Pig Latin statement is an operator that takes a relation as input and produces another relation as output.
o Pig Latin statements may include expressions and schemas.
How are Pig Latin Statements LOAD Load statements read data from the file system
Organized ?
Transformation statements process data read
Pig Latin statements are organized as a TRANSFORMATION from the file system
sequence of steps such that each step
represents a transformation applied to DUMP statement display results
some data
DUMP / STORE
STORE statement to save the results.
Page 34
Understanding Pig Data Model
Pig Latin is a relatively simple language that executes statements
Bag Pig Latin Statement Bag
A bag is a relation, similar to A statement is an operation The output is another bag

table, that you'll find in a that takes input (such as a with the result set of the
relational database (where bag, which represents a set of processing
tuples represent the rows, tuples) and emits another bag
and individual tuples are as its output.
made up of fields)
Pig Latin has 3 Complex Data types and 1 Simple Data Type (Atom simple atomic value such as string or number)
(5,Big Data,-2) A tuple is an ordered set of fields. Its represented by

TUPLE fields separated by commas, all enclosed by parentheses
A bag is represented by collection of tuples separated by

{(5,Big Data,-2), commas, all enclosed by curly brackets. Tuples in a bag
BAG (6.5,PIG,10)} arent required to have the same schema or even have
the same number of fields.
A map is a set of key/value pairs. Keys must be unique and
MAP [key # value]
be a string. The value can be any type.
Page 35
Pig Data Model Expressions
tup = (puppy, { (wagging,1) (chewing,2)}, [age 4])
Let the fields of the Tuple tub be called t1, t2, t3
Expression Type Example Value for tup

Field by position $0 puppy
Field by name t3 [age 4]
{(wagging)
Projection t2.$0
(chewing)}
Map Lookup T3#age 4
Functional Evaluation SUM(t2.$1) 1 + 2 = 3
T2. $0.$0 ==wagging ?
Conditional Expression Dog
Dog : cat
wagging,1
Flattening FLATTEN(t2)
chewing,2
The table above shows the expression types in Pig Lain and how they operate. The Pig data model is very flexible and
permits arbitrary nesting.
Page 36
Pig Runtime
Executing Pig Programs

(Both Local and Hadoop mode)
Script Grunt Embedded

o File or Command line based o Interactive shell to run commands o Run Pig programs from Java
o Use Pig command to run file o Use exec or run command o Same way as using JDBC to call SQL
o Use Pig e option for command line
The Grunt shell allows you to enter Pig commands manually.

Grunt shell can be used for ad hoc data analysis or during the interactive cycles of program development
Grunt remembers command history and can recall lines in the history buffer using Ctrl-P or Ctrl-N
Pig programs as similar to SQL queries , and Pig provides a PigServer class that allows any Java program to
execute Pig queries. Conceptually this is analogous to using JDBC to execute SQL queries
Pig Editors
o Pig Pen - script text editor
Page 37
Ride the Pig !!!
Example - How would you find the maximum temperature in a given year using Pig Latin
records = LOAD examples/sample.txt AS (year:chararray, temperature:int, quality:int); 1
filtered_records = FILTER records BY temperature != 9999 2

AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); 3
grouped_records = GROUP filtered_records BY year; 4
max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records . temperature); 5
DUMP max_temp;
Decoding the Hieroglyphics !!!

1. Ease of programming - Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data
flow sequences, making them easy to write, understand, and maintain.
2. LOAD function produces a set of (year, temperature, quality) tuples that are present in the input file
3. FILTER removes records that have a missing temperature (indicated by a value of 9999) Data quality check
4. GROUP groups the records relation by year
5. FOREACH processes every row to generate a derived set of rows
6. MAX is a built-in function for calculating the maximum value of fields in a bag
7. Examine the contents using the DUMP operator
Page 38
Pig Latin Programming
Statements
A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a command.
For example DUMP operation is a type of a statement e.g. DUMP max_temp;
Any command to list the files in a Hadoop file system is another example of a statement e.g. ls /
Statement Execution
1. Each statement is parsed in turn Parsing
2. For syntax errors and other semantic problems the
interpreter will halt and display an error message
3. The interpreter builds a logical plan for every relational Yes Display
Errors ? Error Message
operation, which forms the core of a Pig Latin program
4. The logical plan for the statement is added to the logical No
plan for the program so far, then the interpreter moves
on to the next statement. Build Logical Plan
5. The trigger for Pig to start processing is the DUMP for the statement
statement (a STORE statement also triggers processing). Run Logical Plan

6. At that point, the logical plan is compiled into a physical for the Program
plan and executed. DUMP/STORE

(trigger)
Comments
Pig Latin has two forms of comments.
Double hyphens are single-line comments. Everything from the first hyphen to the end of the line is ignored by the Pig Latin
interpreter. Ex: -- DUMP max_temp
C-style comments are more flexible since they delimit the beginning and end of the comment block with /* and */ markers.
They can span lines or be embedded in a single line. Ex: /* . */
Page 39
Pig Functions Beyond the Barn
Functions
1. Eval function
o Takes one or more expressions and returns another expression
o Some Eval functions are aggregate functions such as MAX, which returns the maximum value of the
entries in a bag
2. Filter function
o Removes unwanted rows
o Returns a logical Boolean result
o Example: IsEmpty, which tests whether a bag or a map contains any items
3. Comparison function
o Can impose an ordering on a pair of tuples.
4. Load function
o Specifies how to load data into a relation from external storage
5. Store function
o A function that specifies how to save the contents of a relation to external storage
User Defined Functions (UDF)

1. PigLatin is extensible through user-defined functions (UDFs ), and theres a well-defined set of APIs for writing UDFs.
2. PiggyBank - an online repository for users to share their functions
REGISTER piggybank/java/piggybank.jar;
b = FOREACH a GENERATE UDFs are written in
PIG UDF Jar File Java and packaged
org.apache.pig.piggybank.evaluation.str
in a jar file
ing.UPPER($0);
Page 40
Analyzing Logs using Pig Where Can we Use This?
Churning through voluminous log files and extracting meaningful insight Very common for click stream data, active and
passive sensor data, exception stack-trace information etc.
Much of what Google, Yahoo, Twitter & Facebook does
Processing Log Files with Pig Latin

REGISTER Register PiggyBank
DEFINE Define Log Loader Define Load Extractor
LOAD Load log files into Pig
Log File Ready for Analysis
Find Unique Hits per Day Find Unique Website Hits per Day
GROUP Group records based on days

FOREACH Iterate to get unique number of user identifiers
ORDER Sort the records by unique user identifiers
STORE Store the final result
?
Page 41
HBase 1
Day
Page 42
Yet another Storage Mechanism?
What is Database ?
o Organized collection of data to provide usage (for example, storing and finding a list of conference rooms available)
o Databases are intended to be used by multiple programs and several different users at the same time
o Has long history that is spread over 50 years with several technological advancements
The Evolution of Database
1960s 1970s 1980s 1990s 2000 +
Hierarchical Object Oriented
?
Traditional Files Relational
Network Object relational
Page 43
Why Enough is not Enough?
RDBMS still makes sense in most cases At present
o As a Persistence layer for front end applications
o Store relational data, strong consistency (ACID properties) and Referential Integrity
o Random access for structured data
o Limited number of records
So Why is an RDBMS not enough?

o Large dataset
o Scalability story
o Store both Structured and Semi- Structured data
o There is no cost effective way to store everything.
Scale up !!!
Enterprise Data Needs
Scale up !!
Offload reads to in-

Scale up ! memory systems Out of Options
Cease Stored
Consistent Updates Add more Procedures
Thousands of application servers Materialized Views
Referential Master-Slave Schema de-
Users
Integrity Architecture normalization
Normalize data
Stored Procedures Caching layer Partition data across
Use Foreign keys
ACID properties multiple databases
Use Indexes
Options
Big Data Revolution Store everything & There is no one size that fits all
Page 44
BigTable The Backdrop
A Distributed Storage System for Structured Data
Quick Facts Motivation

o Started as a Internal initiative at Google Labs by 2004 o Lots of (semi) structured data at Google
o Used by many Google production applications o Scale is large
Google Analytics o Need data for offline data processing and online
Google Maps serving
Google Earth
Wide applicability
(many Google products)
Big Table
High Availability
Thousands of Servers (Hundreds of TB to PB)
High Performance Scalability

(Very high reads/writes)
Page 45
Terminology and Complexity
From Googles BigTable Paper
A BigTable is a sparse, distributed, persistent multidimensional sorted map
From Googles BigTable Paper
The map is indexed by a row key, column key, and a timestamp; each value in the map is an un-
interpreted array of bytes.
From Hadoop wiki
HBase uses a data model very similar to that of BigTable. Users store data rows in labeled
tables. A data row has a sortable key and an arbitrary number of columns. The table is stored
sparsely, so that rows in the same table can have crazily-varying columns, if the user likes..
Page 46
Demystifying HBase and BigTable
HBase is a open source implementation of BigTable (HBase, 2013)[4]
Sorted Map Multi-dimensional

HBase maintains maps of Keys to Values (Key Value). Each of these A map of maps. The key itself has structure. Each key consists of the
mappings is called a Cell following parts: row-key, column-key, column and time-stamp
Ex:1 Row
{ Column Family
1 : Dave, {
2 : Joe, 1 : {
} Name : Dave,
Group : STAS
},
2 : {
Name : Joe,
Group : ITAS
The cells are sorted by the key. Allows for searching (ex: retrieve all }
values for which the key is between x and y). Rather than just }
retrieving a value for a known key
Ex:2
Sparse
Hbase does not follow a spreadsheet model. In Hbase a given row can
have any number of columns in each column family or none at all
(rowkey, column family, column, timestamp) value
Distributed
Hbase and BigTable are built over distributed files systems so that the
underlying storage can be spread out among array of independent
machines. Provides a layer of protection against a node within the
Values can be of any length, no predefined names or widths cluster failing
Page 47
Storage Layout Differences by Example
ROW-ORIENTED
Row-ID Name Birthdate Salary Dept
1 Joe 6-Apr 70000 SAL
2 Jane 12-Feb 55000 ACCT
3 Bob 13-Jul 120000 ENG
4 Mike 17-Sep 115000 MKT
SCHEMA ID Name Birthdate Salary Dept

(Integer) (Varchar) (String) (Integer) (String)
COLUMN-ORIENTED
Name Column Row-ID Value
1 Joe
2 Jane
3 Bob
4 Mike
Birthdate Column Row-ID Value

1 6-Apr
2 12-Feb
3 13-Jul
4 17-Sep
Page 48
Storage Layout Differences by Example (contd..)
Converting a Relational Model to HBase
Example Business Application: Shopping Cart

Problem Statement:
Build a Online Shopping Cart Application
Data Model:
A data model representation by conventional RDBMS: General RDBMS methodology:
Customer Orders Products o Normalize data

CustomerID UUID PK OrderID UUID PK ProductID UUID PK o Use foreign keys
Name Text CustomerID FK Name Text o Use indexes
Email Text date Timestamp Price Double
Total Double
Order_Products
OrderID FK
ProductID FK
Model Description: Design description:

Table Purpose 1. Customer table is indexed on the customerID
field for fast lookup
Customer Contains list of customers 2. Orders table is indexed on the OrderID for fast
lookup
Orders Orders placed by customers 3. Products table is indexed on the ProductID
4. Customer and Order tables are related through a
Products List of products available in the store foreign key relation on the CustomerID
5. Orders table and Products table are mapped
Mapping between orders placed and products in the between each other by mapping table
Order_Products
order
Page 49
Representing Shopping Cart application in HBase
Example Business Application: Shopping Cart

Problem Statement:
Build a Online Shopping Cart Application
Data Model:
A data model representation in HBase:
Table: Customers
Model Description:
Row Key Customers
Family
Table Purpose
data Columns: Content, Orders
Stats-3 mins Stores each customer, associated order usage statistics, various time
Customer
Stats-2 mins ranges in separate column-families with distinct TTL settings
Stats-1 mins
Products Stores content associated with products
Table: Products
Orders Stores all orders and products under each order
Row Key Products
Family data Columns: Content
Design description:
Stats-3 mins
Stats-2 mins 1. Similar Number of tables as RDBMS model

Stats-1 mins 2. Mapping table is absorbed as part of the Customers table
3. Statistics are stored with date as a key, so they can be accessed sequentially
Table: Orders
Row Key Orders Important Changes:
Family data Columns: Content, Products
1. Wide tables and Column oriented design eliminates JOINs
Stats-3 mins
2. Compound keys are essential
Stats-2 mins
Stats-1 mins
3. Data partitioning is based on keys, so a proper understanding is essential
Page 50
Where RDBMS makes sense
1. Joining
o In a single query get all products in an order with their product information
2. Secondary Indexing
o Get Customer Id by Email
3. Referential Integrity
o Deleting an order would delete links out of order_products
Where Hbase makes sense

1. Dataset Scale
o We have 1M customers and 100M orders
o Product information includes large text datasheets and PDF files
o Want to track every time a customer looks at a product page
2. Read/Write Scale
o Writes are extremely fast since there are no index updates
o Tables distributed across nodes means reads/writes are fully distributed
3. Replication
o Comes free with Hadoop
4. Batch Analysis
o Massive queries executed serially become efficient MapReduce jobs distributed and
executed in parallel
Conclusion
For small instances of simple straightforward systems, relational databases offer much more convenient way to
model and access data
If you need to scale to larger proportions, the properties and flexibility of Hbase can relieve you from the headaches
associated with scaling an RDBMS
RDMBS provides tremendous functionality out of the box but extremely difficult and costly to scale. Hbase provides
barebones functionality out of the box but scaling is built in and inexpensive
Page 51
HBase Building Blocks
Row Column
o Rows are composed of columns o The most basic unit of Hbase is a Column
o Can have millions of columns o Each column may have multiple versions, with each distinct
o Can be compressed or tagged to stay in memory value contained in a separate cell
o One of more columns for a row, that is addressed uniquely
by a row key
Table
o Column name is called qualifier.
o Collection of rows
o Reference = family: qualifier
o All rows are always sorted lexicographically by their row
key
o Keys are compared on binary level from left to right Cell
o Rows are always unique and can be thought of a primary o Every column value is called Cell. It is time stamped.
index on the row key o Can be used to save multiple versions of a value. Versions
are stored in decreasing timestamp, most recent ones first
Column Families
o Columns are grouped into column families Region
o Semantic boundaries between data o Basic unit of scalability and load balancing
o Defined when table is created and should not be changed o Contiguous ranges of rows stored together.
often o Dynamically split by the system when they become too
o Number of column families should be reasonable (?) large
Page 52
HBase Logical Architecture
o HB ase uses HDFS as its reliable

HBase storage layer, which provides
failover, replication
API o Native Java API, Gateway for
REST, Thrift, Avro (Avro,
2013)[12]
Region Servers
o Two types of HBase nodes
(Master and Region Server)
Master HFile memStore
o Master manages cluster and
Write Ahead Log responsible for monitoring the
region server and load balancing
of regions
o Region Server manages data

HDFS ZooKeeper o Zookeeper is used for
coordination
Page 53
HBase Distribution Architecture
Rows Region Server 1 Region Server 2 Region Server 3

A
. Keys: [T-z]
Table - Logical View
.
. Keys: [I-M]
H
Keys: [F-I]
.
. Keys: [A-C]
.
Q
. Keys: [M-T]
.
. Keys: [C-F]
Z
Distribution
o Unit of Scalability in Hbase is a Region
o Sorted and Contiguous rows
o Spread randomly across Region Server
o Moved around for Load balancing and failover
o Split automatically or manually to scale growing data
Page 54
Hive & Sqoop 1
Day
Page 55
Introducing Hive
Many thanks Facebook for making Hadoop data files look like SQL tables!!!
Hive is a Petabyte scale data warehousing infrastructure for managing and querying structured data built
on top of Hadoop (Hive, 2013)[5]
o Map-Reduce for execution
o Simple query language called Hive QL, which is based on SQL
o Plug in custom mappers and reducers for sophisticated analysis that may not be supported by the built-in capabilities of the
language
Data Summarization Adhoc Querying Data Analysis
Hive
(Built by Facebook)
Large Volumes of Data
How is Hive different from Oracle or other popular databases ?

1. Not designed for online transaction processing
2. Does not offer real time queries and row level updates. More suited for batch processing over large sets of immutable data
3. Provides Data Structures that are much richer and has powerful transformation capabilities
Page 56
Motivation For Using Hive
Quick Facts
Motivation
o Data Data and mode Data
o Users expect faster response time on fresher data
o On average; data is increasing at 8X yearly
o Fast, faster and real-time
o Platform scalability is the major limiting factor in
supporting this data growth
Realization that more insights are derived

from simple algorithms on more data
Existing data warehousing systems do not

meet all the requirements in a scalable, agile
and cost effective way
Can I use
Hadoop ?
Page 57
Rationale For Hive
Pros and Cons analysis
1. Hadoop provides superior

availability/scalability/mana
geability
2. Command-line interface for
end users 1. Map Reduce is hard to
3. Partial program
availability/resilience/scale 2. Need to publish data in well
more important than ACID known schemas
Hive> select employee_number, $cat > /tmp/mapreducer.sh

count(1) from employee_records where $cat > /tmp/map.sh
employee_number > 100 group by .multiple lines of map reduce code
employee_number
HIVE
Page 58
Where Does Hive Help?
HIVE Principles that Addresses Key Challenges
Data Growth Performance

How Hive addresses Data Growth How Hive addresses Performance
challenges? challenges?
1. Hive table can be defined directly on 1. Tools to load data into Hive table in near
existing HDFS files real time
2. Tables can be partitioned and bucketed 2. Various optimization techniques
and data can be loaded to each partition 3. Pull simple short tasks to the client side
3. Scale out instead of Scale up
HIVE
How Hive addresses How Hive addresses Extensibility
Interoperability challenges? challenges?
1. Schemas are stored in RDBMS 1. Plug in Custom Mappers / Reducer
2. Column types could be complex types 2. Data Source can come from web services
3. Tables and Partitions can be altered 3. JDBC/ ODBC drivers
4. Views to be available soon
Interoperability Extensibility
Page 59
Hive Architecture
Data Model
Tables have Typed columns (int , float,
HIVE string, date, boolean)
JDBC ODBC Partitions
Buckets (Hash partitions useful for
Command Line sampling, join optimization)
Metastore
Web Interface Thrift Server
Interface
Driver
(Compiler, Optimizer, Executor)
Metastore
Namespace containing set of tables
Holds partition definitions
Statistics
Runs on Derby, MySQL and many other
relational databases
Job Tracker Name Node

Physical Layout
Warehouse directory in HDFS
Data Node Table row data stored in subdirectories of
+
Task Tracker
warehouse
Partitions form subdirectories of table
Hadoop directories
Map Reduce + HDFS Actual data stored in flat files
Page 60
Hive Query Language
Features
Data Types Built-in Functions

MySQL like MapReduce
SQL Dialect (Simple) (Show Functions)
extensions extensions
(Complex) (Describe Functions)
Capabilities
o Can point to external tables or existing data directories in HDFS
o Sub Queries
o Equi Joins
o Multi-table Insert
o Multi-group by
Select Query Syntax

SELECT
SELECT [ALL
[ALL || DISTINCT]
DISTINCT] select_expr,
select_expr, select_expr,
select_expr, ...
...
FROM table_reference
FROM table_reference
[WHERE
[WHERE where_condition]
where_condition]
[GROUP
[GROUP BY
BY col_list]
col_list]
[[ CLUSTER
CLUSTER BY col_list
BY col_list
|| [DISTRIBUTE
[DISTRIBUTE BYBY col_list]
col_list] [SORT
[SORT BY
BY col_list]
col_list]
]]
[LIMIT
[LIMIT number]
number]
Page 61
Hive Usage @ Facebook and Beyond
Quick Statistics
12 TB of compressed new data added / day
135 TB of compressed data scanned / day
7500 + Hive jobs / day
Analysts (non-engineers) use Hadoop through Hive
95% of jobs at Facebook are Hive jobs
Applications & Jobs

Reporting
o Daily / Weekly aggregations of impression / click counts
o Measures of user engagement
o Micro strategy reports
Ad Hoc Analysis
Machine Learning
Etc Etc.
Other Real World Use Cases

CNET
o Data Mining Log Analysis Adhoc queries eLearning
Hi5
o Analytics Machine Learning Social graph analysis
Page 62
Sqoop 1
Day
Page 63
Sqoop SQL 4* Hadoop
Sqoop is an open source tool that allows users to extract data from relational database into Hadoop
o A great Strength of Hadoop platform is its ability to work with data in different forms and parse Adhoc data formats
extracting relevant information (Sqoop, 2013)[9]
o Considerable amount of valuable data in an organization is stored in relational database systems
Features
o Written in Java Custom MapReduce programs interpret data
o JDBC based interface
o Automatic datatype generation
o Uses MapReduce to read tables from Database SCOOP
database Table HDFS
o Supports most JDBC standard types
o Provides ability to import from SQL
databases straight to HIVE data warehouse
Auto generated datatype

definitions
Page 64
Sqooping Large Objects
Importing Large Objects
o Database queries usually reads all columns of each row from disk to identify rows that match a query criteria
o If large objects were stored inline in this fashion, it would adversely affect the performance of the scans
MapReduce typically materializes every record before passing it along to the mappers. To avoid the
performance degradation the large objects are often stored externally from their rows
Storage B
LOB A
Row 1: Col 1 ID (A) Col 3 Col 4
Row 2: Col 1 ID (B) Col 3 Col 4

LOB B
o Sqoop will store imported large objects in a separate file called a LobFile
o The LobFile format can store individual records of very large size (64 bit address).
o This format allows clients to hold a reference to a record without accessing the record contents
Page 65
Conclusion 1
Day
Questions?
Page 66
Industry Use Cases End to End 2
Day
Page 67
Use Case 1:How do you Gauge Customer Sentiment?
Use Case Reducing customer attrition via apriori semantic analysis of customer sentiments using textual data
Problem Statement Ability to ingest large volume of unstructured data from multiple sources
Apply an apriori established rules models to ingested data
Derive a structured Result-Set that can correlate sentiments to subjects
Challenges in the semantic analysis of human sentiment!! Limitations of traditional semantic text analysis
o Accuracy of the result-set is the major challenge in o Limited accuracy
unstructured textual analysis. Much of it is manifested in o Latency (limited speed)
form of Precision and Recall o Low expressiveness
o Precision Percentage of relevant items in the result-set i.e. o Low granularity
are the results valid? Correlation is a better measure. o Inability to perform bi-directional raw text
o Recall Percentage of relevant results retrieved from analysis
unstructured data i.e. are all sentiments relevant to a
subject retrieved successfully?
CHANNELS MODEL RUNTIME STRUCTURED OUTPUT

Text Corpus
Email
Speech Subject Satisfaction Index
Corpus John Irritating [-1]
OPTIMIZER Chris Unhappy [-3]
IVR Adam Tired [-5]
Text Corpus
Online Extractor
Help
Page 68
Using NLP to assess Customer Sentiment
Natural Language Processing (NLP) constitutes the process of extracting meaningful information via computation
from a natural language e.g. English. This requires collaboration between computation, linguistics and statistics.
Mahout
NLP constructs that we use to answer the Customer Sentiment question using Mahout using (Mahout, 2011)[6]
Coreference Resolution Given a sentence determine which words refers to the same object/entity.
Named Entity Recognition Given a stream of text determine which item in the text maps to a proper name & type.
Natural Language Generation Convert rich media to human readable format. Conversion of IVR data to text.
Part of Speech Tagging Given a sentence, determine part-of-speech for each word. @ Large Bank, customer interactions
captured in a non-inflectional language such as English introduces ambiguity because multiple words can be used both as
nouns and verbs e.g. book, set etc.
Sentence Boundary Disambiguation Given a stream of text determine sentence boundaries.

Word Sense Disambiguation Given a stream of text determine the context in which each word is used for words with
multiple meanings under different contexts.
Sentiment Analysis Given a sentence, classify its polarity given basic states such as positive, negative or neutral. Further
sub-classification into advance emotional states based on a scaling model e.g. -5 to 5 can also be derived.
@ We can use a Hadoop platform that constitutes:

Text Analytics Toolkit
Query Language (QL)
HDFS implementation
Page 69
Extractor Definition
Semantic Extractor Definition
Sample Customer Email Extract
My name is John Doe and I maintain two wealth management accounts with you. On 28th Feb. I
called customer regarding erroneous fees applied to my account, however after speaking with
multiple agents and wasting a lot of time. I came off very disappointed with the level of
service I received. My issue has not been resolved and I am considering moving my accounts
to another institution
Extractor Rules Expression

<Named Entity>
<Person Entity> <Person Entity> <Emotive Expression> <Emotive Expression>
Single Sentence Boundary
Data Stream Boundary
Boundary Person Emotion Satisfaction Index

Data Stream John Doe wasting, disappointed, issue
Sentence I wasting -3
Sentence I disappointed -4
Sentence My issue -2
Page 70
Introducing Hadoop Parallelism for Sentiment Analysis
We can use a Hadoop platform that includes
Includes Advance Text Analytics Engine
Includes Annotator Query Language (AQL) which is a fast, declarative & fine grained expressive language for text analysis
AQL is used for defining text analytics rules to build text extractors
Runtime complies an AQL extractor into an Analytics Object Graph (AOG) which represents the text extraction rules in a tree
structure in memory
A separate instance of AOG is deployed on each Mapper which runs a full instance of Text Analytics Engine
A single Mapper operates on a single data Split running it through the rules defined in the AOG
Output from all Mappers is reduced via a Reducer that coalesces negative emotive expressions exclusively
Hadoop Parallelism Extractor per Mapper

Mappers
Each Mapper runs
an Analytics
AOG instance
Subject Satisfaction Index
John Irritating [-1]
Optimizer Chris Unhappy [-3]
AQL AOG Reducer
& Runtime Adam Tired [-5]
Reduces negative
AOG emotive expressions
Page 71
Use Case 2: Anti Money Laundering Using Hadoop Fraud Use Case
Money Laundering is the practice of concealing the source of illegally obtained money
Process of Money Laundering

Placement Act of transforming illicit liquid assets (currency) into any other asset class by a subject.
Layering Attempt by a subject to obscure the trail of an illicit source of funds via the establishment of complex financial
transactions e.g. shell companies, offshore accounts, tax heavens etc.
Integration The aggregation of illicit funds from various legitimate commercial activities and financial systems. This stage
converts illicit money into clean money.
Placement Layering Integration

Offshore Bank A/C
Money Changer
Subject $ Remit. Agent $ Tax Heaven $
$ $ $
Offshore Bank A/C
$
$ Bank Agent
$ $ Agent
Remit. Agent
$
Region A Region B
Page 72
Notional Model for Event Based Detection of ML
Using Hadoop for AML Detection
SQUOOP Event Record
o Subject
Enterprise AVRO o Time
Event Cloud o Location ML Detection
Hadoop Algorithms
o Activity
Serialization
o . ML Object
o Historical Subject John Doe
Iterations to isolate Event
Associations & 14:35 April 5 2009
sub-associations Records Seattle, WA
Rankings
Company Setup
o OFAC Lists HDFS
o CFTC Lists CFO
MapReduce ML Object Graph
Put Ranking
& Association ML Object Parallelism A2 1
3
Subject Ranking
Get Ranking Mahout
1 1
& Association
Knowledge Base
& Association Cluster Association B
Ranking Assignment
AML Modeling Characteristics Heuristics for AML Modeling

A
o ML Event Is a significant occurrence that presents an ML o Corruption Network
activity at a single point in space-time.
Parallel Cluster MAP o Illicit Activity Period
o Cluster Defines all the associations for a given subject. B o Illicit Transaction
Evaluation
o ML Object Defines a subject with corresponding attributes o Suspicious Entity
including time, illicit activity and location of activity. o Suspicious Region
o Association Defines the relationship between two B o Subjects Career
subjects.
MAP o Subjects Age Profile
o Association Degree Defines the degree of relationship o Transaction Behavior
closeness.
o Subject Rank Defines the degree of suspicion of a subject.
Page 73
Use Case 3: Major System Outages in Recent Past
Friday Aug 05 2011

Lloyds Banking Group's wholesale banking division suffered several hours of downtime in its trading
system as a server cooling system failed and traders were forced to revert to pen and paper and
telephones during a day of turmoil in the stock market. The downtime came during a busy day on the
stock markets as billions of pounds were wiped off the value of shares. Lloyds Bank cooling system
failure to blame for trading downtime
Source: Computer Weekly
Monday Oct 10 2011

BlackBerry users have just started to report issues with BlackBerry Messenger and email, which
suggest that the network is down. This is not the first time that users complain of BlackBerry outage,
but this time seems to be a global issue as we've seen reports from all over the world and BlackBerry
owners in our team seem to be affected as well..
Source: CNN Money
Page 74
Monitoring & Decisioning Man vs. Machine
The variance of accuracy is inversely proportional to the increase in the breadth of the data
Current Situation Decisions changes based on each

Perception individual's past, interpretation,
perception and judgmental capabilities
Interpretation
Past Knowledge Action
Motivation Judgment
Decision
Consistency and Comprehension of the
breadth of data is humanely impossible
Trainable Model
Current Situation (More the data More the Accuracy) Consistent Results
(Perception, Interpretation,
Judgment remains the same)
Page 75
Notional Model for System Outage Prediction
Use Case Designing a automated framework that collects, correlates and classifies enterprise events; thereby
generating alert notifications identifying potential outage scenarios
Problem Statement Collection, classification and correlation of enterprise events
Autonomous self training intelligent event framework
Critical factors in automated predictive analysis Limitations of traditional reactive

o Real Time Faster high frequency data capture and analysis methodology
as soon as the data is captured. o Manned effort
o Data Volumes of historic data both structured and un- o Prone to errors
structured to train the system o Nearly impossible to have consistent results
o White Noise Percentage of relevant results retrieved from o Difficult to comprehend the breadth of data
the collected data o Impossible to meet and maintain the SLAs
o Autonomous Decisioning - More Intelligent automated
decision making.
Data Source MODEL Autonomous Actions
Time Box
Log Data Notifications
Pattern matching
Decide/Re-train
Filter White Noise
Collect
Aggregate
Events Corrective Actions
Enrich
Classify
Apply Rules
System Actions
Historic Data
Page 76
Intelligent Outage Learning
We can use a Big Data Platform to build a system outage framework that uses ML to predict outages
System Logs exception stack trace enterprise events customer call data is fed into a Mahout ML Model (Baseline)
ML Base model is continuously trained by regressing all exogenous & endogenous variables against test data set(s)
Refined base model is established as the yardstick for the prediction of future outages
Real time events are sent through the trained ML model at times t+n and variances from the fitted curve are observed
Event cloud input is multiplexed across multiple Mappers; each running an instance of the trained Mahout ML Model
Outlier events are identified by their violation of the normal variance threshold from the fitted curve of the ML Model
System Outage Predictive Framework
Page 77
Hadoop Installation & HDFS Hands on 2
Day
Page 78
Setting up the VM
For this demonstration, we will be using Clouderas Quick Start VM. This 64-bit VM contains a single-node Apache
Hadoop cluster. It runs on CentOS and includes:
o CDH4.6
o Cloudera Manager
o Cloudera Impala
o Cloudera Search
For the installation, we will be:

Installing VirtualBox (the virtualization application)
Downloading & Importing Cloudera Quickstart VM
Adjusting VM RAM and Copy/Paste
Optional: Shared Folder, Internet Settings
Cleaning Up and Starting the VM
Page 79
Setting up the VM
Installing VirtualBox
Step 1
Install Chocolatey from Command Prompt (as administrator) Chocolatey is a utility for easy command-line installation
C:\> @powershell -NoProfile -ExecutionPolicy unrestricted -Command "iex ((new-object

net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET
PATH=%PATH%;%systemdrive%\chocolatey\bin
Step 2
Install VirtualBox
C:\> cinst virtualbox
Step 3
Install 7Zip (will be used later to extract the VM)
C:\> cinst 7zip
Page 80
Setting up the VM
Downloading & Importing Cloudera Quickstart VM
Step 1
Download Cloudera Quickstart VM (approximately 8 minutes)
http://www.cloudera.com/content/support/en/downloads/download-components/download-
products.html?productID=F6mO278Rvo&version=2.1
Step 2
Unzip VM
Locate the downloaded VM zip file (cloudera-quickstart-vm-4.4.0-1-virtualbox.7z)
Move file to desired folder (optional)
Right-click on the file and from the 7-Zip sub-menu, select "Extract Here" (approximately 3 minutes)
Step 3
Import Appliance
Open VirtualBox (Oracle VM VirtualBox Manager)
Select from the File menu "Import Appliance
Located the extracted VM folder, expand it, and select the the *.ovf file
Hit Next then Import
Page 81
Setting up the VM
Adjusting RAM Settings
It is advisable to allocate at least 2GB of RAM for your host machine OS. CDH4.6 requires 4GB of RAM. Before loading the VM, it is
important to make sure the RAM settings leave at least 2GB of RAM space for your host OS to run. If you keep it at the default it is set to,
Windows and the VM will continuously fight for resources -- you don't really want Windows to become inoperable. This is done by:
Steps:
1. In Oracle VM VirtualBox Manager with
"cloudera-quicksta..."
highlighted click
2. Select the "System tab
3. Reduce the "Base Memory
of the VM to accommodate
enough memory for your
host OS
4. Hit OK
Page 82
Setting up the VM
Adjusting Copy/Paste Settings
It is advisable to allocate at least 2GB of RAM for your host machine OS. CDH4.6 requires 4GB of RAM. Before loading the VM, it is
important to make sure the RAM settings leave at least 2GB of RAM space for your host OS to run. If you keep it at the default it is set to,
Windows and the VM will continuously fight for resources -- you don't really want Windows to become inoperable. This is done by:
Steps:
highlighted click
2. Select the Advanced
sub-tab within General
3. Change the Shared
Clipboard drop-down to
Bidirectional.
4. Hit OK
Page 83
Setting up the VM
Cleaning Up and Starting the VM
Step 1
Clean up
You can remove the *.7z and the extracted folder (5GB of hard drive savings)
NOTE: The VM has already been loaded (i.e., installed) on your machine in:
c:\users\[user]\VirtualBox VMs\cloudera-quickstart-vm-4.4.0-1-virtualbox
This is where the functional machine exists (do not remove it).
Step 2
Start up the VM
With the "cloudera-quicksta..." appliance selected, click the Start button.
o Within Firefox, the bookmarks on top are for Cloudera/Hadoop managers

o Eclipse (shortcut on the desktop) already has a training folder setup with a blank
MapReduce template
o Go to System -> Shutdown
Page 84
Setting up the VM
Creating a Shared Folder
You can setup your Shared Folder to transfer files from your host machine to the VM. This will be useful when we wish to transfer our
tutorial files from our host to the VM My Documents folder.
Steps:
1. Be sure to have your virtual machine
shutdown
highlighted click
3. Select the Shared Folders tab
4. Change the icon to add a shared
folder.
5. Select Other folder path. Select the
folder on your host machine under
your My Documents folder which
you have created to share documents
between your machine and the VM.
6. Provide the Folder Name which is
what the name of the shared folder
will be within the VM.
7. Ensure that Auto-mount is checked.
Page 85
Setting up the VM
Transferring Files into VM
Step 0: Start the VM and Open a Terminal window
Step 1: Escalate privileges
$ sudo su
Step 2: Move the tutorial file from the shared folder to the home folder
o The shared folders are located within the /media/ path.
o The folder name provided in the VM Settings is prefixed with sf_
# mv /media/sf_VMShare/NYSE.tar.gz /home/cloudera/NYSE.tar.gz
Step 3: Change the owner of the file to cloudera
# chown cloudera /home/cloudera/NYSE.tar.gz
Step 4: Exit escalated privileges

# exit
Page 86
Setting up the VM
Internet Settings: Adjusting the Network Connection Settings
Note: Internet connectivity is not required for this tutorial. You may omit this section.
Steps:
1. Check internet connection
o Having the cloudera-quicksta.. VM selected,
start the VM by clicking
o Open up Firefox within the VM and attempt
to browse to a website (www.google.com).
o If the connection succeeds, your internet
connection is working and you do not need to
do anything else.
o If the connection fails, shutdown the VM
(System -> Shutdown) and following the
next steps.
2. Copy the VMs assigned MAC address.

o In the settings window of the VM, navigate
to the Network tab.
o Copy the MAC address
Page 87
Setting up the VM
Internet Settings: Adjusting the Network Connection Settings (Continued)
Note: Internet connectivity is not required for this tutorial. You may omit this section.
Steps:
3. Adjust VM Network Connections
o Start VM
o Under System -> Preferences, select Network Connections. If only System eth0 exists, click Add to eth01 as a new
Wired connection.
o If prompted, the Password for root is cloudera.
o The MAC addresses of both connections should be the same as copied from Step 1 and the first two checkboxes of each
connection should be checked.
o Under IPv4 Settings:
The connection method for eth01 should be Automatic (DHCP) addresses only.
The host IPv4 DNS Server address should be provided as the DNS Server
o Now that the internet connection works, you can use the web to download the training materials zip file to the My
Documents folder.
Page 88
Setting up the VM
Screen shots of Linux and Windows popup windows
For both the System eth0 and In Windows, record the IPv4 DNS Server Ensure that the connection Method for
eth01 connections, make sure that address (highlighted above) for the eth01 is Automatic (DHCP) addresses
you have checked the appropriate connected network. Select the current only. Provide the recorded IPv4 DNS
boxes and provided the Device MAC network from Network and Sharing Server from Windows (highlighted above).
address (highlighted above). Center in Control Panel, then click
Details.
HadoopPage
Hands-On
89 Training: Step-by-Step Examples
HDFS Hands on 2
Day
Page 90
HDFS
HDFS (Hadoop Distributed File System) distributed file system that stores the data on the commodity machines, providing
high aggregate bandwidth across the cluster. HDFS allows us to not worry about where the file is actually stored in the
Hadoop clusters and treat it as a singular file system.
For the intro on HDFS, we will be:

Setting up the user folder in Hadoop
Transferring data files into Hadoop
Optional: HDFS shell commands
The File System (FS) is invoked by:
% hadoop fs <args>
Page 91
HDFS
Setting up the user folder in Hadoop
Step 0: Start the VM and Open a Terminal window

o We will do this every time we run commands.
Step 1: Escalate privileges for HDFS
$ sudo su hdfs
Step 2: Create the cloudera user folder. (If this errors, simply proceed)
$ hadoop fs -mkdir /user/cloudera
Step 3: Change the owner of the folder to cloudera
$ hadoop fs -chown cloudera /user/cloudera
Step 4: Exit escalated privileges
$ exit
Page 92
HDFS
Shell Command Basics
All the FS shell commands take path URIs as arguments
The URI format is scheme://autority/path
For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If
not specified, the default scheme specified in the configuration is used
An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as
/parent/child (given that your configuration is set to point to hdfs://namenodehost)
Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of
the commands
Error information is sent to stderr and the output is sent to stdout
Page 93
HDFS
Shell Commands List
recall getmerge Concatenates files in src into the destination local

file. Specifying addnl is for adding a new line between files
% hadoop fs <args>
-getmerge <src> <localdst> [addnl]
dus Displays a summary of file lengths ls For a file returns stat on the file and for a directory returns
list of its direct children.
-dus <args>
-ls <args>
expunge Empty the trash. lsr - Recursive version of ls. Similar to Unix ls -R.
-expunge -lsr <args>
get - Copy files to the local file system. CRC options relate mkdir - Takes path uri's as argument and creates directories.
to cyclic redundancy check. The behavior is much like unix mkdir -p creating parent
directories along the path.
-get [-ignorecrc] [-crc] <src> <localdst>
-mkdir <paths>
mv - Moves files from source to destination within HDFS.

For multiple sources, the destination needs to be a
directory.
-mv URI [URI ] <dest>
Page 94
HDFS
Shell Commands List (continued)
put Copy single or multiple sources from local file system tail Returns last kilobyte of the file to stdout.
to HDFS. Specifying - as source is stdin.
-put <localsrc> ... <dst> -tail [-f] URI
rm Deletes non-empty directories and files. test Test to see if the file exists (E), is zero length (Z), or is a
directory (D)
-rm URI [URI ] -test -[ezd] URI
rmr Recursive version of delete. text - Takes a source file and outputs the file in text format. The
allowed formats are zip and TextRecordInputStream.
-rmr URI [URI ] -text <src>
setrep Change the replication factor of a file. Provide R touchz - Create a file of zero length.
for recursive.
-touchz URI [URI ]
-setrep <args> [-R] <path>
stat Returns stat information on the path.
-stat URI [URI ]
Page 95
Lunch 2
Day
Page 96
MapReduce Deep Dive 2
Day
Page 97
Features of MapReduce
Automatic parallelization and distribution
A clean abstraction for programmers
o MapReduce programs are usually written in Java
Can be written in any language using Hadoop Streaming (see later)
All of Hadoop is written in Java
o MapReduce abstracts all the housekeeping away from the developer
Developer can concentrate simply on writing the Map and Reduce functions
Automatic Fault tolerance
Status and monitoring tools
Page 98
MapReduce
JobTracker
MapReduce jobs are controlled by a software daemon known as the JobTracker

The JobTracker resides on a master node
o Clients submit MapReduce jobs to the JobTracker
o The JobTracker assigns Map and Reduce tasks to other nodes on the cluster
o These nodes each run a separate daemon known as the TaskTracker
o The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting
progress back to the JobTracker
Page 99
MapReduce
Terminology
MapReduce jobs job is a full program

A task is the execution of a single Mapper or Reducer over a slice of data
A task is attempt is a particular instance of an attempt to execute a task
o There will be at least as many task attempts as there are tasks
o If a task attempt fails, another will be started by the JobTracker
o Speculative execution can also result in more task attempts than completed tasks
Page 100
MapReduce
Mapper
Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to
avoid network traffic
o Multiple Mappers run in parallel, each processing a portion of the input data
The Mapper reads data in the form of key/value pairs
It outputs zero or more key/value pairs (pseudo code):
map(in_key, in_value) -> (inter_key, inter_value) list
The Mapper may use or completely ignore the input key
For example, a standard pattern is to read a line of a le at a time
o The key is the byte oset into the le at which the line starts
o The value is the contents of the line itself
o Typically the key is considered irrelevant

If the Mapper writes anything out, the output must be in the form of key/value pairs
Page 101
MapReduce
Reducer
After the Map phase is over, all the intermediate values for a given intermediate key are
combined together into a list
This list is given to a Reducer
o There may be a single Reducer, or multiple Reducers
This is specied as part of the job conguration (see later)
o All values associated with a particular intermediate key are guaranteed to go to the same Reducer
o The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
o This step is known as the shue and sort

The Reducer outputs zero or more nal key/value pairs
o These are written to HDFS
o In practice, the Reducer usually displays a single key/value pair for each input key
Page 102
MapReduce
Data Locality
Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block
of data stored locally on that node via HDFS
If this is not possible, the Map task will have to transfer the data across the network as it processes
that data
Once the Map tasks have nished, data is then transferred across the network to the Reducers
o Although the Reducers may run on the same physical machines as the Map tasks, there is no
concept of data locality for the Reducers
All Mappers will, in general, have to communicate with all Reducers
Page 103 Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent. 03#43
MapReduce: Bigger Picture?
Node 1 Node 2
Files loaded from Files loaded from

local HDFS stores local HDFS stores
Input Format Input Format
File File
Split Split Split Split Split Split
File File
RR RR RR RR RR RR
Record readers: Record readers:
Input (k, v) pairs Input (k, v) pairs
map map map map map map
Intermediate (k, v) pairs Intermediate (k, v) pairs
Partitioner Partitioner
Shuffling Process
Intermediate (k, v)
(Sort) pairs exchanged by all
(Sort)
nodes
Reduce Reduce
Final (k, v) pairs Final (k, v) pairs
Write-back to Write-back to
local HDFS store Output Format Output Format local HDFS store
Page 104
MapReduce
Is Shuffle and Sort a Bottleneck?
It appears that the shue and sort phase is a bottleneck
o The reduce method in the Reducers cannot start until all Mappers have nished
In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers nish work
o This mitigates against a huge amount of data transfer starting as soon as the last Mapper nishes
o Note that this behavior is congurable
The developer can specify the percentage of Mappers which should nish before Reducers start
retrieving data
o The developers reducemethod still does not start until all intermediate data has been
transferred and sorted
Page 105
MapReduce
Is a Slow Mapper a Bottleneck?
It is possible for one Map task to run more slowly than the others
o Perhaps due to faulty hardware, or just a very slow machine

It would appear that this would create a bottleneck
o The reducemethod in the Reducer cannot start until every Mapper has nished
Hadoop uses speculative execution to mitigate against this
o If a Mapper appears to be running signicantly more slowly than the others, a new instance
of the Mapper will be started on another machine, operating on the same data
o The results of the rst Mapper to nish will be used
o Hadoop will kill o the Mapper which is still running
Page 106
MapReduce
The Five Hadoop Daemons
Hadoop is comprised of ve separate daemons
NameNode
o Holds the metadata for HDFS

Secondary NameNode
o Performs housekeeping functions for the NameNode
o Is not a backup or hot standby for the NameNode!

DataNode
o Stores actual HDFS data blocks

JobTracker
o Manages MapReduce jobs, distributes individual tasks to machines running the

TaskTracker
o Instantiates and monitors individual Map and Reduce tasks
Page 107
MapReduce
The Five Hadoop Daemons (Continued)
Each daemon runs in its own Java Virtual Machine (JVM)

No node on a real cluster will run all ve daemons
o Although this is technically possible

We can consider nodes to be in two dierent categories:
o Master Nodes
Run the NameNode, Secondary NameNode, JobTracker daemons
Only one of each of these daemons runs on the cluster
o Slave Nodes
Run the DataNode and TaskTracker daemons
A slave node will run both of these daemons
Page 108
MapReduce
Submitting a Job
When a client submits a job, its configuration information is packaged into an XML le
This le, along with the .jarle containing the actual program code, is handed to the JobTracker
o The JobTracker then parcels out individual tasks to TaskTracker nodes
o When a TaskTracker receives a request to run a task, it instantiates a separate JVM for that task
o TaskTracker nodes can be congured to run multiple tasks at the same time, if the node has enough
processing power and memory
The intermediate data is held on the TaskTrackers local disk
As Reducers start up, the intermediate data is distributed across the network to the Reducers
Reducers write their nal output to HDFS
Once the job has completed, the TaskTracker can delete the intermediate data from its local disk
o Note that the intermediate data is not deleted until the entire job completes
Page 109
Configuration Properties
Filename Format Description
hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop.
Hadoop
Configuration settings for Hadoop core, such as I/O settings that are
core-site.xml configuration
common to HDFS and MapReduce.
XML
Hadoop
Configuration settings for HDFS daemons: the namenode, the secondary
hdfs-site.xml configuration
namenode, and the datanodes.
XML
Hadoop
Configuration settings for MapReduce daemons: the jobtracker and the
mapred-site.xml configuration
tasktrackers.
XML
masters Plain text A list of machines (one per line) that each run a secondary namenode.
A list of machines (one per line) that each run a datanode and a
slaves Plain text
tasktracker.
hadoop-metrics.properties Java properties Properties controlling how metrics are published in Hadoop.
Properties for the system logfiles, the namenode audit log, and the task log
log4j.properties Java properties
for the tasktracker child process.
Source: Hadoop Definitive Guide, Tom White
Page 110
Property Name Type Default Value Description
The default filesystem. The URI defines the

fs.default.name hostname and port that namenodes RPC
URI file:///
(core-site.xml) server runs on. The default port is 8020.
This property is set in core-site.xml.
The list of directories where the namenode

dfs.name.dir comma-separated- stores its persistent metadata. The
${hadoop.tmp.dir}/dfs/name
(hdfs-site.xml) directory names namenode stores a copy of the metadata in
each directory in the list.
dfs.data.dir comma-separated- The list of directories where the datanode

(hdfs-site.xml) directory names ${hadoop.tmp.dir}/dfs/data stores blocks. Each block is stored in only
one of these directories.
The list of directories where the secondary

fs.checkpoint.dir
comma-separated- ${hadoop.tmp.dir}/dfs/name namenode stores checkpoints. It stores a
(hdfs-site.xml)
directory names secondary copy of the checkpoint in each directory in
the list.
Page 111
The hostname and port that jobtrackers RPC server

runs on. If set to the default value of local then the
mapred.job.tracker
Hostname and jobtracker is run in-process on demand when you run a
(mapred-site.xml) local
Port MapReduce job (you do not need to start the
jobtracker in this case, and in fact will get an error if
you try to start it in this mode).
comma- The list of directories where MapReduce stores

mapred.local.dir ${hadoop.tmp.dir}/
separated- intermediate data for jobs. The data is removed when
(mapred-site.xml) mapred/local
directory names the job completes.
Mapred-site.xml
mapred.system.dir ${hadoop.tmp.dir}/ The directory relative to fs.default.name where shared

URI
(mapred-site.xml) mapred/system files are stored, during a job run.
mapred.tasktracker.map
The number of map tasks that may be run on
.tasks.maximum Int 2
tasktracker at any one time.
(mapred-site.xml)
mapred.tasktracker.red
The number of reduce tasks that may be run on
uce.tasks.maximum Int 2
tasktracker at any one time.
(mapred-site.xml)
The JVM option used to launch tasktracker child

mapred.child.java.opts
process that runs map and reduce tasks. This property
(mapred-site.xml) String -Xmx200m
can be set on a per-job basis, which can be useful for
setting JVM properties for debugging, for example.
Page 112
mapred.map.java.opts The JVM option used for the child process that runs
String -Xmx200m
(mapred-site.xml) map tasks. From 0 .21.
mapred.reduce.java.opts The JVM option used for the child process that runs
String -Xmx200m
(mapred-site.xml) reduce tasks. From 0 .21.
Page 113
MapReduce
Submitting a Job
This consists of three portions
o The driver code
Code that runs on the client to congure and submit the job
o The Mapper
o The Reducer
Before we look at the code, we need to cover some basic Hadoop API concepts
Page 114
MapReduce
Getting Data to the Mapper
The data passed to the Mapper is specied by an InputFormat
o Specied in the driver code
o Denes the location of the input data
A le or directory, for example
o Determines how to split the input data into input splits
Each Mapper deals with a single input split
o InputFormat is a factory for RecordReaderobjects to extract (key, value) records from the input source
Page 115
Map Reduce Hands on Development
& Deployment 2
Day
Page 116
Executing a MapReduce Program
MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. It
includes:
o A Mapper class containing the map() function
o A Reducer class containing the reduce() function
o A driver class which configures the MapReduce job in the main() function
For this demonstration, we will be:

Compiling and exporting a MapReduce job using commands
Running the MapReduce program using Hadoop
Tracking a MapReduce job using the Hadoop JobTracker
Examining the output of a MapReduce execution
Page 117
Compiling and exporting an MR job using commands
Step 1
Change the directory to where we will be compiling and exporting the MapReduce program.
This folder already contains the *.class and *.jar files but we will overwrite them for the sake of this exercise.
$ cd /home/cloudera/NYSE/bin
Step 2
Compile the *.class files using javac (Java compiler)
$ javac -classpath /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*:/home/cloudera/NYSE/src

/home/cloudera/NYSE/src/AvgHigh.java /home/cloudera/NYSE/src/AvgHighMapper.java
/home/cloudera/NYSE/src/AvgHighReducer.java
Included library folders (colon delimited) Driver source file Mapper source file Reducer source file
Step 3
Export the class files into a jar file.
$ jar cvf averageHigh.jar AvgHigh.class AvgHighMapper.class AvgHighReducer.class
Page 118
Step 1
Change the directory to where we the jar file exists.
Step 2
Set HADOOP_CLASSPATH environment variable to the jar file
$ export HADOOP_CLASSPATH=averageHigh.jar
Step 3
Execute the MapReduce program using Hadoop
$ hadoop AvgHigh /user/cloudera/NYSE/data/EOD2013.txt output/AvgHigh
Application name within jar file Input data on HDFS Output folder on HDFS
Beginning of output from

map() function called execution: Jobid highlighted.
reduce() function called map() and reduce() calls
indicated.
Page 119
Tracking a MapReduce Job using the Hadoop JobTracker
Page 120
Tracking a MapReduce Job using the Hadoop JobTracker
Page 121
Step 1
Examine the files contained in the output folder specified when the MapReduce program was executed.
$ hadoop fs -ls output/AvgHigh
Step 2
Display the output (in HDFS) to the screen (stdout).
$ hadoop fs -cat output/AvgHigh/part-r-00000
Step 3
Make a local output folder.
$ mkdir /home/cloudera/NYSE/output
Step 4
Copy the output to your local (VM) file system.
$ hadoop fs -get output/AvgHigh /home/cloudera/NYSE/output/AvgHigh
Step 5
Display the output on your local system (VM) to the screen (stdout).
$ cat /home/cloudera/NYSE/output/AvgHigh/part-r-00000
Page 122
InputFormat - Hierarchy
CombineFile FileInputFormat
InputFormat <K, V> The base class used for all file
based InputFormats
TextInputFormat TextInputFormat
The default
Treats each \n-terminated line
<<Interface>> of a file as a value.
InputFormat <K, V> FileInputFormat <K, V> KeyValueTextInputFormat StreamInputFormat
org.apache.hadoop.mapred Key is the byte offset within
the file of that line.
NLineInputFormat KeyValueTextInputFormat
Maps \n-terminated lines as
KeySEP value.
SequenceFile SequenceFileAsBinary By default separator is a tab.
InputFormat <K, V> InputFormat
SequenceFileInputFormat
Binary file of (key, value) pairs
SequenceFileAsText
InputFormat
with some additional metadata
SequenceFileAsTextInputFormat
<<Interface>>
Composite SequenceFile Similar, but maps
CompositeInput
InputFormat <K, V> InputFilter <K, V> (key.toString(),
Format <K, V>
value.toString())
DBInputFile<T> DBInputFormat
EmptyInput
Format <K, V>
Page 123
OutputFormat - Hierarchy
TextOutputFormat <K, V>
<<Interface>> SequenceFileOutput SequenceFileAsBinary

OututFormat <K, V> FileOutputFormat <K, V> Format <K, V> OutputFormat
org.apache.hadoop.mapred
MapFileOutputFormat
MultipleOutputputFormat MultipleTextOutput
<K, V> Format <K, V>
MultipleSequenceFile
OutputFormat
NullOutput
Format <K, V>
DBOutputFormat<K, V>
FilterOutput LazyOutputFormat <K, V>

Format <K, V>
Page 124
Writable & WritableComparable?
Hadoop denes its own box classes for strings, integers and so on o Keys and values in Hadoop
IntWritable for ints are Objects
LongWritable for longs
o Values are objects which
FloatWritable for oats
implement Writable
DoubleWritable for doubles
Text for strings o Keys are objects which
Etc. implement
WritableComparable
The Writable interface makes serialization quick and easy for Hadoop
Any values type must implement the Writable interface
Any values type must implement the Writable interface

Two WritableComparables can be compared against each other to determine their order
Keys must be WritableComparables because they are passed to the Reducer in sorted order
Page 125 Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent. 04#16
Driver & Jobs
The Driver Code Specifying the InputFormat:
o The driver code runs on the client machine The default InputFormat (TextInputFormat) will be used
o It congures the job, then submits it to the cluster unless you specify otherwise
To use an InputFormat other than the default, use e.g.:
job.setInputFormatClass(KeyValueTextInputFormat.class)
Creating a New Job Object
o The Job class allows you to set configuration options for your
MapReduce job Specifying Final Output with OutputFormat:
o The classes to be used for your Mapper and Reducer FileOutputFormat.setOutputPath() species the directory
o The input and output directories to which the Reducers will write their nal output
o Many other options
The driver can also specify the format of the output data
o Any options not explicitly set in your driver code will be read
from your Hadoop configuration les o Default is a plain text le
o Usually located in /etc/hadoop/conf o Could be explicitly written as
o Any options not specied in your configuration les will receive o job.setOutputFormatClass(TextOutputFormat.class)
Hadoops default values
o You can also use the Job object to submit the job, control its
execution, and query its state
Running The Job:

Determining While Files to Read There are two ways to run your MapReduce job:
o By default, FileInputFormat.setInputPaths() will read all les o job.waitForCompletion()
from a specied directory and send them to Mappers Blocks (waits for the job to complete before continuing)
o Exceptions: items whose names begin with a period (.) or
underscore (_) o job.submit()
o Globs can be specied to restrict input, e.g.: /2010/*/01/* Does not block (driver code continues as the job is
o Alternatively, FileInputFormat.addInputPath() can be called running)
multiple times, specifying a single le or directory each time The job determines the proper division of input data into
o More advanced ltering can be performed by implementing a
InputSplits, and then sends the job information to the
PathFilter
JobTracker daemon on the cluster
Page 126
MapReduce Architecture
2. get new job id

MapReduce 1. run job 5. initialize
Job Client 4. submit job JobTracker job
Program
Client JVM
Client Node 6. retrieve JobTracker Node

input splits
7. Heartbeat
3. copy job (return tasks)
resources
TaskTracker
8. retrieve job
resources 9. launch
Child JVM
Shared Filesystem
(e.g., HDFS)
Child
10. run
MapTask
Or
ReduceTask
TaskTracker Node
Page 127
Building a MapReduce Program
We will be utilizing Eclipse as the IDE to build our MapReduce program.
o IDE stands for Integrated Development Environment.

o Eclipse has become popular because it is free and can be used to program in various languages.
o Eclipse can be extendable for customizing the working environment for a specific programming language.
o Common elements of an IDE include a source code editor, build automation tools, and a debugger.

Creating a new Java Project
Eclipse workspace environment
Examine the data
Writable classes (data types)
Building the Mapper class
Building the Reducer class
Building the driver class
Exporting the JAR package
Executing the MapReduce job
Page 128
Creating a new Java Project
Steps:
1. Open Eclipse
2. Go to File > New, select Java Project
3. Provide NYSE as the project name
and modify the Location (uncheck
Use default location) to:
/home/cloudera/NYSE/src
4. After clicking Next on the Create a

Java Project window, browse to the
Libraries tab. Click on the Add
External JARs
The external JARs will be the
compilations of Hadoop code
that makes the program
understand the references to
Hadoop / MapReduce code.
Page 129
Creating a new Java Project (continued)
Steps (continued):
5. Select the *.jar files in the /usr/lib/hadoop folder
Click on File System
Open the usr folder, then the lib folder, then the hadoop folder
Page 130
Creating a new Java Project (continued)
Steps (continued):
6. After clicking OK, again click on Add External JARs and select the *.jar files in the /usr/lib/hadoop/client-0.20 folder.
Click OK, then Finish to return to the workspace.
Page 131
Eclipse Working Environment
Open Files
Imported libraries
Project
Package
Comments
Files
Class
Classes definition
Functions
Function
Libraries definition
included when
compiling
Page 132
Examining the Data
Preview the first 10 lines of the data file.

$ head -10 /home/cloudera/NYSE/data/EOD2013.txt
Youll notice the following output:
The fields of the EOD2013.txt data file are:

o Symbol, Date, Open, High, Low, Close, Volume
Page 133
Writable classes (data types)
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io package:
o There are Writable wrappers for all the Java primitive types except char which can be stored in an IntWritable.
o All wrappers have a get() and set() method for retrieving and storing the value.
Java primitive Writable classes Bytes Other Writable classes:
boolean BooleanWritable NullWritable

1
Text
byte ByteWritable 1 BytesWritable
MD5Hash
short ShortWritable 2 ObjectWritable
int IntWritable
GenericWritable
4
ArrayWritable
VIntWritable 1-5 ArrayPrimitiveWritable
TwoDArrayWritable
float FloatWritable 4 AbstractMapWritable
o MapWritable
long LongWritable 8
o SortedMapWritable
VLongWritable 1-9 EnumSetWritable
CompressedWritable
double DoubleWritable 8 VersionedWritable
Page 134
Mapper Class: Import Commands
Import commands are used to include references to functionality used in code.

java.io.IOException used to throw an error when needed
org.apache.hadoop.io.DoubleWritable include the Hadoop defined DoubleWritable data type
org.apache.hadoop.io.LongWritable include the Hadoop defined LongWritable data type
org.apache.hadoop.io.Text include the Hadoop defined Text data type
org.apache.hadoop.mapreduce.Mapper include the Mapper template which we will extend to make our own
Mapper
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
Page 135
Mapper Class: Building the Mapper class
The AvgHighMapper is a subclass of the Hadoop defined Mapper class

Our AvgHighMapper subclass extends the Mapper class
The Mapper takes an input <key, value> and outputs a <key, value>
The first parameter (LongWritable) is the input key data type
The second parameter (Text) is the input value data type
The third parameter (Text) is the output key data type
The fourth parameter (DoubleWritable) is the output value data type
The output <key, value> will the be input <key, value> for the Reducer
public class AvgHighMapper extends Mapper<LongWritable, Text, Text,

DoubleWritable> {
Page 136
Mapper Class: Building the map function
The map function is the function for the Mapper that is executed by Hadoop
The map function takes three parameters.
The key parameter is the auto-assigned id of the line that Hadoop is processing. For most purposes, this is an arbitrary
value.
The value parameter is the line that Hadoop is processing.
The context is where the output is processed to.

DoubleWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
}
}
Page 137
Mapper Class: Mapper skeleton
The skeleton below represents the key elements that are found in each Mapper.
The purple highlighted elements can be changed to work with your needs.

DoubleWritable> {
@Override
// CODE GOES HERE
}
}
Page 138
Mapper Class: The code

DoubleWritable> {
@Override
String[] line = value.toString().split(",");
if (line.length > 3)
context.write(new Text(line[0]), new
DoubleWritable(Float.parseFloat(line[3])));
}
}
Page 139
Reducer Class: Import Commands

java.io.IOException used to throw an error when needed
org.apache.hadoop.io.DoubleWritable include the Hadoop defined DoubleWritable data type
org.apache.hadoop.io.Text include the Hadoop defined Text data type
org.apache.hadoop.mapreduce.Reducer include the Reducer template which we will extend to make our own
Reducer
import org.apache.hadoop.mapreduce.Reducer;
Page 140
Reducer Class: Building the Reducer class
The AvgHighReducer is a subclass of the Hadoop defined Reducer class

Our AvgHighReducer subclass extends the Reducer class
The Mapper takes an input <key, value> and outputs a <key, value>
The first parameter (Text) is the input key data type and matches the output key data type of the Mapper
The second parameter (DoubleWritable) is the input value data type and matches the output data type of the Mapper
The third parameter (Text) is the output key data type
The fourth parameter (DoubleWritable) is the output value data type
public class AvgHighReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>

{
Page 141
Reducer Class: Building the reduce function
The map function is the function for the Mapper that is executed by Hadoop
The map function takes three parameters.
The key parameter is the auto-assigned id of the line that Hadoop is processing. For most purposes, this is an arbitrary
value.
The value parameter is the line that Hadoop is processing.
The context is where the output is processed to.
public class AvgHighReducer extends Reducer<Text, DoubleWritable, Text,

DoubleWritable> {
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
}
}
Page 142
Reducer Class: Reducer skeleton
The skeleton below represents the key elements that are found in each Reducer.
The purple highlighted elements can be changed to work with your needs.
public class AvgHighReducer extends Reducer<LongWritable, Text, Text,

DoubleWritable> {
@Override
// CODE GOES HERE
}
}
Page 143
Reducer Class: The code
public class AvgHighReducer extends Reducer<Text, DoubleWritable, Text,
DoubleWritable> {
@Override
double symbolCount = 0.0;
double dailyHighs = 0.0;
for (DoubleWritable value : values) {
dailyHighs += value.get();
symbolCount++;
}
context.write(key, new DoubleWritable(dailyHighs/symbolCount));
}
}
Page 144
Driver Class: Import Commands

org.apache.hadoop.fs.Path Hadoop defined Path data type
org.apache.hadoop.io.DoubleWritable Hadoop DoubleWritable data type
org.apache.hadoop.io.Text Hadoop defined Text data type
org.apache.hadoop.mapreduce.lib.input.FileInputFormat Input file
org.apache.hadoop.mapreduce.lib.input.FileOutputFormat Output file
org.apache.hadoop.mapreduce.Job Hadoop MapReduce Job specifications
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
Page 145
Driver Class: The code
public class AvgHigh {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: AvgHigh <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(AvgHigh.class);
job.setJobName("Average Ticker Highs");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(AvgHighMapper.class);
job.setReducerClass(AvgHighReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Page 146
Exporting the JAR package
Steps:
1. Right click the (default
package) and select the Export option
2. Select JAR File under the Java folder and
press Next
Page 147
Exporting the JAR package (continued)
Steps: (continued)
3. Provide the following as the destination of the
JAR file and hit Finish.
home/cloudera/NYSE/bin/averageHigh.jar
At this point, we are at the same point we

were after the Executing a MapReduce
Program: Compiling and exporting an MR job
using commands slide.
We have exported a *.jar file using Eclipse
instead of doing it from the Terminal as we did
before.
Page 148
Executing the MapReduce Job
Step 1
Change the directory to where we the jar file exists.
Step 2
Set HADOOP_CLASSPATH environment variable to the jar file
$ export HADOOP_CLASSPATH=averageHigh.jar
Step 3
Remove the output/AvgHigh folder in HDFS that was created before. The skipTrash option removes it immediately (no trash).
$ hadoop fs -rm -r -skipTrash output/AvgHigh
Step 4
Execute the MapReduce program using Hadoop
$ hadoop AvgHigh /user/cloudera/NYSE/data/EOD2013.txt output/AvgHigh
Step 5
Review the results
$ hadoop fs -cat output/AvgHigh/part-r-00000
Page 149
The Streaming API
Many organizations have developers skilled in languages other than Java, such as:
o Ruby
o Python
o Perl
The Streaming API allows developers to use any language they wish to write Mappers and Reducers
As long as the language can read from standard input and write to standard output
Advantages of the Streaming API Disadvantages of the Streaming API

No need for non/Java coders to learn Java Performance
Fast development time Primarily suited for handling data that can be
Ability to use existing code libraries represented as text
Streaming jobs can use excessive amounts of RAM
or fork excessive numbers of processes
Although Mappers and Reducers can be written
using the Streaming API, Partitioners, InputFormats
etc. must still be written in Java
Page 150
More on the Hadoop API
The ToolRunner class

How to improve the efficiency of intermediate data with Combiners
The setup and cleanupmethods
How to write custom Partitioners for better load balancing
How to access HDFS programmatically
How to use the distributed cache
Page 151
Why Use ToolRunner?
You can use ToolRunnerin MapReduce driver classes

o This is not required, but is a best practice
ToolRunner uses the GenericOptionsParser class internally

o Allows you to specify conguration options on the command line
o Also allows you to specify items for the Distributed Cache on the command line (see later)
Page 152
The Combiner
Often, Mappers produce large amounts of intermediate data

o That data must be passed to the Reducers
o This can result in a lot of network trac
It is often possible to specify a Combiner

o Like a mini/Reducer
o Runs locally on a single Mappers output
o Output from the Combiner is sent to the Reducers
o Input and output data types for the Combiner/Reducer must be identical
Combiner and Reducer code are often identical

A Combiner would decrease the amount of data sent to the Reducer
Combiners decrease the amount of network trac required during the shue and sort
phase
Page 153
Specifying a Combiner
To specify the Combiner class to be used in your MapReduce code, put the following line
in your Driver:
job.setCombinerClass(YourCombinerClass.class);
The Combiner uses the same interface as the Reducer

o Takes in a key and a list of values
o Outputs zero or more (key, value) pairs
o The actual method called is the reducemethod in the class
VERY IMPORTANT: The Combiner may run once, or more than once, on the output
from any given Mapper
o Do not put code in the Combiner which could inuence your results if it runs more than once
Page 154
The setup / Clean up Method
It is common to want your Mapper or Reducer to execute some code before the
mapor reducemethod is called
o Initialize data structures
o Read data from an external le
o Set parameters
The setupmethod is run before the mapor reducemethod is called for the rst time
public void setup(Context context)
Similarly, you may wish to perform some action(s) after all the records have been
processed by your Mapper or Reducer
The cleanupmethod is called before the Mapper or Reducer terminates
public void cleanup(Context context) throws

IOException, InterruptedException
Page 155
What Does The Partitioner Do?
The Partitioner divides up the keyspace
o Controls which Reducer each intermediate key and its associated values goes to
Often, the default behavior is ne

o Default is the HashPartitioner
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() &
Integer.MAX_VALUE) % numReduceTasks;
}
}
Page 156
Creating a Custom Partitioner
Partitioner
Step 1 : Create a class for the custom Partitioner

extend Partitioner
public class MyPartitioner<K, V> extends Partitioner<K, V> {

//variables and methods
}
Step 2: Create a method in the class called getPartition

Receives the key, the value, and the number of Reducers
Should return an int between 0 and one less than the number of Reducers
e.g., if it is told there are 10 Reducers, it should return an int between 0 and 9
public int getPartition(key, value, int) { //do something }
Step 3: Specify the custom Partitioner in your driver code
job.setPartitionerClass(MyPartitioner.class);
Page 157
The MapReduce Flow: Shuffle and Sort
Node 1 Node 2
Files loaded from Files loaded from

local HDFS stores local HDFS stores
Input Format Input Format
File File
Split Split Split Split Split Split
File File
RR RR RR RR RR RR
Record readers: Record readers:
Input (k, v) pairs Input (k, v) pairs
map map map map map map
Intermediate (k, v) pairs Intermediate (k, v) pairs
Partitioner Partitioner
Shuffling Process
Intermediate (k, v)
(Sort) pairs exchanged by
(Sort)
all nodes
Reduce Reduce
Final (k, v) pairs Final (k, v) pairs
Write-back to Write-back to
local HDFS Output Format Output Format local HDFS
store store
Page 158
The FileSystem API
Some useful API methods:
o FSDataOutputStream create(...)
Extends java.io.DataOutputStream
Provides methods for wriEng primiEves, raw bytes etc
o FSDataInputStream open(...)
Extends java.io.DataInputStream
Provides methods for reading primiEves, raw bytes etc

o boolean delete(...)
o boolean mkdirs(...)
o void copyFromLocalFile(...)
o void copyToLocalFile(...)
o FileStatus[] listStatus(...)
Copyright 2010/2013 EY. All rights reserved. Not to be

Page 159
reproduced without prior wri> en consent.
The FileSystem API: Directory Listing
Get a directory listing:
Path p = new Path("/my/path");

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus[] fileStats = fs.listStatus(p);

for (int i = 0; i < fileStats.length; i++) {
Path f = fileStats[i].getPath();
//do something
}

Page 160
The Distributed Cache
A common requirement is for a Mapper or Reducer to need access to some side data
o Lookup tables
o Dictionaries
o Standard conguration values
Option 1: read directly from HDFS in the setupmethod

o Works, but is not scalable
Option 2: The Distributed Cache provides an API to push data to all slave nodes
o Transfer happens behind the scenes before any task is executed
o Note: Distributed Cache is read/only
o Files in the Distributed Cache are automatically deleted from slave nodes when the job nishes

Page 161
Using the Distributed Cache: Via coding
Place the les into HDFS
Congure the Distributed Cache in your driver code
Configuration conf = new Configuration();

DistributedCache.addCacheFile(new URI("/myapp/lookup.dat"),conf);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"),conf);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz",conf));
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz",conf));
jarles added with addFileToClassPath will be added to your Mapper or Reducers classpath
Files added with addCacheArchive will automatically be dearchived/decompressed
Page 162
Using the DistributedCache: Command line
If you are using ToolRunner, you can add les to the Distributed Cache directly from the
command line when you run the job
o No need to copy the les to HDFS rst
Use the -files option to add les
hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...
The -archivesag adds archived les, and automatically unarchives them on the destination
machines
The -libjarsag adds jar les to the classpath

Page 163
Accessing Files in the Distributed Cache
Files added to the Distributed Cache are made available in your tasks local working
directory
o Access them from your Mapper or Reducer the way you would read any ordinary local le
File f = new File("file_name_here");
Page 164
Reusable Classes for the New API
The org.apache.hadoop.mapreduce.lib.*/* packages contain a library of Mappers,
Reducers, and Partitioners supporting the new API
Example classes:
o InverseMapper Swaps keys and values
o RegexMapper Extracts text based on a regular expression
o IntSumReducer, LongSumReducer Add up all values for a key
o TotalOrderPartitioner Reads a previously/created partition l e and partitions based on

the data from that le:
Sample the data rst to create the partition l e
Allows you to partition your data into n partitions without hard/ coding the
partitioning information
Page 165
Most Common InputFormats
Most common InputFormats:
o TextInputFormat
o KeyValueTextInputFormat
o SequenceFileInputFormat
Others are available

o NLineInputFormat
Every n lines of an input le is treated as a separate InputSplit
Congure in the driver code by setting:
mapreduce.input.lineinput.linespermap
o MultiFileInputFormat
Abstract class that manages the use of multiple les in a single task
You must supply a getRecordReader()implementation
Page 166
How FileInputFormat Works
All le based InputFormats inherit from FileInputFormat
FileInputFormat computes InputSplits based on the size of each le, in bytes
o HDFS block size is used as upper bound for InputSplit size
o Lower bound can be specied in your driver code
o This means that an InputSplit typically correlates to an HDFS block
So the number of Mappers will equal the number of HDFS blocks of input data to be processed
Important: InputSplits do not respect record boundaries!
Page 167
What RecordReaders Do
InputSplits are handed to the RecordReaders
o Specied by the path, starting position oset, length
RecordReaders must:
o Ensure each (key, value) pair is processed
o Ensure no (key, value) pair is processed more than once
o Handle (key, value) pairs which are split across Input Splits Not a good idea
Page 168
OutputFormat
OutputFormats work much like InputFormat classes
Custom OutputFormats must provide a RecordWriter implementation
Page 169
Compressions
Compression Format Tool Algorithm File Extension Split-able
DEFLATE (a) N/A DEFLATE .deflate No
gzip gzip DEFLATE .gz No
bzip2 bzip2 bzip2 .bz2 Yes
LZO lzop LZO .lzo No (b)
Snappy N/A Snappy .snappy No
(a) DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available
command line tool for producing files in DEFLATE format as gzip is normally used. (Note the gzip file format is
DEFLATE with extra headers and a footer.). The .deflate file extension is a Hadoop convention.
(b) LZO files are split-able if they have been indexed in a pre-processing step,
Compression Format Hadoop Compression Codec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO org.apache.hadoop.io.compress.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec
Page 170
Hadoop and Compressed Files
Hadoop understands a variety of le compression formats
o Including GZip
If a compressed le is included as one of the les to be processed, Hadoop will automatically decompress
it and pass the decompressed contents to the Mapper
o There is no need for the developer to worry about decompressing the le

However, GZip is not a splitable le format
o A GZipped le can only be decompressed by starting at the beginning of the le and continuing on to
the end
o You cannot start decompressing the le part of the way through it
Page 171
Non-Splittable File Formats and Hadoop
If the MapReduce framework receives a non splittable le (such as a GZipped le) it passes the
entire file to a single Mapper
This can result in one Mapper running for far longer than the others
o It is dealing with an entire le, while the others are dealing with smaller portions of les
o Speculative execution could occur
Although this will provide no benet

Typically it is not a good idea to use GZip to compress les which will be processed by MapReduce
Page 172
Snappy Codec
Splittable Compression for SequenceFiles and Avro Files Using the Snappy Codec
Snappy is a relatively new compression codec
o Developed at Google
o Very fast
Snappy does not compress a SequenceFile and produce, e.g., a le with a .snappyextension
o Instead, it is a codec that can be used to compress data within a le
o That data can be decompressed automatically by Hadoop (or other programs) when the le
is read
o Works well with SequenceFiles, Avro les

Snappy is now preferred over LZO
Page 173
Map Reduce Patterns
Summarization Patterns Join Patterns
o Numerical Summarization o Reduce Side Join
o Inverted Index Summarization o Replicated Join
o Counting with Hadoop Counters o Composite Join
Filtering Patterns o Cartesian Product
o Normal Filtering Meta Patterns
o Bloom Filtering o Job Chaining
o Top Ten Filtering o Chain Folding
Data Organization patterns o Job Merging
o Structured to Hierarchical Input Output Patterns
o Partitioning o Customizing I/O & O/P
o Binning o Generating Data
o Total Order Sorting o External Source Output
o Shuffling o External Source Input
o Partition Pruning
Page 174
Map and Reduce Side- Joins Pattern Overview
We frequently need to join data together from two sources as part of a MapReduce job, such as
o Lookup tables
o Data from database tables

There are two fundamental approaches: Mapside joins and Reduce side joins
Map side joins are easier to write, but have potential scaling issues
But rst
Avoid writing joins in Java MapReduce if you can!
Abstractions such as Pig and Hive are much easier to use
o Save hours of programming

If you are dealing with text based data, there really is no reason not to use Pig or Hive
Page 175
Conclusion 2
Day
Questions?
Page 176
Pig deep dive - Theory 3
Day
Page 177
Hive and Pig: Why?
MapReduce code is typically written in Java
o Although it can be writ t en in other languages using Hadoop Streaming

Requires:
o A programmer
o Who is a good Java programmer
o Who understands how to think in terms of MapReduce
o Who understands the problem theyre trying to solve
o Who has enough time to write and test the code
o Who will be available to maintain and update the code in the future as requirements change

Page 178
Hive and Pig
Many organizations have only a few developers who can write good MapReduce code
Meanwhile, many other people want to analyze data
o Business analysts
o Data scientists
o Statisticians
o Data analysts
Whats needed is a higher level of abstractions on top of MapReduce
o Providing the ability to query the data without needing to know MapReduce
o Hive and Pig address these needs
Page 179
Hive and Pig
Introduction
Pig was originally created at Yahoo! to answer a similar need to Hive
o Many developers did not have the Java and/or MapReduce knowledge required to write standard
MapReduce programs
o But still needed to query data

Pig is a high level platform for creating MapReduce programs
o Language is called PigLatin
o Relatively simple syntax
o Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster
Installation of Pig requires no modification to the cluster
The Pig interpreter runs on the client machine
o Turns PigLatin into standard Java MapReduce jobs, which are then submitted to the JobTracker
There is (currently) no shared metadata, so no need for a shared metastore of any kind
Page 180
Pig
Pig Philosophy
Pigs eat anything

o Pig can operate on data whether it has metadata or not, whether it is relational, nested,
unstructured
Pig lives anywhere
o Pig is intended to be a language for parallel data processing. It is not tied to any
framework
Pigs are domestic animals
o It is designed to be easily controlled and modified by its users
o Allows integration of users code wherever possible
o Supports user provided load and store functions
o Supports external executables via its stream command and MapReduce via JARs
Pigs Fly
o Processes data quickly.
Page 181
Pig Overview
Pig Latin provides a higher order abstraction for implementing MapReduce jobs. It constitutes a data flow language which is
made up of a series of operations and transformations that are applied to the input data
Key features of Pig Latin

o Ease of programming - Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as
data flow sequences, making them easy to write, understand, and maintain.
o Optimization opportunities - The way in which tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than efficiency.
o Extensibility - Users can create their own functions to do special-purpose processing.
Pig Latin Statement

o Pig Latin statements are the basic constructs you use to process data using Pig.
o A Pig Latin statement is an operator that takes a relation as input and produces another relation as output.
o Pig Latin statements may include expressions and schemas.
How are Pig Latin Statements LOAD Load statements read data from the file system
Organized ?
Transformation statements process data read
Pig Latin statements are organized as a TRANSFORMATION from the file system
sequence of steps such that each step
represents a transformation applied to DUMP statement display results
some data DUMP / STORE STORE statement to save the results.
Page 182
Pig Vs. SQL
PigLatin is data flow language SQL is query language
o Allows users to describe how data from one or o Allows users to form queries
more inputs should be read, processed and then o Allows users to describe what questions they want
stored to one or more outputs in parallel answered, not how
o Pig latin script describes a directed acyclic graph o Focused around answering one question. Complex
(DAG), where edges are data flows and the requires temp tables, sub queries or multiple
nodes are operators that process the data procedures
o Allows users to describe how to process the o Schemas are static and constraints are enforced
input data
o Schemas are dynamic
Page 183
PigLatin Concepts
Concepts Features
o In Pig, a single element of data is an atom o Pig supports many features which allow developers to
o A collection of atoms such as a row, or a partial perform complex data analysis without having to write
row is a tuple Java MapReduce code
o Tuples are collected together into bags Joining datasets
o Typically, a PigLatin s c r ipt starts by loading one Grouping data
or more datasets into bags, and then creates
Referring to elements by position rather than name
new bags by modifying those it already has
o Useful for datasets with many
elements
Loading non/delimited data using a custom SerDe
Creation of user/dened functions written in Java
And more
Page 184
PigLatin
Sample Pig Script
tkrs = LOAD 'NYSE_Daily' AS(sym,open,high,low,close);

tops = FILTER tkrs BY high > 100;
srtd = ORDER tops BY close DESC;
STORE srtd INTO 'best_tickers';
Here, we load a directory of data into a bag called tkrs
Then we create a new bag called tops which contains just those records where the high
portion is greater than 100
Finally, we write the contents of the srtdbag to a new directory in HDFS

o By default, the data will be written in tab separated format
Alternatively, to write the contents of a bag to the screen, say

DUMP srtd;
Page 185
PigLatin
Data Types/Functions/Operations
Scalar Types Relational Operations Input/Output

INT foreach Load
LONG filter Store
FLOAT Group Dump
DOUBLE Order by
CHARARRAY Distinct
BYTEARRAY Join
Sample
Complex Types Parallel
MAP - Chararray to data UDF

element mapping
TUPLE Fixed length, ordered
collection
BAG unordered collection of
tuples
Page 186
PigLatin
Using the Grunt Shell to Run PigLatin
Starting Grunt
$ pig grunt>
Useful Commands
$ pig -help (or -h)

$ pig -version (-i)
$ pig -execute (-e)
$ pig script.pig
Page 187
PigLatin
Sample Pig Script
tkrs = LOAD 'NYSE_Daily' AS(sym,open,high,low,close);

tops = FILTER tkrs BY high > 100;
srtd = ORDER tops BY close DESC;
STORE srtd INTO 'best_tickers';
Here, we load a directory of data into a bag called tkrs
Then we create a new bag called tops which contains just those records where the high
portion is greater than 100
Finally, we write the contents of the srtdbag to a new directory in HDFS

o By default, the data will be written in tab separated format
Alternatively, to write the contents of a bag to the screen, say

DUMP srtd;
Page 188
PigLatin
More PigLatin
To view the structure of a bag:
DESCRIBE bagname;
Joining two datasets

data1 = LOAD 'data1' AS (col1, col2, col3, col4);
data2 = LOAD 'data2' AS (colA, colB, colC);
jnd= JOIN data1 BY col3, data2 BY colA;
STORE jnd INTO 'outfile';
Expressions in for each

Beginning = foreach prices generate ..open;
Middle = foreach prices generate open..close;
End = foreach prices generate volume..;
gain = foreach prices generate close open;
Gain 2 = foreach prices generate $4-$1;
Page 189
PigLatin
FOR EACH
The FOREACH...GENERATEstatement iterates over members of a bag
justnames = FOREACH emps GENERATE name;
Can combine with COUNT:
summedUp = FOREACH grpd GENERATE group, COUNT(bag1) AS

elementCount;
Page 190
PigLatin
Grouping
Grouping
grpd = GROUP bag1 BY elementX
Create a new bag
o Each tuple in grpd has an element called group, and an element called bag1
o The group element has a unique value for elementX from bag1
o The bag1 element is itself a bag, containing all the tuples from bag1 with that value
for elementX
Page 191
Pig hands on 3
Day
Page 192
Hive
Overview
Pig Latin is a data flow programming language whereas SQL is a declarative programming
language. There are three ways of executing Pig programs, all of which work in both local and
MapReduce modes:
Pig can run a script file that contains Pig commands (pig script.pig).
Grunt is an interactive shell for running Pig commands.
Grunt is started when no file is specified for Pig to run and the -e (or -execute for direct command
execution) is not used.
It is also possible to run Pig scripts from within Grunt using run and exec.
Embedded You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use PigRunner.

Load and filter data using Grunt
Naming fields
Joining data
Grouping and aggregate functions
Splitting records
Page 193
Pig
Load and filter using Grunt
Step 1 : Start pig
$ Pig
Step 2: Load the symbols data into the symbols variable
grunt> symbols = LOAD 'NYSE/data/symbols.txt';
Step 3: Dump the data and the schema just loaded to the screen
grunt> DUMP symbols;

grunt> DESCRIBE symbols;
Step 4 Retrieve the records where the first field ($0) starts with M
grunt> symbols_m = FILTER symbols BY $0 matches 'M.*';
grunt> DUMP symbols_m;
grunt> DESCRIBE symbols_m;
Page 194
Pig
Naming Fields
Step 1 : Reload the data, this time specifying the field names and types;
The >> indicates a line break statements end with a semicolon (;)
grunt> symbols = LOAD 'NYSE/data/symbols.txt'

>> AS (ticker:chararray, name:chararray);
Step 2: Dump the data and the schema just loaded to the screen
grunt> DUMP symbols;
grunt> DESCRIBE symbols;
Step 3: Retrieve the records where the ticker field starts with M
The fields can be referred to by name or ordinality ($[i-1])
grunt> symbols_m = FILTER symbols BY ticker matches 'M.*';

grunt> DUMP symbols_m;
grunt> DESCRIBE symbols_m;
Page 195
Pig
Joining Data
Step 1 : Load the end of the data data.
grunt> eoddata = LOAD 'NYSE/data/EOD2013.txt' USING PigStorage(',') AS (ticker:chararray,

close_date:chararray, price_open:double, price_high:double, price_low:double,
price_close:double, volume:long);
Step 2: Examine the schema just loaded on to the screen

grunt> DESCRIBE eoddata;
Step 3: Join the record sets
grunt> sym_eod = JOIN symbols BY ticker, eoddata BY ticker;
Step 4: Examine the schema of join
grunt> DESCRIBE sym_eod;
Page 196
Pig
Grouping and Aggregate Functions
Step 1 : Group the symbols records by ticker symbol (dump to review)
grunt> symbols_grp = GROUP symbols BY ticker;

grunt> DUMP symbols_grp;
Step 2: Co-Group the symbols and EOD data (describe to review)

grunt> symeod_grp = COGROUP symbols BY ticker, eoddata BY ticker;
grunt> DESCRIBE symeod_grp;
Step 3: For each co-grouped record, find the maximum open price
grunt> eod_maxopen = FOREACH symeod_grp GENERATE group, MAX(eoddata.price_open), symbols.name;

grunt> DESCRIBE eod_maxopen;
Step 4: Sort the results (dump to view the final record set)
grunt> eod_maxopen_sorted = ORDER eod_maxopen BY $1 DESC;

grunt> DUMP eod_maxopen_sorted;
Page 197
Pig
Splitting Records
Step 1 : Split the data into good and bad relations
grunt> SPLIT eod_maxopen INTO eod_maxopen_good IF $1 >= 5.0, eod_maxopen_bad IF $1 < 5.0;
Step 2: Examine the Good schema and results

grunt> DESCRIBE eod_maxopen_good;
grunt> STORE eod_maxopen_good INTO 'output/pig/eod_maxopen_good';
Step 3: Examine the Bad schema and results
grunt> DESCRIBE eod_maxopen_bad;

grunt> STORE eod_maxopen_good INTO 'output/pig/eod_maxopen_good';
Page 198
Pig
Running Pig as Script
Step 1 : myscript.pig (stored in NYSE/pig)
symbols = LOAD 'NYSE/data/symbols.txt' AS (ticker:chararray, name:chararray);

eoddata = LOAD 'NYSE/data/EOD2013.txt' USING PigStorage(',') AS (ticker:chararray,
close_date:chararray, price_open:double, price_high:double, price_low:double,
price_close:double, volume:long);
symbols_ms = FILTER symbols BY ticker matches 'MS';
eoddata_ms = FILTER eoddata BY ticker matches 'MS';
symeod_ms = JOIN symbols_ms BY ticker, eoddata_ms BY ticker;
SPLIT symeod_ms INTO symeod_ms_good IF price_open <= price_close, symeod_ms_bad IF price_open >
price_close;
STORE symeod_ms_good INTO 'output/pig/symeod_ms_good';
STORE symeod_ms_bad INTO 'output/pig/symeod_ms_bad';
Step 2: Execute Script

% pig NYSE/pig/myscript.pig
Page 199
Pig
Running Pig as local Script
Step 1 : myscript.pig (stored in NYSE/pig)
symbols = LOAD '/home/cloudera/NYSE/data/symbols.txt' AS (ticker:chararray, name:chararray);

eoddata = LOAD '/home/cloudera/NYSE/data/EOD2013.txt' USING PigStorage(',') AS
(ticker:chararray, close_date:chararray, price_open:double, price_high:double,
price_low:double, price_close:double, volume:long);
symbols_ms = FILTER symbols BY ticker matches 'MS';
eoddata_ms = FILTER eoddata BY ticker matches 'MS';
symeod_ms = JOIN symbols_ms BY ticker, eoddata_ms BY ticker;
SPLIT symeod_ms INTO symeod_ms_good IF price_open <= price_close, symeod_ms_bad IF price_open >
price_close;
STORE symeod_ms_good INTO '/home/cloudera/NYSE/output/pig/symeod_ms_good';
STORE symeod_ms_bad INTO '/home/cloudera/NYSE/output/pig/symeod_ms_bad';
Step 2: Execute Script

% pig x local NYSE/pig/myscript-local.pig
Page 200
Pig Latin
Operators
Category Operator Category Operator

Loading and storing LOAD
Sorting ORDER
STORE
LIMIT
DUMP
Filtering FILTER Combining and UNION
splitting
DISTINCT SPLIT
FOREACHGENERATE
Diagnostic DESCRIBE
MAPREDUCE operators
STREAM EXPLAIN
SAMPLE ILLUSTRATE
Grouping and JOIN Macro and UDF REGISTER
joining statements
COGROUP
DEFINE
GROUP
CROSS IMPORT
Page 201
Pig Latin
Commands
Category Operator Category Operator

Hadoop Filesystem cat
cd Hadoop kill
MapReduce Utility
copyFromLocal
exec
copyToLocal
cp help
fs
quit
ls
mkdir run
mv
pwd set
rm
sh
rmf
Page 202
Pig Latin
Expressions
Category Expression Category Expression

Constant Literal Conditional x?y:z
Field (ordinality) $n Comparison x == y, x != y
Field (by name) f x > y, x < y
Field (disambiguate) r::f x >= y, x <= y
Projection c.$n, c.f x matches y
Map lookup m#k x is null
Cast (t) f x is not null
Arithmetic x + y, x y Boolean x or y, x and y
x * y, x / y not x
x%y Functional fn(f1,f2,...)
+x, -x Flatten FLATTEN(f)
Page 203
Pig Latin
Data Types
Category Type Description Literal example

Numeric int 32-bit signed integer 1
long 64-bit signed integer 1L
float 32-bit floating-point number 1.0F
double 64-bit floating-point number 1.0
Text chararray Character array in UTF-16 format 'a'
Binary bytearray Byte array Not supported
Complex tuple Sequence of fields of any type (1,'boy')
bag An unordered collection of tuples, possibly with {(1,'boy'),(2)}

duplicates
map A set of key-value pairs; keys must be character ['a'#'boy']
arrays, but values may be any type
Page 204
Pig Latin
Built-in Functions
Category Function Category Function

Eval AVG Eval TOMAP
CONCAT
TOP
COUNT
TOTUPLE
COUNT_STAR
DIFF Filter IsEmpty
MAX Load/Save PigStorage
MIN BinStorage
SIZE
TextLoader
SUM
JsonLoader, JsonStoarge
TOBAG
TOKENIZE HBaseStorage
Page 205
Hive deep dive - Theory 3
Day
Page 206
Hive Introduction
o Hive was originally developed at Facebook
Provides a very SQL Like language
Can be used by people who know SQL
Under the covers, generates MapReduce jobs that run on the Hadoop cluster
Enabling Hive requires almost no extra work by the system administrator
Page 207
Hive Architecture
Data Model
Tables have Typed columns (int , float,
HIVE string, date, boolean)
JDBC ODBC Partitions
Buckets (Hash partitions useful for
Command Line sampling, join optimization)
Metastore
Web Interface Thrift Server
Interface
Driver
(Compiler, Optimizer, Executor)
Metastore
Namespace containing set of tables
Holds partition definitions
Statistics
Runs on Derby, MySQL and many other
relational databases
Job Tracker Name Node

Physical Layout
Warehouse directory in HDFS
Data Node Table row data stored in subdirectories of
+
Task Tracker
warehouse
Partitions form subdirectories of table
Hadoop directories
Map Reduce + HDFS Actual data stored in flat files
Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent.
03#60
Page 208
Hive Metastore
o Hives Metastore is a database containing table definitions and other metadata

By default, stored locally on the client machine in a Derby database
If multiple people will be using Hive, the system administrator should create a shared Metastore
Usually in MySQL or some other relational database server
Page 209
Hive Data Model and Data Types
Hive layers table definition on top of data in Type constructors:

HDFS ARRAY < primitive-type >
MAP < primitive-type, data-type >
o Tables
STRUCT < col-name : data-type, ... >
Typed columns (int, oat, string,
Primitive Type
boolean and so on)
TINYINT
Also array, struct, map (for JSON like
SMALLINT
data)
INT
o Partitions BIGINT
e.g., to range partition tables by date FLOAT
BOOLEAN
o Buckets DOUBLE
Hash partitions within ranges (useful for STRING
sampling, join optimization) BINARY(available starting in CDH4)
TIMESTAMP(available starting in CDH4)
Page 210
Hive Sample Commands
CREATE TABLE managed_table ( dummy STRING);

LOAD DATA INPATH user/dinesh/data.txt INTO table managed_table
DROP TABLE managed_table;
CREATE EXTERNAL TABLE external_table ( dummy STRING);

LOCATION user/dinesh/external_table;
LOAD DATA INPATH /user/dinesh/data.txt INTO TABLE external_table;
CREATE TABLE Logs (ts BIGINT, line STRING) PARTIONED BY (dt STRING, country
STRING)
LOAD DATA LOCAL INPATH input/live/partitions/file1 INTO TABLE logs
PARTITION (dt=2001-10-10, country=US);
SHOW PARTITIONS logs;

SELECT ts, dt, line FROM logs WHERE country=US
CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY 9id) INTO 4 BUCKETS;
Page 211
Storage Format
Default
Default Delimited text with a row per line
OUTER JOIN / CROSS JOIN / VIEWS / GROUP BY / UDF All are possible with Hive
CREATE TABLE ...

ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
JOINS:
SELECT Sales.*, things.* FROM Sales JOIN

things ON (Sales.ID = things.ID)
EXPLAIN
Select . Entire query: Will provide details
about the execution plan for the query,
including MR job
Page 212
Hive Data
Physical Layout
o Hive tables are stored in Hives warehouse directory in HDFS

By default, /user/hive/warehouse
o Tables are stored in subdirectories of the warehouse directory

Partitions form subdirectories of tables
o Possible to create external tables if the data is already in HDFS and should not be moved from its current
location
o Actual data is stored in at les
Control character/delimited text, or SequenceFiles
Can be in arbitrary format with the use of a custom Serializer/ Deserializer (SerDe)
Limitations
o Not all standard SQL is supported

Subqueries are only supported in the FROM clause
No correlated subqueries
o No support for UPDATE or DELETE
o No support for INSERTing single rows
Page 213
Hive vs. Pig
Choosing between Pig and Hive
o If you want abstraction on top of MapReduce use either Hive or Pig
o Which one is chosen depends on the skillset of the target users

Those with an SQL background will naturally gravitate towards Hive
Those who do not know SQL will often choose Pig
o use both
Pig deals better with less structured data, so Pig is used to manipulate the data into a more
structured form, then Hive is used to query that structured data

Pig is better suited for data pipelines.
Where schema is unknown, incomplete, inconsistent , need to manage nested data, work on data
before cleaning etc (Researchers prefer this
Page 214
Lunch Break 3
Day
Page 215
Hive hands on 3
Day
Page 216
Hive
Overview
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Internally, a compiler translates HiveQL statements into
a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.
Originally developed by Facebook.
Supports analysis of large datasets stored in HDFS and compatible file systems such as Amazon S3 filesystem.
Provides an SQL-like language called HiveQL.
HiveQL does not strictly follow the full SQL-92 standard.
HiveQL offers extensions not in SQL (mulitable inserts, create table as select)
To accelerate queries, Hive provides indexes, including bitmap indexes.

Review Hive file types
Loading data from a file into Hive
Loading data from a Hive table into

another Hive table
Another data loading example
Page 217
Hive
Hive File Types
Four file types supported in Hive: TEXTFILE, SEQUENCEFILE, ORC, AND RCFILE
Import text files compressed with Gzip or Bzip2 directly into a TEXTFILE table
Compression automatically detected and will be decompressed on-the-fly
CREATE TABLE raw (line STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
Default File Type

TEXTFILE is default storageHadoop cannot split this type of file into blocks
Recommended practice is to insert data into another SEQUENCEFILE table.
CREATE TABLE raw (line STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
Page 218
Hive
ORC File Type
Optimized Row Columnar (ORC) file format

provides a highly efficient way to store
Hive data.
o An ORC file contains groups of row data called
stripes, along with auxiliary information in a file
footer. At the end of the file a postscript holds
compression parameters and the size of the
compressed footer.
o The default stripe size is 250 MB. Large stripe
sizes enable large, efficient reads from HDFS.
o The file footer contains a list of stripes in the
file, the number of rows per stripe, and each
column's data type. It also contains column-
level aggregates count, min, max, and sum.
o This diagram illustrates the ORC file structure:
Page 219
Hive
Loading data from File to Hive
Step 1 : Copy the working data files for use with Hive
$ hadoop fs -mkdir NYSE/data/hive

$ hadoop fs cp NYSE/data/*.txt NYSE/data/hive
Step 2: Start Hive

$ hive
Step 3: Create the temporary table to load the Hive data into.
hive> CREATE TABLE symbols_tmp (symbol STRING, description STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TEXTFILE;
Step 4: Load the data from the source file into the temporary table.
hive> LOAD DATA INPATH 'NYSE/data/hive/symbols.txt' INTO TABLE symbols_tmp;
Data can also be loaded from the OS file system using the LOCAL option
hive> LOAD DATA LOCAL INPATH '/home/cloudera/NYSE/data/symbols.txt' INTO TABLE symbols_tmp;
Page 220
Hive
Loading Data from Hive table into another Hive Table
Step 1 : Create the destination table
hive> CREATE TABLE symbols (symbol STRING, description STRING, exchange STRING) STORED AS
SEQUENCEFILE;
Step 2: Set the Hive Compression parameters

hive> SET hive.exec.compress.output=true;
hive> SET io.seqfile.compression.type=BLOCK;
Step 3: Insert the data into destination table.
hive> INSERT OVERWRITE TABLE symbols SELECT symbol, description, 'NYSE' FROM symbols_tmp;
Step 4: Drop the temporary table we previously created.
hive> DROP TABLE symbols_tmp;
Page 221
Hive
Another data loading example
Step s
hive> CREATE TABLE eod_data_tmp (ticker STRING, close_date STRING, price_open DOUBLE, price_high
DOUBLE, price_low DOUBLE, price_close DOUBLE, volume BIGINT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' STORED AS TEXTFILE;
hive> LOAD DATA INPATH 'NYSE/data/hive/EOD2013.txt' INTO TABLE eod_data_tmp;
hive> CREATE TABLE eod_data (ticker STRING, close_date timestamp, price_open DOUBLE, price_high
DOUBLE, price_low DOUBLE, price_close DOUBLE, volume BIGINT) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET io.seqfile.compression.type=BLOCK;
hive> INSERT OVERWRITE TABLE eod_data SELECT ticker,
from_unixtime(unix_timestamp(CONCAT(close_date,'163000000'), 'yyyyMMddHHmmssSSS')), price_open,
price_high, price_low, price_close, volume FROM eod_data_tmp;
hive> DROP TABLE eod_data_tmp;
Page 222
Hive
Running Select Queries
Step 1 : Get end-of-day data from 10/2013 for MS
hive> SELECT * FROM eod_data WHERE ticker = 'MS' AND year(close_date) = 2013 AND
month(close_date) = 10;
Step 2: Same as above using date ranges and unix timestamp

hive> SELECT * FROM eod_data WHERE ticker = 'MS' AND close_date >= unix_timestamp('2013-10-01
00:00:00') AND close_date < unix_timestamp('2013-11-01 00:00:00');
Step 3: Get the average monthly close prices to MS .
hive> SELECT MONTH(close_date), AVG(price_close) FROM eod_data WHERE ticker = 'MS' GROUP BY
month(close_date);
Page 223
Hive
Aggregate Function and Joins
Step 1 : Find the difference in the highest and smallest close prices for all stocks in the NYSE in 2013
hive> SELECT s.symbol, s.description, MAX(d.price_close) - MIN(d.price_close) AS spread FROM

eod_data d JOIN symbols s ON d.ticker = s.symbol WHERE s.exchange = 'NYSE' AND
year(d.close_date) = 2013 GROUP BY s.symbol, s.description ORDER BY spread DESC;
Output: Stage 1
Page 224
Hive
Output: Stage 2 and Stage 3
Page 225
Hive
Output: Stage 2 and Stage 3 ( Continued )
Page 226
Hive
Saving Query Results
Option 1 : Saving results into HDFS
hive> INSERT OVERWRITE DIRECTORY '/user/cloudera/NYSE/hive/udaf' SELECT s.symbol,

s.description, MAX(d.price_close) - MIN(d.price_close) AS spread FROM eod_data d JOIN symbols s
ON d.ticker = s.symbol WHERE s.exchange = 'NYSE' AND year(d.close_date) = 2013 GROUP BY
s.symbol, s.description ORDER BY spread DESC;
Option 2: Saving results to local file system
hive> INSERT OVERWRITE LOCAL DIRECTORY '~/NYSE/hive/udaf' SELECT s.symbol, s.description,

MAX(d.price_close) - MIN(d.price_close) AS spread FROM eod_data d JOIN symbols s ON d.ticker =
s.symbol WHERE s.exchange = 'NYSE' AND year(d.close_date) = 2013 GROUP BY s.symbol,
s.description ORDER BY spread DESC;
Page 227
Hive
Examining Database Results
Database : Show database
hive> show databases;
Tables: Show tables
hive> show tables;
Columns: Show columns
hive> show columns in eod_data;

hive> show columns in symbols;
Step 2: Exit Hive
hive> exit;
Page 228
Administration 3
Day
Page 229
Hadoop Administration & Troubleshooting
Managing Hadoop Processes
Starting and Stopping Processes with Init Scripts
Starting and Stopping Processes Manually
MapReduce Maintenance Tasks
Adding a Tasktracker
Decommissioning a Tasktracker
Killing a MapReduce Job
Dealing with a Blacklisted Tasktracker

HDFS Maintenance Tasks
Adding a Datanode
Decommissioning a Datanode
Checking Filesystem Integrity with fsck
Balancing HDFS Block Data
Dealing with a Failed Disk
Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent.
Page 230 14#5
Conclusion 3
Day
Questions?
Page 231
Notes
Page 232
Notes
Page 233
Notes
Page 234
Notes
Page 235
Notes
Page 236
Appendix
o Oozie
Why Oozie
Oozie use cases
o Many problems cannot be solved with a single MapReduce job Sta

rt
o Instead, a workow of jobs must be created Dat
a
o Simple workow:
Run Job A Job A
Use output of Job A as input to Job B

Use output of Job B as input to Job C
Output of Job C is the nal required output
Job B
o Easy if the workow is linear like this
Can be created as standard Driver code
Job C
o If the workow is more complex, Driver code becomes much
more dicult to maintain
o Example: running multiple jobs in parallel, using the output

from all of those jobs as the input to the next job
o Example: including Hive or Pig jobs as part of the workow
Page 238
What is Oozie?
How it works?
Oozie is a workow engine
Runs on a server
Typically outside the cluster
Runs workows of Hadoop jobs

Including Pig, Hive, Sqoop jobs
Submits those jobs to the cluster based on a workow definition
Workow definitions are submitted via HTTP
Jobs can be run at specic times

One/o or recurring jobs
Jobs can be run when data is present in a directory
Page 239
Oozie Workflow Basics?
Workflow Overview
Oozie workows are written in XML
Map Reduce
Workow is a collection of actions PIG
Start End
Hive
MapReduce jobs, Pig jobs, Hive jobs etc. Job
Start Success
A workow consists of control ow nodes and action nodes
Failure
Control ow nodes dene the beginning and end of a
workow
They provide methods to determine the workow
execution path Error
Example: Run multiple jobs simultaneously
Action nodes trigger the execution of a processing task, such

as
A MapReduce job
A Pig job
A Sqoop data import job
Page 240
Oozie Workflow Sample?
Workflow XML Anatomy
1 o A workflow is wrapped into workflow entity
<workflow-app name='wordcount-wf'
1xmlns="uri:oozie:workflow:0.1">
2 <start to='wordcount'/> o The startnode is the control node which tells
<action name='wordcount'> 2 Oozie which workow node should be run rst.
3 <map-reduce> There must be one startnode in an Oozie
<job-tracker>${jobTracker}</job-tracker> workow. In our example, we are telling Oozie
<name-node>${nameNode}</name-node>
<configuration> to start by transitioning to the wordcount
<property> workow node.
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property> o Action node defines the type of job. It is
3
<name>mapred.reducer.class</name> mapReduce in this. Within the action we define
<value>org.myorg.WordCount.Reduce</value> the job properties
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value> 4 o We specify what to do if the action ends
</property> successfully, and what to do if it fails. In this
<property> example, if the job is successful we go to the end
<name>mapred.output.dir</name>
<value>${outputDir}</value>
node. If it fails we go to the killnode.
</property>
</configuration>
</map-reduce> o If the workow reaches a killnode, it will kill all
4 <ok to='end'/> 5 running actions and then terminate with an error.
<error to='kill'/>
</action> A workow can have zero or more killnodes
<kill name='kill'>
<message>Something went wrong:
5 ${wf:errorCode('wordcount')}</message>
6 o Every workow must have an endnode. This
</kill/>
indicates that the workow has completed
6 <end name='end'/>
</workflow-app> successfully.
Page 241
Oozie Other control nodes
Control noes overview
o A decisioncontrol node allows Oozie to determine the workow execution path based on some criteria
Similar to a switch/case statement
o forkand joincontrol nodes split one execution path into multiple execution paths which run concurrently
forksplits the execution path
joinwaits for all concurrent execution paths to complete before proceeding
forkand joinare used in pairs
o Oozie can also be called from within a Java program

Via the Oozie client API
o To submit an Oozie workow using the command line tool:
$ oozie job -oozie http://<oozie_server>/oozie

-config config_file -run
Page 242
Oozie Action Nodes
Action Nodes Overview
Node Name Description
map-reduce Runs either a Java MapReduce or Streaming job
fs Create directories, move or delete les or directories
java Runs the main()method in the specied Java class as a single/ Map,
Map/only job on the cluster
pig Runs a Pig job
hive Runs a Hive job
sqoop Runs a Sqoop job
email Sends an e/mail message
Page 243
Bibliography
Hadoop. (2013, 03 06)[1]. Welcome to Apache Hadoop. Retrieved from Apache Hadoop: http://hadoop.apache.org/
Borthakur, D. (2013, 02 13)[2]. HDFS Architecture Guide. Retrieved from Apache Hadoop:
http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html
Borthakur, D. (2007)[3]. The Hadoop Distributed File System:Architecture and Design. Retrieved from Apache Hadoop:
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
HBase. (2013, 03 14)[4]. Aapache HBase. Retrieved from Apache HBase: http://hbase.apache.org/
Hive. (2013, 02 07)[5]. Welcome to Hive. Retrieved from Hadoop: http://hive.apache.org/
Mahout. (2011)[6]. What is Apache Mahout? Retrieved from mahout: http://mahout.apache.org/
MapReduce. (2013, 03 12)[7]. MapReduce. Retrieved from Wikipedia: http://en.wikipedia.org/wiki/MapReduce
Pig. (2013, 02 21)[8]. Welcome to Apache Pig. Retrieved from Apache Hadoop: http://pig.apache.org/
Sqoop. (2013, 03 08)[9]. Sqoop. Retrieved from Apache Sqoop: http://sqoop.apache.org/
Zookeeper. (2010)[10]. Apache ZooKeeper. Retrieved from Apache ZooKeeper: http://zookeeper.apache.org/
oozie. (2013, 01 23)[11]. Apache Oozie Workflow Scheduler for Hadoop. Retrieved from oozie: http://oozie.apache.org/
Avro. (2013, 02 26)[12]. Welcome to Apache Avro. Retrieved from Avro: http://avro.apache.org/
Informatica Corporation.(2012)[13]. Lean Integration. Retrieved from Informatica: http://www.informatica.com/us/vision/best-

practices/lean-integration/
MarkLogic Corporation. (2013)[14]. Featured Clients. Retrieved from MarkLogic: http://www.marklogic.com/
MarkLogic Corporation. (2011)[15]. What's New in MarkLogic 5.

Page 244

Big Data Training

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Training

Uploaded by

Copyright:

Available Formats

BIG DATA TRAINING

Concepts, Use Cases, Architecture, Hands-on

Understand how to build solutions for Big Data problems

Day 1 Duration Day 3 Duration

Breakfast 8:30- 9:00 AM Breakfast 8:30- 9.00 AM

PIG hands on 10.00 - 10.30 AM

VELOCITY How Fast is Fast?

Spending 80% of your life with 20% of the data!!!

How Fast is Fast?

Velocity of Data Velocity of Data Flow Restful

o System Outage Prediction

o Anti - Money Laundering | Fraud Analysis

o Correlation | Link Analysis

Consolidate ALL Enterprise

High Velocity of Data Generation Open Data Format / Fixable Schema

Deep Predictive Analysis Unstructured / Semi Structured Data

Store All Your Data Never Delete No Specialized Hardware Required

Low Cost Per Compute

IVR, Image, Video Complete Support for

Yahoo! Use Case

Yahoo! Search architecture consists of four primary components

So what is the complexity?

Hadoop File System (HDFS) MapReduce

Coexisting Data Profiles DW IT Investment is Not Lost

High Value Data

Domain Metric Problem Question Solution

5 Resource Mgmt. & Orchestration:

Why is the HDFS data block so large?

DataNode DataNode DataNode

Blocks 1 & 2 have high replication factor

o Namespace is a name that uniquely identifies an HDFS instance e.g. hdfs://ernst&young/stas/ali/filename

Maintains the filesystem tree

5 - Read o Data is streamed from DataNode A to the

Rack 1 Rack 2 Rack 3 Rack 4

Data Center 1 Data Center 2

5 - Write Packet 8 - Acknowledgement Packet

DataNode A DataNode B DataNode C

o User specifies the problem in terms of Map and Reduce tasks

Map: Input data <key, value>

MapReduce Job Management

Task Tracker HDFS

MapReduce Task Distribution

Understand the distribution of data in the source before split segmentation

Sort Algorithm in MapReduce

Count every occurrence Brute force!

Steps in MapReduce Sort

Why call it Pig ? There is a reason that we will get to later

Why use Pig ?

Pig has two major components:

Local Mode Hadoop Mode

Pig Latin Pig Latin

Single JVM Runtime Hadoop Cluster

o Single JVM o Runs on Hadoop cluster

ETL (or) Data Factory Data Warehouse

Data Factory Use Cases:

Key features of Pig Latin

Pig Latin Statement

Bag Pig Latin Statement Bag

A bag is a relation, similar to A statement is an operation The output is another bag

(5,Big Data,-2) A tuple is an ordered set of fields. Its represented by

A bag is represented by collection of tuples separated by

tup = (puppy, { (wagging,1) (chewing,2)}, [age 4])

Let the fields of the Tuple tub be called t1, t2, t3

Expression Type Example Value for tup