Professional Documents
Culture Documents
current market trends. The training will go over key Big Data Industry use cases end to end 10.00 - 11.30 AM
use cases in financial services, Big Data concepts and the
Hadoop technology architecture used to address Big Data issues. Break 11:30 11:45 AM
Upon completion of this training, participants should be able to: Hadoop Installation & HDFS Hands-on 11.45 - 1.00 PM
Understand what Big Data means and how Big Data issues
are manifesting themselves in financial services. Lunch Break 1.00 to 2.00 PM
Understand the Big Data reference architecture. Map Reduce Deep Dive 2.00 - 3.00 PM
Get an in-depth understanding of the Hadoop architecture
platform for Big Data. Map Reduce Hands on development & Deployment 3.00 - 4.30 PM
Big Data Introduction and Concepts 9.00 - 11.00 AM PIG deep dive Theory 9.00 - 10.00 AM
Page 2
Big Data Introduction & Concepts 1
Day
Page 3
What is Big Data?
Big Data constitutes dataset that become so large as to render themselves unmanageable using traditional data
platforms e.g. RDBMS, flat file systems, OO databases etc.. This unmanageability stems from the complexity in
capture, storage, search, retrieval, sharing and analytics of these datasets due to their sheer size.
o Click Stream
o Active/Passive Sensor
o Log
o Unstructured
o
o
Event
Printed Corpus VOLUME VARIETY o Semi Structured
o Structured
o Speech
o Social Media
o Traditional Big Data
o Velocity of Generation
o Velocity of Analysis
Page 4
Who Uses BIG Data?
Page 5
Market Trends in Data... What is the Context?
Neil Armstrong lands on the moon with 32KB of data (1969)
Google processes 24PB of data everyday (2010) ~ 240K 100GB hard drives
So What is a PetaByte?
o 1 PB = 1 000 000 000 000 000 bytes = 1 million gigabytes = 1 thousand terabytes
o Large Hadron Collider produces ~ 15 PB per year
o The movie Avatar took ~ 1 PB of storage for rendering 3D CGI graphics
Twitter ~ 7TB
Facebook ~ 10B
Bank of America Who Knows? Why dont we know? Because they dont have the ability to store and process all the
unstructured / semi-structured data their eco-system generates But can they use it?
o WHY need Velocity of Analytics? Finding arbitrage opportunities in capital markets before asset prices balance
Page 6
What is Big Data Used For?
o Search
Yahoo, Google, Amazon, Zevents
o Log Processing
Yahoo, Facebook, Google, Twitter, LinkedIn
o Recommendation Systems
Yahoo, Facebook, Google, Twitter, LinkedIn
o Data Warehousing
Facebook, AOL
o Sentiment Analysis
Analyze TB volume of emails and transaction logs against an apriori established sentiment ML model to gauge which
customers are likely to leave Offer an incentive program to reduce customer attrition
o Security Profiling
Collect event data from the enterprise event cloud and run it against an apriori established ML model to detect security breaches
and unforeseen correlations between enterprise events.
o Multimedia Analysis
New York Times, Veoh
Page 7
Confluence of Influences
80% of new data generated is either completely unstructured or semi-structured Much of this data is being
generated at very high velocity Presents tremendous challenges for storage, search, retrieval and analysis
Traditional data platforms and analysis capabilities are unable to meet these evolving demands!
On ALL Data
Extremely High Throughput Big Data Massively Parallel & Fault Tolerant
Page 8
Digression But Illustrative of Big Data Problems!!!
Building extremely large scale search engines or equivalently processing 1000s of terabytes of data in financial
services requires a large number of machines and a complex (aka costly) and scalable processing engine.
So what kind of platform and framework would be needed to enable this kind of computation?
1) A large numbers of machines/nodes running in parallel.
2) Logically dividing the analysis in smaller chunks of work that can be processed in parallel.
3) Recombining the smaller pieces of work into a cohesive whole at the end.
4) A fault tolerant platform that can survive node failures.
5) A latency sensitive platform that can work on local data instead of requiring remote data.
Page 9
Who is Hadoop?
Hadoop is a software framework that supports very large scale data intensive processing using an open source license
(Hadoop, 2013)[1]. It is composed of two primary components 1. Hadoop File System (HDFS) & 2. MapReduce
Page 10
Why Hadoop?
Which Data Problems are Hadoopable?
Analysis of structured, semi-structured and unstructured dataset from a variety of data sources.
When the entire population dataset requires analysis instead of merely sampling a subset and extrapolating results.
Ideal for iterative and exploratory analysis when business measures on data are not predetermined.
o Analyze all data (Structured, unstructured, semi-structured) o Cleansed, enriched, matched data
o Inherent data discovery and data value analysis o Structured data analysis
o Analytics-at-rest & analytics on-the-wire o Analytics-at-rest
o Multiple disparate data sources o Produces insight with known and stable measurements
o Store all data (retain fidelity of transactions, logs, posts etc.) o Defined based on pre-determined corpus of questions
o Store data in native object format o Inflexibility in structure due to rigid data structure design
o Flexible or no data transport encoding o Rigorous data quality controls
o Low cost-per-compute o Performance envelope constrained due to functional limits
o Minimal performance concerns due to massive parallelism o High cost-per-compute
o High value-per-byte
o Data retained based on perceived business value
Page 11
A Little Bit of History
2002 2003
Apache Nutch
o Open source web search engine.
o Building one which can index 1-billion web pages is ambitious and costly at the very least.
o Doug Cutting wrote a Nutch crawler but the architecture would not scale to a billion pages.
Google
o Meanwhile Google site index is growing exponentially based on Sergey and Larrys Page Rank Algorithm.
o Query semantics and composition is becoming increasing complex.
o They are facing similar scale issues as Nutch. Their Oracle RAC is just not scaling.
o At this time Google is looking for a technological miracle else a possible demise.
2004
o Dec 2004 Google publish their seminal paper on distributed computing in the shape of Google File System (GFS).
o GFS would solve Googles storage needs, free up time spent on node management & enable huge indexing & crawling jobs.
o GFS runs all of Google search including all the utility functions NOW.
2005
o Doug Cutting picks up the GFS paper and implements an open sources version called NDFS.
o Nutch realizes that NDFS is applicable to wide array of computing issues beyond merely search.
2006
o NDFS is moved out as a top level Apache project and renamed to Hadoop (after the toy elephant of Dougs kid)
o Yahoo! Hires Doug Cutting and adopts Hadoop as their main computing platform.
o Yahoo! Implements 10GB/node sort benchmark on 188 nodes in 47.9 hours
2008
o Yahoo! wins the 1 terabyte sort benchmark in 209 seconds on 900 nodes
o Yahoo! loads 10 terabytes of data per day on a cluster of 1000 Hadoop nodes.
2009 - Yahoo! has 17 Hadoop clusters with 24,000 nodes. Wins the min. sort by sorting 500GB in 59 seconds on 1400 nodes.
Page 12
Where Can Our Clients Use Big Data?
In the new data paradigm; Big Data constitutes the fundamental enabler of value-add predictive analytics Big Data
enables analytics-at-rest and analytics-on-the-wire
Page 14
Big Data Reference Architecture
1 Connectors: Different methods to connect external
1 Connectors
source/target systems to the Hadoop platform.
ETL DBMS Middleware BI Analytics Visualization
2 Analytics: Data mining algorithms for performing
clustering, regression testing and statistical modeling
and to implement them using the Map Reduce model. 2 Analytics
3 Security: Kerberos Authentication, Role Based Text Analytics Machine Learning Object Correlation Path & Pattern Analysis
Authorization, Audit, Encryption.
4 Data Access | Pipelining | Serialization:
3
Security (Authentication, Authorization, Audit, Encryption)
HBase - A non-relational database that allows for low-
latency, quick lookups and adds transactional capabilities
to Hadoop. 4 Data Access | Pipelining | Serialization 5 Resource Mgmt. & Orchestration
Hive - allows users to write queries in a SQL-like language HBase Hive Pig Sqoop Avro Yarn ZooKeeper Oozie Flume
called HiveQL, which are then converted to Map Reduce.
PIG - A platform for analyzing large data sets that
consists of a high-level language for expressing data 6
Near Real Time Access
analysis programs.
In - Memory Database | Cache Object Immutability Graphing
Sqoop - A connectivity tool for moving data from non-
Hadoop data stores such as relational databases and
data warehouses into Hadoop. 7 Map Reduce -- Data Processing
Avro - A data serialization system that allows for
encoding the schema of Hadoop files.
8 HDFS | Cassandra File System (CFS) | GPFS -- Storage
Page 16
Lunch Break 1
Day
Page 17
Introduction to HDFS & MapReduce 1
Day
Page 18
HDFS Architecture Overview
Hadoop Distributed File System Architecture (Borthakur, 2013)[2]
HDFS Cluster
Block Map
NameNode
Metadata - master -
Operations
Client
Block Operations
Read
RACK RACK
DataNode DataNode DataNode DataNode
- worker - - worker - - worker - - worker -
1 4 2 4 5 6 Replication 6 9
2
3 7 8
3
Write Write
HDFS B lock
Client
Page 19
HDFS Architecture What is an HDFS Block?
A Block is the most basic level of persistent storage More basic than a file. Actually, a file is composed of one or more blocks
Alternatively, you can think of a Block as the minimum amount of data that can be read or written to a storage platform.
All filesystems (e.g. NFS, NTFS, FAT, Apple HFS) are designed using the block paradigm
Most filesystem blocks are usually a few KB
HDFS is also organized using blocks
Files in HDFS are broken down into equal sized data blocks
An HDFS blocks is >= 64MB. That is a very large data block
Block based storage architecture allows HDFS to enable the following (Borthakur, 2007)[3]:
o Blocks from same file can be stored on different servers allowing cluster based fault tolerance
o Blocks can be replicated to enable high availability
o Popular files can be set to a high replication factor (meaning replicate blocks to more servers) to enable load balancing and high
throughput
Client 1 Client 2 Client 3
o Since block locations change during the life of an HDFS cluster; they are not persistently held by the NameNode
o In-memory caching of block locations imposes a limitation on the number of files that can be stored in an HDFS cluster
Clients access NameNode when they want to read or write a file. NameNode proxy client requests to appropriate DataNode(s)
DataNode B NameNode DataNode A
3 4 File1
- master - 1 2 File1
Name Type Server Block
File1 File A 1 -2 1 2 File2
B 3 -4
File2 File A 1-2
Without NameNode an HDFS cluster cannot function
X
NameNode
- master - X
HDFS
Even though Hadoop provides fault tolerance; how do you make Hadoop itself fault tolerant?
o Make NameNode resilient by writing persistent state of HDFS cluster to a remote filesystem e.g. NFS
o Run a secondary NameNode
Page 21
HDFS Architecture Anatomy of an HDFS Read
Client JVM/Node
2 - Remote Procedure Call
DistributedFileSystem 3-
NameNode
http://DataNodeA 1, 2 - master -
http://DataNodeB 3, 4
4 - Read
Client 7 - Close FSDataInputStream Sorted by proximity to client
DataNode B
Primary Proximity - worker -
3 4
How does this design scale?
o Clients retrieve data directly from DataNode(s) 2
o NameNode merely serves up location information
o HDFS can therefore scale to multiple concurrent clients since data Secondary Proximity
traffic is spread across multiple DataNode(s)
And where is the fault tolerance of Hadoop?
o If an error occurs while reading from a DataNode; HDFS will automatically proxy the request to the closest DataNode with the block.
o HDFS will remember the DataNode(s) that have failed and will automatically remove them from future client requests.
Page 22
HDFS Architecture The Concept of Node Proximity
HDFS represents the network as a Tree and the distance between two nodes is the sum of their distances to their common ancestor.
d=4 d=6
d=0 B Node 1 B Node 3 B Node 4
d=2
B Node 2
How does HDFS chooses which DataNode to store block replicas on? Bandwidth degradation between DataNode(s)
o The contention is between balancing a) Reliability b) Write bandwidth and c)
Read bandwidth. Same node
Bandwidth
o First replica goes on the same node as the client.
Different node same rack
o Second replica goes off-rack in the same data center chosen randomly.
o Third replica goes on the same rack as the second but on a different node. Different rack same data center
o Further replicas are placed on random nodes within the cluster.
Different data center
o This strategy provides:
Reliability Blocks stored on two separate racks
Write bandwidth Writes only traverse a single network switch
Read bandwidth Choice of reading from two racks or more
Block distribution Client writes a single block on local rack
Page 23
HDFS Architecture Anatomy of an HDFS Write
Replication Factor = 3
Client JVM/Node
2 - Remote Procedure Call NameNode
DistributedFileSystem
- master -
3 - Write
Client 9 - Close FSDataOutputStream
AC K QU E U E Create new file in file system
namespace e.g.
[Data Queue] Ack. Packet hdfs://eny/stas/ali/myfile.doc
DataStreamer
So what happens when all of this goes to hell? When a DataNode dies !!
o HDFS closes the pipeline
o Packets in the Ack. Queue are added to the front of Data Queue (Why?)
o Current blocks on the good DataNodes are assigned new identity and communicated to NameNode
o Partial blocks on failed DataNodes are deleted when the node comes back up
o Failed DataNode is removed from the pipeline
o NameNode notices the under-replication and assigns a new healthy DataNode for replication
Page 24
MapReduce 1
Day
Page 25
What is MapReduce?
MapReduce is a programming model pioneered by Google & Yahoo to process extremely large datasets (> 100s of Giga Bytes)
MapReduce parallelizes the data processing problem into smaller chunks of work (MapReduce, 2013)[7]
MapReduce Parallelism
(key, value)
split Map
Dataset
Result
(key, value) (key, value)
split Map Reduce
(key, value)
split Map
Page 26
MapReduce Architecture How Does the Data Flow?
NameNode Job Tracker
- master -
Task Tracker
DataNode 1
Partitions
sort
Split 1 Map DataNode 5 Task Tracker
Partitions Merge
Source Dataset
Page 27
Making MapReduce Real Example from Capital Markets
New York Stock Exchange executes 1.1 billion trades per year. How do you find valuable information in such a large dataset?
Problem Statement I Find the highest traded stock price for each company registered on NYSE within a given year
Problem Statement II Find the spread between trade prices that are within 1, 2 and 3 for each listing on NYSE
MapReduce tasks are defined in terms of (key, value) pairs (Stock Ticker, Trade Price)
MapReduce Parallelism
AAPL 544.21
AAPL 543.90 Key Value
AAPL 521.36 AAPL (544.21, 543.90, 521.36)
Original Dataset
(1.1 Billion Tuples) split(s)
MSFT 32.20 A
H
Map
AAPL 544.21
XRX 8.25
MSFT 31.87 I
AAPL 543.90 P
Map Reduce
AAPL 521.36
Q
Z Map
Key Value
MSFT 32.20 XRX 8.25 Key Value AAPL 544.21
MSFT 31.87 MSFT (32.20, 31.87) MSFT 32.20
Page 28
Sorting in MapReduce Adding Velocity to Data Processing
The ability to sort data is at the heart of MapReduce. It helps to organize data and improves the data processing speed in MR.
How do we create balanced splits that feed into Mappers? - (Consider the NYSE example)
Even splits are important so that no one Mapper can dominate the overall job time.
The overall job is as fast as the slowest Mapper or Reducer
Velocity of Data
X
count(APPL) 230K
MSFT 32.20
Approach I
count(XRX) 400K
Cost Prohibitive
AAPL 544.21 Split Segmentation
XRX 8.25
MSFT 31.87
AAPL 543.90 Sample a subset of
AAPL 521.36 Approach II tuples to estimate Hadoop Even Split
the population key InputSampler Segmentation
distribution
Page 29
Elaborating MapReduce Sort Why is Sorting Important?
Why is Sorting Important?
o Improves data processing speed by producing more even splits to be distributed across multiple Mappers
o Facilitates joining multiple datasets together for improved analysis capabilities
o Produces a globally sorted output that is consumable by downstream Mappers and Reducers
Sort
A
Map Z
Sorted
A Population
Map Z Reduce
Dataset
A
Map Z
Page 30
Pig & HBase 1
Day
Page 31
Introducing Pig
What is Pig ? (Pig, 2013)[8]
o High-level data processing language
o Hadoop extension that simplifies Hadoop programming
o 40% of all Yahoo jobs are run using Pig. Twitter is another well known user of Pig
Execution Modes: Pig has two execution types or modes: local mode and Hadoop mode
Page 32
Where Does a Pig Fit ?
Data processing usually involves three higher level tasks (Data Collection, Data Preparation & Data Presentation)
Data
Data Preparation Data Presentation
Collection
o Pipeline Bring in data feed and clean and transform it. Example is logs from Yahoo Web Servers. Logs undergo cleaning
step where bots, company internal views and clicks are removed.
o Iterative Processing Typical processing on large data set involves bringing in small new pieces of data that will change the
state of the large data set
o Research Quickly write a script to test a theory or gain deeper insight by combing through Petabyte of data
Page 33
Pig Latin The Language
Pig Latin provides a higher order abstraction for implementing MapReduce jobs. It constitutes a data flow language which is
made up of a series of operations and transformations that are applied to the input data
How are Pig Latin Statements LOAD Load statements read data from the file system
Organized ?
Transformation statements process data read
Pig Latin statements are organized as a TRANSFORMATION from the file system
sequence of steps such that each step
represents a transformation applied to DUMP statement display results
some data
DUMP / STORE
STORE statement to save the results.
Page 34
Understanding Pig Data Model
Pig Latin is a relatively simple language that executes statements
Pig Latin has 3 Complex Data types and 1 Simple Data Type (Atom simple atomic value such as string or number)
Page 35
Pig Data Model Expressions
The table above shows the expression types in Pig Lain and how they operate. The Pig data model is very flexible and
permits arbitrary nesting.
Page 36
Pig Runtime
Pig Editors
o Pig Pen - script text editor
Page 37
Ride the Pig !!!
Example - How would you find the maximum temperature in a given year using Pig Latin
DUMP max_temp;
Page 38
Pig Latin Programming
Statements
A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a command.
For example DUMP operation is a type of a statement e.g. DUMP max_temp;
Any command to list the files in a Hadoop file system is another example of a statement e.g. ls /
Statement Execution
1. Each statement is parsed in turn Parsing
2. For syntax errors and other semantic problems the
interpreter will halt and display an error message
3. The interpreter builds a logical plan for every relational Yes Display
Errors ? Error Message
operation, which forms the core of a Pig Latin program
4. The logical plan for the statement is added to the logical No
plan for the program so far, then the interpreter moves
on to the next statement. Build Logical Plan
5. The trigger for Pig to start processing is the DUMP for the statement
Comments
Pig Latin has two forms of comments.
Double hyphens are single-line comments. Everything from the first hyphen to the end of the line is ignored by the Pig Latin
interpreter. Ex: -- DUMP max_temp
C-style comments are more flexible since they delimit the beginning and end of the comment block with /* and */ markers.
They can span lines or be embedded in a single line. Ex: /* . */
Page 39
Pig Functions Beyond the Barn
Functions
1. Eval function
o Takes one or more expressions and returns another expression
o Some Eval functions are aggregate functions such as MAX, which returns the maximum value of the
entries in a bag
2. Filter function
o Removes unwanted rows
o Returns a logical Boolean result
o Example: IsEmpty, which tests whether a bag or a map contains any items
3. Comparison function
o Can impose an ordering on a pair of tuples.
4. Load function
o Specifies how to load data into a relation from external storage
5. Store function
o A function that specifies how to save the contents of a relation to external storage
REGISTER piggybank/java/piggybank.jar;
b = FOREACH a GENERATE UDFs are written in
PIG UDF Jar File Java and packaged
org.apache.pig.piggybank.evaluation.str
in a jar file
ing.UPPER($0);
Page 40
Analyzing Logs using Pig Where Can we Use This?
Churning through voluminous log files and extracting meaningful insight Very common for click stream data, active and
passive sensor data, exception stack-trace information etc.
Find Unique Hits per Day Find Unique Website Hits per Day
Page 42
Yet another Storage Mechanism?
What is Database ?
o Organized collection of data to provide usage (for example, storing and finding a list of conference rooms available)
o Databases are intended to be used by multiple programs and several different users at the same time
o Has long history that is spread over 50 years with several technological advancements
?
Traditional Files Relational
Page 43
Why Enough is not Enough?
RDBMS still makes sense in most cases At present
o As a Persistence layer for front end applications
o Store relational data, strong consistency (ACID properties) and Referential Integrity
o Random access for structured data
o Limited number of records
Scale up !!!
Enterprise Data Needs
Scale up !!
Options
Big Data Revolution Store everything & There is no one size that fits all
Page 44
BigTable The Backdrop
A Distributed Storage System for Structured Data
Wide applicability
(many Google products)
Big Table
High Availability
Page 45
Terminology and Complexity
The map is indexed by a row key, column key, and a timestamp; each value in the map is an un-
interpreted array of bytes.
HBase uses a data model very similar to that of BigTable. Users store data rows in labeled
tables. A data row has a sortable key and an arbitrary number of columns. The table is stored
sparsely, so that rows in the same table can have crazily-varying columns, if the user likes..
Page 46
Demystifying HBase and BigTable
HBase is a open source implementation of BigTable (HBase, 2013)[4]
Sparse
Hbase does not follow a spreadsheet model. In Hbase a given row can
have any number of columns in each column family or none at all
Distributed
Hbase and BigTable are built over distributed files systems so that the
underlying storage can be spread out among array of independent
machines. Provides a layer of protection against a node within the
Values can be of any length, no predefined names or widths cluster failing
Page 47
Storage Layout Differences by Example
ROW-ORIENTED
Row-ID Name Birthdate Salary Dept
1 Joe 6-Apr 70000 SAL
2 Jane 12-Feb 55000 ACCT
3 Bob 13-Jul 120000 ENG
4 Mike 17-Sep 115000 MKT
COLUMN-ORIENTED
Name Column Row-ID Value
1 Joe
2 Jane
3 Bob
4 Mike
Page 48
Storage Layout Differences by Example (contd..)
Converting a Relational Model to HBase
Data Model:
A data model representation by conventional RDBMS: General RDBMS methodology:
Order_Products
OrderID FK
ProductID FK
Page 49
Storage Layout Differences by Example (contd..)
Representing Shopping Cart application in HBase
Page 50
Storage Layout Differences by Example (contd..)
Where RDBMS makes sense
1. Joining
o In a single query get all products in an order with their product information
2. Secondary Indexing
o Get Customer Id by Email
3. Referential Integrity
o Deleting an order would delete links out of order_products
Conclusion
For small instances of simple straightforward systems, relational databases offer much more convenient way to
model and access data
If you need to scale to larger proportions, the properties and flexibility of Hbase can relieve you from the headaches
associated with scaling an RDBMS
RDMBS provides tremendous functionality out of the box but extremely difficult and costly to scale. Hbase provides
barebones functionality out of the box but scaling is built in and inexpensive
Page 51
HBase Building Blocks
Row Column
o Rows are composed of columns o The most basic unit of Hbase is a Column
o Can have millions of columns o Each column may have multiple versions, with each distinct
o Can be compressed or tagged to stay in memory value contained in a separate cell
o One of more columns for a row, that is addressed uniquely
by a row key
Table
o Column name is called qualifier.
o Collection of rows
o Reference = family: qualifier
o All rows are always sorted lexicographically by their row
key
o Keys are compared on binary level from left to right Cell
o Rows are always unique and can be thought of a primary o Every column value is called Cell. It is time stamped.
index on the row key o Can be used to save multiple versions of a value. Versions
are stored in decreasing timestamp, most recent ones first
Column Families
o Columns are grouped into column families Region
o Semantic boundaries between data o Basic unit of scalability and load balancing
o Defined when table is created and should not be changed o Contiguous ranges of rows stored together.
often o Dynamically split by the system when they become too
o Number of column families should be reasonable (?) large
Page 52
HBase Logical Architecture
Page 53
HBase Distribution Architecture
.
. Keys: [I-M]
H
Keys: [F-I]
.
. Keys: [A-C]
.
Q
. Keys: [M-T]
.
. Keys: [C-F]
Z
Distribution
o Unit of Scalability in Hbase is a Region
o Sorted and Contiguous rows
o Spread randomly across Region Server
o Moved around for Load balancing and failover
o Split automatically or manually to scale growing data
Page 54
Hive & Sqoop 1
Day
Page 55
Introducing Hive
Many thanks Facebook for making Hadoop data files look like SQL tables!!!
Hive is a Petabyte scale data warehousing infrastructure for managing and querying structured data built
on top of Hadoop (Hive, 2013)[5]
o Map-Reduce for execution
o Simple query language called Hive QL, which is based on SQL
o Plug in custom mappers and reducers for sophisticated analysis that may not be supported by the built-in capabilities of the
language
Hive
(Built by Facebook)
Page 56
Motivation For Using Hive
Quick Facts
Motivation
o Data Data and mode Data
o Users expect faster response time on fresher data
o On average; data is increasing at 8X yearly
o Fast, faster and real-time
o Platform scalability is the major limiting factor in
supporting this data growth
Can I use
Hadoop ?
Page 57
Rationale For Hive
HIVE
Page 58
Where Does Hive Help?
HIVE Principles that Addresses Key Challenges
HIVE
How Hive addresses How Hive addresses Extensibility
Interoperability challenges? challenges?
1. Schemas are stored in RDBMS 1. Plug in Custom Mappers / Reducer
2. Column types could be complex types 2. Data Source can come from web services
3. Tables and Partitions can be altered 3. JDBC/ ODBC drivers
4. Views to be available soon
Interoperability Extensibility
Page 59
Hive Architecture
Data Model
Tables have Typed columns (int , float,
HIVE string, date, boolean)
JDBC ODBC Partitions
Buckets (Hash partitions useful for
Command Line sampling, join optimization)
Metastore
Web Interface Thrift Server
Interface
Driver
(Compiler, Optimizer, Executor)
Metastore
Namespace containing set of tables
Holds partition definitions
Statistics
Runs on Derby, MySQL and many other
relational databases
Page 60
Hive Query Language
Features
Capabilities
o Can point to external tables or existing data directories in HDFS
o Sub Queries
o Equi Joins
o Multi-table Insert
o Multi-group by
Page 61
Hive Usage @ Facebook and Beyond
Quick Statistics
12 TB of compressed new data added / day
135 TB of compressed data scanned / day
7500 + Hive jobs / day
Analysts (non-engineers) use Hadoop through Hive
95% of jobs at Facebook are Hive jobs
Page 62
Sqoop 1
Day
Page 63
Sqoop SQL 4* Hadoop
Sqoop is an open source tool that allows users to extract data from relational database into Hadoop
o A great Strength of Hadoop platform is its ability to work with data in different forms and parse Adhoc data formats
extracting relevant information (Sqoop, 2013)[9]
o Considerable amount of valuable data in an organization is stored in relational database systems
Features
o Written in Java Custom MapReduce programs interpret data
o JDBC based interface
o Automatic datatype generation
o Uses MapReduce to read tables from Database SCOOP
database Table HDFS
o Supports most JDBC standard types
o Provides ability to import from SQL
databases straight to HIVE data warehouse
Page 64
Sqooping Large Objects
Importing Large Objects
o Database queries usually reads all columns of each row from disk to identify rows that match a query criteria
o If large objects were stored inline in this fashion, it would adversely affect the performance of the scans
MapReduce typically materializes every record before passing it along to the mappers. To avoid the
performance degradation the large objects are often stored externally from their rows
Storage B
LOB A
Row 1: Col 1 ID (A) Col 3 Col 4
o Sqoop will store imported large objects in a separate file called a LobFile
o The LobFile format can store individual records of very large size (64 bit address).
o This format allows clients to hold a reference to a record without accessing the record contents
Page 65
Conclusion 1
Day
Questions?
Page 66
Industry Use Cases End to End 2
Day
Page 67
Use Case 1:How do you Gauge Customer Sentiment?
Use Case Reducing customer attrition via apriori semantic analysis of customer sentiments using textual data
Problem Statement Ability to ingest large volume of unstructured data from multiple sources
Apply an apriori established rules models to ingested data
Derive a structured Result-Set that can correlate sentiments to subjects
Challenges in the semantic analysis of human sentiment!! Limitations of traditional semantic text analysis
o Accuracy of the result-set is the major challenge in o Limited accuracy
unstructured textual analysis. Much of it is manifested in o Latency (limited speed)
form of Precision and Recall o Low expressiveness
o Precision Percentage of relevant items in the result-set i.e. o Low granularity
are the results valid? Correlation is a better measure. o Inability to perform bi-directional raw text
o Recall Percentage of relevant results retrieved from analysis
unstructured data i.e. are all sentiments relevant to a
subject retrieved successfully?
Text Corpus
Online Extractor
Help
Page 68
Using NLP to assess Customer Sentiment
Natural Language Processing (NLP) constitutes the process of extracting meaningful information via computation
from a natural language e.g. English. This requires collaboration between computation, linguistics and statistics.
Mahout
NLP constructs that we use to answer the Customer Sentiment question using Mahout using (Mahout, 2011)[6]
Coreference Resolution Given a sentence determine which words refers to the same object/entity.
Named Entity Recognition Given a stream of text determine which item in the text maps to a proper name & type.
Natural Language Generation Convert rich media to human readable format. Conversion of IVR data to text.
Part of Speech Tagging Given a sentence, determine part-of-speech for each word. @ Large Bank, customer interactions
captured in a non-inflectional language such as English introduces ambiguity because multiple words can be used both as
nouns and verbs e.g. book, set etc.
Page 69
Extractor Definition
Semantic Extractor Definition
Sample Customer Email Extract
My name is John Doe and I maintain two wealth management accounts with you. On 28th Feb. I
called customer regarding erroneous fees applied to my account, however after speaking with
multiple agents and wasting a lot of time. I came off very disappointed with the level of
service I received. My issue has not been resolved and I am considering moving my accounts
to another institution
Page 70
Introducing Hadoop Parallelism for Sentiment Analysis
We can use a Hadoop platform that includes
Includes Advance Text Analytics Engine
Includes Annotator Query Language (AQL) which is a fast, declarative & fine grained expressive language for text analysis
AQL is used for defining text analytics rules to build text extractors
Runtime complies an AQL extractor into an Analytics Object Graph (AOG) which represents the text extraction rules in a tree
structure in memory
A separate instance of AOG is deployed on each Mapper which runs a full instance of Text Analytics Engine
A single Mapper operates on a single data Split running it through the rules defined in the AOG
Output from all Mappers is reduced via a Reducer that coalesces negative emotive expressions exclusively
Reduces negative
AOG emotive expressions
Page 71
Use Case 2: Anti Money Laundering Using Hadoop Fraud Use Case
Money Laundering is the practice of concealing the source of illegally obtained money
$ $ $
Offshore Bank A/C
$
$ Bank Agent
$ $ Agent
Remit. Agent
$
Region A Region B
Page 72
Notional Model for Event Based Detection of ML
Using Hadoop for AML Detection
SQUOOP Event Record
o Subject
Enterprise AVRO o Time
Event Cloud o Location ML Detection
Hadoop Algorithms
o Activity
Serialization
o . ML Object
o Historical Subject John Doe
Iterations to isolate Event
Associations & 14:35 April 5 2009
sub-associations Records Seattle, WA
Rankings
Company Setup
o OFAC Lists HDFS
o CFTC Lists CFO
MapReduce ML Object Graph
Put Ranking
& Association ML Object Parallelism A2 1
3
Subject Ranking
Get Ranking Mahout
1 1
& Association
Knowledge Base
& Association Cluster Association B
Ranking Assignment
Page 73
Use Case 3: Major System Outages in Recent Past
Page 74
Monitoring & Decisioning Man vs. Machine
The variance of accuracy is inversely proportional to the increase in the breadth of the data
Decision
Consistency and Comprehension of the
breadth of data is humanely impossible
Trainable Model
Current Situation (More the data More the Accuracy) Consistent Results
(Perception, Interpretation,
Judgment remains the same)
Page 75
Notional Model for System Outage Prediction
Use Case Designing a automated framework that collects, correlates and classifies enterprise events; thereby
generating alert notifications identifying potential outage scenarios
Problem Statement Collection, classification and correlation of enterprise events
Autonomous self training intelligent event framework
Time Box
Log Data Notifications
Pattern matching
Decide/Re-train
Filter White Noise
Collect
Aggregate
Events Corrective Actions
Enrich
Classify
Apply Rules
System Actions
Historic Data
Page 76
Intelligent Outage Learning
We can use a Big Data Platform to build a system outage framework that uses ML to predict outages
System Logs exception stack trace enterprise events customer call data is fed into a Mahout ML Model (Baseline)
ML Base model is continuously trained by regressing all exogenous & endogenous variables against test data set(s)
Refined base model is established as the yardstick for the prediction of future outages
Real time events are sent through the trained ML model at times t+n and variances from the fitted curve are observed
Event cloud input is multiplexed across multiple Mappers; each running an instance of the trained Mahout ML Model
Outlier events are identified by their violation of the normal variance threshold from the fitted curve of the ML Model
Page 77
Hadoop Installation & HDFS Hands on 2
Day
Page 78
Setting up the VM
For this demonstration, we will be using Clouderas Quick Start VM. This 64-bit VM contains a single-node Apache
Hadoop cluster. It runs on CentOS and includes:
o CDH4.6
o Cloudera Manager
o Cloudera Impala
o Cloudera Search
Page 79
Setting up the VM
Installing VirtualBox
Step 1
Install Chocolatey from Command Prompt (as administrator) Chocolatey is a utility for easy command-line installation
Step 2
Install VirtualBox
Step 3
Install 7Zip (will be used later to extract the VM)
Page 80
Setting up the VM
Downloading & Importing Cloudera Quickstart VM
Step 1
Download Cloudera Quickstart VM (approximately 8 minutes)
http://www.cloudera.com/content/support/en/downloads/download-components/download-
products.html?productID=F6mO278Rvo&version=2.1
Step 2
Unzip VM
Locate the downloaded VM zip file (cloudera-quickstart-vm-4.4.0-1-virtualbox.7z)
Move file to desired folder (optional)
Right-click on the file and from the 7-Zip sub-menu, select "Extract Here" (approximately 3 minutes)
Step 3
Import Appliance
Open VirtualBox (Oracle VM VirtualBox Manager)
Select from the File menu "Import Appliance
Located the extracted VM folder, expand it, and select the the *.ovf file
Hit Next then Import
Page 81
Setting up the VM
Adjusting RAM Settings
It is advisable to allocate at least 2GB of RAM for your host machine OS. CDH4.6 requires 4GB of RAM. Before loading the VM, it is
important to make sure the RAM settings leave at least 2GB of RAM space for your host OS to run. If you keep it at the default it is set to,
Windows and the VM will continuously fight for resources -- you don't really want Windows to become inoperable. This is done by:
Steps:
1. In Oracle VM VirtualBox Manager with
"cloudera-quicksta..."
highlighted click
2. Select the "System tab
3. Reduce the "Base Memory
of the VM to accommodate
enough memory for your
host OS
4. Hit OK
Page 82
Setting up the VM
Adjusting Copy/Paste Settings
It is advisable to allocate at least 2GB of RAM for your host machine OS. CDH4.6 requires 4GB of RAM. Before loading the VM, it is
important to make sure the RAM settings leave at least 2GB of RAM space for your host OS to run. If you keep it at the default it is set to,
Windows and the VM will continuously fight for resources -- you don't really want Windows to become inoperable. This is done by:
Steps:
1. In Oracle VM VirtualBox Manager with
"cloudera-quicksta..."
highlighted click
2. Select the Advanced
sub-tab within General
3. Change the Shared
Clipboard drop-down to
Bidirectional.
4. Hit OK
Page 83
Setting up the VM
Cleaning Up and Starting the VM
Step 1
Clean up
You can remove the *.7z and the extracted folder (5GB of hard drive savings)
NOTE: The VM has already been loaded (i.e., installed) on your machine in:
c:\users\[user]\VirtualBox VMs\cloudera-quickstart-vm-4.4.0-1-virtualbox
This is where the functional machine exists (do not remove it).
Step 2
Start up the VM
With the "cloudera-quicksta..." appliance selected, click the Start button.
Page 84
Setting up the VM
Creating a Shared Folder
You can setup your Shared Folder to transfer files from your host machine to the VM. This will be useful when we wish to transfer our
tutorial files from our host to the VM My Documents folder.
Steps:
1. Be sure to have your virtual machine
shutdown
2. In Oracle VM VirtualBox Manager with
"cloudera-quicksta..."
highlighted click
3. Select the Shared Folders tab
4. Change the icon to add a shared
folder.
5. Select Other folder path. Select the
folder on your host machine under
your My Documents folder which
you have created to share documents
between your machine and the VM.
6. Provide the Folder Name which is
what the name of the shared folder
will be within the VM.
7. Ensure that Auto-mount is checked.
Page 85
Setting up the VM
Transferring Files into VM
$ sudo su
Step 2: Move the tutorial file from the shared folder to the home folder
o The shared folders are located within the /media/ path.
o The folder name provided in the VM Settings is prefixed with sf_
# mv /media/sf_VMShare/NYSE.tar.gz /home/cloudera/NYSE.tar.gz
Page 86
Setting up the VM
Internet Settings: Adjusting the Network Connection Settings
You can setup your Shared Folder to transfer files from your host machine to the VM. This will be useful when we wish to transfer our
tutorial files from our host to the VM My Documents folder.
Note: Internet connectivity is not required for this tutorial. You may omit this section.
Steps:
1. Check internet connection
o Having the cloudera-quicksta.. VM selected,
start the VM by clicking
o Open up Firefox within the VM and attempt
to browse to a website (www.google.com).
o If the connection succeeds, your internet
connection is working and you do not need to
do anything else.
o If the connection fails, shutdown the VM
(System -> Shutdown) and following the
next steps.
Page 87
Setting up the VM
Internet Settings: Adjusting the Network Connection Settings (Continued)
You can setup your Shared Folder to transfer files from your host machine to the VM. This will be useful when we wish to transfer our
tutorial files from our host to the VM My Documents folder.
Note: Internet connectivity is not required for this tutorial. You may omit this section.
Steps:
3. Adjust VM Network Connections
o Start VM
o Under System -> Preferences, select Network Connections. If only System eth0 exists, click Add to eth01 as a new
Wired connection.
o If prompted, the Password for root is cloudera.
o The MAC addresses of both connections should be the same as copied from Step 1 and the first two checkboxes of each
connection should be checked.
o Under IPv4 Settings:
The connection method for eth01 should be Automatic (DHCP) addresses only.
The host IPv4 DNS Server address should be provided as the DNS Server
o Now that the internet connection works, you can use the web to download the training materials zip file to the My
Documents folder.
Page 88
Setting up the VM
Screen shots of Linux and Windows popup windows
For both the System eth0 and In Windows, record the IPv4 DNS Server Ensure that the connection Method for
eth01 connections, make sure that address (highlighted above) for the eth01 is Automatic (DHCP) addresses
you have checked the appropriate connected network. Select the current only. Provide the recorded IPv4 DNS
boxes and provided the Device MAC network from Network and Sharing Server from Windows (highlighted above).
address (highlighted above). Center in Control Panel, then click
Details.
HadoopPage
Hands-On
89 Training: Step-by-Step Examples
HDFS Hands on 2
Day
Page 90
HDFS
HDFS (Hadoop Distributed File System) distributed file system that stores the data on the commodity machines, providing
high aggregate bandwidth across the cluster. HDFS allows us to not worry about where the file is actually stored in the
Hadoop clusters and treat it as a singular file system.
% hadoop fs <args>
Page 91
HDFS
Setting up the user folder in Hadoop
$ sudo su hdfs
Step 2: Create the cloudera user folder. (If this errors, simply proceed)
$ exit
Page 92
HDFS
Shell Command Basics
All the FS shell commands take path URIs as arguments
The URI format is scheme://autority/path
For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If
not specified, the default scheme specified in the configuration is used
An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as
/parent/child (given that your configuration is set to point to hdfs://namenodehost)
Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of
the commands
Error information is sent to stderr and the output is sent to stdout
Page 93
HDFS
Shell Commands List
dus Displays a summary of file lengths ls For a file returns stat on the file and for a directory returns
list of its direct children.
-dus <args>
-ls <args>
expunge Empty the trash. lsr - Recursive version of ls. Similar to Unix ls -R.
get - Copy files to the local file system. CRC options relate mkdir - Takes path uri's as argument and creates directories.
to cyclic redundancy check. The behavior is much like unix mkdir -p creating parent
directories along the path.
-get [-ignorecrc] [-crc] <src> <localdst>
-mkdir <paths>
Page 94
HDFS
Shell Commands List (continued)
put Copy single or multiple sources from local file system tail Returns last kilobyte of the file to stdout.
to HDFS. Specifying - as source is stdin.
-put <localsrc> ... <dst> -tail [-f] URI
rm Deletes non-empty directories and files. test Test to see if the file exists (E), is zero length (Z), or is a
directory (D)
-rm URI [URI ] -test -[ezd] URI
rmr Recursive version of delete. text - Takes a source file and outputs the file in text format. The
allowed formats are zip and TextRecordInputStream.
-rmr URI [URI ] -text <src>
setrep Change the replication factor of a file. Provide R touchz - Create a file of zero length.
for recursive.
-touchz URI [URI ]
-setrep <args> [-R] <path>
Page 95
Lunch 2
Day
Page 96
MapReduce Deep Dive 2
Day
Page 97
Features of MapReduce
Automatic parallelization and distribution
A clean abstraction for programmers
o MapReduce programs are usually written in Java
Can be written in any language using Hadoop Streaming (see later)
All of Hadoop is written in Java
o MapReduce abstracts all the housekeeping away from the developer
Developer can concentrate simply on writing the Map and Reduce functions
Automatic Fault tolerance
Status and monitoring tools
Page 98
MapReduce
JobTracker
Page 99
MapReduce
Terminology
Page 100
MapReduce
Mapper
Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to
avoid network traffic
o Multiple Mappers run in parallel, each processing a portion of the input data
The Mapper reads data in the form of key/value pairs
It outputs zero or more key/value pairs (pseudo code):
map(in_key, in_value) -> (inter_key, inter_value) list
o The key is the byte oset into the le at which the line starts
Page 101
MapReduce
Reducer
After the Map phase is over, all the intermediate values for a given intermediate key are
combined together into a list
This list is given to a Reducer
o All values associated with a particular intermediate key are guaranteed to go to the same Reducer
o The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
o In practice, the Reducer usually displays a single key/value pair for each input key
Page 102
MapReduce
Data Locality
Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block
of data stored locally on that node via HDFS
If this is not possible, the Map task will have to transfer the data across the network as it processes
that data
Once the Map tasks have nished, data is then transferred across the network to the Reducers
o Although the Reducers may run on the same physical machines as the Map tasks, there is no
concept of data locality for the Reducers
Page 103 Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent. 03#43
MapReduce: Bigger Picture?
Node 1 Node 2
File File
File File
RR RR RR RR RR RR
Record readers: Record readers:
Partitioner Partitioner
Shuffling Process
Intermediate (k, v)
(Sort) pairs exchanged by all
(Sort)
nodes
Reduce Reduce
Write-back to Write-back to
local HDFS store Output Format Output Format local HDFS store
Page 104
MapReduce
Is Shuffle and Sort a Bottleneck?
It appears that the shue and sort phase is a bottleneck
o The reduce method in the Reducers cannot start until all Mappers have nished
In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers nish work
o This mitigates against a huge amount of data transfer starting as soon as the last Mapper nishes
The developer can specify the percentage of Mappers which should nish before Reducers start
retrieving data
o The developers reducemethod still does not start until all intermediate data has been
transferred and sorted
Page 105
MapReduce
Is a Slow Mapper a Bottleneck?
It is possible for one Map task to run more slowly than the others
o The reducemethod in the Reducer cannot start until every Mapper has nished
Hadoop uses speculative execution to mitigate against this
o If a Mapper appears to be running signicantly more slowly than the others, a new instance
of the Mapper will be started on another machine, operating on the same data
Page 106
MapReduce
The Five Hadoop Daemons
Hadoop is comprised of ve separate daemons
NameNode
Page 107
MapReduce
The Five Hadoop Daemons (Continued)
o Master Nodes
o Slave Nodes
Page 108
MapReduce
Submitting a Job
When a client submits a job, its configuration information is packaged into an XML le
This le, along with the .jarle containing the actual program code, is handed to the JobTracker
o When a TaskTracker receives a request to run a task, it instantiates a separate JVM for that task
o TaskTracker nodes can be congured to run multiple tasks at the same time, if the node has enough
processing power and memory
The intermediate data is held on the TaskTrackers local disk
As Reducers start up, the intermediate data is distributed across the network to the Reducers
Reducers write their nal output to HDFS
Once the job has completed, the TaskTracker can delete the intermediate data from its local disk
o Note that the intermediate data is not deleted until the entire job completes
Page 109
Configuration Properties
Filename Format Description
hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop.
Hadoop
Configuration settings for Hadoop core, such as I/O settings that are
core-site.xml configuration
common to HDFS and MapReduce.
XML
Hadoop
Configuration settings for HDFS daemons: the namenode, the secondary
hdfs-site.xml configuration
namenode, and the datanodes.
XML
Hadoop
Configuration settings for MapReduce daemons: the jobtracker and the
mapred-site.xml configuration
tasktrackers.
XML
masters Plain text A list of machines (one per line) that each run a secondary namenode.
A list of machines (one per line) that each run a datanode and a
slaves Plain text
tasktracker.
hadoop-metrics.properties Java properties Properties controlling how metrics are published in Hadoop.
Properties for the system logfiles, the namenode audit log, and the task log
log4j.properties Java properties
for the tasktracker child process.
Page 110
Configuration Properties
Property Name Type Default Value Description
Page 111
Configuration Properties
Property Name Type Default Value Description
mapred.tasktracker.map
The number of map tasks that may be run on
.tasks.maximum Int 2
tasktracker at any one time.
(mapred-site.xml)
mapred.tasktracker.red
The number of reduce tasks that may be run on
uce.tasks.maximum Int 2
tasktracker at any one time.
(mapred-site.xml)
Page 112
Configuration Properties
Property Name Type Default Value Description
mapred.map.java.opts The JVM option used for the child process that runs
String -Xmx200m
(mapred-site.xml) map tasks. From 0 .21.
mapred.reduce.java.opts The JVM option used for the child process that runs
String -Xmx200m
(mapred-site.xml) reduce tasks. From 0 .21.
Page 113
MapReduce
Submitting a Job
This consists of three portions
Code that runs on the client to congure and submit the job
o The Mapper
o The Reducer
Before we look at the code, we need to cover some basic Hadoop API concepts
Page 114
MapReduce
Getting Data to the Mapper
The data passed to the Mapper is specied by an InputFormat
o InputFormat is a factory for RecordReaderobjects to extract (key, value) records from the input source
Page 115
Map Reduce Hands on Development
& Deployment 2
Day
Page 116
Executing a MapReduce Program
MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. It
includes:
o A Mapper class containing the map() function
o A Reducer class containing the reduce() function
o A driver class which configures the MapReduce job in the main() function
Page 117
Executing a MapReduce Program
Compiling and exporting an MR job using commands
Step 1
Change the directory to where we will be compiling and exporting the MapReduce program.
This folder already contains the *.class and *.jar files but we will overwrite them for the sake of this exercise.
$ cd /home/cloudera/NYSE/bin
Step 2
Compile the *.class files using javac (Java compiler)
Included library folders (colon delimited) Driver source file Mapper source file Reducer source file
Step 3
Export the class files into a jar file.
$ jar cvf averageHigh.jar AvgHigh.class AvgHighMapper.class AvgHighReducer.class
Page 118
Executing a MapReduce Program
Running the MapReduce program using Hadoop
Step 1
Change the directory to where we the jar file exists.
$ cd /home/cloudera/NYSE/bin
Step 2
Set HADOOP_CLASSPATH environment variable to the jar file
$ export HADOOP_CLASSPATH=averageHigh.jar
Step 3
Execute the MapReduce program using Hadoop
$ hadoop AvgHigh /user/cloudera/NYSE/data/EOD2013.txt output/AvgHigh
Application name within jar file Input data on HDFS Output folder on HDFS
Page 119
Executing a MapReduce Program
Tracking a MapReduce Job using the Hadoop JobTracker
Page 120
Executing a MapReduce Program
Tracking a MapReduce Job using the Hadoop JobTracker
Page 121
Executing a MapReduce Program
Running the MapReduce program using Hadoop
Step 1
Examine the files contained in the output folder specified when the MapReduce program was executed.
$ hadoop fs -ls output/AvgHigh
Step 2
Display the output (in HDFS) to the screen (stdout).
$ hadoop fs -cat output/AvgHigh/part-r-00000
Step 3
Make a local output folder.
$ mkdir /home/cloudera/NYSE/output
Step 4
Copy the output to your local (VM) file system.
$ hadoop fs -get output/AvgHigh /home/cloudera/NYSE/output/AvgHigh
Step 5
Display the output on your local system (VM) to the screen (stdout).
$ cat /home/cloudera/NYSE/output/AvgHigh/part-r-00000
Page 122
InputFormat - Hierarchy
CombineFile FileInputFormat
InputFormat <K, V> The base class used for all file
based InputFormats
TextInputFormat TextInputFormat
The default
Treats each \n-terminated line
<<Interface>> of a file as a value.
InputFormat <K, V> FileInputFormat <K, V> KeyValueTextInputFormat StreamInputFormat
org.apache.hadoop.mapred Key is the byte offset within
the file of that line.
NLineInputFormat KeyValueTextInputFormat
Maps \n-terminated lines as
KeySEP value.
SequenceFile SequenceFileAsBinary By default separator is a tab.
InputFormat <K, V> InputFormat
SequenceFileInputFormat
Binary file of (key, value) pairs
SequenceFileAsText
InputFormat
with some additional metadata
SequenceFileAsTextInputFormat
<<Interface>>
Composite SequenceFile Similar, but maps
CompositeInput
InputFormat <K, V> InputFilter <K, V> (key.toString(),
Format <K, V>
value.toString())
DBInputFile<T> DBInputFormat
EmptyInput
Format <K, V>
Source: Hadoop Definitive Guide, Tom White
Page 123
OutputFormat - Hierarchy
MapFileOutputFormat
MultipleOutputputFormat MultipleTextOutput
<K, V> Format <K, V>
MultipleSequenceFile
OutputFormat
NullOutput
Format <K, V>
DBOutputFormat<K, V>
Page 124
Writable & WritableComparable?
Hadoop denes its own box classes for strings, integers and so on o Keys and values in Hadoop
IntWritable for ints are Objects
LongWritable for longs
o Values are objects which
FloatWritable for oats
implement Writable
DoubleWritable for doubles
Text for strings o Keys are objects which
Etc. implement
WritableComparable
The Writable interface makes serialization quick and easy for Hadoop
Page 125 Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent. 04#16
Driver & Jobs
The Driver Code Specifying the InputFormat:
o The driver code runs on the client machine The default InputFormat (TextInputFormat) will be used
o It congures the job, then submits it to the cluster unless you specify otherwise
To use an InputFormat other than the default, use e.g.:
job.setInputFormatClass(KeyValueTextInputFormat.class)
Creating a New Job Object
o The Job class allows you to set configuration options for your
MapReduce job Specifying Final Output with OutputFormat:
o The classes to be used for your Mapper and Reducer FileOutputFormat.setOutputPath() species the directory
o The input and output directories to which the Reducers will write their nal output
o Many other options
The driver can also specify the format of the output data
o Any options not explicitly set in your driver code will be read
from your Hadoop configuration les o Default is a plain text le
o Usually located in /etc/hadoop/conf o Could be explicitly written as
o Any options not specied in your configuration les will receive o job.setOutputFormatClass(TextOutputFormat.class)
Hadoops default values
o You can also use the Job object to submit the job, control its
execution, and query its state
Page 126
MapReduce Architecture
Client JVM
TaskTracker
8. retrieve job
resources 9. launch
Child JVM
Shared Filesystem
(e.g., HDFS)
Child
10. run
MapTask
Or
ReduceTask
TaskTracker Node
Page 127
Building a MapReduce Program
Page 128
Building a MapReduce Program
Creating a new Java Project
Steps:
1. Open Eclipse
2. Go to File > New, select Java Project
3. Provide NYSE as the project name
and modify the Location (uncheck
Use default location) to:
/home/cloudera/NYSE/src
Page 129
Building a MapReduce Program
Creating a new Java Project (continued)
Steps (continued):
5. Select the *.jar files in the /usr/lib/hadoop folder
Click on File System
Open the usr folder, then the lib folder, then the hadoop folder
Page 130
Building a MapReduce Program
Creating a new Java Project (continued)
Steps (continued):
6. After clicking OK, again click on Add External JARs and select the *.jar files in the /usr/lib/hadoop/client-0.20 folder.
Click OK, then Finish to return to the workspace.
Page 131
Building a MapReduce Program
Eclipse Working Environment
Open Files
Imported libraries
Project
Package
Comments
Files
Class
Classes definition
Functions
Function
Libraries definition
included when
compiling
Page 132
Building a MapReduce Program
Examining the Data
Page 133
Building a MapReduce Program
Writable classes (data types)
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io package:
o There are Writable wrappers for all the Java primitive types except char which can be stored in an IntWritable.
o All wrappers have a get() and set() method for retrieving and storing the value.
Page 134
Building a MapReduce Program
Mapper Class: Import Commands
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
Page 135
Building a MapReduce Program
Mapper Class: Building the Mapper class
Page 136
Building a MapReduce Program
Mapper Class: Building the map function
The map function is the function for the Mapper that is executed by Hadoop
The map function takes three parameters.
The key parameter is the auto-assigned id of the line that Hadoop is processing. For most purposes, this is an arbitrary
value.
The value parameter is the line that Hadoop is processing.
The context is where the output is processed to.
}
}
Page 137
Building a MapReduce Program
Mapper Class: Mapper skeleton
The skeleton below represents the key elements that are found in each Mapper.
The purple highlighted elements can be changed to work with your needs.
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
Page 138
Building a MapReduce Program
Mapper Class: The code
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
Page 139
Building a MapReduce Program
Reducer Class: Import Commands
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
Page 140
Building a MapReduce Program
Reducer Class: Building the Reducer class
Page 141
Building a MapReduce Program
Reducer Class: Building the reduce function
The map function is the function for the Mapper that is executed by Hadoop
The map function takes three parameters.
The key parameter is the auto-assigned id of the line that Hadoop is processing. For most purposes, this is an arbitrary
value.
The value parameter is the line that Hadoop is processing.
The context is where the output is processed to.
}
}
Page 142
Building a MapReduce Program
Reducer Class: Reducer skeleton
The skeleton below represents the key elements that are found in each Reducer.
The purple highlighted elements can be changed to work with your needs.
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
Page 143
Building a MapReduce Program
Reducer Class: The code
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class AvgHighReducer extends Reducer<Text, DoubleWritable, Text,
DoubleWritable> {
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
double symbolCount = 0.0;
double dailyHighs = 0.0;
for (DoubleWritable value : values) {
dailyHighs += value.get();
symbolCount++;
}
context.write(key, new DoubleWritable(dailyHighs/symbolCount));
}
}
Page 144
Building a MapReduce Program
Driver Class: Import Commands
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
Page 145
Building a MapReduce Program
Driver Class: The code
public class AvgHigh {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: AvgHigh <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(AvgHigh.class);
job.setJobName("Average Ticker Highs");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(AvgHighMapper.class);
job.setReducerClass(AvgHighReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Page 146
Building a MapReduce Program
Exporting the JAR package
Steps:
1. Right click the (default
package) and select the Export option
2. Select JAR File under the Java folder and
press Next
Page 147
Building a MapReduce Program
Exporting the JAR package (continued)
Steps: (continued)
3. Provide the following as the destination of the
JAR file and hit Finish.
home/cloudera/NYSE/bin/averageHigh.jar
Page 148
Executing a MapReduce Program
Executing the MapReduce Job
Step 1
Change the directory to where we the jar file exists.
$ cd /home/cloudera/NYSE/bin
Step 2
Set HADOOP_CLASSPATH environment variable to the jar file
$ export HADOOP_CLASSPATH=averageHigh.jar
Step 3
Remove the output/AvgHigh folder in HDFS that was created before. The skipTrash option removes it immediately (no trash).
$ hadoop fs -rm -r -skipTrash output/AvgHigh
Step 4
Execute the MapReduce program using Hadoop
$ hadoop AvgHigh /user/cloudera/NYSE/data/EOD2013.txt output/AvgHigh
Step 5
Review the results
$ hadoop fs -cat output/AvgHigh/part-r-00000
Page 149
The Streaming API
Many organizations have developers skilled in languages other than Java, such as:
o Ruby
o Python
o Perl
The Streaming API allows developers to use any language they wish to write Mappers and Reducers
As long as the language can read from standard input and write to standard output
Page 150
More on the Hadoop API
Page 151
Why Use ToolRunner?
o Also allows you to specify items for the Distributed Cache on the command line (see later)
Page 152
The Combiner
o Input and output data types for the Combiner/Reducer must be identical
Page 153
Specifying a Combiner
To specify the Combiner class to be used in your MapReduce code, put the following line
in your Driver:
job.setCombinerClass(YourCombinerClass.class);
VERY IMPORTANT: The Combiner may run once, or more than once, on the output
from any given Mapper
o Do not put code in the Combiner which could inuence your results if it runs more than once
Page 154
The setup / Clean up Method
It is common to want your Mapper or Reducer to execute some code before the
mapor reducemethod is called
o Initialize data structures
o Set parameters
The setupmethod is run before the mapor reducemethod is called for the rst time
Similarly, you may wish to perform some action(s) after all the records have been
processed by your Mapper or Reducer
The cleanupmethod is called before the Mapper or Reducer terminates
Page 155
What Does The Partitioner Do?
The Partitioner divides up the keyspace
o Controls which Reducer each intermediate key and its associated values goes to
public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() &
Integer.MAX_VALUE) % numReduceTasks;
}
}
Page 156
Creating a Custom Partitioner
Partitioner
job.setPartitionerClass(MyPartitioner.class);
Page 157
The MapReduce Flow: Shuffle and Sort
Node 1 Node 2
File File
Split Split Split Split Split Split
File File
RR RR RR RR RR RR
Record readers: Record readers:
Partitioner Partitioner
Shuffling Process
Intermediate (k, v)
(Sort) pairs exchanged by
(Sort)
all nodes
Reduce Reduce
Write-back to Write-back to
local HDFS Output Format Output Format local HDFS
store store
Page 158
The FileSystem API
Some useful API methods:
o FSDataOutputStream create(...)
Extends java.io.DataOutputStream
o FSDataInputStream open(...)
Extends java.io.DataInputStream
o boolean mkdirs(...)
o void copyFromLocalFile(...)
o void copyToLocalFile(...)
o FileStatus[] listStatus(...)
Path f = fileStats[i].getPath();
//do something
}
A common requirement is for a Mapper or Reducer to need access to some side data
o Lookup tables
o Dictionaries
Option 2: The Distributed Cache provides an API to push data to all slave nodes
o Transfer happens behind the scenes before any task is executed
o Files in the Distributed Cache are automatically deleted from slave nodes when the job nishes
jarles added with addFileToClassPath will be added to your Mapper or Reducers classpath
Files added with addCacheArchive will automatically be dearchived/decompressed
Page 162
Using the DistributedCache: Command line
If you are using ToolRunner, you can add les to the Distributed Cache directly from the
command line when you run the job
o No need to copy the les to HDFS rst
The -archivesag adds archived les, and automatically unarchives them on the destination
machines
The -libjarsag adds jar les to the classpath
Page 164
Reusable Classes for the New API
The org.apache.hadoop.mapreduce.lib.*/* packages contain a library of Mappers,
Reducers, and Partitioners supporting the new API
Example classes:
o InverseMapper Swaps keys and values
Allows you to partition your data into n partitions without hard/ coding the
partitioning information
Page 165
Most Common InputFormats
Most common InputFormats:
o TextInputFormat
o KeyValueTextInputFormat
o SequenceFileInputFormat
Page 166
How FileInputFormat Works
All le based InputFormats inherit from FileInputFormat
FileInputFormat computes InputSplits based on the size of each le, in bytes
o HDFS block size is used as upper bound for InputSplit size
So the number of Mappers will equal the number of HDFS blocks of input data to be processed
Page 167
What RecordReaders Do
InputSplits are handed to the RecordReaders
o Specied by the path, starting position oset, length
RecordReaders must:
o Ensure each (key, value) pair is processed
o Ensure no (key, value) pair is processed more than once
o Handle (key, value) pairs which are split across Input Splits Not a good idea
Page 168
OutputFormat
OutputFormats work much like InputFormat classes
Custom OutputFormats must provide a RecordWriter implementation
Page 169
Compressions
Compression Format Tool Algorithm File Extension Split-able
(a) DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available
command line tool for producing files in DEFLATE format as gzip is normally used. (Note the gzip file format is
DEFLATE with extra headers and a footer.). The .deflate file extension is a Hadoop convention.
(b) LZO files are split-able if they have been indexed in a pre-processing step,
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO org.apache.hadoop.io.compress.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec
Page 170
Hadoop and Compressed Files
Hadoop understands a variety of le compression formats
o Including GZip
If a compressed le is included as one of the les to be processed, Hadoop will automatically decompress
it and pass the decompressed contents to the Mapper
o A GZipped le can only be decompressed by starting at the beginning of the le and continuing on to
the end
Page 171
Non-Splittable File Formats and Hadoop
If the MapReduce framework receives a non splittable le (such as a GZipped le) it passes the
entire file to a single Mapper
This can result in one Mapper running for far longer than the others
o It is dealing with an entire le, while the others are dealing with smaller portions of les
Page 172
Snappy Codec
Splittable Compression for SequenceFiles and Avro Files Using the Snappy Codec
o Developed at Google
o Very fast
Snappy does not compress a SequenceFile and produce, e.g., a le with a .snappyextension
o That data can be decompressed automatically by Hadoop (or other programs) when the le
is read
Page 173
Map Reduce Patterns
Summarization Patterns Join Patterns
o Numerical Summarization o Reduce Side Join
o Inverted Index Summarization o Replicated Join
o Counting with Hadoop Counters o Composite Join
Filtering Patterns o Cartesian Product
o Normal Filtering Meta Patterns
o Bloom Filtering o Job Chaining
o Top Ten Filtering o Chain Folding
Data Organization patterns o Job Merging
o Structured to Hierarchical Input Output Patterns
o Partitioning o Customizing I/O & O/P
o Binning o Generating Data
o Total Order Sorting o External Source Output
o Shuffling o External Source Input
o Partition Pruning
Page 174
Map and Reduce Side- Joins Pattern Overview
We frequently need to join data together from two sources as part of a MapReduce job, such as
o Lookup tables
Page 175
Conclusion 2
Day
Questions?
Page 176
Pig deep dive - Theory 3
Day
Page 177
Hive and Pig: Why?
MapReduce code is typically written in Java
o A programmer
o Who will be available to maintain and update the code in the future as requirements change
o Business analysts
o Data scientists
o Statisticians
o Data analysts
Whats needed is a higher level of abstractions on top of MapReduce
o Providing the ability to query the data without needing to know MapReduce
Page 179
Hive and Pig
Introduction
o Many developers did not have the Java and/or MapReduce knowledge required to write standard
MapReduce programs
o Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster
Installation of Pig requires no modification to the cluster
The Pig interpreter runs on the client machine
o Turns PigLatin into standard Java MapReduce jobs, which are then submitted to the JobTracker
There is (currently) no shared metadata, so no need for a shared metastore of any kind
Page 180
Pig
Pig Philosophy
Page 181
Pig Overview
Pig Latin provides a higher order abstraction for implementing MapReduce jobs. It constitutes a data flow language which is
made up of a series of operations and transformations that are applied to the input data
How are Pig Latin Statements LOAD Load statements read data from the file system
Organized ?
Transformation statements process data read
Pig Latin statements are organized as a TRANSFORMATION from the file system
sequence of steps such that each step
represents a transformation applied to DUMP statement display results
some data DUMP / STORE STORE statement to save the results.
Page 182
Pig Vs. SQL
PigLatin is data flow language SQL is query language
o Allows users to describe how data from one or o Allows users to form queries
more inputs should be read, processed and then o Allows users to describe what questions they want
stored to one or more outputs in parallel answered, not how
o Pig latin script describes a directed acyclic graph o Focused around answering one question. Complex
(DAG), where edges are data flows and the requires temp tables, sub queries or multiple
nodes are operators that process the data procedures
o Allows users to describe how to process the o Schemas are static and constraints are enforced
input data
o Schemas are dynamic
Page 183
PigLatin Concepts
Concepts Features
o In Pig, a single element of data is an atom o Pig supports many features which allow developers to
o A collection of atoms such as a row, or a partial perform complex data analysis without having to write
row is a tuple Java MapReduce code
o Tuples are collected together into bags Joining datasets
o Typically, a PigLatin s c r ipt starts by loading one Grouping data
or more datasets into bags, and then creates
Referring to elements by position rather than name
new bags by modifying those it already has
o Useful for datasets with many
elements
And more
Page 184
PigLatin
Sample Pig Script
Then we create a new bag called tops which contains just those records where the high
portion is greater than 100
Page 185
PigLatin
Data Types/Functions/Operations
DOUBLE Order by
CHARARRAY Distinct
BYTEARRAY Join
Sample
Page 186
PigLatin
Using the Grunt Shell to Run PigLatin
Starting Grunt
$ pig grunt>
Useful Commands
Page 187
PigLatin
Sample Pig Script
Then we create a new bag called tops which contains just those records where the high
portion is greater than 100
Page 188
PigLatin
More PigLatin
DESCRIBE bagname;
Page 189
PigLatin
FOR EACH
Page 190
PigLatin
Grouping
Grouping
o Each tuple in grpd has an element called group, and an element called bag1
o The group element has a unique value for elementX from bag1
o The bag1 element is itself a bag, containing all the tuples from bag1 with that value
for elementX
Page 191
Pig hands on 3
Day
Page 192
Hive
Overview
Pig Latin is a data flow programming language whereas SQL is a declarative programming
language. There are three ways of executing Pig programs, all of which work in both local and
MapReduce modes:
Pig can run a script file that contains Pig commands (pig script.pig).
Grunt is an interactive shell for running Pig commands.
Grunt is started when no file is specified for Pig to run and the -e (or -execute for direct command
execution) is not used.
It is also possible to run Pig scripts from within Grunt using run and exec.
Embedded You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use PigRunner.
Page 193
Pig
Load and filter using Grunt
$ Pig
Step 3: Dump the data and the schema just loaded to the screen
Step 4 Retrieve the records where the first field ($0) starts with M
grunt> symbols_m = FILTER symbols BY $0 matches 'M.*';
grunt> DUMP symbols_m;
grunt> DESCRIBE symbols_m;
Page 194
Pig
Naming Fields
Step 1 : Reload the data, this time specifying the field names and types;
The >> indicates a line break statements end with a semicolon (;)
Step 2: Dump the data and the schema just loaded to the screen
grunt> DUMP symbols;
grunt> DESCRIBE symbols;
Step 3: Retrieve the records where the ticker field starts with M
The fields can be referred to by name or ordinality ($[i-1])
Page 195
Pig
Joining Data
Page 196
Pig
Grouping and Aggregate Functions
Step 3: For each co-grouped record, find the maximum open price
Step 4: Sort the results (dump to view the final record set)
Page 197
Pig
Splitting Records
grunt> SPLIT eod_maxopen INTO eod_maxopen_good IF $1 >= 5.0, eod_maxopen_bad IF $1 < 5.0;
Page 198
Pig
Running Pig as Script
Page 199
Pig
Running Pig as local Script
Page 200
Pig Latin
Operators
SAMPLE ILLUSTRATE
Grouping and JOIN Macro and UDF REGISTER
joining statements
COGROUP
DEFINE
GROUP
CROSS IMPORT
Page 201
Pig Latin
Commands
rm
sh
rmf
Page 202
Pig Latin
Expressions
x * y, x / y not x
Page 203
Pig Latin
Data Types
Page 204
Pig Latin
Built-in Functions
MIN BinStorage
SIZE
TextLoader
SUM
JsonLoader, JsonStoarge
TOBAG
TOKENIZE HBaseStorage
Page 205
Hive deep dive - Theory 3
Day
Page 206
Hive Introduction
o Hive was originally developed at Facebook
Provides a very SQL Like language
Can be used by people who know SQL
Under the covers, generates MapReduce jobs that run on the Hadoop cluster
Page 207
Hive Architecture
Data Model
Tables have Typed columns (int , float,
HIVE string, date, boolean)
JDBC ODBC Partitions
Buckets (Hash partitions useful for
Command Line sampling, join optimization)
Metastore
Web Interface Thrift Server
Interface
Driver
(Compiler, Optimizer, Executor)
Metastore
Namespace containing set of tables
Holds partition definitions
Statistics
Runs on Derby, MySQL and many other
relational databases
Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent.
03#60
Page 208
Hive Metastore
Page 209
Hive Data Model and Data Types
Page 210
Hive Sample Commands
CREATE TABLE Logs (ts BIGINT, line STRING) PARTIONED BY (dt STRING, country
STRING)
LOAD DATA LOCAL INPATH input/live/partitions/file1 INTO TABLE logs
PARTITION (dt=2001-10-10, country=US);
Page 211
Storage Format
Default
Default Delimited text with a row per line
OUTER JOIN / CROSS JOIN / VIEWS / GROUP BY / UDF All are possible with Hive
JOINS:
EXPLAIN
Select . Entire query: Will provide details
about the execution plan for the query,
including MR job
Page 212
Hive Data
Physical Layout
Limitations
No correlated subqueries
Page 213
Hive vs. Pig
Choosing between Pig and Hive
o use both
Pig deals better with less structured data, so Pig is used to manipulate the data into a more
Page 214
Lunch Break 3
Day
Page 215
Hive hands on 3
Day
Page 216
Hive
Overview
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Internally, a compiler translates HiveQL statements into
a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.
Originally developed by Facebook.
Supports analysis of large datasets stored in HDFS and compatible file systems such as Amazon S3 filesystem.
Provides an SQL-like language called HiveQL.
HiveQL does not strictly follow the full SQL-92 standard.
HiveQL offers extensions not in SQL (mulitable inserts, create table as select)
To accelerate queries, Hive provides indexes, including bitmap indexes.
Page 217
Hive
Hive File Types
Four file types supported in Hive: TEXTFILE, SEQUENCEFILE, ORC, AND RCFILE
Import text files compressed with Gzip or Bzip2 directly into a TEXTFILE table
Compression automatically detected and will be decompressed on-the-fly
Page 218
Hive
ORC File Type
Page 219
Hive
Loading data from File to Hive
Step 1 : Copy the working data files for use with Hive
Step 3: Create the temporary table to load the Hive data into.
hive> CREATE TABLE symbols_tmp (symbol STRING, description STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TEXTFILE;
Step 4: Load the data from the source file into the temporary table.
Data can also be loaded from the OS file system using the LOCAL option
Page 220
Hive
Loading Data from Hive table into another Hive Table
hive> CREATE TABLE symbols (symbol STRING, description STRING, exchange STRING) STORED AS
SEQUENCEFILE;
hive> INSERT OVERWRITE TABLE symbols SELECT symbol, description, 'NYSE' FROM symbols_tmp;
Page 221
Hive
Another data loading example
Step s
hive> CREATE TABLE eod_data_tmp (ticker STRING, close_date STRING, price_open DOUBLE, price_high
DOUBLE, price_low DOUBLE, price_close DOUBLE, volume BIGINT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' STORED AS TEXTFILE;
hive> LOAD DATA INPATH 'NYSE/data/hive/EOD2013.txt' INTO TABLE eod_data_tmp;
hive> CREATE TABLE eod_data (ticker STRING, close_date timestamp, price_open DOUBLE, price_high
DOUBLE, price_low DOUBLE, price_close DOUBLE, volume BIGINT) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET io.seqfile.compression.type=BLOCK;
hive> INSERT OVERWRITE TABLE eod_data SELECT ticker,
from_unixtime(unix_timestamp(CONCAT(close_date,'163000000'), 'yyyyMMddHHmmssSSS')), price_open,
price_high, price_low, price_close, volume FROM eod_data_tmp;
hive> DROP TABLE eod_data_tmp;
Page 222
Hive
Running Select Queries
hive> SELECT * FROM eod_data WHERE ticker = 'MS' AND year(close_date) = 2013 AND
month(close_date) = 10;
hive> SELECT MONTH(close_date), AVG(price_close) FROM eod_data WHERE ticker = 'MS' GROUP BY
month(close_date);
Page 223
Hive
Aggregate Function and Joins
Step 1 : Find the difference in the highest and smallest close prices for all stocks in the NYSE in 2013
Output: Stage 1
Page 224
Hive
Aggregate Function and Joins
Page 225
Hive
Aggregate Function and Joins
Page 226
Hive
Saving Query Results
Page 227
Hive
Examining Database Results
hive> exit;
Page 228
Administration 3
Day
Page 229
Hadoop Administration & Troubleshooting
Adding a Tasktracker
Decommissioning a Tasktracker
Adding a Datanode
Decommissioning a Datanode
Copyright 2010/2013 EY. All rights reserved. Not to be reproduced without prior writ t en consent.
Copyright 2010/2013 EY. All rights reserved. Not to be
Page 230 14#5
reproduced without prior wri> en consent.
Conclusion 3
Day
Questions?
Page 231
Notes
Page 232
Notes
Page 233
Notes
Page 234
Notes
Page 235
Notes
Page 236
Appendix
o Oozie
Why Oozie
Oozie use cases
Page 238
What is Oozie?
How it works?
Oozie is a workow engine
Runs on a server
Typically outside the cluster
Page 239
Oozie Workflow Basics?
Workflow Overview
Oozie workows are written in XML
Map Reduce
Workow is a collection of actions PIG
Start End
Hive
MapReduce jobs, Pig jobs, Hive jobs etc. Job
Start Success
A workow consists of control ow nodes and action nodes
Failure
Control ow nodes dene the beginning and end of a
workow
They provide methods to determine the workow
execution path Error
Page 240
Oozie Workflow Sample?
Workflow XML Anatomy
1 o A workflow is wrapped into workflow entity
<workflow-app name='wordcount-wf'
1xmlns="uri:oozie:workflow:0.1">
2 <start to='wordcount'/> o The startnode is the control node which tells
<action name='wordcount'> 2 Oozie which workow node should be run rst.
3 <map-reduce> There must be one startnode in an Oozie
<job-tracker>${jobTracker}</job-tracker> workow. In our example, we are telling Oozie
<name-node>${nameNode}</name-node>
<configuration> to start by transitioning to the wordcount
<property> workow node.
<name>mapred.mapper.class</name>
<value>org.myorg.WordCount.Map</value>
</property>
<property> o Action node defines the type of job. It is
3
<name>mapred.reducer.class</name> mapReduce in this. Within the action we define
<value>org.myorg.WordCount.Reduce</value> the job properties
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value> 4 o We specify what to do if the action ends
</property> successfully, and what to do if it fails. In this
<property> example, if the job is successful we go to the end
<name>mapred.output.dir</name>
<value>${outputDir}</value>
node. If it fails we go to the killnode.
</property>
</configuration>
</map-reduce> o If the workow reaches a killnode, it will kill all
4 <ok to='end'/> 5 running actions and then terminate with an error.
<error to='kill'/>
</action> A workow can have zero or more killnodes
<kill name='kill'>
<message>Something went wrong:
5 ${wf:errorCode('wordcount')}</message>
6 o Every workow must have an endnode. This
</kill/>
indicates that the workow has completed
6 <end name='end'/>
</workflow-app> successfully.
Page 241
Oozie Other control nodes
Control noes overview
o A decisioncontrol node allows Oozie to determine the workow execution path based on some criteria
Similar to a switch/case statement
o forkand joincontrol nodes split one execution path into multiple execution paths which run concurrently
forksplits the execution path
joinwaits for all concurrent execution paths to complete before proceeding
forkand joinare used in pairs
Page 242
Oozie Action Nodes
Action Nodes Overview
java Runs the main()method in the specied Java class as a single/ Map,
Map/only job on the cluster
Page 243
Bibliography
Hadoop. (2013, 03 06)[1]. Welcome to Apache Hadoop. Retrieved from Apache Hadoop: http://hadoop.apache.org/
Borthakur, D. (2013, 02 13)[2]. HDFS Architecture Guide. Retrieved from Apache Hadoop:
http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html
Borthakur, D. (2007)[3]. The Hadoop Distributed File System:Architecture and Design. Retrieved from Apache Hadoop:
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf
HBase. (2013, 03 14)[4]. Aapache HBase. Retrieved from Apache HBase: http://hbase.apache.org/
Pig. (2013, 02 21)[8]. Welcome to Apache Pig. Retrieved from Apache Hadoop: http://pig.apache.org/
oozie. (2013, 01 23)[11]. Apache Oozie Workflow Scheduler for Hadoop. Retrieved from oozie: http://oozie.apache.org/
Avro. (2013, 02 26)[12]. Welcome to Apache Avro. Retrieved from Avro: http://avro.apache.org/