Professional Documents
Culture Documents
Considerations
for Hadoop
Applications
Strata+Hadoop World Barcelona November 19th 2014
tiny.cloudera.com/app-arch-slides
Principal Solutions
Jonathan Seidman
Architect at Cloudera
Partner Enablement at
Cloudera
Previously, lead architect at
Previously, Technical Lead
FINRA
on the big data team at
Contributor to Apache
Orbitz Worldwide
Hadoop, HBase, Flume,
Co-founder of the Chicago
Avro, Pig and Spark
Hadoop User Group and
Chicago Big Data
Mark Grover
Software Engineer at
Cloudera
Committer on Apache Bigtop,
PMC member on Apache
Sentry (incubating)
Contributor to Apache
Hadoop, Spark, Hive, Sqoop,
Pig and Flume
2014 Cloudera, Inc. All Rights Reserved.
Logistics
Break at 3:30-4:00 PM
Questions at the end of each section
Case Study
Clickstream Analysis
6
Analytics
Analytics
Analytics
Analytics
10
Analytics
11
Analytics
12
Analytics
13
14
Clickstream Analytics
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36
15
Similar use-cases
Sensors heart, agriculture, etc.
Casinos session of a person at a table
16
Pre-Hadoop
Architecture
Clickstream Analysis
17
Web logs
(full fidelity)
(2 weeks)
Transform/
Aggregate
Data Warehouse
Business
Intelligence
Tape Archive
18
19
20
Why is Hadoop A
Great Fit?
Clickstream Analysis
21
22
Web logs
Hadoop
Business
Intelligence
23
24
25
26
Case Study
Requirements
Overview of Requirements
27
Overview of Requirements
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processing
Processed Data
Storage
(Formats,
Schema)
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
2014 Cloudera, Inc. All Rights Reserved.
28
Case Study
Requirements
Data Ingestion
29
CRM Data
Web Servers
Web Servers
Web Servers
Web Servers
Web Servers
Logs
Hadoop
Web Servers
Web Servers
Web Servers
Web Servers
Web Servers
ODS
Logs
30
31
Case Study
Requirements
Data Storage
32
Compress the
data
33
Case Study
Requirements
Data Processing
34
Processing requirements
Be able to answer questions like:
What is my websites bounce rate?
i.e. how many % of visitors dont go past the landing page?
Which marketing channels are leading to most sessions?
Do attribution analysis
Which channels are responsible for most conversions?
35
Sessionization
> 30 minutes
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
36
Case Study
Requirements
Orchestration
37
Orchestration is simple
We just need to execute actions
One after another
38
Actually,
we also need to handle errors
And user notifications
.
39
And
Re-start workflows after errors
Reuse of actions in multiple workflows
Complex workflows with decision points
Trigger actions based on events
Tracking metadata
Integration with enterprise software
Data lifecycle
Data quality control
Reports
40
41
Architectural
Considerations
Data Modeling
42
43
Architectural
Considerations
Data Modeling Storage Layer
44
45
Fast scans
Slow scans
46
47
Architectural
Considerations
Data Modeling Raw Data Storage
48
Raw
Data
Processed Data
49
Logs
(plain
text)
50
Logs
(plain
Logs
Logs
Logs
(plain
Logstext)
LogsLogs
(plain
Logs
text)
(plain
(plain
text)
Logs text)
(plain
Logs Logs
text)
text)
51
Splittable. Getting
better
snappy
Hmmm.
52
53
54
SequenceFile
Stores records as binary
key/value pairs.
SequenceFile blocks can
be compressed.
This enables splittability
with non-splittable
compression.
2014 Cloudera, Inc. All Rights Reserved.
55
Avro
Kinda SequenceFile on
Steroids.
Self-documenting stores
schema in header.
Provides very efficient
storage.
Supports splittable
compression.
2014 Cloudera, Inc. All Rights Reserved.
56
57
But Note
For simplicity, well use plain text for raw data in our example.
58
Architectural
Considerations
Data Modeling Processed Data Storage
59
Raw
Data
Processed Data
60
Column A
Column B
Column C
Column D
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
Value
61
Columnar Formats
Eliminates I/O for columns that are not part of a query.
Works well for queries that access a subset of columns.
Often provide better compression.
These add up to dramatically improved performance for many queries.
2014-10-13
abc
2014-10-14
def
2014-10-13
2014-10-14
2014-10-15
2014-10-15
ghi
abc
def
ghi
62
63
64
65
Architectural
Considerations
Data Modeling Schema Design
66
67
Partitioning
Split the dataset into smaller consumable chunks.
Rudimentary form of indexing. Reduces I/O needed to process queries.
68
Partitioning
Un-partitioned HDFS
directory structure
dataset
file1.txt
file2.txt
filen.txt
Partitioned HDFS
directory structure
dataset
col=val1/file.txt
col=val2/file.txt
col=valn/file.txt
69
Partitioning considerations
What column to partition by?
Dont have too many partitions (<10,000)
Dont have too many small files in the partitions
Good to have partition sizes at least ~1 GB, generally a multiple of block size.
Well partition by timestamp. This applies to both our raw and processed data.
70
/etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!
Processed dataset:
/data/bikeshop/clickstream/year=2014/month=10/day=10!
71
Architectural
Considerations
Data Ingestion
72
73
74
75
76
Flume:
Many available data collection sources
Well integrated into Hadoop
Supports file transformations
Can implement complex topologies
Very low latency
No programming required
77
78
Kafka is:
Very high-throughput publish-subscribe messaging
Highly available
Stores data and can replay
Can support many consumers with no extra latency
79
Kafka is awesome.
We heard it cures cancer
80
81
In Our Example
We want to ingest events from log files
Flumes Spooling Directory source fits
With HDFS Sink
We would have used Kafka if
We wanted the data in non-Hadoop systems too
82
Sources
Interceptors
DR, critical
Selectors
Channels
Sinks
Flume Agent
83
Configuration
Declarative
No coding required.
Configuration specifies how
components are wired
together.
84
Interceptors
Mask fields
Validate information
85
86
of our servers.
These client agents send
data to multiple agents to
provide reliability.
Flume provides support for
load balancing.
2014 Cloudera, Inc. All Rights Reserved.
87
on ingest.
For example:
88
Flume Agent
Flume Agent
Web Logs
Spooling Dir
Source
Timestamp
Interceptor
File
Channel
Avro Sink
Avro Sink
Avro Sink
89
Flume Agent
Avro
Source
File
Channel
HDFS Sink
HDFS Sink
HDFS
90
processing in SparkStreaming
Another consumer for alerting
And maybe a batch consumer too
91
Producer A
Producer B
Producer C
Kafka
Producers
Channels
Sinks
Flume Agent
92
Kafka
Consumer A
Consumer B
Consumer C
Sources
Interceptors
Channels
Kafka
Consumers
Flume Agent
93
Sources
Interceptors
Kafka
DR, critical
Selectors
Channels
Sinks
Flume Agent
94
Architectural
Considerations
Data Processing Engines
tiny.cloudera.com/app-arch-slides
95
Processing Engines
MapReduce
Abstractions
Spark
Spark Streaming
Impala
96
MapReduce
Oldie but goody
Restrictive Framework / Innovated Work Around
Extreme Batch
97
HDFS
(Replicated)
Reducer
Block of Data
Output File
Temp Spill
Data
Partitioned
Sorted Data
Reducer Local
Copy
98
MapReduce Innovation
Mapper Memory Joins
Reducer Memory Joins
Buckets Sorted Joins
Cross Task Communication
Windowing
And Much More
99
Abstractions
SQL
Hive
Script/Code
Pig: Pig Latin
Crunch: Java/Scala
Cascading: Java/Scala
100
Spark
The New Kid that isnt that New Anymore
Easily 10x less code
Extremely Easy and Powerful API
Very good for machine learning
Scala, Java, and Python
RDDs
DAG Engine
101
Spark - DAG
102
Spark - DAG
TextFile
Filter
KeyBy
Join
TextFile
Filter
Take
KeyBy
103
Spark - DAG
TextFile
Filter
KeyBy
Good
Good
Good-Replay
Good
Good
Good-Replay
Good
Good
Good-Replay
TextFile
KeyBy
Good
Good-Replay
Good-Replay
Lost Block
Replay
Good
Good-Replay
Join
Filter
Take
Lost Block
Future
Future
Good
Future
Future
104
Spark Streaming
Calling Spark in a Loop
Extends RDDs with DStream
Very Little Code Changes from ETL to Streaming
105
Spark Streaming
Pre-first
Batch
Source
Receiver
RDD
Source
Receiver
RDD
Single Pass
Filter
Count
First
Batch
RDD
Source
Second
Batch
Receiver
RDD
Single Pass
RDD
Filter
Count
RDD
106
Spark Streaming
Pre-first
Batch
Source
Receiver
RDD
Source
Receiver
RDD
Single Pass
Filter
Count
First
Batch
RDD
Stateful RDD 1
Source
Second
Batch
Receiver
RDD
RDD
RDD
Stateful RDD 1
Single Pass
Filter
Count
Stateful RDD 2
107
Impala
MPP Style SQL Engine on top of Hadoop
Very Fast
High Concurrency
Analytical windowing functions (C5.2).
108
Impala Daemon
100% Cached
Smaller Table
100% Cached
Smaller Table
Output
100% Cached
Smaller Table
Output
Impala Daemon
Output
Impala Daemon
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
100% Cached
Smaller Table
100% Cached
Smaller Table
109
Impala Daemon
~33% Cached
Smaller Table
Hash Partitioner
~33% Cached
Smaller Table
Hash Partitioner
~33% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Hash Partitioner
33% Cached
Smaller Table
BiggerTable Data
Block
Impala Daemon
Output
Output
Impala Daemon
Impala Daemon
Hash Join Function
Hash Partitioner
BiggerTable Data
Block
33% Cached
Smaller Table
33% Cached
Smaller Table
BiggerTable Data
Block
110
Impala vs Hive
Very different approaches and
We may see convergence at some point
But for now
Impala for speed
Hive for batch
111
Architectural
Considerations
Data Processing Patterns and Recommendations
112
113
Sessionization
> 30 minutes
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
114
Why sessionize?
Helps answers questions like:
What is my websites bounce rate?
i.e. how many % of visitors dont go past the landing page?
Which marketing channels (e.g. organic search, display ad, etc.) are leading to
most sessions?
Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.)
Do attribution analysis which channels are responsible for most conversions?
115
Sessionization
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36
244.157.45.12+1413580110
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5;
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0
Mobile Safari/533.1 244.157.45.12+1413583199
116
How to Sessionize?
1. Given a list of clicks, determine which clicks came from
117
IP address (244.157.45.12)
Cookies (A9A3BECE0563982D)
IP address (244.157.45.12)and user agent string ((KHTML, like Gecko)
Chrome/36.0.1944.0 Safari/537.36")
118
IP address (244.157.45.12)
Cookies (A9A3BECE0563982D)
IP address (244.157.45.12)and user agent string ((KHTML, like Gecko)
Chrome/36.0.1944.0 Safari/537.36")
119
120
121
122
123
124
Filtering recommendation
Bot/Spider filtering can be done easily in any of the engines
Incomplete records are harder to filter in schema systems like Hive, Impala,
Pig, etc.
Flume interceptors can also be used
Pretty close choice between MR, Hive and Spark
Can be done in Spark using rdd.filter()
We can simply embed this in our MR sessionization job
125
126
Deduplication recommendation
Can be done in all engines.
We already have a Hive table with all the columns, a simple DISTINCT query
127
128
End-to-end processing
BI tools
Deduplication
Filtering
Sessionization
129
Architectural
Considerations
Orchestration
130
Orchestrating Clickstream
Data arrives through Flume
Triggers a processing event:
Sessionize
Enrich Location, marketing channel
Store as Parquet
Each day we process events from the previous day
131
Choosing Right
Workflow is fairly simple
Need to trigger workflow based on data
Be able to recover from errors
Perhaps notify on the status
And collect metrics for reporting
132
Oozie or Azkaban?
133
Oozie Architecture
134
Oozie features
Part of all major Hadoop distributions
Hue integration
Built -in actions Hive, Sqoop, MapReduce, SSH
Complex workflows with decisions
Event and time based scheduling
Notifications
SLA Monitoring
REST API
135
Oozie Drawbacks
Overhead in launching jobs
Steep learning curve
XML Workflows
136
Azkaban Architecture
Client
Hadoop
MySQL
137
Azkaban features
Simplicity
Great UI including pluggable visualizers
Lots of plugins Hive, Pig
Reporting plugin
138
Azkaban Limitations
Doesnt support workflow decisions
Cant represent data dependency
139
Choosing
Workflow is fairly simple
Need to trigger workflow based on data
Be able to recover from errors
Perhaps notify on the status
And collect metrics for reporting
Easier in Oozie
140
141
142
143
144
Putting It All
Together
Final Architecture
145
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processing
Processed Data
Storage
(Formats,
Schema)
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
2014 Cloudera, Inc. All Rights Reserved.
146
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processing
Processed Data
Storage
(Formats,
Schema)
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
2014 Cloudera, Inc. All Rights Reserved.
147
/etl/BI/casualcyclist/
clicks/rawlogs/
year=2014/month=10/
day=10!
Flume Agent
Flume Agent
HDFS
Flume Agent
Flume Agent
148
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processing
Processed Data
Storage
(Formats,
Schema)
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
2014 Cloudera, Inc. All Rights Reserved.
149
/etl/BI/casualcyclist/clicks/
rawlogs/year=2014/month=10/
day=10
dedup->filtering->sessionization
parquetize
/data/bikeshop/clickstream/
year=2014/month=10/day=10
150
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processing
Processed Data
Storage
(Formats,
Schema)
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
2014 Cloudera, Inc. All Rights Reserved.
151
Hive/
Impala
Sqoop
DWH
DB import tool
Local Disk
R, etc.
152
Demo
153
Stay in touch!
@hadooparchbook
hadooparchitecturebook.com
slideshare.com/hadooparchbook
154
155
156
Visit us at
the Booth
#408
BOOK SIGNINGS
THEATER SESSIONS
TECHNICAL DEMOS
GIVEAWAYS
Highlights:
Hear whats new with 5.2
including Impala 2.0
Learn how Cloudera is
setting the standard for
Hadoop in the Cloud
157
158
159
Thank you