Professional Documents
Culture Documents
Architecture
with Apache Flink
Robert Metzger
@rmetzger_
rmetzger@apache.org
Berlin Buzzwords,
June 7, 2016
Talk overview
My take on the stream processing space, and how
it changes the way we think about data
Transforming an existing data analysis pattern into
the streaming world (Streaming ETL)
Demo
Apache Flink
Apache Flink is an open source stream processing
framework
Low latency
High throughput
Stateful
Distributed
Kafka
topic
Stream
processor
7
YARN
DataStream (Java /
Scala)
Cluster
Zeppelin
Zeppelin
Cascading
Cascading
Apache
Apache Beam
Beam
ML
ML
Gelly
Gelly
Table
Table / SQL
SQL
Hadoop
Hadoop M/R
M/R
SAMOA
SAMOA
Storm
Storm API
API
Apache
Apache Beam
Beam
Table
Table //
StreamSQL
StreamSQL
CEP
CEP
DataSet (Java/Scala)
Local
8
Windows &
user-defined state
Exactly-once semantics
for fault tolerance
True
Streaming
Event Time
Stateful
Streaming
APIs
Libraries
Globally consistent
savepoints
Flexible windows
(time, count, session, roll-your own)
9
10
Mobile
IoT
Server
Server
Server
Logs Logs
Logs
Mobile
IoT
HDFS /
S3
Data
Lake
Tier 1: Normalized,
cleansed data
Server
Server
Server
Logs Logs
Logs
Mobile
HDFS /
S3
IoT
Data
Lake
Periodi
c jobs
User
Parquet /
ORC in
HDFS
Tier 1: Normalized,
cleansed data
Server
Server
Server
Logs Logs
Logs
Mobile
IoT
HDFS /
S3
Data
Lake
Periodi
c jobs
Tier 2: Aggregated
data
User
Parquet /
ORC in
HDFS
User
Periodi
c jobs
Data
Warehouse
Mobile
Data
Lake
IoT
Mobile
Data
Lake
IoT
Transformati
on
Alerts
Cleansin
g
TimeWindow
TimeWindow
Mobile
Data
Lake
IoT
Transformati
on
Alerts
Cleansin
g
TimeWindow
TimeWindow
ES
Connect
or
Rolling
file sink
User
Parquet /
ORC in
HDFS
Batch
Processing
Mobile
Data
Lake
IoT
Transformati
on
Alerts
Cleansin
g
ES
Connect
or
Rolling
file sink
User
Parquet /
ORC in
HDFS
Batch
Processing
JDBC
sink
Cassand
rasink
User
Batch
processor with
micro-batches
minutes
seconds
Stream
processor
millisecon
ds
*
s
m
0
0
5
n
Less tha
Less
than
250
ms*
* Your mileage may vary. These are rule of thumb estimates.
19
11:2
9
11:2
9
11:2
8
Same real-world
activity
11:2
9
Out
Out of
of sync
sync
clocks
clocks
Network
Network delays
delays
Machine
Machine failures
failures
20
Demo
21
Job Overview
Streaming ETL
Job
22
Job Overview
Filter
Filter operation
operation
(Rolling)
(Rolling) file
file sink
sink
Aggregation
Aggregation to
to
ElasticSearch
ElasticSearch
Streaming
Streaming
WordCount
WordCount
TopN
TopN
operator
operator
23
24
Closing
25
https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets25580481910
26
We are hiring!
data-artisans.com/careers
Questions?
Ask now!
eMail: rmetzger@apache.org
Twitter: @rmetzger_
Follow: @ApacheFlink
Read: flink.apache.org/blog, data-artisans.com/blog/
Mailinglists: (news | user | dev)@flink.apache.org
29
Appendix
30
Sources
Large scale ETL with Hadoop http://
www.slideshare.net/OReillyStrata/large-scale-etl-w
ith-hadoop
31