Flink

A Data Streaming
Architecture
with Apache Flink
Robert Metzger
@rmetzger_
rmetzger@apache.org
Berlin Buzzwords,
June 7, 2016
Talk overview
My take on the stream processing space, and how
it changes the way we think about data
Transforming an existing data analysis pattern into
the streaming world (Streaming ETL)
Demo
Apache Flink
Apache Flink is an open source stream processing
framework
Low latency
High throughput
Stateful
Distributed
Developed at the Apache Software Foundation,

1.0.0 released in March 2016,
used in production
3
Entering the streaming era
Streaming is the biggest

change in data infrastructure
since Hadoop
1. Radically simplified infrastructure

2. Do more with your data, faster
3. Can completely subsume batch
Real-world data is produced in

a continuous fashion.
New systems like Flink and
Kafka embrace streaming
nature of data.
Web server
Kafka
topic
Stream
processor
7
YARN
DataStream (Java /
Scala)
Cluster
Zeppelin
Zeppelin
Cascading
Cascading
Apache
Apache Beam
Beam
ML
ML
Gelly
Gelly
Table
Table / SQL
SQL
Hadoop
Hadoop M/R
M/R
SAMOA
SAMOA
Storm
Storm API
API
Apache
Apache Beam
Beam
Table
Table //
StreamSQL
StreamSQL
CEP
CEP
Apache Flink stack
DataSet (Java/Scala)
Streaming dataflow runtime
Local
8
What makes Flink flink?

Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Windows &
user-defined state
Exactly-once semantics
for fault tolerance
Make more sense of data

Works on real-time
and historic data
True
Streaming
Event Time
Stateful
Streaming
APIs
Libraries
Globally consistent
savepoints
Complex Event Processing
Flexible windows
(time, count, session, roll-your own)
9
Moving existing (batch)

data analysis into
streaming
10
Extract, Transform, Load (ETL)

ETL: Move data from A to B and transform it on the
way
Old approach:
Server
Server
Server
Logs Logs
Logs
Mobile
IoT

way
Old approach:
Tier 0: Raw
data
Server
Server
Server
Logs Logs
Logs
Mobile
IoT
HDFS /
S3
Data
Lake

way
Old approach:
Tier 0: Raw
data
Tier 1: Normalized,
cleansed data
Server
Server
Server
Logs Logs
Logs
Mobile
HDFS /
S3
IoT
Data
Lake
Periodi
c jobs
User
Parquet /
ORC in
HDFS

way
Old approach:
Tier 0: Raw
data
Tier 1: Normalized,
cleansed data
Server
Server
Server
Logs Logs
Logs
Mobile
IoT
HDFS /
S3
Data
Lake
Periodi
c jobs
Tier 2: Aggregated
data
User
Parquet /
ORC in
HDFS
User
Periodi
c jobs
Data
Warehouse
Extract, Transform, Load (Streaming

ETL)
way
Tier 0: Raw approach:
Streaming
data
Server
Server
Server
Logs
Logs
Logs
Mobile
Data
Lake
IoT

ETL)
way
Stream Processor
Streaming
data
Server
Server
Server
Logs
Logs
Logs
Kafka
Connect
or
Mobile
Data
Lake
IoT
Transformati
on
Alerts
Cleansin
g
TimeWindow
TimeWindow

ETL)
way
Tier 1: Normalized,
cleansed data
Stream Processor
Streaming
data
Server
Server
Server
Logs
Logs
Logs
Kafka
Connect
or
Mobile
Data
Lake
IoT
Transformati
on
Alerts
Cleansin
g
TimeWindow
TimeWindow
ES
Connect
or
Rolling
file sink
User
Parquet /
ORC in
HDFS
Batch
Processing

ETL)
way
Tier 1: Normalized,
cleansed data
Stream Processor
Streaming
data
Server
Server
Server
Logs
Logs
Logs
Kafka
Connect
or
Mobile
Data
Lake
IoT
Transformati
on
Alerts
Cleansin
g
ES
Connect
or
Rolling
file sink
User
Parquet /
ORC in
HDFS
Batch
Processing
Tier 2: Aggregated data

TimeWindow
TimeWindow
JDBC
sink
Cassand
rasink
User
Streaming ETL: Low Latency

Events are processed immediately
No need to wait until the next load batch job is running
Approa Periodic batch
ch
job
Latenc
hours
y
Batch
processor with
micro-batches
minutes
seconds
Stream
processor
millisecon
ds
*
s
m
0
0
5
n
Less tha
Less
than
250
ms*
* Your mileage may vary. These are rule of thumb estimates.
19
Streaming ETL: Event-time aware

Events derived from the same real-world activity
might arrive out of order in the system
Flink is event-time aware
11:2
8
11:2
8
11:2
9
11:2
9
11:2
8
Same real-world
activity
11:2
9
Out
Out of
of sync
sync
clocks
clocks
Network
Network delays
delays
Machine
Machine failures
failures
20
Demo
21
Job Overview
Streaming ETL
Job
Data Ingestion Job

Flink
Twitter
Source
22
Job Overview
Filter
Filter operation
operation
(Rolling)
(Rolling) file
file sink
sink
Aggregation
Aggregation to
to
ElasticSearch
ElasticSearch
Streaming
Streaming
WordCount
WordCount
TopN
TopN
operator
operator
23
Demo code @ GitHub

https://
github.com/rmetzger/flink-streaming-etl
24
Closing
25
https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets25580481910
26
Flink Forward 2016, Berlin

Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!
data-artisans.com/careers
Questions?
Ask now!
eMail: rmetzger@apache.org
Twitter: @rmetzger_
Follow: @ApacheFlink
Read: flink.apache.org/blog, data-artisans.com/blog/
Mailinglists: (news | user | dev)@flink.apache.org
29
Appendix
30
Sources
Large scale ETL with Hadoop http://
www.slideshare.net/OReillyStrata/large-scale-etl-w
ith-hadoop
31

Flink

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flink

Uploaded by

Copyright:

Available Formats

A Data Streaming

Developed at the Apache Software Foundation,

Entering the streaming era

Streaming is the biggest

1. Radically simplified infrastructure

Real-world data is produced in

Apache Flink stack

Streaming dataflow runtime

What makes Flink flink?

Make more sense of data

Complex Event Processing

Moving existing (batch)

Extract, Transform, Load (ETL)

Extract, Transform, Load (ETL)

Extract, Transform, Load (ETL)

Extract, Transform, Load (ETL)

Extract, Transform, Load (Streaming

Extract, Transform, Load (Streaming

Extract, Transform, Load (Streaming

Extract, Transform, Load (Streaming

Tier 2: Aggregated data

Streaming ETL: Low Latency

Streaming ETL: Event-time aware

Data Ingestion Job

Demo code @ GitHub

Flink Forward 2016, Berlin

You might also like