You are on page 1of 31

A Data Streaming

Architecture
with Apache Flink

Robert Metzger
@rmetzger_
rmetzger@apache.org

Berlin Buzzwords,
June 7, 2016

Talk overview
My take on the stream processing space, and how
it changes the way we think about data
Transforming an existing data analysis pattern into
the streaming world (Streaming ETL)
Demo

Apache Flink
Apache Flink is an open source stream processing
framework

Low latency
High throughput
Stateful
Distributed

Developed at the Apache Software Foundation,


1.0.0 released in March 2016,
used in production
3

Entering the streaming era

Streaming is the biggest


change in data infrastructure
since Hadoop

1. Radically simplified infrastructure


2. Do more with your data, faster
3. Can completely subsume batch

Real-world data is produced in


a continuous fashion.
New systems like Flink and
Kafka embrace streaming
nature of data.
Web server

Kafka
topic

Stream
processor
7

YARN

DataStream (Java /
Scala)

Cluster

Zeppelin
Zeppelin

Cascading
Cascading

Apache
Apache Beam
Beam

ML
ML

Gelly
Gelly

Table
Table / SQL
SQL

Hadoop
Hadoop M/R
M/R

SAMOA
SAMOA

Storm
Storm API
API

Apache
Apache Beam
Beam

Table
Table //
StreamSQL
StreamSQL

CEP
CEP

Apache Flink stack

DataSet (Java/Scala)

Streaming dataflow runtime

Local
8

What makes Flink flink?


Low latency
High Throughput
Well-behaved
flow control
(back pressure)

Windows &
user-defined state
Exactly-once semantics
for fault tolerance

Make more sense of data


Works on real-time
and historic data

True
Streaming

Event Time

Stateful
Streaming

APIs
Libraries

Globally consistent
savepoints

Complex Event Processing

Flexible windows
(time, count, session, roll-your own)
9

Moving existing (batch)


data analysis into
streaming

10

Extract, Transform, Load (ETL)


ETL: Move data from A to B and transform it on the
way
Old approach:
Server
Server
Server
Logs Logs
Logs

Mobile

IoT

Extract, Transform, Load (ETL)


ETL: Move data from A to B and transform it on the
way
Old approach:
Tier 0: Raw
data

Server
Server
Server
Logs Logs
Logs

Mobile

IoT

HDFS /
S3
Data
Lake

Extract, Transform, Load (ETL)


ETL: Move data from A to B and transform it on the
way
Old approach:
Tier 0: Raw
data

Tier 1: Normalized,
cleansed data

Server
Server
Server
Logs Logs
Logs

Mobile

HDFS /
S3

IoT

Data
Lake

Periodi
c jobs

User
Parquet /
ORC in
HDFS

Extract, Transform, Load (ETL)


ETL: Move data from A to B and transform it on the
way
Old approach:
Tier 0: Raw
data

Tier 1: Normalized,
cleansed data

Server
Server
Server
Logs Logs
Logs

Mobile

IoT

HDFS /
S3

Data
Lake

Periodi
c jobs

Tier 2: Aggregated
data

User
Parquet /
ORC in
HDFS

User
Periodi
c jobs

Data
Warehouse

Extract, Transform, Load (Streaming


ETL)
ETL: Move data from A to B and transform it on the
way
Tier 0: Raw approach:
Streaming
data
Server
Server
Server
Logs
Logs
Logs

Mobile
Data
Lake
IoT

Extract, Transform, Load (Streaming


ETL)
ETL: Move data from A to B and transform it on the
way
Tier 0: Raw approach:
Stream Processor
Streaming
data
Server
Server
Server
Logs
Logs
Logs
Kafka
Connect
or

Mobile
Data
Lake
IoT

Transformati
on
Alerts
Cleansin
g

TimeWindow
TimeWindow

Extract, Transform, Load (Streaming


ETL)
ETL: Move data from A to B and transform it on the
way
Tier 1: Normalized,
cleansed data
Tier 0: Raw approach:
Stream Processor
Streaming
data
Server
Server
Server
Logs
Logs
Logs
Kafka
Connect
or

Mobile
Data
Lake
IoT

Transformati
on
Alerts
Cleansin
g

TimeWindow
TimeWindow

ES
Connect
or
Rolling
file sink

User
Parquet /
ORC in
HDFS

Batch
Processing

Extract, Transform, Load (Streaming


ETL)
ETL: Move data from A to B and transform it on the
way
Tier 1: Normalized,
cleansed data
Tier 0: Raw approach:
Stream Processor
Streaming
data
Server
Server
Server
Logs
Logs
Logs
Kafka
Connect
or

Mobile
Data
Lake
IoT

Transformati
on
Alerts
Cleansin
g

ES
Connect
or
Rolling
file sink

User
Parquet /
ORC in
HDFS

Batch
Processing

Tier 2: Aggregated data


TimeWindow
TimeWindow

JDBC
sink
Cassand
rasink

User

Streaming ETL: Low Latency


Events are processed immediately
No need to wait until the next load batch job is running
Approa Periodic batch
ch
job
Latenc
hours
y

Batch
processor with
micro-batches
minutes
seconds

Stream
processor
millisecon
ds

*
s
m
0
0
5
n
Less tha
Less
than
250
ms*
* Your mileage may vary. These are rule of thumb estimates.

19

Streaming ETL: Event-time aware


Events derived from the same real-world activity
might arrive out of order in the system
Flink is event-time aware
11:2
8
11:2
8

11:2
9
11:2
9

11:2
8
Same real-world
activity

11:2
9
Out
Out of
of sync
sync
clocks
clocks

Network
Network delays
delays

Machine
Machine failures
failures
20

Demo

21

Job Overview
Streaming ETL
Job

Data Ingestion Job


Flink
Twitter
Source

22

Job Overview
Filter
Filter operation
operation

(Rolling)
(Rolling) file
file sink
sink

Aggregation
Aggregation to
to
ElasticSearch
ElasticSearch

Streaming
Streaming
WordCount
WordCount

TopN
TopN
operator
operator
23

Demo code @ GitHub


https://
github.com/rmetzger/flink-streaming-etl

24

Closing

25

https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets25580481910

26

Flink Forward 2016, Berlin


Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org

We are hiring!
data-artisans.com/careers

Questions?
Ask now!
eMail: rmetzger@apache.org
Twitter: @rmetzger_

Follow: @ApacheFlink
Read: flink.apache.org/blog, data-artisans.com/blog/
Mailinglists: (news | user | dev)@flink.apache.org

29

Appendix

30

Sources
Large scale ETL with Hadoop http://
www.slideshare.net/OReillyStrata/large-scale-etl-w
ith-hadoop

31

You might also like