You are on page 1of 8

265

257

CONTENTS
ö ö WHAT IS STREAM PROCESSING?

Understanding Stream öö

öö
WHEN TO USE STREAM
PROCESSING

THE BUILDING BLOCKS

Processing: öö

öö
TRANSFORMATIONS

WINDOWING

öö RUNNING JOBS
Fast Processing of Infinite and Big Data
öö FAULT TOLERANCE

öö SOURCES AND SINKS

WRITTEN BY VLADIMIR SCHREINER, PRODUCT MANAGER AT HAZELCAST öö OVERVIEW OF STREAM


AND MARKO TOPOLNIK, SENIOR SOFTWARE ENGINEER AT HAZELCAST PROCESSING PLATFORMS

UNDERSTANDING STREAM PROCESSING WHEN TO USE STREAM PROCESSING


FAST PROCESSING OF INFINITE AND BIG DATA FAST BIG DATA
This Refcard introduces you to the domain of stream processing. It Stream processing should be used in systems that handle big data
covers these topics: volumes and where real-time results matter — in other words, when
the value of the information contained in the data stream decreases
• Use cases that benefit from stream processing
rapidly as it gets older.
DZO N E .CO M/ RE FCA RDZ

• Building blocks of a stream processing solution


This mostly applies to:
• Key concepts used when building a streaming pipeline: definition of
• Real-time analytics (for fast business insights and decision-making)
the dataflow, keyed aggregation, and windowing
• Anomaly, fraud, or pattern detection
• Runtime aspects and tradeoffs between performance and

correctness • Complex event processing

• Overview of distributed stream processing engines • Real-time stats (monitoring, feeding the real-time dashboards)

• Hands-on examples based on Hazelcast Jet • Real-time ETL (extract, transform, load)

• Implementing event-driven architectures (stream processor builds

WHAT IS STREAM PROCESSING? materialized views on top of the event stream)

The goal of streaming systems is to process big data volumes and provide
CONTINUOUS DATA
useful insights into the data prior to saving it to long-term storage.
Batch processing forces you to split your data into isolated blocks. The

The traditional approach to processing data at scale is batching. For


example, a bank collects the transactional data during the day into a
storage system such as HDFS or a data warehouse. Then, after closing
time, it processes it into an offline batch job. The premise of this
approach is that all the data is available in the system of record before
Learn with 8 Reference
the processing starts. In the case of failures, the whole job can be
simply restarted. Streaming Applications
While quite simple and robust, this approach clearly introduces a large
DOWNLOAD THE CODE
latency between gathering the data and being ready to act upon it.

The goal of stream processing is to overcome this latency. It processes


the live, raw data immediately as it arrives and meets the challenges
https://jet.hazelcast.org/demos/
of incremental processing, scalability, and fault tolerance.

1
ULTRA FAST,
APPLICATION
EMBEDDABLE,
3RD GENERATION
STREAM PROCESSING
ENGINE
Data Distributed
Stream Fast Batch
Processing java.util.
Processing Processing
Microservices stream

Get Started with Hazelcast Jet


https://jet.hazelcast.org/getting-started/

© 2018 Hazelcast Inc. www.hazelcast.com


UNDERSTANDING STREAM PROCESSING

information in the data that crosses the border of batches gets lost. To build a stream processing application, you must:
• Define the transformations
Example: You analyze the behavior of users browsing your website.
• Connect the streaming application to stream sources and sinks.
You run an analytic task on a daily basis, processing one-day
Stream processing platforms mostly provide a set of connectors;
batches. What if somebody keeps browsing your site around
you have to configure them properly
midnight? His or her behavior is divided between two batches and
• Define the dataflow of the application by wiring the sources to
the correlation is lost.
transformations and sinks
Streams, on the other hand, are unbounded by definition. As the data • Execute the stream processing application using the stream
stream is potentially infinite, as is the computation, your insight into processing engine. The engine then takes care of passing the
the data isn't limited by the underlying technical solution. records between the system components (sources, processors, and
sinks) and invoking them when the record arrives. The application
CONSISTENT RESOURCE CONSUMPTION
consumes and processes the stream until it is stopped
Also, processing data as it arrives spreads out workloads more
evenly over time. Stream processing should be used to make the This Refcard will guide you through the building blocks in detail
consumption of resources more consistent and predictable.
EXAMPLE
THE BUILDING BLOCKS Let's consider an application that processes a stream of system log
A stream is a sequence of records. Each record holds information events in order to power a dashboard giving immediate insight into
about an event that happened, such as a user's access to a web what's going on with the system. These are the steps it will take:
site, temperature update from an IoT sensor, or trade being
1. Read the raw data from log files in a watched directory
completed. The records are immutable, as they reflect something
2. Parse and filter it, transforming it into event objects
that had already happened — the user accessed the website, the
DZO N E .CO M/ RE FCA RDZ

3. Enrich the events: Using a machine ID field in the event


thermometer captured the temperature, the trade was processed.
object, look up the data we have on that machine and attach
This cannot be undone. Also, the stream is potentially infinite, as
it to the event
the events just keep happening.
4. Save the enriched, denormalized data to persistent storage such
The stream processing application provides insight into the stream, as Cassandra or HDFS for deeper offline analysis (first data sink)
the current state computed on top of the individual events flowing in 5. Perform windowed aggregation using the attached data
the stream — such as a current count of users accessing a website or calculate real-time statistics over the events that happened
the maximum temperature measured in an engine in the last hour. within the time window of the last minute
6. Push the results to an in-memory K-V store that backs the
This is being done by the application of transformations on top of
dashboard (second data sink)
the data stream. Multiple transformations could be composed to do
more complex transformations. The next section will introduce the We can represent each of the above steps with a box in a diagram
most frequently used transformations. and connect them using arrows representing the data flow:
Shared Directory
The stream processing application ingests one or more streams with Event Logs

from stream sources. Source connectors are used to connect the


stream processing application to the system that feeds the data, Log Source
Connector

such as Apache Kafka, a JMS broker, or a custom enterprise system.

Cache Enrich Data Compute Stats


Sinks are used to pass the transformed data downstream for
storage or further processing. One streaming application can read
data from multiple sources and output data to multiple sinks. HDFS Sink Key-Value Store Sink

Stream Processing Engine


The stream processing application runs on a stream processing
engine or platform infrastructure, allowing you to focus on Operational
HDFS
Key-Value Storage
business logic and transformations instead of low-level distributed
computation concerns.
Real-Time Dashboard

3
UNDERSTANDING STREAM PROCESSING

The boxes and arrows form a graph, specifically a Directed stage groups events by key in an infinite stream and calculates
Acyclic Graph (DAG). This model is at the core of modern stream an aggregated value over a sliding window. You provide just the
processing engines. business logic such as the function to extract the grouping key, the
definition of the aggregate function, the definition of the sliding
Here's how you would describe the above DAG in Hazelcast Jet's
window, etc.
Pipeline API. It uses the paradigm of a "pipeline" that consists of
interconnected "stages."
BASIC TRANSFORMATIONS
Pipeline p = Pipeline.create(); We have already mentioned that a stream is a sequence of
isolated records. Many basic transformations process each record
// Read the logs, parse each line, keep only success responses
independently. Such a transformation is stateless.
StreamStage<LogEvent> events =
p.drawFrom(Sources.fileWatcher("log-directory"))
These are the main types of stateless transformation:
.map(LogEvent::parse)
.filter(line -> line.responseCode() >= 200 && line • Map transforms one record to one record, e.g.hange format of the
responseCode() < 400); record, enrich record with some data

// Enrich with machine data • Filter filters out the records that doesn't satisfy the predicate

StreamStage<Entry<Long, LogEvent>> enrichedEvents = events • FlatMap is the most general type of stateless transformation,
hashJoin(
outputting zero or more records for each input record, e.g.
p.drawFrom(Sources.<String, Machine>map("machines")),
joinMapEntries(LogEvent::machineID), tokenize a record containing a sentence into individual words
(e, m) -> entry(e.sequenceNumber(), e.withMachine(m)));
However, many types of computation involve more than one record.
// Save enriched events to HDFS In this case, the processor must maintain internal state across the
JobConf jobConfig = new JobConf(); records. When counting the records in the stream, for example, you
jobConfig.setOutputFormat(TextOutputFormat.class);
DZO N E .CO M/ RE FCA RDZ

have to maintain the current count.


TextOutputFormat.setOutputPath(jobConfig, new Path("output"));
enrichedEvents.drainTo(HdfsSinks.hdfs(jobConfig));
Stateful transformations:

// Calculate requests per minute for each machine,


• Aggregation: Combines all the records to produce a single value,
// save stats to IMap
e.g. min, max, sum, count, avg
events
.addTimestamps(LogEvent::timestamp, 1000) • Group-and-aggregate: Extracts a grouping key from the record
.window(sliding(60_000, 1_000))
and computes a separate aggregated value for each key
.groupingKey(LogEvent::machineID)
.aggregate(counting()) • Join: Joins same-keyed records from several streams
.drainTo(Sinks.map("machine-stats"));
• Sort: Sorts the records observed in the stream

TRANSFORMATIONS This code samples shows both stateless and stateful


Transformations are used to express the business logic of a streaming transformations:
application. On the low level, a processing task receives some stream
logs.map(LogLine::parse)
items, performs arbitrary processing, and emits some items. It may .filter((LogLine log) -> log.getResponseCode() >= 200 &&
emit items even without receiving anything (acting as a stream log.getResponseCode() < 400)
.flatMap(AccessLogAnalyzer::explodeSubPaths)
source) or it may just receive and not emit anything (acting as a sink).
.groupingKey(wholeItem())
.aggregate(counting());
Due to the nature of distributed computation, you can't just provide
arbitrary imperative code that processes the data — you must describe it
In the most general case, the state of stateful transformations
declaratively. This is why streaming applications share some principles
is affected by all the records observed in the stream, and all the
with functional and dataflow programming. This requires some time to
ingested records are involved in the computation. However, we're
get used to when coming from the imperative programming.
mostly interested in something like "stats for last 30 seconds"
Hazelcast Jet's Pipeline API is one such example. You compose instead of "stats since streaming app was started" (remember: the
a pipeline from individual stages, each performing one kind of stream is generally infinite). This is where the concept of windowing
transformation. A simple map stage transforms items with a enters the picture — it meaningfully bounds the scope of the
stateless function; a more complex windowed group-and-aggregate aggregation. See the Windowing section.

4
UNDERSTANDING STREAM PROCESSING

KEYED OR NON-KEYED AGGREGATIONS and grouped together to a meaningful frame. Your transformation is
You often need to classify records by a grouping key, thereby executed just on the records contained in the window.
creating sub-streams containing just the records with the same
grouping key. Each record group is then processed separately. This HOW TO DEFINE WINDOWS

fact can be leveraged for easy parallelization by partitioning the A very simple example is a tumbling window. It divides the continuous
data stream on the grouping key and letting independent threads/ stream into discrete parts that don't overlap. The window is usually
processes/machines handle records with different keys. defined by a time duration or record count. The new window is
opened as soon as the time passes (for time-based windows) or as
EXAMPLES OF KEYED AGGREGATIONS soon as the count reaches the limit (count-based windows).
This example processes a stream of text snippets (tweets or
anything else) by first splitting them into individual words and Examples:

then performing a windowed group-and-aggregate operation. The • Time-based tumbling windows: Counting system usage stats,

aggregate function is simply counting the items in each group. e.g. a count of accesses in the last minute
This results in a live word frequency histogram that updates as • Count-based tumbling windows: Maximum score in a gaming
the window slides along the time axis. Although quite a simple system over last 1,000 results
operation, it gives a powerful insight into the contents of the stream.
The sliding window is also of a fixed size; however, consecutive

tweets.flatMap(tweet -> windows can overlap. It is defined by the size and sliding step.
traverseArray(tweet.toLowerCase().split("\\W+")))
.window(sliding(10_000, 100)) Example:
.groupingKey(wholeItem()) • Time-based sliding windows: Counting system usage stats. Number
.aggregate(counting(),
of accesses in the last minute with updates every ten seconds
(start, end, word, frequency) -> entry(word,
frequency)); A session is a burst of user activity followed by period of inactivity
DZO N E .CO M/ RE FCA RDZ

(timeout). A session window collects the activity belonging to the same


Another example of keyed aggregation could be gaining insight into session. As opposed to tumbling or sliding windows, session windows
the activities of all your users. You key the stream by user ID and don't have a fixed start or duration — their scope is data-driven.
write your aggregation logic focused on a single user. The processing
engine will automatically distribute the load of processing user data Example: When analyzing the web site traffic data, the activity of

across all the machines in the cluster and all their CPU cores. one user forms one session. The session is considered closed after
some period of inactivity (let's say one hour). When the user starts

EXAMPLE OF NON-KEYED (GLOBAL) AGGREGATION


browsing later, it's considered a new session.

This example processes a stream of reports from weather stations.


Types of Windows
Among the reports from the last hour, it looks for the one that
Tumbling
indicated the strongest winds. (size)

1 2 3
weatherStationReports
.window(sliding(HOURS.toMillis(1), MINUTES.toMillis(1))) Sliding
(size + step)

.aggregate(maxBy(comparing(WeatherStationReport::windSpeed)), 1

(start, end, station) -> entry(end, station)); 2

A general class of use cases where non-keyed aggregation is useful Session


(timeout)
are complex event processing (CEP) applications. They search for
1 2
complex patterns in the data stream. In such a case, there is no a priori
partitioning you can apply to the data; the pattern-matching operator
Take into account that the windows may be keyed or global. The keyed
needs to see the whole dataset to be able to detect a pattern.
window contains just the records that belong to that window and have
the same key. See the Keyed or Global Transformations section.
WINDOWING
Windows provide you with a finite, bounded view on top of an infinite DEALING WITH LATE EVENTS
stream. The window defines how records are selected from the stream Very often, a stream record represents an event that happened

5
UNDERSTANDING STREAM PROCESSING

in the real world and has its own timestamp (called the event RUNNING JOBS
time), which can be quite different from the moment the record is The definition of the data processing pipeline (composed of sources,
processed (called the processing time). This can happen due to the sinks, and transformations wired together) is the equivalent of a
distributed nature of the system — the record could be delayed on program, but the environment in which you can execute it is not
its way from source to stream processing system, the link speed simply your local machine. It describes a program that will run
from various sources may vary, the originating device may be offline distributed in the stream processing cluster. Therefore, you must
when the event occurred (e.g. IoT sensor or mobile device in flight perform the additional step of getting a handle to the stream
mode), etc. The difference between processing time and event time processing engine and submitting your program to it.
is called event time skew.
Once you have submitted the pipeline, the SPE allocates the
Processing time semantics should be used when: cluster's resources (i.e. CPU, memory, disk) for it and runs it. Each
• The business use case emphasizes low latency. Results are vertex in the DAG model becomes one or more tasks executing in
computed as soon as possible, without waiting for stragglers. parallel, and each edge (i.e. arrow) becomes a data connection
• The business use case builds on the time when the stream between tasks: either a concurrent queue (if both tasks are on the
processing engine observed the event, ignoring when it same machine) or a network connection. The tasks implementing
originated. Sometimes, you simply don't trust the timestamp in the data source start pulling the data and sending it into the
the event coming from a third party. pipeline. The SPE coordinates the execution of the tasks, monitors
the data connections for congestion, and applies backpressure as
Event time semantics should be used when the timestamp of needed. It also monitors the whole cluster for topology changes (a
the event origin matters for the correctness of the computation. member leaving or joining the cluster) and compensates for them
When your use case is event time-sensitive, it requires more of by remapping the resources devoted to your processing job so that
your attention. If event time is ignored, records can be assigned to it can go on unaffected by these changes. If the job's data source is
DZO N E .CO M/ RE FCA RDZ

improper windows, resulting in incorrect computation results. unbounded, it will keep on running until the user explicitly stops it
(or a failure occurs within the processing logic).
The system has to be instructed where in the record the information
about event time is. Also, it has to be decided how long to wait Stream processing engines usually provide a Client (a programming
for late events. If you tolerate long delays, it is more probable API or a command line tool) that is used for submitting the Job
that you've captured all the events. Short waiting, on the other and its resources to a cluster. The library approach of Hazelcast Jet
hand, gives you faster responses (lower latency) and less resource allows you to follow the "embedded member" mode, where the JVM
consumption — there is less data to be buffered. containing application code participates in the Hazelcast Jet cluster
directly and can be used to submit the job.
In a distributed system, there is no "upper limit" on event time skew.
Embedded Client-Server
Imagine an extreme scenario where there is a game score from
Application
a mobile game. The player is on a plane, so the device is in flight Java API

mode and the record cannot be sent to be processed. What if the


mobile device never comes online again? The event just happened,
so in theory, the respective window has to stay open forever for the
Application Application

computation to be correct. Java API Java API

Unordered and Late Data Application Application Application Application

Events Java Client Java Client Java Client Java Client

11 14 12 33 22 35 2
(timestamps)

Windows 10 - 19 20 - 29 30 - 39 When to use Client-Server:


• Cluster shared for multiple jobs
Stream processing frameworks provide heuristic algorithms to help • Isolating client application from the cluster
you assess window completeness, sometimes called watermarks.

When to use Embedded:


When your system handles event time-sensitive data, make sure
• When you want simplicity: It's simple, with no separate moving
that the underlying stream processing platform supports event
time-based processing. parts to manage

6
UNDERSTANDING STREAM PROCESSING

• OEM: SPE is embedded in your application it will deal with bounded (finite) or unbounded (infinite) data.

• Microservices
Bounded data is handled in batch jobs and there are fewer concerns
to deal with, as data boundaries are within the dataset itself. You
FAULT TOLERANCE
don't have to worry about windowing, late events, or event time
A major challenge of infinite stream processing is maintaining
skew. Examples of bounded, finite resources are plain files, HDFS,
state in the face of inevitable system failures. The data that was
or iterating through database query results.
already ingested but not yet emitted as final results is at risk of
disappearing forever if the processing system fails. To avoid loss, Unbounded data streams allow for continuous processing; however,
the original input items can be stored until the results of their you have to use windowing for operations that cannot effectively
processing have been emitted. Upon resuming after failure, these work on infinite input (such as sum, avg, or sort). In the unbounded
sitems can then be replayed to the system. The internal state of category, the most popular choice is Kafka. Some databases can be
the computation engine can also be persisted. However, in each turned into unbounded data sources by exposing the journal, the
case, there is an even greater challenge if there must be a strict stream of all data changes, as an API for third parties.
guarantee of correctness. That means that each individual item
must be accounted for and it must be known at each processing Pipeline p = Pipeline.create();

stage whether that particular item was processed. This is called the p.drawFrom(Sources.<Integer, Integer>mapJournal(MAP_NAME,
START_FROM_OLDEST));
exactly-once processing guarantee. A more relaxed variant is called
at-least-once and, in this case, the system may end up replaying an Hazelcast Jet reading the journal of an Imap, the distributed map
item that was already processed. provided by Hazelcast In-Memory Data Grid (unbounded data source).

The exactly-once guarantee is achievable only at a cost in terms


IS IT REPLAYABLE?
of latency, throughput, and storage overheads. If you can come
The data source is replayable if you can easily replay the data.
DZO N E .CO M/ RE FCA RDZ

up with an idempotent function that processes your data, the


This is generally necessary for fault tolerance: if anything goes
cheaper at-least-once guarantee will be enough because processing
wrong during the computation, the data can be replayed and the
the same event twice with such a function has the same effect as
computation can be restarted from the beginning.
processing it just once.

Bounded data sources are mostly replayable (e.g. plain file, HDFS file).
Current stream processing frameworks do support exactly-
The replayability of infinite data sources is limited by the disk space
once processing. However, assess whether your use case really cannot
necessary to store the whole data stream — an Apache Kafka source
tolerate some duplicities in the case of fault, as exactly-once isn't free
is replayable with this limitation. On the other hand, some sources
in the sense of complexity, latency, throughput, and resources.
can be read only once (e.g. TCP socket source, JMS Queue source).
Example: In a system processing access logs and detecting fraud
It would be quite impractical if you could only replay a
patterns on top of millions of events per second, minor duplicities
data stream from the very beginning. This is why you need
could be tolerated. Duplicities just lead to possible false positives
in exchange for performance. On the other hand, in a billing system, checkpointing: the ability of the stream source to replay its data

consistency and exactly-once is an absolute must have. from the point (offset) you choose. Both Kafka and the Hazelcast
Event Journal support this.

SOURCES AND SINKS


IS IT DISTRIBUTED?
A stream processing application accesses data sources and sinks via
A distributed computation engine prefers to work with distributed
its connectors. They are a computation job's point of contact with
data resources to maximize performance. If the resource is not
the outside world.
distributed, all SPE cluster members will have to contend for access
Although the connectors do their best to unify the various kinds of to a single data source endpoint. Kafka, HDFS, and Hazelcast IMap
resources under the same "data stream" paradigm, there are still are all distributed. On the other hand, a file is not; it is stored on a
many concerns that need your attention as they limit what you can single machine.
do within your stream processing application.
DATA LOCALITY?

IS IT UNBOUNDED? If you're looking to achieve record breaking throughput for your


The first decision when building a computation job is to decide whether application, you'll have to think carefully about how close you

7
UNDERSTANDING STREAM PROCESSING

can deliver your data to the location where the stream processing • Hazelcast Jet: Open-source, lightweight, and embeddable platform

application will consume and process it. For example, if your source for batch and stream processing; contains distributed in-memory
is HDFS, you should align the topologies of the Hadoop and SPE data structures to store operational data and publish results.
clusters so that each machine that hosts an HDFS member also hosts
• Kafka Streams: Open-source SPE integrated into the Apache
a node of the SPE cluster. Hazelcast Jet will automatically figure this
Kafka ecosystem. It's optimized for processing Kafka topics.
out and arrange for each member to consume only the slice of data
stored locally. • Spark Streaming: Open-source; utilizes Spark batch

processing platform by dividing continuous stream to sequence


Hazelcast Jet makes use of data locality when reading from co-
of discrete micro-batches. Big ecosystem around Spark and thus a
located Hazelcast IMap or HDFS.
big mind share.

OVERVIEW OF STREAM PROCESSING


CONCLUSION
PLATFORMS
This Refcard guided you through the key aspects of stream
This list provides an overview of the major distributed stream
processing. It covered the building blocks of streaming application:
processing platforms:
• Source and sink connectors to connect your application into the
• Apache Flink: Open-source, unified platform for batch and stream
data pipeline
processing
• Transformations to process and "query" the data stream using
• Apache Storm: First streaming platform (originated in 2010), thus
filtering, converting, grouping, aggregating, and joining it
has a big mind share and install base. It's limited to at-least-once
• Windows to select finite sub-streams from generally infinite
and doesn't support stateful stream processing. Project Trident
adds another layer for exactly-once using micro-batches. Another data streams
DZO N E .CO M/ RE FCA RDZ

project based on Storm is Twitter Heron.


We also covered how to run a streaming application in a stream
• Google Cloud DataFlow: A managed service in a Google Cloud. processing engine, listing the most popular distributed SPEs.
It introduced Apache Beam, the unified programming model and
SDK for batch and stream processing.

Written by Vladimir Schreiner


Vladimir Schreiner is a Product Manager of Hazelcast Jet. Coming from an engineering background, he has helped
organizations building their internal software platforms and development infrastructure. He is passionate about new
technology and finding ways to simplify data processing.

Written by Marko Topolnik


Marko Topolnik is a Senior Engineer at Hazelcast for the Jet team. He has been with Hazelcast since 2015, holds a PhD in
computer science, and has a six-figure score on Stack Overflow. Before starting to work as a Software Engineer, he used to
teach Java at the University of Zagreb and his textbook covering the fundamentals of OOP in Java remains in use today.

DZone, Inc.
150 Preston Executive Dr. Cary, NC 27513
DZone communities deliver over 6 million pages each
888.678.0399 919.678.0300
month to more than 3.3 million software developers,
architects and decision makers. DZone offers something for
Copyright © 2018 DZone, Inc. All rights reserved. No part of this publication
everyone, including news, tutorials, cheat sheets, research may be reproduced, stored in a retrieval system, or transmitted, in any form
guides, feature articles, source code and more. "DZone is a or by means electronic, mechanical, photocopying, or otherwise, without
developer’s dream," says PC Magazine. prior written permission of the publisher.

You might also like