Professional Documents
Culture Documents
Streaming In Python
Posted on December 22, 2015
An actual example
Everything feels better if we just discuss an actual use case. Let’s consider a simple real life example
and see how we can use Spark Streaming to code it up. Let’s say you are receiving a stream of 2D
points and we want to keep a count of how many points fall in each quadrant. We will be getting these
points from a data server listening on a TCP socket. Let’s see how to do it in Spark. The code below is
well commented, so just read through it and you’ll get an idea. We will be discussing it in detail later
in this blog post.
import sys
def get_quadrant(line):
try:
(x, y) = [float(x) for x in line.split()]
except:
elif x == 0 and y != 0:
elif x != 0 and y == 0:
else:
quadrant = 'Origin'
return (quadrant, 1)
if __name__ == "__main__":
if len(sys.argv) != 3:
spc = SparkContext(appName="QuadrantCount")
stc = StreamingContext(spc, 2)
# Checkpointing feature
stc.checkpoint("checkpoint")
running_counts = lines.map(get_quadrant).updateStateByKey(updateFunction)
# Print the current state
running_counts.pprint()
stc.start()
stc.awaitTermination()
$ nc -lk 9999
Then, in a different terminal, navigate to your spark-1.5.1 directory and run our program using:
Make sure you provide the right path to “quadrant_count.py”. You can enter the datapoints in the
Netcat terminal like this:
In this DStream, each item is a line of text that we want to process. We split the lines by space into
individual strings, which are then converted to numbers. In this case, each line will be split into
multiple numbers and the stream of numbers is represented as the lines DStream. Next, we want to
count the number of points belonging to each quadrant. The lines DStream is further mapped to a
DStream of (quadrant, 1) pairs, which is then reduced using updateStateByKey(updateFunction) to
get the count of each quadrant. Once it’s done, we will print the output using running_counts.pprint()
once every 2 seconds.
Understanding “updateFunction”
We use “updateStateByKey” to update all the counts using the lambda function “updateFunction”.
This is actually the core concept here, so we need to understand it completely if we want to write
meaningful code using Spark Streaming. Let’s look at the following line:
This function basically takes two inputs and computes the sum. Here, “new_values” is a list and
“running_count” is an int. This function just sums up all the numbers in the list and then adds a new
number to compute the overall sum. This list just has a single element in our case. The values we get
will be something a list, say [1], for new_values indicating that the count is 1, and the running_count
will be something like 4 indicating that there are already 4 points in this quadrant. So we just sum it
up and return the updated count.