You are on page 1of 2

Hello, in this video

we'll be talking caching. In the last video,


we talked about actions. And we saw that once Spark
prepares a Directed Acyclic Graph with our computation,
we call an action to complete and submit this job for execution. The problem is
that if we do this twice, if we have two different
data analysis pipelines, even if they share most of the processing, still Spark
is going to
execute them independently. So if they are, for example,
reading data, doing some data cleaning, and then do two different kind of
processing, unfortunately we cannot, by default, share the cleaning stage
between the two different pipelines. And the purpose of caching is
exactly to address this issue. So once we have one RDD which is
going to be re-used in the future, we can call the cache
method on this objects, and these will trigger
caching of this in memory. So next time we need to use this again,
even if it's on another Spark job that we execute later,
this is going to be read from memory. And so it's gonna be a lot faster. And thi
s method,
like transformations, is lazy. That means that we mark this to be cached,
but it's not cached straightaway. The first time that this RDD
is going to be computed, then it will be stored
in memory from then on. So what is the best stage at
which we should cache our data? Generally, you don't wanna do that
straight with your input data, because generally input
data is very large. So it's usually better to do some
validation and some cleaning first. And then,
once you have a dataset which is ready for your actual computations,
cache that in memory. Because you're gonna use it
later on in your computations. And you can also cache memory different. RDD is i
n your same pipeline. So let's say that, for example, you cached
data after validation and cleaning. But then once you have prepared your data
for your iterative algorithm, for example, a machine learning algorithm, it's a
good
idea to cache that algorithm as well. So that each iteration of
your algorithm can read data from memory and be very fast. Caching has some more
options. So we talked, until now,
about storing our datasets in memory. Which is generally the best option,
because you want to have the best speed. There is another option,
you could store only on disk. But that's pretty rare, because
reading from disk is not very fast. So it's not very common that
you wanna just store to disk. A more used option is to store
both on memory and on disk. And so in this case, whatever fits in
memory is going to be cached in memory, and the rest of the dataset
is gonna be written to disk. What kind of speedups you
can achieve with caching? It's easy to get 10 or even 100-fold
speedup, depending on your application. Because of course, instead of
re-reading everything from HDFS and doing your processing again,
here you're just reading out of memory. So for complicated pipelines,
there are very big speedups. And it's also important to
understand that caching is gradual. So even if your dataset doesn't fit
completely in memory, even if just half of that dataset fits in memory, still yo
u
can get half of the expected speedup. Which is a good feature. And the cache is
also, of course,
fault tolerant, as all the rest of Spark. So if one of the cached
partitions is lost, then it's recomputed and
then cached again. So let's see an example of caching
in our usual word count test. So if we read the text_RDD from HDFS, and then we
do our first

processing stage where we go from our lines to our key value pairs. And at this
point, we can trigger cache. And now we can go ahead with our
execution and finalize with wordcounts. And then we call our first action,
collect on wordcounts. That is going to copy the results
of our computation back to driver. And then let's say that we have a second
job where we can just want to inspect one element of pairs_RDD. So these two are
two different jobs. So without cache,
the second job would have triggered the execution again of the first
stage of the pipeline. Instead, with caching,
now pairs_RDD is already in memory. So the second job is gonna
be almost instantaneous, because Spark has only to read back from
memory and return the result of the job.

You might also like