You are on page 1of 22

A Lightweight Continuous Jobs Mechanism for

MapReduce Frameworks
Trong-Tuan Vu
INRIA Lille Nord Europe

Fabrice Huet
INRIA-University of Nice

Model

Big Data processing


landscape

Real-time

Iterative

Batch
Data
Static

Dynamic

Stream

Processing Big Data


Model

Batch

Iterative

Hadoop
HOP

HaLoop
Twister
PIC

Real-time

Data
Static

Dynamic (fast
data)
Stream

Amazon S4
Twitter Storm

Batch Processing of Big Data

Canonical workflow
Push data to cluster
Start jobs
Pull results
Profit!

As long as the data set does not change

Dealing with dynamic data

Bulk arrival
Job only submitted once and runs automatically
Slightly changes the workflow
While (new data)
Push, execute, pull, profit!

-5

Continuous Analysis
Time
Foo
Bar

What
Bar

Foo
Bar
What
Bar

Word-Count

Foo 1
Bar 1

What 1
Bar 1

Foo 1
Bar 2
What 1

-6

Properties

Efficiency
Only process new data, not the whole data set

Correctness
Merging all results on intermediate data should give
the same result than processing the whole dataset

-7

Dependencies
Time
Foo
Bar

What
Bar

Word-2

Word-2

Foo
Bar
What
Bar

Bar

Word-2 : display words which appears at least twice


-8

Not all data are equals

Processing only new data leads to incorrect results


Because some old ones are useful

Different categories
New data
Results
Carried data

-9

Carried data

Data which have been processed


But could be useful in subsequent run

Typically application dependent


Let the programmer decide this

Example Word-2 :
Result : words which appear at least twice
Carry : words which appear once
- 10

Continuous Map-Reduce jobs


Map

Map

Reduce
Carry

Reduce
Carry

- 11

Contribution

A continuous Job model adapted to MapReduce


An implementation on top of Hadoop
An evaluation with two toys application and a
realistic one

- 12

CONTINUOUS HADOOP

- 13

Continuous MapReduce Framework

Based on the Hadoop MapReduce Framework


Support for automatic re-execution of jobs
Notification of new data
Filtering of data by timestamp

New API with carry function

- 14

Even Elephants are fast

No modification to Hadoop source code


Proxies/Interceptors
Subclassing
Reflection (accessing private fields)

Use public API


Hopefully Never play cat and mouse elephant

- 15

Continuous
Job

Continuous
JobTracker

Job

JobTracker

Task

Task

Task

Task

Task

Task

Task

Task

Task

TaskTracker

Continuous
NameNode
NameNode

Data Nodes

Local File System


- 16

Time stamping data

Jobs should process new Data


Only those added after last execution

HDFS has limitations


No in-place modification and no appending

Add time stamp for blocks as metadata in


Continuous NameNode
- 17

API example (Word-2-count)


ContinuousJob job = new ContinuousJob() ;
.
job.setCarryFilesName(carry") ;
protected void continuousReduce(Text key, Iterable<IntWritable> values,
ContinuousContext context) {

if(sum < 2) {
context.carry(key, result);
} else {
context.write(key, result);
}
}

- 18

Application : SPARQL Query


A SQL-like language for the RDF data format
<http://localhost/publications/journals/Journal1/1940> rdf:type bench:Journal
<http://localhost/publications/journals/Journal1/1940> dc:title "Journal 1 (1940)"^^xsd:string
<http://localhost/publications/journals/Journal1/1940> dcterms:issued "1940"^^xsd:integer

SELECT ?yr
WHERE {
?journal rdf:type bench:Journal.
?journal dc:title "Journal 1 (1940)"^^xsd:string.
?journal dcterms:issued ?yr
}

- 19

Continuous SPARQL
Selection Job
Map

Reduce

Join Job

Map

Reduce
Carry

Selection Job
Map

Reduce

Map

Reduce
Carry

- 20

Hundred of seconds

cHadoop

14

Hadoop

12
10
8
6
4
2
0
20

40

60

80

100

120

140

160

180

(Millions of
RDF triple)

Experiments on 40 nodes
- 21

Conclusion

A model for processing dynamic (fast) data using


MapReduce
Carry allows saving data for future use

An non-intrusive implementation in Hadoop


Automatic restarting of continuous jobs

Latency of restarting jobs is high


- 22

You might also like