You are on page 1of 16

Nephele

Efficient Parallel Data Processing in the Cloud


Daniel Warneke and Odej Kao daniel.warneke@tu-berlin.de Complex and Distributed IT-Systems Technische Universitt Berlin

Compute Clouds and Data Processing (1/2)


Cloud Computing provides IT infrastructure on demand
Infrastructure-as-a-Service (IaaS) Different types of virtual machines No long-term obligations, pay as you go

Data processing as major cloud application


Flexible and easy deployment No need for own compute center Amazon integrated Hadoop as core service

23.11.2009

Nephele: Efficient Data Processing in the Cloud

Compute Clouds and Data Processing (2/2)


Current situation:
Allocate set of virtual machines Deploy processing framework Submit processing job Destroy virtual machines
Poor utilization!

Imitation of static clusters


Clouds features remain unused Poor resource utilization higher processing cost!

23.11.2009

Nephele: Efficient Data Processing in the Cloud

Outline
Data Processing and Compute Clouds Opportunities Nephele
Architecture Job Description Scheduling

Evaluation Conclusion

23.11.2009

Nephele: Efficient Data Processing in the Cloud

Opportunities
Embrace dynamics and heterogeneity offered by clouds! Enables new ways of job scheduling VMs can be allocated/deallocated according to job progress, or to respond to peaks in workload Requirements:
Processing framework must be aware of the cloud Job description/schedule must be able to express when to allocate/deallocate virtual machines
23.11.2009 Nephele: Efficient Data Processing in the Cloud 5

Nephele
Data processing framework for compute clouds
Runs on clouds following IaaS abstraction Allocates/deallocates VMs on behalf of user according to job progress

Job description based on directed acyclic graphs (DAGs)


Vertices represent jobs individual tasks Edges denote communication channels

23.11.2009

Nephele: Efficient Data Processing in the Cloud

Nephele Architecture
Classic master worker pattern Workers are allocated on demand
Client
Public Network (Internet) Workload over time

Master
Private / Virtualized Network

Worker

Worker

Worker

23.11.2009

Nephele: Efficient Data Processing in the Cloud

Persistent Storage

Compute Cloud Cloud Controller

Job Description
Output 1
Task: LineWriterTask.program Output: s3://user:key@storage/outp

Job Graphs focus on simplicity


No explicit modeling of parallelization No explicit assignment to VMs No explicit assignment of channel types

Task 1
Task: MyTask.program

Input 1
Task: LineReaderTask.program Input: s3://user:key@storage/input

Users can provide annotations to influence construction of schedule

23.11.2009

Nephele: Efficient Data Processing in the Cloud

Execution Graph
Primary scheduling data structure
Output 1 (1)
ID: 2 Type: m1.large

Explicit parallelization
Tasks can be declared parallelizable Tasks specify wiring of subtasks

Task 1 (2)

Explicit assignment to virtual machines


Specified by ID and type Type refers to hardware profile

ID: 1 Type: m1.small

Input 1 (1)

23.11.2009

Nephele: Efficient Data Processing in the Cloud

Dealing with On-Demand VM Allocation


Stage 1 Output 1 (1)
ID: 2 Type: m1.large

Issues with on-demand allocation:


When to allocate virtual machines? When to deallocate virtual machines? No guarantee of resource availability!

Stage 0

Stages ensure three properties:


Task 1 (2)

ID: 1 Type: m1.small

VMs of upcoming stage are available All workers are set up and ready Data of previous stages is stored in persistent manner

Input 1 (1)

23.11.2009

Nephele: Efficient Data Processing in the Cloud

10

Channel Types
Stage 1 Output 1 (1)
ID: 2 Type: m1.large

Network channels (pipeline)


Vertices must be in same stage

In-memory channels (pipeline)


Stage 0

Vertices must run on same VM Vertices must be in same stage

Task 1 (2)

File channels
ID: 1 Type: m1.small

Vertices must run on same VM Vertices must be in different stages


Nephele: Efficient Data Processing in the Cloud 11

Input 1 (1)

23.11.2009

Evaluation Job
Given 100 GB of integer numbers, 100 bytes each
Find the smallest 20% of these numbers Calculate the average of these 20%

Implemented job for both Nephele and Hadoop


Popular open-source data processing framework Highly advertized as major cloud application

Hardware platform:
Private cloud based on Eucalyptus (EC2 WS API) Commodity servers with Intel Xeon 2.66 GHz (8 cores)
23.11.2009 Nephele: Efficient Data Processing in the Cloud 12

Results Hadoop
1. Task: Terasort
Map (a-c), Reduce (b-d)

2. Task: Aggregation 1
Map (d-f), Reduce (e-g)

3. Task: Aggregation 2
Map, Reduce (g-h)
Used cloud resources: 6 x c1.xlarge (8 CPU cores, 18 GB RAM)

23.11.2009

Nephele: Efficient Data Processing in the Cloud

13

Nephele Execution Graph


Stage 1 BigIntegerWriter (1) BigIntegerAggregater (1) Stage 0 BigIntegerMerger (1)
ID: 1 c1.xlarge ID: 7 m1.small

In-memory channel Network channel File channel

... BigIntegerMerger (2) ... ... BigIntegerMerger (6) ... ... BigIntegerSorter (126) ...
... 21 ... ... 21 ...

ID: 6 c1.xlarge

ID: 1 c1.xlarge

ID: 6 c1.xlarge

... BigIntegerReader (126) ...

23.11.2009

Nephele: Efficient Data Processing in the Cloud

14

Results Nephele
(a) Experiment starts (b) Sorting starts (c) Merging starts (d) Deallocation of six c1.xlarge VMs (e) Experiment ends
Used cloud resources: 6 x c1.xlarge (8 CPU cores, 18 GB RAM) 1 x m1.small (1 CPU core, 1 GB RAM)

Transfer penalty for changing VMs

23.11.2009

Nephele: Efficient Data Processing in the Cloud

15

Conclusion
Nephele facilitates new ways of data processing in clouds Variety of research issues to address in the future:
Dynamic compression to mitigate transfer penalty Feedback-based construction of Execution Graphs More sophisticated ways to improve fault tolerance

Nephele will become open source soon!


Feel free to contact me: daniel.warneke@tu-berlin.de

23.11.2009

Nephele: Efficient Data Processing in the Cloud

16

You might also like