Nephele Framework Architechture

Nephele
Efficient Parallel Data Processing in the Cloud

Daniel Warneke and Odej Kao daniel.warneke@tu-berlin.de Complex and Distributed IT-Systems Technische Universitt Berlin
Compute Clouds and Data Processing (1/2)

Cloud Computing provides IT infrastructure on demand
Infrastructure-as-a-Service (IaaS) Different types of virtual machines No long-term obligations, pay as you go
Data processing as major cloud application

Flexible and easy deployment No need for own compute center Amazon integrated Hadoop as core service
23.11.2009
Nephele: Efficient Data Processing in the Cloud
Compute Clouds and Data Processing (2/2)

Current situation:
Allocate set of virtual machines Deploy processing framework Submit processing job Destroy virtual machines
Poor utilization!
Imitation of static clusters

Clouds features remain unused Poor resource utilization higher processing cost!
23.11.2009
Outline
Data Processing and Compute Clouds Opportunities Nephele
Architecture Job Description Scheduling
Evaluation Conclusion
23.11.2009
Opportunities
Embrace dynamics and heterogeneity offered by clouds! Enables new ways of job scheduling VMs can be allocated/deallocated according to job progress, or to respond to peaks in workload Requirements:
Processing framework must be aware of the cloud Job description/schedule must be able to express when to allocate/deallocate virtual machines
23.11.2009 Nephele: Efficient Data Processing in the Cloud 5
Nephele
Data processing framework for compute clouds
Runs on clouds following IaaS abstraction Allocates/deallocates VMs on behalf of user according to job progress
Job description based on directed acyclic graphs (DAGs)

Vertices represent jobs individual tasks Edges denote communication channels
23.11.2009
Nephele Architecture
Classic master worker pattern Workers are allocated on demand
Client
Public Network (Internet) Workload over time
Master
Private / Virtualized Network
Worker
Worker
Worker
23.11.2009
Persistent Storage
Compute Cloud Cloud Controller
Job Description
Output 1
Task: LineWriterTask.program Output: s3://user:key@storage/outp
Job Graphs focus on simplicity

No explicit modeling of parallelization No explicit assignment to VMs No explicit assignment of channel types
Task 1
Task: MyTask.program
Input 1
Task: LineReaderTask.program Input: s3://user:key@storage/input
Users can provide annotations to influence construction of schedule
23.11.2009
Execution Graph
Primary scheduling data structure
Output 1 (1)
ID: 2 Type: m1.large
Explicit parallelization
Tasks can be declared parallelizable Tasks specify wiring of subtasks
Task 1 (2)
Explicit assignment to virtual machines

Specified by ID and type Type refers to hardware profile
ID: 1 Type: m1.small
Input 1 (1)
23.11.2009
Dealing with On-Demand VM Allocation

Stage 1 Output 1 (1)
Issues with on-demand allocation:

When to allocate virtual machines? When to deallocate virtual machines? No guarantee of resource availability!
Stage 0
Stages ensure three properties:

Task 1 (2)
VMs of upcoming stage are available All workers are set up and ready Data of previous stages is stored in persistent manner
Input 1 (1)
23.11.2009
10
Channel Types
Stage 1 Output 1 (1)
Network channels (pipeline)

Vertices must be in same stage
In-memory channels (pipeline)

Stage 0
Vertices must run on same VM Vertices must be in same stage
Task 1 (2)
File channels
Vertices must run on same VM Vertices must be in different stages

Nephele: Efficient Data Processing in the Cloud 11
Input 1 (1)
23.11.2009
Evaluation Job
Given 100 GB of integer numbers, 100 bytes each
Find the smallest 20% of these numbers Calculate the average of these 20%
Implemented job for both Nephele and Hadoop

Popular open-source data processing framework Highly advertized as major cloud application
Hardware platform:
Private cloud based on Eucalyptus (EC2 WS API) Commodity servers with Intel Xeon 2.66 GHz (8 cores)
23.11.2009 Nephele: Efficient Data Processing in the Cloud 12
Results Hadoop
1. Task: Terasort
Map (a-c), Reduce (b-d)
2. Task: Aggregation 1
Map (d-f), Reduce (e-g)
3. Task: Aggregation 2
Map, Reduce (g-h)
Used cloud resources: 6 x c1.xlarge (8 CPU cores, 18 GB RAM)
23.11.2009
13
Nephele Execution Graph

Stage 1 BigIntegerWriter (1) BigIntegerAggregater (1) Stage 0 BigIntegerMerger (1)
ID: 1 c1.xlarge ID: 7 m1.small
In-memory channel Network channel File channel
... BigIntegerMerger (2) ... ... BigIntegerMerger (6) ... ... BigIntegerSorter (126) ...
... 21 ... ... 21 ...
ID: 6 c1.xlarge
ID: 1 c1.xlarge
ID: 6 c1.xlarge
... BigIntegerReader (126) ...
23.11.2009
14
Results Nephele
(a) Experiment starts (b) Sorting starts (c) Merging starts (d) Deallocation of six c1.xlarge VMs (e) Experiment ends
Used cloud resources: 6 x c1.xlarge (8 CPU cores, 18 GB RAM) 1 x m1.small (1 CPU core, 1 GB RAM)
Transfer penalty for changing VMs
23.11.2009
15
Conclusion
Nephele facilitates new ways of data processing in clouds Variety of research issues to address in the future:
Dynamic compression to mitigate transfer penalty Feedback-based construction of Execution Graphs More sophisticated ways to improve fault tolerance
Nephele will become open source soon!

Feel free to contact me: daniel.warneke@tu-berlin.de
23.11.2009
16

Nephele Framework Architechture

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nephele Framework Architechture

Uploaded by

Copyright:

Available Formats

Nephele

Efficient Parallel Data Processing in the Cloud

Compute Clouds and Data Processing (1/2)

Data processing as major cloud application

Nephele: Efficient Data Processing in the Cloud

Compute Clouds and Data Processing (2/2)

Imitation of static clusters

Nephele: Efficient Data Processing in the Cloud

Nephele: Efficient Data Processing in the Cloud

Job description based on directed acyclic graphs (DAGs)

Nephele: Efficient Data Processing in the Cloud

Nephele: Efficient Data Processing in the Cloud

Compute Cloud Cloud Controller

Job Graphs focus on simplicity

Users can provide annotations to influence construction of schedule

Nephele: Efficient Data Processing in the Cloud

Explicit assignment to virtual machines

ID: 1 Type: m1.small

Nephele: Efficient Data Processing in the Cloud

Dealing with On-Demand VM Allocation

Issues with on-demand allocation:

Stages ensure three properties:

ID: 1 Type: m1.small

Nephele: Efficient Data Processing in the Cloud

Network channels (pipeline)

In-memory channels (pipeline)

Vertices must run on same VM Vertices must be in same stage

Vertices must run on same VM Vertices must be in different stages

Implemented job for both Nephele and Hadoop

Nephele: Efficient Data Processing in the Cloud

Nephele Execution Graph

In-memory channel Network channel File channel

... BigIntegerReader (126) ...

Nephele: Efficient Data Processing in the Cloud

Transfer penalty for changing VMs

Nephele: Efficient Data Processing in the Cloud

Nephele will become open source soon!

Nephele: Efficient Data Processing in the Cloud

You might also like