Professional Documents
Culture Documents
A challenging problem
Dynamic/complex interplay between program and
system
2
Canonical Parallel Machines
Dedicated parallel server in your own space
Few programmers
3
Canonical Programming Models
Developed for canonical parallel machines
4
Ubiquitous Parallelism
Parallel hardware everywhere
Lots of programmers
7
Solution Mechanics
Mechanism: control the exposure of applications
parallelism
8
Approach: Varuna
9
How do we realize such a
strategy ?
Optimum Point
High overheads
Performance
Operating conditions
may change during
search
Can we do better than
0 1 2 3 4 5 6 7 8
Parallelism exhaustive search ?
11
Ideal Approach
12
Our Approach: Model Side Effects
Model relationship among dynamic serialization,
dynamic acceleration, performance, and DoP
13
Our Approach: Model Side Effects
As a simple scalability model based on Amdahls law
Use scalability model to rapidly estimate how side
effects vary with parallelism
Use formulae, derived from model, to
instantaneously determine optimum degree of
parallelism
MAX (throughput)
14
Modeling Side Effects
Amdahls law:
ReverseIndex 17
Barneshut
MAX (Throughput)
Naive approach:
19
Determining dqc(P)/dP
20
Measuring Instantaneous
Speedup
Assumptions:
No spin locks in user level code
22
Controlling Parallel Execution
Ability to pause/resume/migrate/introduce
computations
Task-based programming model possesses
necessary foundations to realize this ability
23
Controlling Parallel Execution
Multithreaded programs are not equipped with
this capability (e.g., Pthreads)
24
Virtual Tasks (Vtasks)
25
Virtual Tasks (Vtasks)
Programming model
independent
26
Ensuring Forward Progress
Vtasks are co-operative tasks
Control point which is in user mode and does not hold any user
level locks
29
Handling Blocked Mutexes
30
Handling Blocked Mutexes
Many multithreaded programs have fewer safe
points since they seldom synchronize
More safe points can be created by spawning more
vtasks
31
Detect Operating Condition
Changes
32
Varunas System Overview
Application
Varuna
Analytical Parallelism
Engine Manager
Operating System
Hardware
Analytical Engine
Establish relationship between parallelism,
performance and side effects
Continuously
Determine
monitoroptimum parallelism
operating for given
conditions
performance objective
Determine optimum parallelism to expose for a
given performance objective
Signal Parallelism Manager
34
Parallelism Manager
Parallel computations
35
Evaluation
36
Evaluation
37
Vtask Count on Xeon
Benchmark Pthreads Varuna
Barneshut 24 100,000
Blackscholes 24 10,000,000
Bzip2 26 24
Canneal 24 96
Dedup 74 24
Fluidanimate 24 384
Hash Join 24 1024
Histogram 24 1024
RE 24 100,000
ReverseIndex 24 24
Stream 24 1024
Swaptions 24 384
Word Count 24 256
X264 26 24
38
Isolated: Pthreads Execution
Time on Xeon
Lazy vtask creation [Mohr, et., al]
39
Isolated:MAX (throughput)
12% average reduction in exec. time
40
Isolated:MAX (throughput)
22% average reduction in energy consumption
41
Isolated:MIN(consumption)
Picked optimum parallelism of one
42
Isolated:MIN(consumption)
80% reduction in resource consumption cost
43
Multiprogrammed: TBB
Execution Time on Xeon
Varuna Parcae
45
Conclusion
Future multicores will exhibit dynamically
operating conditions
46
Varuna vs Others
No source
Continuous Principled Resource Diverse
Proposal code Speed
adaptation approach agnostic metrics
changes
Parcae/Do
Yes No No Yes Slow No/Yes
PE
49
Determining dqc(P)/dP
Naive approach:
50
Determining dqc(P)/dP
51
Limitations of the Model
Model assumes monotonic distribution of qc(P)
Currently, user-driven
52