You are on page 1of 42

15-150 Fall 2014

Lecture 18
Stephen Brookes

today
parallel programming

parallelism and functional style



cost semantics

Brents Theorem and speed-ups

sequences: an abstract type with
efficient parallel operations

parallelism
exploiting multiple processors

evaluating independent code simultaneously

low-level implementation

scheduling work onto processors

high-level planning

designing code abstractly

without baking in a schedule

our approach
design abstractly

specify behavioral correctness



specify asymptotic runtime (work, span)

reason abstractly

independently of schedule

cost semantics and evaluation

functional benefits
No side effects, so evaluation order

doesnt affect behavioral correctness


Can build abstract types that support


efficient parallel-friendly operations

Can use work and span to predict


how parallelizable our code is

Work and span are independent of


scheduling details

caveat
In practice, its hard to achieve speed-up

Current language implementations
dont make it easy

Problems include:

scheduling overhead

locality of data (cache problems)

runtime sensitive to scheduling choices

why bother?
Its good to think abstractly first
and figure out details later

Focus on data dependencies

when you design your code


Our thesis: this approach to parallelism


will prevail...

(and 15-210 builds on these ideas...)

cost semantics
Weve already introduced work and span

Work estimates the sequential running time


on a single processor

Span takes account of data dependency,


estimates the parallel running time
with unlimited processors

cost semantics
We showed how to calculate work and span

for recursive functions with recurrence relations


Now we introduce cost graphs,

another way to deal with work and span


Cost graphs also allow us to talk about schedules...



... and the potential for speed-up

cost graphs
A cost graph is a series-parallel graph

directed graph, with source and sink



nodes represent units of work

edges represent data dependencies

branching indicates potential parallelism

(constant time)

cost graphs

.
..
.

G1
G2
sequential

composition

a single node

.
. .
. .
.

G1

G2

parallel

composition

work and span


of a cost graph

The work is the number of nodes



The span is the length of the longest path
from source to sink

span(G) work(G)

.
..
.
.
. .
. .
.

work

G1

work

G2

work

G1

G2

= work G1 + work G2 + c
sequential code add the work

= work G1 + work G2 + c
independent code add the work

.
..
.
.
. .
. .
.

span

G1

span

G2

span

G1

G2

= span G1 + span G2 + c
sequential code add the span

= max(span G1 , span G2) + c


parallel code max the span

example
and
must be done

before

work = 11 (number of nodes)


span = 4 (longest path length)

using cost graphs


Every expression can be given a cost graph

Can calculate work and span using the graph

These are asymptotically the same as
the work and span derived from
recurrence relations

work and span provide



asymptotic estimates of

actual running time,

under certain assumptions

basic operations

take constant time
work: single processor

span: many processors

Work: number of nodes



Span: length of critical path

w = 11
s=4

uses 5 processors

scheduling

assign units of work to processors


respecting data dependency

(i)
(ii)
an optimal

(iii) parallel schedule
(5 rounds,

(iv)
or 4 steps)
(v)

example
What if there are only 2 processors?

w = 11
s=4

(i)
(ii)
(iii)
(iv)
(v)
(vi)

a best schedule

for 2 processors
(6 rounds,

5 steps)

2 cannot do the work as fast as 5 (!)

Brents Theorem
An expression with work w and span s !
can be evaluated on a p-processor machine
in time O(max(w/p, s)).
Optimal schedule using p processors:

Do (up to) p units of work each round

Total work to do is w

Needs at least s steps

Richard Brent is an illustrious Australian mathematician and computer scientist.


He is known for Brents Theorem, which shows that a parallel algorithm can
always be adapted to run on fewer processors with only the obvious time penalty
a beautiful example of an obvious but non-trivial theorem.

Brents Theorem
An expression with work w and span s !
can be evaluated on a p-processor machine
in time O(max(w/p, s)).
Find me the smallest

p such that

w/p s
Using more than

this many processors

wont yield any speed-up

example

w = 11
s=4

min {p | w/p s} is 3
(i) a best schedule

(ii) for 3 processors
(iii)
(iv)
(5 rounds,

(v)
4 steps)

3 processors

can do the work as fast as 5(!)

next
Exploiting parallelism in ML

A signature for parallel collections

Cost analysis of implementations

Cost benefits of parallel algorithm design

sequences
signature SEQ =!
sig!
type a seq!
exception Range!
val tabulate : (int -> a) -> int -> a seq!
val length : a seq -> int!
val nth : int -> a seq -> a!
val map : (a -> b) -> a seq -> b seq!
val reduce : (a * a -> a) -> a -> a seq -> a!
val mapreduce : (a -> b) -> b -> (b * b -> b) -> a seq -> b!
end

implementations
Many ways to implement the signature

lists, balanced trees, arrays, ...

For each one, can give a cost analysis

There may be implementation trade-offs

arrays: item access is O(1)

trees: item access is O(log n)

Seq : SEQ
An abstract parameterized type of sequences

Think of a sequence as a parallel collection

With parallel-friendly operations

constant-time access to items

efficient map and reduce
Well work today with an implementation

Seq : SEQ

based on vectors

sequence values
A value of type t seq

is a sequence of values of type t

We use math notation like


v1, ..., vn

v0, ..., vn-1

for sequence values


1, 2, 4, 8 is a value of type int seq

equality
Two sequence values are (extensionally) equal
iff they have the same length
and their items are equal

v1, ..., vn = u1, ..., um


if and only if
n = m and for all i, vi = ui

operations
For each operation in the signature SEQ

we specify the (extensional) behavior of the


operation implemented in Seq
and discuss its cost semantics

Other structures with the same signature


may implement the operations with
different work and span profile

Learn to choose wisely!

tabulate
tabulate f n = f 0, ..., f(n-1)

If G is cost graph for f(i),


i

the cost graph for tabulate f n is


.
work?
G0
... Gn-1
span?
.

If f is O(1), the work for tabulate f n is O(n)


If f is O(1), the span for tabulate f n is O(1)

examples
tabulate (fn x:int => x) 6

tabulate (fn x:int => x*x) 6

tabulate (fn _ => raise Range) 0

0, 1, 2, 3, 4, 5
0, 1, 4, 9, 16, 25

length
length v1, ..., vn = n
!

Work is O(1)

Span is O(1)

.
Cost graph is
.
Contrast: List.length [v1,...,vn] = n

work, span O(n)

nth
nth i v0, ..., vn-1 = vi
= raise Range
!

Work is O(1)

Span is O(1)
.
Cost graph is
.

Seq provides

constant-time access to items

if 0 i < n
otherwise

map
map f v1, ..., vn = f v1, ..., f vn
map f v1, ..., vn has cost graph
G1
!

...
.

Gn

where each Gi

is graph for f vi

If f is constant time, map f v , ..., v has


1

work O(n), span O(1)


(contrast with List.map)

reduce
reduce should be used to combine a sequence using
an associative function g with identity element z

g : t * t -> t is associative iff for all x ,x ,x :t


1

g(x1, g(x2, x3)) = g(g(x1, x2), x3)


z is an identity for g iff for all x:t, g(x,z) = x



We write

v1 g v2 g ... g vn g z
for the result of combining v1, , vn, z
reduce g z v1, ..., vn = v1 g v2 g ... g vn g z

reduce
When g is associative and z is an identity

reduce g z v1, ..., vn = v1 g v2 g ... g vn g z

If g is constant time,

reduce g z v1, ..., vn

has work O(n)


and span O(log n)

needs to use g n times

divide-and-conquer
(Contrast with foldr, foldl on lists)

reduce (op +) 0 1, 2, 3, 4, 5, 6, 7, 8
.
.

1 2

3 4

5 6

+
cost graph

7 8
+

+
+

reduce cost
reduce g z v1, ..., v2n =
g(reduce g z v1, ..., vn, reduce g z vn+1, ..., v2n)

G1, ..., 2n =
G1, ..., n

W(2n) = 2*W(n) + c
S(2n) = S(n) + c

Gn+1, ..., 2n

.
.

W(n) is O(n)
S(n) is O(log2 n)

mapreduce
When g is associative and z is an identity,

mapreduce f z g v1, ..., vn = (f v1) g ... g (f vn) g z

When f, g are constant time,


mapreduce f z g v1, ..., vn
has work O(n)
and span O(log n)

examples
fun sum (s : int seq) : int = !
reduce (op +) 0 s
fun count (s : int seq seq) : int = !
sum (map sum s)

analysis
fun sum (s : int seq) : int = reduce (op +) 0 s
fun count (s : int seq seq) : int = sum (map sum s)

Let s be a value of type int seq seq

consisting of n rows, each of length n


What are the work and span for


count s ?

analysis
Let s = s1, ..., sn , si = xi1, ..., xin, ti = sum si
For each i, sum si = reduce(op +) 0 xi1, ..., xin
cost graph of

sum si

sum si

work is O(n)

span is O(log n)

log2 n

map sum s = sum s1, ..., sum sn


.
2)

work
is
O(n
cost graph of

sum s
sum s
...
map sum s
span is O(log n)
.
1

analysis
Let ti = sum si
count s = sum t1, ..., tn
.
sum s1

cost graph of

sum (map sum s)

...
.

sum sn

sum t1, ..., tn

log2 n

log2 n

work is O(n2)

span is O(log n)

You might also like