Professional Documents
Culture Documents
Lecture 18
Stephen Brookes
today
parallel programming
parallelism
exploiting multiple processors
evaluating independent code simultaneously
low-level implementation
scheduling work onto processors
high-level planning
designing code abstractly
without baking in a schedule
our approach
design abstractly
independently of schedule
cost semantics and evaluation
functional benefits
No side effects, so evaluation order
caveat
In practice, its hard to achieve speed-up
Current language implementations
dont make it easy
Problems include:
scheduling overhead
locality of data (cache problems)
runtime sensitive to scheduling choices
why bother?
Its good to think abstractly first
and figure out details later
cost semantics
Weve already introduced work and span
cost semantics
We showed how to calculate work and span
cost graphs
A cost graph is a series-parallel graph
(constant time)
cost graphs
.
..
.
G1
G2
sequential
composition
a single node
.
. .
. .
.
G1
G2
parallel
composition
span(G) work(G)
.
..
.
.
. .
. .
.
work
G1
work
G2
work
G1
G2
= work G1 + work G2 + c
sequential code add the work
= work G1 + work G2 + c
independent code add the work
.
..
.
.
. .
. .
.
span
G1
span
G2
span
G1
G2
= span G1 + span G2 + c
sequential code add the span
example
and
must be done
before
basic operations
take constant time
work: single processor
span: many processors
w = 11
s=4
uses 5 processors
scheduling
(i)
(ii)
an optimal
(iii) parallel schedule
(5 rounds,
(iv)
or 4 steps)
(v)
example
What if there are only 2 processors?
w = 11
s=4
(i)
(ii)
(iii)
(iv)
(v)
(vi)
a best schedule
for 2 processors
(6 rounds,
5 steps)
Brents Theorem
An expression with work w and span s !
can be evaluated on a p-processor machine
in time O(max(w/p, s)).
Optimal schedule using p processors:
Do (up to) p units of work each round
Total work to do is w
Needs at least s steps
Brents Theorem
An expression with work w and span s !
can be evaluated on a p-processor machine
in time O(max(w/p, s)).
Find me the smallest
p such that
w/p s
Using more than
this many processors
wont yield any speed-up
example
w = 11
s=4
min {p | w/p s} is 3
(i) a best schedule
(ii) for 3 processors
(iii)
(iv)
(5 rounds,
(v)
4 steps)
3 processors
can do the work as fast as 5(!)
next
Exploiting parallelism in ML
A signature for parallel collections
Cost analysis of implementations
Cost benefits of parallel algorithm design
sequences
signature SEQ =!
sig!
type a seq!
exception Range!
val tabulate : (int -> a) -> int -> a seq!
val length : a seq -> int!
val nth : int -> a seq -> a!
val map : (a -> b) -> a seq -> b seq!
val reduce : (a * a -> a) -> a -> a seq -> a!
val mapreduce : (a -> b) -> b -> (b * b -> b) -> a seq -> b!
end
implementations
Many ways to implement the signature
lists, balanced trees, arrays, ...
For each one, can give a cost analysis
There may be implementation trade-offs
arrays: item access is O(1)
trees: item access is O(log n)
Seq : SEQ
An abstract parameterized type of sequences
Think of a sequence as a parallel collection
With parallel-friendly operations
constant-time access to items
efficient map and reduce
Well work today with an implementation
Seq : SEQ
based on vectors
sequence values
A value of type t seq
is a sequence of values of type t
equality
Two sequence values are (extensionally) equal
iff they have the same length
and their items are equal
operations
For each operation in the signature SEQ
tabulate
tabulate f n = f 0, ..., f(n-1)
examples
tabulate (fn x:int => x) 6
tabulate (fn x:int => x*x) 6
tabulate (fn _ => raise Range) 0
0, 1, 2, 3, 4, 5
0, 1, 4, 9, 16, 25
length
length v1, ..., vn = n
!
Work is O(1)
Span is O(1)
.
Cost graph is
.
Contrast: List.length [v1,...,vn] = n
work, span O(n)
nth
nth i v0, ..., vn-1 = vi
= raise Range
!
Work is O(1)
Span is O(1)
.
Cost graph is
.
Seq provides
constant-time access to items
if 0 i < n
otherwise
map
map f v1, ..., vn = f v1, ..., f vn
map f v1, ..., vn has cost graph
G1
!
...
.
Gn
where each Gi
is graph for f vi
reduce
reduce should be used to combine a sequence using
an associative function g with identity element z
v1 g v2 g ... g vn g z
for the result of combining v1, , vn, z
reduce g z v1, ..., vn = v1 g v2 g ... g vn g z
reduce
When g is associative and z is an identity
reduce g z v1, ..., vn = v1 g v2 g ... g vn g z
If g is constant time,
divide-and-conquer
(Contrast with foldr, foldl on lists)
reduce (op +) 0 1, 2, 3, 4, 5, 6, 7, 8
.
.
1 2
3 4
5 6
+
cost graph
7 8
+
+
+
reduce cost
reduce g z v1, ..., v2n =
g(reduce g z v1, ..., vn, reduce g z vn+1, ..., v2n)
G1, ..., 2n =
G1, ..., n
W(2n) = 2*W(n) + c
S(2n) = S(n) + c
Gn+1, ..., 2n
.
.
W(n) is O(n)
S(n) is O(log2 n)
mapreduce
When g is associative and z is an identity,
mapreduce f z g v1, ..., vn = (f v1) g ... g (f vn) g z
examples
fun sum (s : int seq) : int = !
reduce (op +) 0 s
fun count (s : int seq seq) : int = !
sum (map sum s)
analysis
fun sum (s : int seq) : int = reduce (op +) 0 s
fun count (s : int seq seq) : int = sum (map sum s)
analysis
Let s = s1, ..., sn , si = xi1, ..., xin, ti = sum si
For each i, sum si = reduce(op +) 0 xi1, ..., xin
cost graph of
sum si
sum si
work is O(n)
span is O(log n)
log2 n
analysis
Let ti = sum si
count s = sum t1, ..., tn
.
sum s1
cost graph of
sum (map sum s)
...
.
sum sn
log2 n
log2 n
work is O(n2)
span is O(log n)