You are on page 1of 18

A new parallel algorithm for vertex

priorities of data flow acyclic digraphs

Zeyao Mo, Aiqing Zhang & Zhang Yang

The Journal of Supercomputing


An International Journal of HighPerformance Computer Design,
Analysis, and Use
ISSN 0920-8542
J Supercomput
DOI 10.1007/s11227-013-1022-8

1 23

Your article is protected by copyright and all


rights are held exclusively by Springer Science
+Business Media New York. This e-offprint is
for personal use only and shall not be selfarchived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com.

1 23

Author's personal copy


J Supercomput
DOI 10.1007/s11227-013-1022-8

A new parallel algorithm for vertex priorities of data


flow acyclic digraphs
Zeyao Mo Aiqing Zhang Zhang Yang

Springer Science+Business Media New York 2013

Abstract Data flow acyclic directed graphs (digraph) are widely used to describe the
data dependency of mesh-based scientific computing. The parallel execution of such
digraphs can approximately depict the flowchart of parallel computing. During the
period of parallel execution, vertex priorities are key performance factors. This paper
firstly takes the distributed digraph and its resource-constrained parallel scheduling as
the vertex priorities model, and then presents a new parallel algorithm for the solution
of vertex priorities using the well-known technique of forwardbackward iterations.
Especially, in each iteration, a more efficient vertex ranking strategy is proposed. In
the case of simple digraphs, both theoretical analysis and benchmarks show that the
vertex priorities produced by such an algorithm will make the digraph scheduling
time converge non-increasingly with the number of iterations. In other cases of nonsimple digraphs, benchmarks also show that the new algorithm is superior to many
traditional approaches. Embedding the new algorithm into the heuristic framework
for the parallel sweeping solution of neutron transport applications, the new vertex
priorities improve the performance by 20 % or so while the number of processors
scales up from 32 to 2048.
Keywords Acyclic digraph Parallel algorithm Neutron transport

1 Introduction
The data flow acyclic directed graphs (digraph) [9] are usually used to describe the
data dependency for a wide range of mesh-based scientific computing. Each of these
digraphs consists of weighted vertices and arcs, each vertex often refers to a mesh

Z. Mo ( ) A. Zhang Z. Yang
Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics,
P.O. Box 8009, Beijing, 100088, China
e-mail: zeyao_mo@iapcm.ac.cn

Author's personal copy


Z. Mo et al.

cell, and its weight often represents the workloads; each arc often depicts the data
dependency between two neighboring cells and its weight often represents the dependent overheads.
The parallel sweeping solvers are the numerical kernel for the seven-dimensional
radiation or neutron transport equations [20] when the discrete ordinates methods
(Sn ) are used. They are typically mesh-based scientific computing applications whose
data dependency are suitable for digraph description. Baker et al. [2] and Koch et al.
[19] addressed these solvers on rectangular meshes on earlier massively parallel computers, Plimpton et al. [29] and Pautz et al. [27] prolonged these researches to unstructured meshes, Mo et al. [33] supplemented these works for cylindrical coordinate system, and recently Pautz et al. [28] presented another heuristic method to improve the
inherent parallelism for long characteristics Sn discretization. Besides from these parallel sweeping solvers, many other mesh-based applications exist and are suitable for
digraph description, for example, the parallel downstream relaxation for the direct solution of upper or lower sparse triangle linear system arising from the discretization of
convection-dominated problems [4, 11, 13], the well-known ILU factorization [12],
the dense matrix LU factorization and their multi-threaded versions [3], the patchbased structured mesh AMR simulations [22, 25], and their multi-threaded versions
[23], and so on.
The flowchart of parallel computing for above mesh-based scientific computing
applications can be approximately depicted by the parallel execution of the associated
digraphs. Nevertheless, the solution for the minimal execution time of the digraphs
are still NP-hard [18]. Mo et al. [32] present a heuristic framework. It consists of three
components. The first is the partitioning method assigning digraph vertices across
processors, the second is the parallel sweeping solver for the execution of distributed
digraph, and the third is the vertex priorities strategy to decide which vertex should be
executed when many vertices are executable in each processor. For a given distributed
digraph, the vertex priorities approach is the most crucial for parallel efficiency.
There are two types of approaches for the calculation of vertex priorities. The
first is local and another is global. The local approaches only use the data flow
locally in each processor and the global approaches use the data flow of digraph
across processors. The First-In-First-Out (FIFO) strategy, the Geometrical Coordinates KBA strategy [29], the Shortest processor-Boundary Path strategy (SBP) [32],
and the Sweeping Direction Upwind strategy [33] are typically local approaches. The
Largest End Time strategy [7], the Latest Start Time strategy (LST) [17], the Least
Relaxation Time Strategy [7], the Maximal Number of Successors Strategy [1], the
Hybrid Strategies [5, 30], the Sampling Strategies [6, 8], and the Depth First sweeping strategies (DFHDS) [27] are the typically global approaches. Usually, the local
approaches are cheaper and less efficient, the global approaches are favorable while
the vertex priorities are reusable.
Generally, most of above vertex priorities approaches can be depicted by the wellknown resource-constrained scheduling models widely used for many of digraphbased projects or networks [9, 15, 16] except that the constrained resources refer
to the number of available processors. Each of these models will produce a parallel
scheduling for which each vertex has a start or an end time for execution and the
parallel execution time of the scheduling is equal to the difference between the maximal vertices end time and the minimal vertices start time. Taking the start or the

Author's personal copy


A new parallel algorithm for vertex priorities of data flow acyclic

end time as the vertex priorities, most of all above local or global approaches can be
reproduced.
It is heuristic that the vertex priorities approach making the minimal parallel execution time is perhaps the most suitable for the parallel sweeping solvers of meshbased scientific computing. So, taking the vertex priorities output from above local or
global approaches as the input, the well-known Forward-Backward technique (FB)
presented by Li et al. [21] can be used to iteratively reduce the parallel execution
time. Finally, a better scheduling is reproduced; each vertex can take the new start or
end time as its priorities. It is obviously heuristic that the reproduced priorities are
more effective because larger inherent parallelism is achieved.
Based on above heuristic observation, this paper presents a new parallel algorithm
for vertex priorities by the parallelization of the forward-backward iterations. Especially, a new efficient vertex ranking strategy is designed for each forward or backward iteration. In the case of simple digraphs where each vertex has equal weight
and each arc has zero weight, both theoretical analysis and benchmarks show that
the vertex priorities produced by the new algorithm make the parallel execution time
non-increasingly converge with the number of forward and backward iterations. In
other cases of non-simple digraphs, benchmarks also show that the new algorithm is
superior to many of traditional approaches. In particular, the new algorithm improves
the speedup of the approach SBP from 201 to 301, while 500 processors are used
for a digraph in the scale of 140 K vertices and 280 K arcs. Furthermore, embedding
the new algorithm into the heuristic framework for the parallel sweeping solution of
neutron transport applications, the new vertex priorities improve the performance by
20 % or so while the number of processors scales up from 32 to 2048.
This paper is organized as follows. The second section defines the resourceconstrained scheduling model for vertex priorities, the third section gives the parallel
forward-backward iterations, and the fourth section gives the vertex ranking strategy,
and then the next two sections give the convergence and the complexity analysis.
The last two sections list the performance results for theoretical benchmarks and real
neutron transport applications. Finally, this paper is concluded.

2 The resource-constrained scheduling model


Denote G = (V, Q, E, W) be an acyclic digraph. Here, V = {v1 , v2 , . . . , vn } is the
set of vertices, Q = {q1 , q2 , . . . , qn } is the set of vertex weights, E = {(vi1 , vj1 ), . . . ,
(vim , vjm )} is the set of arcs, W = {wik ,jk }m
k=1 is the matrix of arc weights.
A path is a sequence of vertices (u1 , u2 , . . . , us ) with (ui , ui+1 ) E. A cycle is a
special path with u1 = us . A digraph is acyclic and computable if and only if each
path is acyclic. The length of a path is the weight sum of all its vertices and arcs. The
diameter of a digraph is the length of its longest path.
Vertex ui is the predecessor of vertex uj or vertex uj is the successor of vertex
ui if and only if there is a path where ui locates in the before of uj . For each arc
(ui , uj ), vertex ui is the head, and vertex uj is the tail. A vertex is a source if it has
no heads and a vertex is a sink if it has no tails.

Author's personal copy


Z. Mo et al.

Simple digraph is such a type of digraphs that all its vertices have equal weights
and all its arcs have zero weights. Simple digraphs often exist in the field of meshbased scientific computing. For example, weights of each vertex are usually equal to
each other because the computational formulae are similar across mesh cells, the data
transfer overheads between neighboring vertices are neglectable provided that each
vertex has enough computational workloads.
A distributed digraph consists of many non-overlapping sub-digraphs and each is
assigned to a processor. A vertex is local to a sub-digraph if and only if it belongs to
this sub-digraph, a vertex is local to another vertex if and only if they belong to the
same sub-digraph. An arc is a local arc if and only if both of its head and tail belong to
the same sub-digraph, otherwise, it is a cut arc. Usually, a sub-digraph includes all its
local vertices and all its arcs whose head or tail is local. Denote R = (r1 , r2 , . . . , rn )
be the mapping vector whose element ri (1 i P ) is the rank of processor owning
vertex vi , P is the number of processors.
Given a distributed digraph with P sub-digraphs, the vertex priorities approaches
as introduced in the first section are equivalent to the solution of resource-constrained
scheduling model as follows:
min[f]
subject to
fi + wi,j fj qj ,
|Ak (t)| 1,

i Dj , j = 1, . . . , n

(1)

k = 1, . . . , P , t

Here, f = (f1 , f2 , . . . , fn ) is a scheduling and is represented by a vector of vertex


end time, n is the number of vertices, P is the number of processors, Dj is the set of
heads of vertex vj ,
= max {fj },
1j n

= min {fj qj },
1j n

[f] =

(2)

denotes the end time, start time, and execution time of the digraph, respectively,


Ak (t) = {vj : fj qj t < fj } {vj : rj = k}
(3)
is the set of vertices executed on processor k at time t. Obviously, assign each vertex
vi the priority fi , then the digraph has the final execution time [f].
Two constraints are proposed in Eq. (1). The first is the sequence constraint such
that a vertex should not execute until all its predecessors have finished, and the second is the resource constraint such that at most one vertex executes concurrently in
each processor. A scheduling is feasible if and only if these constraints are satisfied.
A scheduling is optimal if and only if [f] is the minimal.

3 Parallel forwardbackward iterations


Li et al. [21] presents a serial technique of ForwardBackward iterations (FB) to
reduce the execution time of a feasible scheduling for projects or networks. Here,

Author's personal copy


A new parallel algorithm for vertex priorities of data flow acyclic

we consider its parallel version for a distributed digraph in Algorithm 3.1. Different
from the serial forwardbackward iterations, a new ranking strategy is introduced
satisfying the sequence constraint of Eq. (1).
Algorithm 3.1 PFB(G, r, q, w, , , f0 , Mits , , f)
INPUT:
G
r
q
w


Mits

f0

:
:
:
:
:
:
:
:
:

local sub-digraph of the digraph;


processor mapping vector for all vertices of the digraph;
weight vector of local vertices;
weight matrix of both local arcs and cut arcs;
forward ranks of local vertices, smaller rank has high priority;
backward ranks of local vertices, larger rank has high priority;
maximal number of forward-backward iterations;
convergence error threshold;
initial scheduling of local vertices.

OUTPUT:
f : final scheduling of local vertices.
BEGIN
h = 0, m = number of local vertices.
0 = start time of initial scheduling, 0 = end time of initial scheduling.
DO in PARALLEL {
(1) Execute backward iteration for schedule fh+1/2 from fh .
(1.1) Compute ranks for local vertices from fh and sort.
(1.1.1) Compute  from fh using a rank strategy as discussed in
the next section.
(1.1.2) Order local vertex indices {ig }g=1,...,m satisfying
 




i1 , fih1 i2 , fih2 im , fihm
Here, (a1 , b1 ) (a2 , b2 ) means


(a1 > b1 )|| (a1 = b1 )&(a2 > b2 ) .
(1.2) h+1/2 = h , I = (, h+1/2 ) initial intervals available.
(1.3) FOR each vertex ig : g = 1, 2, . . . , m 1, m DO {
(1.3.1) Let tig = fihg and goto step (1.3.4) if vig is a sink.
h+1/2

(1.3.2) Receive every sj

from each remote tail vj .


h+1/2

(1.3.3) Compute tig = min(vig ,vl )E {sl


(1.3.4) Update
(1.3.5) Update
(1.3.6) Update

wig ,l }.
h+1/2
= max{t : (t tig )&((t qig , t) I )}.
fi g
h+1/2
h+1/2
sig
= fi g
qi g .
h+1/2
h+1/2
the latest start time I = I \ (sig
, fi g
).

Author's personal copy


Z. Mo et al.
h+1/2

(1.3.7) Send sig


to each processor owing heads of vig .
(1.3) } END for g = 1, 2, . . . , m 1, m.
(1.4) Synchronize the latest start time across all processors:
 h+1/2

: vl V .
h+1/2 = min sl
(2) Execute forward iteration for schedule fh+1 from fh+1/2 .
(2.1) Compute ranks for local vertices using fh+1/2 and sort.
(2.1.1) Compute  from fh+1/2 using a rank strategy as discussed
in the next section.
(2.1.2) Order local vertex indices {ig }g=1,...,m satisfying



h+1/2 
h+1/2 
h+1/2 
i1 , si1
i2 , si2
im , sim
(2.2) h+1 = h+1/2 , I = (h+1 , +) //initial intervals available.
(2.3) FOR each vertex ig : g = 1, 2, . . . , m 1, m DO {
h+1/2
and goto step (2.3.4) if vig is a source.
(2.3.1) Let tig = sig
(2.3.2) Receive every flh+1 from each remote head vl .
(2.3.3) Compute tig = max(vj ,vig )E {fjh+1 + wj,ig }.
(2.3.4) Update sih+1
= min{t : t tig & ([t, t + qig ) I )}.
g
(2.3.5) Update fih+1
= sih+1
+ qig . //.
g
g
(2.3.6) Update the earliest end time of I = I \ (sih+1
, fih+1
).
g
g
(2.3.7) Send fih+1
to each processor owning tails of vig .
g
} END for g = 1, 2, . . . , m 1, m.
(2.4) Synchronize the earliest end time across all processors:
h+1 = max{fjh+1 : vj V}.
(3) h = h + 1.
} UNTIL (h > Mits or |( h1 h1 ) ( h h )| < ).
Remark 3.1 The sequence of vertices vig (g = 1, 2, . . . , m 1, m) in step (1.1.1) or
step (2.1.1) satisfies the sequence constraint of Eq. (1).
Remark 3.2 In step (1.3.4), tig is the latest end time satisfying the sequence conh+1/2

straint, fig

is the latest end time satisfying both constraints.

Remark 3.3 In step (2.3.4), tig is the earliest start time satisfying the sequence constraint, sih+1
is the earliest start time satisfying both constraints.
g
Remark 3.4 The last row in Algorithm 3.1 is the terminal condition. If the sequence
[fh ] non-increasingly converge with h, the final output is the solution; otherwise, we
should take the output having the shortest execution time in the solution history.

Author's personal copy


A new parallel algorithm for vertex priorities of data flow acyclic

4 Vertex ranking strategies


In step (1.1) and step (2.1), the vertex ranking strategies are crucial not only for the
convergence but also for the quality of sequence [fh ]. Li et al. [21] assign each vertex
the rank of its end time after last backward iteration and the rank of its start time after
last forward iteration, Ozdamar et al. [26] present another similar but more complex
strategy, Mo et al. [32] use the Shortest processor-Boundary Path strategy (SBP) for
which vertices are ranked by their distance away from the sub-digraph boundaries.
Local approaches for vertex priorities introduced in the first section can also give
rank sequences. However, no matter what strategies are used, one problem is still
open such that whether and how [fh ] converges.
In the following part of this section, a new vertex ranking strategy is proposed
called Cut Arc Preference strategy (CAP). It not only makes above open problem be
answerable but also gives better execution time. Moreover, its computational complexity is similar to other local approaches for vertex priorities in the literatures as
SBP. This strategy mainly differs from traditional strategies in the viewpoint of that
cut arcs should be preferentially executed since their earlier finish indicate larger inherent parallelism.
In step (2.1.1) after a backward iteration, we rank a vertex by the earliest start time
of all its downstream cut arcs as follows:
 h+1/2

(4)
wk,j {+}
i = min sj
(vk ,vj )Si

Here, Si is the set of cut arcs whose heads are the successors of vertex vi , wk,j is the
h+1/2
is the start time after last backward iteration,
weight of cut arc (vk , vj ) and sj
superscript h represents the iteration step. Smaller i indicates higher priority.
Similarly, in step (1.1.1) after a forward iteration, we rank a vertex by the latest
end time of all its upstream cut arcs as follows:


(5)
i = max fjh + wj,k {}
(vj ,vk )Bi

Here, Bi is the set of local cut arcs whose tails are the predecessors of vertex vi , wj,k
is the weight of cut arc (vj , vk ), and fjh is the end time after last forward iteration,
superscript h represents this iteration. Larger i indicates higher priority.
Strategy CAP naturally satisfies the sequence constraint stated in step (2.1.2) and
step (1.1.2). In fact, assuming vertex va the predecessor of vertex vb , vertex set Sb
must be a subset of Sa , so a b . Similarly, assuming vertex vc the successor of
vertex vd , vertex set Bc must be a subset of Bd , so c d .
Let CAP-F be the forward iteration using formula (4) and CAP-B be the backward
iteration using formula (5), denote the forwardbackward iteration as CAP-PFB. Figure 1 gives an example to illustrate these iterations for a scheduling. Figure 1a gives
a distributed simple digraph with two sub-digraphs divided by the dashed line. Each
circle and its central number represent a vertex, the weight of each vertex is normalized to 1.0. Each arrow represents an arc showing the data dependency between two
vertices, the weight of each arc is equal to zero. Three cut arcs exist along the dashed
line. Figure 1b gives an initial scheduling. The horizontal axis is the execution steps

Author's personal copy


Z. Mo et al.

Fig. 1 One backward-forward iteration of CAP-PFB

representing the execution time, the vertical axis is the executed vertices across two
processors. This scheduling requires 7 steps in total.
Taking the set of end time {fi0 }10
i=1 in Fig. 1b as the input, Fig. 1c assigns each vertex a rank defined by formula (5). All local vertices in the 0th processor have the rank
{} since no local upstream cut arcs exist. B9 = {(v2 , v9 ), (v3 , v7 )}, f30 = 3, f20 =
2, so 9 = 3. Similarly, we have 7 = 10 = 3, 8 = f40 = 4. After one backward
iteration as stated in ALGORITHM 3.1, a better scheduling is generated in Fig. 1d
where one step is reduced.
1/2
Taking the set of start time {si }10
i=1 in Fig. 1d as the input, Fig. 1e updates vertex ranks defined by formula (4). All local vertices in the 1th processor have the
1/2
rank {+} since no local downstream cut arcs exist. S1 = {(v2 , v9 ), (v4 , v8 )}, s9 =
1/2
4, s8 = 6, so 1 = 4. Similarly, we have 2 = 4, 3 = 3, 4 = 6. After one forward
iteration as stated in ALGORITHM 3.1, a better scheduling is generated again in
Fig. 1f where another step is further reduced.
Until now, one backwardforward iteration finishes, an optimal scheduling with 5
steps is gained. However, the ranking strategies presented in [21, 26] cannot improve
the results in Fig. 1b.

5 Convergence analysis for CAP-PFB


The section gives the convergence analysis for algorithm CAP-PFB for simple digraphs. No loss of generalization, we normalize the vertex weight to be q1 = q2 =
= qn = 1.

Author's personal copy


A new parallel algorithm for vertex priorities of data flow acyclic

Lemma 5.1 If the distributed acyclic digraph is simple, then given a local vertex vk
in the forward iteration, we can conclude
h+1/2

fjh+1
sk
c

c = 1, 2, . . . , L

(6)

Here, J = {vjc }L
c=1 is a special list of vertices ordered by step (2.1.2) and each vertex
is the head of a cut arc whose tail is a predecessor of vertex vk .
Proof We use the induction method on the size L of list J . Let L = 1. Assume
vertex vj1 belongs to processor rj1 . Let H1 be the set of vertices in the before of vj1
in processor rj1 and vj1 itself. Because  satisfies the sequence constraint and j1 is
the smallest, we have
h+1/2

i = j1 , fi

h+1/2

f j1

vi H1

(7)

Moreover, H1 are firstly scheduled in processor rj1 , so


h+1/2

fjh+1
= h+1 + |H1 | = h+1/2 + |H1 | fj1
1

h+1/2

sk

(8)

Here, |H1 | is the weight sum of vertices in H1 .


Assume Lemma 5.1 be correct for L = C 1 such that
h+1/2

fjh+1
sk
c

c = 1, 2, . . . , C 1

(9)

then we conclude that it is also correct for L = C.


Let HC be the set of vertices located in the before of vertex vjC in processor rjC
and vjC itself. For this set, we conclude that
vj HC

h+1/2

s.t. fjh+1
fj
C

(10)

Firstly, assume processor rjC be busy in interval [h+1 , fjh+1


), then we have
C

fjh+1
= h+1 + |HC |. Let j = argmax{fi
C

h+1/2

h+1/2
|HC |, so fjh+1
fj
.
C
j +1 h+1
Secondly, assume [fl , sa )

h+1/2

: vi HC }, then fj

h+1/2 +

be the last idle segment in interval [h+1 , fjh+1


),
C

let K HC be the set of vertices scheduled in the segment [sah+1 , fjh+1


), then
C
fjh+1
= sah+1 + |K|
C
so



max fjh+1 : (vj , vi ) E sah+1

(11)

vi K

in that each arc has zero weight. Let


 h+1/2



: vi K , k = argmax fjh+1 : (vj , vk ) E
k = argmin fi

(12)

(13)

then
sah+1 fkh+1

(14)

Author's personal copy


Z. Mo et al.

so vertex vk does not belong to set K and (vk , vk ) is a cut arc whose head belongs
to the list J . Moreover, k < C because k > k jC . Therefore,
h+1/2

fkh+1
sk

(15)
h+1/2

sk
from the induction assumption in Eq. (9). So, we have fjh+1
C
above five equations. Let j
so

fjh+1
C

h+1/2
= argmax{fi

:i

+ |K| from

h+1/2
h+1/2
K}, then sk
+ |K| fj
,

h+1/2
fj
.

Equation (10) is obtained.


From Eq. (10) and the inequality of j jC , we can further conclude that
h+1/2

fjh+1
fj
C

h+1/2

j jC sk

(16)


Inequality (6) and Lemma 5.1 are correct.

Lemma 5.2 If the distributed acyclic digraph is simple, the scheduling fh+1 generated by the forward iteration CAP-F is non-increasing such that
 h+1   h+1/2 
f
(17)
f
Here, [f] is the parallel execution time as defined in Eq. (2).
Proof Given a vertex vk , assuming one of its successor being the head of cut arc
(vjc , vt ), we have
h+1/2

fkh+1 fjh+1
st
c

h+1/2

< ft

h+1/2

(18)

by Lemma 5.1 and the sequence constraint. Otherwise, it is a sink or local predecessor
of a sink. Let vertex vs be the sink, we can also conclude that fkh+1 fsh+1 by the
sequence constraint and
h+1/2

fsh+1 fj

h+1/2

by Eq. (10). Therefore, we always have


 h+1 


f
= h+1 h+1 h+1/2 h+1/2 = f h+1/2
Lemma 5.2 is obtained.

(19)

(20)


Similarly, we have the following conclusion.


Lemma 5.3 If the distributed acyclic digraph is simple, the scheduling fh+1/2 generated by the backward iteration CAP-B is non-increasing such that
 h+1/2   h 
f
(21)
f
Here, [f] is the parallel execution time as defined in Eq. (2).
From Lemmas 5.1 and 5.2, we have the following conclusion.

Author's personal copy


A new parallel algorithm for vertex priorities of data flow acyclic

Theorem 5.1 The sequence {[fh ]}+


h=0 generated by CAP-PFB of Algorithm 3.1 nonincreasingly converges for simple acyclic digraphs starting from any initial scheduling f0 .
6 Complexity analysis for CAP-PFB
The calculation complexity of strategy CAP-PFB is the same as that of strategy FB
apart from vertex ranking strategies. Let N be the total number of vertices, P be
the number of processors, each sub-digraph G have equal number of vertices, then
step (1.1.1) and step (2.1.1) have equal complexity O( N
P ). Furthermore, step (1.1.2)
N
and step (2.1.2) have equal complexity O( N
log
).
The
calculation complexity of
P
P
step (1.3) and step (1.4) are the same as that of step (1.1.1) and step (2.1.1). The
global reduction in step (1.4) or step (2.4) has the calculation complexity of O( N
P )+
O(log P ). So, Algorithm 3.1 has the calculation complexity as follows:

N
N
N
N
N
N
+O
log
+O
+O(log P )
log
+O(log P )
Tcal
=O
=O
P
P
P
P
P
P
(22)
Assume the average degree of vertices
be O(D) and the diagraph be uniformly
partitioned, then each processor has O( N
P ) cut arcs in the two-dimensional case.
So, the message passing complexity is about

Tmes = O D
(23)
P

Similarly, the message passing complexity can be estimated in the three-dimensional


case


2

N 3

(24)
Tmes = O D
P
It is difficult to evaluate the overheads for step (1.3.2) and (1.3.7) because of the
uncertainty of message passing. However, they can be estimated as follows:
Tpas
= O(Tmes L)

(25)

Here, L is the message passing latency. Similarly, this estimation is suitable for steps
(2.3.2) and (2.3.7).
The complexity of Algorithm 3.1 is the sum of Tcal and Tpass . In the case of
N  P , tcal dominates. If we fix N/P and increase P , the global reduction will
dominates. If the diagraph parallelism is smaller, Tpas or the message passing latency
will dominate.
In fact, Algorithm 3.1 is the sequence of a number of iterations and each iteration
has double overheads of a parallel sweeping as introduced in the first section. So,
Algorithm 3.1 has the complexity of 2M parallel sweeping provided that it converges
within M iterations. This implies that Algorithm 3.1 has the disadvantage of that it is
useful if and only if the vertex priorities approach is reusable for the parallel sweeping
of a series of digraphs or the parallel sweeping is repeated for the real applications.

Author's personal copy


Z. Mo et al.
Fig. 2 A structured mesh
generated by the software [10]
in the scale of 120 49 across
an airfoil

7 Theoretical benchmarks
This section validates Theorem 5.1 for Algorithm 3.1. Given a distributed digraph,
we define the speedup SP of a feasible scheduling as the serial execution time over
the parallel execution time for P processors. Here, the execution time is measured by
steps. For example, for a simple digraph, the vertex weight is equal to one step and
the arc weight is equal to zero. The number of steps can be accurately calculated by
the symbolic sweeping. Obviously, larger speedup means shorter parallel execution
time and implies superior vertex priorities.
7.1 Simple digraph benchmarks
The simple digraphs come from one flux sweeping of the discrete ordinates solution
of the neutron transport applications as studied in [32] and [33]. The geometry is a
two-dimensional structured mesh in the scale of 120 49 across an airfoil as shown
in Fig. 2; the discrete ordinates include 24 angles arising from the non-overlapping
partitioning of a unit spherical surface. The digraph includes about 140 K vertices and
280 K arcs. Each vertex represents the task of a pair of ci , am  where flux is swept
across cell ci for angle am and its weight is equal to one step. Each arc represents
the data dependence among two tasks such that (ci , am , cj , am ) where cell ci is
the upstream neighbor of cell cj for angle am and its weight is equal to zero. It is
easily concluded that the digraph is acyclic if and only if each cell is convex [32].
This digraph has the maximal speedup of 666 provided that unlimited number of
processors are available.
The distributed digraph with P subgraph is generated by the non-overlapping
mesh partitioning method of Inertial KemighanLin (IKL) implemented in the tool
Chaco [14]. For each partitioning, above tasks defined on each cell are assigned to
the same processor.
Figure 3 lists the speedup convergence history with the number of iterations for
400 or 500 processors, respectively. Here, legend CAP-FB represents Algorithm 3.1
coupled with the vertex ranking strategy CAP, legend FB represents the traditional
forwardbackward iterations [21] coupled with the vertex ranking strategy LST. The
horizontal axis is the number of iterations where the half means the a backward iteration and the integer means a backwardforward iteration.

Author's personal copy


A new parallel algorithm for vertex priorities of data flow acyclic

Fig. 3 Speedup convergence history for forwardbackward iterations


Fig. 4 Speedup comparison for
six vertex priorities approaches

Fig. 5 Speedup convergence history for forwardbackward iterations

Figure 3 shows the non-decreasing speedup convergence. These results coincide


with Theorem 5.1. Moreover, the speedup almost converges after one or two iterations. Figure 3 also shows that the vertex ranking strategy CAP is superior to LST.
Figure 4 lists the speedup for six vertex priorities approaches using hundreds of
processors. Legend LST, DFHDS, and SBP represent three traditional approaches
introduced in the first section. Legend CAP-FB+LST, CAP-FB+DFHDS, and CAPFB+SBP refer to the approach of CAP-PFB taking the output of LST, DFHDS and
SBP as its input, respectively. These curves show that CAP-PFB can significantly
improve the speedup in each case. Most of all, CAP-PFB has increased the speedup
for SBP from 201 to 302 while 500 processors are used.
7.2 Non-simple digraphs
Assigning each cut arc a weight of one-third step, above digraph is not simple any
more. Figure 5 lists the speedup convergence history for Algorithm 3.1. Though the
convergence is achieved within 8 iterations, non-increasing behavior is broken after
the 7th iteration. Of course, such breaks perhaps challenge the stopping criteria of
Algorithm 3.1. Nevertheless, similar to Fig. 3, Fig. 5 also shows that 2 iterations are
enough for satisfying vertex priorities.

Author's personal copy


Z. Mo et al.

8 Real applications
Embedding the vertex priorities approach given by Algorithm 3.1 into the heuristic
sweeping framework [32] for a distributed diagraph, we can apply it to solve the
neutron transport applications as studied in a series of publications such as [27, 29,
32, 33]. By the way, the traditional forward-backward approach of FB [21] and the
local approach of SBP [32] are used for comparison.
The multi-group neutron transport equations [32] are discretized by the methods
of both discrete ordinates and discontinuous finite element on a two dimensional unstructured quadrilateral mesh with 57600 cells. 48 angles are used to evenly partition
the unit spherical surface, and 24 groups are used to partition the energy distribution.
A digraph is constructed from the task of a pair of ci , am  where 24 group-flux are
swept across cell ci for angle am . Similarly, each arc represents the data dependence
among two tasks such that (ci , am , cj , am ) where cell ci is the upstream neighbor
of cell cj for angle am .
The parallel computer is TianHe-1A [31]. It is a distributed memory machine with
1024 nodes, each node includes two Intel Xeon EX5675 2.93 GHz CPUs, each CPU
has 6 cores, and each core has the peak performance of 11.72 GFLOPS. This machine has the fat-tree crossbar interconnect network, the dual-direction bandwidth is
equal to 160 Gbps and the message passing MPI [24] latency is about 1.57 microseconds. Though multi-threaded parallelization is supported within each node, only
MPI parallelization is considered.
Table 1 lists the elapsed time and the real speedup on TianHe-1A for the heuristic
sweeping framework [32] using three vertex priorities approaches of SBP, FB and
CAP-PFB, respectively. The number of processors scales up from 32 to 2048. Here,
each processor means a CPU core. Four cores are used in each CPU. In total, 100
physical time steps are performed and 12 parallel sweeping are executed for each time
step. Simultaneously, two forwardbackward iterations are executed for the vertex
priorities of CAP-PFB and the vertex priorities are reused within each time step.
In the case of 2,048 processors, CAP-PFB reduces the sweeping time by 24 %
and by 14 % while compared to SBP and FB, respectively. In the case of hundreds
of processors, superlinear speedup appears. This phenomena mainly benefits from
Table 1 The performance results for parallel sweeping using three strategies

elapsed
time
(seconds)
speedup

ratio

32

64

128

256

512

1024

2048

SBP

1781

908

413

193

103

58

41

FB

1552

762

362

167

93

51

36

CAP-PFB

1380

702

331

151

81

45

31

SBP

1.0

1.87

4.31

9.22

17.24

30.70

43.41

FB

1.0

2.04

4.29

9.29

16.69

30.43

43.11

CAP-PFB

1.0

1.97

4.17

9.14

17.04

30.67

44.52

CAP-PFB v SBP

22 %

23 %

17 %

22 %

22 %

21 %

24 %

CAP-PFB v FB

11 %

8.0 %

8.5 %

9.6 %

13 %

12 %

14 %

Author's personal copy


A new parallel algorithm for vertex priorities of data flow acyclic

the increased cache hit ratio. In fact, the Level-3 cache size of each CPU is 12MB,
the memory requirement of each processor decreases from 663MB to 45MB while the
number of processors increases from 32 to 512.
The good strong scalability also benefits from the larger number of energy groups.
The workloads of each task and the length of each message are linear with the number
of energy groups. While the number of energy groups is larger enough, the message
passing latency will be suppressed.

9 Conclusion
This paper presents a new parallel algorithm, i.e. CAP-PFB, for vertex priorities in
the heuristic framework of parallel computing for distributed acyclic digraphs arising
from mesh-based scientific applications. The algorithm is the parallel version of the
traditional forwardbackward iteration techniques coupled with a new vertex ranking
strategy, i.e. CAP, where the cur arc is more preferable. Compared with the traditional
vertex ranking strategies, CAP not only makes the forwardbackward iteration fast
converge, but also makes the heuristic framework get higher speedup. Especially,
the theoretical analysis are given for the property of non-increasing convergence for
simple digraphs.
Theoretical benchmarks for simple digraphs show that the new parallel algorithm
CAP-PFB is superior to the traditional forward-backward iteration FB and can significantly improve many of the traditional vertex priorities approaches such as LST,
DFHDS, and SBP. Real applications for typical neutron transport on TianHe-1A show
that the new parallel algorithm can improve the sweeping performance for 2,048 processors by 24 % and by 14 % while compared to the traditional approaches of SBP
and FB, respectively.
Acknowledgements This work is under the auspices of National Science Foundation (Nos. 61033009,
60903006), National Basic Key Research Special Fund (No. 2011CB309702) and National High Technology Research and Development Program of China (2010AA012303).

References
1. Alvarez-Valds R, Tamarit JM (1989) Heuristic algorithms for resource-constrained project scheduling: a review and an empirical analysis. In: Advances in project scheduling. Elsevier, Amsterdam, pp
113134
2. Baker RS, Koch KR (1998) An Sn algorithm for the massively parallel CM-200 computer. Nucl Sci
Eng 128:312320
3. Bosilca G, Bouteiller A, Danalis A, Herault T, Lemarinier P, Dongarra J (2010) GAGuE: a generic distributed DAG engine for high performance computing. Innovative Computing Laboratory. Technical
Report, ICL-UT-10-01, April 11
4. Bey J, Downwind GW (1997) Numbering: a robust multigrid method for convection diffusion problems on unstructured grids. Appl Numer Math 23(1):177192
5. Boctor F (1990) Some efficient multi-heuristic procedures for resource-constrained project scheduling. Eur J Oper Res 49:313
6. Cooper D (1976) Heuristics for scheduling resource-constrained projects: an experimental investigation. Manag Sci 22(11):11861194
7. Davis E, Patterson J (1975) A comparison of heuristic and optimum solutions in resource-constrained
project scheduling. Manag Sci 21:944955

Author's personal copy


Z. Mo et al.
8. Drexl A (1991) Scheduling of project networks by job assignment. Manag Sci 37(12):15901602
9. Gross JL, Yellen J (eds) (2003) Handbook of graph theory. Series: discrete mathematics and its applications, vol 25. CRC Press, Boca Raton
10. Gridgen (2012) Users manual for version 15. http://www.pointwise.com/ gridgen
11. Hackbush W, Probst T (1997) Downwind GaussSeidel smoothing for convection dominated problems. Numer Linear Algebra Appl 4:85102
12. Hackbush W, Wittum G (eds) (1993) Incomplete decompositions (ILU)algorithms, theory and applications. Notes on numerical fluid mechanics, vol 41. Vieweg, Wiesbaden
13. Han H, Ilin VP, Kellogg RB, Yuan W (1992) Analysis of flow directed iterations. J Comput Math
10(1):5776
14. Hendrickson B, Leland R (1994) The Chaco users guide: version 2.0. Technical Report, SAND942692, Sandia National Laboratories, Albuquerque, NM
15. Kolisch R, Hartmann S (1999) Heuristic algorithms for solving the resource-constrained project
scheduling problem: classification and computational analysis. In: Weglarz J (ed) Project schedulingrecent models, algorithms and applications. Kluwer Academic, Boston, pp 147178
16. Kolisch R, Hartmann E (2006) Experimental investigation of heuristics for resource-constrained
project scheduling: an update. Eur J Oper Res 174(1):2337
17. Kolisch R (1995) Project scheduling under resource constraintsefficient heuristics for several problem classes. Physica. Springer, Heidelberg
18. Jrgen BJ, Gregory G (2001) Digraphs: theory, algorithms and applications. Springer, London
19. Koch KR, Baker RS, Alcouffe RE, Baker RS, Alcouffe RE (1997) Parallel 3-d Sn performance for
MPI on cray-T3D. In: Proc joint intl conference on mathematics methods and supercomputing for
nuclear applications, vol 1, pp 377393
20. Lewis EE, Miller WF (1984) Computational methods of neutron transport. Wiley, New York
21. Li KY, Willis RJ (1992) An iterative scheduling technique for resource-constrained project scheduling. Eur J Oper Res 56:370379
22. Meng Q, Luitjens J, Berzins M (2010) Dynamic task scheduling for the Uintah framework. In:
Proceedings of the 3rd IEEE workshop on many-task computing on grids and supercomputers
(MATAGS10)
23. Meng Q, Berzins M, Schmidt J (2011) Using hybrid parallelism to improve memory use in the Uintah
framework. In: TeraGrid11, Solt Lake City, Utah, USA, 1821 July
24. Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the messagepassing interface, 2nd edn. MIT Press, Cambridge
25. Notz PK, Pawlowski RP, Sutherland JC (2012) Graph-based software design for managing complexity
and enabling concurrency in multiphysics PDE software. ACM Trans Math Software 39(3):1
26. Ozdamar L, Ulusoy G (1996) An iterative local constraint based analysis for solving the resource
constrained project scheduling problem. J Oper Manag 14(3):193208
27. Pautz SD (2002) An algorithm for parallel Sn sweeps on unstructured meshes. Nucl Sci Eng 140:111
136
28. Pautz SD, Pandya T, Adams ML (2011) Scalable parallel prefix solvers for discrete ordinates transport. Nucl Sci Eng 169:245261
29. Plimpton S, Hendrickson B, Burns S, McLendon W (2000) Parallel algorithms for radiation transport
on unstructured grids. In: Proceeding of SuperComputing2000
30. Thomas P, Salhi S (1997) An investigation into the relationship of heuristic performance with
network-resource characteristics. J Oper Res Soc 48(1):3443
31. Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and
software. J Comput Sci Technol 26(3):344351
32. Mo Z, Zhang A, Wittum G (2009) Scalable heuristic algorithms for the parallel execution of data flow
acyclic digraphs. SIAM J Sci Comput 31(5):36263642
33. Mo Z, Fu L (2004) Parallel flux sweep algorithm for neutron transport on unstructured grid. J Supercomput 30(1):517

You might also like