Professional Documents
Culture Documents
1 23
1 23
Abstract Data flow acyclic directed graphs (digraph) are widely used to describe the
data dependency of mesh-based scientific computing. The parallel execution of such
digraphs can approximately depict the flowchart of parallel computing. During the
period of parallel execution, vertex priorities are key performance factors. This paper
firstly takes the distributed digraph and its resource-constrained parallel scheduling as
the vertex priorities model, and then presents a new parallel algorithm for the solution
of vertex priorities using the well-known technique of forwardbackward iterations.
Especially, in each iteration, a more efficient vertex ranking strategy is proposed. In
the case of simple digraphs, both theoretical analysis and benchmarks show that the
vertex priorities produced by such an algorithm will make the digraph scheduling
time converge non-increasingly with the number of iterations. In other cases of nonsimple digraphs, benchmarks also show that the new algorithm is superior to many
traditional approaches. Embedding the new algorithm into the heuristic framework
for the parallel sweeping solution of neutron transport applications, the new vertex
priorities improve the performance by 20 % or so while the number of processors
scales up from 32 to 2048.
Keywords Acyclic digraph Parallel algorithm Neutron transport
1 Introduction
The data flow acyclic directed graphs (digraph) [9] are usually used to describe the
data dependency for a wide range of mesh-based scientific computing. Each of these
digraphs consists of weighted vertices and arcs, each vertex often refers to a mesh
Z. Mo ( ) A. Zhang Z. Yang
Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics,
P.O. Box 8009, Beijing, 100088, China
e-mail: zeyao_mo@iapcm.ac.cn
cell, and its weight often represents the workloads; each arc often depicts the data
dependency between two neighboring cells and its weight often represents the dependent overheads.
The parallel sweeping solvers are the numerical kernel for the seven-dimensional
radiation or neutron transport equations [20] when the discrete ordinates methods
(Sn ) are used. They are typically mesh-based scientific computing applications whose
data dependency are suitable for digraph description. Baker et al. [2] and Koch et al.
[19] addressed these solvers on rectangular meshes on earlier massively parallel computers, Plimpton et al. [29] and Pautz et al. [27] prolonged these researches to unstructured meshes, Mo et al. [33] supplemented these works for cylindrical coordinate system, and recently Pautz et al. [28] presented another heuristic method to improve the
inherent parallelism for long characteristics Sn discretization. Besides from these parallel sweeping solvers, many other mesh-based applications exist and are suitable for
digraph description, for example, the parallel downstream relaxation for the direct solution of upper or lower sparse triangle linear system arising from the discretization of
convection-dominated problems [4, 11, 13], the well-known ILU factorization [12],
the dense matrix LU factorization and their multi-threaded versions [3], the patchbased structured mesh AMR simulations [22, 25], and their multi-threaded versions
[23], and so on.
The flowchart of parallel computing for above mesh-based scientific computing
applications can be approximately depicted by the parallel execution of the associated
digraphs. Nevertheless, the solution for the minimal execution time of the digraphs
are still NP-hard [18]. Mo et al. [32] present a heuristic framework. It consists of three
components. The first is the partitioning method assigning digraph vertices across
processors, the second is the parallel sweeping solver for the execution of distributed
digraph, and the third is the vertex priorities strategy to decide which vertex should be
executed when many vertices are executable in each processor. For a given distributed
digraph, the vertex priorities approach is the most crucial for parallel efficiency.
There are two types of approaches for the calculation of vertex priorities. The
first is local and another is global. The local approaches only use the data flow
locally in each processor and the global approaches use the data flow of digraph
across processors. The First-In-First-Out (FIFO) strategy, the Geometrical Coordinates KBA strategy [29], the Shortest processor-Boundary Path strategy (SBP) [32],
and the Sweeping Direction Upwind strategy [33] are typically local approaches. The
Largest End Time strategy [7], the Latest Start Time strategy (LST) [17], the Least
Relaxation Time Strategy [7], the Maximal Number of Successors Strategy [1], the
Hybrid Strategies [5, 30], the Sampling Strategies [6, 8], and the Depth First sweeping strategies (DFHDS) [27] are the typically global approaches. Usually, the local
approaches are cheaper and less efficient, the global approaches are favorable while
the vertex priorities are reusable.
Generally, most of above vertex priorities approaches can be depicted by the wellknown resource-constrained scheduling models widely used for many of digraphbased projects or networks [9, 15, 16] except that the constrained resources refer
to the number of available processors. Each of these models will produce a parallel
scheduling for which each vertex has a start or an end time for execution and the
parallel execution time of the scheduling is equal to the difference between the maximal vertices end time and the minimal vertices start time. Taking the start or the
end time as the vertex priorities, most of all above local or global approaches can be
reproduced.
It is heuristic that the vertex priorities approach making the minimal parallel execution time is perhaps the most suitable for the parallel sweeping solvers of meshbased scientific computing. So, taking the vertex priorities output from above local or
global approaches as the input, the well-known Forward-Backward technique (FB)
presented by Li et al. [21] can be used to iteratively reduce the parallel execution
time. Finally, a better scheduling is reproduced; each vertex can take the new start or
end time as its priorities. It is obviously heuristic that the reproduced priorities are
more effective because larger inherent parallelism is achieved.
Based on above heuristic observation, this paper presents a new parallel algorithm
for vertex priorities by the parallelization of the forward-backward iterations. Especially, a new efficient vertex ranking strategy is designed for each forward or backward iteration. In the case of simple digraphs where each vertex has equal weight
and each arc has zero weight, both theoretical analysis and benchmarks show that
the vertex priorities produced by the new algorithm make the parallel execution time
non-increasingly converge with the number of forward and backward iterations. In
other cases of non-simple digraphs, benchmarks also show that the new algorithm is
superior to many of traditional approaches. In particular, the new algorithm improves
the speedup of the approach SBP from 201 to 301, while 500 processors are used
for a digraph in the scale of 140 K vertices and 280 K arcs. Furthermore, embedding
the new algorithm into the heuristic framework for the parallel sweeping solution of
neutron transport applications, the new vertex priorities improve the performance by
20 % or so while the number of processors scales up from 32 to 2048.
This paper is organized as follows. The second section defines the resourceconstrained scheduling model for vertex priorities, the third section gives the parallel
forward-backward iterations, and the fourth section gives the vertex ranking strategy,
and then the next two sections give the convergence and the complexity analysis.
The last two sections list the performance results for theoretical benchmarks and real
neutron transport applications. Finally, this paper is concluded.
Simple digraph is such a type of digraphs that all its vertices have equal weights
and all its arcs have zero weights. Simple digraphs often exist in the field of meshbased scientific computing. For example, weights of each vertex are usually equal to
each other because the computational formulae are similar across mesh cells, the data
transfer overheads between neighboring vertices are neglectable provided that each
vertex has enough computational workloads.
A distributed digraph consists of many non-overlapping sub-digraphs and each is
assigned to a processor. A vertex is local to a sub-digraph if and only if it belongs to
this sub-digraph, a vertex is local to another vertex if and only if they belong to the
same sub-digraph. An arc is a local arc if and only if both of its head and tail belong to
the same sub-digraph, otherwise, it is a cut arc. Usually, a sub-digraph includes all its
local vertices and all its arcs whose head or tail is local. Denote R = (r1 , r2 , . . . , rn )
be the mapping vector whose element ri (1 i P ) is the rank of processor owning
vertex vi , P is the number of processors.
Given a distributed digraph with P sub-digraphs, the vertex priorities approaches
as introduced in the first section are equivalent to the solution of resource-constrained
scheduling model as follows:
min[f]
subject to
fi + wi,j fj qj ,
|Ak (t)| 1,
i Dj , j = 1, . . . , n
(1)
k = 1, . . . , P , t
= min {fj qj },
1j n
[f] =
(2)
denotes the end time, start time, and execution time of the digraph, respectively,
Ak (t) = {vj : fj qj t < fj } {vj : rj = k}
(3)
is the set of vertices executed on processor k at time t. Obviously, assign each vertex
vi the priority fi , then the digraph has the final execution time [f].
Two constraints are proposed in Eq. (1). The first is the sequence constraint such
that a vertex should not execute until all its predecessors have finished, and the second is the resource constraint such that at most one vertex executes concurrently in
each processor. A scheduling is feasible if and only if these constraints are satisfied.
A scheduling is optimal if and only if [f] is the minimal.
we consider its parallel version for a distributed digraph in Algorithm 3.1. Different
from the serial forwardbackward iterations, a new ranking strategy is introduced
satisfying the sequence constraint of Eq. (1).
Algorithm 3.1 PFB(G, r, q, w, , , f0 , Mits , , f)
INPUT:
G
r
q
w
Mits
f0
:
:
:
:
:
:
:
:
:
OUTPUT:
f : final scheduling of local vertices.
BEGIN
h = 0, m = number of local vertices.
0 = start time of initial scheduling, 0 = end time of initial scheduling.
DO in PARALLEL {
(1) Execute backward iteration for schedule fh+1/2 from fh .
(1.1) Compute ranks for local vertices from fh and sort.
(1.1.1) Compute from fh using a rank strategy as discussed in
the next section.
(1.1.2) Order local vertex indices {ig }g=1,...,m satisfying
i1 , fih1 i2 , fih2 im , fihm
Here, (a1 , b1 ) (a2 , b2 ) means
(a1 > b1 )|| (a1 = b1 )&(a2 > b2 ) .
(1.2) h+1/2 = h , I = (, h+1/2 ) initial intervals available.
(1.3) FOR each vertex ig : g = 1, 2, . . . , m 1, m DO {
(1.3.1) Let tig = fihg and goto step (1.3.4) if vig is a sink.
h+1/2
wig ,l }.
h+1/2
= max{t : (t tig )&((t qig , t) I )}.
fi g
h+1/2
h+1/2
sig
= fi g
qi g .
h+1/2
h+1/2
the latest start time I = I \ (sig
, fi g
).
straint, fig
Remark 3.3 In step (2.3.4), tig is the earliest start time satisfying the sequence constraint, sih+1
is the earliest start time satisfying both constraints.
g
Remark 3.4 The last row in Algorithm 3.1 is the terminal condition. If the sequence
[fh ] non-increasingly converge with h, the final output is the solution; otherwise, we
should take the output having the shortest execution time in the solution history.
Here, Si is the set of cut arcs whose heads are the successors of vertex vi , wk,j is the
h+1/2
is the start time after last backward iteration,
weight of cut arc (vk , vj ) and sj
superscript h represents the iteration step. Smaller i indicates higher priority.
Similarly, in step (1.1.1) after a forward iteration, we rank a vertex by the latest
end time of all its upstream cut arcs as follows:
(5)
i = max fjh + wj,k {}
(vj ,vk )Bi
Here, Bi is the set of local cut arcs whose tails are the predecessors of vertex vi , wj,k
is the weight of cut arc (vj , vk ), and fjh is the end time after last forward iteration,
superscript h represents this iteration. Larger i indicates higher priority.
Strategy CAP naturally satisfies the sequence constraint stated in step (2.1.2) and
step (1.1.2). In fact, assuming vertex va the predecessor of vertex vb , vertex set Sb
must be a subset of Sa , so a b . Similarly, assuming vertex vc the successor of
vertex vd , vertex set Bc must be a subset of Bd , so c d .
Let CAP-F be the forward iteration using formula (4) and CAP-B be the backward
iteration using formula (5), denote the forwardbackward iteration as CAP-PFB. Figure 1 gives an example to illustrate these iterations for a scheduling. Figure 1a gives
a distributed simple digraph with two sub-digraphs divided by the dashed line. Each
circle and its central number represent a vertex, the weight of each vertex is normalized to 1.0. Each arrow represents an arc showing the data dependency between two
vertices, the weight of each arc is equal to zero. Three cut arcs exist along the dashed
line. Figure 1b gives an initial scheduling. The horizontal axis is the execution steps
representing the execution time, the vertical axis is the executed vertices across two
processors. This scheduling requires 7 steps in total.
Taking the set of end time {fi0 }10
i=1 in Fig. 1b as the input, Fig. 1c assigns each vertex a rank defined by formula (5). All local vertices in the 0th processor have the rank
{} since no local upstream cut arcs exist. B9 = {(v2 , v9 ), (v3 , v7 )}, f30 = 3, f20 =
2, so 9 = 3. Similarly, we have 7 = 10 = 3, 8 = f40 = 4. After one backward
iteration as stated in ALGORITHM 3.1, a better scheduling is generated in Fig. 1d
where one step is reduced.
1/2
Taking the set of start time {si }10
i=1 in Fig. 1d as the input, Fig. 1e updates vertex ranks defined by formula (4). All local vertices in the 1th processor have the
1/2
rank {+} since no local downstream cut arcs exist. S1 = {(v2 , v9 ), (v4 , v8 )}, s9 =
1/2
4, s8 = 6, so 1 = 4. Similarly, we have 2 = 4, 3 = 3, 4 = 6. After one forward
iteration as stated in ALGORITHM 3.1, a better scheduling is generated again in
Fig. 1f where another step is further reduced.
Until now, one backwardforward iteration finishes, an optimal scheduling with 5
steps is gained. However, the ranking strategies presented in [21, 26] cannot improve
the results in Fig. 1b.
Lemma 5.1 If the distributed acyclic digraph is simple, then given a local vertex vk
in the forward iteration, we can conclude
h+1/2
fjh+1
sk
c
c = 1, 2, . . . , L
(6)
Here, J = {vjc }L
c=1 is a special list of vertices ordered by step (2.1.2) and each vertex
is the head of a cut arc whose tail is a predecessor of vertex vk .
Proof We use the induction method on the size L of list J . Let L = 1. Assume
vertex vj1 belongs to processor rj1 . Let H1 be the set of vertices in the before of vj1
in processor rj1 and vj1 itself. Because satisfies the sequence constraint and j1 is
the smallest, we have
h+1/2
i = j1 , fi
h+1/2
f j1
vi H1
(7)
fjh+1
= h+1 + |H1 | = h+1/2 + |H1 | fj1
1
h+1/2
sk
(8)
fjh+1
sk
c
c = 1, 2, . . . , C 1
(9)
h+1/2
s.t. fjh+1
fj
C
(10)
fjh+1
= h+1 + |HC |. Let j = argmax{fi
C
h+1/2
h+1/2
|HC |, so fjh+1
fj
.
C
j +1 h+1
Secondly, assume [fl , sa )
h+1/2
: vi HC }, then fj
h+1/2 +
max fjh+1 : (vj , vi ) E sah+1
(11)
vi K
(12)
(13)
then
sah+1 fkh+1
(14)
so vertex vk does not belong to set K and (vk , vk ) is a cut arc whose head belongs
to the list J . Moreover, k < C because k > k jC . Therefore,
h+1/2
fkh+1
sk
(15)
h+1/2
sk
from the induction assumption in Eq. (9). So, we have fjh+1
C
above five equations. Let j
so
fjh+1
C
h+1/2
= argmax{fi
:i
+ |K| from
h+1/2
h+1/2
K}, then sk
+ |K| fj
,
h+1/2
fj
.
fjh+1
fj
C
h+1/2
j jC sk
(16)
Lemma 5.2 If the distributed acyclic digraph is simple, the scheduling fh+1 generated by the forward iteration CAP-F is non-increasing such that
h+1 h+1/2
f
(17)
f
Here, [f] is the parallel execution time as defined in Eq. (2).
Proof Given a vertex vk , assuming one of its successor being the head of cut arc
(vjc , vt ), we have
h+1/2
fkh+1 fjh+1
st
c
h+1/2
< ft
h+1/2
(18)
by Lemma 5.1 and the sequence constraint. Otherwise, it is a sink or local predecessor
of a sink. Let vertex vs be the sink, we can also conclude that fkh+1 fsh+1 by the
sequence constraint and
h+1/2
fsh+1 fj
h+1/2
(19)
(20)
N
N
N
N
N
N
+O
log
+O
+O(log P )
log
+O(log P )
Tcal
=O
=O
P
P
P
P
P
P
(22)
Assume the average degree of vertices
be O(D) and the diagraph be uniformly
partitioned, then each processor has O( N
P ) cut arcs in the two-dimensional case.
So, the message passing complexity is about
Tmes = O D
(23)
P
N 3
(24)
Tmes = O D
P
It is difficult to evaluate the overheads for step (1.3.2) and (1.3.7) because of the
uncertainty of message passing. However, they can be estimated as follows:
Tpas
= O(Tmes L)
(25)
Here, L is the message passing latency. Similarly, this estimation is suitable for steps
(2.3.2) and (2.3.7).
The complexity of Algorithm 3.1 is the sum of Tcal and Tpass . In the case of
N P , tcal dominates. If we fix N/P and increase P , the global reduction will
dominates. If the diagraph parallelism is smaller, Tpas or the message passing latency
will dominate.
In fact, Algorithm 3.1 is the sequence of a number of iterations and each iteration
has double overheads of a parallel sweeping as introduced in the first section. So,
Algorithm 3.1 has the complexity of 2M parallel sweeping provided that it converges
within M iterations. This implies that Algorithm 3.1 has the disadvantage of that it is
useful if and only if the vertex priorities approach is reusable for the parallel sweeping
of a series of digraphs or the parallel sweeping is repeated for the real applications.
7 Theoretical benchmarks
This section validates Theorem 5.1 for Algorithm 3.1. Given a distributed digraph,
we define the speedup SP of a feasible scheduling as the serial execution time over
the parallel execution time for P processors. Here, the execution time is measured by
steps. For example, for a simple digraph, the vertex weight is equal to one step and
the arc weight is equal to zero. The number of steps can be accurately calculated by
the symbolic sweeping. Obviously, larger speedup means shorter parallel execution
time and implies superior vertex priorities.
7.1 Simple digraph benchmarks
The simple digraphs come from one flux sweeping of the discrete ordinates solution
of the neutron transport applications as studied in [32] and [33]. The geometry is a
two-dimensional structured mesh in the scale of 120 49 across an airfoil as shown
in Fig. 2; the discrete ordinates include 24 angles arising from the non-overlapping
partitioning of a unit spherical surface. The digraph includes about 140 K vertices and
280 K arcs. Each vertex represents the task of a pair of ci , am where flux is swept
across cell ci for angle am and its weight is equal to one step. Each arc represents
the data dependence among two tasks such that (ci , am , cj , am ) where cell ci is
the upstream neighbor of cell cj for angle am and its weight is equal to zero. It is
easily concluded that the digraph is acyclic if and only if each cell is convex [32].
This digraph has the maximal speedup of 666 provided that unlimited number of
processors are available.
The distributed digraph with P subgraph is generated by the non-overlapping
mesh partitioning method of Inertial KemighanLin (IKL) implemented in the tool
Chaco [14]. For each partitioning, above tasks defined on each cell are assigned to
the same processor.
Figure 3 lists the speedup convergence history with the number of iterations for
400 or 500 processors, respectively. Here, legend CAP-FB represents Algorithm 3.1
coupled with the vertex ranking strategy CAP, legend FB represents the traditional
forwardbackward iterations [21] coupled with the vertex ranking strategy LST. The
horizontal axis is the number of iterations where the half means the a backward iteration and the integer means a backwardforward iteration.
8 Real applications
Embedding the vertex priorities approach given by Algorithm 3.1 into the heuristic
sweeping framework [32] for a distributed diagraph, we can apply it to solve the
neutron transport applications as studied in a series of publications such as [27, 29,
32, 33]. By the way, the traditional forward-backward approach of FB [21] and the
local approach of SBP [32] are used for comparison.
The multi-group neutron transport equations [32] are discretized by the methods
of both discrete ordinates and discontinuous finite element on a two dimensional unstructured quadrilateral mesh with 57600 cells. 48 angles are used to evenly partition
the unit spherical surface, and 24 groups are used to partition the energy distribution.
A digraph is constructed from the task of a pair of ci , am where 24 group-flux are
swept across cell ci for angle am . Similarly, each arc represents the data dependence
among two tasks such that (ci , am , cj , am ) where cell ci is the upstream neighbor
of cell cj for angle am .
The parallel computer is TianHe-1A [31]. It is a distributed memory machine with
1024 nodes, each node includes two Intel Xeon EX5675 2.93 GHz CPUs, each CPU
has 6 cores, and each core has the peak performance of 11.72 GFLOPS. This machine has the fat-tree crossbar interconnect network, the dual-direction bandwidth is
equal to 160 Gbps and the message passing MPI [24] latency is about 1.57 microseconds. Though multi-threaded parallelization is supported within each node, only
MPI parallelization is considered.
Table 1 lists the elapsed time and the real speedup on TianHe-1A for the heuristic
sweeping framework [32] using three vertex priorities approaches of SBP, FB and
CAP-PFB, respectively. The number of processors scales up from 32 to 2048. Here,
each processor means a CPU core. Four cores are used in each CPU. In total, 100
physical time steps are performed and 12 parallel sweeping are executed for each time
step. Simultaneously, two forwardbackward iterations are executed for the vertex
priorities of CAP-PFB and the vertex priorities are reused within each time step.
In the case of 2,048 processors, CAP-PFB reduces the sweeping time by 24 %
and by 14 % while compared to SBP and FB, respectively. In the case of hundreds
of processors, superlinear speedup appears. This phenomena mainly benefits from
Table 1 The performance results for parallel sweeping using three strategies
elapsed
time
(seconds)
speedup
ratio
32
64
128
256
512
1024
2048
SBP
1781
908
413
193
103
58
41
FB
1552
762
362
167
93
51
36
CAP-PFB
1380
702
331
151
81
45
31
SBP
1.0
1.87
4.31
9.22
17.24
30.70
43.41
FB
1.0
2.04
4.29
9.29
16.69
30.43
43.11
CAP-PFB
1.0
1.97
4.17
9.14
17.04
30.67
44.52
CAP-PFB v SBP
22 %
23 %
17 %
22 %
22 %
21 %
24 %
CAP-PFB v FB
11 %
8.0 %
8.5 %
9.6 %
13 %
12 %
14 %
the increased cache hit ratio. In fact, the Level-3 cache size of each CPU is 12MB,
the memory requirement of each processor decreases from 663MB to 45MB while the
number of processors increases from 32 to 512.
The good strong scalability also benefits from the larger number of energy groups.
The workloads of each task and the length of each message are linear with the number
of energy groups. While the number of energy groups is larger enough, the message
passing latency will be suppressed.
9 Conclusion
This paper presents a new parallel algorithm, i.e. CAP-PFB, for vertex priorities in
the heuristic framework of parallel computing for distributed acyclic digraphs arising
from mesh-based scientific applications. The algorithm is the parallel version of the
traditional forwardbackward iteration techniques coupled with a new vertex ranking
strategy, i.e. CAP, where the cur arc is more preferable. Compared with the traditional
vertex ranking strategies, CAP not only makes the forwardbackward iteration fast
converge, but also makes the heuristic framework get higher speedup. Especially,
the theoretical analysis are given for the property of non-increasing convergence for
simple digraphs.
Theoretical benchmarks for simple digraphs show that the new parallel algorithm
CAP-PFB is superior to the traditional forward-backward iteration FB and can significantly improve many of the traditional vertex priorities approaches such as LST,
DFHDS, and SBP. Real applications for typical neutron transport on TianHe-1A show
that the new parallel algorithm can improve the sweeping performance for 2,048 processors by 24 % and by 14 % while compared to the traditional approaches of SBP
and FB, respectively.
Acknowledgements This work is under the auspices of National Science Foundation (Nos. 61033009,
60903006), National Basic Key Research Special Fund (No. 2011CB309702) and National High Technology Research and Development Program of China (2010AA012303).
References
1. Alvarez-Valds R, Tamarit JM (1989) Heuristic algorithms for resource-constrained project scheduling: a review and an empirical analysis. In: Advances in project scheduling. Elsevier, Amsterdam, pp
113134
2. Baker RS, Koch KR (1998) An Sn algorithm for the massively parallel CM-200 computer. Nucl Sci
Eng 128:312320
3. Bosilca G, Bouteiller A, Danalis A, Herault T, Lemarinier P, Dongarra J (2010) GAGuE: a generic distributed DAG engine for high performance computing. Innovative Computing Laboratory. Technical
Report, ICL-UT-10-01, April 11
4. Bey J, Downwind GW (1997) Numbering: a robust multigrid method for convection diffusion problems on unstructured grids. Appl Numer Math 23(1):177192
5. Boctor F (1990) Some efficient multi-heuristic procedures for resource-constrained project scheduling. Eur J Oper Res 49:313
6. Cooper D (1976) Heuristics for scheduling resource-constrained projects: an experimental investigation. Manag Sci 22(11):11861194
7. Davis E, Patterson J (1975) A comparison of heuristic and optimum solutions in resource-constrained
project scheduling. Manag Sci 21:944955