Professional Documents
Culture Documents
Erlangen
June 2009
Abstract
Computer-based image segmentation is a common task when
analyzing and classifying images in a broad range of applications. Problems arise when it comes to the computation of
segmentations for huge datasets like high-resolution microscope or satellite scans or three-dimensional magnetic resonance images appearing in medical image processing. The
computation may exceed constraints like available memory
and time.
We will present a finite element algorithm for image segmentation based on a level set formulation combined with
the domain decomposition method which enables us to compute segmentations of large datasets on multi-core CPUs and
high-performance distributed parallel computers rapidly.
Contents
1 Introduction
2 Mathematical Model of Image Segmentation
2.1 The Mumford-Shah Energy Functional . .
2.2 The Chan-Vese Model . . . . . . . . . . .
2.3 The Level Set Formulation . . . . . . . . .
2.4 Heaviside Regularization . . . . . . . . . .
2.5 Multiple Channels . . . . . . . . . . . . .
2.6 The Euler-Lagrange equation . . . . . . .
2.7 Gradient Descent . . . . . . . . . . . . . .
2.8 Weak formulation . . . . . . . . . . . . . .
2.9 Finite Element Space Discretization . . .
2.10 Time Discretization . . . . . . . . . . . .
2.11 Matrix Formulation . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
4
7
8
9
11
12
12
13
14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
18
19
20
22
22
24
24
25
27
28
4 Implementation in Image
4.1 Brief Introduction to the Image Framework . . . . . . . .
4.2 Parallel Computing Programming Model . . . . . . . . . .
4.3 Design Principles with MPI in Image . . . . . . . . . . .
4.4 Partitioning of Triangulations using ParMETIS . . . . . .
4.5 Distribution of Subdomains . . . . . . . . . . . . . . . . .
4.6 Association of Global and Local Degrees of Freedom . . .
4.7 Handling of Interface Data . . . . . . . . . . . . . . . . . .
4.8 Non-Blocking MPI Communication . . . . . . . . . . . . .
4.9 Distributed Iterative Solver . . . . . . . . . . . . . . . . .
4.9.1 Assembly of Matrices and Adaption of Right Hand
4.9.2 Schur Complement System Solver . . . . . . . . .
4.9.3 Backward Substitution . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
32
33
34
36
37
37
38
41
41
41
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Sides
. . . .
. . . .
5 Numerical Results
5.1 Segmentation . . . . . . . . . . . . . . . . .
5.1.1 Experimental Order of Convergence
5.1.2 Artificial Images . . . . . . . . . . .
Checkerboard . . . . . . . . . . . . .
Grayscale Gradient . . . . . . . . . .
5.1.3 Real World Images . . . . . . . . . .
Multiple Channels . . . . . . . . . .
Large-Scale Image . . . . . . . . . .
5.2 Parallel Performance . . . . . . . . . . . . .
5.2.1 Computation Environments . . . . .
5.2.2 Scalability Benchmarks . . . . . . .
Small-Sized problem . . . . . . . . .
Large-Scale . . . . . . . . . . . . . .
6 Conclusion and Perspective
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
45
49
50
51
53
53
55
56
56
57
58
62
67
Acknowledgements
I wish to thank all of my friends for their assistance in my studies and especially in
this work. Special thanks go to Jenny for all the love, care, fun and for constantly
triggering thoughts on and actions in a philosophical and political world that matters
beyond mathematics.
Concerning this work, I am very grateful to Michael Fried for the excellent supervision
and the topic, perfectly matching my personal interests. I had great fun while attaining
knowledge together with Kai Hertel, occasionally spending days and nights on program
code. Furthermore, I want to thank the entire staff (and Saeco) at AM3 for making this
a humane place to productively work at. Special thanks go to Eberhard Bansch, Steffen
Basting, Rodolphe Prignitz, Stephan Weller and Rolf Krahl for taking the time whenever mathematical problems arose and the latter two, being LATEXperts, also for their
support concerning typesetting. Thanks are also directed towards the high performance
computing team at the universitys computing center for operating the woody cluster
and sharing their profound knowledge.
My parents deserve very special thanks for unconditionally supporting me in every
way, enabling me to concentrate on my studies and this work. I deeply wish everyone to
be able to study under similar circumstances and look forward to a time when income
will no longer determine educational chances. Beyond that, I thank my brother Mirko
for the humorous phone calls, often exhilarating me in times of heavy work load.
Thank you!
Notation
Basic Notation
R
R+
N
N0
(x1 , . . . , xn )
ei
xy
kxk
Aij
A
kAk
(A)
Operators
For a function f : R+ R of space and time and a vector-valued function g :
Rd R+ Rd we write:
t f
i f
f
g
Time derivative of f : t f := f
t
Derivative of f with respect to i-th spatial axis: i f :=
Gradient with respect to spatial variables:
f = (1 f, . . . , d f )
Divergence
Pd with respect to spatial variables:
g = i=1 i gi
f
xi
Specific symbols
I
BV
diam (S)
Th
XT
vh
i
Pi
I
Ri
Vector-valued image I : Rm
Space of functions of bounded variation
Diameter of a simplex S
T
Triangulation Th = {Si }N
i=1 of with NT simplices and
h = maxi=1,...,NT diam (Si )
Vector space of finite element functions with respect to T
Discrete function vh XTh
i-th segment i of a segmentation
Segmentation interface separating the segments i
i-th subdomain of a partitioning P
Interface separating the subdomains Pi
Restriction operator Ri mapping unknowns in the global
domain to unknowns of the i-th subdomain Pi
1 Introduction
Image segmentation partitions a given image into multiple segments in such a way that
similar regions are grouped together in one segment. More generally, the segmented
image shares visual characteristics in each segment. The process aims at detecting
objects and its boundaries or simplifying the image in order to analyze it more easily.
Computer-based image segmentation has become a vital method in several applications
like locating objects in medical imaging or satellite images and enables many people to
focus on the kind of work computers are not able to perform yet.
When it comes to the computational processing of huge datasets like high-resolution
images in two or three dimensions, arising for example in magnetic resonance imaging, problems like high memory consumption and long computation times have to be
addressed.
This work presents an image segmentation algorithm combined with the domain decomposition method which allows for fast computation of large-scaled image segmentations on parallel computers. Since the segmentation algorithm has been investigated
thoroughly in the past, we will place emphasis on the domain decomposition technique
in this study.
In chapter 2, we will present a mathematical model of image segmentation based on
the Mumford-Shah functional. Input data may consist of multiple image channels, e.g.
RGB color images. The presented algorithm generates a piecewise constant approximation of a given image with an arbitrary number of segments. The resulting partial
differential equation is discretized in space by the finite element method.
The theoretical background of the employed domain decomposition method will be
introduced in chapter 3. We will shed light on different partitioning approaches and
present a Schur complement method, which is a straightforward approach to decouple
groups of unknowns resulting from the finite element discretized equation. The decoupling is the key to our parallel implementation.
We will describe the most important details concerning the implementation in chapter 4. The algorithms have been embedded in the abstract image processing framework
Image which is briefly introduced together with the used finite element toolbox ALBERTA. Because domain decomposition methods aim at speeding up computations,
they are tightly coupled to computer science and we will describe the algorithms both
from a mathematical and from a computational point of view where appropriate. Concepts used like the distributed memory approach and MPI are briefly discussed. We
will demonstrate parallel partitioning with ParMetis before discussing the distributed
parallel Schur complement solver, which is the core and workhorse of our implementation. Crucial points in the implementation are highlighted along with possible solutions
of which very few appear in the form of actual source code. For the sake of clarity the
chapter closes with an overview of the work flow of the presented algorithms.
Chapter 5 presents experiments addressing the segmentation of example images as well
as the analysis of the parallel performance of the domain decomposition implementation.
1 Introduction
Computations have been performed with up to 384 processors on the high-performance
cluster woody installed at the computing center of the University of Erlangen-N
urnberg.
We finish this document with concluding remarks and perspectives for further research
in chapter 6.
For piecewise constant functions u the Mumford-Shah functional boils down to:
NX
S 1 Z
FCV () =
|ci I |2 + Hn1 ()
i=0
i
(2.3)
Z
I , i = 0, . . . , NS 1
ci =
ki k
(2.4)
(2.5)
(2.6)
The level set method has several advantages over other methods, e.g. it allows for
topology changes of the interface . Following Fried [11] we can furthermore extend
the level set approach
to NS = 2NL segments by using NL level set functions =
0 , . . . , NL1 . Using the Heaviside function H : R R with
0 for z 0
H (z) :=
1 for z > 0
we define the Heaviside vector H () := (H (NL 1 ) , . . . , H (0 )). In order to define
the segments for NL > 1 we use the segments index i {0, . . . , NS 1} unique binary
representation
b (i) := (bNL 1 (i) , . . . , b0 (i)) with bj (i) {0, 1} j {0, . . . , NL 1}
with
i=
NX
L 1
j=0
bj (i) 2j .
(2.7)
1
0.5
0.5
1
1
0.5
0
0
0.5
(a) Graph of two level set functions 0 , 1 along with (b) Resulting segments 0 , 1 , 2 , 3 and interthe corresponding zero isoline levels on the bottom. face for the level set functions in (a).
Figure 2.1: Two level set functions and the resulting segmentation.
Using the above, we can now define the interface and the segments i as
j := {x | j (x) = 0 }
NY
N[
L 1
L 1
j (x) = 0
:=
j = x
j=0
j=0
(2.8)
i := {x | H () = b (i) } .
Figure 2.1 shows a simple example when two level set functions are used.
For convenience we split the index set J := {0, . . . , NL 1} into two subsets for every
segment index i:
I (i) := {j J | bj (i) = 1 }
I (i) := J \ I (i)
The indicator function i () of the segment i then reads
Y
Y
i () :=
H (j )
(1 H (j )).
jI(i)
(2.9)
(2.10)
jI(i)
In order to reformulate the length of in terms of level set functions we need some
definitions from the theory of functions of bounded variation. We only present the basics
and refer to the work of Ambrosio, Fusco and Pallara [1] for an in-depth analysis of the
Mumford-Shah energy functional with respect to functions of bounded variations.
Definition 2.3.1 (Variation). Let f L1 (). The variation V (f , ) of f in is
defined by
Z
f dx C01 (, Rn ) , kk 1 .
V (f , ) := sup
This result will be of importance in section 2.4 where the discontinuous Heaviside function is going to be replaced by a regularized Heaviside function.
Definition 2.3.2 (Function of bounded variation). A function f L1 () is a function
of bounded variation in if V (f , )
vector space of
all functions of bounded
< . The
variation is denoted by BV () := f L1 () V (f , ) < .
As carried out in detail in [1] it turns out that for a set E of finite perimeter the
following holds:
Hn1 ( E) = V (E , )
(2.11)
n1
NS 1
NS 1
1 X
1 X
n1
() =
H
( i ) =
V (i , )
2
2
i=0
(2.12)
i=0
with i = i ().
In practice Hn1 () is approximated by
Hn1 ()
NX
L 1
V (H (j ) , ) =: L () .
(2.13)
j=0
This approximation only suffers from inaccuracy in the case of multiple level set
functions, when two or more zero isolevel lines coincide. The length of these overlapping
parts would be counted twice or even more often.
Using the functional (2.3) together with the approximation (2.13) leads to the following level set formulation of the Mumford-Shah energy functional for piecewise constant
functions:
NX
S 1 Z
|ci I |2 + L ()
FLS () =
i=0
NX
1
S
2
(2.14)
|ci I | i () + L ()
=
i=0
ci =
I , i = 0, . . . , NS 1
ki k
3.5
3
0.8
2.5
0.6
2
1.5
0.4
1
0.2
0.5
0
2
0
2
(a)
(b)
Figure 2.2: (a) shows the regularized Heaviside function H and (b) the regularized delta
function for = 0.1
1
d
H (z) =
.
2
dz
+ z2
(2.16)
Note that lim H = H and lim = , where denotes the (distributional) derivative
0
1
z
1
z
(z) :=
H
+
sin
1
+
2
1
(z) := d H
(z) =
dz
1
2
and :
for z <
for |z|
for z >
NX
L 1
V (H (j ) , ).
(2.17)
j=0
NX
L 1 Z
| (H (j ))|
j=0
NX
L 1 Z
(2.18)
(j ) |j |.
j=0
H (j )
jI(i)
(1 H (j )).
(2.19)
jI(i)
NX
S 1 Z
i=0
1
ci =
ki k
|ci I | i, () + L ()
I , i = 0, . . . , NS 1
(2.20)
NX
S 1 Z
i=0
i
2
k
ci I k , k = 1, . . . , NC
(2.21)
With el as the l-th unit vector, the above condition is equivalent to:
d
[F ( + el )]| =0 = 0 l J = {0, . . . , NL 1}
d
(2.23)
H (j )
jI(i)\{l}
(1 H (j ))
jI(i)
H (j )
jI(i)\{l}
(1 H (j ))
jI(i)
jI(i)\{l}
jI(i)\{l}
H (j )
(1 H (j ))
jI(i)\{l}
and with the binary representation of the segments index b (i) defined in (2.7) we arrive
at
h
i
d
l
[i, ( + el )]| =0 = (1)(1bl (i)) (l + ) i,
()
d
=0
(1bl (i))
l
= (1)
(l ) i, () .
(2.24)
(l + ) | (l + )| +
NL 1 Z
d X
(j ) |j |
d
j=0
j6=l
d
.
( (l + ) |l + |)
d
Z
Z
l +
= (l + ) |l + | + (l + )
|l + |
=
l
|l |
Z
Z
l
= (l ) |l | + (l )
| |
{z l
}
d
[L ( + el )]| =0 =
d
(l ) |l |
(l )
= 0
(homogeneous Neumann boundary)
l
(l )
|l |
Z
Z
l
= (l ) |l | (l ) l
|l |
|
{z
}
=|l |
|l |
Z
l
= (l )
|l |
(l )
(2.25)
Now it is time to combine (2.24) and (2.25) such that the derivate from (2.23) becomes:
d
[F ( + el )]| =0
d
NC NX
S 1 Z
2 d
d
1 X
k
[i ( + el )]| =0 +
[L ( + el )]| =0
=
ci I k
NC
d
d
=
k=1 i=0
NC
NX
S 1 Z
X
i=0
10
1
NC
k=1
(l )
(1)(1bl (i)) cki
l
|l |
2
l
()
I (l ) i,
k
NX
S 1 Z
l
gil (l ) i,
()
i=0
(l )
l
|l |
= 0 l J.
(2.26)
C (, R)
Because we chose
to be an arbitrary test function we now restrict
NX
S 1
l
l
l
= 0 in ,
gi (l ) i, () (l )
|l |
i=0
l J.
(2.27)
(l )
= 0 on
|l |
Note that the length L of the interface in the energy functional (2.20) now appears
l
in the second term with |
being the curvature of the level set function l and
l|
the zero isoline level respectively.
Let us recall the two regularization approaches defined in section 2.4. Chan and
Vese observed in [7], that only local minima of the non-convex functional may be found
and , respectively. The small compact support
with thesecond regularizations H
supp = [, ] would be responsible for making the algorithm depend on the
initial level set function and only local minima may be obtained. The first introduced
regularization is not equal to zero everywhere and tends to compute global minima.
l
=0
(l )
|l |
l (, 0) = 0l ()
on (0, T ] ,
in .
(2.28)
11
t l
(l )
l
Q (|l |)
NX
S 1
l
gil i,
()
i=0
l
=0
(l )
Q (|l |)
l (, 0) = 0l ()
in (0, T ] ,
on (0, T ] ,
in .
(2.30)
The corresponding weak formulation of the first equation in (2.30) can now be written
as: C ( [0, T ] , R) , l J
Z
t l
(l )
NX
S 1 Z
l
l
=
gil i,
()
Q (|l |)
i=0
Integration by parts
Z
t l
(l )
l
+
Q (|l |)
NX
S 1 Z
l
l
=
gil i,
()
Q (|l |)
i=0
and dropping of the second term because of Neumann boundary conditions results in:
Z
t l
+
(l )
NX
S 1 Z
l
l
=
gil i,
()
Q (|l |)
(2.31)
i=0
12
i=0
a0 , . . . , ak
S = x=
i ai R 0 i 1 and
i = 1
i=0
i=0
is called a k-sub-simplex of S.
NB
X
vi i (x).
(2.32)
i=1
We can now formulate the spatial discretized version of the weak evolution equation
(2.31) as follows: j {1, . . . , NB }
Z
Z
NX
S 1 Z
j h,l
t h,l
l
j +
=
gil i,
(h ) j
(2.33)
(h,l )
Q (|h,l |)
i=0
i=0
h,l
h,l
13
h,l
Q
h,l
NX
m1
S 1 Z
h,l
l
j
gil i,
hm1 j .
m1
h,l
i=0
(2.35)
PNB m
Using the functions representation m
h,l =
k=1 k,l k in the basis BT from (2.32)
we reformulate (2.35) to: l J, j {1, . . . , NB }
NB
X
m
k,l
k=1
Z
NB
X
j k
j k
m
+
k,l
m1
m1
h,l
Q
k=1
h,l
NB
X
m1
k,l
k=1
fl
NX
S 1 Z
l
j k
gil i,
hm1 j .
m1
h,l
i=0
(2.36)
Defining our system matrices Al RNB NB and the corresponding right hand sides
RNB by
Aljk :=
fjl
:=
Z
j k
j k +
m1
m1
h,l
Q
h,l
NB
X
k=1
m1
k,l
NX
S 1 Z
j k
l
gil i,
hm1 j
m1
i=0
h,l
(2.37)
for all l J.
The matrix Al is symmetric and we note that for all v RNB with v 6= 0 and
14
PNB
j=1 vj j
v t Al v =
NB
X
j,k=1
NB Z
X
vj j vk k
vj j vk k
+
m1
m1
j,k=1
j,k=1
h,l
h,l
P
P
P
P
NB
NB
NB
NB
Z
Z
v
j
j
j
j
k
k
k
k
j=1
j=1
k=1
k=1
+
=
m1
m1
h,l
h,l
Z
Z
v2
kv k2
h +
h
=
m1
m1
h,l
h,l
NB Z
X
> 0.
Thus the matrix Al is symmetric and positive definite. This will be important for the
selection of an appropriate solver in chapter 3.
15
17
P1
P2
Figure 3.1: Partition of into two non-overlapping subdomains P1 and P2 . The interface I = P1 P2 separates the subdomains from each other.
3.1 Partitioning
First of all we shall introduce the terms partition and interface:
Definition 3.1.1 (Partition, Interface). A set of subsets Pi , i = 1, . . . , NP is called
a partition P = {Pi }i=1,...,NP of if
(1) Pi 6= i {1, . . . , NP }
(2) =
N
SP
Pi
i=1
i P
j = i, j {1, . . . , NP } , i 6= j.
(3) P
Then Pi is called a subdomain of for every i {1, . . . , NP }. The induced interface I
is defined by
[
I :=
(Pi Pj ) .
i,j{1,...,NP }
i6=j
Figure 3.1 illustrates the definitions of partition and interface in a basic example.
Note 3.1.1. The partition interface I is not to be confused with the segmentation interface defined in (2.8). They may, but usually will not, coincide. The same applies
to the subdomains Pi , which are often denoted by i in domain decomposition literature.
However, and i will always refer to the segmentation algorithm in this work, whereas
the subdomains Pi and the partition interface I are related to the domain decomposition
technique.
As we are about to develop a domain decomposition method for an algorithm using the
finite element discretization for the computational domain we only allow for partitions
with subdomains consisting of complete simplices. We demand that for a partition
P = {Pi }i=1,...,NP of with respect to a conforming triangulation T = {Sj }j=1,...,NT
the following holds for some index subset Ji {1, . . . , NT }:
[
Sj .
i {1, . . . , NP } : Pi =
jJi
18
3.1 Partitioning
P3
P4
P3
P4
I
P2
P1
P2
P1
(a)
(b)
n
X
l=1
(jl 1)
l1
Y
mk .
k=1
This results in a chessboard-like partition of an associated triangulation T as illustrated in figure 3.2(a) for a globally refined triangulation where all simplices are of the
same volume and arranged in a structured way.
Problems arise when it comes to locally refined triangulations like the one shown in
figure 3.2(b). Here, the number of unknowns in the subdomain P3 is much greater than
the number in the subdomains P1 , P2 and P4 . We will later on assign every subdomain
Pi to one processor. An imbalance, as illustrated in figure 3.2(b), results in a disastrous
parallel efficiency, because the processor dealing with P3 would be still computing while
the others would already have finished their task. Three processors would waste CPUcycles in idle mode. This behavior even becomes worse for larger numbers of CPUs.
19
Figure 3.3(b) gives an example of what the dual graph looks like for a small twodimensional mesh.
Metis and its parallelized offspring ParMetis address exactly the mentioned demands and thus are perfect candidates for partitioning a given triangulation T into
equal-sized subdomains with minimal interface size. In addition to the high quality of
the obtained partitions, Metis and ParMetis are very fast. For further details on the
algorithms used we refer to the work of Karypis and Kumar, particularly [13] and [14].
Figure 3.3 illustrates the typical workflow for partitioning a given triangulation with
Metis. We will discuss remaining implementational issues in 4.4.
Note 3.1.2. As of this writing, the partitioning routines implemented in Metis do not
guarantee that the resulting subdomains Pi are contiguous. In practice we have only
been able to observe non-contiguous subdomains with Metis in particular non-realistic
cases where the number of partitions was almost the number of simplices. However, our
algorithm is prepared for non-contiguous subdomains.
Note 3.1.3. Metis tries to achieve a minimal edgecut in the dual graph which means
a minimal size of the interface I in terms of adjacent simplices (and not in terms of
the geometrical length) while trying to keep the number of graph vertices (simplices) in
each partition equal. However, the sizes of the interface parts Ii := I Pi touching
one particular subdomain may vary. This has to be considered when designing and
implementing the algorithms.
20
3.1 Partitioning
G = (T , E)
(a)
P2
(b)
P3
P2
P3
P4
P4
I
P1
P1
(c)
(d)
21
(3.1)
fP1
xP1
..
..
x= .
and
f = .
fPM
xPM
fI
xI
i
A
AP 1 I
P1 P1
AP 2 P 2
AP 2 I
..
..
A=
.
(3.2)
.
.
AP M P M AP M I
APP API
(3.3)
A=
AIP AII
22
P4
P2
P3
P4
I2 I4 IX
I1 I3 I5
P1
I5
P2
IX
P3
I4
P2
I3
I2
P4
IX
I1
I1
I2
P1
I4
I3
I5
IX
(a)
(b)
Figure 3.4: (a) shows a partitioning of a triangulation with 584 simplices into 4 subdomains and (b) shows the block structure of the corresponding matrix A after
reordering the unknowns. Here the interface I has been divided into parts
I1 , . . . , I5 , IX in order to illustrate the adjacency structure in the matrix
more clearly.
Note that APP is a block diagonal matrix. Furthermore, the reordering conserves
the sparseness, symmetry and positive definiteness of A because rows and columns are
changed simultaneously.
Let us now perform a block Gaussian elimination to eliminate the block AIP in (3.3).
We therefore multiply equation (3.1) with
I
0
SL (R)
L :=
AIP A1
I
PP
and obtain
The matrix
APP
0
fP
xP
=
.
1
f
A
A
f
A
x
AIP A1
I
I
IP
P
PI
PP
PP
API
AII
S := AII AIP A1
PP API
(3.4)
(3.5)
is called the Schur complement matrix of A associated with the interface variables xI .
Together with
fI := fI AIP A1
(3.6)
PP fP
we obtain the Schur complement system
SxI = fI .
(3.7)
23
A1
PP
AP1 P1
AP 2 P 2
..
.
AP M P M
A1
P1 P1
A1
P2 P2
..
.
A1
P M PM
(3.8)
APP zP = yP
in fact naturally decouples into M systems
APi Pi zPi = yPi , i {1, . . . , M } .
These systems can therefore be solved independently in parallel.
24
A B
y
y
z
= y
M
z
z
B C
= y Ay + y Bz + z B y + z Cz
= y Ay + 2y Bz + z Cz
= A1 Bz A A1 Bz + 2 A1 Bz Bz + z Cz
= z B A1 AA1 Bz 2z B A1 Bz + z Cz
= z Cz z B A1 Bz
= z Cz z B A1 Bz
= z C B A1 B z
= z Sz.
25
A :=
AiPP
AiIP
AiPI
AiII
i
i
i
i
i
i
with AiPP RNP NP , AiPI = AiIP
RNP NI and AiII RNI NI . Let Ri be a
restriction operator which maps the unknowns of the global domain to the corresponding unknowns of the subdomain Pi . This restriction operator can be represented
i
by a matrix Ri NN N consisting only of zeros and ones. The transpose Ri is called
i
the prolongation operator and extends a vector xi RN from the subdomain Pi to the
global domain by inserting zeros outside of Pi . The global system matrix A can then
be expressed as the sum of the local subdomain matrices Ai :
A=
M
X
Ri Ai Ri .
(3.9)
i=1
i
The block matrix AII appearing in (3.5) can thus be written as the
of the AII
sum
i
RP
i
:
by only using the interface part RIi NNI N of the restriction Ri =
RIi
AII =
M
X
(3.10)
i=1
We have seen in (3.8) that the matrix APP can be inverted block-wise and the matrix
AIP A1
PP API in (3.5) is
AIP A1
PP API
AP 1 P 1
AP1 I
AP I
2
= .
..
AP M I
1
AP 1 P 1
AP1 I
AP I
2
= .
..
AP 2 P 2
M
X
.
APM PM
A1
P2 P 2
AP M I
..
AP 1 I
AP I
2
..
.
AP M I
AP 1 I
AP I
2
..
..
.
.
1
AP M I
AP M P M
AIPi A1
P i P i AP i I
i=1
M
X
i=1
26
1
AiPI RIi .
(3.11)
M
X
i=1
M
X
1 i i
RIi AiII AiIP AiPP
API RI
RIi S i RIi
(3.12)
i=1
1
AiPI .
(3.13)
We showed that the global system matrix A as well as the Schur complement matrix
S can be obtained by summing up local subdomain matrices. The decoupling property
(3.8) and equation (3.12) will allow us to implement a distributed iterative solver for
the Schur complement system which will be described in detail in the next section and
in chapter 4.
PM
i i
i=1 RI rI .
27
i{1,...,M }
We now consider a scalability experiment where the triangulation is refined and the
number of subdomains is increased in such a way that the ratio of subdomain diameter
to mesh size H/h is held constant. Roughly this means that the number of unknowns
in a subdomain does not change. Because as per (3.14) the condition number bound
depends on H/h, we can expect the parts of the algorithm dealing with the subdomain
solves to be scalable.
The Schur complement matrix, in contrast, does not share this characteristic as the
1
condition number depends on Hh
and thus deteriorates as the number of subdomains
increases because of smaller subdomain diameters H. The actual impact on runtime
characteristics will be discussed in chapter 5.
28
4 Implementation in Image
This chapter concentrates on the implementation of the Schur complement domain decomposition method presented in chapter 3, which is ultimately combined with the
image segmentation algorithm described in chapter 2. All source code that emerged
from this work has been completely integrated into the Image project which was initiated by Kai Hertel and the author with conceptual and mathematical mentoring by
Michael Fried.
Image makes extensive use of the open source library ALBERTA for the implementation of finite element based image processing operators. Our domain decomposition
code was built on top of this library and as a result of this work we have been able
to parallelize all finite element based image operators implemented in Image with only
minor modifications. Because the basic implementation of a parallelized second order
problem like the image segmentation algorithm still follows ALBERTAs control flow,
we refer to Schmidt and Siebert [18] for a full documentation of the ALBERTA toolbox.
After briefly introducing the Image framework, we will concentrate on the distributed
iterative solver for the Schur complement domain decomposition method in this chapter.
29
4 Implementation in Image
of this document. Our code as well as most of the used libraries are written in the C
programming language with both efficiency and maintainability in mind.
We shall now present the essence of concepts and structures we are about to use with
the domain decomposition implementation later on. A self-contained presentation of
Image would go beyond the scope of this work and we refer to Kai Hertels diploma
thesis [12] for a more detailed documentation of Images usage and internals.
The basic data type Image operates on is the img_list. An img_list is a linked
list of image channels, which look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
struct _img_operator {
const char * name ;
const char * friendly_name ;
const i m g _ c h a n n e l _ r e f e r e n c e accept_mask ;
img_list * (* run ) (
const img_list * channels_in ,
const img_optset optset ,
img_outarr outarr ,
30
31
4 Implementation in Image
32
convert to fe
readmesh
segment ddm
output
P0
P1
P2
P3
MPI communication
Figure 4.3: Typical example of the workflow with 4 MPI processes and multiple operators in Image: The master reads an image file and enhances the contrast.
Then a macro triangulation is read and the image is converted into a finite
element function. The segmentation operator with built-in domain decomposition support distributes the work among all available MPI processes and
gathers the results in the master process where the data is written to files
by an output method. The worker processes are idling when no parallel
operator is running.
Image was designed to run on single- and multi-processor systems. In a singleprocessor environment all parallel operators are disabled. From now on we will assume
a multi-processor environment. Then one process becomes the master process and all
remaining are worker processes waiting for the assignment of jobs from the master
process. Each process in the current MPI runtime environment (MPI_COMM_WORLD) is
identified by an int rank := MPI_Comm_rank() {0, . . . , size 1} where int size :=
MPI_Comm_size() denotes the number of MPI processes. For the sake of simplicity we
will abbreviate the process holding the MPI rank i with Pi . The process P0 is defined
as the master process.
The master process is responsible for the following tasks:
Initialization of subsystems (e.g. GraphicsMagick)
Read initial image data from files
Read macro triangulation
Refine triangulation
Run serial operators
Set up and start parallel operators on worker processes when needed. In the case
of a finite element based operator this includes:
Initialize partitioning of the triangulation by building the dual graph
Call ParMetis on master and worker processes
33
4 Implementation in Image
int * xadj ;
// indices for adjncy array
int * adjncy ; // adjacency list for vertices
int * vtxdist ; // distribution of graph vertices
Listing 4.4: Distributed Compressed Row Storage (DCRS)
The array vtxdist has size+1 entries and stores the distribution of the graphs
vertices among the processes. The process Pi is responsible for the vertices from
vtxdist[i] up to vtxdist[i+1]-1. Note that the vtxdist array is identical for every process because the process owning a particular vertex has to be identifiable by
ParMetis. The processes deal with local vertex numbering and the global index of
a local vertex j in Pi is vtxdist[i]+j. The local vertex j in Pi is adjacent to the
global vertices adjncy[xadj[j]],. . . ,adjncy[xadj[j+1]-1]. Figure 4.5 shows a simple
example.
Because ALBERTA organizes the triangulation with a binary tree, there is no integer available for the identification of a simplex. We therefore have to traverse the
triangulation and tag every element with an unique integer before building the graph
in the distributed CRS format. A new structure img_el_parinfo is introduced which
allows the storage of partitioning data on each leaf element of the binary tree.
34
3
0
6
Process P0 :
1
2
Process P1 :
1
2
Figure 4.5: Distributed CRS format for a small graph consisting of 8 vertices. The
adjacency information for ParMetis is distributed among 2 processes.
1
2
3
typedef struct {
unsigned int part , id ;
} img_el_parinfo ;
Listing 4.6: Declaration of img_el_parinfo
ALBERTAs macro LEAF_DATA(EL *el) provides a pointer to memory associated
with leaf elements, but since operators usually want to store custom data on leaf elements
besides the img_el_parinfo information, the operator has to provide a pointer to a
function with the prototype img_el_parinfo *get_el_parinfo(EL *el) which points
to the correct memory location inside a LEAF_DATA memory area. The tagging of the
mesh then is performed by the function img_alberta_mesh_tag which also is presented
here as a demonstration of ALBERTAs mesh traversal routines:
2
3
4
5
6
7
8
35
4 Implementation in Image
el_parinfo - > id = count ++;
}
f re e _ tr a v er s e _s t a ck ( stack ) ;
9
10
11
12
return count ;
13
14
}
Listing 4.7: Definition of img_alberta_mesh_tag
After the triangulation has been tagged, the adjacency information is gathered by
traversing the mesh another time in a similar way. In each element we iterate through
all neighbors and fill xadj, adjncy and vtxdist accordingly. The MPI_Bcast function is used to transfer parameters for ParMetis to the worker processes. The arrays xadj and adjncy are distributed with help of the function MPI_Scatter which
sends equal sized parts to all processes. ParMetis then is executed via a call to
ParMETIS_V3_PartKway() with the parameter controlling the number of desired partitions set to the number of worker processes. The result of the partitioning process
is stored in an array int *part which holds a partition number for each vertex of the
dual graph and thus for each simplex of the triangulation. We iterate another time
through the triangulation and store the partition number in each leaf elements member
el_parinfo->part. Every simplex of the triangulation now is tagged with a partition
index and we can begin to distribute the subdomains among the worker processes.
struct macro_data {
int dim ;
// dimension of the mesh
int n_total_vertices ; // number of vertices
int n_macro_elements ; // number of macro elements
REAL_D * coords ;
// vertex coordinates
int * mel_vertices ;
// macro element vertices
int * neigh ;
// macro element neighbors
S_CHAR * boundary ;
// boundary type if no neighbor
U_CHAR * el_type ;
// not needed by our implementation
};
typedef struct macro_data MACRO_DATA ;
Listing 4.8: Declaration of MACRO_DATA
36
37
4 Implementation in Image
i in the worker process Pj is identified in the masters interface array by the index
assoc_wi2mi[j-1][i].
Note 4.7.1. wi2mi is an abbreviation for worker-interface-to-master-interface.
38
1
2
3
4
5
6
7
8
9
10
11
Time t
P0
MPI_Recv() waiting. . .
MPI_Recv()
P1
working
MPI_Send()
P2
working
P3
working
MPI_Send() waiting. . .
MPI_Send() waiting. . .
MPI_Recv()
MPI_Recv()
MPI_Send()
MPI_Send()
Figure 4.10: Runtime behavior when using the blocking MPI_Recv function in the master
process. Worker process P1 needs more time for computation than the other
ones. The master process as well as P2 and P3 are waiting although data
could be transferred to the master process. The small blocks following each
MPI_Recv (colored orange) correspond to the processing of received data in
lines 9-10 of listing 4.9.
39
4 Implementation in Image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Time t
P0
MPI_Waitany()
P1
working
P2
working
P3
working
MPI_Waitany()
MPI_Waitany()
MPI_Waitany()
MPI_Send()
MPI_Send()
MPI_Send()
Figure 4.12: Runtime behavior with non-blocking MPI communication in the master
process in the same setting as in figure 4.10. The call to MPI_Irecv is not
shown because it immediately returns. The master process waits for any
of the worker processes with MPI_Waitany which returns once a receive
operation is completed. The received data is instantly processed which
again is indicated by the small blocks following each MPI communication
in the master process (cf. lines 14-15 in listing 4.11).
40
(4.1)
(4.2)
Until here all tasks have been carried out in parallel without any communication. Now
each worker process sends fIi to the master process where the right hand side for the
Schur P
complement system is obtained by summing up the subdomain contributions
M
i i
i
fI =
i=1 RI fI . Instead of the prolongation matrix RI the association vectors
assoc_wi2mi described in section 4.7 are used.
41
4 Implementation in Image
described in section 3.2.3 we will not form thematrices S or S i explicitly because of
1
high computational costs for the inverses AiPP
.
We recall the local subdomain Schur complements (3.13) and the relation to the global
Schur complement from (3.12)
S i =AiII AiIP AiPP
S=
M
X
RIi S i RIi
1
AiPI
(4.3)
(4.4)
i=1
which gives us a recipe for implementing the matrix-by-vector multiplication with the
Schur complement matrix in a distributed manner. For each iteration of the outer
iterative solver we have to compute a matrix-by-vector multiplication rI = SxI by
performing the following operations in our implementation:
1. First of all, the master process gathers the interface DOFs of xI for each worker
process by using assoc_wi2mi (cf. 4.7) and sends them accordingly via MPI. This
corresponds to the application of the restriction operator RIi in (4.4). Each worker
process Pi now holds the portion xiI affecting its interface part.
i = Ai xi
2. Each worker process computes yP
PI I
i using the standard Conjugate Gradient
3. Each worker process solves AiPP zPi = yP
solver implemented in ALBERTA. We will need a high accuracy for this solution
as stated in note 3.2.1.
4. Each worker process computes the subdomain result rIi = AiII xI AiIP zPi .
5. The master process receives and sums up the subdomain results rIi to obtain
P
i i
rI = M
i=1 RI rI . We again employ the efficient association vectors assoc_wi2mi
instead of a multiplication with the matrix RIi . Additionally, the non-blocking
MPI communication described in section 4.8 is used to receive the interface data
rIi from the worker processes. This allows us to process data as soon as it is
available and thus prevents unnecessary delay in the master process in the case
where worker processes do not terminate computations in order.
The remaining steps beside this matrix-by-vector multiplication like computation of
descent direction, update of the residual and the solution are all performed by ALBERTAs Conjugate Gradient solver. Because ALBERTA allows to exchange the
matrix-by-vector multiplication easily for every implemented iterative solver (e.g. GMRes and BiCGstab) we would be able to use these as well in the case of non-symmetric,
positive definite matrices.
Note that the steps 2-4 can be carried out in parallel without communication. Only
the steps 1 and 5 involve communication via MPI which has been optimized in our
implementation in order to obtain better scalability.
Beside the matrix-by-vector multiplication, the Conjugate Gradient method only requires the computation of scalar products and the sum of two vectors for one iteration.
These are computed in serial on the master, but as outlined in section 4.7, the vectors
are plain C arrays of the size of the interface and we are able to use optimized BLAS
42
1
=
ki k
I = PM
j=1 ki Pj k
Z
M
X
Ik
j=1 P
i
j
for each channel k and each segment i at the end of a time step (cf. section 2.5). This
is accomplished by computing the volumes and integrals locally in the worker processes
and employing the function MPI_Allreduce() with the MPI reduce operation set to
MPI_SUM in order to sum up the local contributions and distribute the result back to all
processes.
Figure 4.13 gives an impression of the parallel work flow for the initialization phase
and one timestep.
Note 4.9.1. A major feature of this implementation is that no matrices have to be
stored or assembled in the master process. All operations that have to be carried out
in the master process are usually considered to be performance-critical. With the underlying domain decomposition approach based upon the finite element method we are
able to assemble the subdomain matrices in parallel in a very natural way without any
communication.
43
4 Implementation in Image
Time t
P0
Partitioning with
ParMetis (cf. 4.4)
distribute
subdomains (cf. 4.5)
associate (cf. 4.6),
send initial data
begin timestep
P
fI = fIi {
begin solving {
SxI = fI ;
compute and send
initial yI (cf. 4.9.2)
P
SyI = rIi n
send new yI
(cf 4.9.2)
repeat iterations
until tolerance is
reached
distribute interface {
solution xI
receive full solution
x of timestep (if
needed in master) {
next timestep
P1
P2
P3
ParMetis
receive subdomain,
get FE_SPACE and
data
assemble Ai , f i
solve
i
i
i
i
i
rI = AII yI AIP zP
(cf. 4.9.2)
solutions xiP
(cf. 4.9.3)
Figure 4.13: Timeline of initialization and one timestep in our implementation with
4 processes: Initialization (orange), assembly (green), Schur complement
solver (yellow) and solving for interior variables (blue). The arrows indicate
MPI communication between the processes.
44
5 Numerical Results
In this chapter, we will turn to numerical results of the presented algorithms. In the
first part we will present numerical results obtained by the segmentation algorithm. The
second part will analyze the runtime behavior of the parallelization with benchmarks.
All presented results have been computed using the domain decomposition method
implementation of the segmentation algorithm in the Image project.
Linear Lagrange elements have been used for all calculations. The ALBERTA library
allows us to use elements of higher order, but difficulties arise when it comes to the
computation of the integrals in the segmentation equations right hand side (cf. 2.34)
and mean values. A non-linear zero isoline of the level set functions would not split the
elements into simplices anymore and the geometry would become hard to tackle.
5.1 Segmentation
First of all, we will verify the correctness of the parallel segmentation algorithm by
presenting the experimental order of convergence in a case with a known solution. We
will also show examples where no exact solution is known but many features of the
Mumford-Shah segmentation method can be recognized. The parallel performance like
timing and efficiency of the used domain decomposition technique is omitted here and
will be subject of the second part (section 5.2).
u (x) =
c0 ,
x 0
c1 ,
x 1
45
5 Numerical Results
We therefore have to solve the following evolution equations (cf. equation (2.28)):
t
()
()
||
NX
S 1
l
()
fil i,
in (0, T ] ,
(5.1)
i=0
=0
||
(, 0) = 0 ()
on (0, T ] ,
in .
In the case of only one level set function the right hand side simplifies to:
1
X
i=0
fi i, () = (c0 I )2 (c1 I )2 .
We now turn to a very special case and assume that the initial data 0 as well as
the given
image I only depend on x1 . Then 0 has straight isolines with curvature
0
|0 | = 0. We furthermore restrict ourself to solutions (x1 , t) depending only
on x1 in space and exhibiting a non-vanishing gradient (x1 , t) for all t [0, T ]. The
curvature of the isolines of such solutions analogously vanishes and (5.1) reads:
t
= (c0 I )2 (c1 I )2
()
in (0, T ]
(5.2)
If we fix the parameter = 1, we obtain with the definition of the regularized delta
function from (2.16):
(c0 I )2 (c1 I )2
t 1 + 2 =
With f := (c0 I )2 (c1 I )2 equation (5.3) reads:
t 1 + 2 = f
in (0, T ] .
(5.3)
in (0, T ] .
(5.4)
We now require that the zero isoline level of and thus the segments 0 and 1 do
not change over time. Then f neither depends on the level set function nor on the
time t and the mean values c0 and c1 are constants.
Under the above assumptions equation (5.4) is an ordinary differential equation for
each fixed x1 [1, 1] with the following real-valued solution
1
2
(A (t)) 3
(t) =
1
2
(A (t)) 3
(5.5)
with
A (t) = 4 3tf +
46
30
+ 30 +
4+
9t2 f 2
+ 6tf
30
+ 30 +
30
+ 30
2
(5.6)
5.1 Segmentation
Figure 5.1: Computation of the experimental order of convergence in the case of a known
solution
47
5 Numerical Results
Let us now consider suitable initial conditions and an image I we are able to use with
Image. We define the original image by a grayscale image consisting of four stripes
0,
1 x1 < 0.5
0.75 ,
0 x1 < 0.5
1,
0.5 x1 1
2
x1
which both fulfill the above assumptions. Figure 5.1 shows the image I and the initial
level set function 0 .
In order to compute the experimental order of convergence, we numerically compute
the discrete solution h with dirichlet boundary conditions h | = on a series Thj j
Z
errj = sup
t[0,T ]
hj1
2 .
h j
2
= k hj kL ,L2
Value
tolsub
tolschur
h2j
0.5
1.0
1.0
1.0 108
1.0
1.0 1012
1.0 108
48
5.1 Segmentation
16 CPUs
j
hj
3
4
5
6
7
8
2.5 101
1.25 101
1.625 102
3.125 102
1.562 102
7.812 103
256 CPUs
k hj kL ,L2
EOCj
k hj kL ,L2
EOCj
4.330 1002
3.025 1002
2.038 1002
1.390 1002
9.668 1003
6.805 1003
5.173 1001
5.698 1001
5.512 1001
5.245 1001
5.066 1001
3.025 1002
2.038 1002
1.390 1002
9.668 1003
6.805 1003
5.698 1001
5.512 1001
5.245 1001
5.066 1001
Table 5.3: Experimental order of convergence. Note that we have not been able to compute the error for refinement level 3 on 256 CPUs because the triangulation
exactly consists of 256 simplices in this case and ParMetis did not supply every worker process with a simplex, which is what our implementation
requires.
The experimental order of convergence stabilizes around 21 , which is the same result
Fried obtained in [11]. Our segmentation algorithm with domain decomposition parallelization thus is able to reproduce the solutions of the original serial version of the
code. Furthermore, there are no differences between the computations performed using
16 and 256 processors.
Note 5.1.1. In addition, we verified the correctness of the domain decomposition code
with computations of the experimental order of convergence for the heat equation and
mean curvature flow.
49
5 Numerical Results
the very coarse mesh in this regions of the example.
Checkerboard
Figure 5.4(a) shows the original checkerboard image and a corresponding adapted mesh.
Value
tolsub
tolschur
1.0 102
0.28
1.0
1.0 102
1.0 108
255.0
1.0 1012
1.0 108
50
5.1 Segmentation
Figure 5.6: Three steps (t0 = 0, t1 = 0.14 and t2 = 0.28) of the segmentation evolution
for the checkerboard image. The upper row shows the original image with the
interface and the lower row reveals the corresponding segmented images.
Grayscale Gradient
We now turn to a more interesting scenario with a grayscale gradient in figure 5.7(a).
The image thus consists of more than two color levels and it is not clear where the interface should exactly be placed on the fading right part even for humans. Nevertheless,
we expect a sane segmentation algorithm to recognize the circles left boundary reliably.
We used exactly the same parameters as in the previous experiment and obtained the
results depicted in figure 5.7(c) and 5.7(d). The interface front immediately moved
to the hard line on the left side and stabilized in the fading part on the right side. The
experiment was repeated with different parameters and . The resulting segmented
images only differed marginally from the presented one. For higher curvature parameters
we obtained a slightly rounded interface where the interface leaves the full circles
boundary.
51
5 Numerical Results
52
5.1 Segmentation
53
5 Numerical Results
(a) 1 = 0.01
(b) 2 = 0.1
Figure 5.9: Segmented road sign image using different curvature weights . In every row
the left image shows the result at t = 1.0 before adding the second level set
function and the right one is the segmentation at t = 2.0 with 2 level set
functions and thus four colors.
54
5.1 Segmentation
Large-Scale Image
The next example is a high resolution photograph consisting of 2000x2000 pixels.
55
5 Numerical Results
56
57
5 Numerical Results
let n N be the number of used compute nodes. Then, the p = 4n CPUs are assigned
to one master and p 1 worker processes.
Let Rn be the real execution time of the solving process with n nodes. We define the
relative speedup by
R1
.
Sn :=
Rn
The efficiency then is defined by
En :=
Sn
.
n
An efficiency close to one indicates an ideal utilization of the processors (linear speedup).
Values above one may also occur, for example in the following situations:
when vectors entirely fit in the processors caches
if the interface I, separating the subdomains, suddenly induces a Schur complement system which the Conjugate Gradient method is able to solve faster
if the partitioning results in a better load-balancing
On the other hand, we expect the absolute speedup, referring to the execution time of
a corresponding serial implementation, to be below one for very low numbers of CPUs,
because of the communication and management overhead of the domain decomposition
implementation.
Let us keep these considerations in mind and turn to benchmarks in the following two
sections.
Small-Sized problem
The described domain decomposition method and its implementation certainly aim at
the solution of large scaled problems with respect to the spatial discretization. Nevertheless, we will show the characteristics of our implementation when applied to a
small-sized problem.
We start off with the segmentation of the coastline (figure 5.10) with all parameters
except for the mesh refinement set as above. The refinement process was stopped earlier
to obtain a coarser mesh. In order to observe the correlation between high local detail
density and a locally refined mesh, figure 5.12(a) shows the mesh as an overlay on the
original image. Figure 5.12(b) presents a partitioning produced with help of ParMetis.
The adapted mesh clearly shows coarse areas in the upper right part and fine structures in the center and bottom. Note that the partitioning is not based on the geometrical size but on the number of simplices. For example, the upper right partition
(red) covers a larger area than the one in the lower left corner (blue). Building the dual
graph and partitioning the mesh with ParMetis took about 80 milliseconds in this
experiment.
Beside the timing information we also captured valuable data like the condition numbers of the Schur complement matrices and the number of needed iterations for the Schur
complement CG solver. For the sake of clarity we will only provide these additional data
for the first time step of the computation. The timing, however, was measured for 10
58
NP
NI
(S)
CG
iterations
time
Rp [s]
speedup
Sp
40.72
20,204
8,605
3,984
1,904
921
603
446
318
692
1,174
1,896
2,926
3,631
4,319
130.1
338.7
421.8
278.9
207.6
271.4
312.6
41
59
75
75
69
81
82
47.68
14.42
8.34
8.01
8.16
10.75
17.78
1.00
3.30
5.71
5.95
5.84
4.43
4.42
1.00
1.65
1.42
0.74
0.36
0.18
0.13
efficiency
Ep
Table 5.13: Benchmark for a segmentation on a coarse mesh inducing 60,929 total degrees of freedom: Average number of interior degrees of freedom per subdomain NP , number of interface unknowns NI , condition number (S) and
needed CG iterations for the first time step and timings.
The runtime of the serial code beat the computation with one cluster node (4 CPUs),
but the runs with two and four nodes revealed a reduction of the execution time. Eight
nodes did not improve the time significantly and more nodes even caused the time to
rise slightly again. Figure 5.15 illustrates the deteriorating performance graphically. As
59
5 Numerical Results
there was no significant growth of the Schur complement matrices condition number or
the needed number of CG iterations, we have to further investigate the issue. The cause
for the stagnation and decline of efficiency is rather a computational than a mathematical
one and can revealed when analyzing the parallel runtime behavior with the Intel Trace
Analyzer. Figure 5.14 shows the timeline of two CG iterations for 16 nodes (64 CPUs).
Figure 5.14: Analysis of two parallel CG iterations on 16 nodes (64 CPUs) with Intels
Trace Analyzer in a timeline view. Each horizontal bar represents one
process, starting with the master process in the first row. Application code
is marked blue and MPI routines including waiting are marked red. Black
lines indicate communication.
The time needed to solve the local subdomain problems almost fell below the time
needed for the distribution of the interface data via MPI. One iteration roughly took
three milliseconds and a few processes were affected by some kind of jitter. Note that
the first 3 worker processes received their data faster than all the rest, which was caused
by the fact that the 4 involved CPUs accessed the same physical memory in one cluster
node and did not need any indirection via the InfiniBand network.
Simply put, this experiments problem size was too small for the parallel algorithm to
obtain a performance gain from large numbers of CPUs. However, the execution time
is still reduced to one fifth of the serial execution time by employing four nodes (16
CPUs).
Because our implementation aims at the solution of large systems, we will now turn
to a more realistic scenario where parallelization is vitally needed.
60
30
Actual speedup
Ideal linear speedup
Speedup
25
20
15
10
5
0
0
16
24
Number of cluster nodes
32
10
Average number of interior
unknowns per subdomain
5000
4000
3000
2000
1000
10
10
10
0
8
16
24
Number of cluster nodes
32
8
16
24
Number of cluster nodes
32
90
450
Schur complement matrix
condition number
Number of CG iterations
400
80
70
60
50
40
350
300
250
200
150
100
16
24
Number of cluster nodes
32
(d) Number of CG iterations for the Schur complement system until tolerance 108
8
16
24
Number of cluster nodes
32
Figure 5.15: Scalability limitations of the parallel algorithm with small-sized problems
(60,929 degrees of freedom)
61
5 Numerical Results
Large-Scale
This benchmark will investigate the behavior for the same image and the same parameters as in the previous experiment, but this time with a very fine triangulation. In areas
depicting many details the meshs simplices will be as small as a pixel (whose size is
determined by the overall mesh diameter). The mesh consists of 4,022,596 elements at
the end of the refinement process and the corresponding finite element space is defined
by 2,012,758 global degrees of freedom. Figure 5.16 shows a partitioning of the mesh
into 511 subdomains. Employing ParMetis once again, the process took 1.7 seconds.
time
Rp [s]
speedup
Sp
efficiency
Ep
NP
NI
(S)
CG
iterations
see text
670,250
286,942
133,725
64,567
31,692
15,657
10,382
7,756
6,186
5,143
2,008
4,166
6,883
11,182
16,181
24,289
29,637
35,054
39,342
42,890
2,473.0
2,458.6
4,258.3
5,520.2
5,759.1
7,796.5
9,884.6
7,357.9
11,241.2
6,390.0
139
154
245
260
216
335
370
296
371
333
11,011.34
6,111.34
3,589.77
1,570.59
644.35
261.81
149.18
105.12
125.95
137.85
1.00
1.80
3.06
7.01
17.08
42.05
73.81
104.75
87.42
79.87
1.00
0.90
0.76
0.87
1.06
1.31
1.53
1.63
1.09
0.83
Table 5.17: Benchmark for a segmentation on a fine mesh with 2,012,758 total degrees
of freedom (notation as in table 5.13).
62
100
Speedup
80
60
40
20
0
0
16
32
48
64
80
Number of cluster nodes
96
x 10
10
Average number of interior
unknowns per subdomain
5
4
3
2
1
10
10
0
16
32
48
64
80
Number of cluster nodes
96
16
32
48
64
80
Number of cluster nodes
96
400
12000
350
Number of CG iterations
10
300
250
200
150
100
10000
8000
6000
4000
2000
0
16
32
48
64
80
Number of cluster nodes
96
16
32
48
64
80
Number of cluster nodes
96
Figure 5.18: (Super-)Linear relative speedup with up to 64 nodes (256 CPUs) when
solving a large-scaled problem
63
5 Numerical Results
The serial code had to be abandoned for the computations with this problem because
of memory exhaustion during the assembly of the system matrix. We observed a super
linear speedup between 16 and 64 cluster nodes (64 - 256 CPUs) which is owed to the
faster solution of the smaller subdomain problems and cache effects.
When going beyond 64 cluster nodes, performance stagnation and regression began.
We confirmed this tendency with up to 128 nodes (512 CPUs) in further experiments.
Although the symptoms look similar to the ones observed in the small-sized problem,
the cause for the decline now is another. Using Intels Trace Analyzer once again we
obtain a different behavior, which is again shown as a parallel timeline in figure 5.19.
Figure 5.19: Timeline for two distributed CG iterations on 16 nodes (64 CPUs) operating
on 16,181 interface unknowns (see figure 5.14 for an explanation of the
figures semantic).
The communication latency no longer was the bottleneck. The processes received
their parts of the interface data and solved their local subdomain problem, where the
latter clearly dominated. Note that the non-blocking communication described in 4.8
can be observed in this figure: Each worker process was able to send its result back to
the master process immediately upon completion of the local computation.
Taking a closer look reveals a load-imbalance between the subdomains. The partitioning strategy used in ParMetis tries to balance the number of simplices between
the processes, but this does not guarantee an equal workload for the solution of the
local subdomain problems. The more processes are involved the more the performance
deteriorates when load-imbalance occurs. For example, if one process needs twice the
time of all other processes, then these processes waste valuable CPU cycles and the
parallel performance stagnates.
A slight load-imbalance was present in almost every experiment we conducted. However, when using larger numbers of processors the imbalance deteriorates due to local
phenomena of the underlying algorithm and the used image. The partitioning of the
64
65
67
68
References
[1] L. Ambrosio, N. Fusco, and D. Pallara. Functions of bounded variation and free
discontinuity problems. Oxford Mathematical Monographs, 2000.
[2] T. J. Barth, T. F. Chan, and W. Tang. A Parallel Non-Overlapping Domain
Decomposition Algorithm for Compressible Fluid Flow Problems on Triangulated
Domains. In J. Mandel, C. Farhat, and X.-C. Cai, editors, Tenth International
Conference on Domain Decomposition Methods, pages 2341. AMS, Contemporary
Mathematics 218, 1998.
[3] Christoph Borgers. The Neumann-Dirichlet domain decomposition method with
inexact solvers on the subdomains. 55:123136, 1989.
[4] James H. Bramble, Joseph E. Pasciak, and Apostol T. Vassilev. Analysis of nonoverlapping domain decomposition algorithms with inexact solves. Math. Comput.,
67(221):119, 1998.
[5] Susanne C. Brenner. The Condition Number of the Schur Complement in Domain
Decomposition. Numer. Math, 83:187203, 1998.
[6] Tony F. Chan, Berta Yezrielev Sandberg, and Luminita Aura Vese. Active contours
without edges for vectorvalued images. Journal of Visual Communication and
Image Representation, 11:130141, 2000.
[7] Tony F. Chan and Luminita Aura Vese. Active Contours without Edges. IEEE
Transactions on Image Processing, 10(2):266277, 2001.
[8] Alexandre Ern and Jean-Luc Guermond. Theory and Practice of Finite Elements.
Springer, New York, Berlin, Heidelberg, 2004.
[9] Lawrence Craig Evans. Partial Differential Equations. Graduate Studies in Mathematics. American Mathematical Society, United States of America, 1998.
[10] Michael Fried. Berechnung des Kr
ummungsflusses von Niveaufl
achen. Diplomarbeit, Institut f
ur Angewandte Mathematik, Universitat Freiburg, 1993.
[11] Michael Fried. Multichannel Image Segmentation Using Adaptive Finite Elements.
Computing and Visualization in Science, 12(3):125135, 2005.
[12] Kai Hertel. Image Processing Algorithms Incorporating Textures for the Segmentation of Satellite Data based upon the Finite Element Method. Diploma thesis, Chair of Applied Mathematics III, Friedrich-Alexander-Universitat ErlangenN
urnberg, March 2009. http://www10.informatik.uni-erlangen.de/~kai/
publications/diplomathesis.pdf.
69
References
[13] George Karypis and Vipin Kumar. Multilevel Graph Partitioning Schemes. In
Proc. 24th Intern. Conf. Par. Proc., III, pages 113122. CRC Press, 1995.
[14] George Karypis and Vipin Kumar. A Fast and High Quality Multilevel Scheme for
Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20(1):359
392, 1998.
[15] QingKai Liu, ZeYao Mo, and LinBo Zhang. A parallel adaptive finite-element
package based on ALBERTA. Int. J. Comput. Math., 85(12):17931805, 2008.
[16] David Mumford and Jayant Shah. Optimal Approximations by Piecewise Smooth
Functions and Associated Variational Problems. Communications on Pure and
Applied Mathematics, 42:577685, 1989. Originally published in 1988.
[17] Yousef Saad. Iterative Methods for Sparse Linear Systems, Second Edition. Society
for Industrial and Applied Mathematics, April 2003.
[18] Alfred Schmidt and Kunibert G. Siebert. Design of Adaptive Finite Element Software, The Finite Element Toolbox ALBERTA. Lecture Notes in Computational
Science and Engineering. Springer, Berlin, Heidelberg, New York, 2005.
[19] Andrea Toselli and Olof Widlund. Domain Decomposition Methods - Algorithms
and Theory, volume 34 of Springer Series in Computational Mathematics. Springer,
2004.
70
Andre Gaul