Professional Documents
Culture Documents
YUA N L I U
TRITA-ICT-EX-2011:4
Hybrid Parallel Computation of
OpenFOAM Solver on Multi-Core Cluster
Systems
YUAN LIU
Abstract
Acknowledgement
Contents v
List of Tables x
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Subject investigated . . . . . . . . . . . . . . . . . . . 3
1.2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.5 Thesis Organization . . . . . . . . . . . . . . . . . . . 4
2 DieselFoam 5
2.1 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 aachenbomb case directory . . . . . . . . . . . . . . . . . . . . 7
2.3.1 System directory . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Constant directory . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Time directory . . . . . . . . . . . . . . . . . . . . . . 12
2.4 dieselFoam Solvers and UEqn Object . . . . . . . . . . . . . . 12
2.4.1 Directory structure . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Linear Solver . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 PISO algorithm . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 Velocity equation and Navier-Stokes equations . . . . . 15
2.5 Running program . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Parallel Computing 19
v
vi CONTENTS
3.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Symmetric multiprocessing (SMP) . . . . . . . . . . . 20
3.1.2 SMP cluster . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Parallel programming models . . . . . . . . . . . . . . . . . . 21
3.2.1 Message Passing Interface (MPI) . . . . . . . . . . . . 21
3.2.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Hybrid parallel model . . . . . . . . . . . . . . . . . . 22
3.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Comparison of parallel programming mode . . . . . . . . . . . 26
3.3.1 Pros and Cons of OpenMP and MPI . . . . . . . . . . 26
3.3.2 Trade off between OpenMP and MPI . . . . . . . . . . 26
3.3.3 Pros and Cons of hybrid parallel model . . . . . . . . . 28
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 OpenFOAM in parallel 31
4.1 Running in parallel . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Domain decomposition methodology . . . . . . . . . . . . . . 32
4.3 Simple decomposition . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 MPI communication description . . . . . . . . . . . . . . . . . 37
4.4.1 MPI Program structure . . . . . . . . . . . . . . . . . 37
4.4.2 Environment routines . . . . . . . . . . . . . . . . . . . 37
4.4.3 Communication schedule schemes . . . . . . . . . . . . 37
4.4.4 Communication routines . . . . . . . . . . . . . . . . . 42
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Prototype Design 45
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Running platforms . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Performance profiling . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Profiling input . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Collaboration diagram for PBICG . . . . . . . . . . . . . . . . 48
5.6 Profiling result . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7 Candidate analysis . . . . . . . . . . . . . . . . . . . . . . . . 51
5.8 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.8.1 Task parallelism . . . . . . . . . . . . . . . . . . . . . . 57
5.8.2 Data parallelism . . . . . . . . . . . . . . . . . . . . . 61
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Evaluation 63
6.1 Test case: Jacobi iterative method . . . . . . . . . . . . . . . . 63
6.2 Performance results . . . . . . . . . . . . . . . . . . . . . . . . 65
CONTENTS vii
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bibliography 75
List of Figures
viii
List of Figures ix
x
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
tromagnetic and the pricing of financial options, etc [12]. Its free of charge can
obviously reduce the cost instead of using commercial one. Another feature of
OpenFOAM is the core technology based on C++ programming language. By
using a set of C++ module, programmer can design CFD solver applications
in more flexible and extensible manner.
Volvo Technology AB (VTEC) is an innovation, research and development
company in Volvo Group AB. The mission of this company is to develop lead-
ing product in existing and future technology areas which high importance
to Volvo [3]. CAE department leverage the computer software to aid engi-
neerings task, is interesting in using CFD tools for their daily simulation in
variant activities such as automobile, aerospace and boat. Such efforts re-
duce the computer hardware resources cost and manpower of the simulation
process by moving the model simulation environment from real physical and
chemistry world to computer based virtual world.
In this thesis, hybrid parallel programming method is taken to acceler-
ate the performance of OpenFOAM solvers execution beyond the traditional
achievement. OpenFOAM application primitive is only based on single-level
pure MPI model, that each MPI process is executed in each node. In order
to improve its performance and reduce overhead caused by MPI communi-
cation. One good method is to map multi-level parallelism to OpenFOAM
solver, which coarse-grain parallelism is done by message passing (MPI) among
nodes and fine-grain is achieved by OpenMP loop-level parallelism inside each
node. Moreover, The report compares each parallelism prototype, analysis
and benchmark to the others, lay a solid foundation for the future work.
1.2.4 Contributions
The following aspects can be reached in this project:
DieselFoam
5
6 CHAPTER 2. DIESELFOAM
2.1 Geometry
DieselFoam geometry is a block with 0.01 in width and length, 0.1 in height
which shows in Figure 2.2. Its fulfilled with air. At the top of this model,
one injector there injects n-Heptane (C7H16) into combustion chamber. The
discrete droplets there would evaporate and burn. Then gas phase reactions
take place there, range from a reaction scheme with 5 species and one reaction
up to a reaction scheme involving approximate 300 reactions and 56 species
[7].
$HOME/OpenFOAM/tutorials
dieselFoam solver
Figure 2.3: Flow chart of dieselFoam solver
UEqn object YEqn object hEqn object rhoEqn object k, epsilon object ...
CHAPTER 2. DIESELFOAM
fvMatrix solver // Equation solve function
N
Result display Output and Print Condition check
Program terminate
2.3. AACHENBOMB CASE DIRECTORY 9
aachenbomb
System
controlDict
fvSchemes
fvSolution
decomposeParDict
Constant
Properties
polyMesh
points
cells
faces
0 boundary
epsilon
solvers
{
p
{
solver PCG ;
p r e c o n d i t i o n e r DIC ;
t o l e r a n c e 1e -06;
relTol 0;
}
U
{
solver PBiCG ;
p r e c o n d i t i o n e r DILU ;
t o l e r a n c e 1e -05;
relTol 0;
}
}
PISO
{
n C o r r e c t o r s 2;
n N o n O r t h o g o n a l C o r r e c t o r s 0;
p R e f C e l l 0;
p R e f V a l u e 0;
}
ddtSchemes
{
d e f a u l t Euler ;
}
gradSchemes
{
d e f a u l t Gauss linear ;
grad ( p ) Gauss linear ;
}
divSchemes
{
d e f a u l t none ;
div ( phi , U ) Gauss linear ;
}
laplacianSchemes
{
d e f a u l t none ;
l a p l a c i a n( nu , U ) Gauss linear c o r r e c t e d;
l a p l a c i a n ((1| A ( U ) ) ,p ) Gauss linear c o r r e c t e d;
}
snGradSchemes
{
default corrected;
}
fluxRequired
{
d e f a u l t no ;
p ;
}
newSolver
Make
files
options
newSolver.cpp
newHeader.h
Show the type of object files. Files can be compiled as executable file or
library.
dieselFoam use PISO algorithm, which the steps of it can be divided into:
If ( m o m e n t u m P r e d i c t o r )
{
solve ( UEqn == - fvc :: gra ( p ) ) ;
}
mkdir -p $FOAM_RUN
cp -r $FOAM_TUTORIALS $FOAM_RUN
cd $FOAM_RUN/tutorials/simpleFoam
blockMesh
simpleFoam
paraFoam
2.6 Conclusion
Nowadays, CFD researchers find that even they increase the hardware re-
source, but the performance always be narrowed by the endless demand of
large fluid application. These problems are caused by:
2.6. CONCLUSION 17
1. Most CFD programs have large-scale characters in its design and simu-
lation procedure, and these features are growing very quickly.
Therefore, even the most excellent computer sometime could not give a
satisfied result, or could not show an acceptable simulation in a cost-efficient
manner. Although the improvement of numerical algorithm used in program
can improve performance to some extent, but many studies indicate that par-
allelization seem to be the only way can significantly improve efficiency of
CFD program.
Next chapter, the introduction of parallel programming with its program-
ming models will be provided.
Chapter 3
Parallel Computing
Parallelism is a pretty old terms which appears everywhere in our daily life.
Look around, people perform a lot of activities in parallel manner, building
construction, automotive manufacturing, etc. they involve a lot of works done
by simultaneously. The most familiar term is assembly line, which means the
components of product are performed at the same time by several workers.
The concept of parallel computing in computer science is decomposing the
given task into several small ones. Executing or solving them at the same
time in order to make the computational process faster. With the emergence
of large-scale application, the development of computer system architecture and
network technology, the status of parallel programming was becoming more and
more important. Up to now, parallel programming is ready and mature for a
lot of industrial application acceleration based on previous research outputs.
In this chapter, we will talk about parallel computer architecture in section
3.1. Section 3.2 briefly describes the characters and implementation of three
parallel programming models: OpenMP, MPI and hybrid parallel separately.
In the last section, comparison of different parallel paradigm will be given.
3.1 Architectures
By development of computer architecture, a number of processors could access
one single memory in an efficient way. Hence, the concept of clustered SMP
is proposed and becoming more and more popular as parallel architecture.
More important, SMP cluster is an excellent platform for hybrid programming
model based on previous research [13]. In this section, we briefly present single
19
20 CHAPTER 3. PARALLEL COMPUTING
Memory
SYSTEM BUS
System Interconnection
Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
project shown in later chapters. Figure 3.2 shows the architecture of a simple
SMP cluster.
3.2.2 OpenMP
OpenMP is an API for shared memory parallel programming. It is an another
parallel programming standard which based on compiler directives for C/C++
or FORTRAN. OpenMP provides a set of directives, library routines and
environment variables. It uses high level abstraction by fork-join model, can
be implemented on shared memory system.
Recently, OpenMP model can be implemented on cluster of distributed
virtual shared memory, but need to be supported by specific compiler, e.g.
Intel Cluster OpenMP. In SMP or DSM (Distributed shared memory) system,
OpenMP thread is allowed to access the global memory address.
Mesh
OpenMP constructs
# pragma omp p a r a l l e l
for ()
}
Listing 3.2 shows another type of communication scheme. All other threads
are computing when master thread or a few threads doing communication.
3.2.4 Implementation
Listing 3.3 is a pseudo code of hybrid parallel mode. This code use standard
MPI_INIT and MPI_FINALIZE commands to initialize and finalize parallel
program as general MPI program. And inside parallel region, each process
spawns a number of threads.
From Figure 3.4, we could also see that each MPI process spawn to sev-
eral threads under #pragma omp parallel directive. Threads parallelism only
occurs inside parallel region, the rest of that still run in single thread.
Multithreading
MPI_Finalize
Threads end
M P I _ I n i t ()
M P I _ C o m m _ r a n k ()
M P I _ C o m m _ s i z e ()
// MPI c o m m u n i c a t i o n part
M P I _ S e n d ()
M P I _ R e c v ()
o m p _ s e t _ n u m _ t h r e a d s (4)
// each p r o c e s s start four t h r e a d s
M P I _ F i n a l i z e ()
}
OpenMP MPI
Shared memory architecture only, Can be used on shared or distributed
which more expensive than distributed memory architecture
memory architecture
OpenMP MPI
Threads communication through Processes communication through
memory access memory copies
Through such comparison, we found both OpenMP and MPI have their
drawbacks:
3.4 Conclusion
In order to get better performance, hybrid programming model of OpenFOAM
application on SMP cluster is investigated in this thesis. This method based
on combination of coarse-grain parallelism and fine-grain parallelism. Domain
decomposition is achieved by MPI message passing between inter nodes and
time consuming loop iterations are parallelized by OpenMP threads.
Compare to hybrid parallel model, another popular programming model
implemented on SMP cluster is pure MPI implementation. Some previous
research shows that MPI perform better under some situations [10] [18]. Since
the native parallel programming model of OpenFOAM is done by pure MPI
implementation, the analysis is necessary and will be presented in next chapter
At last, we will compare these two implementations in evaluation chapter.
Chapter 4
OpenFOAM in parallel
31
32 CHAPTER 4. OPENFOAM IN PARALLEL
# m a c h i n e file
linux01
linux02
linux03
P P
P PP
P P
P P
n (2 2 1) ;
delta 0.001;
}
we also make the available node have the exactly same number with mesh
points in x, y, z direction.
Simple decomposition algorithm use two data structure, one called pointIndices
and another is finalDecomp.
2. After step one, the nodes in x direction is well formed. Then map the
sorted nodes to the corresponding processors in finalDecomp list. Since
processor 0 and processor 1 are available in x direction, so each processor
gets 4 points. This step is shown in Figure 4.2.
3. Repeat step 1 and 2 for y-axis. This step is shown in Figure 4.3.
4. Combine x and y axis, we can get x-y axis results shown in Figure 4.4.
This result can be got from following equation 4.1:
5. Repeat step 1 and 2 for z-axis. This step is shown in Figure 4.5.
6. Combine x, y and z axis, we can get x-y-z axis results shown in Figure
4.6. This result can be got from following equation 4.2:
This simple example only consider equal amount scenario, but in practical,
we always meet non-equal amount node with the number of processors. Con-
sider if we have 9 points, the processors available still the same as previous
example. Like that, in such situation, each processor in x or y or z direction
still get 4 points. But the processor 1 in x-axis have to get one more point to
process. If Hierarchical Decomposition specify yxz order, then the processor 1
4.3. SIMPLE DECOMPOSITION 35
in y-axis get the work. The result can be calculated from following Equation
4.3:
#include <mpi.h>
Program termination
Linear Communication
Master process
Figure 4.9: When k=0, the node only contain root R, we record it as B0
Step one
Node 0 receive data from node 1
Node 2 receive data from node 3
40 CHAPTER 4. OPENFOAM IN PARALLEL
Figure 4.10: When k=1, the binomial tree contains current root R1 and the
existing sub-tree B0, we record it as B1
R2
R1 B0
B0
B1
Figure 4.11: When k=2, the binomial tree contains current root R2 and the
existing sub-trees B0 and B1, we record it as B2
Step two
Node 0 receive data from node 2
Node 4 receive data from node 6
Step three
Node 0 receive data from node 4
4.4. MPI COMMUNICATION DESCRIPTION 41
R3
R2 R1 B0
R1 B0 B0
B1
B0
B2
B1
Figure 4.12: When k=2, the binomial tree contains current root R3 and the
existing sub-trees B0, B1 and B2
4 2 1
6 5 3
It proves that, since the data communication in each step only involve peer-
to-peer scheme which reduce the communication overhead. The measurement
work shows tree-based communication scheme is more efficiency than linear-
based scheme when the numbers of nodes increasing, also when the number
of message is large but size relatively small, the quality of communication can
benefit from binomial tree algorithm [15].
Same as standard gather and scatter, but data type could be diversity,
and more operators are allowed to use, such as binary operator and
assignment operator.
There are variants of combineGather and combineScatter which can be
performed on list and map structure.
There are two types of reduction for each gather/scatter type, since real
reduction work is carried out inside Pstream::gather, therefore, it just call
Pstream::gather and Pstream::scatter in sequence.
4.5 Conclusion
Native MPI message passing model used in OpenFOAM is definitely a good
solution. But in fact, most experiments prove that MPI implementations
require a large amount of "halo" message communication between each SMP
nodes. It seems not the most efficient parallelism approach for SMP cluster.
However, shared memory parallelism such as OpenMP provide pretty nice
solution on inter node parallelism theoretically. Then we have an idea that, by
integrate OpenMP thread parallelism construct to native MPI model, replace
the extra message exchange by shared memory access, the application may
provide better performance than before.
Next, we will talk about hybrid parallel prototype of dieselFOAM. The
contents in next chapter include time profiling and candidates analysis.
Chapter 5
Prototype Design
The parallelism approach through top-down scheme, starts from the top level
of the whole program, and then parallelizes each sub-domains of the mesh.
The structure of prototype is that coarse grain parallelism between different
sub-domains by MPI and fine grain loop-level parallelism in solver functions.
In previous chapters we have already known the basic structure of Open-
FOAM program, parallel programming method and domain decomposition method-
ology in OpenFOAM. Now it is time to identify the potential places in program
which allow us to accelerate it, after that, applying hybrid techniques on these
codes to get faster execution time of physical simulation.
This chapter will focus on profiling of dieselFoam, by means of the input
and output, the result and candidates will be discussed in the following sections.
5.1 Methodology
This section is an overview of next two chapters, it present the methodology
that how the prototype can be implemented and analyzed. These stages are
designed to machine independent, which can be concluded as following:
1. Sequential implementation
Parallel programming usually starts from correct sequential implemen-
tation, since this step is already done, what we should do here is record
the correct result in order to examine parallel implementation.
2. Profiling
The profiling work is to identify bottleneck of the application, find out
which part consuming large time and where is the computational inten-
sive part. Following steps will be spread out based on this.
45
46 CHAPTER 5. PROTOTYPE DESIGN
3. Task-parallelism part
Some functions in program have a number of obvious concurrent stages.
These parts can be easily carried out by OpenMP task parallelism. Iden-
tify the sequence of execution, reconstruct the program logic if necessary.
4. Data-parallelism part
6. Prototype simulation
The parallelized code will run on SMP cluster. Analysis based on result
comparison will be provided. It shows the overhead of communication
and threads synchronization. This step also includes some suggestion to
the future work.
Sub-domain initialization
boudaries exchange
convergence check
Termination
Check convergence
Return
5.7. CANDIDATE ANALYSIS 51
Amul(Tmul)
Listing 5.1: Amul function code
1 2 3
6 5 4
7 8 9
From mesh above, we see cell 1 influence cell 2 and cell 6, cell 2 influence
cell 5, cell 1, and cell 3, etc.
Hence, we can obtain a matrix contains coefficients like below:
D N N
O D N N
O D N
O D N N
O O D N N
O O D N
O D N
O O D N
O O D
1 0 5 0
0 2 0 6
M = (5.1)
7 0 3 0
0 8 0 4
d(0) u(0)
d(1) u(1)
(5.2)
l(0) d(2)
l(1) d(4)
54 CHAPTER 5. PROTOTYPE DESIGN
1 0 5 0 b1 c1
0 2 0 6 b2 c2
= (5.3)
7 0 3 0 b3 c3
0 8 0 4 b4 c4
1 0 5 0
0 2 6 0
M = (5.4)
7 8 3 0
0 0 0 4
From the above equations, we can find these iterations cannot execute
at the same time since iteration 0 and iteration 1 both write data into
TpsiPtr[2].
Iteration 0 and iteration 1 have obviously data-racing problem, of course
sometime we will meet good situation, but still cannot guarantee this
problem not happened. In a word, Amul function cannot be performed
OpenMP directive on it, must be executed in order.
Precondition
Listing 5.2: Precondition function code
r e g i s t e r label sface ;
Fvm::ddt
Fvm::ddt is one of discrete schemes in OpenFOAM equation category,
some other schemes such like fvm:div and fvm:grad do the same thing
which you can specify them in fvSchemes file. The fvm::ddt and fvm::div
are also the most time consuming functions in Navier-Stokes equation
of our project target.
After code anaylsis, we found such functions are difficult to be paral-
lelized, since all the functions are built in class intensive manner. Also
there are no obvious for-loop iteration, concurrent execution or tree-like
traversal parts can be found.
5.8 OpenMP
In shared memory programming model, OpenMP use fork-join model to paral-
lel execute each partial sub-domains. This purpose can be achieved by either
task parallelism or data parallelism. After previous analysis of dieselFoam
source code, the time consuming part can be parallelized mainly appear in
two places:
5.8. OPENMP 57
This section will discuss these two parts separately according to different
parallel model.
# pragma omp s e c t i o n
code_block
code_block
}
# pragma omp s e c t i o n
{
f v S c a l a M a t r i x * T2 = new f v S c a l a M a t r i x ( mvConvection - >
fvmDiv ( phi , Yi ) ) ;
}
# pragma omp s e c t i o n
{
f v S c a l a M a t r i x * T3 = new f v S c a l a M a t r i x ( fvm :: l a p l a c i a n(
turbulence - > muEff () , Yi ) () ) ;
}
...
solve (* T1 + * T2 == * T3 ) ;
It insert each term of equation into a general matrix object, let these
object function call can be executed concurrently, then after all object values
are returned, call solve subroutine.
Based on the result of that thesis, we can go further to re-implement this
part by OpenMP 3.0 tasking construct.
Tasking directive specify explicit tasks, these tasks will be executed by the
encountering thread. A task pool with the data environment is created by one
single thread of the team, and other threads can retrieve works from it. All
the process is optimized by compiler. Another advantage of tasking construct
is highly compostable. It can be nested inside other task or work-sharing. For
instance, in some other solver applications, solver equation may exist inside
for-loop iteration, tasking construct may take more efficient way than section
construct under such situation.
By tasking construct, parallelization result gives us almost same structure
as section in this example. Because of we cannot parallel the outer loop of
Navier-Stokes equation by tasking construct due to time dependent feature:
{
* T1 = new f v S c a l a M a t r i x ( fvm :: ddt ( rho , Yi ) () ) ;
}
# pragma omp t a s k w a i t
//...
solve (* T1 + * T2 == * T3 ) ;
}
}
f v S c a l a r M a t r i x Eqn
(
fvm :: ddt ( p )
+ fvm :: div ( id , p )
==
rho
+ U
);
if ( U . p r e d i c t o r () )
{
Eqn . solve () ;
}
}
60 CHAPTER 5. PROTOTYPE DESIGN
# pragma omp t a s k w a i t
if ( flag )
{
5.8. OPENMP 61
for loop
Besides, if we get the loop iteration has nested loop, one thing should be
mentioned here is that, if for construct is used in the inner loop, a increasing
overhead is made since language construct produce large overhead in every
outer loop, also with synchronization overhead. So the efficient way is deploy
loop construct at outermost loop.
5.9 Conclusion
After experiment profiling and prototype implementation, we found majority
time occurs in linear equation calculation. In most programs, loop iteration in
these equations take up a large proportion. After carefully investigate about
read-write dependence, we could easily add work-sharing construct on them
in order to get higher performance. Besides, we can also leverage OpenMP
3.0 tasking character, which can be used to parallelize some situations al-
most impossible or producing bad performance in OpenMP 2.5. By combined
work-sharing and tasking construct, we can get better performance when the
number of elements or loops unbalance.
Chapter 6
Evaluation
In the previous chapters, we have discussed about function analysis and pro-
totype implementation. This chapter continues to discuss about performance
analysis. First in section 6.1, Jacobi iterative method case study will be pre-
sented on the Linux SMP cluster, which is the same platform conduct prototype
performance study. Then, PBICG solver prototype performance results will be
given. by Last but not the least, in the section 6.3, conclusion of comparison
between pure OpenMP, MPI and hybrid parallel model will be discussed. All
the performance study is conducted on a small SMP cluster, each SMP node
contains two-socket quad core Intel Xeon CPU, The compiler environment is
GCC 4.3 and OpenMPI.
63
64 CHAPTER 6. EVALUATION
Speedup
According to the simplest term, the criteria to investigate parallel pro-
gram is the time reduction. Hence, we use the straightforward measure
which the ratio of execution time in serial running to the execution time
in parallel running on cluster. The speedup equation is given as Equa-
tion 6.1:
S = TS TP (6.1)
S = 1 ((1 f ) s + f ) (6.3)
Because of those two features above, Amdahls Law obeys the perfor-
mance in reality. When proportion of S is increasing, the accelerated
performance you get would much lesser than the price you cost. It seems
like when the portion of code could be parallelized only have five percent-
ages, the maximum acceleration is five no matter how many processor
you used. This is the reason that the applicability of parallel computing
is low, it is very useful when the numbers of processors relative small,
but as the scale increasing, the benefits begin falling.
Efficiency
The efficiency of parallel program is the ratio of time which is used to
take computational work by N processors. Which Ts is the execution
time of sequential program, and Tp is the execution time on cluster. The
Equation 6.4 is given as following:
E = TS (n TP ) (6.4)
or
E = S n[2] (6.5)
The results of case study can be seen in Figure 6.1 which shows speedup
of two different programming model. As we see, the performance of hybrid
parallel mode apparently higher than pure MPI model.
Leverage by hybrid programming model, Jacobi iterative model obtains
higher ratio of computational to communication than before. Hence, as the
efficiency equation previously mentioned, lower communication overhead leads
to higher efficiency.
6
Speedup
5
MPI Processes
4
Hybrid parallel model
3 Ideal model
0
0 1 2 3 4 5 6 7 8 9
Processes * Threads
Figure 6.1: Comparison of speedup between Hybrid parallel model and MPI
processes on the Jacobi iterative model
6.3 Discussion
According to the benchmark we got from evaluation, two discussions will be
presented between OpenMP and MPI, MPI and Hybrid parallel model. Since
OpenMP implementation is assessed on single computational node, MPI and
hybrid parallel model are tested on Linux cluster. Its more meaningful to
compare them respectively.
6
Speedup
peedup
5
OpenMP Threads
4
MPI Processes
3 Ideal model
0
0 1 2 3 4 5 6 7 8 9
Processes * Threads
Figure 6.2: Comparison of speedup between OpenMP threads and MPI pro-
cesses on Preconditioner function
5
hybrid parallel model
4 (2OpenMP*4MPI)
3 MPI Processes
2
Id l model
Ideal d l
1
0
0 2 4 6 8 10
Processes * Threads
Figure 6.3: Comparison of speedup between MPI processes and Hybrid parallel
model with different map topology on PBICG solver.
job to Volvo HPC cluster. When the serial fraction increasing, the per-
formance by OpenMP construct begin failing. This is due to OpenMP
limited by the ratio between serial part and potential parallelized part.
And if the test object contains a big amount of data set and complex
tree-based function call, instead of our test results, OpenMP under such
circumstance gives considerable overhead create by synchronization and
memory consumption [5] [19], HPC cluster improve the MPI perfor-
mance by cache effect.
18
16
14
12
Speedup
10
hybrid parallel model
8 (8OPENMP*2MPI)
6 Ideal model
0
0 5 10 15 20
Processes * Threads
Although the different between each other is very tiny, but depends on
theoretical knowledge, compared with one MPI process per SMP node,
several MPI processes created per SMP node produce unnecessary intra
node communication [10] [9], Since MPI implementation require data
copying for intra node communication, it waste memory bandwidth
and reduce performance even if intra-node bandwidth is higher than
inter-node one. In other words, the number between MPI process and
OpenMP thread depends on overlapping of computation and commu-
nication. In our case, fewer threads cause more intra communication
overhead from shared domain exchange.
But meanwhile, we should notice that, more threads also cause more
overhead on thread creation, destruction and synchronization. These
created threads should involve in large amount of computational work,
otherwise, it may reduce the performance. Such approach is only suitable
for the computational intensive program.
6.4. CONCLUSION 71
6.4 Conclusion
Based on our performance speedup results, compare to two MPI processes de-
ploy on two SMP cluster nodes, hybrid parallelism adding 8 OpenMP threads
on each node increase 604% speedup on PBICG solve function. According to
time scale of UEqn object in Table 6.5 and Amdahls law, when the original
time is presumed as 1, we could predict the running time for whole UEqn
object as Equation 6.6:
73
74 CHAPTER 7. SUMMARY OF CONCLUSION AND FUTURE WORK
sor. In other words, that is how you decompose the domain to each node.
More sub-domains lead to communication overhead, and reduce the efficiency
of OpenMP construct. Although fewer sub-domains solve the communication
overhead problem, but in such case, program cannot fully utilize the effect of
cluster, it will also create more computation overhead. Moreover, some vari-
ety of problems like load imbalance, synchronization overhead also need to be
considered. Based on this point, how to choose the optimal number between
MPI processes and OpenMP threads, also known as mapping strategy, is also
a important study in the future.
Hence, based on this thesis work, more code analysis should be carried out
in order to find best domain decomposition approach and mapping topology
between application and hardware. From the conclusion of previous chapter,
we could find large part of dieselFoam class/function still remaining radix as
one or close to one. They are all can be next research target. In additional,
some optimization including cache improvement, OpenMP loop scheduling
also need be taken into account in the future, in such case, better load balance
could be obtained.
At present, hybrid parallel model is being a hot topic on accelerate CFD
problem. This thesis work illustrate the PBICG solver which original design by
only MPI and scale up to eight processes could perform excellent acceleration
result by adding OpenMP construct. Also propose a reasonable approach on
how to parallelize OpenFOAM application. For the purpose to gain more
speedup of running time, more work should be conducted for the proposal
suggesting above. In such ways, more performance speedup could be achieved.
Bibliography
[1] http://universe-review.ca/R13-10-NSeqs.htm.
[2] http://www.clear.rice.edu/comp422/homework/parallelefficiency.html.
[3] http://www.volvo.com.
[10] Gabriele Jost Georg Hager and Rolf Rabenseifner. Communication char-
acteristics and hybrid mpi/openmp parallel programming on clusters of
multi-core smp nodes. 2009 Parallel, Distributed and Network-based Pro-
cessing, pages 427436, 2009.
75
76 BIBLIOGRAPHY
[12] Mattijs Janssens Andy Heather Sergio Ferraris Graham Macpherson He-
lene Blanchonnet Henry Weller, Chris Greenshields and Jenya Collings.
OpenFOAM, The Open Source CFD Toolbox - User Guide. OpenCFD,
1.7.1 edition, August 2010.
[13] Gabriele Jost and Ferhat F. Hatay Haoqiang Jin, Dieter an Mey. Com-
paring the openmp, mpi, and hybrid programming paradigm on an smp
cluster. NAS Technical Report, NAS-03-019:10, November 2003.
[14] Michael R. Marty Mark D. Hill. Amdahls law in the multicore era. IEEE
Computer, pages 3338, July 2008.
[18] Gabriele Jost Rolf Rabenseifner, Georg Hager. Hybrid mpi/openmp par-
allel programming on clusters of multi-core smp nodes. 2009 Parallel,
Distributed and Network-based Processing, pages 427436, 2009.
www.kth.se