Parallel and Distributed Programming On Low Latency Clusters

Parallel and Distributed Programming on Low Latency Clusters
BY
VITTORIO GIOVARA
B.Sc. (Politecnico di Torino) 2007
THESIS
Submitted as a partial fulfillment of the requirements

for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Chicago, 2010
Chicago, Illinois
To my mother,
without whose continuous love
and support I would never have made it.
iii
ACKNOWLEDGMENTS
I want to thank all my family, my mother Silvana, my grandmother Nenna and my dear
Tanino who help me with love and support every day of my life.
Then I would like to thank all the faculty members that assisted me with this project, in
particular professor Bartolomeo Montrucchio and professor Carlo Ragusa for all the time spent
with me trying to make the software run, and researcher Fabio Freschi for giving me useful
suggestions during development.
Finally I would like to thank all my friends that were near me during these years, Al-
berto Grand, whose patience and kindness towards me are really extraordinary, and Salvatore
Campione, who is an encouraging model for my studies.
V. G.
iv
TABLE OF CONTENTS
CHAPTER PAGE
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evolution of parallel and distributed systems . . . . . . . . . . 1
1.2 Computer architecture classification . . . . . . . . . . . . . . . 4
1.3 Thesis Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Parallel and distributed application developing . . . . . . . . . 8
2.2 Technological requirements . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 SMP processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 GPGPU computing . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 NUMA machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Scientific software advance . . . . . . . . . . . . . . . . . . . . . 14
3 TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Parallel applications with OpenMP . . . . . . . . . . . . . . . . 16
3.1.1 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2.1 Sequential program with OpenMP enhancements . . . . . . . . 22
3.1.2.2 OpenMP schedulers performance . . . . . . . . . . . . . . . . . 24
3.1.2.2.1 Static Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2.2.2 Dynamic Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2.2.3 Guided Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2.3 OpenMP enhancement results . . . . . . . . . . . . . . . . . . . 27
3.2 Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Distributed execution with MPI . . . . . . . . . . . . . . . . . . 29
3.3.1 MPI over Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2.1 Single message over Infiniband with MPI . . . . . . . . . . . . 31
3.3.2.2 Multiple messages over Infiniband with MPI . . . . . . . . . . 33
3.3.2.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Code Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
v
TABLE OF CONTENTS (Continued)
CHAPTER PAGE
4.4 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Compiler optimizations . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Native switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Loop unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.3 IEEE compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.4 Library Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 General Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Hardware Support . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Applied Directives . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 MPI Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 DO directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 REDUCTION directive . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.4 Avoiding data dependency . . . . . . . . . . . . . . . . . . . . . 52
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Reduced test case . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.2 Final test case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
LIST OF TABLES
TABLE PAGE
I MAXIMUM DATA THROUGHPUT IN DIFFERENT CONFIGU-

RATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
II MPI OVER INFINIBAND 0-BYTE MESSAGE LATENCY . . . . . 34
III PARTIAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
IV FINAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
V FUNCTION RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
vii
LIST OF FIGURES
FIGURE PAGE
1 Approach levels for parallelization . . . . . . . . . . . . . . . . . . . . . . . 3
2 Classification scheme of computer architecture classification . . . . . . . 5
3 Image showing the tree splitting procedure of a sequential task . . . . . 17
4 Graph plotting of the theoretical curve from Amdahl’s Law . . . . . . . 19
5 Graph plotting of Amdahl’s Law for multiprocessors . . . . . . . . . . . 20
6 Performance overview of an OpenMP threaded program . . . . . . . . . 23
7 OpenMP static scheduler performance chart . . . . . . . . . . . . . . . . 24
8 OpenMP dynamic scheduler performance chart . . . . . . . . . . . . . . . 25
9 OpenMP guided scheduler performance chart . . . . . . . . . . . . . . . . 26
10 OpenMP scheduler overview . . . . . . . . . . . . . . . . . . . . . . . . . . 27
11 Time v. size for a single message . . . . . . . . . . . . . . . . . . . . . . . 32
12 Time v. size for 1024 consecutive messages . . . . . . . . . . . . . . . . . 33
13 Flowchart of the main functions implementated in the code . . . . . . . 38
14 Standard problem #4 representation . . . . . . . . . . . . . . . . . . . . . 39
15 S state field representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
16 Call graph scheme of the target software . . . . . . . . . . . . . . . . . . . 42
17 Implementation scheme overview . . . . . . . . . . . . . . . . . . . . . . . 46
viii
LIST OF ABBREVIATIONS
API Application Programming Interface
SMP Symmetric Multi-Processing
OpenMP Open Multi-Processing
MPI Message Passing Interface
IPC Inter Process Communication
PML Point-to-point Messaging Layer
BTL Byte Transfer Layer
SISD Single Instruction Single Data
SIMD Single Instruction Multiple Data
MISD Multiple Instructions Single Data
MIMD Multiple Instructions Multiple Data
SPMP Single Program Multiple Data
MPMD Multiple Program Multiple Data
SSE Streaming SIMD Extensions
SSSE3 Supplemental Streaming SIMD Extensions 3
UMA Uniform Memory Access
NUMA Non-Uniform Memory Access

ix
LIST OF ABBREVIATIONS (Continued)
GPU Graphics Processing Units
GPGPU General-Purpose computing on Graphics Process-
ing Units
ECC Error-Correcting Code
LLG Landau-Liftshitz-Gilbert equation
x
SUMMARY
The goal of this thesis is to increase performance and data throughput of Sally3D, an electro-
magnetic field analyzer and micromagnetic modeler for nanomagnets, developed at “Politecnico
di Torino” by the Electrical Engineering department.
This target has been achieved by means of open standards, such as OpenMP and MPI, that
offer robust parallel programming paradigm and an efficient message passing API; in order to
reduce latency in message passing between the two machines, a point-to-point Infiniband link
has been implemented.
Results will be provided, showing that it is possible to achieve a 80% speed improvement
thanks to optimized code, OpenMP multithreading and MPI communication. The used hard-
ware consists of two computers with two quad-core Intel Xeon processors, running at 2.5 GHz,
supplied with 32 GB of RAM and a 20 Gb/s Infiniband network card.
xi
CHAPTER 1
INTRODUCTION
1.1 Evolution of parallel and distributed systems
Until some decades ago computer applications were written in a sequential style in which
the instructions were executed in a fixed order; the programs relied on a single processing unit
and the throughput was dependent on the processor speed.
Nowadays however the technological trend is to control processor frequency and voltage in
order to consume less power and generate less heat and in this modern architecture sequential
programming is not effective. For this reason a new execution paradigm has been exploited:
parallel programming.
Parallel computing is a simultaneous execution of operations at different levels: the most
widely used form of parallelism are bit-level, augmenting the bit size of words, instruction-
level, exploiting instruction pipelines in processor architectures, loop-level, distributing data
independent instructions in a loop among different cores, and task-level, using complete threads
distribution among the cores.
In order to be able to use parallel applications, hardware support must be present. There are
many kinds of parallel-oriented computers like multi-core, single processor with many processing
units, symmetric multiprocessing, a machine with more than one (multicore) processor, cluster
and grid computing, closely coupled computers connected with high-end networks, and finally
1
2
graphics processing units which are used for general purpose computation and are suited for
linear and array operations.
On the other hand parallel applications bring some drawbacks at different levels: manually
programming threads and concurrent processes is a difficult task, as data dependency must be
carefully handled, and poor programming styles may lead to performance degradation. More-
over in a parallel environment several problems are introduced, such as deadlock or starvation,
in which execution cannot continue due to resource dependency conflicts.
Subsequently there has been an increasingly research effort to circumvent the difficulties
of parallel programming, trying to achieve the automatic parallelization from the compiler.
However complete automatic parallelization is a very complex operation requiring computa-
tional power that has not yet been reached; for this reasons several other approaches have been
proposed.
A quite simple and somewhat effective technology is loop unrolling activated by proper
compiler optimizations; instead of translating a loop into a sequence of operations followed by
a jump, the cycle is transformed in a completely sequential program, preventing a lot of jumps
and processor flushes. This is quite beneficial for pipelined processors that present a high
overhead for jump operations, but there is an increased code size proportional to the dimension
of the loop and there is still exponential complexity in unrolling very large cycles.
A more effective way was introduced a few years ago in which the programmers could insert
hints as compiler directives: in this way it is possible to define sections of code that can be safely
parallelized, exploiting the full capabilities of multicore processors. The interaction level in this
3
methodology is more advanced with respect to loop unrolling as it requires deeper knowledge of
the program and of dependency between variables; however even limited insertion of compiler
directives has a major effect on parallelization and program throughput.
The next figure (Figure 1) shows different parallelization methodology and in-depth level
approach; as it may seem obvious, full parallelization is fully achieved when it is set up as a
goal during a program design, but it is possible to adapt the project during development at
different stages, each requiring an action of different difficulty.
Figure 1. Approach levels for parallelization

4
1.2 Computer architecture classification
As soon as parallel computation theory began to gain popularity, there was a shift in
computer architecture design and a precise classification was needed. From a single processor
model that operates on a single data stream, it was possible to consider multiple or single
instructions operating on multiple or single data; representation of each classification is gathered
in the Flynn’s taxonomy:
SISD computers are traditional machines with a single processor operating on a single instruc-
tion (or data) stream, often stored in a single memory. This is the oldest architecture
design and was the leading model in computer markets until a decade ago, when the first
MMX extension was added to Intel processors.
SIMD is the general modern architecture commonly found in current processors in the form
of SSE, Altivec and VIS1 instructions among others; most recently GPUs have started
to exceed this paradigm with emphasis on vectorial parallelization. Multimedia opera-
tions are the prime beneficiaries for this application as well as cryptography and data
compression.
MISD architecture is an uncommon one as there is no performance benefit from this design,
but it is often found in mission critical applications, in which a dependable system must
be developed. As a matter of fact operating on single data with multiple identical in-
1
Visual Instruction Set, technology present in SPARC processors.
5
structions may lead to error detection and error correction with means of hardware and
time redundancy.
MIMD systems are suited for computer clusters in which a shared or distributed memory is
used; processors may function asynchronously and independently. Parallelism is achieved
because at any time computers may be executing different instructions on different data.
Figure 2. Classification scheme of computer architecture classification

6
There might be some other classification for the MIMD class, in which the concept of
“instruction” is extended to the notion of “program”:
SPMD multi processors execute the same program at the same time, but at independent
points in the code while working on the different data;
MPMD implementation of a client/server model in which a master feeds other nodes with
data and coordinates the workload distribution; so each node executes a different set of
programs on different data and reports its result to the master.
1.3 Thesis Contents
In this thesis it is described how to make use of such levels of parallelization directives for
a completely serial numerical program, in order to increase computational performance over a
distributed and parallel environment. For this reason a MIMD system will be exploited.
The program consists in an equation solver written in FORTRAN language adapt for com-
putation of electromagnetic field analysis, with high level plotter resolution. Since the program
is already provided, it is not possible to abstract to a very high level methodology; for this
reason what has been selected for parallelization technology is OpenMP which offers a set of
compiler directives to extend sequential sections of code on every core of the machine.
As for the distributed part of the algorithm, two technologies have been adopted: MPI
and Infiniband. MPI is an high level API for performing Inter Process Communication on
the same machine or on different nodes available for many different programming languages
(even for those which do not have IPC mechanism capabilities). Infiniband on the other hand
7
was chosen for its outstanding performance in sending small quantities of data with very little
latency.
After introduction, this document will present a general background and previous work
regarding parallel application methodologies, followed by a thorough description of the tech-
nologies used in this research. Then the main algorithm of the program will be outlined,
showing the critical points in which a possible performance increase may be achieved through
parallelization or distribution; finally some results will be submitted, tracing the throughput
growth of the program with OpenMP and MPI directives.

CHAPTER 2
BACKGROUND
2.1 Parallel and distributed application developing
Historically, parallel and distributed computing has been considered to be “the high end of
computing”, and has been used to model difficult scientific and engineering problems found in
the real world. Some examples (source: Livermore Computing Center):
• Atmosphere, Earth, Environment;
• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics;
• Bioscience, Biotechnology, Genetics;
• Chemistry, Molecular Sciences;
• Geology, Seismology;
• Mechanical Engineering - from prosthetics to spacecraft;
• Electrical Engineering, Circuit Design, Microelectronics;
• Computer Science, Mathematics;
• processing of large amounts of data in sophisticated ways such as:
– Databases, data mining;
– Oil exploration;
– Web search engines, web based business services;

8
9
– Medical imaging and diagnosis;
– Pharmaceutical design;
– Management of national and multi-national corporations;
– Financial and economic modeling;
– Advanced graphics and virtual reality, particularly in the entertainment industry;
– Networked video and multi-media technologies;
– Collaborative work environments.
2.2 Technological requirements
2.2.1 SMP processors
As demands for performance increases and as the cost of microprocessors continues to drop,
the single processor model has been abandoned in favor of an SMP organization. An SMP
architecture refers a computer system composed of multiple processors connected to a single
shared memory and to a shared I/O controller.
Operating system support is necessary for enabling this feature. Moreover programs have to
be rewritten or at least reconsidered in order to access every resource available. For this reason
there has been a continuous improvement to compiler software, trying to simplify program
parallelization for developers.
Resorting to a SMP architecture can bring many advantages (1):

10
1. Performance – workload can be spread among more processors, running different tasks
in parallel; moreover interrupt management can affect only one processor at time, avoiding
processes suspension and pipeline stalls;
2. Incremental Growth – adding additional processors increases performance even more,
up to a certain extent;
3. Scaling – vendors can offer more systems with different SMP configuration;
4. Transparency – the operating system hides SMP management from the user, as it
handles thread scheduling and processes synchronization;
5. Availability – it is possible to set up the processor to execute the same instruction
on all the symmetric processors, being able to sustain hardware failures (sort of MISD
architecture).
2.2.2 Multithreading
Multithreading is a technique to exploit thread-level parallelism; unit of execution becomes
a single thread of the program in memory. Once again, it is necessary to enable this feature in
software, through the operating system support (2).
It is possible to increase execution parallelism by using one of the following implementations:
interleaved multithreading (fine-grained ) at every clock cycle the processor switches exe-
cution from one thread to another, unless one is not ready (blocked for data dependency
or memory latency);
11
blocked multithreading (coarse-grained ) instructions of the threads are continuously exe-
cuted, until an event causes delay or cache miss; in that case execution is switched to
another thread;
simultaneous multithreading (SMT or Hyperthreading) instructions from multiple threads
are simultaneously executed, exploiting intrinsic parallelism of the execution units of the
processor;
chip multithreading one or more processors is simulated on the physical chip, each handling
separate thread sets; in this way pipeline execution is much simplified.
The Simultaneous Multithreading technique has been implemented in most modern proces-
sors as it has shown the best performance benefits in a variety of applications during testing.
2.2.3 GPGPU computing
General-purpose computing on graphics processing units refers to a technique that allows
general purpose execution through the processors present in modern video cards (namely,
GPUs). This methodology allows to exploit the GPU computing power, that is usually re-
served for computer graphics, for almost any kind of operations; since the graphics processing
unit is composed of a lot of array processors, using a GPGPU programming language enables
automatic streaming execution.
Applications that especially benefit from streaming execution are multimedia-related, such
as digital signal processing (for audio/video or image manipulation), but there are also many
implementations of computer clusters, physics simulators, mathematical solvers and raytracing

12
done with GPGPU. Moreover there is older array-based software that receives a positive impact
from this rather new technology, like cryptography, DNA folding, neural networks and medical
imaging.
2.2.4 NUMA machines
While general purpose processors adopt a uniform memory access (UMA), it is not un-
common to find systems whose access time is not uniform and depends on the position of the
processor (NUMA, non-uniform).
NUMA machines are usually physically distributed but logically shared, meaning that one
node can directly access memory of another node and that not all processors have equal access
time to all memories; a software layer is often needed to guarantee program access and workload
distribution.
Memory is mapped like a global address space, merging the linked SMP memory; this feature
provides a user-friendly programming perspective to memory as data sharing between tasks is
both fast and uniform due to the proximity of memory to CPUs.
However there is a lack of scalability between memory and CPUs because adding more CPUs
can geometrically increase traffic on the shared memory-CPU path. Moreover there is a whole
synchronization construct that needs to be implemented to insure “correct” access of global
memory. One final disadvantage is that it is becoming increasingly difficult and expensive to
design and produce shared memory machines with ever increasing numbers of processors.
13
2.2.5 Clusters
A cluster is an alternative or an addition to symmetric multiprocessing for achieving high
performance; it is possible to define “cluster” as a group of computers interconnected through
some network interface, working together as unified computing resource.
It is possible to create large clusters that can by far outperform any standalone machine,
with the advantage that is is relatively easy to add new components, even in small increments;
both clusters and SMP systems provide a configuration for high performance applications and
they can both introduce advantages and disadvantages.
For example an SMP system is easier to manage and has less problems in running single-
processor software, while clusters require an in-depth program revision, with load balancing and
work distribution; on the other hand, though, clusters dominate the final performance outcome
and offer more solutions for availability.
Clusters are historically divided in:
High-availability clusters for improving the availability offered by the cluster itself; they
usually exploit redundancy so that when one node fails, it can be immediately substituted
by a spare one (active or passive standby);
Load-balancing clusters with the primary purpose of distributing evenly the workload of a
given task or service among the rest of the cluster;
Compute clusters used for computational activity, rather than services; nodes are tightly
coupled and usually computation implies a consistent quantity of communication involved;

14
usually programs can be easily ported to this environment through simple instruction
routines (e.g. MPI);
Grid computing similar to compute clusters, they focus more on the final computational
throughput rather than workload distribution and tightly coupled jobs; computation con-
sists of many independent jobs which do not have to share data during the computation
process.
2.3 Scientific software advance
Using parallelization technologies such as OpenMP and MPI, is not new in scientific soft-
ware. As a matter of fact it is normal to find quite a number of projects that exploit those
technologies.
For example it is possible to cite the Folding@Home project, from the Stanford University’s
chemistry department, currently the most powerful distributed computing cluster, which is
being developed using an MPI layer between its nodes; or it is possible to find many entries
from the TOP500 list1 , like the Pleiades and the Ranger that use Infiniband as connection link
among the clusters.
As for electromagnetic field analyzers, there has been some previous work with OpenMP:
(3) and (4) describe a possible implementation for Hybrid solvers, but the addressed software
has different solving and modeling routines. The proposed work doesn’t rely on standard FEM
1
project ranking and detailing the 500 most powerful known computer systems in the world.
15
approach, but takes on a Finite Formulation of nonlinear Magneto-static algorithm which can
be safely parallelized and distributed; see (5) for more information.

CHAPTER 3
TECHNOLOGY
3.1 Parallel applications with OpenMP
OpenMP is an application programming interface (API) that offers a set of compiler direc-
tives, library routines and environment variables to enable shared memory multiprocessing for
C, C++ and FORTRAN programs.
OpenMP stands for Open Multi-Processing and it is implemented in many open source and
commercial compilers, like Intel C++ and FORTRAN Compilers (ifort and icc) and GNU Com-
piler Collection (gcc). Among the key factors for its popularity there is the easiness of handling
threads and shared variables and the simplicity of porting programs to a multiprogramming
scheme with very little code change; moreover OpenMP enables parallel execution control for
languages that cannot usually handle multi threading and synchronization primitives, like, for
instance, FORTRAN.
With this technology the main program forks a set number of parallel threads which carry
out a task, dividing the work load on different cores; by default every thread executes its section
of code independently. After execution of the parallel job, threads are then joined back in the
main (or master) thread, resuming normal sequential programming; in this way it is possible
to divide the sequence of program execution in a tree-like structure (as shown in Figure 3).
16
17
Figure 3. Image showing the tree splitting procedure of a sequential task
OpenMP exploits preprocessor directives for thread creation and synchronization, workload
distribution and sharing, data and function management, while retaining compatibility with
unsupported compilers. In order to prevent data corruption due to overlapping threads, all
variables of the parallel section must have a declared visibility scope, either shared or private.
One directive is particularly suited for loop parallelization as it offers a fine-grained control
on the scheduling for the threads and on the distribution of the loop among the thread pool.
Other directives may directly manage thread interaction and synchronization objects (critical
regions and variable locking).

18
However, it is important to clarify that using OpenMP on an N processor machine does not
reduce the execution time by N. As a matter of fact there are a couple of reasons for this to
apply:
• Symmetric Multi Processor computer have increased computational power, but the mem-
ory bandwidth does not scale proportionally to the number of processors (or cores); per-
formance degradation occurs especially when the shared memory bandwidth is filled up
and data transfer is slowed down;
• synchronization overhead, critical region management, context switch costs and load bal-
ancing among the threads may reduce the final speedup;
• not every portion of the code can be actually parallelized:
• the theoretical limit imposed by Amdahl’s Law for parallel applications that regulates the
maximum theoretical speedup holds.
3.1.1 Amdahl’s Law
Amdahl’s Law is a method used for finding the maximum speed improvement in parallel
computing environments. The speedup highly depends on the size of the parallelizable code
(6).
The formula states that the potential speedup of the program directly depends on the
fraction of code P that can be parallelized
1
speedup = (3.1)
1−P
19
Basically if none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup),
if all of the code is parallelized, P = 1 and the speedup is infinite (in theory), if 50% of the
code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast, and
so on; the next figure (Figure 4) shows the theoretical speedup curve with infinite processors.
Figure 4. Graph plotting of the theoretical curve from Amdahl’s Law
When the code has parts that cannot be parallelized, the relationship can be updated to
1
speedup = P
(3.2)
N +S
20
where N is the number of processors, P the portion of parallelizable code and S the portion
of serial code (corresponding to (1 − P )).
The following figure (Figure 5) shows a set of examples with different parallelizable code over
a variable number of processors. It is possible to see not only that a 95% parallelizable program
has a maximum speed improvement in the order 20x notwithstanding the high number of
processors available, but also that a highly sequential program cannot achieve any acceleration
whatsoever.
Figure 5. Graph plotting of Amdahl’s Law for multiprocessors

21
3.1.2 Benchmarking
In order to understand the possible benefit from using OpenMP, some tests have been run
targeting the best possible configuration about the number of threads and the thread size. A
simple test program was used with a complex and long loop containing some processor inten-
sive operations (mainly mathematical operations like power and square root). The particular
case of an “interesting” loop has been chosen because it showed with enough simplicity the
effort/benefit ratio of OpenMP.
The two main configuration variables that characterized the benchmarks were the scheduler
type and the chunk size, plus the total number of threads involved in the program. The chunk
size is an integer positive value representing the number of iterations each thread has to manage,
while the scheduler type may be:
STATIC loop iterations are divided in fixed chunk number of iterations;
DYNAMIC loop iterations are divided in chunk number of iterations, but then dynamically
assigned to thread when one task is completed;
GUIDED the chunk size is rearranged proportionally to its value allowing unassigned iteration
to gain priority over completed tasks.
Other type of schedulers may be auto and runtime in which one of the above scheduler is
selected accordingly to the CPU load and the set up environment. As it can be foretold, guided-
scheduled threads work best with very small chunk sizes (with respect to the total number of
22
iterations), as the scheduling algorithm is more efficient when it can control a pool of threads
on its whole, while the static and dynamic scheduling prefer having a medium chunk size value.
Beware that setting a static number of threads may reduce the total performance of the
application. As a matter of fact the thread number in the main program has been left to the
default value for this very reason.
The test program partially emulates some computationally intensive routines of the target
software; the main loop is composed of several mathematical functions that are known to stress
the processor and require a long cpu time to be carried out. The code is reported in appendix
B.
3.1.2.1 Sequential program with OpenMP enhancements
In this first test the program is speeded up with increasingly higher number of threads avail-
able, also overcoming the eight physical cores actually present. All three scheduling algorithms
are evaluated. The value of the first column (one thread) may be safely considered as reference
for the program without OpenMP optimizations.
It is possible to see that there is a huge impact when inserting a second thread (50% time
reduction) and then it asinthotically tends to a given value, fully respecting Amdahl’s Law. It’s
interesting to notice that the three schedulers perform in same range of values and that the
best performance is achieved in the region of 8-9 threads (given the eight-core machines used).
After this value all the schedulers, static and the dynamic in particular, suffer from excessive
context switches and interference from the operating system preemption mechanism.
23
Figure 6. Performance overview of an OpenMP threaded program

24
3.1.2.2 OpenMP schedulers performance
Having evaluated the performance of the different threads, now the three types of available
schedulers are compared; moreover for each scheduler a different order of chunk size is tested.
3.1.2.2.1 Static Scheduler
The static scheduler works as expected (Figure 7) showing a very good performance increase
in region of 7-8 threads with 10-100 as chunk value. It is interesting to notice that for very high
chunk size OpenMP can’t reduce the execution time, and this holds for every type of scheduler;
the reason of this behavior resides in how OpenMP manages iterations – all iterations of the
loop are assigned to a single thread and therefore there is not any benefit.
Figure 7. OpenMP static scheduler performance chart

25
3.1.2.2.2 Dynamic Scheduler
Because of its dynamic behavior, the dynamic scheduler shows very peculiar results with
different configurations. For example, as shown in Figure 8, there are high chunks and little
number of threads that present even an additional overhead, or small chunks that cannot leave
the average value regardless of the thread number.
Even with this disparity however, the best execution time reduction is achieved in region
7-9 by chunks of medium order.
Figure 8. OpenMP dynamic scheduler performance chart

26
3.1.2.2.3 Guided Scheduler
The final scheduler presented here is the most straightforward and the best performing,
thanks to the more advanced algorithm of the guided scheduling. As a matter of fact for a
chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations
divided by the number of threads, decreasing to 1. For a chunk size with value k (greater than
1), the size of each chunk is determined in the same way with the restriction that the chunks do
not contain fewer than k iterations (except for the last chunk to be assigned, which may have
fewer than k iterations) – source (7).
As anticipated, this algorithm works best with very small chunks, as it can apply its algo-
rithm without interferences, and always in the 8-9 threads region.
Figure 9. OpenMP guided scheduler performance chart

27
3.1.2.3 OpenMP enhancement results
This last section resumes the global results from the point of view of the scheduler. As
reference value, the maximum time execution reduction has been selected from each chunk of
each scheduling algorithm; all these results come from the 7-9 threads region.
The test run shows that the scheduler that performed best is the guided scheduler with
chunk size in the order of the units, and for this reason it has been chosen as default scheduler
in all OpenMP directives inserted.
Figure 10. OpenMP scheduler overview

28
3.2 Infiniband
Infiniband is the union of two competing transport designs, Next Generation I/O from Intel,
Microsoft and Sun, and Future I/O from Compaq, IBM and Hewlett-Packard. It has become
the de facto standard for high speed cluster interconnection, outperforming Ethernet in both
transfer rate and latency.
This technology implements a modern interconnection link using a point-to-point bidi-
rectional serial transfer, supporting several signaling rates. It is used for high-performance
computing either for high-speed connection between processors and peripherals as well as for
low-latency networking.
The standard transmission rate is of 2.5 Gbit/s, but double and quad data rates currently
achieve 5 Gbit/s and 10 Gbit/s respectively. Moreover it is possible to join links in units of 4 or
12 elements enabling even further transfer speed (up to 120 Gbit/s). However it is important
to state that a fault prevention for transmitted data is adopted using information redundancy:
every 10 bits sent carry only 8 bits of useful information, reducing the useful data transmission
rate. Table I summarizes the various configuration effective data rate.
Most notably, there is no standard programming interface for the device, only a set of
functions (referenced as verbs) must be present, leaving implementation to the vendors. The
most commonly accepted implementation is provided by the OpenFabric alliance. Being a
transport layer there are many protocol that can run on Infiniband, from TCP/IP to OpenIB
(described in section 3.3.1).

29
TABLE I
MAXIMUM DATA THROUGHPUT IN DIFFERENT CONFIGURATIONS
useful data Single Data Rate Double Data Rate Quad Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s
raw data Single Data Rate Double Data Rate Quad Data Rate
1X 2.5 Gbit/s 5 Gbit/s 10 Gbit/s
3.3 Distributed execution with MPI
MPI is a high level language-independent API used both for parallel computing and for one-
to-one, one-to-many and many-to-many inter process communication (IPC). It has become the
de facto standard for process communication despite of lack of sponsorship by any association.
Originally it was developed by William Gropp and Ewing Lusk among others.
This set of API is used for high-performance computing for its scalability, portability and
performance, as it implements a distributed shared memory system with very few directives. It
usually resides on level 5 of the OSI model, but, as there is no strict constraint on this point,
there are many implementation that offer different transport, network and data link layers.
MPI is available for many programming languages including C, C++, FORTRAN and
Java; sometimes implementations benefit from the bounded language, for example using object-
oriented programming in C++ and Java, and from the hardware they run on. Among the
30
most diffused library it is possible to find OpenMPI, MPICH2 and MVAPICH2 which differ
only for threading support, network availability (e.g. Ethernet or Infiniband) and hardware
optimizations.
3.3.1 MPI over Infiniband
One of the most widely used environments for MPI is Infiniband; as a matter of fact thanks
to Infiniband low latency a small packet sent through a connection link doesn’t present a major
overhead with respect to Ethernet for example. In order to set up a distributed system of this
kind there is need of additional software for managing the Infiniband sub net (OpenSM) and
for handling the transport layer (OpenIB).
MPI and Infiniband modularity allow different configurations, and it is common use to
transmit packet with either Infiniband or a TCP/IP stack. This is possible because the transport
layer of MPI is handled by two routines (among others): the Point-to-Point Messaging Layer
and the Byte Transfer Layer. The PML abstracts the communication mechanism with buffers,
synchronization points and acknowledge messages; the BTL on the other hand translates the
byte messages into the network layer byte sequence – OpenIB is a BTL protocol for sending
messages on Infiniband.
Subsequently the functions (or verbs) available in the Infiniband drivers are invoked and
control is moved from user space to kernel space, where the message is finally sent across the
network link.
This seemingly complex structure allows to reduce code complexity and increase inter-
compatibility and maintainability between different implementations.

31
3.3.2 Benchmarks
As it has been done with OpenMP, some tests were also performed on the MPI installation
and on Infiniband structure to check that machine configuration was correct and that devices
were running at full speed. The program makes heavy use of the MPI Send and MPI Recv
directives and utilizes timing function with resolution of milliseconds. It has been noticed
that a warm-up phase (exchanging some messages between the nodes) is necessary before any
measurement is done, because the whole structure of MPI plus Infiniband must be activated.
3.3.2.1 Single message over Infiniband with MPI
In this test the transfer time of messages over Infiniband with MPI directives is evaluated;
message size increases quadratically and time is measured with millisecond precision. Data is
displayed in a semi-logarithmic scale so that the whole slope can be shown.
Two different MPI implementation are compared, and it possible to notice that OpenMPI
outperforms MVAPICH in small and large quantities of data, but it is slower in medium-sized
messages. With MVAPICH it is not possible to send data over 2 GB, due to implementation
limits; OpenMPI doesn’t suffer from this behavior, but on the other hand it has a sort of latency
of 3.5 seconds before programs start executing (and this is not recorded in this test).
Other types of MPI implementation exist, most notably MPICH and Lam-MPI, from which
both MVAPICH and OpenMPI derived, but they lack of support for Infiniband; any packet
transmitted would revert to plain TCP/IP.

32
Figure 11. Time v. size for a single message

33
3.3.2.2 Multiple messages over Infiniband with MPI
Using the same structure of above, here is tested the time v. size with multiple messages
(1024 messages exchanged for each tested size). The results are similar to the previous case.
Figure 12. Time v. size for 1024 consecutive messages

34
3.3.2.3 Latency
One final test has been run to determine the expected latency in message passing; this has
been achieved by sending a 0-length packet using some data types available in MPI. However,
due to the modularity of the MPI over Infiniband structure, the MPI initialization overhead
must be removed: for this reason the same test is to be repeated both on a single machine and
on the two machines.
The latency value measured with this method is 8 µs which is compatible with the Infiniband
board specifications. The complete table of results follows.
TABLE II
MPI OVER INFINIBAND 0-BYTE MESSAGE LATENCY
Test type µ-seconds

Single node 26
Two nodes 34
Latency 8
CHAPTER 4
ALGORITHM
4.1 Overview
The target application is a suite of programs called Sally3d, and it has been ported from a
VMS system to standard FORTRAN, with a standard makefile instead of terminal scripts and
it can be compiled on any UNIX based operating systems.
The software is designed for electromagnetic field analysis and micromagnetic modeling of
nanomagnets; for this purpose, magnetization dynamics in nanomagnets is described by the
Landau-Lifshitz-Gilbert (LLG) equation which rules the gyromagnetic precession of magneti-
zation vector field around the so-called micromagnetic effective field.
The effective field takes phenomenologically into account the interactions occurring in mag-
netic materials such as short-range (exchange, anisotropy) and long-range interactions (mag-
netostatics, Zeeman). Magnetization dynamics in a ferromagnetic body is described by the
following Landau-Lifshitz-Gilbert (LLG) equation:
∂m ∂m

= −m × heff [m] − α , (4.1)
∂t ∂t
where m = m(r, t) is the magnetization vector field normalized to the saturation magneti-
zation, Ms , time is measured in unit of (γMs )−1 (γ is the absolute value of the gyromagnetic
35
36
ratio), α is the dimensionless damping parameter, heff [m(r, t)] is the effective field operator
which can be obtained by the variational derivative of the free energy functional:
δgL [m]
heff [m] = − , (4.2)
δm
where
Z " 2 #
1 l ex 12
gL [m] = |∇m| − hm · m + ϕ(m) − ha · m dV , (4.3)
VΩ Ω 2 2
p
ϕ(m) is the anisotropy energy density and lex = (2A)/µ0 Ms2 is the exchange length (A
is the exchange constant and µ0 the vacuum permeability), hm and ha are the demagnetizing
and applied fields, respectively, and VΩ is the body volume.
In addition, the homogeneous Neumann boundary condition ∂m/∂n = 0 is imposed at the
body surface. In order to obtain a spatially discretized version of eq. (Equation 4.1) a partition
of the region Ω in N cells Ωk , with volume Vk is considered and is assumed that the cells are
small enough that the vector fields m(r, t) and heff [m(r, t)] can be considered spatially uniform
within each cell. Symbols mk (t) and heffk denote the vectors associated with the generic k-th
cell. Beside the cell vectors, the mesh vectors m = (m1 , . . . , mN )T ∈ R3N containing the whole
collection of cell vectors are defined.
Now it is possible to write down the discretized LLG equation in the following form that
consist of a system of ordinary differential equations:

37
dmk dmk
= −mk × heffk [m] + αmk × , (4.4)
dt dt
where mk is the average magnetization of the k-th cell. It is worth noting that the ef-
fective field in the k-th cell depends on the magnetization of the whole cell collection due to
the magnetostatic interaction, namely heffk = heffk [m]. The numerical solution of equation
(Equation 4.4) will provide the time evolution of magnetization.
4.2 Code Flowchart
The kernel of the micromagnetic solver integrates over time the LLG equation discretized
with respect to space. At every time step, the next value of the magnetic vector is computed
by collecting the different finite elements of the magnetic field; this operation is performed by
the GILBERT routine and it is reported in Figure 13. The equation is a non linear differential
equation, whose solution is obtained through Newton-Raphson method of approximation; this
is performed by the GINT function.
The section of code which has been parallelized and distributed (outlined with yellow in
the next figure) implements the magnetostatic and anisotropic field solvers; also the part that
combines together the different field elements has been updated with OpenMP and MPI direc-
tives. This development scheme has been chosen on the grounds that the real computational
bottleneck resulted particularly in the magnetostatic solver and partially in the anisotropic
solver.
38
Figure 13. Flowchart of the main functions implementated in the code

39
4.3 Test Case
In order to carefully analyze the performance of the program and to identify the possible
parallelization points, as well as to obtain useful data, a particular test was prepared. The test
case is the fourth standard problem of micromagnetics, proposed by Bob McMichael, Roger
Koch and Thomas Schrefl.
Quoting (8), the problem focuses on dynamic aspects of micromagnetic computations. The
initial state is an equilibrium s-state (Figure 15) which is obtained after applying and slowly
reducing a saturating fild along the [1,1,1] direction to zero. Fields of magnitude sufficient to
reverse the magnetization of the rectangle are applied to this initial state and the time evolution
of the magnetization are examined as the system moves towards equilibrium in the new fields.
The problem will be run for two different applied fields.
At t = 0 one field will be applied to the equilibrium s-state: the field is composed of
µHx = −24.6 mT, µHy = 4.3 mT, µHz = 0.0 mT (corresponding to approximately 25 mT,
directed 170 degrees counterclockwise from the positive x axis).
Figure 14. Standard problem #4 representation

40
The problem was chosen so that resolving the dynamics should be easier for the 170 degree
applied field than for the 190 degree applied field. Preliminary simulations reveal that, in the
case of the field applied at 170 degrees, the magnetization in the center of the rectangle rotates
in the same direction as at the ends during reversal. In the 190 degree case, however, the center
rotates the opposite direction as the ends resulting in a more complicated reversal. The field
amplitudes were chosen to be about 1.5 times the coercivity in each case.
Figure 15. S state field representation
4.4 Profiling
Thanks to the standardization of the program code, it was possible to exploit the gprof
utility, available in the gcc suite. This utility allows to obtain procedure level timing information
with reasonable resolution, as well as a complete call graph view for identifying the most
computational expensive functions.
According to the profiler, whose graph call has been reported in Figure 16, the following
functions were the most time consuming:

41
• calc intmudua
• curledge and the calling calc hdmg tet
• calc mudua
• campo effettivo
Most of the software is composed of very small routines that are called with very high
frequency, thus very difficult to optimize and to measure (in fact they are not even reported in
profiler reports); only the noted functions have an observable impact on the overall execution
time.
4.5 Compiler optimizations
Once again, due to the porting operation that has been performed, several compiler opti-
mizations became available and were subsequently added in order to increase the throughput
of the program. Most of the additions have been chosen following official gcc documentation
and manual pages.
4.5.1 Native switch
The key for optimization relies on the native machine capabilities; in order to activate
at once all the features of a given architecture and of a given processor is required to set
-march=native. In this way all processor specific instructions can be accessed and all floating
point capabilities fully exploited, setting the right processor architecture and the available SSE
flags. Moreover the floating point instructions are specifically set to use any SSE extension
(-mfpmath=sse enabled by default).

42
Figure 16. Call graph scheme of the target software

43
A similar optimization is achieved also in the Intel FORTRAN Compiler with the -axS -xS
switches.
4.5.2 Loop unrolling
Among the loop transformation techniques, loop unrolling has achieved wide success in
compiler theory. Its goal is to increase the execution speed of the program at the expense of
size. Loop unrolling is performed by reducing (if not eliminating) the number of the “end of
loop”; in this way the number of jumps and of conditional branches decreases, and thanks to
the larger, size the number of cache hits increases (in big caches).
This optimization is pulled in by the -O3 flag.
4.5.3 IEEE compliance
Due to the highly mathematical nature of the software, the -ffast-math flag has been
added: this flag activates a set of optimization that allow some general speedups by discarding
some return codes and by skipping some redundant operations (like rejecting the sign of zero
or not considering Nan and +-Inf number types).
The main drawback to this optimization is that it is not possible to guarantee IEEE, ISO
and or ANSI compliance that specify arithmetic compatibility, exceptions and operand order
in floating point operations.
4.5.4 Library Striping
One final type of optimization has been inserted at linking time. The following options
try to decrease load time for library functions, modifying the executable header (ELF in this
44
context) and symbol handling (9). These options must be passed with the -Wl flag so that the
compiler can forward them to the linker.
More specifically the -O1 switch performs in this way: as symbols get inserted in the ELF
header, they are stored in hash tables; the default configuration is to keep the hash keys small,
performing string comparison with collisions. This optimization shifts the reduction towards
short hash chains, increasing hash keys length and header size, but actually reducing symbol
look-ups.
CHAPTER 5
IMPLEMENTATION
5.1 General Scheme
Analyzing the functions of 4.4 from several profiling sessions a common pattern has been
found.
As a matter of fact, every function contained one or more loops, carrying quite a number
of instructions over arrays and matrices. For this reason a general plan has been decided and
summed up in Figure 17.
As first step, the standard sequential loop is parallelized to fully exploit all the eight cores
each single machine can offer. By setting up proper shared/private variables lists, the loop
is divided among a given number of OpenMP threads and each carries out a portion of that
iteration; as soon as a thread ends, a new one is created and assigned a element, until the whole
loop section is completed.
The second step in this strategy is to split in two distinct and equal parts before exploiting
OpenMP. Each part is submitted to a node of the cluster and separately executed; at the end of
the loop data is exchanged back with MPI and merged so that the two machines can continue
working on complete arrays. Thanks to Infiniband, latency for exchanged data sets is reduced
to a minimum.
45
46
Even though OpenMP requires little software modifications, in order to obtain the maximum
possible throughput from the software, some updates have been carried out, mainly reducing
portions of redundant code.
Figure 17. Implementation scheme overview
It should be noted, however, that the software is not embarrassingly parallel; as a matter of
fact there were a number of modification to the software in order to apply parallelization and
47
distributed computing. The synchronization object mostly used is the implicit blocking offered
by the send() and recv() mechanism; since data is exchanged between the two machines in
the same manner, until either of them is ready to process data, the other cannot continue.
In other sections of the code, synchronization was achieved by native OpenMP directives, as
shown in 5.3.4.
5.2 Hardware Support
The hardware selected for implementing the cluster consists of two computer, each supplied
with:
• two quad core Intel Xeon E5420 running at 2.5 GHz frequency, with 6 MB of L2 cache;
• an Intel Server Board S5000PSLSATAR motherboard;
• 32 GB of ECC DDR2 RAM;
• one Infiniband card from Mellanox, model ConnectX IB MHGH28-XTC DDR HCA PCI-e
2.0 x8 Memory Free.
The two machines are connected together with an end-to-end Infiniband link, running at
full speed as the cards are mounted on the PCI Express x16 v1.1 slot. The focus for building
these computer has been to search for low-cost components that could enable high performance
results.
5.3 Applied Directives
In this section some example code has been extracted from the source of the program and
explained.
48
5.3.1 MPI Layer
The following sections of code show some sample “header” and “epilogue” MPI functions
that enable slitting the array and merging it back. The header part analyzes the rank variable
which differs for every node of the MPI cluster: inside the if clause the array range is defined by
setting start INDEX and end INDEX variables (which intuitively represent the range beginning
and ending). So the first node works on the first half of the array and the second node on the
second half, allowing both machines to operate on separated data subsets.
Some preprocessor directives have been inserted in order to maintain compatibility on non
MPI system.
#ifdef MPI_ENABLED
if (rank .eq. 0) then
start_INDEX = 1
end_INDEX = NEDGE/2
else if (rank .eq. 1) then
start_INDEX = ( NEDGE/2 ) + 1
end_INDEX = NEDGE
endif
#else
start_INDEX = 1
end_INDEX = NEDGE
#endif
49
DO M=start_INDEX,end_INDEX
[...]
So after loop has terminated, the array on which the iteration worked must be synchronized
on both nodes; this is done with a couple of MPI SEND and MPI RECV instructions. The rank
variable is checked again to be able to tell which portions of the array must be updated.
#ifdef MPI_ENABLED
tag = 1
ISIZE = NEDGE - NEDGE/2
if (rank .eq. 0) then
dest = 1
source = 1
call MPI_RECV(BINTMU( (NEDGE/2) + 1), ISIZE , MPI_REAL8,
& source, tag, MPI_COMM_WORLD, stat, err)
call MPI_SEND(BINTMU, NEDGE/2, MPI_REAL8, dest, tag,
& MPI_COMM_WORLD, err)
else if (rank .eq. 1) then
dest = 0
source = 0
call MPI_SEND(BINTMU( (NEDGE/2) + 1), ISIZE , MPI_REAL8,
& dest, tag, MPI_COMM_WORLD, err)
call MPI_RECV(BINTMU, NEDGE/2, MPI_REAL8, source, tag,

50
& MPI_COMM_WORLD, stat,err)
endif
#endif
5.3.2 DO directive
The DO directive is the most common in this configuration. It requires a list of shared and
private variables: for the latter case, a new memory position is allocated for each thread.
Workload is distributed accordingly to the selected scheduler as described in 3.1.
!$OMP PARALLEL SHARED(IFAEXT,BINTMU,AMAG,TM)
!$OMP& PRIVATE(I,KH,KK,NPOS,IMAG,KCOMP)
!$OMP DO SCHEDULE(GUIDED)
DO I=start_INDEX,end_INDEX
[...]
BINTMU(I)=BINTMU(I)-
+ IFAEXT(I,KH,2)*TM(NPOS+(IMAG-1)*3+KCOMP)*AMAG(IMAG,KCOMP)
[...]
ENDDO
!$OMP END DO
!$OMP END PARALLEL

51
5.3.3 REDUCTION directive
One of the possible benefits in parallelization is to use a mathematical property for addition
and subtraction clauses: since variating the order doesn’t change the result, the reduction
directive allows to execute out-of-order loop instances and to compute the final value at the
end of the iteration.
Without this directive the target variable could have suffered from various synchronization
problems, as reading and writing to a shared position doesn’t guarantee a correct result.
!$OMP PARALLEL PRIVATE (M,K,DOT)
!$OMP& SHARED(H_DEMG,AMAG,VOLTET,NPNMAG)
!$OMP& REDUCTION(+:VOLUME)
!$OMP& REDUCTION(-:DEMG_ENE)
!$OMP DO SCHEDULE(GUIDED)
DO M=1,NPNMAG
DOT=0.D0
DO K=1,3
DOT=DOT+H_DEMG(M,K)*AMAG(M,K)
ENDDO
VOLUME=VOLUME+VOLTET(M)
DEMG_ENE=DEMG_ENE-VOLTET(M)*DOT/2.D0
ENDDO
!$OMP END DO
52
!$OMP END PARALLEL
Unfortunately this option is available for non-array operators only, so it has been applied
few times.
5.3.4 Avoiding data dependency
One of main problems of OpenMP and parallel programming in general is data dependency
and this is usually resolved by modifying the algorithm structure or by means of synchronization
objects.
In order to avoid inserting a critical region (corresponding to a CRITICAL or ATOMIC OpenMP
directive) for shared constructs which could have negatively affected performance, an array
with self data references has been converted into a matrix and indexed with the working thread
number; in this way every array element of the matrix was automatically dereferenced from
itself as there could only be one single thread working on a given line at the same time.
!$OMP PARALLEL DEFAULT(PRIVATE)
!$OMP& SHARED [...]
#ifdef _OPENMP
INUM_TH = omp_get_num_threads()
#endif
[...]
DO L=1,6
LATO=(MCNT_E(L,ITET))
53
AUS=SIGN(1,LATO)*
& ( FMDUA(MATFE(L,1)) - FMDUA(MATFE(L,2)) )
LATO=ABS(LATO)
#ifdef _OPENMP
INUM = omp_get_thread_num()+1
#else
INUM = 1
#endif
AMUDUAW(INUM ,LATO) = AMUDUAW
+ (INUM, LATO) + AUS
ENDDO
[...]
!$OMP END DO
!$OMP END PARALLEL
At the end of operation, the original array is rebuilt with a simple loop on the number of
generated threads (known in INUM TH).
DO ILATO=1, NEDGE
#ifdef _OPENMP
DO III=1, INUM_TH
#else
III=1
54
#endif
AMUDUA(ILATO) = AMUDUA(ILATO) + AMUDUAW(III,ILATO)
#ifdef _OPENMP
ENDDO
#endif
ENDDO
5.4 Results
5.4.1 Reduced test case
During development the test case was run to understand if the current implementation was
providing good results. The simulation had duration of 8 ps only and was composed of just
1000 elements (see Figure 14), but it was already possible to notice some good improvements
to the software. Further work has been done after these results were produced.
The following table (Table III) resumes the total execution time in seconds; in the table the
label OMP stands for OpenMP, MPI for OpenMPI over Infiniband and OPT for optimiza-
tions, while for each field a * stands for enabled and a - for disabled.
It is possible to notice that the software has received a speed boost of 87.5% from the old
configuration to the newer optimized MPI over Infiniband plus OpenMP environment.
Not surprisingly the most effective contribution to the software is the optimizations section:
this is because the ability to access all the SSE extensions with the loop unrolling configuration
(see 4.5) adds some SIMD execution to the software already.

55
TABLE III
PARTIAL RESULTS
OMP MPI OPT seconds

* * * 133
* * - 400
* - * 186
* - - 487
- * * 200
- * - 792
- - * 246
- - - 1062
However it is important to take in consideration what targets had this project. It is true
that the most cumbersome code for the processor has been dutely parallelized, but the software
is composed of a high number of other functions that are either closely serialized or with very
small duration time. The sections that have been parallelized and distributed have received
a speed boost, but the final software performance suffers from the presence of serial code and
from the high number of small functions.
This explains also why the optimizations bring such an improvement, as they affect all the
software without distinction. So a more sensible comparison can only be done if the optimization
element is kept constant.
5.4.2 Final test case
With the analysis of the previous data, it was possible to understand what was really needed
to be measured and to be improved, so development continued focusing on the new ratio. In

56
the end, when all the most computational-expensive functions were addressed, it was possible
to launch the final test case with the same characteristics of before and to obtain the following
results:
TABLE IV
FINAL RESULTS
OMP MPI seconds

* * 59
* - 129
- * 174
- - 249
The total speed improvements of OpenMP and MPI elements only correspond to a raw 76%
increment. This is very good results, because not only it is comparable to the speedup intro-
duced by the optimizations, but also it out does the results obtained from the Intel FORTRAN
compiler v10 (obtained through other tests) by a rough 23%.
By looking at the single functions contribution more in detail, it is possible to see the effect
of OpenMP and MPI over Infiniband with no overhead from the other routines.
From the above table it is possible to understand the actual impact of the technologies used
to increase the throughput of the software.

57
TABLE V
FUNCTION RESULTS
Function Name Normal OpenMP MPI OpenMP+MPI

calc intmudua 24.5 s 4.7 s 14.4 s 2.8 s
calc hdmg tet 16.9 s 3.0 s 10.8 s 1.7 s
calc mudua 12.1 s 1.9 s 7.0 s 1.1 s
campo effettivo 17.7 s 4.5 s 9.9 s 2.3 s
Having a look at the OpenMP section, there is an aggressive reduction, by a factor of 6-8x:
this is a very good result as it means that the code was able to exploit every processor available
to the maximum extent, with very little overhead and no synchronization problems.
As for MPI on the other hand, there is a 2x factor of speed improvement; this is sensible as
the code was almost split in two, so it is normal that the overall reduction corresponds to half
execution time. It is interesting to notice that this effect applies perfectly when merging MPI
with OpenMP. As a matter of fact, thanks to the Infiniband channel used, communication time
is negligible, and so only the small MPI overhead can influence execution.
CHAPTER 6
CONCLUSION
In this thesis, it has been demonstrated that to achieve best results, a complete review
of the software must be taken into account. Highly serialized software, written thoughout a
mathematical model, must be reorganized to allow better parallelization.
However there are technologies that can have a direct impact on performance, in particular
OpenMP and MPI. With very little software modification and simple code analisys, it has been
possible to introduce a significant improvement in the overall execution time. Furthermore
the standard, clean and stable environment of the GCC suite enabled accessing important
optimization controls that increased the quality of the software where it was not been done
with OpenMP or MPI.
For this reason this project shows significant room for improvement. First of all, algorithm
optimization are necessary to obtain high performance; secondly it could be possible to take
advantage of FORTRAN library functions for otherwise long routines – even more for the
high number of small operations repeated several times. In the third place, software analysis
must continue in order to extract precise timing information from profiling and to identify the
other computational-expensive functions that could receive a significant improvement from the
inclusion of OpenMP and MPI directives.
Finally thanks to the high scalability of cluster system, it should be fairly easy and much
convenient to add new elements that can contribute in the computation deployment; in fact it
58
59
would be possible to connect more components to the cluster using an Infiniband switch, at the
sole cost of some increased latency. In fact due to the applied middleware of open standards,
OpenMP and MPI, porting software to other architectures and expanding its routines to use
further nodes of the clusters should not be considered a complex task.

APPENDICES
60
61
Appendix A
STREAMING SIMD EXTENSIONS
Introduced by Intel in its line of Pentium III processors, SIMD technology allows for
SIMD execution. While older processors could only process one data element per instruc-
tion, SIMD technology allows instructions to handle multiple data elements, making processing
much quicker.
SSE’s use of SIMD technology allows for data processing in applications such as 3D graphics
to benefit greatly from the availability of extended floating point registers. In contrast to the
preceding MMX technology, SSE registers have an increased width, allowing more bits to be
stored and more speed facilities for applications. Initially eight new 128-bit registers known
as XMM0 through XMM7 were added; SSE2 extends MMX instructions to operate on XMM
registers, allowing the programmer to completely avoid the eight 64-bit MMX registers “aliased”
on the original floating point register stack.
More precisely SSE2 adds new mathematical instructions for double-precision (64-bit) float-
ing point and also extends MMX instructions to operate on 128-bit XMM registers. SSE integer
instructions introduced with later extensions would still operate on 64-bit MMX registers be-
cause the new XMM registers require operating system support (this behavior changed only
with SSE4 onward). SSE2 enables the programmer to perform SIMD math of virtually any
type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the
need to touch the obsolete MMX/FPU registers.

62
Appendix A (Continued)
SSE3, SSSE3 and SSE4 are further revisions to the architecture and introduce new operating
conditions (column access to registers), new instructions (that can act on 64-bit MMX or 128-
bit XMM registers and simplify implementation of DSP and 3D code) and conversion utility
that avoid pipeline stalls.
In a multi-tasking environment, the Streaming SIMD Extensions require support from the
operating system: the SIMD registers must be handled properly by the operating system’s
context switching code. When the system switches control from one process to another, the old
process’s SIMD registers must be saved away, and the saved values of the new process’s SIMD
registers must be loaded into the processor. The Pentium III processor prohibits programs from
using the Streaming SIMD Extensions unless the operating system tells the processor at system
startup time that it is aware of the SIMD registers, and will manage them properly.
63
Appendix B
OPENMP TEST PROGRAM
The test program has been designed to simulate some computationally intensive routines of
the target software; in the main loop a lot of mathematical functions are executed over a set
of arrays, without creating data dependencies between the iterations. Statistics are printed at
the beginning and at the end of the program; in order to obtain the total execution time the
function gettimeofday() is used; the loop is repeated ten times obtaining a more
for (u=1;u<=32; u++) {
printf("%d threads\t", u);
omp_set_num_threads(u);
totaltime = 0;
for (t = 0; t< 10; t++){
gettimeofday (&timing_start, NULL);
#pragma omp parallel for \
shared (a, b, c, d, f, g, h, chunk) private (i, u, t) \
schedule (guided, chunk)
for (i=0; i < N; i++){
c[i] = pow (sqrt(a[i] * b[i] / (b[i] + a[i]) ), 3);

64
Appendix B (Continued)
d[i] = sqrt (c[i] * b[i] / (c[i] + pow (a[i], 4) ) );
e[i] = pow (pow (c[i], d[i]), pow (d[i], c[i]) );
f[i] = sqrt (pow (a[i], c[i] + b[i]) );
h[i] = a[i] * b[i] * c[i] * d[i] * e[i] * f[i] * g[i] * h[i];
gettimeofday (&timing_end, NULL);
totaltime+= (timing_end.tv_sec - timing_start.tv_sec) * 1000000 \
+ (timing_end.tv_usec - timing_start.tv_usec);
printf("%d\n", totaltime/10);
}
65
Appendix C
OPENMP FORTRAN REFERENCE
The general OpenMP directive begins with !$OMP indicating the starting of an OpenMP
configuration; any directive has to be declared with an entry and a closing section, such as:
!$OMP directive [clause ...]
[...]
!$OMP END directive
The first directive must be PARALLEL, which wraps the code section that must be executed in
parallel, and it is closed by the corresponding END PARALLEL. Syntax for this directive (clause)
may be any combination of the following:
IF (condition) parallel execution is activated only when condition is met;
PRIVATE (list) list of private variables;
SHARED (list) list of shared variables;
DEFAULT (type) type of visibility for variables not listed before;
FIRSTPRIVATE (list) list of private variables that are automatically initialized;
REDUCTION (operator: list) performs an out-of-order operation of kind operator on the
variable list;
COPYIN (list) for copying values of variable list among threads;

66
Appendix C (Continued)
NUM THREADS (num) statically set the number of threads to generate.
Another important OpenMP directive is DO that specifies the next loop can be executed in
parallel by the thread team. Syntax follows:
SCHEDULE (type [, chunk )] describes how iterations of the loop are divided (in chunk s)
among the threads;
ORDERED performs iteration in order, sequential style;
PRIVATE (list) list of private variables;
FIRSTPRIVATE (list) list of private variables that are automatically initialized;
LASTPRIVATE (list) list of private variables that are initialized when iteration ends;
SHARED (list) list of shared variables;
REDUCTION (operator | intrinsic : list) performs an out-of-order operation of kind op-
erator (or intrinsic function) on the variable list;
COLLAPSE (n) performs some loop collapsing (for n loops).
Other parallelizing directives that don’t require any particular clause configuration are:
• SECTIONS: statically splits the code into sections which are assigned each to a single thread
in the pool;
• WORKSHARE: divides the execution of the enclosed code block into separate units of work;
• TASK: defines an explicit task, which may be executed by the encountering thread, or
deferred for execution by any other thread in the team.

67
As for synchronization management, it is possible to find the following directives, which
don’t need any other clause as well:
• MASTER: specifies a region of code that is executed only by one thread;
• CRITICAL: identifies a critical region in which only one thread at a time can access;
• BARRIER: implements a barrier region where execution is stopped until all threads are
ready to continue;
• ATOMIC: defines a single instruction critical region, in which memory is accessed atomically
from all the threads.
Finally it is possible to use some OpenMP related functions to further adapt the software to
a multiprogrammed system; this set of routines may be used for a variety of application such as
obtaining information from single threads, setting configuration about the number of threads,
getting environment data (like number of processors), locking variables, timing and so on. For
example:
• OMP SET NUM THREADS(): sets the number of threads that must be started;
• OMP GET NUM THREADS(): returns the number of threads of the parallel region;
• OMP GET THREAD NUM(): returns the number identifying a single thread in the pool;
• OMP GET THREAD LIMIT(): returns the maximum number of OpenMP threads available
to a program;
• OMP GET NUM PROCS(): returns the number of processors that are available to the program;
68
• OMP INIT LOCK(): initializes a lock on the variable, setting the lock to “unset”;
• OMP DESTROY LOCK(): eliminates the lock on the variable;
• OMP SET LOCK(): sets the lock on the given variable;
• OMP UNSET LOCK(): unsets the lock on the given variable;
• OMP TEST LOCK(): tests the lock on the given variable.

69
Appendix D
MPI FORTRAN REFERENCE
MPI routines are added to a standard FORTRAN program by including the mpif.h library.
After this, the MPI layer must be initialized with MPI INIT(), before using any MPI related
functions, and it must be closed with MPI FINALIZE(), before ending the program.
By using:
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
the program becomes aware of running in a MPI environment, as the reported functions
save the number of the instance of the program in the rank variable, and the total number of
instances created in numtasks, respectively.
Then now it is possible to use point-to-point communication routines that are present in
many variants, like blocking, synchronous, non-blocking, buffered, and they are all described
by the following:
call MPI_SEND(buffer, quantity, type, destination, tag, MPI_COMM_WORLD, ierr)
call MPI_RECV(buffer, quantity, type, source, tag, MPI_COMM_WORLD, stat, ierr)
The syntax for the equivalent functions is similar:
buffer : represents either data that has be sent or the memory location in which it must be
saved;
70
Appendix D (Continued)
quantity : tells how much data of type is sent in the message;
type : sets one of the high MPI data types for the transfer;
destination/source : describes the number of the instance of the program that has to send
or receive the buffer;
tag : identifies the message number that must be sent or received;
MPI COMM WORLD : reads from the macro in which MPI configuration is saved;
stat : represents the status of the transfer;
ierr : is the error variable in case communication fails.
MPI allows also for collective communication (a sort of “multicasting”) by means of functions
such as:
• MPI BCAST() a message is sent to all the nodes;
• MPI SCATTER() a message is split and sent to all the nodes;
• MPI GATHER() a message is received from all the nodes;
• MPI ALLTOALL()Each task in a group performs a scatter operation, sending a distinct
message to all the tasks in the group
that require information about the data buffers of both the sender and the receiver.
In order to compile an MPI-enabled program, it is not possible to directly call the compiler,
but it is necessary to resort to the wrapper of the MPI distribution, which correctly set paths
and libraries; also for launching executables a special wrapper must be used with proper syntax.
71
Appendix D (Continued)
In case of OpenMPI, the MPI implementation selected for this project, the compiler is
called mpif90 while the launching wrapper is mpirun; this software must be called specifying
the number of instances of the program to run (-np) and the list of hosts that have to execute
it (-host). So for example in a two-machine cluster environment in which each node has to
execute an instance of the program, the correct syntax is
$ mpirun -np 2 -host host1,host2 program [args]
It is possible to share some environment variables among the nodes with the -x switch; this
is required for an OpenMP+MPI system as the number of threads depends on the value of
OMP NUM THREADS. The resulting command line instruction is:
$ mpirun -np 2 -host host1,host2 -x OMP NUM THREADS program [args]

CITED LITERATURE
1. Stallings, W.: Computer Organization & Architecture - Designing for Performance. Pear-
son - Prentice Hall, 2006.
2. Hennessy, J. L. and Patterson, D. A.: Computer Architecture: A Quantitative Approach.

Morgan Kaufmann, 1990.
3. Lu, J., Li, Y., Sun, C., and Yamada, S.: A parallel computation model for nonlinear
electromagnetic field analysis by harmonic balance finite element method. Tech-
nical Report 0-7803-2018-2, Faculty of Science and Technology, Griffith University
Australia and Faculty of Technology, Kanazawa University Japan, 1995.
4. Ito, F. and Amemiya, N.: Application of parallelized SOR method to electromagnetic field
analysis of superconductors. Technical Report 1051-8223/04, Faculty of Engineer-
ing, Yokohama National University, 2004.
5. Giuffrida, C., Gruosso, G., and Repetto, M.: Finite formulation of nonlinear magnesto-
statics with integral boundary conditions. Technical Report 0018-9464, Electrical
Engineering Department, Politecnico di Torino and Electronic and Information En-
gineering Department, Politecnico di MIlano, 2006.
6. Silberschatz, A., Galvin, P. B., and Gagne, G.: Operating System Concepts. Pearson
Education, 2006.
7. Barney, B.: OpenMP. Lawrence Livermore National Laboratory, https://computing.

llnl.gov/tutorials/openMP/.
8. McMichael, R. D.: µMAG – Micromagnetic Modeling Activity Group. Center for Theo-
retical and Computational Materials Science, http://www.ctcms.nist.gov/~rdm/
mumag.html.
9. Moser, J. R.: Optimizing linker load times. LWN.net - Your Linux info source, http:
//lwn.net/Articles/192082/, 2006.
10. Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Menon, R.: Parallel
Programming in OpenMP. Morgan Kaufmann Publishers, 2001.
72
73
CITED LITERATURE (Continued)
11. Dagum, L. and Menon, R.: OpenMP: An Industry Standard API fo Shared Memory
Programming. Computational Science and Engineering, 1998.
12. Gropp, W., Lusk, E., and Skjellum, A.: Using MPI - Portable Parallel Programming with
the Message-Passing Interface. Scientific and Engineering Computation Series. The
MIT Press, 1999.
13. Reinders, J.: VTuneTM Performanc Analyzer Essentials. Intel Press, 2007.
14. Stevens, W. R.: UNIX Network Programming: Networking APIs: Sockets and XTI. Pren-
tice Hall, 1998.
15. Butenhof, D. R.: Programming with POSIX

R Threads. Addison-Wesley Professional
Computing Series, 1997.
16. Shipman, G. M., Woodall, T. S., Graham, R. L., Maccabe, A. B., and Bridges, P. G.:
Infiniband scalability in Open MPI. Technical Report 1-4244-0054-6/06, Advanced
Computing Laboratory, Los Alamos National Laboratory and Dept. of Computer
Science, University of New Mexico, 2006.
17. Sur, S., Koop, M. J., and Panda, D. K.: High-performance and scalable MPI over In-
finiband with reduced memory usage: An in-depth performance analysis. Technical
Report 0-7695-2700-0/06, Department of Computer Science Engineering, Ohio State
University, 2006.
18. Quintero, D., Conrad, N., Desjarlais, R., Kahle, M.-E., Kim, J.-H., Nguyen, H.-N., Pir-
raglia, T., Pizzano, F., Simon, R., Yao, S. L., and Lascu, O.: Implementing
InfiniBand on IBM System p. IBM Redbooks, 2007.
19. Gray, A., Hein, J., and Booth, S.: Improved MPI with RDMA. Technical report, EPCC,
Univeristy of Edinburgh, June 2005.
20. T., U. and J., R. B. S.: Multithreaded processors. The Computer Journal, 3, 2002.
21. R., B.: High Performance Cluster Computing: Architectures and Systems. Prentice Hall,
1999.
22. R., B.: High Performance Cluster Computing: Programming and Applications. Prentice
Hall, 1999.
74
CITED LITERATURE (Continued)
23. Barney, B.: Message Passing Interface (MPI). Lawrence Livermore National Laboratory,
https://computing.llnl.gov/tutorials/mpi/.
24. Hablot, L., Gluck, O., Mignot, J.-C., Genaud, S., and Primet, P. V.-B.: Comparison and
tuning of MPI implementations in a grid context. Technical Report 1-4244-1388-5,
Laboratoir de l’Informatique du Parallelisme, Universite de Lyon, 2007.
VITA
NAME: Vittorio Giovara
EDUCATION: B.Sc. equiv., Computer Engineering,

Politecnico di Torino, Turin, Italy, 2007
M.Sc. equiv., Computer Engineering,

Politecnico di Torino, Turin, Italy, 2009, under the advising of
professors Bartolomeo Montrucchio and Carlo Ragusa
Master of Science in Electrical and Computer Engineering, University

of Illinois at Chicago, Chicago, Illinois, 2009, under the advising of
professors Bartolomeo Montrucchio and Zhichun Zhu
HONORS: PROFICIENCY Certificate in English, Cambridge University,

Turin, Italy, 2004
BTP certification, XX Winter Olympics, TOBO, Turin, Italy, 2006
TOP-UIC Fellowship, Politecnico di Torino, Turin, Italy, 2008
PROFESSIONAL: Project manager for GLE-MiPS, a VHDL description for

processor architecture, focusing on the educational implementation,
http://gle-mips.googlecode.com
Developer of Hedgewars, a strategy game, managing the Mac OS X

and iPhone versions, http://www.hedgewars.org
Editor and author for ProjectSymphony, a collection of academic

essays and homework reports publically available,
http://www.scribd.com/ProjectSymphony
75

Parallel and Distributed Programming On Low Latency Clusters

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel and Distributed Programming On Low Latency Clusters

Uploaded by

Copyright:

Available Formats

Parallel and Distributed Programming on Low Latency Clusters

Submitted as a partial fulfillment of the requirements

without whose continuous love

and support I would never have made it.

suggestions during development.

Campione, who is an encouraging model for my studies.

I MAXIMUM DATA THROUGHPUT IN DIFFERENT CONFIGU-

II MPI OVER INFINIBAND 0-BYTE MESSAGE LATENCY . . . . . 34

III PARTIAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

1 Approach levels for parallelization . . . . . . . . . . . . . . . . . . . . . . . 3

2 Classification scheme of computer architecture classification . . . . . . . 5

3 Image showing the tree splitting procedure of a sequential task . . . . . 17

4 Graph plotting of the theoretical curve from Amdahl’s Law . . . . . . . 19

5 Graph plotting of Amdahl’s Law for multiprocessors . . . . . . . . . . . 20

6 Performance overview of an OpenMP threaded program . . . . . . . . . 23

7 OpenMP static scheduler performance chart . . . . . . . . . . . . . . . . 24

8 OpenMP dynamic scheduler performance chart . . . . . . . . . . . . . . . 25

9 OpenMP guided scheduler performance chart . . . . . . . . . . . . . . . . 26

10 OpenMP scheduler overview . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11 Time v. size for a single message . . . . . . . . . . . . . . . . . . . . . . . 32

12 Time v. size for 1024 consecutive messages . . . . . . . . . . . . . . . . . 33

13 Flowchart of the main functions implementated in the code . . . . . . . 38

14 Standard problem #4 representation . . . . . . . . . . . . . . . . . . . . . 39

15 S state field representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

16 Call graph scheme of the target software . . . . . . . . . . . . . . . . . . . 42

17 Implementation scheme overview . . . . . . . . . . . . . . . . . . . . . . . 46

API Application Programming Interface

SMP Symmetric Multi-Processing

OpenMP Open Multi-Processing

MPI Message Passing Interface

IPC Inter Process Communication

PML Point-to-point Messaging Layer

BTL Byte Transfer Layer

SISD Single Instruction Single Data

SIMD Single Instruction Multiple Data

MISD Multiple Instructions Single Data

MIMD Multiple Instructions Multiple Data

SPMP Single Program Multiple Data

MPMD Multiple Program Multiple Data

SSE Streaming SIMD Extensions

SSSE3 Supplemental Streaming SIMD Extensions 3

UMA Uniform Memory Access

NUMA Non-Uniform Memory Access

GPU Graphics Processing Units

GPGPU General-Purpose computing on Graphics Process-

ECC Error-Correcting Code

LLG Landau-Liftshitz-Gilbert equation

di Torino” by the Electrical Engineering department.

has been implemented.

supplied with 32 GB of RAM and a 20 Gb/s Infiniband network card.

1.1 Evolution of parallel and distributed systems

and the throughput was dependent on the processor speed.

Parallel computing is a simultaneous execution of operations at different levels: the most

level, exploiting instruction pipelines in processor architectures, loop-level, distributing data

distribution among the cores.

linear and array operations.

in which execution cannot continue due to resource dependency conflicts.

However complete automatic parallelization is a very complex operation requiring computa-

compiler optimizations; instead of translating a loop into a sequence of operations followed by

directives has a major effect on parallelization and program throughput.

different stages, each requiring an action of different difficulty.

Figure 1. Approach levels for parallelization

1.2 Computer architecture classification