Professional Documents
Culture Documents
BY
VITTORIO GIOVARA
B.Sc. (Politecnico di Torino) 2007
THESIS
Chicago, Illinois
To my mother,
iii
ACKNOWLEDGMENTS
I want to thank all my family, my mother Silvana, my grandmother Nenna and my dear
Tanino who help me with love and support every day of my life.
Then I would like to thank all the faculty members that assisted me with this project, in
particular professor Bartolomeo Montrucchio and professor Carlo Ragusa for all the time spent
with me trying to make the software run, and researcher Fabio Freschi for giving me useful
Finally I would like to thank all my friends that were near me during these years, Al-
berto Grand, whose patience and kindness towards me are really extraordinary, and Salvatore
V. G.
iv
TABLE OF CONTENTS
CHAPTER PAGE
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evolution of parallel and distributed systems . . . . . . . . . . 1
1.2 Computer architecture classification . . . . . . . . . . . . . . . 4
1.3 Thesis Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Parallel and distributed application developing . . . . . . . . . 8
2.2 Technological requirements . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 SMP processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 GPGPU computing . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 NUMA machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Scientific software advance . . . . . . . . . . . . . . . . . . . . . 14
3 TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Parallel applications with OpenMP . . . . . . . . . . . . . . . . 16
3.1.1 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2.1 Sequential program with OpenMP enhancements . . . . . . . . 22
3.1.2.2 OpenMP schedulers performance . . . . . . . . . . . . . . . . . 24
3.1.2.2.1 Static Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2.2.2 Dynamic Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2.2.3 Guided Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2.3 OpenMP enhancement results . . . . . . . . . . . . . . . . . . . 27
3.2 Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Distributed execution with MPI . . . . . . . . . . . . . . . . . . 29
3.3.1 MPI over Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2.1 Single message over Infiniband with MPI . . . . . . . . . . . . 31
3.3.2.2 Multiple messages over Infiniband with MPI . . . . . . . . . . 33
3.3.2.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Code Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
v
TABLE OF CONTENTS (Continued)
CHAPTER PAGE
4.4 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Compiler optimizations . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Native switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Loop unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.3 IEEE compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.4 Library Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 General Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Hardware Support . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Applied Directives . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 MPI Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 DO directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 REDUCTION directive . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.4 Avoiding data dependency . . . . . . . . . . . . . . . . . . . . . 52
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Reduced test case . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.2 Final test case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
LIST OF TABLES
TABLE PAGE
IV FINAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
V FUNCTION RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
vii
LIST OF FIGURES
FIGURE PAGE
viii
LIST OF ABBREVIATIONS
ing Units
x
SUMMARY
The goal of this thesis is to increase performance and data throughput of Sally3D, an electro-
magnetic field analyzer and micromagnetic modeler for nanomagnets, developed at “Politecnico
This target has been achieved by means of open standards, such as OpenMP and MPI, that
offer robust parallel programming paradigm and an efficient message passing API; in order to
reduce latency in message passing between the two machines, a point-to-point Infiniband link
Results will be provided, showing that it is possible to achieve a 80% speed improvement
thanks to optimized code, OpenMP multithreading and MPI communication. The used hard-
ware consists of two computers with two quad-core Intel Xeon processors, running at 2.5 GHz,
xi
CHAPTER 1
INTRODUCTION
Until some decades ago computer applications were written in a sequential style in which
the instructions were executed in a fixed order; the programs relied on a single processing unit
Nowadays however the technological trend is to control processor frequency and voltage in
order to consume less power and generate less heat and in this modern architecture sequential
programming is not effective. For this reason a new execution paradigm has been exploited:
parallel programming.
widely used form of parallelism are bit-level, augmenting the bit size of words, instruction-
independent instructions in a loop among different cores, and task-level, using complete threads
In order to be able to use parallel applications, hardware support must be present. There are
many kinds of parallel-oriented computers like multi-core, single processor with many processing
units, symmetric multiprocessing, a machine with more than one (multicore) processor, cluster
and grid computing, closely coupled computers connected with high-end networks, and finally
1
2
graphics processing units which are used for general purpose computation and are suited for
On the other hand parallel applications bring some drawbacks at different levels: manually
programming threads and concurrent processes is a difficult task, as data dependency must be
carefully handled, and poor programming styles may lead to performance degradation. More-
over in a parallel environment several problems are introduced, such as deadlock or starvation,
Subsequently there has been an increasingly research effort to circumvent the difficulties
of parallel programming, trying to achieve the automatic parallelization from the compiler.
tional power that has not yet been reached; for this reasons several other approaches have been
proposed.
A quite simple and somewhat effective technology is loop unrolling activated by proper
a jump, the cycle is transformed in a completely sequential program, preventing a lot of jumps
and processor flushes. This is quite beneficial for pipelined processors that present a high
overhead for jump operations, but there is an increased code size proportional to the dimension
of the loop and there is still exponential complexity in unrolling very large cycles.
A more effective way was introduced a few years ago in which the programmers could insert
hints as compiler directives: in this way it is possible to define sections of code that can be safely
parallelized, exploiting the full capabilities of multicore processors. The interaction level in this
3
methodology is more advanced with respect to loop unrolling as it requires deeper knowledge of
the program and of dependency between variables; however even limited insertion of compiler
The next figure (Figure 1) shows different parallelization methodology and in-depth level
approach; as it may seem obvious, full parallelization is fully achieved when it is set up as a
goal during a program design, but it is possible to adapt the project during development at
As soon as parallel computation theory began to gain popularity, there was a shift in
computer architecture design and a precise classification was needed. From a single processor
model that operates on a single data stream, it was possible to consider multiple or single
SISD computers are traditional machines with a single processor operating on a single instruc-
tion (or data) stream, often stored in a single memory. This is the oldest architecture
design and was the leading model in computer markets until a decade ago, when the first
SIMD is the general modern architecture commonly found in current processors in the form
of SSE, Altivec and VIS1 instructions among others; most recently GPUs have started
tions are the prime beneficiaries for this application as well as cryptography and data
compression.
MISD architecture is an uncommon one as there is no performance benefit from this design,
but it is often found in mission critical applications, in which a dependable system must
be developed. As a matter of fact operating on single data with multiple identical in-
1
Visual Instruction Set, technology present in SPARC processors.
5
structions may lead to error detection and error correction with means of hardware and
time redundancy.
MIMD systems are suited for computer clusters in which a shared or distributed memory is
because at any time computers may be executing different instructions on different data.
There might be some other classification for the MIMD class, in which the concept of
SPMD multi processors execute the same program at the same time, but at independent
MPMD implementation of a client/server model in which a master feeds other nodes with
data and coordinates the workload distribution; so each node executes a different set of
In this thesis it is described how to make use of such levels of parallelization directives for
distributed and parallel environment. For this reason a MIMD system will be exploited.
The program consists in an equation solver written in FORTRAN language adapt for com-
putation of electromagnetic field analysis, with high level plotter resolution. Since the program
is already provided, it is not possible to abstract to a very high level methodology; for this
reason what has been selected for parallelization technology is OpenMP which offers a set of
compiler directives to extend sequential sections of code on every core of the machine.
As for the distributed part of the algorithm, two technologies have been adopted: MPI
and Infiniband. MPI is an high level API for performing Inter Process Communication on
the same machine or on different nodes available for many different programming languages
(even for those which do not have IPC mechanism capabilities). Infiniband on the other hand
7
was chosen for its outstanding performance in sending small quantities of data with very little
latency.
After introduction, this document will present a general background and previous work
nologies used in this research. Then the main algorithm of the program will be outlined,
showing the critical points in which a possible performance increase may be achieved through
parallelization or distribution; finally some results will be submitted, tracing the throughput
BACKGROUND
Historically, parallel and distributed computing has been considered to be “the high end of
computing”, and has been used to model difficult scientific and engineering problems found in
• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics;
• Geology, Seismology;
– Oil exploration;
– Pharmaceutical design;
As demands for performance increases and as the cost of microprocessors continues to drop,
the single processor model has been abandoned in favor of an SMP organization. An SMP
Operating system support is necessary for enabling this feature. Moreover programs have to
be rewritten or at least reconsidered in order to access every resource available. For this reason
there has been a continuous improvement to compiler software, trying to simplify program
1. Performance – workload can be spread among more processors, running different tasks
in parallel; moreover interrupt management can affect only one processor at time, avoiding
up to a certain extent;
3. Scaling – vendors can offer more systems with different SMP configuration;
4. Transparency – the operating system hides SMP management from the user, as it
on all the symmetric processors, being able to sustain hardware failures (sort of MISD
architecture).
2.2.2 Multithreading
a single thread of the program in memory. Once again, it is necessary to enable this feature in
interleaved multithreading (fine-grained ) at every clock cycle the processor switches exe-
cution from one thread to another, unless one is not ready (blocked for data dependency
or memory latency);
11
cuted, until an event causes delay or cache miss; in that case execution is switched to
another thread;
are simultaneously executed, exploiting intrinsic parallelism of the execution units of the
processor;
chip multithreading one or more processors is simulated on the physical chip, each handling
The Simultaneous Multithreading technique has been implemented in most modern proces-
sors as it has shown the best performance benefits in a variety of applications during testing.
general purpose execution through the processors present in modern video cards (namely,
GPUs). This methodology allows to exploit the GPU computing power, that is usually re-
served for computer graphics, for almost any kind of operations; since the graphics processing
unit is composed of a lot of array processors, using a GPGPU programming language enables
Applications that especially benefit from streaming execution are multimedia-related, such
as digital signal processing (for audio/video or image manipulation), but there are also many
done with GPGPU. Moreover there is older array-based software that receives a positive impact
from this rather new technology, like cryptography, DNA folding, neural networks and medical
imaging.
While general purpose processors adopt a uniform memory access (UMA), it is not un-
common to find systems whose access time is not uniform and depends on the position of the
NUMA machines are usually physically distributed but logically shared, meaning that one
node can directly access memory of another node and that not all processors have equal access
time to all memories; a software layer is often needed to guarantee program access and workload
distribution.
Memory is mapped like a global address space, merging the linked SMP memory; this feature
However there is a lack of scalability between memory and CPUs because adding more CPUs
can geometrically increase traffic on the shared memory-CPU path. Moreover there is a whole
memory. One final disadvantage is that it is becoming increasingly difficult and expensive to
design and produce shared memory machines with ever increasing numbers of processors.
13
2.2.5 Clusters
It is possible to create large clusters that can by far outperform any standalone machine,
with the advantage that is is relatively easy to add new components, even in small increments;
both clusters and SMP systems provide a configuration for high performance applications and
For example an SMP system is easier to manage and has less problems in running single-
processor software, while clusters require an in-depth program revision, with load balancing and
work distribution; on the other hand, though, clusters dominate the final performance outcome
High-availability clusters for improving the availability offered by the cluster itself; they
usually exploit redundancy so that when one node fails, it can be immediately substituted
Load-balancing clusters with the primary purpose of distributing evenly the workload of a
Compute clusters used for computational activity, rather than services; nodes are tightly
usually programs can be easily ported to this environment through simple instruction
Grid computing similar to compute clusters, they focus more on the final computational
throughput rather than workload distribution and tightly coupled jobs; computation con-
sists of many independent jobs which do not have to share data during the computation
process.
Using parallelization technologies such as OpenMP and MPI, is not new in scientific soft-
ware. As a matter of fact it is normal to find quite a number of projects that exploit those
technologies.
For example it is possible to cite the Folding@Home project, from the Stanford University’s
chemistry department, currently the most powerful distributed computing cluster, which is
being developed using an MPI layer between its nodes; or it is possible to find many entries
from the TOP500 list1 , like the Pleiades and the Ranger that use Infiniband as connection link
As for electromagnetic field analyzers, there has been some previous work with OpenMP:
(3) and (4) describe a possible implementation for Hybrid solvers, but the addressed software
has different solving and modeling routines. The proposed work doesn’t rely on standard FEM
1
project ranking and detailing the 500 most powerful known computer systems in the world.
15
approach, but takes on a Finite Formulation of nonlinear Magneto-static algorithm which can
TECHNOLOGY
OpenMP is an application programming interface (API) that offers a set of compiler direc-
tives, library routines and environment variables to enable shared memory multiprocessing for
OpenMP stands for Open Multi-Processing and it is implemented in many open source and
commercial compilers, like Intel C++ and FORTRAN Compilers (ifort and icc) and GNU Com-
piler Collection (gcc). Among the key factors for its popularity there is the easiness of handling
threads and shared variables and the simplicity of porting programs to a multiprogramming
scheme with very little code change; moreover OpenMP enables parallel execution control for
languages that cannot usually handle multi threading and synchronization primitives, like, for
instance, FORTRAN.
With this technology the main program forks a set number of parallel threads which carry
out a task, dividing the work load on different cores; by default every thread executes its section
of code independently. After execution of the parallel job, threads are then joined back in the
main (or master) thread, resuming normal sequential programming; in this way it is possible
to divide the sequence of program execution in a tree-like structure (as shown in Figure 3).
16
17
OpenMP exploits preprocessor directives for thread creation and synchronization, workload
distribution and sharing, data and function management, while retaining compatibility with
unsupported compilers. In order to prevent data corruption due to overlapping threads, all
variables of the parallel section must have a declared visibility scope, either shared or private.
One directive is particularly suited for loop parallelization as it offers a fine-grained control
on the scheduling for the threads and on the distribution of the loop among the thread pool.
Other directives may directly manage thread interaction and synchronization objects (critical
However, it is important to clarify that using OpenMP on an N processor machine does not
reduce the execution time by N. As a matter of fact there are a couple of reasons for this to
apply:
• Symmetric Multi Processor computer have increased computational power, but the mem-
ory bandwidth does not scale proportionally to the number of processors (or cores); per-
formance degradation occurs especially when the shared memory bandwidth is filled up
• synchronization overhead, critical region management, context switch costs and load bal-
• the theoretical limit imposed by Amdahl’s Law for parallel applications that regulates the
Amdahl’s Law is a method used for finding the maximum speed improvement in parallel
computing environments. The speedup highly depends on the size of the parallelizable code
(6).
The formula states that the potential speedup of the program directly depends on the
1
speedup = (3.1)
1−P
19
Basically if none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup),
if all of the code is parallelized, P = 1 and the speedup is infinite (in theory), if 50% of the
code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast, and
so on; the next figure (Figure 4) shows the theoretical speedup curve with infinite processors.
When the code has parts that cannot be parallelized, the relationship can be updated to
1
speedup = P
(3.2)
N +S
20
where N is the number of processors, P the portion of parallelizable code and S the portion
The following figure (Figure 5) shows a set of examples with different parallelizable code over
a variable number of processors. It is possible to see not only that a 95% parallelizable program
has a maximum speed improvement in the order 20x notwithstanding the high number of
processors available, but also that a highly sequential program cannot achieve any acceleration
whatsoever.
3.1.2 Benchmarking
In order to understand the possible benefit from using OpenMP, some tests have been run
targeting the best possible configuration about the number of threads and the thread size. A
simple test program was used with a complex and long loop containing some processor inten-
sive operations (mainly mathematical operations like power and square root). The particular
case of an “interesting” loop has been chosen because it showed with enough simplicity the
The two main configuration variables that characterized the benchmarks were the scheduler
type and the chunk size, plus the total number of threads involved in the program. The chunk
size is an integer positive value representing the number of iterations each thread has to manage,
DYNAMIC loop iterations are divided in chunk number of iterations, but then dynamically
GUIDED the chunk size is rearranged proportionally to its value allowing unassigned iteration
Other type of schedulers may be auto and runtime in which one of the above scheduler is
selected accordingly to the CPU load and the set up environment. As it can be foretold, guided-
scheduled threads work best with very small chunk sizes (with respect to the total number of
22
iterations), as the scheduling algorithm is more efficient when it can control a pool of threads
on its whole, while the static and dynamic scheduling prefer having a medium chunk size value.
Beware that setting a static number of threads may reduce the total performance of the
application. As a matter of fact the thread number in the main program has been left to the
The test program partially emulates some computationally intensive routines of the target
software; the main loop is composed of several mathematical functions that are known to stress
the processor and require a long cpu time to be carried out. The code is reported in appendix
B.
In this first test the program is speeded up with increasingly higher number of threads avail-
able, also overcoming the eight physical cores actually present. All three scheduling algorithms
are evaluated. The value of the first column (one thread) may be safely considered as reference
It is possible to see that there is a huge impact when inserting a second thread (50% time
reduction) and then it asinthotically tends to a given value, fully respecting Amdahl’s Law. It’s
interesting to notice that the three schedulers perform in same range of values and that the
best performance is achieved in the region of 8-9 threads (given the eight-core machines used).
After this value all the schedulers, static and the dynamic in particular, suffer from excessive
context switches and interference from the operating system preemption mechanism.
23
Having evaluated the performance of the different threads, now the three types of available
schedulers are compared; moreover for each scheduler a different order of chunk size is tested.
The static scheduler works as expected (Figure 7) showing a very good performance increase
in region of 7-8 threads with 10-100 as chunk value. It is interesting to notice that for very high
chunk size OpenMP can’t reduce the execution time, and this holds for every type of scheduler;
the reason of this behavior resides in how OpenMP manages iterations – all iterations of the
loop are assigned to a single thread and therefore there is not any benefit.
Because of its dynamic behavior, the dynamic scheduler shows very peculiar results with
different configurations. For example, as shown in Figure 8, there are high chunks and little
number of threads that present even an additional overhead, or small chunks that cannot leave
Even with this disparity however, the best execution time reduction is achieved in region
The final scheduler presented here is the most straightforward and the best performing,
thanks to the more advanced algorithm of the guided scheduling. As a matter of fact for a
chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations
divided by the number of threads, decreasing to 1. For a chunk size with value k (greater than
1), the size of each chunk is determined in the same way with the restriction that the chunks do
not contain fewer than k iterations (except for the last chunk to be assigned, which may have
As anticipated, this algorithm works best with very small chunks, as it can apply its algo-
This last section resumes the global results from the point of view of the scheduler. As
reference value, the maximum time execution reduction has been selected from each chunk of
each scheduling algorithm; all these results come from the 7-9 threads region.
The test run shows that the scheduler that performed best is the guided scheduler with
chunk size in the order of the units, and for this reason it has been chosen as default scheduler
3.2 Infiniband
Infiniband is the union of two competing transport designs, Next Generation I/O from Intel,
Microsoft and Sun, and Future I/O from Compaq, IBM and Hewlett-Packard. It has become
the de facto standard for high speed cluster interconnection, outperforming Ethernet in both
rectional serial transfer, supporting several signaling rates. It is used for high-performance
computing either for high-speed connection between processors and peripherals as well as for
low-latency networking.
The standard transmission rate is of 2.5 Gbit/s, but double and quad data rates currently
achieve 5 Gbit/s and 10 Gbit/s respectively. Moreover it is possible to join links in units of 4 or
12 elements enabling even further transfer speed (up to 120 Gbit/s). However it is important
to state that a fault prevention for transmitted data is adopted using information redundancy:
every 10 bits sent carry only 8 bits of useful information, reducing the useful data transmission
Most notably, there is no standard programming interface for the device, only a set of
functions (referenced as verbs) must be present, leaving implementation to the vendors. The
transport layer there are many protocol that can run on Infiniband, from TCP/IP to OpenIB
TABLE I
useful data Single Data Rate Double Data Rate Quad Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s
raw data Single Data Rate Double Data Rate Quad Data Rate
1X 2.5 Gbit/s 5 Gbit/s 10 Gbit/s
4X 10 Gbit/s 20 Gbit/s 40 Gbit/s
12X 30 Gbit/s 60 Gbit/s 120 Gbit/s
MPI is a high level language-independent API used both for parallel computing and for one-
to-one, one-to-many and many-to-many inter process communication (IPC). It has become the
de facto standard for process communication despite of lack of sponsorship by any association.
Originally it was developed by William Gropp and Ewing Lusk among others.
This set of API is used for high-performance computing for its scalability, portability and
performance, as it implements a distributed shared memory system with very few directives. It
usually resides on level 5 of the OSI model, but, as there is no strict constraint on this point,
there are many implementation that offer different transport, network and data link layers.
MPI is available for many programming languages including C, C++, FORTRAN and
Java; sometimes implementations benefit from the bounded language, for example using object-
oriented programming in C++ and Java, and from the hardware they run on. Among the
30
most diffused library it is possible to find OpenMPI, MPICH2 and MVAPICH2 which differ
only for threading support, network availability (e.g. Ethernet or Infiniband) and hardware
optimizations.
One of the most widely used environments for MPI is Infiniband; as a matter of fact thanks
to Infiniband low latency a small packet sent through a connection link doesn’t present a major
overhead with respect to Ethernet for example. In order to set up a distributed system of this
kind there is need of additional software for managing the Infiniband sub net (OpenSM) and
MPI and Infiniband modularity allow different configurations, and it is common use to
transmit packet with either Infiniband or a TCP/IP stack. This is possible because the transport
layer of MPI is handled by two routines (among others): the Point-to-Point Messaging Layer
and the Byte Transfer Layer. The PML abstracts the communication mechanism with buffers,
synchronization points and acknowledge messages; the BTL on the other hand translates the
byte messages into the network layer byte sequence – OpenIB is a BTL protocol for sending
messages on Infiniband.
Subsequently the functions (or verbs) available in the Infiniband drivers are invoked and
control is moved from user space to kernel space, where the message is finally sent across the
network link.
This seemingly complex structure allows to reduce code complexity and increase inter-
3.3.2 Benchmarks
As it has been done with OpenMP, some tests were also performed on the MPI installation
and on Infiniband structure to check that machine configuration was correct and that devices
were running at full speed. The program makes heavy use of the MPI Send and MPI Recv
directives and utilizes timing function with resolution of milliseconds. It has been noticed
that a warm-up phase (exchanging some messages between the nodes) is necessary before any
measurement is done, because the whole structure of MPI plus Infiniband must be activated.
In this test the transfer time of messages over Infiniband with MPI directives is evaluated;
message size increases quadratically and time is measured with millisecond precision. Data is
Two different MPI implementation are compared, and it possible to notice that OpenMPI
outperforms MVAPICH in small and large quantities of data, but it is slower in medium-sized
messages. With MVAPICH it is not possible to send data over 2 GB, due to implementation
limits; OpenMPI doesn’t suffer from this behavior, but on the other hand it has a sort of latency
of 3.5 seconds before programs start executing (and this is not recorded in this test).
Other types of MPI implementation exist, most notably MPICH and Lam-MPI, from which
both MVAPICH and OpenMPI derived, but they lack of support for Infiniband; any packet
Using the same structure of above, here is tested the time v. size with multiple messages
(1024 messages exchanged for each tested size). The results are similar to the previous case.
3.3.2.3 Latency
One final test has been run to determine the expected latency in message passing; this has
been achieved by sending a 0-length packet using some data types available in MPI. However,
due to the modularity of the MPI over Infiniband structure, the MPI initialization overhead
must be removed: for this reason the same test is to be repeated both on a single machine and
The latency value measured with this method is 8 µs which is compatible with the Infiniband
TABLE II
ALGORITHM
4.1 Overview
The target application is a suite of programs called Sally3d, and it has been ported from a
VMS system to standard FORTRAN, with a standard makefile instead of terminal scripts and
The software is designed for electromagnetic field analysis and micromagnetic modeling of
The effective field takes phenomenologically into account the interactions occurring in mag-
netic materials such as short-range (exchange, anisotropy) and long-range interactions (mag-
∂m ∂m
= −m × heff [m] − α , (4.1)
∂t ∂t
where m = m(r, t) is the magnetization vector field normalized to the saturation magneti-
zation, Ms , time is measured in unit of (γMs )−1 (γ is the absolute value of the gyromagnetic
35
36
ratio), α is the dimensionless damping parameter, heff [m(r, t)] is the effective field operator
which can be obtained by the variational derivative of the free energy functional:
δgL [m]
heff [m] = − , (4.2)
δm
where
Z " 2 #
1 l ex 12
gL [m] = |∇m| − hm · m + ϕ(m) − ha · m dV , (4.3)
VΩ Ω 2 2
p
ϕ(m) is the anisotropy energy density and lex = (2A)/µ0 Ms2 is the exchange length (A
is the exchange constant and µ0 the vacuum permeability), hm and ha are the demagnetizing
body surface. In order to obtain a spatially discretized version of eq. (Equation 4.1) a partition
of the region Ω in N cells Ωk , with volume Vk is considered and is assumed that the cells are
small enough that the vector fields m(r, t) and heff [m(r, t)] can be considered spatially uniform
within each cell. Symbols mk (t) and heffk denote the vectors associated with the generic k-th
cell. Beside the cell vectors, the mesh vectors m = (m1 , . . . , mN )T ∈ R3N containing the whole
Now it is possible to write down the discretized LLG equation in the following form that
dmk dmk
= −mk × heffk [m] + αmk × , (4.4)
dt dt
where mk is the average magnetization of the k-th cell. It is worth noting that the ef-
fective field in the k-th cell depends on the magnetization of the whole cell collection due to
the magnetostatic interaction, namely heffk = heffk [m]. The numerical solution of equation
The kernel of the micromagnetic solver integrates over time the LLG equation discretized
with respect to space. At every time step, the next value of the magnetic vector is computed
by collecting the different finite elements of the magnetic field; this operation is performed by
the GILBERT routine and it is reported in Figure 13. The equation is a non linear differential
The section of code which has been parallelized and distributed (outlined with yellow in
the next figure) implements the magnetostatic and anisotropic field solvers; also the part that
combines together the different field elements has been updated with OpenMP and MPI direc-
tives. This development scheme has been chosen on the grounds that the real computational
bottleneck resulted particularly in the magnetostatic solver and partially in the anisotropic
solver.
38
In order to carefully analyze the performance of the program and to identify the possible
parallelization points, as well as to obtain useful data, a particular test was prepared. The test
case is the fourth standard problem of micromagnetics, proposed by Bob McMichael, Roger
Quoting (8), the problem focuses on dynamic aspects of micromagnetic computations. The
initial state is an equilibrium s-state (Figure 15) which is obtained after applying and slowly
reducing a saturating fild along the [1,1,1] direction to zero. Fields of magnitude sufficient to
reverse the magnetization of the rectangle are applied to this initial state and the time evolution
of the magnetization are examined as the system moves towards equilibrium in the new fields.
At t = 0 one field will be applied to the equilibrium s-state: the field is composed of
µHx = −24.6 mT, µHy = 4.3 mT, µHz = 0.0 mT (corresponding to approximately 25 mT,
The problem was chosen so that resolving the dynamics should be easier for the 170 degree
applied field than for the 190 degree applied field. Preliminary simulations reveal that, in the
case of the field applied at 170 degrees, the magnetization in the center of the rectangle rotates
in the same direction as at the ends during reversal. In the 190 degree case, however, the center
rotates the opposite direction as the ends resulting in a more complicated reversal. The field
amplitudes were chosen to be about 1.5 times the coercivity in each case.
4.4 Profiling
Thanks to the standardization of the program code, it was possible to exploit the gprof
utility, available in the gcc suite. This utility allows to obtain procedure level timing information
with reasonable resolution, as well as a complete call graph view for identifying the most
According to the profiler, whose graph call has been reported in Figure 16, the following
• calc intmudua
• calc mudua
• campo effettivo
Most of the software is composed of very small routines that are called with very high
frequency, thus very difficult to optimize and to measure (in fact they are not even reported in
profiler reports); only the noted functions have an observable impact on the overall execution
time.
Once again, due to the porting operation that has been performed, several compiler opti-
mizations became available and were subsequently added in order to increase the throughput
of the program. Most of the additions have been chosen following official gcc documentation
The key for optimization relies on the native machine capabilities; in order to activate
at once all the features of a given architecture and of a given processor is required to set
-march=native. In this way all processor specific instructions can be accessed and all floating
point capabilities fully exploited, setting the right processor architecture and the available SSE
flags. Moreover the floating point instructions are specifically set to use any SSE extension
A similar optimization is achieved also in the Intel FORTRAN Compiler with the -axS -xS
switches.
Among the loop transformation techniques, loop unrolling has achieved wide success in
compiler theory. Its goal is to increase the execution speed of the program at the expense of
size. Loop unrolling is performed by reducing (if not eliminating) the number of the “end of
loop”; in this way the number of jumps and of conditional branches decreases, and thanks to
the larger, size the number of cache hits increases (in big caches).
Due to the highly mathematical nature of the software, the -ffast-math flag has been
added: this flag activates a set of optimization that allow some general speedups by discarding
some return codes and by skipping some redundant operations (like rejecting the sign of zero
The main drawback to this optimization is that it is not possible to guarantee IEEE, ISO
and or ANSI compliance that specify arithmetic compatibility, exceptions and operand order
One final type of optimization has been inserted at linking time. The following options
try to decrease load time for library functions, modifying the executable header (ELF in this
44
context) and symbol handling (9). These options must be passed with the -Wl flag so that the
More specifically the -O1 switch performs in this way: as symbols get inserted in the ELF
header, they are stored in hash tables; the default configuration is to keep the hash keys small,
performing string comparison with collisions. This optimization shifts the reduction towards
short hash chains, increasing hash keys length and header size, but actually reducing symbol
look-ups.
CHAPTER 5
IMPLEMENTATION
Analyzing the functions of 4.4 from several profiling sessions a common pattern has been
found.
As a matter of fact, every function contained one or more loops, carrying quite a number
of instructions over arrays and matrices. For this reason a general plan has been decided and
As first step, the standard sequential loop is parallelized to fully exploit all the eight cores
each single machine can offer. By setting up proper shared/private variables lists, the loop
is divided among a given number of OpenMP threads and each carries out a portion of that
iteration; as soon as a thread ends, a new one is created and assigned a element, until the whole
The second step in this strategy is to split in two distinct and equal parts before exploiting
OpenMP. Each part is submitted to a node of the cluster and separately executed; at the end of
the loop data is exchanged back with MPI and merged so that the two machines can continue
working on complete arrays. Thanks to Infiniband, latency for exchanged data sets is reduced
to a minimum.
45
46
Even though OpenMP requires little software modifications, in order to obtain the maximum
possible throughput from the software, some updates have been carried out, mainly reducing
It should be noted, however, that the software is not embarrassingly parallel; as a matter of
fact there were a number of modification to the software in order to apply parallelization and
47
distributed computing. The synchronization object mostly used is the implicit blocking offered
by the send() and recv() mechanism; since data is exchanged between the two machines in
the same manner, until either of them is ready to process data, the other cannot continue.
In other sections of the code, synchronization was achieved by native OpenMP directives, as
shown in 5.3.4.
The hardware selected for implementing the cluster consists of two computer, each supplied
with:
• two quad core Intel Xeon E5420 running at 2.5 GHz frequency, with 6 MB of L2 cache;
• one Infiniband card from Mellanox, model ConnectX IB MHGH28-XTC DDR HCA PCI-e
The two machines are connected together with an end-to-end Infiniband link, running at
full speed as the cards are mounted on the PCI Express x16 v1.1 slot. The focus for building
these computer has been to search for low-cost components that could enable high performance
results.
In this section some example code has been extracted from the source of the program and
explained.
48
The following sections of code show some sample “header” and “epilogue” MPI functions
that enable slitting the array and merging it back. The header part analyzes the rank variable
which differs for every node of the MPI cluster: inside the if clause the array range is defined by
setting start INDEX and end INDEX variables (which intuitively represent the range beginning
and ending). So the first node works on the first half of the array and the second node on the
Some preprocessor directives have been inserted in order to maintain compatibility on non
MPI system.
#ifdef MPI_ENABLED
start_INDEX = 1
end_INDEX = NEDGE/2
start_INDEX = ( NEDGE/2 ) + 1
end_INDEX = NEDGE
endif
#else
start_INDEX = 1
end_INDEX = NEDGE
#endif
49
DO M=start_INDEX,end_INDEX
[...]
So after loop has terminated, the array on which the iteration worked must be synchronized
on both nodes; this is done with a couple of MPI SEND and MPI RECV instructions. The rank
variable is checked again to be able to tell which portions of the array must be updated.
#ifdef MPI_ENABLED
tag = 1
dest = 1
source = 1
dest = 0
source = 0
endif
#endif
5.3.2 DO directive
The DO directive is the most common in this configuration. It requires a list of shared and
private variables: for the latter case, a new memory position is allocated for each thread.
!$OMP& PRIVATE(I,KH,KK,NPOS,IMAG,KCOMP)
!$OMP DO SCHEDULE(GUIDED)
DO I=start_INDEX,end_INDEX
[...]
BINTMU(I)=BINTMU(I)-
+ IFAEXT(I,KH,2)*TM(NPOS+(IMAG-1)*3+KCOMP)*AMAG(IMAG,KCOMP)
[...]
ENDDO
!$OMP END DO
One of the possible benefits in parallelization is to use a mathematical property for addition
and subtraction clauses: since variating the order doesn’t change the result, the reduction
directive allows to execute out-of-order loop instances and to compute the final value at the
Without this directive the target variable could have suffered from various synchronization
problems, as reading and writing to a shared position doesn’t guarantee a correct result.
!$OMP& SHARED(H_DEMG,AMAG,VOLTET,NPNMAG)
!$OMP& REDUCTION(+:VOLUME)
!$OMP& REDUCTION(-:DEMG_ENE)
!$OMP DO SCHEDULE(GUIDED)
DO M=1,NPNMAG
DOT=0.D0
DO K=1,3
DOT=DOT+H_DEMG(M,K)*AMAG(M,K)
ENDDO
VOLUME=VOLUME+VOLTET(M)
DEMG_ENE=DEMG_ENE-VOLTET(M)*DOT/2.D0
ENDDO
!$OMP END DO
52
Unfortunately this option is available for non-array operators only, so it has been applied
few times.
One of main problems of OpenMP and parallel programming in general is data dependency
and this is usually resolved by modifying the algorithm structure or by means of synchronization
objects.
directive) for shared constructs which could have negatively affected performance, an array
with self data references has been converted into a matrix and indexed with the working thread
number; in this way every array element of the matrix was automatically dereferenced from
itself as there could only be one single thread working on a given line at the same time.
#ifdef _OPENMP
INUM_TH = omp_get_num_threads()
#endif
[...]
DO L=1,6
LATO=(MCNT_E(L,ITET))
53
AUS=SIGN(1,LATO)*
LATO=ABS(LATO)
#ifdef _OPENMP
INUM = omp_get_thread_num()+1
#else
INUM = 1
#endif
ENDDO
[...]
!$OMP END DO
At the end of operation, the original array is rebuilt with a simple loop on the number of
DO ILATO=1, NEDGE
#ifdef _OPENMP
DO III=1, INUM_TH
#else
III=1
54
#endif
#ifdef _OPENMP
ENDDO
#endif
ENDDO
5.4 Results
During development the test case was run to understand if the current implementation was
providing good results. The simulation had duration of 8 ps only and was composed of just
1000 elements (see Figure 14), but it was already possible to notice some good improvements
to the software. Further work has been done after these results were produced.
The following table (Table III) resumes the total execution time in seconds; in the table the
label OMP stands for OpenMP, MPI for OpenMPI over Infiniband and OPT for optimiza-
tions, while for each field a * stands for enabled and a - for disabled.
It is possible to notice that the software has received a speed boost of 87.5% from the old
configuration to the newer optimized MPI over Infiniband plus OpenMP environment.
Not surprisingly the most effective contribution to the software is the optimizations section:
this is because the ability to access all the SSE extensions with the loop unrolling configuration
TABLE III
PARTIAL RESULTS
However it is important to take in consideration what targets had this project. It is true
that the most cumbersome code for the processor has been dutely parallelized, but the software
is composed of a high number of other functions that are either closely serialized or with very
small duration time. The sections that have been parallelized and distributed have received
a speed boost, but the final software performance suffers from the presence of serial code and
This explains also why the optimizations bring such an improvement, as they affect all the
software without distinction. So a more sensible comparison can only be done if the optimization
With the analysis of the previous data, it was possible to understand what was really needed
the end, when all the most computational-expensive functions were addressed, it was possible
to launch the final test case with the same characteristics of before and to obtain the following
results:
TABLE IV
FINAL RESULTS
The total speed improvements of OpenMP and MPI elements only correspond to a raw 76%
increment. This is very good results, because not only it is comparable to the speedup intro-
duced by the optimizations, but also it out does the results obtained from the Intel FORTRAN
By looking at the single functions contribution more in detail, it is possible to see the effect
of OpenMP and MPI over Infiniband with no overhead from the other routines.
From the above table it is possible to understand the actual impact of the technologies used
TABLE V
FUNCTION RESULTS
Having a look at the OpenMP section, there is an aggressive reduction, by a factor of 6-8x:
this is a very good result as it means that the code was able to exploit every processor available
to the maximum extent, with very little overhead and no synchronization problems.
As for MPI on the other hand, there is a 2x factor of speed improvement; this is sensible as
the code was almost split in two, so it is normal that the overall reduction corresponds to half
execution time. It is interesting to notice that this effect applies perfectly when merging MPI
with OpenMP. As a matter of fact, thanks to the Infiniband channel used, communication time
is negligible, and so only the small MPI overhead can influence execution.
CHAPTER 6
CONCLUSION
In this thesis, it has been demonstrated that to achieve best results, a complete review
of the software must be taken into account. Highly serialized software, written thoughout a
However there are technologies that can have a direct impact on performance, in particular
OpenMP and MPI. With very little software modification and simple code analisys, it has been
the standard, clean and stable environment of the GCC suite enabled accessing important
optimization controls that increased the quality of the software where it was not been done
For this reason this project shows significant room for improvement. First of all, algorithm
optimization are necessary to obtain high performance; secondly it could be possible to take
advantage of FORTRAN library functions for otherwise long routines – even more for the
high number of small operations repeated several times. In the third place, software analysis
must continue in order to extract precise timing information from profiling and to identify the
other computational-expensive functions that could receive a significant improvement from the
Finally thanks to the high scalability of cluster system, it should be fairly easy and much
convenient to add new elements that can contribute in the computation deployment; in fact it
58
59
would be possible to connect more components to the cluster using an Infiniband switch, at the
sole cost of some increased latency. In fact due to the applied middleware of open standards,
OpenMP and MPI, porting software to other architectures and expanding its routines to use
60
61
Appendix A
Introduced by Intel in its line of Pentium III processors, SIMD technology allows for
SIMD execution. While older processors could only process one data element per instruc-
tion, SIMD technology allows instructions to handle multiple data elements, making processing
much quicker.
SSE’s use of SIMD technology allows for data processing in applications such as 3D graphics
to benefit greatly from the availability of extended floating point registers. In contrast to the
preceding MMX technology, SSE registers have an increased width, allowing more bits to be
stored and more speed facilities for applications. Initially eight new 128-bit registers known
as XMM0 through XMM7 were added; SSE2 extends MMX instructions to operate on XMM
registers, allowing the programmer to completely avoid the eight 64-bit MMX registers “aliased”
More precisely SSE2 adds new mathematical instructions for double-precision (64-bit) float-
ing point and also extends MMX instructions to operate on 128-bit XMM registers. SSE integer
instructions introduced with later extensions would still operate on 64-bit MMX registers be-
cause the new XMM registers require operating system support (this behavior changed only
with SSE4 onward). SSE2 enables the programmer to perform SIMD math of virtually any
type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the
SSE3, SSSE3 and SSE4 are further revisions to the architecture and introduce new operating
conditions (column access to registers), new instructions (that can act on 64-bit MMX or 128-
bit XMM registers and simplify implementation of DSP and 3D code) and conversion utility
In a multi-tasking environment, the Streaming SIMD Extensions require support from the
operating system: the SIMD registers must be handled properly by the operating system’s
context switching code. When the system switches control from one process to another, the old
process’s SIMD registers must be saved away, and the saved values of the new process’s SIMD
registers must be loaded into the processor. The Pentium III processor prohibits programs from
using the Streaming SIMD Extensions unless the operating system tells the processor at system
startup time that it is aware of the SIMD registers, and will manage them properly.
63
Appendix B
The test program has been designed to simulate some computationally intensive routines of
the target software; in the main loop a lot of mathematical functions are executed over a set
of arrays, without creating data dependencies between the iterations. Statistics are printed at
the beginning and at the end of the program; in order to obtain the total execution time the
function gettimeofday() is used; the loop is repeated ten times obtaining a more
omp_set_num_threads(u);
totaltime = 0;
+ (timing_end.tv_usec - timing_start.tv_usec);
printf("%d\n", totaltime/10);
}
65
Appendix C
The general OpenMP directive begins with !$OMP indicating the starting of an OpenMP
configuration; any directive has to be declared with an entry and a closing section, such as:
[...]
The first directive must be PARALLEL, which wraps the code section that must be executed in
parallel, and it is closed by the corresponding END PARALLEL. Syntax for this directive (clause)
variable list;
Another important OpenMP directive is DO that specifies the next loop can be executed in
SCHEDULE (type [, chunk )] describes how iterations of the loop are divided (in chunk s)
LASTPRIVATE (list) list of private variables that are initialized when iteration ends;
Other parallelizing directives that don’t require any particular clause configuration are:
• SECTIONS: statically splits the code into sections which are assigned each to a single thread
in the pool;
• WORKSHARE: divides the execution of the enclosed code block into separate units of work;
• TASK: defines an explicit task, which may be executed by the encountering thread, or
• CRITICAL: identifies a critical region in which only one thread at a time can access;
• BARRIER: implements a barrier region where execution is stopped until all threads are
ready to continue;
• ATOMIC: defines a single instruction critical region, in which memory is accessed atomically
Finally it is possible to use some OpenMP related functions to further adapt the software to
a multiprogrammed system; this set of routines may be used for a variety of application such as
obtaining information from single threads, setting configuration about the number of threads,
getting environment data (like number of processors), locking variables, timing and so on. For
example:
• OMP SET NUM THREADS(): sets the number of threads that must be started;
• OMP GET NUM THREADS(): returns the number of threads of the parallel region;
• OMP GET THREAD NUM(): returns the number identifying a single thread in the pool;
• OMP GET THREAD LIMIT(): returns the maximum number of OpenMP threads available
to a program;
• OMP GET NUM PROCS(): returns the number of processors that are available to the program;
68
Appendix C (Continued)
• OMP INIT LOCK(): initializes a lock on the variable, setting the lock to “unset”;
Appendix D
MPI routines are added to a standard FORTRAN program by including the mpif.h library.
After this, the MPI layer must be initialized with MPI INIT(), before using any MPI related
functions, and it must be closed with MPI FINALIZE(), before ending the program.
By using:
the program becomes aware of running in a MPI environment, as the reported functions
save the number of the instance of the program in the rank variable, and the total number of
Then now it is possible to use point-to-point communication routines that are present in
many variants, like blocking, synchronous, non-blocking, buffered, and they are all described
by the following:
buffer : represents either data that has be sent or the memory location in which it must be
saved;
70
Appendix D (Continued)
type : sets one of the high MPI data types for the transfer;
destination/source : describes the number of the instance of the program that has to send
MPI COMM WORLD : reads from the macro in which MPI configuration is saved;
MPI allows also for collective communication (a sort of “multicasting”) by means of functions
such as:
that require information about the data buffers of both the sender and the receiver.
In order to compile an MPI-enabled program, it is not possible to directly call the compiler,
but it is necessary to resort to the wrapper of the MPI distribution, which correctly set paths
and libraries; also for launching executables a special wrapper must be used with proper syntax.
71
Appendix D (Continued)
In case of OpenMPI, the MPI implementation selected for this project, the compiler is
called mpif90 while the launching wrapper is mpirun; this software must be called specifying
the number of instances of the program to run (-np) and the list of hosts that have to execute
it (-host). So for example in a two-machine cluster environment in which each node has to
It is possible to share some environment variables among the nodes with the -x switch; this
is required for an OpenMP+MPI system as the number of threads depends on the value of
1. Stallings, W.: Computer Organization & Architecture - Designing for Performance. Pear-
son - Prentice Hall, 2006.
3. Lu, J., Li, Y., Sun, C., and Yamada, S.: A parallel computation model for nonlinear
electromagnetic field analysis by harmonic balance finite element method. Tech-
nical Report 0-7803-2018-2, Faculty of Science and Technology, Griffith University
Australia and Faculty of Technology, Kanazawa University Japan, 1995.
4. Ito, F. and Amemiya, N.: Application of parallelized SOR method to electromagnetic field
analysis of superconductors. Technical Report 1051-8223/04, Faculty of Engineer-
ing, Yokohama National University, 2004.
5. Giuffrida, C., Gruosso, G., and Repetto, M.: Finite formulation of nonlinear magnesto-
statics with integral boundary conditions. Technical Report 0018-9464, Electrical
Engineering Department, Politecnico di Torino and Electronic and Information En-
gineering Department, Politecnico di MIlano, 2006.
6. Silberschatz, A., Galvin, P. B., and Gagne, G.: Operating System Concepts. Pearson
Education, 2006.
8. McMichael, R. D.: µMAG – Micromagnetic Modeling Activity Group. Center for Theo-
retical and Computational Materials Science, http://www.ctcms.nist.gov/~rdm/
mumag.html.
9. Moser, J. R.: Optimizing linker load times. LWN.net - Your Linux info source, http:
//lwn.net/Articles/192082/, 2006.
10. Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Menon, R.: Parallel
Programming in OpenMP. Morgan Kaufmann Publishers, 2001.
72
73
CITED LITERATURE (Continued)
11. Dagum, L. and Menon, R.: OpenMP: An Industry Standard API fo Shared Memory
Programming. Computational Science and Engineering, 1998.
12. Gropp, W., Lusk, E., and Skjellum, A.: Using MPI - Portable Parallel Programming with
the Message-Passing Interface. Scientific and Engineering Computation Series. The
MIT Press, 1999.
13. Reinders, J.: VTuneTM Performanc Analyzer Essentials. Intel Press, 2007.
14. Stevens, W. R.: UNIX Network Programming: Networking APIs: Sockets and XTI. Pren-
tice Hall, 1998.
16. Shipman, G. M., Woodall, T. S., Graham, R. L., Maccabe, A. B., and Bridges, P. G.:
Infiniband scalability in Open MPI. Technical Report 1-4244-0054-6/06, Advanced
Computing Laboratory, Los Alamos National Laboratory and Dept. of Computer
Science, University of New Mexico, 2006.
17. Sur, S., Koop, M. J., and Panda, D. K.: High-performance and scalable MPI over In-
finiband with reduced memory usage: An in-depth performance analysis. Technical
Report 0-7695-2700-0/06, Department of Computer Science Engineering, Ohio State
University, 2006.
18. Quintero, D., Conrad, N., Desjarlais, R., Kahle, M.-E., Kim, J.-H., Nguyen, H.-N., Pir-
raglia, T., Pizzano, F., Simon, R., Yao, S. L., and Lascu, O.: Implementing
InfiniBand on IBM System p. IBM Redbooks, 2007.
19. Gray, A., Hein, J., and Booth, S.: Improved MPI with RDMA. Technical report, EPCC,
Univeristy of Edinburgh, June 2005.
20. T., U. and J., R. B. S.: Multithreaded processors. The Computer Journal, 3, 2002.
21. R., B.: High Performance Cluster Computing: Architectures and Systems. Prentice Hall,
1999.
22. R., B.: High Performance Cluster Computing: Programming and Applications. Prentice
Hall, 1999.
74
CITED LITERATURE (Continued)
23. Barney, B.: Message Passing Interface (MPI). Lawrence Livermore National Laboratory,
https://computing.llnl.gov/tutorials/mpi/.
24. Hablot, L., Gluck, O., Mignot, J.-C., Genaud, S., and Primet, P. V.-B.: Comparison and
tuning of MPI implementations in a grid context. Technical Report 1-4244-1388-5,
Laboratoir de l’Informatique du Parallelisme, Universite de Lyon, 2007.
VITA
75