2018 Lafayette AdCo

PERFORMANCE IMPROVEMENTS WITH
GPUS FOR MARINE BIODIVERSITY
Lev Lafayette1, Mitch Turnbull2, Mark Wilcox3, Eric A. Treml4

1
Department of Infrastructure, University of Melbourne, Melbourne, Australia
lev.lafayette@unimelb.edu.au
2
Nyriad, Cambridge, New Zealand
mitch.turnbull@nyriad.com
3
Nyriad, Cambridge, New Zealand
mark.wilcox@nyriad.com
4
School of BioSciences, University of Melbourne, Melbourne, Australia
eric.treml@unimelb.edu.au
ABSTRACT
Identifying probable dispersal routes and for marine populations is a data and processing intensive task
of which traditional high performance computing systems are suitable, even for single-threaded
applications. Whilst processing dependencies between the datasets exist, a large level of independence
between sets allows for use of job arrays to significantly improve processing time. Identification of
bottle-necks within the code base suitable for GPU optimisation however had led to additional
performance improvements which can be coupled with the existing benefits from job arrays. This small
example offers an example of how to optimise single-threaded applications suitable for GPU
architectures for significant performance improvements. Further development is suggested with the
expansion of the GPU capability of the University of Melbourne's "Spartan" HPC system.
KEYWORDS
High Performance Computing, General-Purpose Computing on Graphics Processing Units, Marine
Biodiversity
1. UNIVERSITY OF MELBOURNE HPC AND MARINE SPATIAL ECOLOGY

WITH JOB ARRAYS
1.1. University of Melbourne HPC Systems
From 2011-2016, the University of Melbourne provided general researcher access to a medium-
sized HPC cluster system called "Edward", designed in a traditional fashion. Compute nodes
were connected by a faster interconnect along with attached storage, a node deployment system,
etc. However as "Edward" was being retired an analysis of actual job metrics indicated that the
overwhelming majority of jobs were single node or even single core, especially as job arrays.
The successor system, “Spartan”, was therefore designed more with a view of high throughput
rather than high performance. A small traditional HPC system with a high-speed interconnect
was partitioned from a much larger partition built on OpenStack virtual machines from the
NeCTAR research cloud. This proved to a highly efficient and optimised method both in terms
of finances and throughput [1].
Broadly speaking, parallelisation can occur by task-parallel techniques, data-parallel, or a

combination of both. If a dataset requires simultaneous processing and communication between
tasks, then a task-parallel strategy is recommended using message-passing libraries. In the case
of a large quantity of datasets which can run independently of each other, a data-parallel
strategy is recommended. In HPC environments this commonly done with job arrays in
submission scripts which have the benefits of administrative ease and time-efficiency in job
distribution, compared to (for example) running the jobs or job launching, over a shell script
loop. Whilst job arrays have an overhead costs they are a convenient method for users dealing
with data parallel jobs.
1.1. Job Arrays for Marine Spatial Ecology

A specific example of large number of computational tasks that are designed for single-threaded
applications with modest memory requirements is that for research in the marine biodiversity
and population connectivity, which has significant implications for the design of marine
protected areas. In particular there is a lack of quantitative methods to incorporate, for example,
larval dispersal via ocean currents, population persistence, impact on fisheries etc. The Marine
Spatial Ecology and Conservation (MSEC) laboratory at the University of Melbourne has been
engaging in several research projects to identify the probable dispersal routes and spatial
population structure for marine species, and integrate these connectivity estimates into marine
conservation planning. Current projects include reef resilience in Port Phillip Bay (Melbourne,
Australia)., marine biodiversity and population connectivity of the Indo-Pacific, reconciling
competing objectives for the design of marine reserve networks., conservation networks and
ecological neighbours., and hydrodynamic modelling of coastal seas. Nearly fifty publications
have come from the laboratory's research in the past two decades [1].
2. CODE REVIEW FOR GPGPU OPTIMISATION

2.1. GPGPU Processing Capability at the University of Melbourne
As part of a computational pipeline a graphics processing unit (GPU) can be used to process
data as if was in graphic form, creating a General-Purpose computing on Graphics Processing
Unit (GPGPU) environment. The architecture of GPUs is typically with a significantly lower
clock speed than typical Central Processing Units (CPUs), providing limited parallelisation of
operational functions on a datastream (e.g., map, reduce etc) and especially particular
computational problems (e.g., matrices and vectors) in a manner that broadly fits to SIMD
(single instruction stream, multiple data stream) architecture according to Flynn's taxonomy,
making them particularly well suited for pleasingly parallel problems. Whilst the utilisation of
GPGPUs saw notable results in the early 2000s they became particularly prevalent as the ratio
of heat to clock frequency increased in standard CPUs, resulting in a general flattening of
typical CPU clock-speeds by the mid-2000s.
There are a number of architectural constraints on GPUs. They are, to a very large extent,
independent of their host system. Object code needs to be compiled for the GPU (e.g., using
OpenCL or nvcc). There is no shared memory between the GPU and CPU and any unprocessed
data must be transferred to the GPGPU environment and then back to the CPU environment
when completed. This said, GPUs typically only have small amounts of cached memory, if at
all, replacing the need with GPU pipelining and ensuring very high memory transfer between
the GPU and the host [2].
Neither the Edward nor Spartan systems at the University of Melbourne had significant GPGPU
capability. The Edward HPC system included two nodes with GPUs attached, both being dual
quad-core cpus with 2 nVidia 2070 GPUs with 8GB of memory. These saw minimal usage on
Edward and as a result a significant expansion was not initially developed for the Spartan
HPC/cloud hybrid. A small 3-node partition for general availability was been implemented on
Spartan, each with 12 cores, 251 GB of RAM, and four NVidia K80 GPUs. An additional 2-
node partition with the same specifications has been established for the ARC Centre of
Excellence for Particle Physics at the Terascale (CoEPP).
2.2. Nyriad's Review
During the first half of 2017 Nyriad reviewed the HPC infrastructure, existing MATLAB(R)
source code and sample data, and wrote a test suite designed to run the CPU and GPU versions
at the same time. This allowed the researcher to modify the code and still make use of the GPU
to run the simulation. As part of the test suite the generated a significant amount of test data at
larger scales than were provided as samples. The goal of the approach was to check for
equivalence of the algorithms without taking away the researcher's ability to evolve the code,
providing a 'sandbox' environment that could enable running simulations on a larger scale. A
significant benefit of the GPU approach is that the code could be run on the researchers laptop
and workstation, and when the code was ready to run on more than one node, it was already in a
form which could easily be distributed across many GPUs. There were two review stages; the
first for optimisation of the existing MATLAB (R) code base, followed by identification of
functions that could be distribution and rewritten for GPUs.
Firstly, three sections were identified in the code as computationally intensive, Del2, Cell Wall
Flux Calculation, and Anti-diffusion Velocity Calculation. The Del2 function was a built-in
MATLAB(R) function, that required data padding to avoid edge effects and also used 3D arrays,
which was unnecessary for the simulation. For the Cell Wall Flux Calculation, pre-compilation
of variables (UL, UR, VB, and VT) was removed to in-line calculation instead. For the Anti-
diffusion Velocity Calculation function dCdx calculations were only done on cells that had non-
zero larvae densities. This was with a multiplication by a logical matrix where cells were 1 if
larvae were present and 0 otherwise. Indiscriminately computing all the values and removing
unwanted ones with the logical multiply removed the overhead of selecting values and
conditionally computing them. The expressions for dCdx calculations were also moved inline
with the Vd calculations. In addition the sections for cell wall flux and anti-diffusion velocity
calculations were renamed and moved to the functions updateDensityWithFlux and
updateDensityWithAntiDiffusionVelocity, respectively. In itself this was not necessary to achieve
performance improvement, but it improves readability of the main code loops and separates
these pieces of code in a form for GPU acceleration.
For the GPU optimisation, the DisperseLarvae and RunSimulation functions were added to the
GPU acceleration to the code base. The three main operations of the simulation which account
for the bulk of the execution time have been written into optimised GPU functions in CUDA C+
+, or “kernels” in CUDA terminology. Functions are provided to support both double and single
precision calculations. MATLAB(R) uses a gpuArray type to to represent matrices in GPU
memory. There are built-in functions in MATLAB(R) which the user can treat as normal
matrices, where operations are carried out on the GPU using standard MATLAB(R) code.
Whilst convenient, it is not efficient; specialised GPU code has to be added in if statements that
use the USEGPU flag. As the CPU and GPU versions share code, maintaining the bulk of the
GPU code is made simpler. Instead the GPU code works by moving several matrices to the GPU
enabling MatLab perform operations and matrix creations of the GPU when a gpuArray is used.
3. PERFORMANCE IMPROVEMENTS
3.1. Initial Bottleneck Identification
Nyriad code review identified bottlenecks that were available for GPGPU workloads. On the
University of Melbourne HPC system, “Spartan”, using a single GPU, a 90x performance
improvement was achieved over the original code and a 3.75x improvement over the CPU
version with 12 threads available for the 4.6 GB Atlantic Model simulating 442 reefs. The
simulation, previously taking 8 days to complete on one of the most powerful nodes (i.e. GPU
or physical), could be completed in 2 hours. On the other hand, for the 4 MB South Africa
Benguela Region dataset the GPU version is faster than the original code, but slower than the
improved CPU implementation.
Nyriad is still working on optimisations for smaller datasets, but these are less likely to be
suitable for GPU acceleration due to the high latency of moving data back and forth between
GPU and system memory. The original version was been modified to use double precision
which has given a performance increase to the original code with one cpu available. For the
Atlantic model the single precision was estimated to take 515 hours to complete and now it is
predicted to take 258 hours. Note that these performance improvements are in additional to
improvements in performance from conducting the tasks as independent data-parallel job arrays.
Table 1. South Africa Benguela Region
Threads Original Improved GPU
1 28s 13s 14s
12 26s 11s 14s
Single 10 day simulation
Table 2. Atlantic Model
Threads Original Improved GPU
1 258h* 21h* 2h
12 180h* 7.5h 2h
* Performance estimates provided after running simulation for 2 hours
3.2. Additional Refactoring

Although a significant performance improvement has been achieved, significant further
improvements are possible. A larger refactor to the codebase that simulates many reefs in
parallel that would enable greater utilisation of all the GPU compute capabilities available on
each Spartan node. This would work by leveraging multiple GPUs and multiple reef simulations
into a single simulation across all GPUs in parallel. This would reduce the GPU idle time from
waiting for the CPU process to provide it with more processing work. Once again, the most
significant bottle-neck from the GPUs perspective is CPU wait-time.
Figure 1. shows the single GPU utilisation of the large Atlantic data set simulating 25 reefs for
100 days. The average utilisation is 48% when the GPU is used. As a Spartan GPU node has 2
K80 GPUs, and they each have two GPU core with a total of 4 GPUs, this shows that the node
is not being fully utilised. Processing capability with a faster GPU (such as Tesla V100s) would
obviously witness an ever greater performance improvement.
Figure 1. GPU Utilisation of Single K80 GPU Chip on Spartan GPU node
If the code is refactored to process reefs in parallel we anticipate that utilisation of the node
would improve on a per-GPU and multi-GPU level, significantly reducing the single simulation
time by fully utilising the Spartan GPU node on which it is run. With this change we predict a
performance improvement of over 5x compared to the existing GPU code on meaning while
using more resources on a node the execution time of a single simulation would greatly reduce.
Smaller datasets would also likely achieve some improvement as per-GPU utilisation would
increase. Demonstrated in Figure 2. is the performance increase of the current two versions, and
the predicted performance of the multithreaded GPU version, when running a single simulation
on the Atlantic data set of 442 reefs over 100 days.
Figure 2. Single simulation performance increase of code versions compared to original code
for the Atlantic model on a full Spartan GPU node.
4. FURTHER DEVELOPMENTS
4.1. Spartan's GPGPU Expansion
With notable performance improvements to a range of job profiles, a significant expansion of
Spartan's GPGPU capacity has just been implemented. The partition, funded by Linkage
Infrastructure, Equipment and Facilities (LIEF) grants from the Australian Research Council,
has come together as a partnership between the University of Melbourne, La Trobe University,
Deakin University, and the Royal Melbourne Institute of Technology (RMIT). Within the
University of Melbourne, which makes up the bulk of the share (c80%), the research bodies
include Melbourne Bionformatics (the successor to the Victorian Life Science Compute
Initiative), the Melbourne School of Engineering, the Medical and Dental Health School, St
Vincent's Hospital, and Research Platforms. The partition is composed of 68 nodes and 272
nVidia P100 GPGPU cards.
The major usage of the new system will be for turbulent flows, theoretical and computational
chemistry, and genomics, representative of the needs of major participants. Specific software
applications have already been optimised for GPGPU (e.g., through library extensions or
compilation versions) for these research projects, including NAMD, GROMACS (both
molecular dynamics simulation applications), AMBER (biomolecule force field simulations),
MATLAB (numerical computing environment), and a in-house developed application,
HiPSTAR (fluid dynamics). At the time of writing, installation of these applications with
GPGPU optimisation had just been completed and beta-users of the project were being
introduced to the new partition. Future publications will review the performance changes for the
research datasets.
4.2. Future Collaborations

Nyriad's review found that there is significant opportunity in the use of data integrity
and mathematical equivalence algorithmic techniques for enabling porting of code to
GPUs with minimal impact to the research workflow. Firstly the use of the original language to
take advantage of SIMD parallelism on the CPU, enabled the researcher to understand how the
problem mapped to a GPU or the HPC cluster environment. When changes were needed to be
made, the researcher was more confident to update the parallel version in their chosen language,
which made updating native C++ CUDA implementations much easier for the GPU developer.
The generation of a larger dataset derived from a smaller data set hosted on the CPU enabled
mirroring of the researchers code on the CPU and GPU, proving that the gpu could perform the
same operation at much higher resolution. By checking the data integrity we could build up
confidence with the researcher over the benefits of being more ambitious with their simulations.
Finally the availability of consumer GPUs made this collaboration highly effective, as both the
end user's hardware and HPC cluster contained heterogeneous processing capability.
The University of Melbourne and Nyriad will continue their research collaborations, especially
in the GPGPU environment for data integrity and mathematical equivalence, scalability testing
and hybrid clusters to enable more scientific programming users to progressively scale their
work up to larger systems.
ACKNOWLEDGEMENTS
The authors would like to thank the University of Melbourne for their support in developing this
work.
REFERENCES
[1] Lev Lafayette, Greg Sauter, Linh Vu, Bernard Meade, "Spartan : Performance and Flexibility:
An HPC-Cloud Chimera", OpenStack Summit, Barcelona, October 27, 2016
[2] For example, Keyse, J., Treml, EA., Huelsken, T., Barber, P., DeBoer, T., Kochzuis, M.,
Muryanto, A., Gardner, J., Liu, L., Penny, S., Riginos, C. (2018), Journal of Biogeography,
February 2018
[3] Shigeyoshi Tsutsui, Pierre Collet (eds), (2013), Massively Parallel Evolutionary Computation on
GPGPUs, Springer-Verlag
Authors
Lev Lafayette is the Senior HPC and
Training Officer at the
University of Melbourne, where
he has been since 2015. Prior to
that he worked in a similar role
at the Victorian Partnership for
Advanced Computing for
several years. He has also
worked the Ministry of Foreign
Affairs and Cooperation, Timor-
Leste, and the Parliament of
Victoria. He has degrees in
politics, philosophy, and
sociology, project management,
technology management, and
adult and tertiary education.

2018 Lafayette AdCo

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2018 Lafayette AdCo

Uploaded by

Copyright:

Available Formats

PERFORMANCE IMPROVEMENTS WITH

GPUS FOR MARINE BIODIVERSITY

Lev Lafayette1, Mitch Turnbull2, Mark Wilcox3, Eric A. Treml4

1. UNIVERSITY OF MELBOURNE HPC AND MARINE SPATIAL ECOLOGY

Broadly speaking, parallelisation can occur by task-parallel techniques, data-parallel, or a

1.1. Job Arrays for Marine Spatial Ecology

2. CODE REVIEW FOR GPGPU OPTIMISATION

Threads Original Improved GPU

1 28s 13s 14s

12 26s 11s 14s

Threads Original Improved GPU

3.2. Additional Refactoring

4.2. Future Collaborations

You might also like