Professional Documents
Culture Documents
MIKE 2017
DHI headquarters
Agern Allé 5
DK-2970 Hørsholm
Denmark
+45 4516 9200 Telephone
+45 4516 9333 Support
+45 4516 9292 Telefax
mike@dhigroup.com
www.mikepoweredbydhi.com
LIMITED LIABILITY The liability of DHI is limited as specified in Section III of your
‘DHI Software Licence Agreement’:
‘IN NO EVENT SHALL DHI OR ITS REPRESENTATIVES
(AGENTS AND SUPPLIERS) BE LIABLE FOR ANY DAMAGES
WHATSOEVER INCLUDING, WITHOUT LIMITATION,
SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL
DAMAGES OR DAMAGES FOR LOSS OF BUSINESS PROFITS
OR SAVINGS, BUSINESS INTERRUPTION, LOSS OF
BUSINESS INFORMATION OR OTHER PECUNIARY LOSS
ARISING OUT OF THE USE OF OR THE INABILITY TO USE
THIS DHI SOFTWARE PRODUCT, EVEN IF DHI HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THIS
LIMITATION SHALL APPLY TO CLAIMS OF PERSONAL
INJURY TO THE EXTENT PERMITTED BY LAW. SOME
COUNTRIES OR STATES DO NOT ALLOW THE EXCLUSION
OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL,
SPECIAL, INDIRECT, INCIDENTAL DAMAGES AND,
ACCORDINGLY, SOME PORTIONS OF THESE LIMITATIONS
MAY NOT APPLY TO YOU. BY YOUR OPENING OF THIS
SEALED PACKAGE OR INSTALLING OR USING THE
SOFTWARE, YOU HAVE ACCEPTED THAT THE ABOVE
LIMITATIONS OR THE MAXIMUM LEGALLY APPLICABLE
SUBSET OF THESE LIMITATIONS APPLY TO YOUR
PURCHASE OF THIS SOFTWARE.’
MIKE 2017
CONTENTS
MIKE 21 Flow Model FM
Parallelisation using GPU
Benchmarking report
2 Methodology ..................................................................................................................... 2
2.1 GPU Parallelisation .............................................................................................................................. 2
2.2 Hardware .............................................................................................................................................. 2
2.3 Software ............................................................................................................................................... 3
2.4 Performance of the GPU Parallelisation .............................................................................................. 3
i
7.2 Ribe Polder ......................................................................................................................................... 35
7.3 EA2D Test 8A ..................................................................................................................................... 35
8 Discussion ...................................................................................................................... 37
9 Conclusions.................................................................................................................... 39
10 References ...................................................................................................................... 40
ii
Vision and Scope
2 Methodology
As default, the program uses one MPI process per GPU, but it is possible to assign more
processes to the same GPU. In this way simulations, where the hydrodynamic
calculations are less time consuming than the calculations performed in the other
modules, will benefit from the MPI parallelisation.
2.2 Hardware
The benchmarks have been performed using the following hardware platforms and
GPUs:
Operating
Computer Processor Memory GPUs
system
2 x Intel®Xeon® Windows 10 2x
DELL Precision
1 E5-2687W v2 32 GB Enterprise, 64- GeForce
T7610 (workstation)
(8 cores, 3.40 GHz) bit GTX Titan
Single/Double
Number of GPU
Compute Memory Bandwidth precision
GPU CUDA Clock
Capability (GB) (GB/s) floating point
cores (MHz)
performance
2.3 Software
All benchmarks have been performed using the MIKE 2017 Release. The CUDA 8.0
library is used in the MIKE 2017 Release. In the present benchmark the NVIDIA graphics
driver 378.78 has been used for hardware platform 1 and the graphics driver is running in
TCC mode. For platform 2 NVIDIA graphics driver 381.65 has been used. For platform 3
NVIDA graphics driver 377.35 has been used and the driver is running with ECC enabled.
For platform 4 NVIDA graphics driver 375.51 has been used and the driver is running with
ECC enabled.
Per default the calculations performed on the GPU are done in double precision.
However, since some GPUs have a significantly lower double precision floating point
performance than single precision floating point performance, it is possible to force the
calculations on the GPU to be performed in single precision. For this reason the
benchmarking has been done using both single and double precision calculations.
Be aware that using single precision calculations will affect the accuracy of the simulation
results, since single precision calculations are less accurate than double precision
calculations.
The ratio between the specified theoretical single and double precision floating point
performance is not equal to the actual measured performance ratio between single and
double precision. This becomes evident when comparing the values in the last column of
Table 2.2, where the theoretical single precision performance is a factor 3-4 higher than
the theoretical double precision performance, to the actual measured difference between
single and double precision as presented in the benchmarking below.
3.1.1 Description
In the Western parts of the Mediterranean Sea tides are dominated by the Atlantic tides
entering through the Strait of Gibraltar, while the tides in the Eastern parts are dominated
by astronomical tides, forced directly by the Earth-Moon-Sun interaction.
3.1.2 Setup
The bathymetry is shown in Figure 3.1. Simulations are performed using five meshes with
different resolution (see Table 3.1). The meshes are generated specifying the value for
the maximum area of 0.04, 0.005, 0.00125, 0.0003125 and 0.000078125 degree2,
respectively. The simulation period for the benchmarks covers 2 days starting 1 January
2004 for the simulations using mesh A, B and C. The simulation period is reduced to 6
hours for the simulations using mesh D and 3 hours for mesh E.
At the Atlantic boundary a time varying level boundary is applied. The tidal elevation data
is based on global tidal analysis (Andersen, 1995).
For the bed resistance the Manning formulation is used with a Manning number of 32. For
the eddy viscosity the Smagorinsky formulation is used with a Smagorinsky factor of 1.5.
Tidal potential is applied with 11 components (default values).
The shallow water equations are solved using both the first-order scheme and the higher-
order scheme in time and space.
The averaged time step for the simulations using Mesh A, B, C, D and E is 17.65s, 5.61s,
2.86s, 1.43s and 0.69s, respectively, for both the first-order scheme and the higher-order
scheme in time and space.
3.2.1 Description
The model area is located, on the southern part of Jutland, Denmark, around the city of
Ribe. The area is protected from storm floods in the Wadden Sea to the west by a dike.
The water course Ribe Å runs through the area and crosses the dike through a lock.
The flood condition where the dike is breached during a storm flood is illustrated by
numerical modelling. The concept applied to model the breach failure in the
hydrodynamic model is based on prescribing the breach by a dynamic bathymetry that
change in accordance with the relation applied for the temporal development of the
breach. Use of this method requires that the location of the breach is defined and known
at an early stage, so that it can be resolved properly and built into the bathymetry. The
shape and temporal development of the breach is defined with a time-varying distribution
along the dike crest. It is further defined how far normal to the crest line the breach can
be felt. Within this distance the bathymetry is following the level of the breach, if the local
level is lower than the breach level no changes are introduced. The area of influence of
the breach will therefore increase with time.
The breach and flood modelling has been carried out based on a historical high water
event (24 November, 1981), shown in Figure 3.2. Characteristic for this event is that high
tide occurs at the same time as the extreme water level. Højer sluice is located about 40
km south of the breach, while Esbjerg is located about 20 km to the north. Based on the
high water statistics for Ribe the extreme high water level has been estimated for an
event having a return period of 10,000 years. The observed water level at Højer is
hereafter adjusted gradually over two tidal cycles to the extreme high water level
estimated for the given return periods at Ribe, as indicated in Figure 3.2. The water level
time series established in this way are shown in Figure 3.2.
Figure 3.2 Runoff from the catchment is included as specified discharges given for the two
streams Ribe Å and Kongeåen
The crossing between the dike and Ribe Å is shown in Figure 3.4. The crossing is in the
form of a navigational chamber lock. It is represented in the model bathymetry as a
culvert that can be closed by a gate. The points defining the dike next to the creek are
modified to have increased levels in order to ensure a well‐defined bathymetry where flow
only occurs through the cells defining the creek proper. The sluice is defined as a check
valve allowing only flow towards the sea.
A constant discharge of 9.384 m 3/s and 14.604 m3/s, respectively, are applied for the two
streams Ribe Å and Kongeåen. For the bed resistance the Manning formulation is used
with a Manning number of 18. For the eddy viscosity the Smagorinsky formulation is used
with a Smagorinsky factor of 0.28.
3.2.2 Setup
The bathymetry is shown in Figure 3.3. The computational mesh contains 173101
elements. A satisfactory resolution of the breach is obtained by a fine mesh of structured
triangles and rectangles as shown in Figure 3.4. The areas in‐ and offshore of the dike is
defined by a relatively fine mesh to avoid instabilities due to humps or holes caused by
large elements with centroids just outside the area of influence from the breach. The
simulation period is 42 hours.
The shallow water equations are solved using both the first-order scheme and higher-
order scheme in space and time.
The averaged time step is 0.21s for both the first-order scheme and higher-order scheme
in space and time.
3.3.1 Description
The modelled area is approximately 0.4 km by 0.96 km and covers entirely the DEM
provided and shown in Figure 3.5. Ground elevations span a range of ~21m to ~37m.
• a uniformly distributed rainfall event illustrated by the hyetograph in Figure 3.6. This
is applied to the modelled area only (the rest of the catchment is ignored).
• a point source at the location (264896, 664747) (Map projection: British national
grid), and illustrated by the inflow time series in Figure 3.7. (This may for example be
assumed to arise from a surcharging culvert.)
DEM is a 0.5m resolution Digital Terrain Model (no vegetation or buildings) created from
LiDAR data collected on 13th August 2009 and provided by the Environment Agency
(http://www.geomatics-group.co.uk). Model grid resolution should be 2m (or ~97000
nodes in the 0.388 km 2 area modelled).
All buildings at the real location (Cockenzie Street and surrounding streets in Glasgow,
UK) are ignored and the modelling is carried out using the “bare-earth” DEM provided.
All boundaries in the model area are closed (no flow) and the initial condition is dry bed.
The model is run until time T = 5 hours to allow the flood to settle in the lower parts of the
modelled domain.
3.3.2 Setup
Simulations are performed using four meshes with different resolution (see Table 3.2).
The four meshes uses regular quadrilateral elements with grid spacing 2m, 1m, 0.5m and
0.25m, respectively. Mesh A corresponds to the original mesh used in the EA2D test, and
the additional meshes are obtained by refining this mesh.
The shallow water equations are solved using the first-order scheme in time and space.
The averaged time step for the simulation using Mesh A, B, C and D is 0.22s, 0.10s, 0.5s
and 0.027s, respectively.
3.4.1 Description
The modelled area is approximately 0.4 km by 0.96 km and covers entirely the DEM
provided and shown in Figure 3.8. Ground elevations span a range of ~21m to ~37m.
Figure 3.9 Inflow hydrograph applied for the EA2D Test8B at upstream end of culvert
DEM is a 0.5m resolution Digital Terrain Model (no vegetation or buildings) created from
LiDAR data collected on 13th August 2009 and provided by the Environment Agency
(http://www.geomatics-group.co.uk). Model grid resolution should be 2m (or ~97000
nodes in the 0.388 km 2 area modelled).
The presence of a large number of buildings in the modelled area is taken into account.
Building outlines are provided with the dataset. Roof elevations are not provided.
All boundaries in the model area are closed (no flow) and the initial condition is dry bed.
The model is run until time T = 5 hours to allow the flood to settle in the lower parts of the
modelled domain.
3.4.2 Setup
Simulations are performed using four meshes with different resolution (see Table 3.3).
The four meshes uses regular quadrilateral elements with grid spacing 2m, 1m, 0.5m and
0.25m, respectively. Mesh A corresponds to the original mesh used in the EA2D test, and
the additional meshes are obtained by refining this mesh.
The shallow water equations are solved using the first-order scheme in time and space.
The averaged time step for the simulation using Mesh A, B, C and D is 0.27s, 0.15s,
0.76s and 0.025s, respectively.
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 5.02 1 5.27 1
Mesh A
2 5.98 0.83 6.34 0.83
1 29.38 1 36.89 1
Mesh B
2 26.46 1.11 29.55 1.34
1 147.02 1 207.64 1
Mesh C
2 96.04 1.53 126.29 1.64
1 125.35 1 186.56 1
Mesh D
2 67.80 1.84 97.67 1.91
1 503.46 1 765.17 1
Mesh E
2 255.98 1.96 382.45 2.00
Table 4.2 Computational time, tGPU(n),using GPU acceleration (1 and 2 subdomains and 1
thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single
precision (SP) and double precision (DP) and using higher-order scheme in time and
space
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 11.24 1 12.62 1
Mesh A
2 13.10 0.85 14.21 0.88
1 90.22 1 118.44 1
Mesh B
2 68.32 1.32 81.63 1.45
1 558.93 1 783.70 1
Mesh C
2 331.30 1.68 428.26 1.82
1 518.49 1 717.68 1
Mesh D
2 264.90 1.95 372.80 1.92
1 2164.98 1 3071.29 1
Mesh E
2 1095.47 1.97 1518.02 2.02
Figure 4.1 Speedup factor, tGPU(1)/tGPU(n), for two GPUs relative to a single GPU using first-order
scheme in time and space. Blue line: single precision; Black line: double precision;
Red line: Ideal speedup factor
Figure 4.2 Speedup factor, tGPU(1)/tGPU(n), for two GPUs relative to a single GPU using higher-
order scheme in time and space. Blue line: single precision; Black line: double
precision; Red line: Ideal speedup factor
Table 4.3 Computational time, tCPU(n), using no GPU acceleration (1 and 16 subdomains and 1
thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-
order scheme and higher-order scheme in time and space
First-order Higher-order
No. of
Mesh domains Speedup Speedup
n Time (s) Time (s)
Factor Factor
tCPU(n) tCPU(n)
tCPU(1)/tCPU(n) tCPU(1)/tCPU(n)
1 57.87 1 174.61 1
Mesh A
16 5.73 10.09 15.27 11.43
1 1253.81 1 4848.75 1
Mesh B
16 94.07 13.32 314.16 15.34
1 10866.48 1 37241.11 1
Mesh C
16 844.58 12.86 2818.74 13.21
1 10917.15 1 37830.65 1
Mesh D
16 964.27 11.32 3406.35 11.10
Table 4.4 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n ). The simulations are carried out
using single precision (SP) and double precision (DP) and using first-order scheme in
time and space
Table 4.5 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). The simulations are carried out
using single precision (SP) and double precision (DP) and using higher-order
scheme in time and space
Figure 4.3 Speedup factor, tCPU(1)/tGPU(n ), for one and two GeForce GTX Titan card using first-
order scheme. Blue line: single precision; Black line: double precision and solid line:
1 GPU and dash line: 2 GPU
Figure 4.4 Speedup factor, tCPU(1)/tGPU(n ), for one and two GeForce GTX Titan card using higher-
order scheme. Blue line: single precision; Black line: double precision and solid line:
1 GPU and dash line: 2 GPU
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 1920.08 1 2127.53 1
Mesh A
2 1355.13 1.41 1563.44 1.36
Table 4.7 Computational time, tGPU(n) using GPU acceleration (1 and 2 subdomains and 1
thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single
precision (SP) and double precision (DP) and using higher-order scheme in time and
space
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 3983.48 1 4573.40 1
Mesh A
2 2920.90 1.36 3405.28 1.34
Table 4.8 Computational time, tCPU(n), using no GPU acceleration (1 and 16 domains and 1
thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-
order scheme and higher-order scheme in time and space
First-order Higher-order
No. of
Mesh domains Speedup Speedup
n Time (s) Time (s)
Factor Factor
tCPU(n) tCPU(n)
tCPU(1)/tCPU(n) tCPU(1)/tCPU(n)
1 24071.79 1 92536.13 1
Mesh A
16 3387.64 7.10 15621.08 5.92
Table 4.9 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using
single precision (SP) and double precision (DP) and using first-order scheme in time
and space
Table 4.10 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using
single precision (SP) and double precision (DP) and higher-order scheme in time and
space
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 115.97 1 141.78 1
Mesh A
2 99.35 1.16 111.16 1.27
1 631.94 1 803.02 1
Mesh B
2 415.50 1.52 508.99 1.57
1 3590.63 1 4704.55 1
Mesh C
2 2084.08 1.72 2711.03 1.73
1 27741.35 1 34994.81 1
Mesh D
2 13910.62 1.99 18216.11 1.92
Figure 4.5 Speedup factor, tGPU(1)/tGPU(n), for two GPUs relative to a single GPU. Blue line:
single precision; Black line: double precision; Red line: Ideal speedup factor
Table 4.12 Computational time, tCPU(n), using no GPU acceleration (1 and 16 domains and 1
thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-
order scheme in time and space
No. of Speedup
Time (s)
Mesh domains Factor
tCPU(n)
n tCPU(1)/tCPU(n)
1 1494.00 1
Mesh A
16 157.48 9.48
1 14751.77 1
Mesh B
16 1791.74 8.23
1 106463.08 1
Mesh C
16 14514.62 8.10
Table 4.13 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using
single precision (SP) and double precision (DP) and using first-order scheme in time
and space
SP DP
No. of
Mesh GPUs Speedup Speedup
Time (s) Time (s)
n Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 99.42 1 122.11 1
Mesh A
2 84.04 1.18 94.80 1.28
1 449.79 1 484.17 1
Mesh B
2 266.65 1.68 296.16 1.63
1 2038.29 1 2445.55 1
Mesh C
2 1120.32 1.81 1275.39 1.91
1 11673.43 1 14927.29 1
Mesh D
2 5822.06 2.00 7179.54 2.07
Figure 4.6 Speedup factor, tGPU(1)/tGPU(n), for two GeForce GTX Titan cards relative to a single
GeForce GTX Titan card. Blue line: single precision; Black line: double precision;
Red line: Ideal speedup factor
Table 4.15 Computational time, tCPU(n), using no GPU acceleration (1 and 16 domains and 1
thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-
order scheme in time and space
No. of Speedup
Time (s)
Mesh domains Factor
tCPU(n)
n tCPU(1)/tCPU(n)
1 762.43 1
Mesh A
16 107.07 7.12
1 6322.43 1
Mesh B
16 953.09 6.63
1 47032.28 1
Mesh C
16 8434.83 5.57
Table 4.16 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using
single precision (SP) and double precision (DP) and using first-order scheme in time
and space
Time (s)
tGPU(1)
Mesh First-order Higher-order
SP DP SP DP
Time (s)
tGPU(1)
Mesh
First-order Higher-order
SP DP SP DP
Time (s)
tGPU(1)
Mesh
SP DP
Mesh D - 22745.63
Time (s)
tGPU(1)
Mesh
SP DP
Mesh D - -
Time (s)
tGPU(1)
Mesh
SP DP
Table 6.2 Computational time, tGPU(n), using GPU acceleration (multi domains and 1 thread) and
speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single precision
(SP) and double precision (DP) and using higher-order scheme in time and space
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 11.85 1 13.73 1
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 104.53 1 151.12 1
1 650.18 1 1009.37 1
1 585.02 1 924.44 1
1 2385.04 1 4937.59 1
Figure 6.1 Speedup factor, tGPU(1)/tGPU(n), using double precision and higher-order scheme in
time and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue
line: Mesh D; Green line: Mesh E; Red line: Ideal speedup factor
Time (s)
tGPU(1)
Mesh
First-order Higher-order
SP DP SP DP
SP DP
No. of
Mesh GPUs Speedup Speedup
n Time (s) Time (s)
Factor Factor
tGPU(n) tGPU(n)
tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)
1 137.87 1 161.3 1
1 691.93 1 873.54 1
1 3963.79 1 5213.29 1
1 29312.43 1 38412.53 1
Figure 6.2 Speedup factor, tGPU(1)/tGPU(n), using double precision and first-order scheme in time
and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue line:
Mesh D; Green line: Mesh E; Red line: Ideal speedup factor
Time (s)
tGPU(1)
Mesh
SP DP
Time (s)
Mesh
tGPU(1)
Mesh A 9.65
Mesh B 52.03
Mesh C 282.89
Mesh D 254.52
Mesh E 1226.95
Table 7.2 Computational time, tGPU(n), using GPU acceleration (multi domains and 1 thread) and
speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using double precision
(DP) and using higher-order scheme in time and space
No. of Speedup
Time (s)
Mesh GPUs Factor
tGPU(n)
n tGPU(1)/tGPU(n)
1 18.86 1
4 25.3 0.74
1 157.87 1
4 106.26 1.48
1 1056.56 1
4 362.11 2.91
No. of Speedup
Time (s)
Mesh GPUs Factor
tGPU(n)
n tGPU(1)/tGPU(n)
1 948.59 1
4 294.82 3.21
1 5015.65 1
4 1073.62 4.67
Figure 7.1 Speedup factor, tGPU(1)/tGPU(n), using double precision and higher-order scheme in
time and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue
line: Mesh D; Green line: Mesh E; Red line: Ideal speedup factor
Time (s)
tGPU(1)
Mesh
Higher-
First-order
order
No. of Speedup
Time (s)
Mesh GPUs Factor
tGPU(n)
n tGPU(1)/tGPU(n)
1 165.14 1
4 162.03 1.01
1 879.11 1
4 551.8 1.59
1 5239.08 1
4 1757.38 2.98
1 43723.99 1
4 10991.37 3.97
Figure 7.2 Speedup factor, tGPU(1)/tGPU(n), using double precision and first-order scheme in time
and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue line:
Mesh D; Green line: Mesh E; Red line: Ideal speedup factor
8 Discussion
The performance strongly depends on the graphics card. For simulations having a small
number of elements in the computational grid, the best performance is obtained with the
GeForce GTX Titan card while for simulations with a large number of elements the best
performance is obtained with the GeForce GTX 1080 Ti card. Since the Tesla K80 is a
dual card consisting of two GPUs, it can be seen as one entity having two GPUs or it can
be thought of as two separate GPUs. In this report the latter have been chosen. This
approach eliminates the time spent on domain decomposition and communication
between subdomains from the comparison. However, if considering both GPUs on the
Tesla K80 card as one entity, then the Tesla K80 card and the GeForce GTX 1080 Ti
card show similar performance for simulations having a large number of elements. This is
illustrated in Figure 8.1, which shows the speedup for the various graphics cards tested in
this report compared to the performance of the GeForce GTX Titan card.
Considering double precision calculations, the GeForce GTX 1080 Ti card is for very
large meshes up to 36% faster than the GeForce GTX Titan and up to 60% faster than
the Tesla K80 card (using only 1 of 2 GPUs on the card). This might seem odd, since the
Tesla line of graphics cards are designed for numerical calculations whereas the
GeForce cards are primarily used for gaming. The explanation is found in the hardware
specifications for the various cards, where the GeForce GTX 1080 Ti is superior in a
number of specifications compared to the other cards, see Table 2.2. When utilising the
entire Tesla K80 card (both GPUs) the Tesla K80 is up to 37% faster than a single
GeForce GTX Titan card and around the same performance as the GeForce GTX 1080
Ti.
Figure 8.1 Comparison of the performance using different graphic cards. The speedup factor is
the ratio of the computational time using the GeForce GTX Titan Card to the
computational time using other cards. The Mediterranean Sea case with higher-order
scheme in time and space. Black line: GeForce GTX 1080 Ti card; Blue line: Tesla
K80 card (1 GPU); Green line: Tesla K80 (2 GPU)
When the number of wet elements in the considered problem is sufficiently high, it is
possible to obtain nearly ideal speed-up using multiple GPUs relative to using 1 GPU.
Depending on the considered problem it is even possible to get superlinear speed-up. Of
course the communication overhead increases when using multiple GPUs, but as long as
the problem is large enough for each GPU to have full work-load the scalability over
multiple GPUs is very good.
The Tesla K80 card has been tested on hardware platform 3 (Windows) and hardware
platform 4 (Linux), and the card shows similar performance for the two platforms.
9 Conclusions
The overall conclusions of the benchmarks are
• The numerical scheme and the implementation of the GPU version of the MIKE 21
Flow Model FM are identical to the CPU version of MIKE 21 Flow Model FM.
Simulations without flooding and drying produces identical results using the two
versions. Simulations with extensive flooding and drying produce results that may
contain small differences.
• The performance of the GPU version of MIKE 21 Flow Model FM depends highly on
the graphics card and the model setup. When evaluating the performance by
comparing with a single core (no parallelisation) CPU simulation the performance
also depend highly on the specifications for the CPU.
• The speedup factor of simulations with no flooding and drying increases with
increasing number of elements in the computational mesh. When the number of
elements becomes larger than approximately 400.000 then there is only a very
limited increase in the speedup factor for increasing number of elements.
• The use of multi-GPU shows excellent performance. To get the optimal speedup
factor a large number of elements is required for each sub-domain.
10 References
/1/ Andersen, O.B., 1995, Global ocean tides from ERS-1 and TOPEX/POSEIDON
altimetry, J. Geophys Res. 100 (C12), 25,249-25,259.
/2/ Bo Brahtz Christensen, Nils Drønen, Peter Klagenberg, John Jensen, Rolf
Deigaard and Per Sørensen, 2013, Multiscale modelling of coastal flooding,
Coastal Dynamics 2013, paper no. 053.
/3/ Néelz S., Pender G., 2010, Benchmarking of 2D Hydraulic Modelling Packages,
Report published by Environment Agency, www.environment-agency.gov.uk,
Copies of this report are available from the publications catalogue:
http://publications.environment-agency.gov.uk