You are on page 1of 5

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

Revealing Potential Performance Improvements


By Utilizing Hybrid Work-Sharing For
Resource-Intensive Seismic Applications

Patrick Siegl, Rainer Buchty, Mladen Berekovic


Chair for Chip Design for Embedded Computing (C3E)
Technische Universitat Carolo-Wilhelmina zu Braunschweig
[siegl|buchty|berekovic]@c3e.cs.tu-bs.de

AbstractHeterogeneous system architectures are becoming speed-up and energy efciency. Both are crucial aspects of
more and more of a commodity in the scientic community. future HPC systems. For this, a seismic streaming algorithm
While it remains challenging to fully exploit such architectures, was chosen, which provides required features of typical and
the benets in performance and hybrid speed-up, by using a future HPC applications, and was therefore subsequently
host processor and accelerators in parallel in a non-monolithic implemented and evaluated. In order to cover a fair amount
matter, are signicant. Hereby, the energy efciency is becoming of heterogeneous architectures, not only typical HPC systems
an increasingly critical challenge for future high-performance
computing (HPC) systems, which do want to exceed the Exascale were chosen but also a dedicated embedded system. This was
barrier with several competing architecture concepts ranging done with respect to recent developments showing the potential
from high-performance CPUs, combined with GPUs acting as benets of clustered low-performance CPUs with dedicated
oating-point accelerators, to computationally weak CPUs, paired HPC accelerators. As a result of these measurements, the
with dedicated and highly-performant FPGA-based accelerators. paper presents an overview on suitable aspects of a hybrid
In this paper, we realize and evaluate a hybrid computing implementation framework; addressing the question whether
approach based on a two-dimensional seismic streaming algo- or not costly manual optimizations are benecial for the given
rithm with several heterogeneous system architectures, including instruction-set architectures (ISA).
conventional HPC approaches based on powerful CPUs and The paper is structured as follows: Section II reviews the
GPUs. Furthermore, we elaborate the effort on an embedded actual situation of the HPC market and future trends towards
system platform claiming to be a mini supercomputer [1].
Several CPU and accelerator combinations are utilized in a Exascale. Section III provides an overview of the chosen
manual work-sharing manner with the aim of achieving sig- seismic algorithm, while Section IV explains the implemented
nicant performance speed-ups and a detailed energy-efciency application; providing various hybrid execution models. The
study. Based on rooine models and experimental evaluations, the evaluated heterogeneous systems are introduced in Section V,
paper provides an insight into the fact that hybrid computing is which also include a homogeneous performance projection
mostly unconditionally benecial for balanced systems regarding based on rooine models. Experimental results are analyzed in
the performance as well as the energy efciency, aiding the Section VI. The paper closes with Section VII, which contains
programmer in the decision whether or not costly, manually concluding remarks and an outlook to future work.
tuned, homogeneous implementations are worthwhile.
Keywordsnarrow stencil computing, (embedded) hybrid work- II. S TATE OF THE A RT
sharing, multi GPU processing, performance evaluation The HPC market is driven mainly by three major challenges:
energy-efciency, computing power, and reliability. Some of
I. I NTRODUCTION them can only be addressed by a trade-off during hardware
Accelerators in heterogeneous systems are mostly used to design, while others can also be addressed plainly by software.
ofoad entire to-be-accelerated computations in a monolithic Since 2007, HPC systems are ranked by energy efciency
manner, meaning that the CPU waits for the accelerator to showing, that heterogeneous systems improve this efciency
nish its calculation [2]. Efforts previously spent on hand- by a wide margin [9]. In such heterogeneous systems, mostly
tuning algorithms to exploit the specic properties of that CPU GPUs distributed by the vendor Nvidia and alternatively the
would thereby be wasted. Several techniques developed by Many-Integrated Core architecture offered by Intel (e.g. no. 1
AMD (see HSA [3]) and Nvidia (see OpenACC [4]) address system TSUBAME-KFC [10]) are utilized as accelerators.
the serial-versus-parallel distribution on a heterogeneous system, Based on such a heterogeneous approach, the major direction
but are missing a simultaneous parallel heterogeneous execution nowadays seems to be the use of a strong CPU, interconnected
of parallel code on both host processor and accelerator. to an accelerator via a high-performance bus (with one notably
Some attempts on exploiting shared computation, leveraging exception being the IBM BlueGene/Q series).
the benets of hand-tuned CPU code and simultaneously Although the Top500 listed systems are driven by perfor-
running accelerators, have been evaluated with promising results mant architectures like x86-64, POWER and SPARC, scientists
[5], [6]. OpenMP seems to take the direction towards a parallel are investigating into the energy-efcient but computationally
heterogeneous computation (see proposed OpenMP clause weak ARM architecture as a potential CPU for future Exascale
hetero() [7]), which likewise shows signicant performance systems [11]. Additionally, two further major trends can be
improvements on some benchmarks [8]. Hence, this paper identied, which drive heterogeneous computing towards the
targets a hybrid approach that is analyzed with regard to actual Exascale target:

1066-6192/15 $31.00 2015 IEEE 659


DOI 10.1109/PDP.2015.28
Listing 1. Pseudo-code of the seismic forward propagation algorithm Partitioning Pressure Field
// loop over the time steps
Example
for( t = 0; t < TimeSteps; t++ ){ Stencil
// inject seismic pulse into Actual Pressure Field Processor Accelerator
APF[xpulse ][ypulse ] += SeismicPulse[t];
Data needed for Data needed for
// iterate over X and Y dimensions the accelerator the processor
for( i = 2; i < (dimX-2); i++ ){
for( j = 2; j < (dimY-2); j++ ){ Fig. 1. Hybrid Work Sharing / Partitioning
// calculate 2-dimensional stencil
// store into Next Pressure Field
NPF[i][j] = 2.0f * APF[i][j] - PPF[i][j] + VEL[i][j] main processor accelerator
* ( 16.0f * ( APF[i][j-1] + APF[i][j+1]
+ APF[i-1][j] + APF[i+1][j] ) Plain C (incl. pthreads) OpenCL GPU std.
- 1.0f * ( APF[i][j-2] + APF[i][j+2] SIMD SSE (incl. pthreads) OpenCL GPU image
+ APF[i-2][j] + APF[i+2][j] )
-60.0f * APF[i][j] ); OpenCL CPU
}
}
// switch pointers of Pressure Fields
Fig. 2. Seismic x86 & OpenCL application:
TMP = PPF; PPF = APF; APF = NPF; NPF = TMP; Possible hybrid-/ homogeneous-usage scenarios.
}
main processor accelerator
Plain C (incl. pthreads) Epiphany
SIMD NEON (incl. pthreads)
1) Near-data computing minimizing data movement [12].
2) FPGA-accelerated computing, where specic algorithms are Fig. 3. Seismic ARM & Epiphany application:
synthesized and loaded into the FPGA hardware of choice [13]. Possible hybrid-/ homogeneous-usage scenarios.
Even though it does not particularly address realistic
workloads, the High Performance LINPACK (HPL) benchmark
is still being used as an indicator for the fastest machine on (NPF; time step n: result subsurface) pressure field information.
earth [14]. Using more realistic benchmarks for evaluating A vector is used to keep the seismic pulse information, which
systems has been proposed, but no suitable HPL successor gets injected into a specic position.
has yet been established [15]. One application area featuring Unoptimized, the given algorithm requires 15 single-
potentially realistic benchmarks is the eld of geo-sciences. precision FLOPs1 for each stencil. Exchanging the minus-one
In respect of these applications, particularly seismic imaging multiplication for a negation can eliminate one FLOP. The
algorithms are crucial for the HPC market and have been driven operational intensity (see Rooine Models [22] [23]) decreases
by the demands of the oil & gas industry [15]. from 0.3125 F LOPbyte
s
to 0.2917 F LOP
byte
s 14F LOP s
(= 4byte(11 ld +1st )
).
Depending on the ISA, four fused-multiply instructions pro-
III. S EISMIC R EVERSE T IME M IGRATION viding two FLOPs per cycle can be utilized for the algorithm
Seismic imaging is a geological process which analyses the (ref.: x86-64 SSE). A reduction in memory consumption can
geological subsurface of a particular earth model by injecting be achieved by combining PPF and NPF elds. Because only
acoustic signals into a specic subsurface. Within seismic NPF and APF are needed for the following time-step, PPF can
imaging, three stages are crucial: 1) data acquisition, 2) data be overwritten directly by the newly generated pressure eld.
processing, and 3) data interpretation [2]. While all three stages Otherwise it would be written into NPF.
are important in the industry for nding materials within a Hybrid computing, in which both CPU and the accelerator
subsurface, the paper concentrates on the compute-intensive work in parallel while using the same algorithm on a shared
second stage. amount of the same problem, seems to be a viable approach to
The widely used Reverse Time Migration technique (RTM) extract the most FLOPs from heterogeneous systems (Fig. 1).
is one of several algorithms (see e.g. Kirchhoff [16]) used within To support such work sharing, an axis can be partitioned into
the data-processing stage to recalculate a specic subsurface an area being processed by the CPU and another area processed
back in time in relation to a given source subsurface [17], [18]. by the accelerator. Depending on which of the heterogeneous
Combined with the result-subsurface and the forward-based ISAs is being evaluated, data exchange is needed (due to the
computation (in relation to the source subsurface), RTM can overlapping stencil in the split region) as well as one or more
be used to evaluate the foreseen history of the subsurface. mandatory synchronization barriers (global / local) during each
The paper focuses on forward modeling the acoustic wave time step.
propagation based on research, which has been initiated by
Perrone et al. [19] and Grosser et al. [2], [20], [21]. Hereby, IV. I MPLEMENTATION ON VARIOUS ARCHITECTURES
a 2-dimensional seismic forward propagation algorithm has
The evaluated homogeneous x86-64 CPU and OpenCL
been chosen to evaluate heterogeneous system architectures
GPU implementations are based on those that had already been
(listing 1).
evaluated by Grosser et al. [2]. Based on the homogeneous
implementations, further ones targeting new ISAs and the hybrid
A. Seismic Forward Propagation Algorithm
partitioning have been developed (Figs. 2 and 3).
The 2-dimensional seismic algorithm uses four matrices and
To conclude which implementation has the most speed-up
one vector (listing 1). While one of the four matrices represents
and Flops, an evaluation application (targeting x86-64 CPUs
the velocity eld, the other three matrices contain the actual
(APF; time step 0: source subsurface), previous (PPF) and next FLOPs: Number of Floating-point Operations
1
Flops: Floating-point Operations Per Second

660
TABLE I. C ONFIGURATION OF THE MAIN PROCESSORS . Core i7-2600k GeForce GTX560Ti (GF114)
2048 2048 peak single

Attainable [GFlops]
1024 1024
Cores Global 512 512
Processor Frequency Compiler SDK 256 peak single 256 th
(Threads) Memory dwid
128 128 ban
Intel Intel OpenCL peak double am peak double
64
th
64
k stre
3.40 GHz 4 (8) GCC 4.6.2 8 GB 32 dwid 32 pea
Core i7-2600k SDK 1.5 16 ban 16
am

stencil

stencil
Intel Intel OpenCL 8 k stre 8
2.00 GHz 4 (4) GCC 4.3.2 8 GB ea
4 p 4
Xeon E5405 SDK 1.5 2 2
Intel Intel OpenCL 1/8 1/4 1/2 1 2 4 8 16 32 1/8 1/4 1/2 1 2 4 8 16 32
3.33 GHz 2 (4) GCC 4.6.2 8 GB
Core i5-660 SDK 1.5 Operational Intensity [ FLOPs
byte ] Operational Intensity [ FLOPs
byte ]
Xilinx Zynq
677 MHz 2 (2) LLVM 3.4 - 1 GB
XC7Z020-1CLG400C
Fig. 4. Rooine Models:
TABLE II. C ONFIGURATION OF THE ACCELERATORS . Intel Core i7-2600k & Nvidia GeForce GTX560Ti (GF114)
Accelerator Global
Accelerator Frequency SDK Interconnect Xeon E5405 Tesla C1060
Memory Memory
2048 2048

Attainable [GFlops]
Nvidia GeForce Nvidia OpenCL peak single
822 MHz PCIe 2.0 x16 1 GB 8 GB 1024 1024
GTX560Ti (GF114) 295.20 512 512
Nvidia Nvidia OpenCL Both via PCIe 2.0 x16 256 peak single 256 th
602 MHz 4 GB 8 GB 128 128 wid
Tesla C1060 285.05.09 Host Interface Card and
64 64 mb
Nvidia Nvidia OpenCL Both via PCIe 2.0 x16 th peak double rea
32 wid 32 eak st
675 MHz 256 MB 8 GB and p
GeForce 8600GTS 290.10 Second over DMI 16 mb 16 peak double
trea

stencil

stencil
Epiphany Libs 8 eak s 8
Adapteva Custom AXI BUS p
2014-06-25 4 4
600 MHz 512 KB 1 GB
Epiphany E16G301 to eLink glue-logic 2 2
(E-GCC 4.8.2) 1/8 1/4 1/2 1 2 4 8 16 32 1/8 1/4 1/2 1 2 4 8 16 32

Operational Intensity [ FLOPs


byte ] Operational Intensity [ FLOPs
byte ]
TABLE III. E VALUATED HETEROGENEOUS SYSTEMS .
Single / Double Peak TDP Energy Fig. 5. Rooine Models:
peak memory (chip efciency Intel Xeon E5405 & 2x Nvidia Tesla C1060
Device Model bandwidth (single)
performance only)
[GF lops] [ Gbyte
sec ]
[watts] [ GFjoule
LOP s
]
Core i5-660 GeForce 8600GTS
Intel 256 256 peak single

Attainable [GFlops]
CPU 217.6 / 108.8 21.2 95 2.3 128 128
Core i7-2600k 64 peak single 64
th
Nvidia GeForce 32 th 32 dwid
GPU 1263.4 / 105.3 128.3 170 7.4 wid ban
GTX560Ti (GF114) 16 and peak double 16 am
mb stre
8 rea 8 peak
ak st
Intel 4 p e 4
CPU 149.1 / 74.6 25.6 80 1.8

stencil

stencil
Xeon E5405 2 2
1 1
Nvidia
GPU 933.0 / 78.0 102 187.8 5.0 1/2 1/2
Tesla C1060 1/8 1/4 1/2 1 2 4 8 16 1/8 1/4 1/2 1 2 4 8 16

Intel Operational Intensity [ FLOPs


byte ] Operational Intensity [ FLOPs
byte ]
CPU 53.3 / 26.7 21.2 73 0.7
Core i5-660
Nvidia
GPU 139.2 / - 32 71 2.0
GeForce 8600GTS Fig. 6. Rooine Models:
CPU
Xilinx Zynq
10.8 / 2.7 4.2 ~4.8 3
~2.2
Intel Core i5-660 & 2x Nvidia GeForce 8600GTS
XC7Z020-1CLG400C
Adapteva Zynq XC7Z020-1CLG400C Epiphany E16G301
ACC 19.2 / - 1.3 ~0.9 ~20.6
Epiphany E16G301 64 64
Attainable [GFlops]

32 32 peak single
16 peak single 16
8 h 8
idt
dw
4 ban 4
h
2 eam 2 idt
interconnected to GPUs featuring OpenCL support) has been 1 peak str peak double
1
eam
ban
dw

str
stencil

stencil
1/2 1/2
developed which contains all of the CPU implementations 1/4 1/4 p
eak

(plain C; threaded plain C; SSE intrinsics (with various variants: 1/8


1/8 1/4 1/2 1 2 4 8 16 32
1/8
1/8 1/4 1/2 1 2 4 8 16 32

default, aligned, unaligned, aligned not grouped); threaded SSE Operational Intensity [ FLOPs
byte ] Operational Intensity [ FLOPs
byte ]

intrinsics (see previous variants); OpenCL for CPU) and all of


the GPU implementations (OpenCL GPU standard (non-aligned Fig. 7. Rooine Models:
global access; local memory utilization); OpenCL GPU image Xilinx Zynq XC7Z020-1CLG400C & Adapteva Epiphany E16G301
(aligned global access; oat4-vector image primitives)). Here,
the plain C code represents the canonical / nave implementation
(single thread; no SIMD; compiler optimizations allowed) that (aligned)) and an assembly-tuned Epiphany implementation
has been used as the performance-reference of the individual supporting all 16 Epiphany cores (Fig. 3). Likewise, hybrid
platform. In fact, an AVX intrinsics code has been developed work-sharing is supported as well.
but dropped, because the seismic algorithm requires an efcient
shufe instruction2 . V. P ERFORMANCE E VALUATION OVERVIEW
Under user-control, the application provides the manual The majority of the evaluated systems are based on Intels
ability to either run just the CPU or just the accelerators in a x86-64 ISA in combination with Nvidias Tesla or Fermi
homogeneous fashion. Furthermore, the seismic image can be microarchitectures. Because the focus of the paper is also
manually partitioned into two parts so that both CPU and n on low-power stencil processing, a further combination of an
accelerators can process parallel in a hybrid fashion (Fig. 2). ARM-v7 ISA and an Epiphany III microarchitecture has also
Targeting a heterogeneous ARM system, a similar applica- been evaluated (Tables I to III). Two out of the three Intel
tion based on the latter has been developed, which likewise systems were equipped with two Nvidia GPUs (the systems
provides various CPU implementations (plain C; threaded equipped with C1060 and 8600GTS) so that it was possible to
plain C; NEON intrinsics (aligned); threaded NEON intrinsics compare the following hybrid combinations: 1) CPU and one
or two GPUs, and 2) just two GPUs under CPU control.
2 In detail: The 256 bit AVX 1.0 provides two lanes for shufing, each 128
bit wide. Shufing elements to the left or to the right results in losing the A. Rooine Models
elements in-between. Reloading them turned out to be slower than SSE.
3 5.8 W measured during Pthread + NEON seismic workload on the CPU. With an operational intensity of 0.2917 F LOP s
byte , the given
Subtracting 0.95 W for Adapteva Epiphany E16G301: ~4.85 W . seismic algorithm is clearly affected by the memory bandwidth

661
25 25 90
20 80

[nJ/stencil]
20
Speedup

speedup
15 70
10
15 60
5 10 50
40
0
Pth Op Op Op Pth Pth Pth Op Op
5 30
r. P enC enCL enC r. S r. S r. S enC enC 0 20
lain LC CP LC SE SE SE LG LG
C PU U+ PU (al
ign
(al
ign
(al
ign PU PU CPU CPU & ACC ACC CPU CPU & ACC ACC
Op +O e e e (std (im
pen d) d) d) ) g)
enC CL + Op + Op
LG G enC enC
PU PU LG LG
(std
)
(im
g) PU
(std
PU
(im
Fig. 11. Performance and energy efciency results:
) g)
2200x748px 4400x1496px 6600x2244px 8800x2992px Intel Core i7-2600k & Nvidia GeForce GTX560Ti (GF114)

50 300
Fig. 8. Evaluation of the various implementations / combinations:

[nJ/stencil]
40 250

speedup
Intel Core i7-2600k & Nvidia GeForce GTX560Ti (GF114) 30 200
20 150
10 100
45
40 0 50
35
Speedup

30 CPU CPU CPU ACC 2x A CPU CPU CPU ACC 2x A


25 & ACC & 2x A CC & ACC & 2x A CC
20 CC CC
15
10
5
0
Pth Op O O O O P P P P P O 2 O 2
r. P enC penC penC penC penC thr. S thr. S thr. S thr. S thr. S penC *Ope penC *Ope
lai
U U + U + U + U + lig
S S
lig
S
nC L CP L CP L CP L CP L CP SE (a E (a E (a E (a E (a L GP CL G L GP CL G
lig
S
lig lig U(
n n Fig. 12. Performance and energy efciency results:
n n n n n PU U (i P
Op 2 O 2
enC *Op penC *Ope )
e d e d) e d) ed) e d) s td) (st mg) U (im Intel Xeon E5405 & 2x Nvidia Tesla C1060
+ Op + 2 + O + 2 d ) g)
L G enC L nC enC *Ope penC *Ope
PU L G GPU L GP L nC L nC
(st P ( GP L G GP LG
d) U (st img) U (im U U 12 140
d) g) (st PU ( (im PU (
d) std g) im

[nJ/stencil]
) g) 10 130

speedup
2200x748px 4400x1496px 6600x2244px 8800x2992px 8 120
6 110
4 100
2 90
Fig. 9. Evaluation of the various implementations / combinations: 0 80
CPU CPU CPU ACC 2x A CPU CPU CPU ACC 2x A
Intel Xeon E5405 & 2x Nvidia Tesla C1060 & ACC & 2x A
CC
CC & ACC & 2x A
CC
CC

12
10
Fig. 13. Performance and energy efciency results:
Speedup

8
6
4
2
Intel Core i5-660 & 2x Nvidia GeForce 8600GTS
0
Pth Op O O O O P P P P P O 2 O 2
r. P enC penC penC penC penC thr. S thr. S thr. S thr. S thr. S penC *Ope penC *Ope
lai n n
nC L CP L CP L CP L CP L CP SE (a E (a E (a E (a E (a L GP CL G L GP CL G
S S S S
U U + U + U + U + lig lig lig lig lig U P U P
Op 2* Op 2* ned) ned) ned) ned) ned) (std) U (st (img U (im
enC Op enC Ope
L G enC L
PU L G GPU L GP
nC
+ Op +
L
2 + O
enC *Ope penC *Ope
nC
+
L
2
nC
d ) ) g)
OpenCL now provide a signicant speed-up, which results from
P GP L G GP
(st
d)
(
d) U (st img) U (im
g)
U (st P U (
LG
P
d) U (std img) U (im
a more efcient dynamic utilization of the CPU in contrast to
) g)
2200x748px 4400x1496px 6600x2244px the static SSE-tuned code. Focusing just on the GPUs, it is
benecial to use the image-processing primitives to achieve
Fig. 10. Evaluation of the various implementations / combinations: the highest speed-up. For this paper, the implementation and
Intel Core i5-660 & 2x Nvidia GeForce 8600GTS
partitioning with the highest speed-up was used to determine
the maximum hybrid speed-up. Because the evaluated algorithm
(Figs. 4 to 7). It is recognizable that all of the Intel CPUs falls into the streaming category, which does not prot highly
have nearly the same expected attainable GFlops, which is from caches (just from the prefetching mechanism), the average
based on the almost equal peak memory bandwidth (table speed-up of the various image resolutions and the energy
III and Figs. 11 to 13). However, the attached GPUs provide efciencies have been used for further analysis (Figs. 11 to 14).
signicantly more attainable Flops and due to their inequality, The GPU in the rst system comes with a signicant speed-
they will be the decisive criteria for the hybrid approach. The up of about 26.96 (Fig. 11). Using the hybrid variant slightly
system equipped with the Nvidia 8600GTS seems to be more decreases the speed-up as well as the energy-efciency, which
balanced, because the GPU does not even achieve twice the is a result of the hereby-inefcient CPU and its little speed-up
amount of GFlops compared to its CPU (Fig. 6). The ARM compared to the GPU. For stronger GPUs as seen here, it
system is rather unusual, because the accelerator is substantially seems to be of signicance to use the CPU only to shufe data
weaker if compared to its CPU or to the GPUs. In addition, back and forth to keep the GPU busy, instead of partitioning
the Epiphany accelerator shares the CPUs memory (Fig. 7). the to-be-computed data in a hybrid fashion.
Rooine models are a growing instrument for performing The second evaluated system (Fig. 12) comes with two
both, rough and quick-and-simple estimations due to a specic GPUs, both of which create a major speed-up. Contrary to the
ISAs performance regarding various algorithms. Still, they expectation of a doubled performance using two GPUs, we
lack important criteria such as energy efciency and reliability, see a drop in further speed-up as a result of synchronization
which are signicantly growing in their importance. overhead that needs to be handled by the CPU. The latter seems
to be too slow for synchronizing both GPUs, but for reasons
VI. E XPERIMENTAL R ESULTS of rather small data exchange, the PCIe bus cannot be the
In order to have a common baseline, the generated speed-ups limiting factor. This results in no further speed-up if the CPU
were measured against the plain C implementation. Depending is also used in a hybrid computation. It has been distinguished
on the available memory size, various seismic image resolutions that one thread on the CPU needs to take responsibility for
(2200x748, 4400x1496, 6600x2244 and 8800x2992; time steps: one GPU. Also, the energy efciency is signicantly better
3000) were analyzed. Based on the homogeneous variants, when using the GPUs instead of the CPU. With two GPUs,
several resulting hybrid combinations were evaluated, while synchronization and data exchange overhead leads to slightly
three of them are illustrated (Figs. 9 and 10). These show larger energy consumption per stencil and less energy efciency.
that the homogeneous SSE-tuned algorithm runs faster than Still, the resulting efciency is signicantly superior to any
the OpenCL version. This indicates certain overheads, such as computation, including the CPU.
scheduling, which deserves additional introspection. It turns out, As shown in Section V-A, the third system is well balanced
however, that if both mentioned implementations are used in a due to the Flops, resulting in a speed-up, which is almost the
hybrid computation, the smart techniques implemented within same on the CPU as well as on a single GPU. It seems that

662
3.0 200
achieve hybrid computation in a similar manner as presented

[nJ/stencil]
speedup 2.5 180
2.0 160
1.5 140 in this paper. Subsequent research shall be focused especially
1.0 120
0.5 100 on dynamic run-time partitioning (adaptive load balancing)
0.0 80
CPU CPU & ACC ACC CPU CPU & ACC ACC to address recent GPUs with boost support.
Fig. 14. Performance and energy efciency results: R EFERENCES
Xilinx Zynq XC7Z020-1CLG400C & Adapteva Epiphany E16G301
[1] Primeur Magazine, A live report from the Adapteva A-
1 smallest supercomputer in the world launch at ISC14,
www.primeurmagazine.com/weekly/AE-PR-07-14-104.html, June
the CPU is fast enough to handle synchronization and data 2014, ISC 2014 Session: HPC Startups: Innovation Brought to Life.
exchange, leading to a doubled speed-up if both GPUs are used [2] T. Grosser, A. Gremm, S. Veith, G. Heim, W. Rosenstiel, V. Medeiros,
and an even more improved speed-up if hybrid work-sharing and M. Eusebio de Lima, Exploiting heterogeneous computing platforms
by cataloging best solutions for resource intensive seismic applications,
is utilized4 (Fig. 12). However, energy efciency drops when in INTENSIVE 2011, The Third International Conference on Resource
using the hybrid work-sharing approach (Fig. 13), but utilizing Intensive Applications and Services, 2011, p. 3036.
only the two GPUs does not decrease it noticeably. Some could [3] G. Kyriazis, HSA: A Technical Review, 1st ed., AMD, August 2012.
argue that the CPU needs to be included into the calculation of [4] The OpenACC API, 2nd ed., OpenACC, August 2013.
the GPUs energy efciency. But in fact, the GPUs only require [5] S. Ohshima, K. Kise, T. Katagiri, and T. Yuba, Parallel processing
some weak embedded engine, which can restart the GPUs for of matrix multiplication in a cpu and gpu heterogeneous environment,
each time step and realize both eventual synchronization and in In 7th International Meeting on High Performance Computing for
data exchange fast enough. Computational Science (VECPAR06), 2006, pp. 4150.
[6] T. Odajima, T. Boku, T. Hanawa, J. Lee, and M. Sato, Gpu/cpu
Because aggressive compiler optimizations were allowed work sharing with parallel language xcalablemp-dev for parallelized
throughout the implementations, the speed-up of the NEON accelerated computing, in ICPP Workshops12, 2012, pp. 97106.
code on the fourth system is not as signicant as expected [7] T. R. W. Scogland, W. chun Feng, B. Rountree, and B. R. de Supinski,
(Fig. 14). Still, the speed-up on the Xilinx Zynq could be CoreTSAR: Adaptive Worksharing for Heterogeneous Systems, in
improved by changing the memory layout accesses to more ISC14, Leipzig, Germany, June 2014.
nJ
suitable ones. Indicating a high energy-efciency of 133 stencil , [8] T. R. W. Scogland, B. Rountree et al., Heterogeneous task scheduling
the Zynq is quite as energy-efcient as an Intel i5-660, while for accelerated openmp. in IPDPS12, May 2012, pp. 144155.
only the Intel i7-2600k provides a clearly better result. Because [9] W.-c. Feng and K. Cameron, The green500 list: Encouraging sustainable
supercomputing, Computer, vol. 40, no. 12, pp. 5055, Dec. 2007.
of the limited bandwidth of the shared memory bus, the
accelerator does not provide any speed-up. The same applies [10] Green500, The Green500 List - June 2014,
http://green500.org/lists/green201406, 2014.
especially for the hybrid case. An upper-bound speed-up on the
[11] N. Puzovic, Mont-blanc: Towards energy-efcient hpc systems, in
Epiphany can be measured if data fetching is not considered. Conf. Computing Frontiers, ser. CF 12. ACM, 2012, pp. 307308.
If all 16 cores are used, it takes 15861 cycles to compute a [12] P. Dlugosch, D. Brown et al., An efcient and scalable semiconductor
tiny 128x128 image5 . This results in a theoretical speed-up of architecture for parallel automata processing, IEEE Transactions on
43.7 compared to the nave CPU implementation6 . Parallel and Distributed Systems, vol. 99, no. PrePrints, p. 1, 2014.
[13] A. Putnam, A. Cauleld et al., A recongurable fabric for accelerating
VII. C ONCLUSION large-scale datacenter services, in ISCA14, June 2014.
[14] J. J. Dongarra, P. Luszczek, and A. Petitet, The linpack benchmark:
Numerous algorithms have been extensively hand-tuned to Past, present, and future. concurrency and computation: Practice and
t to ISAs. Keeping the cost and time of such optimizations in experience, Concurrency and Computation: Practice and Experience,
mind, the resulting code should be exploited in heterogeneous vol. 15, no. 9, pp. 803820, 2003.
systems. The hybrid approach is thereby a good t to achieve a [15] A. Gara, The long term impact of codesign, in SC Companion12,
tremendous performance boost while utilizing both the already 2012, pp. 22122246.
present CPU and accelerators that are becoming more and [16] O. Yilmaz, Seismic Data Analysis, 2nd ed., ser. Investigations in
more of a commodity these days. This is particularly the case Geophysics. Society Of Exploration Geophysicists, Jan. 2001, vol. 10.
for well-balanced systems in terms of Flops, memory, and bus [17] B. Biondi and G. Shan, Prestack imaging of overturned reections by
reverse time migration, in Expanded Abstracts, Soc. of Expl. Geophys.,
bandwidth. On systems with a stronger imbalance in FLOP ratio 72nd Ann. Internat. Mtg., 2002, pp. 12841287.
between the CPU and accelerator it can be seen that the hybrid
[18] R. Clapp, H. Fu, and O. Lindtjorn, Selecting the right hardware for
approach does not lead to a speed-up due to the communication reverse time migration, The Leading Edge, vol. 29, no. 1, 2010.
and synchronization. In such cases, utilization of multiple [19] M. Perrone, Finding oil with cells: Seismic imaging using a cluster of
accelerators can achieve further speed-up. During the evaluation, cell processors, in Second SHARCNET Symposium on GPU and Cell
it was recognizable that the CPU requires at least one thread per Computing, May 2009.
accelerator to keep the accelerators permanently active, which [20] A. Gremm, Acceleration, clustering, and performance evaluation
decreases the amount of computation being possibly performed of seismic applications, Masters thesis, Eberhard-Karls-Universitat
by the CPU. On the targeted embedded system, the limiting Tubingen, June 2011.
factor was the bus bandwidth between CPU and accelerator. [21] P. Siegl, Hybrid acceleration of a seismic application by combining
traditional methods with opencl, Masters thesis, Eberhard-Karls-
As a next step and based on this study, the authors plan is to Universitat Tubingen, April 2012.
evaluate the new OpenMP clause hetero, which can potentially [22] S. Williams, A. Waterman, and D. Patterson, Rooine: An insightful
visual performance model for multicore architectures, Commun. ACM,
4 This might be attributed to the different ages/generations of the used CPUs vol. 52, no. 4, pp. 6576, Apr. 2009.
and GPUs (i5-660: release year 2010; 8600GTS: 2007), so that the CPU can [23] G. Ofenbeck, R. Steinmann, V. C. Cabezas, D. G. Spampinato, and
easily keep up with synchronization. In comparison, the second system features M. Puschel, Applying the rooine model, in ISPASS14, 2014.
a reverse set-up (E5450: 2007; C1060: 2009), resulting in insufcient speed-up.
5 Using 3 out of 4 SRAM banks; double buffering; 1 time step
6 Image: 2200x748; time steps: 3000; time spent: 348010.967ms

663

You might also like