Performance-Energy Optimizations For Shared Vector Accelerators in Multicores PDF

IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO.
3, MARCH 2015 805
Performance-Energy Optimizations for Shared

Vector Accelerators in Multicores
Spiridon F. Beldianu, Member, IEEE and Sotirios G. Ziavras, Senior Member, IEEE
AbstractFor multicore processors with a private vector coprocessor (VP) per core, VP resources may not be highly utilized due to
limited data-level parallelism (DLP) in applications. Also, under low VP utilization static power dominates the total energy consumption.
We enhance here our previously proposed VP sharing framework for multicores in order to increase VP utilization while reducing the
static energy. We describe two power-gating (PG) techniques to dynamically control the VPs width based on utilization figures.
Floating-point results on an FPGA prototype show that the PG techniques reduce the energy needs by 30-35 percent with negligible
performance reduction as compared to a multicore with the same amount of hardware resources where, however, each core is
attached to a private VP.
Index TermsAccelerator, vector processor, multicore processing, performance, energy

1 INTRODUCTION
H ARDWARE accelerators achieve high performance with

low energy cost due to customization. VPs, single-
instruction, multiple-data (SIMD) units and graphics proc-
on the host, each SIMD unit in a GPU supports scalar
operations since the host and the GPU use different
address spaces and data transfers take thousands of clock
essing units (GPUs) are in the forefront due to the perva- cycles. (vii) Finally, GPU programming requires special-
siveness of DLP-dominant applications [3], [8], [9], [10], ized languages such as CUDA or OpenCL [16]; in con-
[12], [14], [15], [22], [30]. However, VPs are not always effi- trast, VP programming involves the offloading of rather
cient for various reasons. (i) The effective DLP of programs simple macros or directives. These differences increase
is often reduced due to intermittent control. (ii) The vector the area overheads and the power consumption of GPUs.
length may vary across applications or within an applica- Also, the computer architecture field relies on heterogene-
tion [2]. (iii) Applications may have unknown data depen- ity (without any single solution being the panacea), so it
dencies within instruction sequences. This is serious for applies to GPUs as well for DLP. Our work has value for
dynamic environments with limited capabilities and strict systems implemented on FPGAs or ASICs. FPGAs often
performance and/or energy requirements. (iv) As the ratio achieve better performance and lower energy consump-
of arithmetic operations to memory references decreases, tion than GPUs. For sliding-window applications involv-
ALU utilization is reduced. (v) Finally, parallelism effi- ing vector operations, FPGAs achieve 11 and 57
ciency may decrease with increased resource counts [16]. speedups compared to GPUs and multicores, respectively,
Although GPUs originally contained dedicated while consuming much less energy [29].
graphics pipelines, modern general-purpose GPUs [16] VP architectures for multicores must adhere to three
have replaced them with numerous execution units and design requirements: (i) High resource utilization of vector
rely on optimized routines for high performance. VPs and pipelines achieved via software-based DLP and hardware-
GPUs differ in many aspects, such as: (i) GPUs have based simultaneous VP sharing by aggregating many vec-
many more parallel execution units than VPs (thousands tor instruction streams. We assume VPs composed of vec-
compared to dozens), therefore GPUs are less area and tor lanes [7]. A lane contains a subset of the entire VPs
power efficient with limited DLP. (ii) GPUs hierarchical vector register file (VRF), an ALU and a memory load/store
structure is more complex. (iii) VPs hide memory latencies unit. (ii) Reduced impact of static power on the energy budget
using deeply pipelined block transfers whereas GPUs rely via VP sharing. Static power increasingly becomes critical
on complex multithreading. VPs also may optionally use due to reduced feature sizes and increased transistor
multithreading. (iv) Since GPUs lack a control processor, counts [5]. Performance does not often follow an upward
hardware is needed to identify address proximity in trend with finer resources, primarily because of decreases
addresses. (v) GPUs employ branch synchronization in average transistor utilizations. Further, transistor scaling
markers and stacks to handle masks whereas VPs rely on may create a utility economics wall that will force chip
the compiler. (vi) Contrary to VPs that run all scalar code areas to be frequently powered down [6]. Finally, (iii) VP
sharing that can facilitate efficient runtime resource and
The authors are with the Department of Electrical and Computer Engineer- power management. In contrast, a per-core VP leaves much
ing, New Jersey Institute of Technology, Newark, NJ 07102. less room for runtime management. To satisfy the three
Manuscript received 8 June 2013; revised 3 Nov. 2013; accepted 4 Dec. 2013. design requirements while releasing resources for other
Date of publication 20 Jan. 2014; date of current version 11 Feb. 2015. use, we investigate two VP architectural contexts [10], [14].
Recommended for acceptance by Z. Tari. We propose here an energy estimation model, two PG tech-
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. niques and energy management frameworks for adjusting
Digital Object Identifier no. 10.1109/TC.2013.2295820 the number of active lanes based on the utilization.
0018-9340 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
806 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 3, MARCH 2015
2 RELATED WORK circuit board, is IP-network addressable and contains 61

CPU cores in a ring. Phi differs substantially from our
VIRAM emphasized vector lanes [7]. SODA [8] and AnySP
approach since each Phi core is attached to a private vector
[9] contain SIMD integer pipelines for software defined
processing unit (VPU). A core runs in-order up to four
radio. Soft VPs for FPGAs appeared in [12], [15]. All these
threads; a thread uses up to 32 registers in the 512-bit wide
VPs are designed to interface a single scalar core and use
VPU. Phi is easier to program than GPUs by adding simple
fixed resources. Their rigidity becomes a weakness in
directives to compiled code. Its efficiency is prohibitively
dynamic environments with desired energy/performance
low with reduced DLP, so Intel recommends about 240
levels. Lanes may be time-shared by short-vector or scalar
threads running in parallel. Phi supports PG for cores and
threads [19]. In contrast, we target at simultaneous multi-
entire VPUs. Not only is ours very different on-chip vector
threading for maximizing the utilization of lane units.
architecture due to vector-lane sharing across cores, but we
Starting at 32 nm, subthreshold leakage is high even for
also propose in this paper lane-based dynamic power man-
high-k metal gate technology; for the 28 nm TSMC High Per-
agement within vector units.
formance (HP) process, suitable for GPUs and CPUs, leakage
NVIDIAs most advanced Kepler chip has enormous
reaches up to 40 percent of the dissipated power [27], [28]. To
complexity and 7.1 billion transistors [26]. It contains up to
reduce static power, modules may be turned off by gating
15 streaming multiprocessors (SMXs) and has been manu-
the ground or the supply voltage. Dynamic voltage and fre-
factured with TSMCs 28 nm process. An SMX contains 192
quency scaling (DVFS) reduces the clock frequency and/or
single-precision CUDA cores, 64 double-precision units, 32
the voltage but is less beneficial for leakage dominant com-
special-function units, and 32 load/store units. CPU cores
ponents (e.g., SRAMs or register files) [23]. Multi-threshold
can launch simultaneously work for a single GPU and
CMOS circuits reduce static power in low voltage and
SMXs may create work dynamically. Thus, a subset of
power, and high-performance applications [1], [11]. DVFS or
SMXs, or units in them, may be sometimes idle due to work
multi-thresholding can be complementary to our work. We
imbalance. However, Kepler was optimized based on per-
do not intend to combine existing power reduction techni-
formance per watt when all units are active [26], therefore it
ques; such an approach can be time consuming and will
does not support runtime power management despite the
deviate from the main objective of showing that VP sharing
large number of resources in SMXs.
can be enhanced via intelligent power management.
Our work differs from these platforms as follows: (a) It
A sleep instruction used with PG can turn off a unit [24].
facilitates on-chip vector-unit sharing among cores instead
PG overheads increase for finer granularity. PG techniques
of assigning exclusive accelerators to cores. (b) Our shared
turn on/off chosen cores as the utilization varies in datacen-
accelerator deals gracefully with one or many simulta-
ters [21]. The core population that minimizes the energy is
neously running applications that display varying DLP.
derived theoretically in [13], [17]. Instruction-level energy
(c) And, our accelerator implements fine-grain, low-over-
estimation finds statically the number of active streaming
head energy management using information extracted
multiprocessors (SMs) that minimize the dynamic energy for
statically and dynamically. Our work here can be adapted
CUDA workloads on NVIDIA GPUs [22]. This approach is
to develop a PG runtime framework for Kepler- and Phi-
not quite effective as static power is ignored and the optimi-
like systems. For Phi, the VPU design should be lane-
zation is coarse since it relies on active SMs (instead of active
based to support the PG of individual lanes. Also, a mech-
CUDA cores in an SM). Using static-time application profil-
anism must be devised for individual VPU lanes to be
ing, GPU leakage power is reduced by shutting down the
shared across cores. Each SMX in Kepler schedules
shader, geometry or other executions units [30]. L1 and L2
threads in groups of 32, which are called warps. With up
cache leakage is reduced by applying PG to SRAMs [31]. Our
to 64 warps per SMX, 2,048 threads may be present in an
PG-driven framework could help us develop techniques for
SMX. The objective will become to minimize the energy
minimizing GPU energy at fine levels. Memory-bound appli-
consumption of individual SMXs by adjusting dynami-
cations will benefit mostly due to underutilized cores in SMs.
cally their number of active CUDA cores based on the
Intel Advanced Vector eXtensions (AVX) were introduced in
number of threads/warps.
2011 for Sandy Bridge processors. The AltiVec SIMD instruc-
tion set is found in PowerPCs. However, vector instruction
extensions generally consume large core area while showing 3 ARCHITECTURES FOR VP SHARING
low resource utilization on the average. IN MULTICORES
Intelligent VP sharing on multicores increases VP efficiency
2.1 Intel Phi Coprocessor and NVIDIA Kepler GPU and application throughput compared to multicores with
In contrast to our proposed coprocessor design approach, per-core private VPs [14]. Let us summarize two of our VP
the systems discussed in this section are not efficient for sharing contexts [10], [14]. Coarse-grain Temporal Sharing
applications with varying DLP. They target exclusively at (CTS) multiplexes temporally sequences of vector instruc-
applications with sustained high DLP. However, they are tions arriving from the cores. A core gets exclusive VP con-
two of the most advanced affordable commercial platforms trol and releases it by executing lock and unlock instructions.
for data-intensive computing. Similar to our motivation for The VP runs a vector thread to completion, or until it stalls,
this work, the designers of the Intel Xeon Phi coprocessor before switching to another thread. CTS increases lane utili-
recognized recently the importance of resource centraliza- zation primarily for interleaved long sequences of scalar and
tion for vector processing in high-performance computing vector code. The utilization of VP lanes increases but not nec-
[3]. Phi is implemented on a large 4.6 in 5.9 in printed essarily of their functional units since a single thread may
BELDIANU AND ZIAVRAS: PERFORMANCE-ENERGY OPTIMIZATIONS FOR SHARED VECTOR ACCELERATORS IN MULTICORES 807
Fig. 1. A representative architecture containing two scalar cores and M vector lanes in a shared VP.
not fully utilize lane resources. Fine-grain Temporal Sharing ALU contains a write back (WB) arbiter to resolve simulta-
(FTS) multiplexes spatially in each lane instructions from dif- neous stores. Vector register elements are distributed
ferent vector threads in order to increase the utilization of all across lanes using low-order interleaving; the number of
vector units. It resembles simultaneous multithreading for elements in a lane is configurable. Scratchpad memory is
scalar codes. FTS achieves better performance and energy chosen to drastically reduce the energy and area require-
savings [10], [14]. ments. Scratchpad memory can yield, with code optimiza-
Fig. 1 shows a representative dual-core for VP sharing. tion, much better performance than cache for regular
Each lane contains a subset of elements from the vector reg- memory accesses [25]. Also, cacheless vector processors
ister file, an FPU and a memory load/store (LDST) unit. A often outperform cache-based systems [20]. Cores run (un)
vector controller (VC) receives instructions of two types lock routines for exclusive DMA access. For a VP with M
from its attached core: instructions to move and process vec- lanes, K vector elements per lane and a vector length VL for
tor data (forwarded to lanes) and control instructions (for- an application, up to K M=VL vector registers are avail-
warded to the Scheduler). Control instructions facilitate able. The scheduler makes lane assignment and configura-
communications between cores and the scheduler for tion decisions.
acquiring VP resources, getting the VP status and changing We prototyped in VHDL a dual-core on a Xilinx Virtex-6
the vector length. The VC has a two-stage pipeline for XC6VLX130T FPGA. The VP has M 2; 4; 8; 16 or 32
instruction decoding, and hazard detection and vector reg- lanes and an M-bank memory. The soft core is MicroBlaze,
ister renaming, respectively. The VC broadcasts a vector a 32-bit RISC by Xilinx that employs the Harvard architec-
instruction to lanes by pushing it along with vector element ture and uses the 32-bit fast simplex link (FSL) bus for cop-
ranges into small instruction FIFOs. rocessors. The VP is scalable with the lane population,
Each lane in our proof-of-concept implementation has a except for the MUXF8 FPGA primitive that relates to the
vector flag register file (VFRF), and separate ALU and crossbar. Without loss of generality, the crossbar could be
LDST instruction FIFOs for each VC. Logic in ALU and replaced with an alternative interconnect for larger multi-
LDST arbitrates instruction execution. LDST interfaces a cores. The LUT counts for a VP lane and the core are 1,100
memory crossbar (MC). LDST instructions use non-unit or and 750, respectively. The flip-flop counts are 3,642 and
indexed vector stride. Shuffle instructions interchange ele- 1,050, respectively. For TSMC 40 nm HP ASIC implementa-
ments between lanes using patterns stored in vector regis- tion, the lane and core gate counts are 453,000 and 212,000,
ters. LDST instructions are executed and committed in respectively. These resource numbers boost further our
order. ALU instructions may commit out of order. The claim that coprocessor sharing is critical.
TABLE 1
Benchmarks
4 BENCHMARKING The execution time (clock cycles) of a given kernel is

inversely proportional to the product of the ALU utiliza-
Benchmarking involves handwritten assembly-language
tion per lane and the number of active lanes; Kker nel ,
code for the VP in the form of macros embedded in C code
determined by experimentation, is a constant dependent
compiled with Xilinx MicroBlaze gcc. Five applications
on the kernel workload (i.e., number of FPU operations)
were developed: 32-tap finite impulse response filter
divided by 100:
(FIR32), 32-point decimation-in-time radix-2 butterfly FFT
(FFT), 1;024 1;024 dense matrix multiplication (MM), LU Kkernel
decomposition, and sparse matrix vector multiplication texec : (2)
M UALU
(SpMVM). They are summarized in Table 1. They have sim-
ilarities with GPU-oriented benchmarks [16] (e.g., matrix-
vector products and LU Laplace solver). The cores in our
evaluation run independent threads. Several scenarios 5.2 Dynamic and Static Power Estimation
were created for each benchmark involving loop unrolling, Static information is extracted from the application using
various VLs and instruction rearrangement optimizations. profilers embedded in software development environ-
The FIR32 combinations are: (i) CTS or FTS; (ii) ments for the cores (e.g., GNU gprof, Intel VTune
VL 32; 64; 128 or 256; (iii) no loop unrolling, or unrolling Amplifier XE). Special runtime registers monitor instruc-
once or three times; and (iv) instruction rearrangement. FFT tion path utilizations inside lanes. Dynamic power esti-
assumes a five-stage butterfly for complex multiply, add mation relies on lane unit activity rates. Like power factor
and shuffle operations. Eight scenarios involve: (i) CTS or approximation [18], it focuses on a units input statistics.
FTS; (ii) VL 32 or 64; (iii) no loop unrolling or unrolling Input activity rates are deduced via timing simulations.
once; and (iv) instruction rearrangement. MM uses SAXPY Our model uses fixed combinations of process corner and
for 14 scenarios of: (i) CTS or FTS; (ii) VL 32; 64; 128 or values for voltage, frequency and temperature; it is sim-
256; (iii) no loop unrolling or unrolling once; and (iv) ple to extend for other values since only constants change
instruction rearrangement. as discussed below. Execution times and utilization fig-
LU generates the L(ower) and U(pper) triangular matri- ures are obtained with ModelSim simulations using the
ces for a dense 128 128 matrix using the Doolittle algo- RTL system model. The Xilinx XPower tool provides the
rithm for three scenarios. VL is decreased successively in dynamic power dissipation based on data stored in simu-
Gaussian elimination. SpMVM uses a matrix in compressed lation record files (these .vcd files record the switching
row storage format and has two stages. In stage SpMVM-k1, activities of all logic and wires, which are generated by
non-zeros of a matrix row are multiplied with respective ModelSim during the timing simulations with the place-
vector elements. In stage SpMVM-k2, products are added and-route netlist). Timing simulations employ real float-
along each matrix row; to balance the workload across ing-point data. For high accuracy, power measurements
lanes, matrix rows are ordered according to their number of are taken during time windows where executing kernels
non-zeros. do not stall due to DMA transfers.
Results show a linear dependence between an ALUs
dynamic power PALU and utilization (orP activity rate) that
5 PERFORMANCE AND POWER MODELS can be approximated with PALU i Kexei wi UALU ,
where wi is the fraction of utilization that targets execution
5.1 Performance Model
unit i; Kexei is a constant coefficient measured in mW per
The average ALU utilization of a lane is defined as the
percent of utilization (mW/percent) and determined by
average number of results produced in 100 clock cycles.
experimentation. VRF utilization is assumed proportional
Similarly, the average LDST utilization is the average num-
to 2UALU ULDST since a LDST instruction has either a read
ber of 32-bit words sent and received via the crossbar in
or a write, whereas an ALU instruction has one or two reads
100 clock cycles. The ALU or LDST utilization UALU=LDST is
and one write. VRF and LDST dynamic power depend line-
the product of the average instruction throughput
arly on VRF and LDST utilization, respectively. Moreover,
ITALU=LDST (i.e., instructions issued in 100 clock cycles)
the crossbar and memory dynamic power depend almost
and the number of elements from a vector register located
linearly on LDST utilization. Small errors are due to fine-
in this lane (i.e., VL/M):
grain effects (e.g., access patterns and toggling rates in net-
UALU=LDST ITALU=LDST VL=M: (1) list signals due to data randomness).
Table 2 summarizes the power model equations for
VP components. All Ks are constant coefficients
TABLE 2
Dynamic Power Model Equations
measured in mW per percent of utilization (mW/per- For a vector kernel, the ratio ULDST =UALU a is assumed
cent), and are found by experimentation followed by lin- constant (it is known in advance or the number of memory
ear approximation. They are shown in Table 3 along accesses depends on some computed values). The total
with their standard deviation and the mean absolute dynamic power is:
dynamic power estimation error of VP components. The
estimation of dynamic power using unit utilizations D 1
PTOTAL M UALU 2KVC 1 a
results in a 13 percent confidence interval. VL
X
The total dynamic power dissipation for M lanes and L DATA
KALU DATA
CTRL Kexei wi a KLDST
memory banks is: i

D
PTOTAL 2PVC M PLANE L PMEM BANKs 2PVC KVRF 1 2a a KMEM BANKs
M PALU CTRL PALU EXE PLDST PVRF M

INSTR INSTR
L PMEM BANKs : M UALU KALU CTRL a K LDST :
VL
(3) (4)
TABLE 3
Mean Absolute Error for Dynamic Power Estimation
K coefficients with their standard deviation.

TABLE 4 6.1 Total Energy Minimization

Static Power Consumption Breakdown for 16 16-VP Fig. 2 shows the normalized energy consumption for vari-
(16 Lanes and 16 Memory Banks) on the XC6VLX130t ous execution scenarios, each involving 10,000 FPU opera-
FPGA (the Internal Supply Voltage Relative to GND Is 1 V;
the Junction Temperature Is 85 C) tions. Normalization is in reference to 2 16-VP that yields
minimum LU energy in Fig. 2b; 16 16-VP consumes 2.4
times more energy. LUs performance does not change
noticeably starting with 4 16-VP, regardless of the popula-
tion of active lanes, due to stalls caused by scalar divisions
on the core and CPU memory accesses. More static power is
consumed by extra lanes without any substantial improve-
ment. Static approaches dynamic energy on a 45 nm FPGA
for medium to low activity. An ASIC simulation with Syn-
opsys tools for a 40 nm TSMC HP process shows that the
For a kernel with fixed VL: (i) The first part in the equa- static power of our shared VP is about 33 percent of the total
tion is constant. (ii) The second part increases linearly power dissipation. Also, the static power increases expo-
with M. M has small impact on dynamic energy since nentially with the temperature.
INSTR INSTR The optimal lane population that minimizes the energy is
Kker nel KALU CTRL a KLDST =VL is small, especially
for large VL; e.g., FIR32, VL 64, loop unrolled three small for low-scalability kernels. For a scalable kernel (e.g.,
times: E D Kkernel 324:5 M 5:625. The needed lane FFT with VL 64), the energy drops with additional lanes.
controllers increases with M. (iii) For identical VP sharing It is imperative to develop a methodology for activating and
and kernel, dynamic energy should stay almost deactivating vector lanes under workload changes in order
unchanged independent of the number of lanes since it to optimize the energy and performance.
basically depends on utilizations. Using Equations (2), (5) and (6), the total energy con-
The total dynamic energy is obtained from Equations (2) sumption is:
and (4): L
LANE LANE

ETOTAL texec PD PST L M PST POFF

1 E D Kker nel
E D PTOTAL
D
texec Kkernel 2KVC 1 a LANE
VL L LANE
X PST L M PST POFF
DATA
KALU DATA ;
CTRL Kexei wi a KLDST M
M UALU
i
(5) (7)
KVRF 1 2a a KMEM BANKs
M
where UALU is the ALU utilization of M L-VP. From
M INSTR INSTR

Equation (5), the dynamic energy is almost independent of
Kker nel KALU CTRL a KLDST :
VL M, as expected for a good VP; it really depends on the
amount and type of work. Thus, our objective becomes to
As transistors shrink, the delay/power gap between FPGA find the optimal value of M that minimizes the static
and ASIC designs narrows. Power measurements on our energy:
prototype are adjusted since the VP does not utilize all
L LANE
LANE
P L M PST POFF
FPGA resources; XPower results are scaled based on the Mmin arg min ST M
; (8)
fraction of used FPGA primitives. Table 4 shows the static M2F M UALU
power dissipation breakdown for 16 16-VP (16 lanes and
16 memory banks). where F is the set of permissible values for M.
6.2 Dynamic PG with Static Information (DPGS)

6 ENERGY MINIMIZATION When a VP request/release event occurs under DPGS, a pri-
We prototyped various M L-VPs with M f2; 4; 8; 16g ori kernel information is used to compute the optimal lane
lanes and L 16 memory banks. PG is implemented with population that minimizes the energy. Since the static
sleep transistors, isolation cells and circuits to control power power variables in Equation (8) are fixed for a given VP, the
M
LANE
signals; the static power POFF of a lane in the OFF state is only information needed is UALU for all permissible values
M
not zero. Commercial FPGAs lack PG and standby power of M. A simple way to obtain UALU is to employ offline
states [4]. We assume that the static power of PGed lanes simulations of single kernel executions and combinations
can be removed except for 15 percent [24]. The vector con- involving kernel pairs (since vector threads from both cores
trollers, crossbar and memory banks are always active. The may run concurrently). A lookup table can contain the opti-
static power PSTM
of M active lanes is given by Equation (6), mum value of M for kernel pairs ui ; uj ; where i; j 2 Q and
VP
where PST and PST LANE
are shown in Table 4: Q is the set of all known kernels, including the idle one.
Fig. 3a presents hardware extensions to our VP architecture
M VP
LANE LANE
for supporting software-controlled DPGS. They include a
PST PST L M PST POFF : (6) PG sequencer (configured by software) and other elements
(sleep transistors and isolation cells). Operating system
Fig. 2. Normalized energy consumption (bars, values on left axis) for a workload with 10 K floating-point operations and various kernels. Normaliza-
tion is in reference to 2 16-VP; nuno: loop unrolling; u1- loop unrolled once. FTS is applied to two threads. The normalized speed-up (values on
right axis) is shown as a line.
M N
interrupt routines for power management run on a core or a Proof. From ETOTAL < ETOTAL , a conclusion in Section 5.2
M N
power control unit (e.g., similar the Intel7 Nehalem). How- ED ED and Equation (7):
ever, obtaining offline the combined utilization of VP units
M L
LANE LANE

for all pairs of vector kernels is impractical since kernels UALU N PST L M PST POFF
N
> L LANE LANE
; (10)
may start executing randomly. Also, some kernels may not UALU M PST L N PST POFF
be known a priori.
6.3 Adaptive PG with Profiled Information (APGP) where the right-hand term is RThM=N . For M > N,
M N
UALU UALU since a lanes ALU utilization decreases, or
For effective PG-driven energy minimization, we use
stays constant, when the number of lanes increases.
embedded hardware profilers to dynamically monitor lane
Thus, RThM=N 1. u
t
unit utilizations. The decision to adjust the population of
M N
active lanes based on a cores request/release involves spe- Ideal scalability implies UALU UALU . To evaluate
cialized hardware. To find the optimal number of lanes that Equation (9) after a VP event, we profile unit utilizations
minimizes the energy consumption at runtime, we use the for various VP configurations. We propose a dynamic
next theorem. process where the VP state is changed successively in the
right direction (i.e., increasing or decreasing the number
Theorem 1. If the total energy consumption of a kernel for the
of active lanes) until optimality is reached. Since for most
M-lane VP configuration is smaller than the total energy con-
of our scenarios minimum energy is reached for
sumption for the N-lane configuration, then:
M 2 f4; 8; 16g; our runtime framework assumes four pos-
M
UALU sible VP states with 0, 4, 8 and 16 active lanes. Fig. 3b
N
> RThM=N ; (9) shows hardware extensions for APGP. Each profiler,
UALU
attached to a VC, monitors ALU and LDST utilizations
where RThM=N is a constant depending on M and N. Addi- for the kernel. It captures the average ALU utilization
tionally, RThM=N 1 for M > N. based on the instruction stream flowing through the VC
(a) (b)
Fig. 3. Hardware for (a) DPGS and (b) APGP. In DPGS or APGP, software or the PG controller, respectively, configures the PG Register. VP Profiler:
aggregates utilizations from VCs. ST: Sleep Transistor (Header or Footer).
8
during a chosen time window, as per Equation (1). Simu- the probability P UALU < ATh8 > 16 j16L min energy 0:
lations show that a window of 1,024 clock cycles gives RThM=N < 1 for M > N, as per the theorem. Besides these
highly accurate results. thresholds, the PG controller contains the profiled utiliza-
M
The PG controller (PGC) aggregates utilizations from tion registers UALU ; M 2 f4; 8; 16g (one for each VP
both threads (using the profilers) and implements the PGC configuration).
state machine of Fig. 4. We use two types of thresholds: (i) As per APGP, after a VP request/release event that may
the absolute threshold AThM > N which is used if the ratio change the utilization figures, and thus the optimal configu-
M N
UALU =UALU is not available for the current kernel combina- ration, the utilization registers are reinitialized. Bit Vld in
tion and M ! N represents a transition from the M- to the Fig. 4 shows if the ALU utilization register U M for the
N-lane VP, and (ii) the relative threshold RThM=N of Equa- M-lane VP contains an updated value (the ALU subscript in
tion (9). RThM=N is used to compare utilizations with pro- the utilization variable is dropped to simplify the discus-
filed M and N lanes. Absolute thresholds are chosen sion). If the VP is initially idle (0 L), PGC will power up
empirically such that, for a given ALU utilization the eight lanes to enter the 8 L state. We bypass the 4 L configu-
probability that the current configuration will be kept is ration since 8 L has the highest probability to be the optimal
minimum if a configuration of lower energy consumption energy state for our scenarios. VP uses data from at least
exists. Absolute thresholds enable PGC to initiate a one profile window in order to update utilization figures. If
state transition if there is a probability that the current one of the inequalities based on the absolute threshold is
state is not optimal. For example, ATh8 > 16 is such that met, PGC initiates a transition to another state. After each
Fig. 4. PG controller state machine and PGC registers for state transitions under APGP. INT, PW and CFG are transitional VP (i.e., non-operating)
states. 4 L, 8 L and 16 L are stable VP operating states that represent the 4-, 8- and 16-lane VP configurations. ML is a PGC state with M active
lanes,M 2 f0; 4; 8; 16g; INT is a PGC state where the PGC asserts an interrupt and waits for an Interrupt Acknowledge (INT_ACK); PW is a PGC
state where some of the VP lanes are powered-up/down; CFG is a PGC state where the Scheduler is reconfigured to a new VP state. Threshold
registers are fixed during runs and utilization registers are updated for every profile window. The registers store 8-bit integers. The Vld bit is used to
show that the ALU utilization register U M , with M 4, 8 or 16, for the M-lane VP configuration does not contain an updated value.
TABLE 5
Time and Energy Overheads for PGC State Transition
profile window, the utilization register for the current state For diverse workloads, we created benchmarks com-
is updated. A transition between two stable VP operating posed of random threads running on cores. Each thread has
states involves the following steps and three transitional VP VP busy and idle periods. During idle periods the core is
non-operating states: often busy either with memory transfers or scalar code. A
thread busy period is denoted by a vector kernel ui and a
1. INT state: Stop the Scheduler to acknowledge new workload expressed in a random number of floating-point
VP acquire requests and send a hardware interrupt operations; a thread idle period is denoted by a random
to core(s) that have acquired VP resources. number of VP clock cycles. Ten fundamental vector kernels
2. PW state: After ACKs from all cores, reconfigure the were used to create execution scenarios. Two versions of
PG Sequencer for a new VP state. each kernel in Section 4 were first produced with low and
3. CFG state: Reconfigure the Scheduler with the chosen high ALU utilization, respectively. The kernel workload is
number of active lanes and enable it to acknowledge uniformly distributed so that enough data exists in the vec-
new VP acquire requests. tor memory for processing without the need for DMA. By
In a new state, the utilization register is updated after a adding an idle kernel, 55 unique kernel pairs were pro-
full profile window. If one of the inequalities is met, a duced plus 10 scenarios with a single active kernel on a
transition occurs. Up to three transitions are allowed after core. Based on Section 6.3, we get the values: ATh4 > 8
a VP event in order to avoid repetitive transitions that 50%; ATh8 > 16 60%; ATh8 > 4 50%; ATh16 > 8 72%;
increase time/energy overheads. The resources consumed RTh8=4 0:6739, and RTh16=8 0:7581.
by the profilers and the PGC are less than 1 percent of the
VPs resources. As PGC events are rare, the PGCs
dynamic power consumption is insignificant compared to
8 EXPERIMENTAL RESULTS
that of the VP.
Fig. 5 shows the breakdown of VP normalized execution
time and energy consumption when the majority of kernels
7 SIMULATION MODEL AND EXPERIMENTAL SETUP have low ALU utilization. The ratio of low to high utiliza-
Our simulator models vector-thread executions for vari- tion kernels in a thread is 4:1. The idle periods between con-
ous VP configurations. The model is based on perfor- secutive kernels in a thread are uniformly distributed in the
mance and power figures gathered from RTL and netlist ranges: [1,000, 4,000], [5,000, 10,000] and [10,000, 30,000]
simulations, as described in Section 5. It contains informa- clock cycles.
tion necessary to compute the execution time and energy Our conclusions are: (i) FTS generally produces the low-
consumption for any combination of kernels ui ; uj run- est energy consumption. For either CTS or FTS, DPGS or
ning in any VP state. Each combination of kernels ui ; uj APGP minimizes the energy consumption compared to sce-
is represented by the utilization(s) U M ui and U M uj narios without PG. (ii) Except for two scenarios, 2x(1cpu_8
and the total power P M ui ; uj for kernels running on the L) and 2cpu_16 L_FTS, FTS with DPGS or APGP also
M-lane VP, for all values of M 2 f4; 8; 16g. minimizes the execution time. These PG schemes also yield
The model accounts for all time and energy overheads 30-35 percent and 18-25 percent less energy as compared to
due to state transitions, as shown in Table 5. Since our lane 2x(1cpu_8 L) and 2cpu_16 L_FTS, respectively. (iii) Scenar-
implementation is almost eight times larger in area than a ios with two cores and a private per-core VP yield lower
floating-point multiply unit in [24], which is PGed in one execution time than CTS because CTS does not sustain high
clock cycle, we assume that a VP lane wakes up in eight utilization across all lane units. (iv) DPGS or APGP applied
clock cycles. One lane is powered up/down at a time by the to CTS reduces the energy compared to 2x scenarios. As
PG Sequencer to avoid excessive currents in the power net. idle periods decrease, CTS becomes less effective; e.g., a
VP components that are not woken up or gated during state 5 percent gain in consumption for DPGS-driven CTS with a
transitions consume static energy as usual. slowdown of 70 percent compared to 2x(1cpu_4 L). Finally,
Fig. 5. Normalized execution time (a, c, e) and normalized energy consumption (b, d, f) where the majority of kernels in a thread have low ALU utiliza-
tion, for various idle periods. The ratio of low to high utilization kernels in a thread is 4:1. E_st and E_dyn are the energy consumptions due to static
and dynamic activities, respectively. 2x means two cores/CPUs of the type that follows in parentheses, such as (1cpu_4 L) which means one core
having a private VP with four lanes. Whenever CTS or FTS shows, it implies two cores with VP sharing.
(v) time-energy overheads due to state transitions are negli- and high utilization kernels in a thread is 1:1. FTS under
gible; they are not shown in Figs. 5, 6, and 7. The total time DPGS or APGP yields the minimum energy while the
overhead is upper-bounded by 0.3 and 0.7 percent of the performance is better than FTS with eight fixed lanes.
total execution time for DPGS and APGP, respectively. The Fig. 7 shows the normalized execution time and energy
total energy overhead is upper-bounded by 0.23 and 0.57 consumption for threads dominated by high ALU utiliza-
percent of the total energy consumption for DPGS and tion kernels; the ratio between low and high utilization
APGP, respectively. kernels is 1:4. As the number of thread kernels with high
Fig. 6 shows the normalized execution time and ALU utilization increases, the portion of time spent in
energy consumption for threads containing kernels with the 16 L state increases for FTS under DPGS or APGP.
balanced ALU utilization figures; the ratio between low The performance of the PG schemes is better than that of
Fig. 6. Normalized execution time (a, c, e) and normalized energy consumption (b, d, f) for threads with balanced utilization kernels, for various idle
periods. The ratio of low to high utilization kernels in a thread is 1:1.
a fixed VP with eight lanes, and approaches the perfor- prototype, ASIC simulations show that this model is also
mance of the 16 L FTS-driven configuration. As expected, valid for ASICs; only the values of some model coefficients,
the energy is reduced drastically with FTS and DPGS or that depend on the chosen hardware platform anyway,
APGP compared to all other scenarios. must change. For given vector kernels, VPs dynamic
energy does not vary substantially with the number of vec-
tor lanes. Consequently, we proposed two PG techniques to
9 CONCLUSIONS dynamically control the number of lanes in order to mini-
We proposed two energy reduction techniques to dynami- mize the VPs static energy. DPGS uses a priori information
cally control the width of shared VPs in multicores. We first of lane utilizations to choose the optimal number of lanes.
introduced an energy estimation model based on theory APGP uses embedded hardware utilization profilers for
and observations deduced from experimental results. runtime decisions. To find each time the optimal number of
Although we presented detailed results for an FPGA lanes that minimize the static energy, the VP state is
Fig. 7. Normalized execution time (a, c, e) and normalized energy consumption (b, d, f) for threads dominated by high utilization kernels, for various
idle periods. The ratio of low to high utilization kernels in a thread is 1:4.
changed to reach optimality for the given workload. Bench- [3] G. Chrysos, Intel Xeon Phi Coprocessor, Proc. Hot Chips Symp.,
Aug. 2012.
marking shows that PG reduces the total energy by 30-35 [4] S. Ishihara, M. Hariyama, and M. Kameyama, A Low-Power
percent while maintaining performance comparable to a FPGA Based on Autonomous Fine-Grain Power Gating, IEEE
multicore with the same amount of VP resources and per- Trans. Very Large Scale Integration Systems, vol. 19, no. 8, pp. 1394-
core VPs. 1406, Aug. 2011.
[5] M. Keating, D. Flynn, R. Aitken, A. Gibsons, and K. Shi, Low
Power Methodology Manual for System on Chip Design. Springer,
REFERENCES 2008.
[6] H. Esmaeilzadeh, E. Blem, R.S. Amant, K. Sankaralingam, and D.
[1] T. Hiramoto and M. Takamiya, Low Power and Low Voltage Burger, Dark Silicon and the end of Multicore Scaling, Proc.
MOSFETs with Variable Threshold Voltage Controlled by Back- 38th Ann. Intl Symp. Computer Architecture, pp. 365-376, 2011.
Bias, IEICE Trans. Electronics, vol. E83, no. 2, pp. 161-169, 2000. [7] C. Kozyrakis and D. Patterson, Scalable, Vector Processors for
[2] M. Woh et al., Analyzing the Scalability of SIMD for the Next Embedded Systems, IEEE Micro, vol. 23, no. 6, pp. 36-45, Nov./
Generation Software Defined Radio, Proc. IEEE Intl Conf. Acous- Dec. 2003.
tics, Speech, and Signal Processing, pp. 5388-5391, Mar./Apr. 2008.
[8] Y. Lin et al., SODA: A Low-Power Architecture for Software [31] Y. Wang, S. Roy, and N. Ranganathan, Run-Time Power-Gating
Radio, Proc. 33rd IEEE Ann. Intl Symp. Computer Architecture, In Caches of GPUs for Leakage Energy Savings, Proc. Design,
pp. 89-101, 2006. Automation and Test in Europe Conf. & Exhibition (DATE), pp. 300-
[9] M. Woh et al., AnySP: Anytime Anywhere Anyway Signal Proc- 303, Mar. 2012.
essing, IEEE Micro, vol. 30, no. 1, pp. 81-91, Jan./Feb. 2010.
[10] S.F. Beldianu and S.G. Ziavras, On-Chip Vector Coprocessor Spiridon F. Beldianu received the BS degree in
Sharing for Multicores, Proc. 19th Euromicro Intl Conf. Parallel electrical engineering and the MS degree in sig-
Distributed and Network-Based Processing, pp. 431-438, Feb. 2011. nal processing, from Technical University, Iasi,
[11] M. Anis, S. Areibi, and M. Elmasry, Design and Optimization of Romania, in 2001 and 2002, respectively, and
Multithreshold CMOS (MTCMOS) Circuits, IEEE Trans. Com- the PhD degree in computer engineering from
puter Aided Design, vol. 22, no. 10, pp. 1324-1342, Oct. 2003. the New Jersey Institute of Technology in 2012.
[12] H. Yang and S.G. Ziavras, FPGA-Based Vector Processor for His research interests are high-performance
Algebraic Equation Solvers, Proc. IEEE Intl Systems-on-Chip computing, on-chip vector coprocessor sharing,
Conf., pp. 115-116, 2005. reconfigurable computing and scheduling for
[13] J. Li and J.F. Martinez, Power-Performance Considerations of input-queued packet switches. He is a senior
Parallel Computing on Chip Multiprocessors, ACM Trans. Archi- scientist at Broadcom, San Jose, California. He
tecture and Code Optimization Architecture and Code Optimization, is a member of the IEEE.
vol. 2, pp. 397-422, Dec. 2005.
[14] S.F. Beldianu and S.G. Ziavras, Multicore-Based Vector
Coprocessor Sharing for Performance and Energy Gains,
ACM Trans. Embedded Computing Systems, vol. 13, no. 2, article Sotirios G. Ziavras received the diploma in elec-
17, Sept. 2013. trical engineering from the National Technical
[15] J. Yu, C. Eagleston, C.H.-Y. Chou, M. Perreault, and G. Lemieux, University of Athens (NTUA), Greece and the
Vector Processing as a Soft Processor Accelerator, ACM Trans. DSc degree in computer science from George
Reconfigurable Technology and Systems, vol. 2, no. 2, pp. 1-34, June Washington University (GWU). He is a professor
2009. of electrical and computer engineering, the direc-
[16] A. Bakhoda et al., Analyzing CUDA Workloads Using a Detailed tor of the Computer Architecture and Parallel
GPU Simulator, Proc. IEEE Intl Symp. Performance Analysis of Sys- Processing Laboratory (CAPPL), and the associ-
tems and Software, pp. 163-174, Apr. 2009. ate provost for Graduate Studies at the New Jer-
[17] V.A. Korthikanti and G. Agha, Towards Optimizing Energy sey Institute of Technology (NJIT). He was with
Costs of Algorithms for Shared Memory Architectures, Proc. the Center for Automation Research at the Uni-
ACM Symp. Parallelism in Algorithms and Architectures, pp. 157-165, versity of Maryland, College Park from 1988 to 1989. He was a visiting
2010. professor of electrical and computer engineering at George Mason Uni-
[18] S. Powell and P. Chau, Estimating Power Dissipation of VLSI versity in Spring 1990. He joined NJIT in Fall 1990 as an assistant pro-
Signal Processing Chips: The PFA Techniques, Proc. IEEE Work- fessor. He has received several honors such as an award from the
shop VLSI Signal Processing, pp. 250-259, 1990. Hellenic Republic for his academic performance at NTUA, an industry-
[19] S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, Vector Lane funded Distinguished assistantship at GWU, the NJIT Excellence in
Threading, Proc. Intl Conf. Parallel Processing, pp. 55-64, Aug. Teaching Award for Graduate Education, and the Richard E. Merwin fel-
2006. lowship at GWU. He has published 170 papers, and did early work on
[20] L. Oliker et al., Evaluation of Cache-Based Superscalar and chip multiprocessors embedded in FPGAs for data-intensive computing
Cacheless Vector Architectures for Scientific Computations, Proc. and energy-grid applications. His main research interests are multicore
18th Ann. Intl Conf. Supercomputing, Nov. 2003. processors, reconfigurable computing, accelerators, parallel processing,
[21] J. Leverich et al., Power Management of Datacenter Workloads and embedded computing. He is a senior member of the IEEE.
Using Per-Core Power Gating, IEEE Computer Architecture Letters,
vol. 8, no. 2, pp. 48-51, July-Dec. 2009.
[22] Y. Wang and N. Ranganathan, An Instruction-Level Energy Esti-
mation and Optimization Methodology for GPU, Proc. IEEE 11th
Intl Conf. Computer and Information Technology, pp. 621-628, Aug./
Sept. 2011. " For more information on this or any other computing topic,
[23] W. Wang and P. Mishra, System-Wide Leakage-Aware Energy please visit our Digital Library at www.computer.org/publications/dlib.
Minimization Using Dynamic Voltage Scaling and Cache Recon-
figuration in Multitasking Systems, IEEE Trans. Very Large Scale
Integration Systems, vol. 20, no. 5, pp. 902-910, May 2012.
[24] S. Roy, N. Ranganathan, and S. Katkoori, A Framework for
Power-Gating Functional Units in Embedded Microprocessors,
IEEE Trans. Very Large Scale Integration Systems, vol. 17, no. 11,
pp. 1640-1649, Nov. 2009.
[25] O. Avissar, R. Barua, and D. Stewart, An Optimal Memory Allo-
cation Scheme for Scratch-Pad-Based Embedded Systems, ACM
Trans. Embedded Computing Systems, vol. 1, no. 1, pp. 6-26, 2002.
[26] NVIDIAs Next Generation CUDA Compute Architecture: Kep-
ler GK110: The Fastest, Most Efficient HPC Architecture Ever
Built, White Paper, NVIDIA Corp., 2012.
[27] Reducing System Power and Cost with Artix-7 FPGAs, Xilinx
2013, http://www.xilinx.com/support/documentation/white_
papers/wp423-Reducing-Sys-Power-Cost-28 nm.pdf, 2014.
[28] 28 Nanometer Process Technology, http://www.nu-vista.com:
8080/download/brochures/2011_28 Nanometer Process Technol-
ogy.pdf, 2012.
[29] J. Fowers, G. Brown, P. Cooke, and G. Stitt, A Performance and
Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-
Window Applications, Proc. 20th ACM/SIGDA Intl Symp. FPGAs,
pp. 47-56, Feb. 2012.
[30] P.-H. Wang, C.-L. Yang, Y.-M. Chen, and Y.-J. Cheng, Power Gat-
ing Strategies on GPUs, ACM Trans. Architecture and Code Optimi-
zation, vol. 8, no. 3, article 3, Oct. 2011.

Performance-Energy Optimizations For Shared Vector Accelerators in Multicores PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance-Energy Optimizations For Shared Vector Accelerators in Multicores PDF

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO.

3, MARCH 2015 805