Professional Documents
Culture Documents
AbstractMarkov Chain Monte Carlo (MCMC) is a method to draw samples from a given probability distribution. Its frequent use
for solving probabilistic inference problems, where big-scale data are repeatedly processed, means that MCMC runtimes can
be unacceptably large. This paper focuses on population-based MCMC, a popular family of computationally intensive MCMC
samplers; we propose novel, highly optimized accelerators in three parallel hardware platforms (multi-core CPUs, GPUs and
FPGAs), in order to address the performance limitations of sequential software implementations. For each platform, we jointly
exploit the nature of the underlying hardware and the special characteristics of population-based MCMC. We focus particularly
on the use of custom arithmetic precision, introducing two novel methods which employ custom precision in the largest part of
the algorithm in order to reduce runtime, without causing sampling errors. We apply these methods to all platforms. The FPGA
accelerators are up to 114x faster than multi-core CPUs and up to 53x faster than GPUs when doing inference on mixture models.
Index TermsField Programmable Gate Array, Graphics Processing Unit, Markov Chain Monte Carlo, Parallel Tempering,
Custom Arithmetic Precision
F
variance) or use simpler models and/or fewer data, compro- specific characteristics (e.g. correlations, multi-modality).
mising the effectiveness of the analysis. In order to handle The goal is to achieve fast convergence and mixing.
this computational burden, it is insufficient to rely solely on
more efficient MCMC algorithms. It is equally important to
leverage the power of modern hardware accelerators, such 2.2 Population-based MCMC and PT
as multi-core Central Processing Units (CPUs), Graphics When sampling from multi-modal distributions, elementary
Processing Units (GPUs) and Field-Programmable Gate kernels (e.g. Metropolis [18]) tend to get stuck in one
Arrays (FPGAs). Even more crucially, understanding of the of the modes [6], [14], resulting in slow convergence
properties of the underlying hardware is not only neces- and mixing. Here, we focus on a family of MCMC
sary during the implementation stage of MCMC; certain methods designed to tackle this problem; population-based
computational or algorithmic structures map favorably to MCMC. Population-based MCMC instantiates a population
existing parallel architectures and this should affect choices of Markov chains, each of them sampling from a slightly
during the algorithmic and architectural design stages of different distribution. Certain types of interactions are in-
population-based MCMC and MCMC methods in general. troduced between the chains, which help to improve the
Here, we present an endeavour towards this direction, algorithms convergence and mixing properties [14].
targeting Parallel Tempering (PT) [17], the most popular More specifically, we examine a representative
population-based MCMC method. We present optimized population-based method called Parallel Tempering
implementations in three hardware platforms; multi-core (PT) [17]. Fig. 1 shows the pseudocode of PT. The chain
CPUs, GPUs and FPGAs. As a benchmark, we perform population consists of m chains and chain j ( j {1, ..., m})
Bayesian inference on a mixture model [15], which is samples from a distribution with density p j (x):
representative of multi-modal problems. In more detail, the
contributions of this work are: p j (x) = p(x)1/T j , j {1, ..., m} (3)
1) Three optimized accelerators (for CPUs, GPUs and
where T j (with 1 = T1 < T2 < ... < Tm ) is the temperature of
FPGAs) for population-based MCMC which employ double
chain j. The density p1 (x) is the target density p(x) (since
precision and deliver speedups of up to 16.1x, 165x and
T1 = 1). The remaining densities are tempered versions
174x respectively over a single-core CPU.
of p(x). Temperatures increase with higher j, which results
2) Two novel, custom precision methods for population-
in gradually smoother densities, i.e. closer to uniform.
based MCMC, which allow us to reduce runtimes with-
Practically, this means that hot chains move quickly in
out affecting sampling quality. The methods either use a
the state space (they jump more easily between the now
weighting scheme to correct errors (Weighted PT - WPT)
smoothed modes), while cold chains move slowly but
or use custom precision in parts of the algorithm which do
sample from distributions closer to the target p(x).
not affect output accuracy (Mixed-Precision PT - MPPT).
The algorithm performs N iterations (loop in line 1
A theoretical proof of the accuracy of the latter is included.
of Fig. 1) in order to draw N samples from the target
3) Mappings of these methods to CPUs, GPUs and
distribution p1 (x) = p(x). Samples from the tempered (aux-
FPGAs, which result in further speedups of up to 1.4x, 3.2x
iliary) distributions are discarded. Every iteration comprises
and 6.5x respectively over the baseline implementations.
a Global update (line 2) and a Global exchange (line 9).
4) A precision optimization process for WPT and MPPT
During the Global update of iteration i, the current sam-
on FPGAs and a kernel optimization process for GPU (i) (i)
ples of all chains (x1 , ..., xm ) are updated using separate
accelerators, which maximize sampling throughput.
Metropolis transition kernels (loop in line 3). Each kernel
5) An investigation of the way the performance of the
samples from the distribution of the corresponding chain
accelerators scales with the size of the chain population and
(see (3)). The proposal and accept/reject steps of the kernel
the size of the data set and results on the power and cost
are shown in lines 4 and 5-7. The latter requires computing
efficiency of each accelerator.
the probability of the proposed sample using the target
density of the chain (p j (y(i) ) for iteration i and chain j).
2 BACKGROUND During Global exchanges, PT attempts interactions be-
2.1 MCMC principles tween chain pairs; sample exchanges are proposed between
MCMC draws samples from a distribution p(x) by con- all odd pairs of neighboring chains, i.e. (1, 2), ..., (m 1, m)
structing a Markov chain [5]. Each state of the chain or all even pairs of neighboring chains, i.e. (2, 3), ..., (m
represents a random sample. A transition kernel is used 2, m 1) (in a rotating order - line 10). These exchanges
to generate a sample, given the previous one (i.e. update push samples from the hot chains to the colder ones
the chain). The kernel comprises two steps: 1) Propose a and eventually to the first (coldest) chain, helping it escape
sample using an easy-to-sample distribution, e.g. a Noraml from isolated modes (thus enhancing mixing).
with mean equal to the previous sample, 2) Compute the The choice of the number of chains and their temper-
probability of the proposed sample according to the target atures is important for this enhancement to be significant
density and accept or reject the sample using some criterion. [11], [14] but it is outside the scope of this paper. The tem-
m 2
The details of each step vary between kernels ( [7], [14], perature of chain j is set to ( m+1 j ) , which is suitable for
((B+1):N)
[18]), making each kernel suitable for distributions with the target in Section 2.4 [16]. Only the samples x1
0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TC.2015.2439256, IEEE Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 3
TABLE 1
Inputs : Algorithm and model parameters
(1)
Densities p1:m (from (3)), initial samples (x1:m ),
2 Symbol Description Range (in Section 5)
proposal variances (1:m ), number of MCMC m Number of PT chains [8, 32768]
samples (N), number of burn-in samples (B) m
( m+1 2
T PT temperature set j ) for chain j
Algorithm : s Dimension of sampled state space 4
1: for i = 2, . . . , N n Number of data [128, 32768]
d Dimension of data 1
2: Global update :
3: for j = 1, . . . , m
(i) (i1)
4: yj xj + N(0, 2j ) form of density parallelism (Section 2.4), because: 1) It is
(i) (i)
5: Do x j y j with probability: widely used and representative of many applications (e.g.
p (y(i) ) [3], [16], [19]), 2) Several MCMC algorithms are designed
(i1) (i) j j
6: a(x j , y j ) = min 1, (i1) to address this specific form of likelihood [20], 3) The kinds
p j (x j )
7: Otherwise do
(i)
xj
(i1)
xj of computations involved in the case study of Section 2.4
(Gaussian evaluations, reductions, exponents, logarithms)
8: end for
are very common, even in non-i.i.d. problems.
9: Global exchange :
10: Choose even chain pairs ((1, 2), (3, 4), ...) or 2.4 Case study: Bayesian inference on mixture
odd chain pairs ((2, 3), (4, 5), ...) (in turn) models
11: for all chosen chain pairs (q, r) Mixture models are a powerful family of probabilistic
(i) (i)
12: Exchange the samples xq and xr with models used in numerous fields [15]. Multi-modal target
probability: distributions often appear when performing inference on
(i) (i)
these models. Hence, they are representative of the prob-
(i) (i) p (xr )pr (xq )
13: e(xq , xr ) = min 1, q (i) (i)
pq (xq )pr (xr ) lems on which PT is applied. A Gaussian mixture model
14: end for taken from [16] is used here. This model obeys the i.i.d.
15: end for principle described in the previous section. It allows us
Outputs : to easily scale the data-related computational load, i.e. the
((B+1):N)
Samples from the first chain x1 available parallelism in intra-chain computations.
A set of i.i.d. data D1:n , where Dl for l {1, ..., n},
Fig. 1. Parallel Tempering algorithm is given. Each observation is distributed according to:
k
p(Dl |1:k , 1:k , a1:k1 ) = ai f (Dl |i , i ) (4)
from chain 1 are kept (B samples are removed as burn-in) i=1
(i)
and used to get (2) (with x(i) = x1 ). Here, f denotes the density of a univariate Gaussian dis-
tribution, k is the number of mixture components and 1:k ,
1:k and a1:k1 are the parameters of the model (means,
2.3 Parallelism in the algorithm
variances and weights of components respectively). We use
There are two forms of parallelism in PT: 1) Inter-chain k = 4, i = = 0.55 and ai = a = 1/k for i {1, ..., k}
parallelism during Global updates (loop in line 3 of Fig. and 1:4 is the unknown parameter, as in [16]. The prior
1); all chains can be updated in parallel since they do not distribution on 1:4 is a four-dimensional uniform. The data
interact during this stage. Global exchanges can also be D1:n are simulated using 1:4 = (3, 0, 3, 6). Due to the i.i.d.
parallelized since pairs of chains communicate but each assumption, the likelihood is a product of sub-densities:
exchange is independent. Unfortunately, using more than
100-200 chains does not offer any improvement in mixing n
p(D1:n |1:4 ) = p(D j |1:4 , 1:4 , a1:3 ) (5)
[11], [16] and thus inter-chain parallelism can be exploited j=1
up to a limit. Therefore it is crucial for an accelerator to
achieve high performance even for moderate numbers of If p(1:4 ) is the prior, the posterior density of 1:4 (which
chains. 2) Intra-chain parallelism during the computation is the target density and admits 24 modes) is given by:
of the probability density of a proposed sample (line 6
in Fig. 1). This parallelism is density-dependent. Due to p(1:4 |D1:n ) = p(D1:n |1:4 )p(1:4 ) (6)
the large diversity of models and corresponding densities, The main parameters of PT and the parameters of the
it is impossible to investigate the effect of parallelising mixture model target distribution are shown in Table 1.
this part in the general case. Nevertheless, it is common
in Bayesian problems to use independent and identically
distributed (i.i.d.) data. This leads to a density which is 3 R ELATED WORK
equal to the product of the sub-densities of all the data (or CPU-based work on accelerating PT has focused on lifting
the sum if for log-densities). This work only considers this the inter-processor communication bottleneck caused by
0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TC.2015.2439256, IEEE Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 4
chain interactions. In [12], this is done via a decentral- 1) Are tailored for PT and exploit specific characteristics
ized exchange method, implemented on a CPU cluster. of the algorithm, 2) Guarantee unbiased sampling (despite
The method scales better than centralized code. In [21], the use of reduced precision) by trading MCMC mixing for
a scheme to allocate chains to CPU cores in order to speedup instead of MCMC accuracy for speedup.
minimize CPU idle time is proposed. In Section 4.1.1 we
show how these overheads can be avoided in FPGAs.
4 ACCELERATING PT
Most GPU and FPGA-based works have been limited
to exploiting inter-chain parallelism in a straightforward The PT code in Fig. 1 was first implemented in C++ using
manner. In [16], a GPU implementation of PT achieves double precision arithmetic and without exploiting paral-
one to two orders of magnitude acceleration over a CPU lelism. We use this sequential implementation as reference.
by mapping each chain to a separate thread. In [22], a GPU Section 4.1 describes the baseline accelerators (which use
is used to accelerate DEMCMC, another population-based double precision) in each platform. Section 4.2 proposes
MCMC method, achieving a 100x speedup over a CPU. two custom precision methods to improve PTs efficiency.
[4] and [3] use small numbers of parallel PT chains in Section 4.3 maps these methods to the three platforms.
FPGA implementations but the focus of these works is on
exploiting the form of the target density to maximize per- 4.1 Baseline accelerators
formance. [23] and [24] present FPGA architectures for PT,
which exploit custom arithmetic precision and outperform a 4.1.1 FPGA
GPU by up to 1.5x. [25] propose a precision optimization Here, we propose a baseline hardware architecture for
method for FPGAs, applicable to any MCMC algorithm PT in double floating point precision. Fig. 2 shows the
which delivers a 4x speedup over double precision. block diagram of the architecture. There are four compu-
In contrast to the above works which focus on specific tational blocks (Sample Proposal, Probability Evaluation,
platforms, we study the suitability of all three platforms Accept/Reject and Exchange), three memories (Sample,
for accelerating PT. This is also the first work that jointly Probability and Data) and three random number generators.
examines how each platforms performance scales with the The implementation is based on extensively pipelining
number of chains and the size of the data. Previous works all computational blocks. Pipelining is possible because PT
investigate only one or none of the above. In addition, we chains perform the same computations on different data
present, for the first time in MCMC literature, a power and during Global updates and exchanges. The system can be
cost efficiency assessment of PT in various platforms. thought of as a long pipeline which works iteratively, per-
Section 4.1.3 builds upon [16] by introducing intra- forming the same steps for each Global update/exchange.
chain parallelism and other enhancements (including op- One MCMC iteration (outer loop in Fig. 1) includes the
timal GPU kernel configuration) in order to maximize the following: The current samples of all parallel chains (x j for
performance of the GPU implementation. Unlike [16], we all j {1, ..., m}) are read from Sample memory and for-
also show how speedups scale with data size. warded to the Sample Proposal block. The Sample Proposal
Additionally, this paper focuses on the use of custom pre- block proposes candidate samples y j by adding Gaussian
cision as a novel means to accelerate MCMC computations. random numbers to the current samples. The candidates are
One of the methods presented here to achieve this (MPPT) then moved to the Probability Evaluation block (shown in
is based on the idea proposed in [23] but the present paper Fig. 3), which computes the candidate probabilities p j (y j ).
also: 1) Introduces an enhanced FPGA architecture for This is done by calculating n probability sub-densities (n is
MPPT, 2) Applies the MPPT method to the CPU and GPU, the number of data), finding their logarithms and summing
3) Proposes an entirely novel method (WPT) which is also them to get the total log-density. For big n, the whole
based on custom precision but uses importance weights to computation takes a large number of cycles and becomes
correct errors. WPT is mapped to CPUs, GPUs and FPGAs, the bottleneck of the system. To reduce this bottleneck, we
4) Includes results on FPGA precision optimization for instantiate as many parallel pipelines inside this block as
WPT/MPPT, GPU kernel optimization and the scaling of allowed by the chip size. Moreover, since each sub-density
performance with data size, 5) Presents highly optimized requires that a datum be read from the Data memory, the
CPU and GPU implementations for all methods, in contrast Data memory is designed to provide data to all pipelines
to the naive implementations of [23], in parallel at the same cycle (each memory address stores
Custom precision in MCMC has also been employed in multiple data). An adder tree performs the reduction. The
[24] and [25]. In [24], custom precision is used throughout Accept/Reject block receives the candidate probabilities
the MCMC method, resulting in biased sampling. No p j (y j ) and reads the previous probabilities of each chain
formal way is provided to quantify the bias (and choose a (p j (x j )) from Probability memory. It also computes the
m 2
precision accordingly) before sampling starts. [25] follow temperature T j = ( m+1 j ) . All these values (along with
the same approach but supplement it with a precision opti- a uniform random number) are used to find the Metropolis
mization method for FPGAs; pre-runs are used to quantify ratio and accept or reject each candidate sample.
the bias of all candidate precisions and the minimum preci- The above steps comprise the update operation. The
sion which satisfies a user-defined bias threshold is chosen. updated samples also pass through the Exchange block
In contrast, the custom precision methods presented here: before they are written back to Sample memory. Unlike
0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TC.2015.2439256, IEEE Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 5
y(i)
x
(i+1) Interface
Sample x
(i)
Sample y
(i)
Probability (i)
p(y ) Accept/ accept/reject signal to host
Exchange PC
memory Proposal Evaluation Reject
p(x(i) ) p(x(i+1))
D p(x(i) )
p(y(i))
Probability
memory FIFOs Multiplexer
Data
memory
+
to finish before starting the Global exchange. As soon as proposed proposed
proposed
sample 10 10 10 10 9 9 9 9 8
Accept/Reject
(i) sample y(i) datum
a chain is updated, it moves to the Exchange block while sample y
12 11
proposed + 6 5
Data memory sample 10 10 10 10 9 9 9 9 8
the next chains are processed by the Update block. address
D(1:4)
4 data
datum
precision and then apply the temperature T j and do all other FPGA
operations in double precision. Weights are assigned to the Custom precision
PT
samples of the first chain based on the fact that the desired FIFO
x(i)
target density is p1 (x) (= p(x)), while samples are actually Probability
Interface
taken from p1 (x) (= p(x)),
the custom-precision version of Evaluation
(CP) x(i) p(x(i)) (DP) w(x(i))
to host
Probability Weight PC
p1 (x). Thus p1 (x) functions as the importance sampling ~ (i))
p(x D Evaluation (DP) Evaluation
(i)
distribution in IS; each sample x1 from the first chain at FIFO
time i, where i {1, ..., N}, is assigned the weight:
(i)
p1 (x1 ) Fig. 4. FPGA architecture for WPT: DP stands for
wi = (i) (8)
p1 (x1 ) double precision, CP stands for custom precision
Integral (1) is then estimated by the IS sum:
N (i) Exchanges that include the first (coldest) chain and chain
1
E pIS [ f (x)] = N wi f (x1 ) (9)
i=1 r (here, r is always 2) are accepted with probability:
(i) (i)
The PT steps remain the same as in Fig. 1, with the (i) (i)
p (x ) p (x )
e(x1 , xr ) = min 1, 1 r(i) r 1(i) (12)
only differences being the use of p(x) instead of p(x) p1 (x1 ) pr (xr )
everywhere and the computation of the weights.
Notice the use of both double and custom precision densi-
Generally, the density p1 (x) is a close approximation to
ties in (12). The above equations replace the one in line 13
p1 (x) and therefore it can be considered a good (efficient)
of Fig. 1. By altering the exchange operations which include
importance sampling distribution [5]. This efficiency drops
the chain 1, we guarantee that the detailed balance condition
only for very low precisions (see Section 5). IS also
[5] holds for all exchange moves (see Appendix for a
requires that the importance sampling distribution has a
proof). This is necessary to guarantee that the distribution of
wider support than the target distribution [5]. To guarantee
chain 1 remains the same as in double precision samplers.
this, we saturate the computation of p1 (x), i.e. we transform
Thus MPPT induces no loss in sampling accuracy.
zero values to the minimum value representable by the
In [30], Gaussian process approximations of the distri-
employed precision configuration. This guarantees that the
butions of some PT chains were used, while retaining the
support of p1 (x) is larger than that of p1 (x). Once meeting
target distribution of the first chain. MPPT uses a custom
this requirement, the samples and weights generated by
precision-based approximation rather than a probabilistic
WPT can be used to estimate (1) with zero bias compared
approximation. Moreover, in contrast to our work, [30] do
to the double precision (baseline) sampler.
not investigate the effect of the approximation on mixing.
The price for not using double precision for auxiliary
4.2.3 Mixed-precision PT method chains is that mixing can be negatively affected, since
PT has the distinctive property of running many Markov sample exchanges between the first and second chains are
chains but keeping samples from the first chain only. sometimes less likely to succeed. The trade-off between
Therefore, only the first chains target distribution affects precision and mixing is explored in Section 5.5.
the output estimate (2). Auxiliary chains only contribute
to the acceleration of the first chains mixing. This means 4.3 Custom precision accelerators
that they can sample from any distribution with the only
4.3.1 FPGA
possible effect being a change in the mixing speed.
We propose a method called Mixed-Precision PT WPT architecture: The FPGA architecture for WPT is
(MPPT), which exploits this fact to increase performance. shown in Fig. 4, comprising: 1) A custom precision PT
It uses double precision only when updating the first PT system (which is the system of Fig. 2 with the Probability
chain, which means that the first chain samples from the Evaluation block in custom precision), 2) A block which
correct distribution p1 (x) = p(x). The auxiliary chains computes the probabilities of all samples of the first chain in
(i)
(indexes j {2, ..., m}) sample from tempered versions of double precision (p1 (x1 ) for all i) , 3) A Weight Evaluation
the custom precision approximation p(x):
block, which uses the probabilities in custom and double
precision to find the weights in (8).
The double precision Probability Evaluation block pro-
1/T j
p j (x) = p(x) (10)
cesses samples from the first chain only. The custom
In order to guarantee the correctness of the first chains precision Probability Evaluation block processes samples
target distribution, we also change the way the algorithm from all m chains. We use a dual port Data memory with
exchanges samples. Exchanges between chains q > 1 and one port assigned to each block to feed the blocks with at
r > 1 at MCMC iteration i are accepted with probability: different rates.
The accelerators throughput depends on the number of
(i) (i) custom precision pipelines in the Probability Evaluation
(i) (i) p (x ) p (x )
e(xq , xr ) = min 1, q r(i) r q(i) (11)
pq (xq ) pr (xr ) block (inside Custom precision PT), again symbolized by
0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TC.2015.2439256, IEEE Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 8
P, and is given again by (7). Compared to the baseline of Intel Xeon E5-2660 v2 processors with 10 cores each
architecture, the use of reduced precision leads to a net (placed on separate sockets). We use 16 GBs of RAM and
gain in P (and subsequently in throughput), despite WPTs Intels C++ compiler version 2015.1.
resource overhead due to the extra blocks of Fig. 4. For the GPU platform, we use Nvidias GTX285 and
MPPT architecture: Fig. 5 illustrates the FPGA archi- GTX480 devices. Actual runs were performed only for
tecture for MPPT. There are two Probability Evaluation GTX480 (hosted by an Intel Core 2 Q9550 CPU with 8
blocks. The double precision block is responsible for: 1) GBs of RAM). The GTX285 measurements came from the
(i)
Computing p1 (x1 ) for all first chain updates, 2) Computing GPGPU-Sim simulator [31] (version 3.2.2). This simulator
(i) can estimate the runtime of a CUDA kernel (an accuracy
p1 (xr ) (where r = 2) in (12) for all exchanges between
chains 1 and 2. The custom precision block is responsible of 97-98% is reported in [31]). CUDA 1.3 (for GTX285)
(i)
for: 1) Computing p j (x j ) for j {2, ..., m} (auxiliary and 2.0 (for GTX480) were used for compilation.
chains), 2) Computing pr (x1 ) (where r = 2) in (12) for For the FPGA platform, we present results for Xilinx
exchanges between chains 1 and 2. Fig. 5 provides a Virtex 6 (LX240T) and Virtex 7 (VX1140T) devices. Actual
simplified view of the pipelines inside the two blocks. An runs were performed only for LX240T (placed on an
extra Sample Proposal block is needed to feed the double ML605 board and hosted by an Intel Core i7-2600 CPU
precision block. A dual port Data memory is used as in with 4 GBs of RAM). The FPGA was clocked at 200
WPT. As for previous architectures, throughput is given Mhz and communicated with the host PC through a PCI-
by (7). The throughput increases with smaller precisions Express bus, using the RIFFA framework [32]. Performance
because we can use larger P (as in WPT). Although mixing estimates for VX1140T come from combining post-place
decreases with lower precisions, the MPPT accelerator and route resource utilization and (7). The Xilinx Power
achieves a net gain in performance compared to the baseline Estimator was used to get power results.
accelerator (see Section 5). The sequential reference implementation on C++ ran on
the multi-core CPU device with one CPU core activated
4.3.2 Multi-core CPU and with all compiler and Cilk optimizations deactivated.
It is straightforward to apply the two custom precision
5.2 Performance metrics
methods to the CPU. CPUs have inherent support only for
single and double precision floating point arithmetic, so our The performance criterion to compare PT accelerators is the
only choice for the precision of the weight computation in mixing (exploration) achieved by the accelerator per second
WPT and the auxiliary chains in MPPT is single precision. of runtime, given a fixed selection of PT tuning parameters
(number of chains, temperatures) and a particular target
4.3.3 GPU distribution (implying a fixed data set size). This criterion
Applying the two custom precision methods to the GPU depends on two factors: 1) How many MCMC samples can
requires the use of extra kernels compared to the baseline be generated per second, i.e. the raw throughput, 2) How
implementation of Section 4.1.3. Only single precision fast the MCMC samples explore the distribution, i.e. how
can be used as reduced precision. For WPT, the global much exploration is achieved with a given amount of
update kernel runs exclusively in single precision and, after samples. Here, since the PT algorithm and all its tuning
it terminates, a second kernel is invoked to evaluate the parameters are fixed, the second factor depends only on
density of the first chain in double precision. The weight precision (precision affects exploration, Sections 4.2.2-3).
is computed in software. For MPPT, the global update For double precision (baseline) accelerators, only the
comprises two kernel invocations, which correspond to first factor affects performance. It is enough to find the
the custom and double precision pipelines of the FPGA Raw Speedup (Speedupraw ) of each accelerator against the
architecture. The first kernel (single precision) computes 1) reference sequential CPU sampler to make comparisons,
(i) (speedup refers to samples/sec). Nevertheless, for custom
p j (x j ) for j {2, ..., m} for all auxiliary chain updates and
2) pr (x1 ) (where r = 2) in (12) for the exchange between precision accelerators, the second factor has to be taken
chains 1 and 2. The second kernel (double precision) into account too. Specifically, the performance of WPT is
(i)
computes 1) p1 (x1 ) for the first chain update and 2) affected by: 1) Different first chain mixing speed compared
(i) to the baseline sampler (due to custom precision), i.e. the
p1 (xr ) (where r = 2) in (12) for all the exchange between
difference in mixing when sampling from p1 instead of
chains 1 and 2. The second kernel only processes two
p1 . 2) The importance sampling mechanism; the more the
density evaluations and therefore underutilizes the GPU,
importance distribution p1 deviates from the IS target dis-
which becomes inefficient for small numbers of PT chains
tribution p1 , the more the importance weights take extreme
(see Section 5.4). In both methods, the global exchange
values. This makes the IS estimator in (9) less efficient.
kernel runs in double precision.
The sampling efficiency of MPPT is affected by the use
of custom precision for the auxiliary chains ( p2:m ), which
5 I NVESTIGATION AND RESULTS changes the mixing speed of the first chain (compared to
5.1 Evaluation platforms the baseline sampler). Due to these reasons, one MCMC
We evaluate the performance of the samplers using the sample from WPT or MPPT has a different exploration
following devices: For the multi-core CPU, we use a pair value compared to one sample from baseline samplers.
0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TC.2015.2439256, IEEE Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 9
FPGA Gaussian
Probability
Evaluation (DP)
RNG
(i) (i)
x1 Sample y1
Proposal
FIFO y(i)y 1(i)
Fig. 5. FPGA architecture for MPPT: The double precision (DP) Probability Evaluation block has one pipeline
and the custom precision (CP) Probability Evaluation block has multiple pipelines.
1,000
bution of Section 2.4. We also compare with the sampler of
[16], where each thread is assigned one PT chain without
exploiting intra-chain parallelism. Although investigating
performance scaling with data size is specific to the i.i.d. 100
of CUDA blocks and data per thread (given m and n), GTX285 baseline
GTX285 WPT
GTX285 MPPT
as will be demonstrated in Section 5.5. Finally, CPU 1
GTX480 baseline
GTX480 WPT
GTX480 MPPT
implementations use the optimal granularities for chain and E52660 v2 baseline
E52660 v2 WPT
Scaling the number of chains: In Fig. 6, we set the 8 32 128 512 2048 8192 32768
Number of chains
number of data to n = 128 and vary the number of chains
(m). The mean Speedupe f f from 30 independent runs is Fig. 6. Scaling of Speedupe f f with number of chains
shown for all samplers. Error bars represent one standard m. Baseline and custom precision accelerators are
deviation. The deviations are small compared to the mean included. n = 128 were used. Error bars represent
values in all cases, showing that the speedups are not due standard deviation from 30 runs. V7 measurements are
to random behaviour in the runs. The baseline multi-core projections and have no error bars.
CPU (with 20 cores) achieves a peak speedup of 13.8x.
This is not reached with fewer than 2048 chains due to
inter-core communication overhead during exchanges. The of the baseline sampler increased by 1.7x. Increasing the
WPT and MPPT CPU samplers reach a speedup of 18.4x. processing capabilities of the processors (e.g. by adding
This improvement is due to the use of single precision for more floating point units) had minimal effect on peak
the majority of density evaluations. performance.
The baseline GPUs achieve significantly higher peak On the FPGA platform, the baseline architecture on the
Speedupe f f . To reach peak performance (165x for LX240T and VX1140T is 26x and 44x-165x faster than
GTX480, 78x for GTX285), m = 32768 chains are used. the reference respectively. WPT increases these speedups
For m close to a few hundred, speedups are in the range to 65x-128x and 66x-854x respectively and MPPT to 98x-
15x-50x. This slow scaling is due to lack of enough 138x and 76x-997x respectively. These speedups are due to
parallelism to fully utilize the GPU. These speedups are increased P, resulting from reduced precisions. Peak perfor-
up to 4.4x higher than those of the state-of-the-art im- mance is reached with only 8-32 chains for the baseline and
plementation of [16]. [16] achieves the same performance 32-128 chains for the custom precision architectures. These
as our implementation only for m = 32768. WPT and speedups come without compromising sampling quality.
MPPT samplers achieve similar Speedupe f f in the GPU. Comparing across platforms, it is clear that the big
They outperform baseline implementations by up to 3.2x Virtex 7 FPGA with the baseline sampler is 24x-36x and
(GTX285) and 2.4x (GTX480). For fewer than 512 chains, 0.9x-28x faster than the baseline CPU and GPU (on the
baseline samplers are faster than WPT/MPPT by up to GTX480) implementations respectively. These speedups
1.3x. This is due to the fact that two update kernels (in increase to 57x-92x and 2.2x-76x when WPT/MPPT are
double and single precision) are called in WPT/MPPT used in all platforms. This shows that the two custom
(Section 4.3.3). The kernels runtime is stable or increases precision methods are more suitable for mapping on the
slightly when they process less than 100-200 chains (due FPGA because they can exploit the fully custom precision
to GPU under-utilization). This means that, e.g. calling a of the platform, while CPUs and GPUs can only use single
double precision kernel to process only the first chain of or double precision. Similarly, the small Virtex 6 FPGA
MPPT takes almost the same time as the baseline (double is slower than the small GTX285 GPU for m > 128 when
precision) kernel takes to process 32 or 128 chains. On top the baseline sampler is used but only for m > 2048 when
of that, MPPT involves the cost of calling a second (single the WPT/MPPT methods are used. Nevertheless, GTX286s
precision) kernel for the auxiliary chains. This results in peak performance is higher than LX240Ts by 1.8x-3.1x
lower performance than the baseline GPU until enough (depending on the method). All FPGA architectures reach
chains are used, at which point the benefits from processing their peak performance much earlier than GPUs, since
the auxiliary chains in single precision outweighs the cost their flexibility allows them to utilize resources efficiently
of running two kernels. For all GPU samplers, the main and exploit modest amounts of parallelism. For m < 512,
bottleneck resource which limits the peak speedup is the the Virtex 6 FPGA is faster than both GPUs when using
number of registers in each streaming processor. When we MPPT/WPT. For m < 128, even the baseline Virtex 6 is
manually doubled the number of registers per processor in faster than all GPU samplers. In PT literature, the values
the GPGPU-Sim model of GTX285, the peak Speedupe f f of m typically range from less than 10 to a few dozens
0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TC.2015.2439256, IEEE Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 11
1,000
(14,11) (16,11) (16,11)
more sub-densities are summed during density evaluation,
(14,11)
(24,11)
leading to more arithmetic error and rougher density ap-
(16,11) (16,11) (24,11)
(14,11) proximations. This negatively affects the mixing/efficiency
(14,11)
(16,11) of WPT/MPPT (Section 5.5). Higher precisions need to be
Effective speedup vs. sequential SW
TABLE 4
Power efficiency of the proposed accelerators 2
10
1
10
Accelerator (device) ES/J ES/(J sec) ES/($sec)
Baseline (E5-2660v2) 1.5 4.5 101 1.0 101 0
10
(m,n)=(8,128)
so speedups (and the way they scale) depend largely on the [18] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller,
problem. Nevertheless, the trend shown in Figure 6 (FPGAs and E. Teller, Equation of State Calculations by Fast Computing
Machines, The Journal of Chemical Physics, vol. 21, no. 6, 1953.
reach peak performance earlier than GPUs) will still hold [19] M. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West,
as long as chains can be streamed through the likelihood Understanding GPU programming for statistical computation: Stud-
datapath. Moreover, the CPU/GPU/FPGA design is largely ies in massively parallel massive mixtures, Journal of Computa-
tional and Graphical Statistics, vol. 19, no. 2, p. 419438, 2010.
unaffected by the target distribution; only the module/kernel [20] D. Maclaurin and R. P. Adams, Firefly Monte Carlo: Exact MCMC
which computes the probability density needs to change. with Subsets of Data, ArXiv e-prints, Mar. 2014.
The rest of the system is generic. [21] D. J. Earl and M. W. Deem, Optimal Allocation of Replicas
to Processors in Parallel Tempering Simulations, The Journal of
In summary, multi-core CPUs are preferable for specific Physical Chemistry B, vol. 108, no. 21, pp. 68446849, 2004.
MCMC variants and GPUs are suitable for population- [22] W. Zhu, A. Yaseen, and Y. Li, DEMCMC-GPU: An Efficient Multi-
based and other parallel MCMC methods, especially if Objective Optimization Method with GPU Acceleration on the Fermi
Architecture, New Generation Computing, vol. 29, no. 2, pp. 163
the density computation is SIMD. The FPGA should be 184, 2011.
preferred for SIMD and non-SIMD MCMC methods which [23] G. Mingas and C. Bouganis, A Custom Precision Based Architec-
have communication-bound operations and/or are robust ture for Accelerating Parallel Tempering MCMC on FPGAs without
Introducing Sampling Error, in FCCM, 2012 IEEE 20th Annual
to precision reduction and when energy consumption is International Symposium on, 2012, pp. 153156.
critical. For problems which are extremely intensive com- [24] G. Mingas and C.-S. Bouganis, Parallel tempering mcmc acceler-
putationally, a possible solution is the use of multiple ation using reconfigurable hardware, in ARC, ser. Lecture Notes in
Computer Science. Springer, 2012, vol. 7199, pp. 227238.
GPUs and FPGAs and the exploitation of heterogeneous [25] G. Mingas, F. Rahman, and C.-S. Bouganis, On Optimizing the
computing techniques to optimally allocate tasks to devices. Arithmetic Precision of MCMC Algorithms, FCCM, Annual IEEE
Symposium on, vol. 0, pp. 181188, 2013.
R EFERENCES [26] Intel, Intel Cilk Plus, http://software.intel.com/, 2011.
[27] , Intel C and C++ Compilers, http://software.intel.com, 2011.
[1] R. Salakhutdinov, A. Mnih, and G. Hinton, Restricted Boltzmann [28] M. Harris et al., Optimizing parallel reduction in CUDA, NVIDIA
Machines for Collaborative Filtering, in Proceedings of the 24th Developer Technology, vol. 2, p. 45, 2007.
International Conference on Machine Learning, ser. ICML 07. [29] L. Breyer, G. O. Roberts, and J. S. Rosenthal, A note on geometric
New York, NY, USA: ACM, 2007, pp. 791798. ergodicity and floating-point roundoff error, Statistics & Probability
[2] A. U. Asuncion, P. Smyth, and M. Welling, Asynchronous dis- Letters, vol. 53, no. 2, pp. 123127, June 2001.
tributed estimation of topic models for document analysis, Statisti- [30] M. Fielding, D. J. Nott, and S.-Y. Liong, Efficient MCMC Schemes
cal Methodology, vol. 8, no. 1, pp. 3 17, 2011. for Computationally Expensive Posterior Distributions, Technomet-
[3] N. B. Asadi, T. H. Meng, and W. H. Wong, Reconfigurable rics, vol. 53, no. 1, pp. 1628, 2011.
computing for learning Bayesian networks, in Proceedings of the [31] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, An-
16th international ACM/SIGDA symposium on Field programmable alyzing CUDA workloads using a detailed GPU simulator, in
gate arrays. New York, NY, USA: ACM, 2008, pp. 203211. Performance Analysis of Systems and Software, 2009. ISPASS 2009.
[4] F. Belletti et al, Janus: An FPGA-Based System for High- IEEE International Symposium on, 2009, pp. 163174.
Performance Scientific Computing, Computing in Science Engineer- [32] M. Jacobsen, Y. Freund, and R. Kastner, RIFFA: A Reusable
ing, vol. 11, no. 1, pp. 4858, 2009. Integration Framework for FPGA Accelerators, in FCCM, 2012
[5] J. S. Liu, MC strategies in scientific computing. Springer, 2001. IEEE 20th Annual International Symposium on, 2012, pp. 216219.
[6] J. Yan, M. Cowles, S. Wang, and M. Armstrong, Parallelizing [33] A. Rafique, N. Kapre, and G. Constantinides, Enhancing perfor-
MCMC for Bayesian spatiotemporal geostatistical models, Statistics mance of Tall-Skinny QR factorization using FPGAs, in FPL, 2012
and Computing, vol. 17, no. 4, pp. 323335, 2007. 22nd International Conference on, Aug 2012, pp. 443450.
[7] S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, Handbook of
Markov Chain Monte Carlo, 1st ed. Chapman and Hall/CRC, 2011.
[8] L. Bottolo et al., GUESS-ing Polygenic Associations with Multiple
Phenotypes Using a GPU-Based Evolutionary Stochastic Search
Algorithm, PLoS Genet, vol. 9, no. 8, p. e1003657, 2013.
Grigorios Mingas is a PhD candidate and
[9] S. Zierke and J. Bakos, FPGA acceleration of the phylogenetic
Research Assistant in the Department of
likelihood function for Bayesian MCMC inference methods, BMC
Electrical and Electronic Engineering of Im-
Bioinformatics, vol. 11, no. 1, pp. 112, 2010.
perial College London, UK. He received the
[10] J. Gross, W. Janke, and M. Bachmann, Massively parallelized PLACE M.Eng. degree in Electrical and Computer
replica-exchange simulations of polymers on GPUs, Computer PHOTO Engineering in 2010 from the Aristotle Uni-
Physics Communications, vol. 182, no. 8, pp. 16381644, 2011. HERE versity of Thessaloniki, Greece. His main re-
[11] D. J. Earl and M. W. Deem, Parallel tempering: Theory, applica-
search interests are reconfigurable and high-
tions, and new perspectives, Phys. Chem. Chem. Phys., vol. 7, pp.
performance computing for MCMC/SMC, lin-
39103916, 2005.
ear algebra and SLAM acceleration.
[12] Y. Li, M. Mascagni, and A. Gorin, A decentralized parallel im-
plementation for parallel tempering algorithm, Parallel Comput.,
vol. 35, pp. 269283, May 2009.
[13] L. Bottolo and S. Richardson, Evolutionary stochastic search for
Bayesian model exploration, Bayesian Analysis, vol. 5, no. 3, pp.
583618, 09 2010.
[14] A. Jasra, D. A. Stephens, and C. C. Holmes, On population-based Christos-Savvas Bouganis is a Senior Lec-
simulation for static inference. Statistics and Computing, pp. 263 turer in the Department of Electrical and
279, 2007. Electronic Engineering of Imperial College
[15] G. McLachlan and D. Peel, Finite mixture models. Wiley, 2004. London, UK. He is an editorial board member
PLACE of IET Computers and Digital Techniques
[16] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes, On the
PHOTO and Journal of Systems Architecture. His re-
Utility of Graphics Cards to Perform Massively Parallel Simulation
HERE search interests include the theory and prac-
of Advanced Monte Carlo Methods, Journal of Computational and
Graphical Statistics, vol. 19, no. 4, pp. 769789, 2010. tice of reconfigurable computing and design
[17] C. J. Geyer, Markov Chain Monte Carlo Maximum Likelihood, in automation, mainly targeting digital signal
Computing Science and Statistics, Proceedings of the 23rd Sympo- processing algorithms.
sium on the Interface, 1991, pp. 156163.
0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.