You are on page 1of 1

2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines

High Performance Sparse LU Solver FPGA A Accelerator


using a Staticc Synchronous Data Flow
w Model
Mohamed W. Hassan1, Ahmed E. Helal1, Yasser Y. Hanafy1,2
1
Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA
V
2
Arab
A Academy for Science & Technology
mwasffy@vt.edu, ammhelal@vt.edu, yhanafy@vt.edu

table 2, the synthesis results are shown in table 1. Table 3 illustrates


I. PROBLEM SCOPE AND PROPOSED SOLUTION
a comparison of the maxim mum achieved performance using
Interconnection matrices resulting from domaain decomposition different implementations (multti-core, GPU and FPGA) where the
of sparse LU solvers of linear systems [1] is adopted
a in several proposed design achieves 113..2 GFLOPS with more than 40%
scientific problems such as parallel circuit simuulation [2]. These utilization employing 256 PEs. Figure
F 2 shows the scalability of the
matrices are semi-dense, symmetric in topology buut not in values of design starting from 64 PEs reacching 256 PEs.
nonzero elements (NZEs). While dense matrix solvers
s performed
well on GPUs, sparse matrix solvers performannce, however, was
poor. The hardware utilization of previous im mplementations on
massively parallel platforms never exceeded the 20% mark
(including multicores, GPU, and FPGA). Thhe irregularity of
computation and memory access are the main reaasons for the poor
performance. The utilization of hardware resoources of sparse
solvers on multicores is very high, because of the nature of the
hardware dynamic scheduling policies adopted by b most multicore
processors, but they have limited scalability.
Previous FPGA implementations adopted dynamiic dataflow which
incurred a large overhead and poor utilization of hardware
h resources
[4], and did not scale with a large number of coress. In this paper we Fig. 2. Performance scala
ability (% of peak performance)
introduce a static synchronous dataflow FPGA A implementation
which addresses the main problem of sparse solveers. A parallel LU
solver algorithm is proposed which hides the latency of both
memory access and inter-processor communiccations. The low
frequency nature of FPGA designs implied three major design
parameters 1) the selection of deeply pipelineed floating point
operators [3], 2) static scheduling strategy, and 3) a customized data
storage format to organize memory access and to t eliminate time-
consuming address calculations. All three factorss were essential to
maximize the utilization of the available hardware resources.

REFEERENCES
[1] G. Kron, Diakoptics: the piecewise soluttion of large-scale systems: Macdonald, 1963.
Figure 1 shows the task graph of the modified parallel LU solve [2] P. Li, "Parallel Circuit Simulation: A Historical Perspective and Recent Developments,"
Foundations and Trends in Electronic Design Automation, vol. 5, pp. 211-318, 2012.
algorithm which hides the latency of the memorry access and the [3] Z. Ling, G. R. Morris, and V. K. Prassanna, "High-Performance Reduction Circuits Using
inter-processor communications, and the equatioons which proves Deeply Pipelined Operators on FPG GAs," Parallel and Distributed Systems, IEEE
that the execution time is pure computation up to
t 256 processing Transactions on, vol. 18, pp. 1377-1392, 2007.
[4] N. Kapre and A. DeHon, "Parallelizingg sparse Matrix Solve for SPICE circuit simulation
elements. The hardware model is synthesized on a VIRTEX 7 using FPGAs," in Field-Programmabble Technology, 2009. FPT 2009. International
FPGA and tested with several interconnection matrices
m shown in Conference on, 2009, pp. 190-198.

978-1-4799-9969-9/15 $31.00 2015 IEEE 29


DOI 10.1109/FCCM.2015.21

You might also like