Professional Documents
Culture Documents
I. I NTRODUCTION
General matrix-matrix multiplication (GEMM) is one of
the most crucial operations in computational science and
modeling. The operation multiplies a matrix A of size
m k with a matrix B of size k n and gives a result
matrix C of size m n. In many linear solvers and graph
problems such as algebraic multigrid method [1], breadth
rst search [2], nding shortest path [3], colored intersection
[4] and sub-graphs [5], it is required to exploit sparsity
of the two input matrices and the result matrix because
their dense forms normally need huge storage space and
computation cost for the zero entries. Therefore general
sparse matrix-matrix multiplication (SpGEMM) becomes a
common building block in these applications.
1530-2075/14 $31.00 2014 IEEE
DOI 10.1109/IPDPS.2014.47
370
ci = (ai b1 , ai b2 , . . . , ai bp ),
where the operation is dot product of the two vectors.
We rst give sparsity of the matrix A consideration.
Without loss of generality, we assume that the ith row of A
only consists of two nonzero entries in the kth and the lth
column, respectively. Thus ai becomes (aik , ail ). Since all
other entries are zeros, we do not record them explicitly and
ignore their inuence on the dot products in the calculation
of the ith row of C. Then we obtain
ci = (aik bk1 + ail bl1 , aik bk2 + ail bl2 , . . . , aik bkp + ail blp ).
We can see in this case, only entries in the kth and the
lth row of B have contribution to the ith row of C. Then
row vector form instead of column vector form is used for
the matrix B. So we obtain
ci = aik bk + ail bl .
371
372
Table I
OVERVIEW OF E VALUATED S PARSE M ATRICES
Name
Plot
cant
63 K
economics
207 K
epidemiology
207 K
lter3D
526 K
pwtk
106 K
ship
141 K
harbor
47 K
protein
36 K
spheres
83 K
2cubes sphere
102 K
accelerator
121 K
cage12
130 K
hood
221 K
m133-b3
200 K
majorbasis
160 K
mario002
390 K
mono 500Hz
169 K
offshore
260 K
patents main
241 K
poisson3Da
14 K
QCD
49 K
scircuit
171 K
webbase-1M
1M
nnz(A) nnz(C)
nnzr(A) nnzr(C)
4M
269.5 M
64
4315
1.3 M
7.6 M
6
37
8.4 M
2.1 M
16
4
2.7 M
86 M
25
808
11.6 M
626.1 M
53
2873
7.8 M
450.6 M
55
3199
2.4 M
156.5 M
51
3341
4.3 M
555.3 M
119
15249
463.8 M
6M
5566
72
1.6 M
27.5 M
16
270
79.9 M
2.6 M
659
22
2M
34.6 M
16
266
10.8 M
562 M
49
2548
3.2 M
0.8 M
16
4
19.2 M
1.8 M
120
11
2.1 M
12.8 M
5
33
5M
204 M
30
1204
4.2 M
71.3 M
16
275
0.6 M
2.6 M
2
11
0.4 M
11.8 M
26
871
1.9 M
74.8 M
39
1521
1M
8.7 M
6
51
3.1 M
69.5 M
3
70
nnz(C)
nnzr(C)
17.4 M
279
6.7 M
32
5.2 M
10
20.2 M
189
32.8 M
150
24.1 M
171
7.9 M
169
19.6 M
538
26.5 M
318
9M
88
18.7 M
154
15.2 M
117
34.2 M
155
3.2 M
16
8.2 M
52
6.4 M
17
41.4 M
244
23.4 M
90
2.3 M
9
3M
219
10.9 M
222
5.2 M
31
51.1 M
51
V. P ERFORMANCE C ONSIDERATIONS
A. Memory Pre-allocation For the Result Matrix
Compared to the SpGEMM, other sparse matrix multiplication operations, such as multiplication of sparse matrix and
dense matrix [7], [8], [23] and its special case sparse matrixvector multiplication [21], [25], [6], pre-allocate a dense
matrix or a dense vector of trivially predictable size and store
entries to predictable memory addresses. However, because
the number of the nonzero entries in the result sparse
matrix C is unknown in advance, precise memory allocation
of the SpGEMM is impossible before real computation.
And physical address of each new entry is unknown either
373
VI. M ETHODOLOGY
A. Algorithm Design
Our SpGEMM algorithm is composed of four stages:
(1) calculating upper bound, (2) binning, (3) computing the
result matrix, and (4) arranging data.
The rst stage, calculating upper bound, generates the
upper bound number of the nonzero entries in each row of
the result matrix C. We create an array U of size m, where
m is the number of rows of C, for the upper bound sizes of
the rows. We use one GPU thread for computing each entry
of the array U . Algorithm 2 describes this procedure.
C. Load Balancing
Because distribution patterns of the nonzero entries in
the both input sparse matrices are very diverse (consider
plots of the matrices in the Table I), input space-based
data decomposition [17], [9] normally does not bring efcient load balancing. One exception is that computing the
SpGEMM for huge sparse matrices on large scale distributed
memory systems, 2D and 3D decomposition on input space
methods demonstrated good load balancing and scalability
by utilizing efcient communication strategies [29], [30],
[2]. However, in this paper we mainly consider load balancing for ne-grained parallelism in the GPU shared memory
architectures.
Therefore we use the other group of load balancing methods based on output space decomposition. Dalton et al. [12]
presented a method that sorts the rows of the intermediate
divides it into 3 sub-matrices that include the rows
matrix C,
in different size ranges, and uses differentiated ESC methods
for the sub-matrices. We have a similar consideration, but
our implementation is completely different. We do not
but just
strictly sort the rows of the intermediate matrix C
assign rows to a xed number of bins through a much faster
linear time traverse on the CPU. And we decompose the
output space in a more detailed way that guarantees much
more efcient load balancing. We will demonstrate that our
method is always load balanced in all stages for maximizing
resource utilization of the GPUs.
1 Actually according to the CSR format standard, the column indices in
each row do not necessarily have to be sorted. But most implementations
choose to do so, thus our method reasonably makes this assumption.
374
(a)
(b)
(c)
Figure 1. Two steps of an example of the heap method. From (a) to (b),
the root entry is fused to the rst entry in result sequence since they share
the same index. From (b) to (c), the root entry is inserted to the sequence
since they have different indices. After each step, the heap property is
reconstructed.
Figure 2.
375
376
377
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4. Performance comparison of merging 32-bit keys, 32-bit key-32-bit value pairs and 32-bit key-64-bit value pairs through 5 GPU merge algorithms:
ranking merge, merge path, basic oddeven merge, basic bitonic merge and advanced bitonic merge on two different GPUs: nVidia GeForce GTX Titan and
AMD Radeon HD 7970.
Table II
T WO M ACHINES U SED FOR B ENCHMARKING
Conguration
Machine 1
Machine 2
CPU
One Intel Xeon E5-2630 (six Sandy Bridge cores, 2.3 GHz,
boost up to 2.8 GHz, Hyper-Threading on, 15 MB L3 cache)
One Intel Core i7-3770 (four Ivy Bridge cores, 3.4 GHz, boost
up to 3.9 GHz, Hyper-Threading off, 8 MB L3 cache)
System memory
GPU
One nVidia GeForce GTX Titan GPU (14 Kepler cores, 2688
CUDA cores, 876 MHz, 4.7 Tops in single precision, 1.6
Tops in double precision, 672 KB scratchpad memory)
One AMD Radeon HD 7970 GPU (32 GCN cores, 2048 Radeon
cores, 1 GHz, 4 Tops in single precision, 1 Tops in double
precision, 2048 KB scratchpad memory)
GPU memory
System software
and library
Ubuntu Linux 12.04, Intel C++ Compiler 14.0, Intel MKL 11.0,
nVidia CUDA SDK 5.5, CUSPARSE v2, CUSP 0.4.0 and GPU
driver version 319.32
Ubuntu Linux 12.04, g++ complier 4.6.3, AMD APP SDK 2.8
and GPU driver version 13.4
378
Figure 5. Global memory requirement comparison of the precise method, our hybrid method and the upper bound method on the benchmark suite. The
memory requirement of the precise method includes the two input matrices and the result matrix. The memory requirements of the other two methods also
contain additional intermediate matrices.
(a)
(b)
Figure 6.
Even though the nVidia GPU and the AMD GPU have
similar global memory bandwidth and the former one even
offers higher peak computational power, the AMD GPU
outperforms the nVidia GPU on most matrices. According
to the performance behaviors of the merge algorithms in the
Figure 4, we can see that the scratchpad memory capacity
is very important for the GPU SpGEMM performance.
379
(a)
(b)
Figure 7.
VIII. C ONCLUSION
In this paper we demonstrated an efcient SpGEMM
algorithm on the GPUs for solving the three challenging
problems in the SpGEMM. In the experiments on a benchmark suite composed of 23 matrices with diverse sparsity
structures, our SpGEMM algorithm delivered excellent absolute and relative performance as well as space efciency
over the previous GPU SpGEMM methods. Moreover, on
average, our approach obtained up to twice the performance
of the start-of-the-art CPU SpGEMM method.
ACKNOWLEDGMENT
The authors would like to thank Jianbin Fang at the
Delft University of Technology for supplying access to
the machine with the nVidia GeForce GTX Titan GPU.
The authors also thank the anonymous reviewers for their
insightful comments on this paper.
R EFERENCES
[1] N. Bell, S. Dalton, and L. Olson, Exposing ne-grained
parallelism in algebraic multigrid methods, SIAM Journal on
Scientic Computing, vol. 34, no. 4, pp. C123C152, 2012.
[2] J. Gilbert, S. Reinhardt, and V. Shah, High-performance
graph algorithms from parallel sparse matrices, in Applied
Parallel Computing. State of the Art in Scientic Computing, ser. Lecture Notes in Computer Science, B. Kagstrom,
E. Elmroth, J. Dongarra, and J. Wasniewski, Eds. Springer
Berlin Heidelberg, 2007, vol. 4699, pp. 260269.
380
381