Professional Documents
Culture Documents
The utilization of Graphical Processing Units (GPUs) for the element-by-element (EbE) finite element method (FEM) is demonstrated.
EbE FEM is a long known technique, by which a conjugate gradient (CG) type iterative solution scheme can be entirely decomposed
into computations on the element level, i.e., without assembling the global system matrix. In our implementation NVIDIAs parallel
computing solution, the Compute Unified Device Architecture (CUDA) is used to perform the required element-wise computations in
parallel. Since element matrices need not be stored, the memory requirement can be kept extremely low. It is shown that this low-storage
but computation-intensive technique is better suited for GPUs than those requiring the massive manipulation of large data sets.
Index TermsCUDA, EbE FEM, GPU, parallel FEM.
only the computation intensive parts of the program are exe- Using the above concept, the matrix-vector product, which is
cuted on the GPU) will fall behind codes that take full advan- the basis of iterative solvers, can be reformulated in terms of
tage of it [9], i.e. perform all the necessary computations on the element-wise computations as
GPU.
The aim of this paper is to show that modern high perfor- (5)
mance computing platforms (GPUs) offer considerable com-
puting capacity that can be fully utilized only if the applied algo-
rithm fits to their specific architecture. This property is in con- This means that the product of an assembled global matrix and a
trast to traditional (multi-) CPU based program design patterns vector is equivalent with the assembled vector of the elementary
where the efficiency of an algorithm is simply estimated using matrix-vector products. According to (4) the elementary con-
its computational complexity. tributions can be accumulated in a vector, the size of which is
Relying on the fact that it is cheaper to recompute element equal to the global degrees-of-freedom (DoF), hence only vec-
matrices than continuously cache them between the device and tors have to be stored during the computations. Elementary ma-
the system memory, EbE FEM technique is revisited here, and trix-vector products in (3) can be computed for each element
its implementation on CUDA architecture is presented. It is also separately, which enables parallel realization [4].
demonstrated that EbE FEM highly extends the scale of prob- The other building block of iterative solution methods is the
lems that can be solved on devices having limited memory ca- inner product of two DoF-sized vectors. This operation is obvi-
pacity but a massively parallel architecture. ously independent of the mesh structure and connectivity, and
its parallel execution is straightforward.
One more advantage of the EbE implementation worth men-
II. CUDA, A MASSIVELY PARALLEL COMPUTING tioning is that no global numbering of unknowns and finite el-
ENVIRONMENT ements is required at all. This feature can be utilized several
GPUs were designed for total computational throughput ways, like for instance with adaptive mesh generation or mesh
rather than for fast execution of serial calculations. Therefore reduction techniques [12].
they have the potential to dramatically speed-up computation
intensive applications over multi-core CPUs. To achieve high B. ConcurrencyGlobal Updates and Coloring
computational throughput, GPUs have hundreds of lightweight On shared memory architectures (like the GPUs) an impor-
cores and execute tens of thousands of threads simultaneously. tant question is how the partial products are summarized. The
Programs executed on GPUs are called kernels. challenge during a global update is to ensure that different
The reason why GPUs can be so effective is the way how threads do not access the same memory space simultaneously.
thread execution is organized. On traditional CPUs the programs This concurrent access is called race-condition and results in
are executed for a certain amount of time and then interrupted an indefinite outcome. Treatment of such cases is traditionally of
(time division multiplexing). During the interruption, the CPU two kinds. One solution is atomic update, when the memory
has to save the current state of the inner registers, and load a space is protected during I/O, causing other threads accessing
previously saved state for an other thread. the same memory place to wait until the operation is fully com-
A detailed overview on this topic can be found in [10], [11]. pleted.
The other solution is a kind of coloring of the problem [13],
[14]. In this case the mesh is considered as a graph, with the un-
III. IMPLEMENTATION OF EBE FEM ON CUDA
known variables (DoF) being the nodes of the graph and the el-
ements representing the connections between them. This graph
A. Disassembling Matrix Manipulations to the Element Level is then colored in a way that any two elements having the same
The finite element assembling procedure relies on some func- color do not share a common unknown. Different colors are then
tions by which the element matrix and the RHS of (2) are processed serially (one after the other), while elements with
computed. These functions depend among others on the type of the same color are processed simultaneously in parallel.
PDE to be solved as well as the applied shape functions. The
computed element matrices and RHS vectors are then assem- C. Element-by-Element Formulation of the BiCG Solver
bled to form the global system matrix and RHS . Let this In an EbE implemented BiCG solver the computations can
assembly step be represented by an operator , which is de- be grouped into so-called EbE steps and DoF steps, re-
fined differently for matrices and vectors, as follows: spectively. The former refers to the matrix-vector product of
(5), while the latter means vector-vector product or initialization
(3) of variables. BiCG algorithm requires several auxiliary vectors
and complex variables during the iter-
(4) ations. Function of these variables is identical to that outlined in
[2, Chapter 2.3.5]. The vectors are DoF-sized, therefore can be
handled (stored) the same way as the vector of unknowns, .
where is the set of elements, and matrix represents the The way the variables are stored gives the real modularity of
transition between local and global numbering of the unknown the EbE method. Contrary to traditional FEM methods using
variables for the -th element. Contrary to the sparse global global numbering, here a dynamic storage structure is used in-
system matrix , the matrix of size ( being the stead. The structure can be thought as an index array (pointers
local degrees-of-freedom) is usually dense. in the actual implementation) keeping the information how local
KISS et al.: PARALLEL REALISATION OF THE ELEMENT-BY-ELEMENT FEM TECHNIQUE BY CUDA 509
unknowns correspond to global ones. This is functionally equiv- D. Some Drawbacks of the EbE Implementation
alent to the role of in (3), (4). The lack of assembling makes the method convenient for
Algorithm 1 shows the EbE implemented BiCG, which is GPU parallel execution also raises several difficulties. The first
functionally equivalent to that presented in [2, Chapter 2.3.5], one is related to preconditioning, which traditionally assumes
and is implemented in terms of EbE and DoF iterations. the system matrix to be in an assembled form. To overcome
Label EbE iteration indicates the computation of the element this problem one can use specific element-by-element precon-
matrices. To avoid race conditions during global updates, the ditioners [14], [15].
elements are colored. The iteration goes through all colors seri- In this paper a simple Jacobi preconditioner is used [2] be-
ally, and performs the computations on the elements having the cause it can be represented by a diagonal matrix , which
actual color, , in parallel. Label DoF iteration indicates the can be stored the same way as the DoF-sized auxiliary vectors.
computation of the vector-vector products. This iteration is per- The Jacobi preconditioner is implemented as a DoF step (see
formed simultaneously on all global unknowns. Label global Algorithm 1, line 1011, 31 and 3435).
update means that the value of a global variable is affected. To The second problem is related to the required extra com-
avoid race conditions, atomic updates are used to access global putations: since element matrices are not stored, they must be
variables. recomputed in each iteration, which is obviously redundant
when dealing with linear problems. However, this extra
computation becomes necessary for non-linear or coupled
problems, where a kind of fix-point iteration technique can be
realized this way [12].
IV. RESULTS
The chosen test problem is a static conduction problem with
inhomogeneous conductivity. The equation to be solved is
therefore the Laplace equation with spatially varying conduc-
tivity. The domain is discretized by tetrahedral elements and
linear nodal shape functions are used. The global unknowns
(DoF) are the potential values at the nodes of the mesh. The
element matrices are computed using analytical expressions
[16].
To study the accuracy and speed of the proposed method, the
Utah Torso model was investigated by solving an ECG for-
ward problem [17] (c.f. Fig. 1).
510 IEEE TRANSACTIONS ON MAGNETICS, VOL. 48, NO. 2, FEBRUARY 2012
ACKNOWLEDGMENT
The work reported in the paper has been developed in the
Fig. 1. Utah torso modell. Solid faces correspond to organs.
framework of the project Talent care and cultivation in the sci-
entific workshops of BME project. This project was supported
TABLE I by the grant TMOP-4.2.2.B-10/1/KMR-2010-0009.
RESULTS FOR THE UTAH TORSO PROBLEM
REFERENCES
[1] P. P. Silvester and R. L. Ferrari, Finite Elements for Electrical Engi-
neers. Cambridge, U.K.: Cambridge University Press, 1990.
[2] R. Barrett, Templates for the Solution of Linear Systems: Building
Blocks for Iterative Methods, 2nd ed. Philadelphia, PA: SIAM, 1994.
[3] G. F. Carey, E. Barragy, R. McLay, and M. Sharma, Element-by-el-
ement vector and parallel computations, Commun. Appl. Numer.
Methods, vol. 4, no. 3, pp. 299307, 1988.
[4] G. F. Carey and B.-N. Jiang, Element-by-element linear and nonlinear
solution schemes, Appl. Num. Meth., vol. 2, no. 2, pp. 145153, 1986.
[5] J. Bolz, I. Farmer, E. Grinspun, and P. Schrder, Sparse matrix solvers
on the GPU: Conjugate gradients and multigrid, ACM Trans. Graph.,
Run time statistics obtained for several different mesh sizes vol. 22, pp. 917924, Jul. 2003.
are shown in Table I. The computations has been carried out on [6] C. Cecka, A. Lew, and E. Darve, Introduction to assembly of finite
element methods on graphics processors, in IOP Conf. Series: Mate-
a HP-XW8600 workstation, having 8 GB memory, an NVIDIA rials Science and Engineering, 2010, vol. 10, no. 1, p. 012009.
GTX 590 GPU and a quad-core Intel Xeon X3440 CPU. [7] A. Cevahir, A. Nukada, and S. Matsuoka, Fast conjugate gradients
As also outlined in [7], the performance of the GPU accel- with multiple GPUs, in ICCS 2009, G. G. van Albada, J. Dongarra,
erated matrix-vector multiplication (MxV) highly depends on and P. Sloot, Eds., 2009, vol. 5544, pp. 893903.
[8] R. K. W. Hackbusch and B. N. Khoromskij, Direct schur complement
the structure of the matrix, i.e., the distribution and number of method by domain decomposition based on H-matrix approximation,
non-zero elements. Unlike methods using GPUs only for accel- Comput. Vis. Sci., vol. 8, pp. 179188, Dec. 2005.
erating the computation of the MxV, the proposed method does [9] I. Kiss, S. Gyimthy, and J. Pv, Acceleration of moment method
not rely on the system matrix, hence no such degradation may using CUDA, COMPEL: Int. J. Comput. Math. Elect. Eng., vol. 31,
no. 6, to be published.
occur. This results in a uniform performance irrespectively of [10] T. R. Halfhill, Parallel processing With CUDA., Microproc. J., 2008.
the domain (matrix) structure. [11] D. Kirk and W.-M. Hwu, Programming Massively Parallel Processors,
Another advantage is the memory efficiency. Since sparse A Hands-On Approach.. San Mateo, CA: Morgan Kaufmann, 2010.
[12] S. Gyimthy and I. Sebestyn, Symbolic description of field calcula-
matrix storage inherently requires some extra storage overhead tion problems, IEEE Tran. Magn., vol. 34, no. 5, pp. 34273430, 1998.
(for the row and column information), the efficiency of memory [13] C. Farhat and L. Crivelli, A general approach to nonlinear FE compu-
occupancy is limited. On the contrary, the proposed EbE FEM tations on shared-memory multiprocessors, Comput. Methods Appl.
only needs to store the meshing information and several DoF Mech. Eng., vol. 72, no. 2, pp. 153171, Feb. 1989.
[14] A. J. Wathen, An analysis of some element-by-element techniques,
sized auxiliary vectors required for the CG iterations. Comput. Methods Appl. Mech. Eng., vol. 74, no. 3, pp. 271287, Sep.
1989.
[15] G. Golub and Q. Ye, Inexact preconditioned conjugate gradient
V. CONCLUSION method with inner-outer iterations, SIAM J. Sci. Comp., vol. 21, no.
4, pp. 13051320, 2000.
In this paper the EbE FEM method is re-formulated to fit the [16] A. Nentchev, Numerical Analysis and Simulation in Microelectronics
GPU architecture. The method has extremely low memory con- by Vector Finite Elements, Ph.D. dissertation, Tech. Univ. Wien, Vi-
sumption and can take full advantage of the massively parallel enna, 2008.
[17] R. MacLeod, C. Johnson, and P. Ershler, Construction of an inhomo-
execution environment. Not only the algorithm outperforms tra- geneous model of the human torso for use in computational ECG, in
ditional CUDA accelerated FEM methods [5][7] but it is also IEEE Med. and Biology Society Annu. Conf., 1991, pp. 688689.