Professional Documents
Culture Documents
Abstract: Parallel computing in civil engineering has been restricted to monotonic shock or blast loading with explicit algorithm which is
characteristically feasible to be parallelized. In the present paper, efficient parallelization strategies for the highly demanded implicit nonlinear
Downloaded from ascelibrary.org by University of Leeds on 05/17/15. Copyright ASCE. For personal use only; all rights reserved.
finite-element analysis (FEA) program for real scale reinforced concrete (RC) structures under cyclic loading are proposed. Quantitative
comparison of state-of-the-art parallel strategies in terms of factorization were carried out, leading to the problem-optimized solver, which
successfully embraces the penalty method and banded nature. Particularly, the penalty method employed imparts considerable smoothness to
the global response, which yields practical superiority of the parallel triangular system solution over those of advanced solvers such as the
parallel preconditioned conjugate gradient method. Other salient issues on parallelization are also addressed. By virtue of the parallelization,
the analysis platform offers unprecedented access to physics-based mechanisms and probabilistic randomness at theentire system level and
realistically reproduces global degradation and localized damage, as reflected from the application to a RC structure. Equipped with accuracy,
stability and scalability, the parallel platform is believed to serve as a fertile ground for the introducing of further physical mechanisms into
various research fields, as well as the earthquake engineering community. DOI: 10.1061/(ASCE)CP.1943-5487.0000138. © 2012 American
Society of Civil Engineers.
CE Database subject headings: Finite element method; Parallel processing; Earthquake engineering; Concrete structures; Algorithms;
Cyclic loads.
Author keywords: Finite-element method; Parallel processing; Earthquake engineering; Concrete; Algorithms.
(PCGM). Other parallelization issues regarding cyclic global data Among a multitude of the advanced parallel strategies, three
distribution, “divide-and-conquer” strategy, domain decomposi- representative ones, in terms of factorization, are studied in this pa-
tion, and so on shall be also dealt with. Finally, along with the brief per: (1) broadcasting, (2) pipelining, and (3) look-ahead method
review of the physics-based degrading material models introduced (for details of each, see Casanova et al. 2009). Before moving for-
in this paper, a practical application to a complex three-dimensional ward, it is useful to denote the key procedures of serial version fac-
(3D) shear wall system exposed to the cyclic loading shall be pro- torization. Let PðkÞ be the preparation procedure at step k:
vided to ensure the parallel efficiency, as well as accuracy of the preparation of factors for the subrows below the kth diagonal term
performed on the processor Pk (Pk = the processor holding the kth
developed implicit, nonlinear parallel analysis program.
column and diagonal term). Let UðkÞ be the update procedure at
step k: update of submatrix Ai;j for i, j > k with the precalculated
factors with kth diagonal term.
Key Characteristics of the Serial Version Program
The first and simplest parallelization strategy is the broadcasting
scheme utilizing the direct broadcasting command (i.e., MPI_Bcast)
As summarized in Table 1, the mainstream is twofold: static analy-
at each step. The key stream can be summarized as: PðkÞ→ broad-
sis (A) for initial loading followed by nonlinear analysis (B). At
casting to all processors →UðkÞ. It is remarkably easy to under-
the beginning of the nonlinear analysis (B 1), the global stiffness stand and implement, and indeed the broadcasting command can
matrix is augmented by additional penalty elements and then stored be almost freely interleaved into the routine. The key drawback
in its factorized form (B 2). The ensuing displacement-controlled of this approach, however, is that all other processors need to wait
analysis stage consists of the two major loops: the main iteration the data from the sender processor Pk until PðkÞ is fully finished
loop (B 3), to obtain the required external forces corresponding on Pk at each step, causing unnecessary waiting cost between pro-
to the target displacement, and the inner iteration loop (C), for cessors. Furthermore, the broadcasting command itself possesses
modified NR iteration using the factorized stiffness. communication inefficiency as the number of processors increases.
Although the full NR iteration is common in most nonlinear To expand on this adverse nature, it is instructive to review
analysis for its fast convergence rate, the modified NR iteration the cost analysis of two algorithms: (1) a parallel factorization fol-
using initial stiffness had been adopted to fully take advantage lowed by triangular system solving (Karniadakis and Kirby 2003),
of the penalty method. Indeed, the penalty method essentially and (2) a parallel Gaussian elimination (Casanova et al. 2009)—
imparts sufficient smoothness to the global force-displacement both based on a simple broadcasting approach, and the total costs
response, and as confirmed in most simulations here, marching generally read
with initial stiffness yields converged response in several iteration
Total running time ≈ α0 ∕p þ β 0 p ð1Þ
steps, allowing us to perform triangular system solving with KLU
repeatedly. Consequently, it is possible to save expensive cost in where α0 ¼ α × n3 ∕3; and β 0 ¼ ðβn2 ∕2 þ LÞ or ðβn2 þ LÞ for the
tangent stiffness reconstruction and its redistribution along process- former and the latter of the two algorithms, respectively. The value
ors, leaving only two factorizations in the static and nonlinear α = basic operation cost per element; β = transfer cost per element;
analysis stages—notably, they are still significant bottlenecks in L = communication startup cost; n = system size; and p = number of
the main execution stream. total processors.
broadcasting method
1.5
ond term in Eq. (1) cannot be ignored, and simply increasing the
total number of processors cannot guarantee the parallel efficiency.
Indeed, the total cost will undesirably increase with the growth in 1
the number of total processors by the second term of Eq. (1), as
shown in Fig. 1. Hence, the simple broadcasting scheme is assumed
to be the simplest yet poorest one in the later discussion and is used 0.5
as the comparison base for other advanced parallel strategies.
The second algorithm is the pipelined algorithm, following the
key notion of “pipelining,” in which logical topology is well incor- 0
0 50 100 150 200
porated. In the scheme, every processor knows its logically closest Processor
one, and upon receiving the crucial data, a processor always passes
Downloaded from ascelibrary.org by University of Leeds on 05/17/15. Copyright ASCE. For personal use only; all rights reserved.
Table 3. Pseudocode of the optimized parallel algorithm for the upper be the well-known “cache effect” considering the memory hier-
triangular system, successfully exploiting column-based cyclic distribu- archy of modern CPUs—essentially, the major task in the
tion and banded nature
pipelined algorithm in this paper involves a single vector manipu-
Line Major tasks lation, and the amount of data appears to decrease with increasing
Downloaded from ascelibrary.org by University of Leeds on 05/17/15. Copyright ASCE. For personal use only; all rights reserved.
Calculate ~rkþ1 ; Type ofM determines a proper parallel algorithm ~rkþ1 ¼ M1 rkþ1
(if stopping criterion is met, exit)
Pp1 ðmÞ ðmÞ P ðmÞ ðmÞ P ðmÞ
Calculateβ k and then ckþ1 β k ¼ ~rkþ1 · rkþ1 ∕~rk · rk ¼ rkþ1
m¼0 ~ · rkþ1 ∕ p1 rk · rk ; ckþ1 ¼ p1
m¼0 ~ m¼0 ½~
rkþ1 þ
ðmÞ
β k ck
End of main loop
Note: p = total processors, and superscript (m) denotes local storage on processor Pm.
multiplication and two inner products and forβ k one inner product, 8
and they fall into embarrassingly parallelizable forms (denoted as
7
summation in Table 4). The only remaining task is the calculation
subdomain V k , k ∈ ½1; p 1, is calculated on processors Pk , and Fig. 6. Column-based cyclic allocation on all processors for factoriza-
then passed to the master processor P0. Finally, the global summa- tion and solving
tion is done on P0 , and afterward P0 calculates a new unbalance
force, determines convergence, and so on:
where M ¼ mass ¼ diagðmii Þ, i ∈ ½1; n; u € t = acceleration at time t;
p1 Z Z P = external force; and F = internal resistance force possibly
t t
X
p1 X
Fiinternal ¼ Fiinternal;k ¼ BT σic dV k þ BT σis dV k including material/geometry nonlinearity.
k¼1 k¼1 Vk Vk Because of the lumped mass assumption, it is obvious that
ð3Þ Eq. (4) leads to an embarrassingly parallelizable situation resulting
in only vector manipulations. In terms of a straightforward domain
where i = iteration step number; Fiinternal = global internal force vec- decomposition, various parallel tools have been developed and uti-
tor on master node P0 ; Fiinternal;k = local internal force vector on lized: e.g., METIS and its parallel version ParMETIS by Karypis
processor Pk ; p= number of total processors; V k = subdomain k and Kumar (1995a, b) for weighted domain partitioning for the
on processor Pk ; B = strain-displacement matrix; and σ ci and consideration of imbalance resulting from multiple-nonlinear
σ si = current concrete and steel stresses evaluated, respectively. materials and domain distribution along heterogeneous processors
Special attention has been paid to the decomposion of perfectly (Sziveri and Topping 2000).
bonded steels (realized by 3D truss elements in this paper) which In general, such favorable conditions are not the case for
might be shared by several subdomains. For steels lying on the implicit programs such as the one dealt with in this paper. As a
boundary between subdomains, the processor with a small ID num- successful remedy to the obstacle involving global data manage-
ber is assumed to have the priority to hold those steels (e.g., a steel ment, the column-based cyclic allocation is being exploited. As
shared by subdomain 1 and 2 is assigned to subdomain 1). With shown in Fig. 6, each column of the global stiffness matrix is cycli-
these efforts, as shown in Fig. 5, the desired speed-up in the con- cally distributed across all processors, as are the factorized triangu-
struction of new internal force through nonlinear element update lar matrices (cf. row-based cyclic allocation performs almost
has been achieved, despite the inclusion of multidirectional equally). Indeed, the cyclic allocation scheme has been proved
smeared crack model and nonlinear steel material. to balance the computation load very effectively (as an extreme
For practical knowledge, it is necessary to touch on the effective case, if the system size is sufficiently large and stiffness is almost
data management in the parallelization. The key equation to be fully populated, the computation cost of each processor asymptoti-
solved in most of the parallel explicit programs looks more or less cally converges to the same value). However, the block-based
like Eq. (4) with the assumption of lumped mass on nodes (from cyclic allocation is regarded to possess better performance, and thus
Danielson and Namburu 1998): it shall be a natural extension in the future.
Indexing problems naturally emerged from the cyclic data dis-
Mu
€t ¼ Pt Ft ð4Þ tribution, being far more complicated than in serial version. To keep
the portability and object-oriented nature of the parallel algorithms,
a one-to-one mapping function of the index would be a successful
14 tool. In the present platform, an overloaded operator ðÞ developed
12 serves as the function, in which a global term aði; jÞ with global
indexes i; j ∈ ½0; n 1 exactly indicates the corresponding term
Speed up = t 4 / t p
10
~aIJ in the compact storage where I ∈ ½0; 2b 1, b = band width,
8 J ∈ ½0; r 1, and r ¼ n∕p ¼ number of columns per processor.
6
with the uniformly decomposed domain, mainly with respect to alleviate this pathological nature mainly with a physically plausible
solid elements. remedy, a 3D interlocking mechanism has been proposed. Random
Based on the distinct characteristics of the present analysis plat- particles used in the mechanism were generated from Gaussian dis-
form, i.e., local iteration-free material models and total strain-based tribution, in accordance with major trends in tribology (e.g., Jackson
smeared crack, the errors which might arise from parallelization and Green 2005). At each step, active contacting areas of
appear to be negligible. Indeed, a test simulation with a simple hemisphere-indentation couple yield the tangent shear stiffness—
system of 40 concrete elements and 30 perfectly bonded steel bars comparable to the Walraven’s 2D interlocking model (Walraven
was conducted up to severely damaged states involving crushing 1994). By virtue of the parallel platform, the random distribution
and steel yielding under cyclic loading. Results revealed that the of ideal particles in an unstructured manner across the whole
mean square errors (against 1 CPU case) from the analysis by 8 H-shaped wall system was made possible [Fig. 8(c)].
CPUs and 16 CPUs were 2:0 × 1016 and 2:04 × 1014 , respec- As shown in Fig. 7(d) for the reinforcing bar, an integrated
tively, whereas it was about zero with 4 and 6 CPUs. model has been proposed by the authors: mainly based on the
If any further complicated problems are to be tackled in the well-known Menegotto and Pinto (1973) steel model for smooth
future extension, there should be pertinent consideration of dy- transition, initiation of compressive buckling by Dhakal and
namic load balance and domain decomposition techniques, as well Maekawa (2002), and the strain parameter concept for early buck-
as error management in parallel processing, which fortunately have ling in positive strain regime (Rodriguez et al. 1999). With the
been well established and available in the literature. parallel platform, the topological transition, defined by the loss
of surrounding element due to crushing or spalling, is being queried
at each step so as to realistically lengthen the compressive buckling
Physical Mechanisms and Randomness in Entire length of bar, whereas it is simply assumed constant during analysis
Domain Being Accessible by the Parallel Platform in most existing research.
Fig. 7. (a) Thorenfeldt compressive model generalized by un/reloading model; (b) tension softening regime defined on three orthogonal crack
surfaces; (c) the fabic of rigid hemisphere-soft indentation proposed by the authors for nonlinear shear across opened crack; (d) reinforcing steel
bar model incorporating compressive buckling
Fig. 8. (a) Dimensional details of H-shaped wall system; (b) reinforcement layout; (c) random particle distribution across entire domain from the
Gaussian distribution for 3D interlocking model
entire domain are provided in Fig. 8. In both web and flange, the crushing at the intersection zones of diagonal cracks and severe
concrete strength is 21.7 MPa the strain at the peak is 0.00204, all grinding were believed to be the major factors that caused such
steel bars are D6 type 7 mm diameter, and the yielding stress and localized damage with vertical directivity. Since the present 3D in-
the corresponding strain are 605 MPa and 0.00318, respectively. A terlocking model utilizes random particle distribution along entire
total of 7,784 linear hexahedral elements for concrete and 4,692 domain, it was possible to predict the localized damage with ver-
space truss elements for perfectly bonded reinforcing bar were tical directivity, as emphasized in Fig. 9(b). The 3D interlocking
generated for the finite-element modeling, leading to 11,212 nodes appears to capture the irrecoverable bulging phenomena in the
used in total. out-of-plane direction under cyclic loading. Indeed, the random-
From the first excursion of the displacement loading, diagonal ness of material properties appears to serve as an essential factor
cracks were initiated. They eventually extended full height of the to bring such localized damage and failure, as similarly identified
web part with approximately 45° as cyclic loading proceeded.
These diagonal cracks were widely distributed on the web and
crack directions remained unchanged during the entire experiment, 1
Nonlinear element update
and they were effectively captured by the multidirectional smeared Factorization
crack model in the present program. 0.8
Normalized cost
ment that the complete realignment of the previously opened crack 0.6
surfaces became impossible, and consequently grinding between
crack surfaces took place. This can be regarded a sort of interlock- 0.4
T(p
Fig. 9. (a) Resultant force-displacement response comparison; (b) deformed shape (amplified) revealing localized damage on web with vertical
directivity marked by dashed line; (c) ultimate damage at the end of experiment (Palermo and Vecchio 2002, with permission from ACI Publishing)
and also the authors are deeply grateful to Professor S. Krishnan for
anisms or further sophistication of the probabilistic material proper-
his consistent support for parallel simulations.
ties across the entire domain.
References
Concluding Remarks
Andrade, J. E., Baker, J. W., and Ellison, K. C. (2007). “Random porosity
As demonstrated so far, parallelization appears essentially to be fields and their influence on the stability of granular media.” Int. J.
problem-dependent, and how well this method embraces the key Numer. Anal. Methods Geomech., 32(10), 1147–1172.
features of the problem under consideration determines the ulti- Casanova, H., Legrand, A., and Robert, Y. (2009). “Parallel algorithm.”
mate parallel efficiency. In the quantitative comparison among CRC Press, Boca Raton, FL.
representative parallel strategies, particularly in terms of factoriza- Cheng, F. Y., Mertz, G. E., Sheu, M. S., and Ger, J. F. (1993). “Computed
tion, some practical knowledge had been attained: (1) The perfor- versus observed inelastic seismic low-rise rc shear walls.” J. Struct.
Eng., 119(11), 3255–3275.
mance of a “broadcasting” strategy tends to deteriorate beyond a Colotti, V. (1993). “Shear behavior of RC structural walls.” J. Struct. Eng.,
certain range of total processors; (2) Contrary to anticipation, 119(3), 728–746.
the most advanced “look-ahead” strategy appears to exhibit poor Crisfield, M. A., and Wills, J. (1989). “Analysis of R/C panels using differ-
performance with a small number of processors; (3) Only the “pipe- ent concrete models.” J. Eng. Mech., 115(3), 578–597.
lined” strategy reveals overall stable performance. Then, optimiza- Danielson, K. T., Akers, S. A., O’Daniel, J. L., Adley, M. D., and
tion of the pipelined factorization was carried out, successfully Garner, S. B. (2008). “Large-scale parallel computation methodologies
taking advantage of the penalty method and banded nature. Be- for highly nonlinear concrete and soil applications.” J. Comput. Civ.
cause the penalty method imparts remarkable smoothness to the Eng., 22(2), 140–146.
Danielson, K. T., and Namburu, R. R. (1998). “Nonlinear dynamic finite
global response, the parallel triangular system solving was able element analysis on parallel computers using FORTRAN 90 and MPI.”
to achieve practical superiority over advanced parallel solvers such Advances in engineering software, Elsevier Science, Amsterdam,
as parallel PCGM, as confirmed by quantitative comparison. Netherlands, Vol. 29, No. 3–6, 179–186.
The implemented “divide-and-conquer” approach for all other DeGroot, A. J., Sherwood, R. J., Badders, D. C., and Hoover, C. G. (1997).
embarrassingly parallelizable tasks is performing favorably on “Parallel contact algorithms for explicit finite element analysis
the master-slaves concept after nonoverlap uniform domain decom- (DYNA3D).” Proc., Fourth U.S. National Congress on Computational
position. Especially for a moderate size RC structure, the master- Mechanics, U.S. National Congress on Computational Mechanics,
slaves approach enables nonlinear element update to be done Austin, TX.
Dhakal, R., and Maekawa, K. (2002). “Modeling for postyielding buckling
without any intercommunication between slave processors, leading
of reinforcement.” J. Struct. Eng., 128(9), 1139–1147.
to clear scalability. By this successful parallelization of nonlinear Hoover, C. G., DeGroot, A. J., Maltby, J. D., and Procassini, R. J. (1995).
element update procedure, the developed parallel platform was able “ParaDyn-DYNA3D for massively parallel computers.” Engineering,
to be imbued with a multitude of physical mechanisms to describe Research, Development and Technology FY94, UCRL 53868-94,
progressive and localized damage phenomena at the entire system Lawrence Livermore National Laboratory, Livermore, CA.
level. It should be stressed, however, that the platform shall suc- Jackson, R. L., and Green, I. (2005). “A statistical model of elasto-plastic
cessfully harmonize with reliable parallel libraries, e.g., parallel asperity contact between rough surfaces.” Tribol. Int., 39(9), 906–914.
sparse matrix solver and dynamic load balance scheme, in the Karniadakis, G. E., and Kirby, II, R., M. (2003). Parallel scientific com-
puting in C++ and MPI (a seamless approach to parallel algorithms
future research to achieve the general applicability.
and their implementation), Cambridge University Press, Cambridge,
By filling the gap between the microscopic physics and global UK.
degradation with localization, the parallel platform offers the Karypis, G., and Kumar, V. (1995a). “A fast and high quality multilevel
unprecedented access to physics-based mechanisms (e.g., multidi- scheme for partitioning irregular graphs.” Technical Rep. No. TR 95-
rectional smeared crack model, 3D interlocking model, and non- 035, Dept. of Computer Science, Univ. of Minneapolis, Minneapolis.
linear steel with evolving buckling length) and even to the Karypis, G., and Kumar, V. (1995b). “A parallel algorithm for multilevel
probabilistic randomness at entire system level. Indeed, random graph partitioning and sparse matrix ordering.” Technical Rep. No.
distribution of crucial mechanical parameter (i.e., interlocking TR 95-036, Dept. of Computer Science, Univ. of Minneapolis,
Minneapolis.
particle size in this paper) across the entire domain appears to
Kim, S. W., and Vecchio, F. J. (2008). “Modeling of shear-critical rein-
be essential for the irrecoverable localization, as reflected from forced concrete structures repaired with fiber-reinforced polymer com-
the application to a real scale RC structure. Equipped with accu- posites.” J. Struct. Eng., 134(8), 1288–1299.
racy, stability and scalability, the implicit nonlinear FEA program Menegotto, M., and Pinto, P. (1973). “Method of analysis of cyclically
in its parallel version is believed to serve as a fertile ground for the loaded RC plane frames including changes in geometry and nonelastic
reinforced concrete solid.” Can. J. Civ. Eng., 24(3), 460–470. theory for reinforced concrete elements subjected to shear.” J. Am.
Shahinpoor, M. (1980). “Statistical mechanical considerations on the ran- Concr. Inst., 83(22), 219–231.
dom packing of granular materials.” Powder Technol., 25(2), 163–176. Wilkinson, B., and Allen, M. (1999). Parallel programming, Prentice Hall,
Sotelino, E. D. (2003). “Parallel processing techniques in structural engi- Upper Saddle River, NJ.
neering applications.” J. Struct. Eng., 129(12), 1698–1706. Walraven, J. (1994). “Rough cracks subjected to earthquake loading.”
Spacone, E., and El-Tawil, S. (2004). “Nonlinear analysis of steel-concrete J. Struct. Eng., 120(5), 1510–1524.