Parallelized Implicit Nonlinear FEA Program For Real Scale RC Structures Under Cyclic Loading

Parallelized Implicit Nonlinear FEA Program for Real Scale
RC Structures under Cyclic Loading

In Ho Cho, S.M.ASCE1; and John F. Hall, M.ASCE2
Abstract: Parallel computing in civil engineering has been restricted to monotonic shock or blast loading with explicit algorithm which is
characteristically feasible to be parallelized. In the present paper, efficient parallelization strategies for the highly demanded implicit nonlinear
Downloaded from ascelibrary.org by University of Leeds on 05/17/15. Copyright ASCE. For personal use only; all rights reserved.
finite-element analysis (FEA) program for real scale reinforced concrete (RC) structures under cyclic loading are proposed. Quantitative
comparison of state-of-the-art parallel strategies in terms of factorization were carried out, leading to the problem-optimized solver, which
successfully embraces the penalty method and banded nature. Particularly, the penalty method employed imparts considerable smoothness to
the global response, which yields practical superiority of the parallel triangular system solution over those of advanced solvers such as the
parallel preconditioned conjugate gradient method. Other salient issues on parallelization are also addressed. By virtue of the parallelization,
the analysis platform offers unprecedented access to physics-based mechanisms and probabilistic randomness at theentire system level and
realistically reproduces global degradation and localized damage, as reflected from the application to a RC structure. Equipped with accuracy,
stability and scalability, the parallel platform is believed to serve as a fertile ground for the introducing of further physical mechanisms into
various research fields, as well as the earthquake engineering community. DOI: 10.1061/(ASCE)CP.1943-5487.0000138. © 2012 American
Society of Civil Engineers.
CE Database subject headings: Finite element method; Parallel processing; Earthquake engineering; Concrete structures; Algorithms;
Cyclic loads.
Author keywords: Finite-element method; Parallel processing; Earthquake engineering; Concrete; Algorithms.
Introduction From the computational point of view, the intrinsic obstacles to

the advanced microscopic analysis are the memory shortage and
As the computing capacity evolves, the desire to accurately predict expensive computation cost, normally found in the fiber section
nonlinear behavior of real-scale reinforced concrete (RC) structures model and crack-embedded continuum model. Furthermore, inten-
to its ultimate state has been increasing. Researchers in earthquake sive global iterations (and/or local iterations for internal state
engineering field in particular have been the most enthusiastic pio- values) for the converged solution are essential in most cases.
neers who quickly absorbed state-of-the-art analysis technology Therefore, there has been a multitude of attempts to develop
with the newest computing power. For instance, in studying shear parallel finite-element analysis (FEA) platforms: ParaDyn (Hoover
wall systems, one of the most complicated structural elements often et al. 1995; DeGroot et al. 1997), ParAble (Danielson and Namburu
employed to resist shear by seismic loading, there has been drastic 1998; Danielson et al. 2008), Parallel explicit dynamic FEA pro-
change in attempts to increase accuracy with the aid of advanced gram on heterogeneous workstation (Sziveri and Topping 2000),
and so on. A more detailed review of parallel processing in civil
computing capability: from the simplified nonlinear spring model
engineering can be found in a paper by Sotelino (2003).
connecting two rigid bodies (Cheng et al. 1993) or equivalent
However, most of the precedents have been restricted to certain
beam-column model (Colotti 1993) to the fiber section model fields; some focused on the parallelization methodologies needed
incorporating nonlinear shear spring (Orakcal and Wallace to tackle large scale simulations, while others dealt with monotonic
2006), and even to the continuum-based shear wall model that al- forces such as shock or blast loading running on an explicit dy-
lows investigation of complicated systems such as RC elements namic analysis program, which is characteristically easy to be par-
repaired with fiber-reinforced polymer or a partly repaired shear allelized. However, parallelizing implicit programs is difficult due
wall system (Kim and Vecchio 2008; Vecchio et al. 2002). Similar to its nature, which is still being studied by many researchers. Fur-
evolution in numerical analysis of RC beam-column elements is thermore, there has been little attempt to convert an existing FEA
well documented in a review by Spacone and El-Tawil (2004). program for RC structures to its parallel version. Specifically, the
displacement-controlled nonlinear FEA program for RC structures
1
Ph.D. Candidate, Division of Engineering and Applied Science, has been rarely developed in a parallel version, although it is very
California Institute of Technology, Pasadena, CA 91125 (corresponding important in earthquake engineering because cyclically loaded
author). E-mail: ihcho@caltech.edu testing is essential in assessment of resistance of structure against
2
Professor, Division of Engineering and Applied Science, California In- earthquake loading.
stitute of Technology, Pasadena, CA 91125. E-mail: johnhall@caltech.edu In the present paper, efficient parallelization strategies, by which
Note. This manuscript was submitted on August 10, 2010; approved on
June 7, 2011; published online on April 16, 2012. Discussion period
an existing displacement-controlled nonlinear FEA program is
open until October 1, 2012; separate discussions must be submitted for in- transformed into its parallel version, is suggested. In particular,
dividual papers. This paper is part of the Journal of Computing in Civil the parallel platform is mainly aimed at the moderate size structures
Engineering, Vol. 26, No. 3, May 1, 2012. ©ASCE, ISSN 0887-3801/ and focuses on embracing as many physics-based mechanisms
2012/3-356–365/$25.00. as possible. Being “moderate” means that the size of structure is
356 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012
J. Comput. Civ. Eng. 2012.26:356-365.

sufficiently big for a single or several processors in terms of time Quantitative Study on Representative Parallelization
cost, but it does not necessarily require more than several hundred Strategies
processors.
A review of key features of the existing serial version shall Parallelization is essentially problem-dependent, i.e., how to suc-
be addressed first. Then, a quantitative comparison among the cessfully embrace the distinct features of the problem under con-
sideration appears to be crucial. Although qualitative analysis of
representative advanced parallel strategies shall be provided, lead-
a separate parallel algorithm is widely available, quantitative
ing to a selection of a parallel strategy to be tuned for the present
comparison among advanced parallel algorithms is not well docu-
problem later. As will be discussed in detail, by virtue of the penalty mented so far. So, a comparison study on advanced parallel strat-
method, the modified Newton-Raphson (NR) iteration using initial egies, particularly in terms of factorization, has been conducted.
stiffness still retains superiority in the parallel version. An opti- Then optimizations of the selected parallel algorithm are suggested
mized parallel triangular system solving with the factorized stiff- to exploit the unique features: a highly banded nature and a small
ness also achieves superiority over other advanced parallel portion of the stiffness matrix being affected by addition of penalty
solvers such as parallel preconditioned conjugate gradient method elements.
(PCGM). Other parallelization issues regarding cyclic global data Among a multitude of the advanced parallel strategies, three
distribution, “divide-and-conquer” strategy, domain decomposi- representative ones, in terms of factorization, are studied in this pa-
tion, and so on shall be also dealt with. Finally, along with the brief per: (1) broadcasting, (2) pipelining, and (3) look-ahead method
review of the physics-based degrading material models introduced (for details of each, see Casanova et al. 2009). Before moving for-
in this paper, a practical application to a complex three-dimensional ward, it is useful to denote the key procedures of serial version fac-
(3D) shear wall system exposed to the cyclic loading shall be pro- torization. Let PðkÞ be the preparation procedure at step k:
vided to ensure the parallel efficiency, as well as accuracy of the preparation of factors for the subrows below the kth diagonal term
performed on the processor Pk (Pk = the processor holding the kth
developed implicit, nonlinear parallel analysis program.
column and diagonal term). Let UðkÞ be the update procedure at
step k: update of submatrix Ai;j for i, j > k with the precalculated
factors with kth diagonal term.
Key Characteristics of the Serial Version Program
The first and simplest parallelization strategy is the broadcasting
scheme utilizing the direct broadcasting command (i.e., MPI_Bcast)
As summarized in Table 1, the mainstream is twofold: static analy-
at each step. The key stream can be summarized as: PðkÞ→ broad-
sis (A) for initial loading followed by nonlinear analysis (B). At
casting to all processors →UðkÞ. It is remarkably easy to under-
the beginning of the nonlinear analysis (B 1), the global stiffness stand and implement, and indeed the broadcasting command can
matrix is augmented by additional penalty elements and then stored be almost freely interleaved into the routine. The key drawback
in its factorized form (B 2). The ensuing displacement-controlled of this approach, however, is that all other processors need to wait
analysis stage consists of the two major loops: the main iteration the data from the sender processor Pk until PðkÞ is fully finished
loop (B 3), to obtain the required external forces corresponding on Pk at each step, causing unnecessary waiting cost between pro-
to the target displacement, and the inner iteration loop (C), for cessors. Furthermore, the broadcasting command itself possesses
modified NR iteration using the factorized stiffness. communication inefficiency as the number of processors increases.
Although the full NR iteration is common in most nonlinear To expand on this adverse nature, it is instructive to review
analysis for its fast convergence rate, the modified NR iteration the cost analysis of two algorithms: (1) a parallel factorization fol-
using initial stiffness had been adopted to fully take advantage lowed by triangular system solving (Karniadakis and Kirby 2003),
of the penalty method. Indeed, the penalty method essentially and (2) a parallel Gaussian elimination (Casanova et al. 2009)—
imparts sufficient smoothness to the global force-displacement both based on a simple broadcasting approach, and the total costs
response, and as confirmed in most simulations here, marching generally read
with initial stiffness yields converged response in several iteration
Total running time ≈ α0 ∕p þ β 0 p ð1Þ
steps, allowing us to perform triangular system solving with KLU
repeatedly. Consequently, it is possible to save expensive cost in where α0 ¼ α × n3 ∕3; and β 0 ¼ ðβn2 ∕2 þ LÞ or ðβn2 þ LÞ for the
tangent stiffness reconstruction and its redistribution along process- former and the latter of the two algorithms, respectively. The value
ors, leaving only two factorizations in the static and nonlinear α = basic operation cost per element; β = transfer cost per element;
analysis stages—notably, they are still significant bottlenecks in L = communication startup cost; n = system size; and p = number of
the main execution stream. total processors.
Table 1. Flow of Serial Version of Displacement-Controlled Nonlinear FEA Program

Step Description Major tasks
A Static analysis stage Ku ¼ F
B Displacement-controlled nonlinear analysis
B1 Augment stiffness by penalty method K i;j ¼ K i;j þ K penalty × γi;j where K i;j ∈ K, K penalty ≫ maxðKÞ,

1 when related to penalty element
γi;j ¼
0 otherwise
B2 Factorize Kand store into KLU
B3 Displacement loading loop Find external forceΔF for given Δu
P P
C Modified NR iteration using KLU Solve KLU δui ¼ δFitemp at step i, if converged, Δu ¼ δui ; ΔF ¼ δFitemp
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012 / 357
J. Comput. Civ. Eng. 2012.26:356-365.

2
If n ≫ p, then the first term in Eq. (1) will govern the total run-
ning time and an asymptotic parallel efficiency of order 1 can be Look-Ahead Approach
Time normalized by cost of

achieved. For a moderate size of n, however, the effect of the sec- Pipelined Factorization
broadcasting method
1.5
ond term in Eq. (1) cannot be ignored, and simply increasing the
total number of processors cannot guarantee the parallel efficiency.
Indeed, the total cost will undesirably increase with the growth in 1
the number of total processors by the second term of Eq. (1), as
shown in Fig. 1. Hence, the simple broadcasting scheme is assumed
to be the simplest yet poorest one in the later discussion and is used 0.5
as the comparison base for other advanced parallel strategies.
The second algorithm is the pipelined algorithm, following the
key notion of “pipelining,” in which logical topology is well incor- 0
0 50 100 150 200
porated. In the scheme, every processor knows its logically closest Processor
one, and upon receiving the crucial data, a processor always passes
Fig. 2. Costs of look-ahead and pipelined factorization normalized by

the buffer to the closest processor and then performs UðkÞ. In that of broadcasting method, all attained from simulations of a test sys-
this fashion, waiting cost between processors can be remarkably tem (size ¼ 19;176)
reduced and the communication is efficiently accelerated. How-
ever, there still exists some latency, the so-called “pipeline bubble,”
as a result of the difference of the computation time between pre- Optimized Parallel Factorizations with “Super
decessor and successor and also due to the strictly fixed stream: Linear” Speed-up
PðkÞ, receiving/immediate sending, and UðkÞ.
The third and most advanced parallel algorithm is the so-called A successful tuning of the parallel factorization is twofold to fit the
“look-ahead” scheme, which not only considers the logical topol- essentials of the present problem: (1) banded nature and (2) aug-
ogy, but also reduces the pipeline bubble by placing top priority on mentation by the penalty method.
communication over computation—if necessary for fast communi- On one hand, as in the serial version, a remedy to banded nature
cation, sacrificing a consecutive computation often happens. It in a parallel version is rather simple—skipping update tasks beyond
should be stressed, however, that by such a “deferred” computation, the maximum bandwidth on each processor can significantly save
total cost might become very expensive with a small number of the cost. As expected, a test simulation of the parallel factorization
total processors, even worse than the simple broadcasting method. (system size n ¼ 11328; bandwidth ¼ 768) costs only 5.99 s (2.8%
Indeed, in case of a small number of processors, the consecutive of 214.28 s without consideration of banded nature).
computation in modern CPUs, rather than communication, tends On the other hand, the penalty method employed remarkably
to govern the total cost. affects the parallelization strategy as well as the system solving.
Fig. 2 shows that both the pipelined and look-ahead methods In fact, the penalty method had been adopted mainly for the pur-
exhibit highly improved efficiency compared to a simple broadcast- pose of stability, because sudden degradation of force resistance
ing method. As pointed out, however, the poor performance of the and brittle failure during experiments of RC structures are normal.
look-ahead method for a small number of processors is noteworthy It can be physically understood by addition of highly stiff penalty
(e.g., for processors less than 24). Unlike the look-ahead approach, elements to the nodes, on which the actual external forces are
the pipelined approach does not show poor performance even for a imposed (e.g., add 103 × maxðKÞ to the associated diagonal terms
small number of processors. in the stiffness matrix). If a specific consideration is not paid, the
In sum, provided that a sufficiently large number of total pro- two separate parallel factorizations would be conducted in the static
cessors are available, the look-ahead method can be regarded as the and nonlinear analysis stage, which is obviously inefficient in light
best parallel algorithm, but its superiority over pipelined algorithm of the key characteristic of penalty method—only small portion
is not significantly noticeable for a moderate-size system. Further- of the matrix is modified, since the number of loaded nodes is
more, the look-ahead algorithm tends to perform badly for a small relatively small in the actual experimental setup.
number of total processors. Therefore, the present parallel platform The optimized method in this paper is denoted as a “partial-”
departed from the pipelined strategy, on which optimization was pipelined parallel factorization method, whereas the one without
carried out in accordance with key features of the problem. optimization is denoted as “full-.” The underlying idea is to resume
the second parallel factorization from the indispensible part, and it
50
Parallel Gaussian elimination
is practically realized by storing the submatrix, which is under in-
Parallel factorization and solving fluence of the penalty elements on each processor during the first
40
factorization; this is why it is denoted “partial-.” Furthermore, its
30 efficiency can be maximized by a prudent numbering of nodes
[sec]
where the penalty elements are attached. By assigning as large

20
equation numbers to the associated nodes as possible, the required
10 storage of the submatrix can be reduced and also the total cost for
the second parallel factorization can be minimized. This does not
0 call for advanced reordering techniques such as the reverse Cuthill-
0 50 100 150 200
processor McKee algorithm, but rather a simple numbering along a geometric
coordinate axis practically suffices. Of course, prudent numbering
Fig. 1. Costs of parallel factorizations attained from numerical simula-
should be carefully carried out without increasing the bandwidth.
tions of a test system (size ¼ 2;040) showing the parallel Gaussian
Otherwise, this practical remedy will lose its efficiency.
elimination (dashed line) and the parallel factorization followed by tri-
To ensure the efficiency of the optimized factorization, com-
angular system solving (solid line)
parison results are summarized in Table 2. For a sample system
J. Comput. Civ. Eng. 2012.26:356-365.

Table 2. Total Running Time Comparison in Seconds (s) of Full/Partial Parallel Factorizations with/without Prudent Numbering
Full or partial factorization Prudent numbering a. 1st b. 2nd Total cost (a þ b) (t , s )a
Full X 72.889 72.096 144.985 (3,240, 829)
Partial X 72.168 21.348 93.516 (3,240, 829)
Partial O 72.075 0.020 72.095 (3,240, 3,205)
a
t = total nodes, smallest node number related to the penalty element; s = total nodes, smallest node number related to the penalty element.
Table 3. Pseudocode of the optimized parallel algorithm for the upper be the well-known “cache effect” considering the memory hier-
triangular system, successfully exploiting column-based cyclic distribu- archy of modern CPUs—essentially, the major task in the
tion and banded nature
pipelined algorithm in this paper involves a single vector manipu-
Line Major tasks lation, and the amount of data appears to decrease with increasing
total processors available and to better fit into the cache.

1: i ¼ n 1 to 0
2: n_end ¼ minðn 1; i Nband 1Þ
3: r_end ¼ floorðn_end∕pÞ Parallel Triangular System Solving versus
4: t¼s¼0 Parallel PCGM
5: forall j ∈ ð1; r_endÞ
6: t ¼ t þ ui;j × xj After factorization, the solutions at all the remaining steps are
7: MPI_Reduce (t ðsendÞ ; sðrecvÞ ; 1; …; MPI SUM; Pk )
obtained by solving the triangular system in a parallel manner.
In terms of parallel triangular system solving, a number of algo-
8: On Pk : xi ¼ ðbi sÞ∕ui;i
rithms are available, e.g., simple broadcasting (Karniadakis and
9: End
Kirby 2003) and pipelined scheme (Wilkinson and Allen 1999).
Note: Pk = the processor holding unknown xi ; Nband = Maximum Since the structure and performance of those methods are fairly
bandwidth. similar for a moderate-size system, a simple reduction scheme
(Casanova et al. 2009) that exploits a tree algorithm via “MPI_
(size ¼ 9672) 104 total processors were used. The first row indi- Reduce” has been optimized in such a way that column-based
cyclic distribution and banded nature are well taken into
cates the case where “full” factorizations were duplicated without
account (see pseudocode provided in Table 3). The line 2 and 3
prudent numbering. As expected, this case costs exactly twice the
is for the consideration of banded nature of the system. n_end
cost of one parallel factorization. The second row shows the case
(line 2) means the index of the last term within the bandwidth
of “partial” factorization without prudent numbering, and the im- while n_end (line 3) indicates number of unknowns on each proc-
provement in time cost of the second factorization is apparent. Even essor which must be updated. Especially the buffer s (line 4) is
in the absence of prudent numbering, the partial pipelined factori- meaningful only on Pk since all t’s are summed up and sent to the
zation scheme enables the second factorization to be completed in buffer s on Pk by MPI Reduce (line7).
only 21 sec (30% of the first factorization). The third row reveals To bear out the advantage of the parallel strategy being pro-
the result of the case of “partial” factorization with prudent posed, quantitative comparison was carried out with a
numbering, confirming tremendously improved performance of representative, powerful parallel solver—parallel PCGM, which is
the second factorization with the cost of 0.02 sec (0.03% of the widely accepted as one of the fastest and most reliable algorithms
first factorization). for positive definite and symmetric systems. Indeed, its parallel
The parallel pipelined factorization implemented appears to version can be easily realized due to clear parallelizable structure.
achieve “super linear” speed-up, as confirmed by quantitative evi- As briefly described in Table 4, the PCGM consists of just a
dence in the following section. One of the plausible reasons would few parallelizable procedures—namely, for αk one matrix-vector
Table 4. Pseudocode of Parallel PCGM for Kx ¼ b

Description Major tasks
Initialization r0 ¼ b Kx0 ; c0 ¼ M1 r0 ; ~r0 ¼ c0
Main loop begins For k ¼ 0 to dim K 1
Pp1 ðmÞ ∕
Pp1 ðmÞT ðmÞ ðmÞ
At each step k, calculate αk αk ¼ r~k · rk ∕cTk Kck ¼ m¼0 r~k m¼0 ck K ck
Pp1 ðmÞ ðmÞ Pp1 ðmÞ

With αk , calculate xkþ1 & rkþ1 xkþ1 ¼ m¼0 ½xk þ αk ck ; rkþ1 ¼ m¼0 ½rk αk KðmÞ ck
Calculate ~rkþ1 ; Type ofM determines a proper parallel algorithm ~rkþ1 ¼ M1 rkþ1
(if stopping criterion is met, exit)
Pp1 ðmÞ ðmÞ P ðmÞ ðmÞ P ðmÞ
Calculateβ k and then ckþ1 β k ¼ ~rkþ1 · rkþ1 ∕~rk · rk ¼ rkþ1
m¼0 ~ · rkþ1 ∕ p1 rk · rk ; ckþ1 ¼ p1
m¼0 ~ m¼0 ½~
rkþ1 þ
ðmÞ
β k ck
End of main loop
Note: p = total processors, and superscript (m) denotes local storage on processor Pm.
J. Comput. Civ. Eng. 2012.26:356-365.

Table 5. Cost of Parallel PCGM, Parallel Factorization, and Parallel Triangular System Solving in Seconds (sec) (system size n ¼ 32400; bandwidth ¼ 8145)
Total processors p 16 32 64 128 256
a
Parallel PCGM 2904.6 1480.1 766.0 428.0 263.7
Parallel factorizationb 2356.0 1177.5 581.9 266.3 128.3
Parallel triangular system solving 4.47 4.99 6.32 6.90 7.12
a
With Jacobi preconditioning.
b
By pipelined algorithm.
multiplication and two inner products and forβ k one inner product, 8
and they fall into embarrassingly parallelizable forms (denoted as
7
summation in Table 4). The only remaining task is the calculation
Cost of Parallel Triangular

of modified residual term ~rkþ1 of which expense directly depends
System Solving [sec]

6
on the type of preconditioning. Fortunately, there exist some pre- 5

conditioning methods such as “diagonal scaling” (also called
4
Jacobi preconditioning, with M ¼ diagðkii Þ, i ∈ ½1; n), making Triangular system solving
the task for ~rkþ1 embarrassingly parallelizable. 3 Approximated trend line
Table 5 shows the total running time by parallel PCGM and 2
parallel triangular system solving with increasing total processors.
1
The cost of parallel factorization based on the pipelined scheme is
also provided for a reference. Overall, the cost of parallel PCGM is 0
0 100 200 300
too expensive to achieve superiority over parallel triangular system (a) processors
solving, although it clearly exhibits excellent scalability. Referring
to Table 5, if there are 64 processors, 100 iterations by the parallel 10000
Factorization
triangular system solving would amount to 632 s, which is still less
Triangular system solving
Cost in log scale [sec]

than one iteration cost of parallel PCGM, 766.018 s. Hence, pro- 1000 PCGM
vided that the solution converges within a relatively small number
of iterations (this is generally guaranteed by the penalty method in
this paper), this quantitative comparison bears out the practical 100
superiority of the parallel system solving strategy proposed in this
work. 10
However, typical scalability in the parallel triangular system
solving for a moderate size system was not achieved, and this
unfavorable behavior appeared to be found in most algorithms 1
0 2000 4000 6000 8000
mentioned so far—in all numerical simulations by pipelined (b) processors
scheme, look-ahead, broadcasting, and reduction schemes, the total
cost appeared to gradually increase with the number of total pro- Fig. 3. Cost comparison between parallel PCGM and parallel triangu-
cessors [Fig. 3(a)]. This can be attributed to the fact that for a mod- lar system solving: (a) cost for parallel triangular system solving and its
erate size system the cost saved by parallel computation does not approximated trend by a log function, obtained from simulations of a
successfully surpass the communication overhead from increasing test system (size ¼ 32;400); (b) cost prediction of parallel PCGM and
total processors. To expand on this problem, an approximated cost parallel triangular system solving
model of the parallel triangular system solving is proposed to take
the form of
“Divide-and-Conquer” for Embarrassingly
a × ln p þ b ð2Þ Parallelizable Procedures
When a highly nonlinear material model is concerned, updating

where p = number of total processors; a, b = constants.
This simple cost model results from an intuitive understanding elements would behave as a significant bottleneck, e.g., overhead
that the Taylor series of lnðp þ 1Þ is given by p p2 ∕2 þ p3 ∕3 and resulting from the microplane model in the parallel simulations by
the typical cost of a parallel algorithm also takes a polynomial of p. Danielson et al. (2008). Fortunately, a large portion of the program
In this sample simulation, an approximated cost model by Eq. (2) involving nonlinear element update is embarrassingly paralleliz-
reads ln p þ 1:8 (i.e., a ¼ 1:0; b ¼ 1:8), and it provides good agree- able: the numerical integration over global domain, handling non-
ment with actual cost up to 256 processors [see Fig. 3(a)]. Plotting linear materials, and so on. One of the best and intuitively simple
both approximated and actual running times reveals the marginal strategies handling such situations is the scheme so-called “divide-
number of total processors [Fig. 3(b)], at which the costs by the and-conquer,” in which a task is explicitly divided and carried out
parallel triangular system solving and the parallel PCGM become concurrently by the total processors available.
equal, notably being far larger than the practical range of total pro- To boost efficiency, the master-slaves concept is being em-
cessors (e.g., 600 processors available in this research). Even if the ployed. While master processor P0 is dealing with all the global
marginal number is reached, the additional cost for new tangent vector manipulations, slaves (P1 ∼ Pp1 ) perform some of the
stiffness construction and redistribution, etc., should still be global tasks associated with the assigned subdomain and hold only
counted. Therefore, the superiority of the parallel triangular system essential local data of the subdomains without overlapping (Fig. 4).
solving over typical parallel solvers is still obvious, particularly in As briefly described in Eq. (3), the parallel internal force update
the present research. consists of a set of executions: first, a local internal force of the
J. Comput. Civ. Eng. 2012.26:356-365.

Fig. 4. Master-slaves for all embarrassingly parallelizable tasks along
with nonoverlap uniform domain decomposition
subdomain V k , k ∈ ½1; p 1, is calculated on processors Pk , and Fig. 6. Column-based cyclic allocation on all processors for factoriza-
then passed to the master processor P0. Finally, the global summation and solving
tion is done on P0 , and afterward P0 calculates a new unbalance
force, determines convergence, and so on:
where M ¼ mass ¼ diagðmii Þ, i ∈ ½1; n; u € t = acceleration at time t;
p1 Z Z P = external force; and F = internal resistance force possibly
t t
X
p1 X
Fiinternal ¼ Fiinternal;k ¼ BT σic dV k þ BT σis dV k including material/geometry nonlinearity.
k¼1 k¼1 Vk Vk Because of the lumped mass assumption, it is obvious that
ð3Þ Eq. (4) leads to an embarrassingly parallelizable situation resulting
in only vector manipulations. In terms of a straightforward domain
where i = iteration step number; Fiinternal = global internal force vec- decomposition, various parallel tools have been developed and uti-
tor on master node P0 ; Fiinternal;k = local internal force vector on lized: e.g., METIS and its parallel version ParMETIS by Karypis
processor Pk ; p= number of total processors; V k = subdomain k and Kumar (1995a, b) for weighted domain partitioning for the
on processor Pk ; B = strain-displacement matrix; and σ ci and consideration of imbalance resulting from multiple-nonlinear
σ si = current concrete and steel stresses evaluated, respectively. materials and domain distribution along heterogeneous processors
Special attention has been paid to the decomposion of perfectly (Sziveri and Topping 2000).
bonded steels (realized by 3D truss elements in this paper) which In general, such favorable conditions are not the case for
might be shared by several subdomains. For steels lying on the implicit programs such as the one dealt with in this paper. As a
boundary between subdomains, the processor with a small ID num- successful remedy to the obstacle involving global data manage-
ber is assumed to have the priority to hold those steels (e.g., a steel ment, the column-based cyclic allocation is being exploited. As
shared by subdomain 1 and 2 is assigned to subdomain 1). With shown in Fig. 6, each column of the global stiffness matrix is cycli-
these efforts, as shown in Fig. 5, the desired speed-up in the con- cally distributed across all processors, as are the factorized triangu-
struction of new internal force through nonlinear element update lar matrices (cf. row-based cyclic allocation performs almost
has been achieved, despite the inclusion of multidirectional equally). Indeed, the cyclic allocation scheme has been proved
smeared crack model and nonlinear steel material. to balance the computation load very effectively (as an extreme
For practical knowledge, it is necessary to touch on the effective case, if the system size is sufficiently large and stiffness is almost
data management in the parallelization. The key equation to be fully populated, the computation cost of each processor asymptoti-
solved in most of the parallel explicit programs looks more or less cally converges to the same value). However, the block-based
like Eq. (4) with the assumption of lumped mass on nodes (from cyclic allocation is regarded to possess better performance, and thus
Danielson and Namburu 1998): it shall be a natural extension in the future.
Indexing problems naturally emerged from the cyclic data dis-
Mu
€t ¼ Pt Ft ð4Þ tribution, being far more complicated than in serial version. To keep
the portability and object-oriented nature of the parallel algorithms,
a one-to-one mapping function of the index would be a successful
14 tool. In the present platform, an overloaded operator ðÞ developed
12 serves as the function, in which a global term aði; jÞ with global
indexes i; j ∈ ½0; n 1 exactly indicates the corresponding term
Speed up = t 4 / t p
10
~aIJ in the compact storage where I ∈ ½0; 2b 1, b = band width,
8 J ∈ ½0; r 1, and r ¼ n∕p ¼ number of columns per processor.
6
4 Issues on Load Balance and Error from

2 Parallelization
0 In problems involving a discrete crack or a propagating crack,
0 20 40 60 80
processors extreme stress concentration emerging at the crack tip naturally ne-
cessitates the refined or biased meshing around the zone. Moreover,
Fig. 5. Speed-up in nonlinear element updating procedure by “divide-
a multitude of nonlinear problems, e.g., continuum mechanics with
and-conquer” strategy, attained from simulations of a test model con-
plasticity theories or a study of dynamic contact, requires local level
sisting of 2,784 concrete and 3,372 steel bar elements
iterations (sometimes in a large number of steps) to attain the
J. Comput. Civ. Eng. 2012.26:356-365.

converged state values. All of those situations are most likely to been employed to describe compressive stress in each crack direc-
cause load imbalance among processors, requiring prudent domain tion—possibly three orthogonal crack surfaces can be initiated at
decomposition along processors. the present framework, as depicted in Fig. 7(b). Generalization of
However, none of those difficult situations arises in the present the compression model has been conducted by incorporating un/
problem of interest. The smeared crack model used is essentially reloading responses (Taucer et al. 1991) and nonlinear tension soft-
linked to the total strain (cf. the decomposed strain, which is ening regime by Reinhardt (1984), as shown in Figs. 7(a) and 7(b).
common in classical plasticity theories and rotating crack models) In all numerical simulations presented here, the softening parameter
and evaluates current state values directly from micro-physically c is set 0.31 for smooth postpeak softening. To embrace the realistic
defined local spaces—namely, the space of three orthogonal crack nature of the open crack, the present smeared crack model is obey-
planes or a particle-indentation couple for nonlinear shear. Further- ing the notion of the “fixed-type” crack (as opposed to the “rotat-
more, damage is considered to be distributed throughout the ing” crack model). As well pointed out by many (e.g., Crisfield and
domain, and mesh refinement at a specific zone such as a crack Wills 1989), the fixed-type smeared crack model has the problem
tip is generally unnecessary during the entire analysis. Conse- of spurious large stress transfer across crack surfaces when sub-
quently, the work load on each processor is readily well balanced jected to nonproportional loading. As illustrated in Fig. 7(c), to
with the uniformly decomposed domain, mainly with respect to alleviate this pathological nature mainly with a physically plausible
solid elements. remedy, a 3D interlocking mechanism has been proposed. Random
Based on the distinct characteristics of the present analysis plat- particles used in the mechanism were generated from Gaussian dis-
form, i.e., local iteration-free material models and total strain-based tribution, in accordance with major trends in tribology (e.g., Jackson
smeared crack, the errors which might arise from parallelization and Green 2005). At each step, active contacting areas of
appear to be negligible. Indeed, a test simulation with a simple hemisphere-indentation couple yield the tangent shear stiffness—
system of 40 concrete elements and 30 perfectly bonded steel bars comparable to the Walraven’s 2D interlocking model (Walraven
was conducted up to severely damaged states involving crushing 1994). By virtue of the parallel platform, the random distribution
and steel yielding under cyclic loading. Results revealed that the of ideal particles in an unstructured manner across the whole
mean square errors (against 1 CPU case) from the analysis by 8 H-shaped wall system was made possible [Fig. 8(c)].
CPUs and 16 CPUs were 2:0 × 1016 and 2:04 × 1014 , respec- As shown in Fig. 7(d) for the reinforcing bar, an integrated
tively, whereas it was about zero with 4 and 6 CPUs. model has been proposed by the authors: mainly based on the
If any further complicated problems are to be tackled in the well-known Menegotto and Pinto (1973) steel model for smooth
future extension, there should be pertinent consideration of dy- transition, initiation of compressive buckling by Dhakal and
namic load balance and domain decomposition techniques, as well Maekawa (2002), and the strain parameter concept for early buck-
as error management in parallel processing, which fortunately have ling in positive strain regime (Rodriguez et al. 1999). With the
been well established and available in the literature. parallel platform, the topological transition, defined by the loss
of surrounding element due to crushing or spalling, is being queried
at each step so as to realistically lengthen the compressive buckling
Physical Mechanisms and Randomness in Entire length of bar, whereas it is simply assumed constant during analysis
Domain Being Accessible by the Parallel Platform in most existing research.
The implicit nonlinear FEA platform in its parallel version devel-

oped in this paper sparked the authors’ imagination to incorporate Application: Real Scale H-Shaped Wall System
highly sophisticated physical mechanisms to tackle degradation under Cyclic Loading
phenomena. This paper shall touch on some salient concepts only.
For degrading behavior of the concrete under cyclic loading, so- The developed parallel platform was applied to analyzing a real
called multidirectional smeared crack model has been adopted— scale RC structure exposed to cyclic loading (Palermo and Vecchio
comparable with the key concept by Vecchio and Collins (1986) 2002), which has not only moderate size but also well exhibits
and Selby and Vecchio (1993, 1997). For compression regime, considerable progressive damage states. The detailed dimensions,
as shown in Fig. 7(a), the Thorenfeldt concrete model (1987) has reinforcing steel bars, and random particles distributed across the
Fig. 7. (a) Thorenfeldt compressive model generalized by un/reloading model; (b) tension softening regime defined on three orthogonal crack
surfaces; (c) the fabic of rigid hemisphere-soft indentation proposed by the authors for nonlinear shear across opened crack; (d) reinforcing steel
bar model incorporating compressive buckling
J. Comput. Civ. Eng. 2012.26:356-365.

Fig. 8. (a) Dimensional details of H-shaped wall system; (b) reinforcement layout; (c) random particle distribution across entire domain from the
Gaussian distribution for 3D interlocking model
entire domain are provided in Fig. 8. In both web and flange, the crushing at the intersection zones of diagonal cracks and severe
concrete strength is 21.7 MPa the strain at the peak is 0.00204, all grinding were believed to be the major factors that caused such
steel bars are D6 type 7 mm diameter, and the yielding stress and localized damage with vertical directivity. Since the present 3D in-
the corresponding strain are 605 MPa and 0.00318, respectively. A terlocking model utilizes random particle distribution along entire
total of 7,784 linear hexahedral elements for concrete and 4,692 domain, it was possible to predict the localized damage with ver-
space truss elements for perfectly bonded reinforcing bar were tical directivity, as emphasized in Fig. 9(b). The 3D interlocking
generated for the finite-element modeling, leading to 11,212 nodes appears to capture the irrecoverable bulging phenomena in the
used in total. out-of-plane direction under cyclic loading. Indeed, the random-
From the first excursion of the displacement loading, diagonal ness of material properties appears to serve as an essential factor
cracks were initiated. They eventually extended full height of the to bring such localized damage and failure, as similarly identified
web part with approximately 45° as cyclic loading proceeded.
These diagonal cracks were widely distributed on the web and
crack directions remained unchanged during the entire experiment, 1
Nonlinear element update
and they were effectively captured by the multidirectional smeared Factorization
crack model in the present program. 0.8
Normalized cost
After 5 mm of displacement loading, it was observed in experi-

p) / T(p=16)
ment that the complete realignment of the previously opened crack 0.6
surfaces became impossible, and consequently grinding between
crack surfaces took place. This can be regarded a sort of interlock- 0.4
T(p
ing between open cracks, which is tied to the considerable loss of

stiffness of the web part. In the present program, such phenomena 0.2
are described by the 3D interlocking model in which tangential
shear stiffness is obtained by actively contacting particles between 0
0 100 200 300
crack surfaces. As shown in Fig. 9(a), the 3D interlocking model processors
successfully operates in concert with the multidirectional smeared
crack framework to reproduce progressive stiffness reduction up to Fig. 10. Performance of parallel factorization and nonlinear element
update after normalization by T (p ¼ 16) the cost for p ¼ 16 (time
the peak point.
costs with 16 processors: 2,356.03 seconds for 1st factorization;
Near the peak strength near 11 mm of displacement loading, the
183.95 for 2nd factorization; 2.29 for lower triangular system solving;
specimen exhibited unexpected vertical sliding planes on the web,
0.06 for nonlinear element update)
which were equally spaced [Fig. 9(c)]. Among possible reasons,
Fig. 9. (a) Resultant force-displacement response comparison; (b) deformed shape (amplified) revealing localized damage on web with vertical
directivity marked by dashed line; (c) ultimate damage at the end of experiment (Palermo and Vecchio 2002, with permission from ACI Publishing)
J. Comput. Civ. Eng. 2012.26:356-365.

by Shahinpoor (1980) and Andrade et al. (2007) for heterogeneous introducing of further physical mechanisms into various research
granular medium. Although the out-of-plane stiffness of the flanges fields as well as the earthquake engineering community.
is relatively small compared to the web, it is noteworthy that the
complicated deformation of the flanges [see Fig. 9(b)] asserts
the strong necessity of the detailed 3D analysis of the flanges rather Acknowledgments
than simple use of the popular assumption of flexural deformation.
In terms of parallel efficiency, both factorization and nonlinear All the numerical simulations were performed on GARUDA, high-
element update in real scale application revealed the favorable scal- performance computing cluster hosted within the Civil Engineering
ability, as shown in Fig. 10. Because the computational procedure department at Caltech (600 CPUs, 75 Dual quad core processors at
associated with nonlinear materials is bounded within the NR iter- 2.33 GHz with 8 GB RAM). In large part, the purchase and instal-
ation and has been successfully parallelized, the present platform lation of GARUDA was possible by virtue of the Ruth Haskell
Research Fund, the Tomiyasu Discovery Fund, and Dell, Inc.
will be expandable to highly sophisticated physical mechanisms for
The hospitality of Professor Daniel Palermo and Professor Frank
degrading materials with minimal cost, in the future extension—
J. Vecchio is appreciated in support of providing experimental data,
e.g., the “multiscale” technique for enhanced microphysical mech-
and also the authors are deeply grateful to Professor S. Krishnan for
anisms or further sophistication of the probabilistic material proper-
his consistent support for parallel simulations.
ties across the entire domain.
References
Concluding Remarks
Andrade, J. E., Baker, J. W., and Ellison, K. C. (2007). “Random porosity
As demonstrated so far, parallelization appears essentially to be fields and their influence on the stability of granular media.” Int. J.
problem-dependent, and how well this method embraces the key Numer. Anal. Methods Geomech., 32(10), 1147–1172.
features of the problem under consideration determines the ulti- Casanova, H., Legrand, A., and Robert, Y. (2009). “Parallel algorithm.”
mate parallel efficiency. In the quantitative comparison among CRC Press, Boca Raton, FL.
representative parallel strategies, particularly in terms of factoriza- Cheng, F. Y., Mertz, G. E., Sheu, M. S., and Ger, J. F. (1993). “Computed
tion, some practical knowledge had been attained: (1) The perfor- versus observed inelastic seismic low-rise rc shear walls.” J. Struct.
Eng., 119(11), 3255–3275.
mance of a “broadcasting” strategy tends to deteriorate beyond a Colotti, V. (1993). “Shear behavior of RC structural walls.” J. Struct. Eng.,
certain range of total processors; (2) Contrary to anticipation, 119(3), 728–746.
the most advanced “look-ahead” strategy appears to exhibit poor Crisfield, M. A., and Wills, J. (1989). “Analysis of R/C panels using differ-
performance with a small number of processors; (3) Only the “pipe- ent concrete models.” J. Eng. Mech., 115(3), 578–597.
lined” strategy reveals overall stable performance. Then, optimiza- Danielson, K. T., Akers, S. A., O’Daniel, J. L., Adley, M. D., and
tion of the pipelined factorization was carried out, successfully Garner, S. B. (2008). “Large-scale parallel computation methodologies
taking advantage of the penalty method and banded nature. Be- for highly nonlinear concrete and soil applications.” J. Comput. Civ.
cause the penalty method imparts remarkable smoothness to the Eng., 22(2), 140–146.
Danielson, K. T., and Namburu, R. R. (1998). “Nonlinear dynamic finite
global response, the parallel triangular system solving was able element analysis on parallel computers using FORTRAN 90 and MPI.”
to achieve practical superiority over advanced parallel solvers such Advances in engineering software, Elsevier Science, Amsterdam,
as parallel PCGM, as confirmed by quantitative comparison. Netherlands, Vol. 29, No. 3–6, 179–186.
The implemented “divide-and-conquer” approach for all other DeGroot, A. J., Sherwood, R. J., Badders, D. C., and Hoover, C. G. (1997).
embarrassingly parallelizable tasks is performing favorably on “Parallel contact algorithms for explicit finite element analysis
the master-slaves concept after nonoverlap uniform domain decom- (DYNA3D).” Proc., Fourth U.S. National Congress on Computational
position. Especially for a moderate size RC structure, the master- Mechanics, U.S. National Congress on Computational Mechanics,
slaves approach enables nonlinear element update to be done Austin, TX.
Dhakal, R., and Maekawa, K. (2002). “Modeling for postyielding buckling
without any intercommunication between slave processors, leading
of reinforcement.” J. Struct. Eng., 128(9), 1139–1147.
to clear scalability. By this successful parallelization of nonlinear Hoover, C. G., DeGroot, A. J., Maltby, J. D., and Procassini, R. J. (1995).
element update procedure, the developed parallel platform was able “ParaDyn-DYNA3D for massively parallel computers.” Engineering,
to be imbued with a multitude of physical mechanisms to describe Research, Development and Technology FY94, UCRL 53868-94,
progressive and localized damage phenomena at the entire system Lawrence Livermore National Laboratory, Livermore, CA.
level. It should be stressed, however, that the platform shall suc- Jackson, R. L., and Green, I. (2005). “A statistical model of elasto-plastic
cessfully harmonize with reliable parallel libraries, e.g., parallel asperity contact between rough surfaces.” Tribol. Int., 39(9), 906–914.
sparse matrix solver and dynamic load balance scheme, in the Karniadakis, G. E., and Kirby, II, R., M. (2003). Parallel scientific com-
puting in C++ and MPI (a seamless approach to parallel algorithms
future research to achieve the general applicability.
and their implementation), Cambridge University Press, Cambridge,
By filling the gap between the microscopic physics and global UK.
degradation with localization, the parallel platform offers the Karypis, G., and Kumar, V. (1995a). “A fast and high quality multilevel
unprecedented access to physics-based mechanisms (e.g., multidi- scheme for partitioning irregular graphs.” Technical Rep. No. TR 95-
rectional smeared crack model, 3D interlocking model, and non- 035, Dept. of Computer Science, Univ. of Minneapolis, Minneapolis.
linear steel with evolving buckling length) and even to the Karypis, G., and Kumar, V. (1995b). “A parallel algorithm for multilevel
probabilistic randomness at entire system level. Indeed, random graph partitioning and sparse matrix ordering.” Technical Rep. No.
distribution of crucial mechanical parameter (i.e., interlocking TR 95-036, Dept. of Computer Science, Univ. of Minneapolis,
Minneapolis.
particle size in this paper) across the entire domain appears to
Kim, S. W., and Vecchio, F. J. (2008). “Modeling of shear-critical rein-
be essential for the irrecoverable localization, as reflected from forced concrete structures repaired with fiber-reinforced polymer com-
the application to a real scale RC structure. Equipped with accu- posites.” J. Struct. Eng., 134(8), 1288–1299.
racy, stability and scalability, the implicit nonlinear FEA program Menegotto, M., and Pinto, P. (1973). “Method of analysis of cyclically
in its parallel version is believed to serve as a fertile ground for the loaded RC plane frames including changes in geometry and nonelastic
J. Comput. Civ. Eng. 2012.26:356-365.

behavior of elements under normal force and bending.” Struct. Eng. Int. composite structures: State of the art.” J. Struct. Eng., 130(2), 159–168.
(IABSE, Zurich, Switzerland), 13, 15–22. Sziveri, J., and Topping, B. H. V. (2000). “Transient dynamic nonlinear
Orakcal, K., and Wallace, J. W. (2006). “Flexural modeling of reinforced analysis using MIMD computer architectures.” J. Comput. Civ. Eng.,
concrete walls—Experimental verification.” J. Am. Concr. Inst., 103(2), 14(2), 79–91.
196–206. Taucer, F., Spacone, E., and Filippou, F. C. (1991). “A fiber beam-column
Palermo, D., and Vecchio, F. J. (2002). “Behavior of three-dimensional element for seismic response analysis of reinforced concrete structures.”
reinforced concrete shear walls.” J. Am. Concr. Inst., 99(1), 81–89. UCB/EERC-91/17, Earthquake Engineering Research Center (EERC),
Reinhardt, H. W. (1984). “Fracture mechanics of an elastic softening Univ. of California, Berkeley, CA.
material like concrete.” Heron, 29(2). Thorenfeldt, E., Tomaszewicz, A., and Jensen, J. J. (1987). “Mechanical
Rodriguez, M. E., Botero, J. C., and Villa, J. (1999). “Cyclic stress-strain properties of high-strength concrete and applications in design.” Proc.,
behavior of reinforcing steel including effect of buckling.” J. Struct. Symp. Utilization of High-Strength Concrete, American Concrete Insti-
Eng., 125(6), 605–612. tute (ACI), Farmington Hills, MI.
Selby, R. G., and Vecchio, F. J. (1993). “Three-dimensional constitutive Vecchio, F. J., Omar, de la Peæa, A. H., Bucci, F., and Palermo, D. (2002).
relations for reinforced concrete.” Tech. Rep. 93-02, Dept. of Civil “Behavior of repaired cyclically loaded shearwalls.” J. Am. Concr. Inst.,
Engineering, Univ. of Toronto, Canada. 99(3), 327–334.
Selby, R. G., and Vecchio, F. J. (1997). “A constitutive model for analysis of Vecchio, F. J., and Collins, M. P. (1986). “The modified compression field
reinforced concrete solid.” Can. J. Civ. Eng., 24(3), 460–470. theory for reinforced concrete elements subjected to shear.” J. Am.
Shahinpoor, M. (1980). “Statistical mechanical considerations on the ran- Concr. Inst., 83(22), 219–231.
dom packing of granular materials.” Powder Technol., 25(2), 163–176. Wilkinson, B., and Allen, M. (1999). Parallel programming, Prentice Hall,
Sotelino, E. D. (2003). “Parallel processing techniques in structural engi- Upper Saddle River, NJ.
neering applications.” J. Struct. Eng., 129(12), 1698–1706. Walraven, J. (1994). “Rough cracks subjected to earthquake loading.”
Spacone, E., and El-Tawil, S. (2004). “Nonlinear analysis of steel-concrete J. Struct. Eng., 120(5), 1510–1524.
J. Comput. Civ. Eng. 2012.26:356-365.

Parallelized Implicit Nonlinear FEA Program For Real Scale RC Structures Under Cyclic Loading

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallelized Implicit Nonlinear FEA Program For Real Scale RC Structures Under Cyclic Loading

Uploaded by

Copyright:

Available Formats

Parallelized Implicit Nonlinear FEA Program for Real Scale

RC Structures under Cyclic Loading

Introduction From the computational point of view, the intrinsic obstacles to

356 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012

J. Comput. Civ. Eng. 2012.26:356-365.

Table 1. Flow of Serial Version of Displacement-Controlled Nonlinear FEA Program

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012 / 357

J. Comput. Civ. Eng. 2012.26:356-365.

Time normalized by cost of

Fig. 2. Costs of look-ahead and pipelined factorization normalized by

where the penalty elements are attached. By assigning as large

358 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012

J. Comput. Civ. Eng. 2012.26:356-365.

total processors available and to better fit into the cache.

Table 4. Pseudocode of Parallel PCGM for Kx ¼ b

Pp1 ðmÞ ðmÞ Pp1 ðmÞ

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012 / 359

J. Comput. Civ. Eng. 2012.26:356-365.

Cost of Parallel Triangular

System Solving [sec]

on the type of preconditioning. Fortunately, there exist some pre- 5

Cost in log scale [sec]

When a highly nonlinear material model is concerned, updating

360 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012

J. Comput. Civ. Eng. 2012.26:356-365.

4 Issues on Load Balance and Error from

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012 / 361

J. Comput. Civ. Eng. 2012.26:356-365.

The implicit nonlinear FEA platform in its parallel version devel-

362 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012

J. Comput. Civ. Eng. 2012.26:356-365.

After 5 mm of displacement loading, it was observed in experi-

ing between open cracks, which is tied to the considerable loss of

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012 / 363

J. Comput. Civ. Eng. 2012.26:356-365.

364 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012

J. Comput. Civ. Eng. 2012.26:356-365.

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / MAY/JUNE 2012 / 365

J. Comput. Civ. Eng. 2012.26:356-365.

You might also like