Cholesky

A Fast Parallel Cholesky Decomposition Algorithm for Tridiagonal Symmetric Matrices
Ilan Bar-On
y
Bruno Codenotti
Mauro Leoncini
In this paper we present a new parallel algorithm for computing the T decomposition of real symmetric positive de nite tridiagonal matrices. The algorithm consists of a preprocessing and a factoring stage. In the preprocessing stage it determines a rank-( 1) correction to the original matrix ( = number of processors) by precomputing selected components k of the factor, = 1 1. In the factoring stage it performs independent factorizations of matrices of order . The algorithm is especially suited for machines with both vector and processor parallelism, as con rmed by the experiments carried out on a Connection Machine CM5 with 32 nodes. Let ^k and ^0k denote the components computed in the preprocessing stage and the corresponding values (re)computed in the factorization stage, respectively. Assuming that j ^k ^0k j is small, = 1 1, we are able to prove that the algorithm is stable in the backward sense. The above assumption is justi ed both experimentally and theoretically. In fact we found experimentally that j ^k ^0k j is small even for illconditioned matrices, and we have proved by an a priori analysis that
LL p p x L k ;:::;p p n=p x x x =x k ;:::;p x =x
Abstract
Research produced with the help of the National Science Foundation, Infrastructure Grant number CDA-8722788, and the Cooperation Agreement CNR-MOSA. y Department of Computer Science, Technion, Haifa 32000, Israel. Author's present address: Science and Mathematics, University of Texas of the Permian Basin, 4901 E. University, Odessa, Texas, 79762, U.S.A. Email: baron i@gusher.utpb.edu z Istituto di Matematica Computazionale del CNR, Via S. Maria 46, 56126 Pisa, Italy. Supported by the ESPRIT III Basic Research Programme of the EC under contract No. 9072 (Project GEPPCOM). x Dipartimento di Informatica, Universita di Pisa, Corso Italia 40, 56125 Pisa, Italy. Supported by the ESPRIT III Basic Research Programme of the EC under contract No. 9072 (Project GEPPCOM), and by M.U.R.S.T. 40% funds.
the above ratios are small provided that preprocessing is performed with suitably larger precision.
Keywords Parallel algorithm, Cholesky decomposition, LR and QR algorithms, Eigenvalues, Symmetric, Tridiagonal, and Band Matrices, CM5. AMS(MOS) Suject Classi cation 15A18, 15A23, 65F05, 65F15, 65Y05.
1 Introduction
We consider the problem of computing the Cholesky decomposition of very large real symmetric positive de nite tridiagonal matrices. Cholesky decomposition is a valuable tool in many diagonalization techniques for computing eigenvalues and singular values of matrices. Rutishauser's cubically convergent LR algorithm is based on the iterative application of Cholesky decomposition 21]. The divide and conquer approach can also be combined with it 4, 5]. More recently, the Cholesky decomposition, or one of its variants, has been used in connection with the accurate computation of the singular values of bidiagonal matrices 11, 15], and of the eigenvalues of specially structured symmetric tridiagonal matrices 9]. Moreover, it has been shown that Francis' QR algorithm (see 16, 17]) can be implemented using a band Cholesky decomposition 3]. Cholesky decomposition, followed by the parallel solution of the respective bidiagonal systems 7], is one of the most natural approaches to the solution of positive de nite linear systems 2, 12, 23, 24, 26], and as such has not received a great deal of attention. The classical sequential algorithms for computing the Cholesky decomposition cannot be e ciently parallelized, nor directly vectorized. It is thus natural to seek an algorithm directly amenable for an e cient parallel implementation. In this paper we introduce a new algorithm which borrows ideas from the substructured parallel cyclic reduction algorithm for the solution of tridiagonal systems 19, 28]. Parallel cyclic reduction consists of three stages: 1. (Almost) local forward and backward Gaussian Elimination steps. During this stage only one communication is required, usually with an adjacent processor. 2. Solution of a reduced system with one equation per processor. 3. Local backsubstitution. 2
Our algorithm consists of three stages as well. Let A be an N N tridiagonal matrix and assume for simplicity that N = np, where p is the number of available processors. Also, viewing A as block partitioned, let Ti denote its n n diagonal blocks, i = 1; : : : ; p. Finally, let L be the Cholesky factor of A, and let xi , i = 1; : : : ; N , denote its diagonal elements. The stages are as follows. 1. Local forward and backward Gaussian Elimination steps. This stage, which requires no communication, returns a reduced tridiagonal matrix B of order 2p 3. 2. Computation of xn i , i = 2; : : : ; p, by applying suitable transformations to B . This is the only stage which requires (treelike structured) communications between the processors. 3. Local factorization of p n n matrices Ti0, where T 0 = T and Ti0, i = 2; : : : ; p, is a rank one update (involving xn i ) of Ti . We refer to 1) and 2) together as to the preprocessing stage. The time complexity of our algorithm is about 8 N + 15 log p if p procesp sors are available. If log p N the complexity is governed by the factor p 8n. Under this circumstance, the parallel algorithm requires about 4 times the number of ops of the classical sequential algorithm, with a (theoretical) speedup close to p=4. We show that our parallel algorithm is also computationally e cient in practice. We report the results obtained on a Connection Machine CM5 supercomputer with 32 nodes and 128 vector units altogether 27]. We achieve very satisfactory performances on large matrices (say N 2 ). For smaller size matrices, very good performances can still be obtained by appropriately scaling down the number of processors involved. A natural competitor with our algorithm is the recursive doubling algorithm for the LU decomposition of tridiagonal matrices 24]. When recursive doubling is used in the LR algorithm (and to compute Cholesky rather than LU decomposition) it achieves parallel complexity roughly 12 log N , using an unbounded (i.e. linear in N ) number of processors 25]. In the more realistic case of p N , and using parallel pre x instead of recursive doubling, the parallel time complexity becomes roughly 27 N , which is more than 3 p times larger than ours. Cholesky decomposition is componentwise stable, and the variant presented here retains this property. With respect to the classical algorithm,
( 1) 1 1 ( 1) 18
the backward error a ecting the coe cient matrix is further in uenced, in the rst diagonal entry of each block Ti , by the factor jxn i =x0n i j, ^ ^ i = 2; : : : ; p. Here xn i and x0n i denote the components of the L fac^ ^ tor computed in the preprocessing and recomputed in the actual factorization stage, respectively. We nd experimentally that these ratios are small even for ill-conditioned matrices. We also prove, by an a priori analysis, that jxn i =x0n i j are small provided that preprocessing is performed with ^ ^ suitably larger precision. This paper is organized as follows. In Section 2 we de ne concepts and notations used throughout the rest of the paper. In Section 3 we review the LR algorithm for computing the eigenvalues of symmetric tridiagonal matrices. This will provide a motivation to the development of a parallel algorithm for computing the Cholesky decomposition of such matrices. In Section 4 we describe a sequential algorithm that computes the Cholesky factors and discuss its implementation, computational cost, and numerical accuracy. In Section 5 we describe the parallel algorithm, providing details of the preprocessing stage, and, in Section 6, we analyze its computational cost and suitability to vectorization. In Section 7 we present the experimental results obtained on the CM5, and in Section 8 we present the error analysis which shows the numerical accuracy of the algorithm. The technical details of the a priori analysis are in the Appendix. We conclude with some suggestions for further work.
( 1) ( 1) ( 1) ( 1) ( 1) ( 1)
We denote by Rn the set of real vectors of order n and by ei the n-vector whose entries are all zero except the ith one, which is 1. When needed, we emphasize that a particular vector ei is in Rn by writing ein . We denote by M(n) the set of real n n matrices, and by AT the transpose of A. We denote a tridiagonal symmetric matrix T 2 M(n) by
( )
2 De nitions and Main Notations
0a b Bb a b B B ... b B T =B B B ... @ bn bn an
1 2 2 2 3 3
1 C C C C: C C C A
(1)
Note that by this de nition, H is nonsingular as well. We say that the computation of the Cholesky decomposition of a matrix A is componentwise stable if the computed Cholesky factors are the exact decomposition of a small componentwise perturbation of A. We measure the time complexity of a sequential algorithm by counting the number of ops, i.e., oating point operations. We also refer to the op count as to the number of (arithmetic) steps. The time complexity of a parallel algorithm implemented on a p processor machine is the maximum, over the p processors, of the number of steps performed. We refer to this measure as to the number of parallel steps. The speedup of a parallel algorithm A over a sequential algorithm B is the ratio T Sp(n) = T B ((n)) ; A;p n where TB (n) is the (time) complexity of B on inputs of size n and TA;p(n) is the complexity of A on inputs of size n with p processors. Obviously, for any parallel algorithm there is some sequential algorithm for which Sp(n) p, for otherwise a sequential simulation of the parallel algorithm would beat the (supposedly) best known sequential one. However, in this paper we are interested in comparing the running time of the parallel algorithm with that of the classical sequential method. Hence, we may obtain superlinear speedups due to a more e cient use of the architecture resources, namely, data transmission and vectorization.
In this paper we assume that T is unreduced, that is, bi 6= 0; i = 2; : : : ; n. We say that a nonsingular matrix P 2 M(m) is a cyclic transformation if ! H 2 M(m 1); H 0 ; P = hT 1 h2Rm :
( 1)
3 An overview of the LR algorithm

The LR algorithm developed by Rutishauser was termed by Wilkinson as \the most signi cant advance which has been made in connection with the eigenvalue problem since the advent of automatic computers" (see 29], p. 485). This algorithm is very simple and e cient and computes the eigenvalues of tridiagonal symmetric matrices with a cubic rate of convergence. 5
The LR algorithm iteratively computes a sequence of tridiagonal matrices that gradually converge to a diagonal matrix with the same eigenvalues. Starting with the original matrix A = A and with eig = 0, for s = 0; 1; : : : ; the sth step consists of the following stages: choose an appropriate shift ys; nd the Cholesky decomposition of Bs = As ysI = LsLT ; s set As = LT Ls and eig = eig + ys. s As soon as the last o diagonal element becomes negligible, eig is a new exposed eigenvalue. It is easy to see that the third stage of this algorithm can be e ciently parallelized. In addition, after a few steps, the shifts ys in the rst stage can be read o the last diagonal element of the matrix (see Rutishauser and Schwarz 22]). It follows that the main di culty in implementing the LR algorithm on a parallel machine relies in the Cholesky decomposition. This is one major motivation to focus our attention on the development of an e cient parallel implementation of Cholesky decomposition. For further discussions on the LR algorithm the reader is encouraged to see Wilkinson 29], Parlett 20], Grad and Zakrajsek 18], and Bar-On 3].
0 +1
4 Cholesky decomposition
In this section we describe a sequential algorithm to compute the Cholesky decomposition of a symmetric tridiagonal matrix which is particularly suitable to implement the LR algorithm, and analyze its computational and numerical properties. Consider the Cholesky decomposition stage in the LR algorithm described in Section 2, and let (1) be the matrix to be factored. We have that 0 1 10 d y d By d C CB d y B C CB B C CB B y d C = LLT : (2) CB ... ... T =B C CB B C CB ... ... B CB dn yn C @ A A@ dn yn dn Instead of computing the decomposition (2), and taking into account that this process must be repeatedly applied over LR iterations, we compute the quantities xi and zi using the following recurrences:
1 1 2 2 2 2 3 3 3 1
zi = bi =xi ; xi = ai zi ; i = 1; : : : ; n; (3) with x = 1. Note that in recurrences (3) we only use the ai' and bi ' (rather than the bi' ). It can be easily proved by induction that xi = di and zi = yi . Now, if we set 1 0f g C Bg f g C B C B ... C; B g C LT L = B C B B ... A @ gn C gn fn then we can e ciently compute the quantities fi and gi as follows:
2 1 0 s 2 s s 2 2 1 2 2 2 3 3 2
gi = zi xi ; fi = xi + zi ; i = 1; : : : ; n;
2 +1
with zn = 0. This process can therefore be iterated. If needed, the elements of the matrix (implicitly) generated at the ith step of the LR algorithm can be easily recovered. By using this variant of the Cholesky decomposition, which we call revised decomposition, we avoid the computation of square roots.
+1
poor performance that one gets by using both classical and revised Cholesky decompositions on sequential computers. Table 1 shows the running times, observed on a DEC Alpha 7000 Model 660 Super Scalar machine, of the following routines: the BLAS routine \dgemm" which performs matrix multiplication; the LAPACK routines \dpotrf" and \dpbtrf" 1] which perform the Cholesky decomposition on dense and tridiagonal matrices, respectively; the private routine \trid" which computes the above revised decomposition. The revised decomposition is more e cient than the classical one, due primarily to the absence of square root computations. However, the M ops column puts into evidence that it is still very ine cient with respect to the dense computations dgemm and dpotrf, mainly because of the low number of ops per memory reference.
Complexity. The purpose of this paragraph is to point out the rather
Numerical Stability. Cholesky decomposition is componentwise stable,

7
and this variant retains this property. Usually the entries of the given matrix are known up to some perturbation so that it is very useful to investigate
Routine n Flops Time M ops dgemm 400 2 n 0.95 135.48 dpotrf 600 2 n =6 0.99 72.11 dpbtrf 200000 2 n 1.01 0.39 trid 200000 2 n 0.08 5.00
3 3
Table 1: LAPACK Computational Routines the \structure" of the perturbations introduced by rounding. To show this, let us denote the computed value of a by a = fl(a), and assume that the ^ standard operations satisfy
fl(a op b) = (a op b)(1 + ); j j
16
where op stands for +; ; , or =, and is the machine relative precision. For example, is roughly 10 in standard double precision. Then the actual computation of the decomposition can be formulated as follows:
zi = ci(1 + i)=xi = ci=xi ; i = 1; : : : ; n; ^ ^ ^ ^ 0 ) = ai zi ; xi = (ai zi)(1 + i ^ ^ ^ ^

1 1
with ci = bi , ai = ai(1 + i) and j ij , j ij j i0 j . For the classical ^ error bounds for Cholesky decomposition see 30] and 14].
2
Let T 2 M(n) be the unreduced symmetric tridiagonal matrix in (1). In block notation, T can be written as
5.1 Mathematical formulation

0 T UT B U T UT B B U ... ... T =B B B B ... ... UT @ q Uq Tq
1 2 2 2 3 3
5 Parallel Cholesky decomposition

1 C C C C; C C C A
Ui
+1
T ); Piq2 M(nin; n i Pi = mi = ij nj ; = bmi e ni+1 enni i

=1 =1 ( +1 1 ) (
and the Cholesky factors of the decomposition in (2) as 0 1 L BR L C B C B C L 2 M(n ); B C; R L L=B C R = y i e ni+1i e ni T : B C ... ... i mi ni B C @ A Rq L q By equating LLT and T we obtain T 0 L LT = T ; and T Tk0 Lk LT = Tk Rk Rk = Tk ymk 1 e nk (e nk )T ; k = 2; : : : ; q: k Our parallel algorithm precomputes the \perturbations" ymk 1 , then applies the transformation a0mk 1 amk 1 ymk 1 ; k = 2; : : : ; q; thus reducing the computation of Cholesky decomposition of T to q independent instances of the same problem, i.e., the computation of the Cholesky factors of T 0 ; : : : ; Tq0 . We now show some preliminary facts about these perturbations that we will later use to prove the correctness of our parallel algorithm. For i = 1; : : : ; n, let 1 0a b C Bb a b C B C B ... C: B b C Ti =B C B B ... A @ bi C bi ai Then it easily follows that ymk = bmk eT k T mk emk : m Actually, our parallel algorithm does not explicitly compute the perturbation ymk of the rst diagonal element of the block Tk . Instead it computes the quantity (4) xmk a0mk = amk bmk eT k T mk emk ; m
1 2 2 3 3 +1 ( +1 1 ) ( ) 1 1 1 1 2 ( +1 1 ) ( 1 ) 2 +1 +1 +1 2 +1 1 1 2 2 2 3 ( ) 3 2 +1 2 +1 ( 1 ) 2 +1 +1 2 1 ( 1 1) 1
(i.e. a perturbed last element of the preceding block) and then obtain a0mk = amk bmk =xmk ; by using the recurrences (3). The perturbation originally sought can be expressed, in terms of the computed quantity, as ymk = bmk =xmk .
+1 +1 2 +1 2 +1 2 +1
Lemma 1 Let Pi; i = 1; : : : ; j , be a sequence of cyclic transformations. Then

P = Pj
is a cyclic transformation.
( )
! j Y Hi 0 ! H 0 ; PP = hT 1 = hT 1 i i
2 1 =1
Lemma 2 Let P mk be a cyclic transformation such that

P mk T mk
( ) ( )
= HTk 0 hk 1
T mk bmk eT k m
(
1) 1
bmk emk amk

1
~ k = T m0
(
1)
Then we have
amk : ~ (5)
(6)
amk = amk + bmk hT emk = xmk : ~ k

1 ( 1)
Proof: From the second equality in (5), we have T mk hk = bmk emk , hence hk = bmk T mk emk , and (6) follows from (4). In the preprocessing stage of our parallel algorithm we apply a sequence of parallel cyclic transformations to obtain the values xmk , k = 1; : : : ; q 1, called pivots.
1 ( 1) 1
5.2 The algorithm
We assume for simplicity that the tridiagonal matrix is of order N = np, with p being the number of processors. We initially distribute the entries of the matrix between the processors, so that each processor stores n consecutive rows. We denote these blocks of rows by 0 1 bi n ai n bi n B C ... ... ... C 2 M(n; n + 2); Bi = B @ A bin ain bin
( 1) +1 ( 1) +1 ( 1) +2 +1
10
for i = 1; : : : ; p. Our Parallel Cholesky algorithm consists of three stages.
(i) Diagonalization, (ii) Bottom-Up and Top-Down Sweeps, (iii) Factorization.

In stage (i) each processor performs locally O(n) parallel steps independently. In stage (ii) the processors perform O(log p) operations which require interprocessor communication. Finally, in stage (iii) each processor performs O(n) parallel steps independently. Altogether, the number of parallel steps is O(n + log p).
1 0 b a zT C 2 M(n; n + 2): z A c (7) B=B A @ cT an bn denote the block assigned to a generic processor, where A is a tridiagonal matrix of order n 2, z = b e n and c = bnenn . Each processor i, 1 < i < p, performs a forward Gaussian elimination procedure to eliminate bn in the last row, and then a backward Gaussian elimination procedure to eliminate b in the rst row. In matrix notations this amounts to applying a
1 1 +1 ( 2 1 2) ( 2) 2
Stage (i): Diagonalization. Let
cyclic transformation
0 1 0 = PB = B B @
1 b fT C B; I A bn gT 1
2 2 1
f =A en ; g = A enn ;
1 ( 1 1 ( 2) 2) 2
so that
1 1 0 0 b v y b a0 b0 C; C=B z A c z A c B0 = B A A @ @ 0 0 bn y w bn an bn
1 1 +1 +1
(8)
11
since y = b0 = b0n for symmetry. Processor 1 performs the forward Gaussian elimination step only to eliminate bn in the last row, obtaining ! 0= A c B 2 M(n; n + 1); w bn and processor p remains idle. By the end of stage (i), the in-between rows and columns do not further contribute to the search for the pivots xmk , and they can be ignored. We now consider the matrix T , of order 2p 3, formed using the relevant elements from the blocks B 0 computed by processors 1 through p 1. Note that processor 1 contributes one row, while processor i, 1 < i < p contributes two rows, i.e., 1 0 w b C B ... C Bb C B C B ... ... b C; B T C B p C B B bp vp yp C A @ y p wp where bi = bin . Processor i = 2; : : : ; p 1 stores the submatrix ! bi vi yi Ti ; yi wi bi while processor 1 stores X w b x b :
2 +1 (0) (0) 1 (0) 2 (0) (0) 2 (0) (0) 1 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) +1 +1 (0) (0) (0) (0) (0) (0) (0) +1 (0) 1 (0) 1 (0) 2 1 (0) 2
two sequences of cyclic transformations, called Bottom-Up and Top-Down sweeps, that involve the submatrices stored in the di erent processors according to a tree-like pattern. Each sweep corresponds to merging of two submatrices with the generation of a new submatrix using the extreme rows. Bottom-Up sweeps are performed as follows. For s = 1; : : : ; log p 1 and i = 2; : : : ; p=2s 1, rst merge the matrices T is and T is , 0 s 1 bi vs ys i i C ! B B C T is y s w si b si B C; i =B C s s s s B C Ti bi vi yi @ A s s s yi wi bi
( 2 1) 1 ( 2 1) 2 ( 2 ( 2 1) 1 1) ( 2 1) 1 ( 2 ( 2 1) 1 1) 1 ( 2 ( 2 ( 2 1) 1 1) 1 1) ( 2 ( 2 ( 2 1) 1) 1) ( 2 ( 2 1) 1) ( 1) 2 +1
Stage (ii): Bottom-Up and Top-Down Sweeps. This stage consists of
12
and then eliminate y s in the top row and y s in the bottom row, by i i applying a cyclic transformation Pi s , i.e., 0 s 1 bi vi s yi s C ! B B C T is y s w si b si s B C: i Pi =B C s s s s B C Ti bi vi yi @ A yi s wi s bis Finally, form the matrix Ti s using the extreme rows ! s s ys s = bi vi i Ti : yi s wi s bis For i = 1 the merging operation involves only three rows, 0 1 ! B xs 1 bs s C X C; =B bs ys vs s @ A T s s s y w b
( 2 1) 1 ( ) ( 2 1) ( ) ( ) ( ) ( ) ( 2 ( 2 1) 1 1) ( 2 1) 1 ( ) ( 2 ( 2 1) 1 1) ( 2 ( 2 1) 1) ( 2 1) ( ) ( ) +1 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) +1 ( 1 ( 2 1) 1) 2 ( 2 1) ( 2 ( 2 ( 2 1) 1) 1) ( 2 ( 2 1) 1) ( 3 1)
and y s
( 2
1)
is eliminated from the bottom row 0 ! B xs 1 bs s =B bs P s Xs vs @ T

( ) 1 ( 1 ( 2 1) 1) 2 ( 2 1) ( 2 ( 2 ( ) 1 ( ) 1 ( ) 2
1) 1)
Xs = ws bs xs bs : The Top-Down sweeps are performed in a similar way. For s = log p 2; log p 3; : : : ; 0 and odd i (i.e. i = 3; 5; : : : ; p=2s 1), let the nonnegative integer l and the positive odd j be such that i = 2l j + 1 (l and j are uniquely determined). First merge Xj s l and Ti s 1 0 s s l ! B xj s+l bi C Xj C; = B bis vi s yi s s A @ Ti s ws bs y
2 ( ) 2 2 2 ( + ) ( ) ( + ) ( ) ( ) 2 ( ) ( ) ( )
yielding the matrix
ys ws
( 2
1)
( ) 1
bs
( ) 2
1 C C; A
(9)
then eliminate yi s in the bottom row and form Xi s = xi s bis . Bottom-Up and Top-Down sweeps are depicted, for the case of p = 8, in Figure 1.
( ) ( ) 2 ( ) +1
( )
( )
( ) +1
13
T x
T T
T T
T T
T T
T T
T T
? @
1
@ @ @@ ? s=1. @@ ? @@ ? R @T R @T R @x HHH HHH ? s=2. HH j Hx T HHH HHH s=1. HH ? j Hx x T x T T @ @ @ @@ ? s=0. @@ ? @@ ? R @x R @x R @x x x x x @@ @@ @@ @@ @@ @@ @@ Fact. @ @ @ @ @ @ @ R R R R R R R @L @L @L @L @L @L @L L

2 (1) 2 (1) 3 4 (1) 3 2 (0) 3 4 (0) 5 6 (0) 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
(0) 2
(0) 3
? Diag. ?
(0) 4
(0) 5
(0) 6
(0) 7
Figure 1: A owchart for p = 2 processors.

3
14
Theorem 1 Stages (i) and (ii) of the Parallel Cholesky algorithm correctly
compute the pivots xk ; k = 1; : : : ; p 1.
Proof: With respect to any processor k; 1 k (p 1), each transformation computed during stages (i) and (ii) of the algorithm is a cyclic transformation applied to the submatrix T mk . It follows from Lemma 1 that the whole sequence is still a cyclic transformation applied to T mk . Since this annihilates the o diagonal element bmk , the proof follows from Lemma 2.
( ) ( )
Note that the pivotal elements x s are computed from the tridiagonal symmetric matrix
2
0 s BX B B B Ts=B B B B @
( 1 2
1)
Ts
( 3
2)
...
Ts
(0) 2 1
Ts
2
1 C C C C C C C C A
of order n + 2s 1. Analogously, for i = 2l j + 1, the pivot elements xi s are computed from the tridiagonal symmetric matrix
2
0 s B Xj B B B B Ti s = B B B B B @
2
( + )
T is
( 2
1) 1
T is
( 4
2) 1
...
Ti s
(0) 2 1
Ti s
2
1 C C C C C C; C C C C A
of order n + 2s + 1.
Stage (iii): Factorization. The parallel factorization of the independent blocks is straightforward. Processor 1 computes the Cholesky decomposition of its original block T , while processors i = 2; : : : ; p, modify their blocks according to the rule
1
a0 i
(
1) +1
=ai
(
1) +1
bi
2 (
1) +1
=x i
(
1)
n;
and then compute their decompositions. 15
6 Parallel computational cost

In this section we study the computational cost of the Parallel Cholesky algorithm of Section 5.2.
Stage (i). To determine the cost of this stage, we have to give the de-
tails of the forward and backward Gaussian Elimination procedures. We denote the blocks in each processor as in (7), and the computed transformations as in (8). Since we compute the revised decomposition introduced in Section 3, in what follows we actually use the squares ci of the o diagonal elements bi of the matrix T . Forward Gaussian elimination: 1. Set z = c and w = a . 2. For i = 3; : : : ; n, set
2 2
t = ci =w; z = z t=w; w = ai t:
Backward Gauss elimination: 1. Set v = an . 2. For i = n 1; : : : ; 2, set
1
t = ci=v v = ai t
1
The op count for Stage (i) is therefore Stage (ii). Let
6n .
1
0b v y B b T =B y w v y B @ b y w b
1 1 1 1 1 2 2 2 2 2 2
1 0b v y C B y w b C)B C B A @ b v y y w b
1 1 2 2 2 2 2 2 2
1 C C C A
3 2
denote a typical transformation in the Bottom-Up sweep. Again, we consider the squares of the o diagonal elements, c = b and zi = yi , 16
i = 1; 2, and compute z = y . Therefore, we perform the following calculation =w v ; =c = ; =1 ; t = z =( w ) ; t = z =( v ) ; v=v t ; w=w t ; z= t t ; which takes 11 parallel steps. Similarly, let 1 1 0 0 x b x b C C)B b v y T =B b v y A A @ @ x b y w b denote a typical transformation in the Top-Down sweep. Then we compute t = v c =x and x = w z =t in 4 parallel steps. Stage (iii). The number of parallel steps is 2n (see Section 3).
2 1 2 2 1 1 1 2 2 2 1 1 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 3 2 3 2 2 1 2 2 2
The total number Tp of parallel steps is therefore:
Tp 6n + 11 log p + 4 log p + 2n = 8n + 15 log p:

Assuming that log p n we conclude that the cost of the parallel algorithm is governed by the factor 8n. Hence, the parallel algorithm requires about 4 times the number of ops of the sequential algorithm. The theoretical speedup is thus p=4. However, on vector, pipelined, and super-scalar machines, the ops count determines the true performance of an algorithm only to within a constant factor. Actually, an algorithm with a worse ops count might perform better in the case it can be vectorized. We show now that our parallel algorithm can be satisfactorily vectorized. processors each with vectorization capability. One possibility to exploit this additional power relies on employing some parallel slackness. In other words, we assume that the number of available processors is larger than p and let each physical processor simulate many such \logical" processors. More precisely, let n = mP , where P = 2t, so that N = qm, with q = pP = 2r t . We let each physical processor perform the tasks of P corresponding logical
+
Vectorization: Let N = pn where p = 2r is the number of \physical"
17
processors. The number of ops in stages (i) and (iii) is still approximately 8n. The number of ops in stage (ii) increases to about 15(r + P ), which is still negligible for (r + P ) n. However, the main stages of the algorithm, namely stages (i) and (iii), can now be vectorized, with each processor working on vectors of length P = 2t. We provide an example of this sort in the next section.
7 Numerical examples
In this section we present some experimental results obtained on a CM5 parallel supercomputer with p = 32 nodes. Each node is in turn composed of 4 vector units, controlled by a SPARC microprocessor, and 32 Mbytes of memory. The running time and speedup for the largest problems we could experiment on are shown in Table 2. We have computed the Cholesky factorization of several classes of tridiagonal matrices, including: (a) the symmetric tridiagonal Toeplitz matrix with the diagonal element equal to 2 and the o diagonal element equal to 1, (b) matrices obtained from the matrix de ned in (a) by varying the diagonal elements, and (c) random tridiagonal matrices. The order of the test matrices is N = pn = qm, where p = 32 is the number of the \physical" processors actually available, and q is the number of \logical" processors (see Section 5). Table 2 gives the running times for each of the following stages of the algorithm. 1. D - Logical Diagonalization. 2. I - Bottom-Up and Top-Down stages performed by the logical processors within any physical processor. 3. E - Bottom-Up and Top-Down stages performed by the physical processors. 4. C - Logical Factorization. 5. S - Sequential algorithm. Clearly, the speedup is larger than p=4. Beside the additional parallelization due to having 4 vector units, we gain a factor of 40 due to vectorization. 18
N 3 2 2 2 m 3 2 2 2 D 1.444 0.963 0.480 I 0.033 0.033 0.033 E 0.014 0.014 0.014 C 0.562 0.372 0.186 total 2.054 1.376 0.713 S 2704.12 1791.73 880.85 speedup 1316 1296 1235
24 25 9 24 8 8
Table 2: Computational examples on the CM5, q = 2 .

16
In Table 3 we depict similar results for matrices of smaller size. The decrease in performance is due to shorter vector length and the increased e ect of communication overheads. Thus, as the matrix size becomes smaller we should consider as a better strategy to use fewer processors. (We were not able to report on such experiments due to the xed system partition in ICSI). The performance we observed suggests that we should use vector sizes 128, and blocks of order 64, so that for N = 4 2 p we should use p processors.
13
N q m D I E C total S speedup
2 2 2 0.037 0.012 0.014 0.016 0.079 50.43 638

20 14 6
2 2 2 0.022 0.007 0.014 0.010 0.053 24.78 468

19 13 6
2 2 2 0.011 0.007 0.014 0.005 0.037 12.36 334

18 13 5
Table 3: Smaller size matrices.
19
8 Error Analysis
The main results of the a priori analysis(see the Appendix) is that the pivots xnk and x0nk , computed by processor k at the end of stages (ii) and (iii), ^ ^ respectively, satisfy jxnk =x0nk j = (1 + ); j j = O((n + log p) ^); ^ ^ where ^ represent the input precision, provided we use some higher precision < ^ in the computation. To appreciate the signi cant of this result we proceed with the following. i.e. the actual transformation applied to the matrix. As in (7), let B denote the block assigned to a given processor p, and let x be the pivot computed (during the rst two stages) by processor p 1. Processor p computes the following recurrences (see Section 3)
0 1
A posteriori error analysis. Consider stage (iii) of the parallel algorithm,
zi = ci=xi ; xi = ai zi ; i = 1; : : : ; n: Taking the rounding errors into account, we have zi = ci=xi ; ci = ci(1 + i ); j ij ; ^ ^ ^ ^ xi = ai zi ; ai = ai(1 + i); j ij ; ^ ^ ^ ^ where x is the computed pivot. In this analysis there is a discrepancy, i.e. ^ xn is not the same as the pivot transmitted to processor p + 1. To x this ^ problem, de ne x0 as the pivot computed by processor p 1 by the end ^ of stage (iii); then, the rst step above can be written as z = (c =x0 )(1 + ^ ^ 0 =x ) = c =x0 , where c = (1 + 0 )c , and (1 + 0 ) = (1 + )(^0 =x ): )(^ ^ x ^ ^ ^ x ^ The above argument shows that the solution computed by our parallel algorithm is the exact solution of a system in which the rst o diagonal elements of each block are further perturbed (with respect to the classical sequential algorithm) by the factor x0 =x . Hence, when j(^0 x )=x j = O( ), ^ ^ x ^ ^ the algorithm is componentwise stable in the backward sense, and this is con rmed by the a priori analysis. We found experimentally that j(^0 x )=x j is relatively small even using x ^ ^ standard double precision, on very ill conditioned matrices see Table 4. The table contains results related to three di erent kinds of test: (i) Test 1:
1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
20
random diagonally dominant matrices; (ii) Test 2: the Toeplitz tridiagonal symmetric matrix with aii = 2 and ai ;i = ai;i = 1; (iii) Test 3: random tridiagonal matrices. We have added an appropriate shift to the diagonal elements to assure positive de niteness of the perturbed matrix.
+1 +1
Shift Test 1 Test 2 Test 3 10 14 15 16 10 12 14 16 10 12 14 8 10 12 14 6

4 8 12 14
Table 4: Error analysis for N = 2 and m = 2 . Each Test column gives the number of correct digits in the computed factorization as produced by the a posteriori error bound.
23 8
9 Further work
The e ciency of the LR scheme, accelerated with our algorithm in the decomposition stage, should be compared with other algorithms for the computation of the eigensystem of tridiagonal symmetric matrices, notably QR 6] and divide and conquer algorithms 4, 10, 13]. Possible generalizations of this work include the cases of block-tridiagonal and of band matrices. In fact, for both kinds of matrices the algorithmic framework appears to be essentially the same. In addition, it is possible to apply similar ideas to the development of a parallel band version of the QR algorithm.
10 Acknowledgments
The authors are indebted to the referees for a number of suggestions that helped to improve the quality and the clarity of the paper. The rst author is also indebted to Prof. F.T. Luk for his encouragement and advice.
21
References
1] E. Anderson et al., LAPACK Users' Guide, SIAM, 1992. 2] I. Bar-On, A practical parallel algorithm for solving band symmetric positive de nite systems of linear equations, ACM Trans. Math. Softw., 13 (1987), pp. 323{332. 3] , Fast parallel LR and QR algorithms for symmetric band matrices, Tech. Report 749, Technion, Computer Science Department, 1992. , A new divide and conquer parallel algorithm for computing the 4] eigenvalues of a symmetric tridiagonal matrix, Tech. Report 832, Technion, Computer Science Department, 1994. , Interlacing properties for tridiagonal symmetric matrices with ap5] plications to parallel computing, Siam J. on Matrix Analysis and Applications, 17 (1996). 6] I. Bar-On and B. Codenotti, A fast and stable parallel QR algorithm for symmetric tridiagonal matrices, Linear algebra and its applications, 220 (1995), pp. 63{96. 7] I. Bar-On and M. Leoncini, Fast and reliable parallel solution of bidiagonal systems, Siam J. on Numerical Analysis, (1995). Submitted. 8] , Well de ned tridiagonal systems, in the 6th ILAS Conference(Submitted), 1996. 9] J. Barlow and J. Demmel, Computing accurate eigensystems of scaled diagonally dominant matrices, Siam J. on Numerical Analysis, 27 (1990), pp. 762{791. 10] J. Cuppen, A divide and conquer method for the symmetric eigenproblem, Numerische Mathematik, 36 (1981), pp. 177{195. 11] J. Demmel and W. Kahan, Accurate singular values of bidiagonal matrices, Siam J. Sci. Stat. Comput., 11 (1990), pp. 873{912. 12] J. J. Dongarra and A. H. Sameh, On some parallel banded system solvers, Parallel Computing, 1 (1984), pp. 223{235. 22
13] J. J. Dongarra and D. Sorensen, A fully parallel algorithm for the symmetric eigenvalue problem, Siam J. Sci. Stat. Comput., 8 (1987), pp. s139{s154. 14] Z. Drmac, M. Omladic, and K. Veselic, On the perturbation of the Cholesky factorization, Siam J. on Matrix Analysis and Applications, 15 (1994), pp. 1319{1332. 15] K. V. Fernando and B. Parlett, Accurate singular values and di erential qd algorithms, Numerische Mathematik, 67 (1994), pp. 191{ 229. 16] J. Francis, The QR transformation, part I, Computer J., 4 (1961), pp. 265{271. 17] , The QR transformation, part II, Computer J., 4 (1962), pp. 332{ 345. 18] J. Grad and E. Zakrajsek, LR algorithm with Laguerre shift for symmetric tridiagonal matrices, The Computer Journal, 15 (1972), pp. 268{270. 19] S. L. Johnsson, Solving tridiagonal systems on ensemble architectures, Siam J. Sci. Stat. Comput., 8 (1987), pp. 354{392. 20] B. N. Parlett, Laguerre's method applied to the matrix eigenvalue problem, Mathematics of Computation, 18 (1964), pp. 464{485. 21] H. Rutishauser, Solution of eigenvalue problems with the LR transformation, Nat Bur. Standards, AMS, 49 (1958), pp. 47{81. 22] H. Rutishauser and H. Schwarz, The LR transformation method for symmetric matrices, Numerische Mathematik, 5 (1963), pp. 273{289. 23] A. Sameh and D. Kuck, On stable parallel linear system solver, J. Assoc. Comput. Mach., 25 (1978), pp. 81{91. 24] H. S. Stone, An e cient parallel algorithm for the solution of a tridiagonal system of equations, J. Assoc. Comput. Mach., 20 (1975), pp. 27{ 38. 23
25] 26] 27] 28] 29] 30]
, Parallel tridiagonal equation solver, ACM Trans. Math. Softw., 1 (1975), pp. 289{307. P. Swarztrauber, A parallel algorithm for solving general tridiagonal equations, Mathematics of Computation, 33 (1979), pp. 185{199. Thinking Machine Corporation, The Connection Machine CM-5 Technical Summary, Oct. 1991. H. Wang, A parallel method for tridiagonal equations, ACM Trans. Math. Softw., 7 (1981), pp. 170{183. J. H. Wilkinson, The Algebraic Eigenvalue Problem, Oxford University Press, 1965. Reprinted in Oxford Science Publications, 1988. , A priori error analysis of algebraic processes, in Proc. International Congress Math., 1968, pp. 629{639. Moscow: Izdat.
A A priori error analysis

We show in this section that we can assure overall stability of the parallel algorithm by using some higher precision in the preprocessing stage. Let T 2 M(N ) be an unreduced symmetric and positive de nite matrix of order N = np, with p = 2r . Let, ^ denote the precision of the input, and < ^ the precision by which we perform the computations. Lemma 3 Let A 2 M(m) be a submatrix of T , and let ^ ^ ^ A = A + A; A = A + A; j A; Aj O( )jAj; denote corresponding componentwise perturbations of A. Let, > > m > m > 0; denote the eigenvalues of A, and similarly, i and î the eigenvalues of the ^ perturbed matrices respectively. Let A0 ; A0 and A0 denote the corresponding 0 ; 0 and ^0 denote the correleading submatrices of order (m 1), and let sponding eigenvalues. Then ^ (eT A e ) = (1 + ); j j = O(m ) = O(m ); m N (eT A e ) where i ; i = 1; : : : ; N are the eigenvalues of T .
1 2 1 1 1 1 1 1 1 1 1
24
Proof: By standard perturbation analysis
j
0 0 and as 1 < 1 ; m 1 >
i m, i
i ; î
ij
O( O(
1
)j i j; )j i0j:
also
i ; î
1
j
Hence,
1 1 1 1
ij
m ^ Y m Y (eT A e ) = m î0 Y i = m (1 + ^0i ) Y (1 + i ) ; (10) (eT A e ) i 0i i î i (1 + 0i) i (1 + î ) 1 1 with j i; î; 0i ; ^0i j O( m ) = O( N ). This simple, but usually overestimated bound, will decree the use of
1 1 =1 =1 =1 =1
^;
in the following calculations. A better bound, usually j j = O(m ), could be derived using the methods in 8]. In this case the precision of the computation could be greatly reduced. However, to simplify the discussion, we proceed with the former scheme.
Theorem 2 Let xip and x0ip denote the pivots computed by processor ip at
the end of stages (ii) and (iii), respectively. Then,
jxip =x0ip j = (1 + ); where j j = O((n + log p) ^) = O( ^): ^ ^

Here we recall from Section 5.2 that xip as well as x0ip are computed from the matrix Tip .
Proof: We derive the proof in the following pages. We begin with a review of the basic transformations performed in the three stages of the parallel algorithm. For the diagonalization stage we have
0 v B ze @
1 0 zeT C ) B v ze A cen A @ w ceT u n

1 2 2
(0)
u A cen w
2 (0)
1 C; A
25
where z = b , c = bn , and
2 2 2 (0) (0)
= ; = ;
0 0
u = czeT A e : n
2 1 1
= ceT A en ; n = zeT A e ;
2 1 2 1 1 1
Note that we have replaced the o diagonal elements by their squares, which are the values actually involved in the modi ed recurrences (3). For the middle transformations in the Bottom-Up sweep we have
0v B B u B B @
1 1
( 1
1)
w
( )
u
1
( 1
1)
v
( 1
c u
2
( 2
1)
where
0v B u B )B B @ u
( )
w
( 2
u
2
1)
1 C C C C A
1)
c
2
u
s
1)
w
(
( )
1 C C C; C A
(11) (12) (13) (14)
+ s; u ; u= s= d s ; d =w =1 ; =
( 1) 1 1 1 1 ( 1 1)
s s;
d =v = d cd ;
2 2 1 2 1 2
= s u ; s= d
( ) 2 2
1)
+ s;
1)
( 2
Here, we assume that the computation of d (d ) is carried out by accumulating the corresponding partial shifts and then deducing their sum from the corresponding element w (v ). Finally, for the Top-Down sweep (and the rst transformations in the Bottom-Up sweep) we have
1 2
0x c B c v @ u
0
( )
( )
1 1 0 x c C)B d u C; A A @ x
0 1
26
= u; (15) d s s = s + s ; s = c: (16) x We proceed to consider the roundo perturbations related to these transformations. We begin with a preliminary discussion of the perturbations in the internal diagonal elements in the Bottom-Up and Top-Down sweeps, i.e., d ; d in (13), and d in (16).
with
x =w d=v
1
( +1)
( +1)
; ;
( +1)
( )
s+1 ;
+1
s+1
+1
( +1)
( )
Lemma 4 The computed value of d in Step (s + 1) satis es

1
s ^ ^ d =w ; w = w (1 + w1 ); j w1 j ; ^ s X s = i; i = î (1 + i ); j i j = O( ):
( ) 1 1 1 1 1 ^ ^ ( ) 1
i=0
Proof: We have that ^ d = (w ^ s )(1 +

1 1 ( ) 1
^ ^ d1 ) = w1
( ) 1
( ) 1
w = w (1 + ^
1 1
w1 ); ^
with j w1 j j d^1 j ^
, and
( ) 1
( ) 1
= ^ s . Then we obtain î
s 1 Y j =i
1
s X i=0
(1 + j ) =
s X i=0
i;
where
= 0 and j j j
i
for j = 0; : : : ; (s 1). Hence,
= î
s 1 Y j =i
1
(1 + j ) = î(1 + i );
with j i j = O( ).
Corollary 1 Using similar notations we have

^ ^ d =v
2 ( ) 2 ( ) 2
s X i=0
; v = v (1 + v2 ); j v2 j ; ^ i ; i = î (1 + i ); j i j = O( ):
2 2 ^ ^
27
Finally, the computed value of d in Step (s 1) satis es
^ ^ d=v x =w ^ ^
1
( )
s ; (s)
v = v(1 + v ); j v j ^ ; w = w(1 + w ); j w j ^
^ ^ ^ ^ ( )
; ;
where
( )
= ^ s , and
( )
( )
= ^ s , as above.
We then proceed to consider the roundo perturbations in Stage (i) and Stage (ii).
bations of the order of , with the following structure:
Theorem 3 The computed transformation in Stage (i) is a ected by pertur0 v B ze @

1
0 1 zeT Bv A cen C ) B ze A @ ceT w u ^ n

1 2 2
(0)
u ^ A cen w
2 (0)
1 C C; A
where u = (1 + u )u, and j u j = O( ). ^
Proof: The forward elimination procedure can be viewed as a Cholesky factorization of A (see Section 3), and therefore produces a matrix A which di ers from A for an O( ) perturbation. Hence, ^ = c(eT A en )(1 + ); n
1 0 2 2 1
j j
1 1
Then,
0
= c(eT A en ); n
1 2 2
c = (1 + 0 )(1 + )c = (1 + c)c;
and j cj = O( ). Similarly, the backward elimination produces another ^ perturbation of A, which we denote by A. Thus T^ 1 ^0 = (1 + 2)(eT A 1e1 )z = e1 A 1e1 (1 + 2 )(eT A 1e1 )z; ^ 1 1 T
eA e
1
with j 2 j
0
, so that
= z (eT A e );
1 1 1
z = (1 + )(1 + )(1 + 0 )z = (1 + z )z;

2
28
with j z j = O( ) by Lemma 3. Finally,
u = zc(eT ^ n
A e ) (1 +
1 1
n Y
i=3
)(1 + i );
2
j jj
; j = 5; : : : ; 2n;
and from u = zc(eT 2 A 1 e1 ), we get n Qn (1 + )(1 + ) i u = (1 + u)u = i=3(1 + 2)(11 + ) 2i u; ^ c z which completes the proof. We further note that
j u j = O ( );
;
x=w ^
(0)
for ip = 1. We proceed with the perturbations in the Bottom-Up sweep.
w = w(1 + w ); j w j
Theorem 4 The computed transformations in each of the Bottom-Up steps

are a ected by O( ) perturbations, except possibly for the rst step where we may have O( ) perturbations., i.e.
0 Bv B u B B B @
1 1
1)
u
1
1)
c
2
0 s v B s B u w B )B B c @ u ^ with u = (1 + u )u and j u j = O( ). ^
( ) ( 1 1 1
1)
w
s
u
2
1)
1 C C C C C A
1)
c
2
u ^
( 1) 2
( )
1 C C C C C A
Proof: From (12) we have that

with j i j
^s = du22
^ (1+ ) ^ ^ (1+ )
; i = 1; : : : ; 5. Hence, (1 + s )(1 + u2 )(1 + ) u s = (1 + s )^s = ^ (1 + )(1 + ) d

5 4 3
5 4
^ = (1 ^)(1 + );
3
^=
c(1+ 2 ) ^ ^ d1 d2 (1+ 1 ) ;
u ; d
2 2
29
where,
^ and
2
1
2
. Using Lemma 3 and Corollary 1, we get

( 2 2
^ v d = (1 + d2 )d = (^
1)
)(1 + d2 ) = v
( 2
1)
The perturbations in the original element v2 of T is now determined by
v = v (1 + v2 )(1 + v2 ) = v (1 + v2 );
2 2 ^ 2
j v2 j = O(j u2 j);
6 1
since j v2 j ^
s
. Similarly,
7 3 1
u )(1 1 = (1 + s ) ^s = (1 + (1s+ + u+)(1)+ ) ^ )(1 d where j j; j j , and

6 7
u ; d
1 1 (
^ d = (1 + d1 )d = (w ^
1 1
1)
)(1 +
d1 ) = w 1
1)
Here, the perturbation in w1 is determined by
w = w (1 +
1 1
w1 )(1 + w1 ) = w1 (1 + w1 ); ^
j w1 j = O(j u1 j):
c ; dd
1 2
Finally, from (13), we get
= c(1 + )(1 + d1 )(1 + d d (1 + )

2 1 2 1
d2 )
so that and from
c = (1 + c)c;
j cj = O(j u1 j + j u2 j);
8 9
(1 + u = ^^s(1 + )^s(1 + ) = (1 + )(1 + )) ^ s )(1 + s with j j; j j , we get j u j = O( ). Therefore,

8 9 8 9
s s
(1 + u )u;
j v2 j; j w1 j = O( );
s = 2; : : : ; (r 1);
except, possibly, for the rst step, where j v2 j; j w1 j = O( ).
Finally, we consider the perturbations in the Top-Down sweep. 30
Theorem 5 The computed transformations in each Top-Down step are affected by O( ) perturbations, except possibly for the last step where we may have O( ) perturbations, i.e.
0 Bx B c v @
0
c u
1)
1)
1 1 0 x c C C ) B d u C: A @ A x
0 1
Proof: We have from (16), ^s = (c=x )(1 + ); ^

0 1
= ^s(1 + s ) = c=x ;
0
where x0 = x0 so that ^
c = c(1 + c);
Then, with j 2 j
s = (1 + s )^s
j c j = O ( ):
s )(1 + u )(1 +
2
= (1 +
( )
, and using Corollary 1

( )
)u ^ d
u; d
s d = (^ v )(1 + d ) = v Here, the perturbation in v is given by
v = v(1 + v )(1 + v ) = v(1 + v );

^
j v j = O (j u j);
with j v j = O( ); s = r 1; : : : ; 1, except possibly for step s = 1, where j v j = O( ). Finally, the perturbation in w is given by
x =w ^ ^
1
( )
( )
w = w(1 + w );
with j w j . As a result of the above analysis, we may obtain the rst following gross a priori bound.
^ Corollary 2 The computed pivots xip and x0ip satisfy ^
jxip =x0ip j = (1 + ); ^ ^
31
j j = O(N ^):
Proof: The pivots are both derived from the submatrix Tip and thus could be related to O( ) perturbations in the original matrix. The rest follows from Lemma 3. Finally, let T ip corresponds to the matrix derived from the above O( ) ^ perturbations in the original matrix, and let Tip denote the matrix actually ^ computed. Note, that Tip corresponds to O( ) perurbations in T ip .
Corollary 3 Let us denote the eigenvalues of Tip by

( ) 1
ip
>
( )
( ) 2
ip
>
ip ) np 1
>
(ip ) np
>0
^ and similarly those of Tip by ^ (ip) . Then, i ^nipp Proof: Let

(
N (1 + O( ^)) > 0;
^ so that the matrix Tip is positive de nite.

i ip )
denote the corresponding eigenvalues of T ip . Then,

N
(ip ) np ; ( 1
(1 + O( ^))
and therefore
ip )
(1 + O( ^));
N (1 + O( ^)) > 0;
^nipp =
( )
(ip ) np + O (
ip )
(ip ) np (1 + O(
)( = N ))
1
as required. Proof of Theorem 2. Noting that the pivots are computed from O( ^ perturbations in Tip , the proof follows from Lemma 3.
1
+ )
32

Cholesky

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cholesky

Uploaded by

Copyright:

Available Formats

A Fast Parallel Cholesky Decomposition Algorithm for Tridiagonal Symmetric Matrices

2 De nitions and Main Notations

3 An overview of the LR algorithm

Complexity. The purpose of this paragraph is to point out the rather

Numerical Stability. Cholesky decomposition is componentwise stable,

zi = ci(1 + i)=xi = ci=xi ; i = 1; : : : ; n; ^ ^ ^ ^ 0 ) = ai zi ; xi = (ai zi)(1 + i ^ ^ ^ ^

5.1 Mathematical formulation

5 Parallel Cholesky decomposition

T ); Piq2 M(nin; n i Pi = mi = ij nj ; = bmi e ni+1 enni i

Lemma 1 Let Pi; i = 1; : : : ; j , be a sequence of cyclic transformations. Then

Lemma 2 Let P mk be a cyclic transformation such that

bmk emk amk

amk = amk + bmk hT emk = xmk : ~ k

5.2 The algorithm

for i = 1; : : : ; p. Our Parallel Cholesky algorithm consists of three stages.

(i) Diagonalization, (ii) Bottom-Up and Top-Down Sweeps, (iii) Factorization.

Stage (i): Diagonalization. Let

Stage (ii): Bottom-Up and Top-Down Sweeps. This stage consists of

is eliminated from the bottom row 0 ! B xs 1 bs s =B bs P s Xs vs @ T

yielding the matrix

@ @ @@ ? s=1. @@ ? @@ ? R @T R @T R @x HHH HHH ? s=2. HH j Hx T HHH HHH s=1. HH ? j Hx x T x T T @ @ @ @@ ? s=0. @@ ? @@ ? R @x R @x R @x x x x x @@ @@ @@ @@ @@ @@ @@ Fact. @ @ @ @ @ @ @ R R R R R R R @L @L @L @L @L @L @L L

Figure 1: A owchart for p = 2 processors.

and then compute their decompositions. 15

6 Parallel computational cost

The op count for Stage (i) is therefore Stage (ii). Let

The total number Tp of parallel steps is therefore:

Tp 6n + 11 log p + 4 log p + 2n = 8n + 15 log p:

Vectorization: Let N = pn where p = 2r is the number of \physical"

Table 2: Computational examples on the CM5, q = 2 .

2 2 2 0.037 0.012 0.014 0.016 0.079 50.43 638

2 2 2 0.022 0.007 0.014 0.010 0.053 24.78 468

2 2 2 0.011 0.007 0.014 0.005 0.037 12.36 334

Table 3: Smaller size matrices.

A posteriori error analysis. Consider stage (iii) of the parallel algorithm,

Shift Test 1 Test 2 Test 3 10 14 15 16 10 12 14 16 10 12 14 8 10 12 14 6

25] 26] 27] 28] 29] 30]

A A priori error analysis

Proof: By standard perturbation analysis

jxip =x0ip j = (1 + ); where j j = O((n + log p) ^) = O( ^): ^ ^

1 0 zeT C ) B v ze A cen A @ w ceT u n

Lemma 4 The computed value of d in Step (s + 1) satis es

Proof: We have that ^ d = (w ^ s )(1 +

for j = 0; : : : ; (s 1). Hence,

Corollary 1 Using similar notations we have

Finally, the computed value of d in Step (s 1) satis es

Theorem 3 The computed transformation in Stage (i) is a ected by pertur0 v B ze @

0 1 zeT Bv A cen C ) B ze A @ ceT w u ^ n

where u = (1 + u )u, and j u j = O( ). ^

z = (1 + )(1 + )(1 + 0 )z = (1 + z )z;

with j z j = O( ) by Lemma 3. Finally,

for ip = 1. We proceed with the perturbations in the Bottom-Up sweep.

Theorem 4 The computed transformations in each of the Bottom-Up steps

Proof: From (12) we have that

; i = 1; : : : ; 5. Hence, (1 + s )(1 + u2 )(1 + ) u s = (1 + s )^s = ^ (1 + )(1 + ) d

. Using Lemma 3 and Corollary 1, we get

The perturbations in the original element v2 of T is now determined by

u )(1 1 = (1 + s ) ^s = (1 + (1s+ + u+)(1)+ ) ^ )(1 d where j j; j j , and

Here, the perturbation in w1 is determined by

Finally, from (13), we get

= c(1 + )(1 + d1 )(1 + d d (1 + )

so that and from

(1 + u = ^^s(1 + )^s(1 + ) = (1 + )(1 + )) ^ s )(1 + s with j j; j j , we get j u j = O( ). Therefore,