Numerical Linear PDF

Numerical Linear Algebra
And Its Applications
1 2
Xiao-Qing JIN Yi-Min WEI
August 29, 2008
1
Department of Mathematics, University of Macau, Macau, P. R. China.
2
Department of Mathematics, Fudan University, Shanghai, P.R. China
2
i
To Our Families
ii
CONTENTS
page
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Basic problems in NLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Why shall we study numerical methods? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Matrix factorizations (decompositions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Perturbation and error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Operation cost and convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2 Direct Methods for Linear Systems . . . . . . . . . . . . . . . . . 9

2.1 Triangular linear systems and LU factorization . . . . . . . . . . . . . . . . . . . . . 9
2.2 LU factorization with pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3 Perturbation and Error Analysis . . . . . . . . . . . . . . . . . . . 25

3.1 Vector and matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Perturbation analysis for linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Error analysis on floating point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Error analysis on partial pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 4 Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 Least squares problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Orthogonal transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
iii
4.3 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 5 Classical Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1 Jacobi and Gauss-Seidel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 SOR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Chapter 6 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1 Steepest descent method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Conjugate gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Practical CG method and convergence analysis . . . . . . . . . . . . . . . . . . . . 92
6.4 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 GMRES method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 7 Nonsymmetric Eigenvalue Problems . . . . . . . . . . . . . . . 111

7.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Inverse power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4 QR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 Real version of QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Chapter 8 Symmetric Eigenvalue Problems . . . . . . . . . . . . . . . . . . . 131

8.1 Basic spectral properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2 Symmetric QR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.3 Jacobi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
iv
8.4 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.5 Divide-and-conquer method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Chapter 9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.2 Background of BVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.3 Strang-type preconditioner for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.4 Strang-type preconditioner for DDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.5 Strang-type preconditioner for NDDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.6 Strang-type preconditioner for SPDDEs . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
v
Preface
Numerical linear algebra, also called matrix computation, has been a center of sci-
entific and engineering computing since 1946, the first modern computer was born.
Most of problems in science and engineering finally become problems in matrix compu-
tation. Therefore, it is important for us to study numerical linear algebra. This book
gives an elementary introduction to matrix computation and it also includes some new
results obtained in recent years. In the beginning of this book, we first give an outline
of numerical linear algebra in Chapter 1.
In Chapter 2, we introduce Gaussian elimination, a basic direct method, for solving
general linear systems. Usually, Gaussian elimination is used for solving a dense linear
system with median size and no special structure. The operation cost of Gaussian
elimination is O(n3 ) where n is the size of the system. The pivoting technique is also
studied.
In Chapter 3, in order to discuss effects of perturbation and error on numerical
solutions, we introduce vector and matrix norms and study their properties. The error
analysis on floating point operations and on partial pivoting technique is also given.
In Chapter 4, linear least squares problems are studied. We will concentrate on
the problem of finding the least squares solution of an overdetermined linear system
Ax = b where A has more rows than columns. Some orthogonal transformations and
the QR decomposition are used to design efficient algorithms for solving least squares
problems.
We study classical iterative methods for the solution of Ax = b in Chapter 5.
Iterative methods are quite different from direct methods such as Gaussian elimination.
Direct methods based on an LU factorization of the matrix A are prohibitive in terms
of computing time and computer storage if A is quite large. Usually, in most large
problems, the matrices are sparse. The sparsity may be lost during the LU factorization
procedure and then at the end of LU factorization, the storage becomes a crucial issue.
For such kind of problem, we can use a class of methods called iterative methods. We
only consider some classical iterative methods in this chapter.
In Chapter 6, we introduce another class of iterative methods called Krylov sub-
space methods proposed recently. We will only study two versions among those Krylov
subspace methods: the conjugate gradient (CG) method and the generalized mini-
mum residual (GMRES) method. The CG method proposed in 1952 is one of the best
known iterative method for solving symmetric positive definite linear systems. The
GMRES method was proposed in 1986 for solving nonsymmetric linear systems. The
preconditioning technique is also studied.
Eigenvalue problems are particularly interesting in scientific computing. In Chapter
vi
7, nonsymmetric eigenvalue problems are studied. We introduce some well-known

methods such as the power method, the inverse power method and the QR method.
The symmetric eigenvalue problem with its nice properties and rich mathematical
theory is one of the most interesting topics in numerical linear algebra. In Chapter 8,
we will study this topic. The symmetric QR iteration method, the Jacobi method, the
bisection method and a divide-and-conquer technique will be discussed in this chapter.
In Chapter 9, we will briefly survey some of the latest developments in using bound-
ary value methods for solving systems of ordinary differential equations with initial
values. These methods require the solutions of one or more nonsymmetric, large and
sparse linear systems. Therefore, we will use the GMRES method in Chapter 6 with
some preconditioners for solving these linear systems. One of the main results is that
if an A1 ,2 -stable boundary value method is used for an m-by-m system of ODEs,
then the preconditioned matrix can be decomposed as I + L where I is the identity
matrix and the rank of L is at most 2m(1 + 2 ). It follows that when the GMRES
method is applied to the preconditioned system, the method will converge in at most
2m(1 + 2 ) + 1 iterations. Applications to different delay differential equations are also
given.
If any other mathematical topic is as fundamental to the mathematical

sciences as calculus and differential equations, it is numerical linear algebra.
L. Trefethen and D. Bau III
Acknowledgments: We would like to thank Professor Raymond H. F. Chan of

the Department of Mathematics, Chinese University of Hong Kong, for his constant
encouragement, long-standing friendship, financial support; Professor Z. H. Cao of the
Department of Mathematics, Fudan University, for his many helpful discussions and
useful suggestions. We also would like to thank our friend Professor Z. C. Shi for
his encouraging support and valuable comments. Of course, special appreciation goes
to two important institutions in the authors life: University of Macau and Fudan
University for providing a wonderful intellectual atmosphere for writing this book.
Most of the writing was done during evenings, weekends and holidays. Finally, thanks
are also due to our families for their endless love, understanding, encouragement and
support essential to the completion of this book. The most heartfelt thanks to all of
them!
The publication of the book is supported in part by the research grants No.RG024/01-
02S/JXQ/FST, No.RG031/02-03S/JXQ/FST and No.RG064/03-04S/JXQ/FST from
University of Macau; the research grant No.10471027 from the National Natural Science
Foundation of China and some financial support from Shanghai Education Committee
and Fudan University.
vii
Authors words on the corrected and revised second printing: In its second
printing, we corrected some minor mathematical and typographical mistakes in the
first printing of the book. We would like to thank all those people who pointed these
out to us. Additional comments and some revision have been made in Chapter 7.
The references have been updated. More exercises are also to be found in the book.
The second printing of the book is supported by the research grant No.RG081/04-
05S/JXQ/FST.
viii
Chapter 1
Introduction
Numerical linear algebra (NLA) is also called matrix computation. It has been a center
of scientific and engineering computing since the first modern computer came to this
world around 1946. Most of problems in science and engineering are finally transferred
into problems in NLA. Thus, it is very important for us to study NLA. This book gives
an elementary introduction to NLA and it also includes some new results obtained in
recent years.
1.1 Basic symbols

We will use the following symbols throughout this book.
Let
R denote the set of real numbers, C denote the set of complex numbers and
i 1.
Let Rn denote the set of real n-vectors and Cn denote the set of complex n-vectors.
Vectors will almost always be column vectors.
Let Rmn denote the linear vector space of m-by-n real matrices and Cmn denote
the linear vector space of m-by-n complex matrices.
We will use the upper case letters such as A, B, C, and , etc, to denote
matrices and use the lower case letters such as x, y, z, etc, to denote vectors.
The symbol aij will denote the ij-th entry in a matrix A.
The symbol AT will denote the transpose of the matrix A and A will denote the
conjugate transpose of the matrix A.
Let a1 , , am Rn (or Cn ). We will use span{a1 , , am } to denote the linear

vector space of all the linear combinations of a1 , , am .
1
2 CHAPTER 1. INTRODUCTION
Let rank(A) denote the rank of the matrix A.
Let dim(S) denote the dimension of the vector space S.
We will use det(A) to denote the determinant of the matrix A and use diag(a11 , , ann )
to denote the n-by-n diagonal matrix:

a11 0 0
..
0 a22 . . . .

diag(a11 , , ann ) = . .
.. ..
.. . . 0
0 0 ann
For matrix A = [aij ], the symbol |A| will denote the matrix with entries (|A|)ij =
|aij |.
The symbol I will denote the identity matrix, i.e.,

1 0 0
..
0 1 ... .
I= .. . . . .
,

. . . 0
0 0 1
and ei will denote the i-th unit vector, i.e., the i-th column vector of I.
We will use k k to denote a norm of matrix or vector. The symbols k k1 , k k2

and k k will denote the p-norm with p = 1, 2, , respectively.
As in MATLAB, in algorithms, A(i, j) will denote the (i, j)-th entry of matrix A;
A(i, :) and A(:, j) will denote the i-th row and the j-th column of A, respectively;
A(i1 : i2 , k) will express the column vector constructed by using entries from the
i1 -th entry to the i2 -th entry in the k-th column of A; A(k, j1 : j2 ) will express the
row vector constructed by using entries from the j1 -th entry to the j2 -th entry
in the k-th row of A; A(k : l, p : q) will denote the (l k + 1)-by-(q p + 1)
submatrix constructed by using the rows from the k-th row to the l-th row and
the columns from the p-th column to the q-th column.
1.2 Basic problems in NLA

NLA includes the following three main important problems which will be studied in
this book:
1.3. WHY SHALL WE STUDY NUMERICAL METHODS? 3
(1) Find the solution of linear systems
Ax = b
where A is an n-by-n nonsingular matrix and b is an n-vector.
(2) Linear least squares problems: For any m-by-n matrix A and an m-vector b, find
an n-vector x such that
kAx bk2 = minn kAy bk2 .

yR
(3) Eigenvalues problems: For any n-by-n matrix A, find a part (or all) of its eigen-
values and corresponding eigenvectors. We remark here that a complex number
is called an eigenvalue of A if there exists a nonzero vector x Cn such that
Ax = x,
where x is called the eigenvector of A associated with .
Besides these main problems, there are many other fundamental problems in NLA,
for instance, total least squares problems, matrix equations, generalized inverses, in-
verse problems of eigenvalues, and singular value problems, etc.
1.3 Why shall we study numerical methods?

To answer this question, let us consider the following linear system,
Ax = b
where A is an n-by-n nonsingular matrix and x = (x1 , x2 , , xn )T . If we use the

well-known Cramer rule, then we have the following solution:
det(A1 ) det(A2 ) det(An )

x1 = , x2 = ,, xn = ,
det(A) det(A) det(A)
where Ai , for i = 1, 2, , n, are matrices with the i-th column replaced by the vector
b. Then we should compute n + 1 determinants det(Ai ), i = 1, 2, , n, and det(A).
There are
[n!(n 1)](n + 1) = (n 1)(n + 1)!
multiplications. When n = 25, by using a computer with 10 billion operations/sec., we
need
24 26!
10
30.6 billion years.
10 3600 24 365
If one uses Gaussian elimination, it requires

n
X n
X 1
(i 1)(i + 1) = i2 n = n(n + 1)(2n + 1) n = O(n3 )
6
i=1 i=1
multiplications. Then less than 1 second, we could solve 25-by-25 linear systems by
using the same computer. From above discussions, we note that for solving the same
problem by using different numerical methods, the results are much different. There-
fore, it is essential for us to study the properties of numerical methods.
1.4 Matrix factorizations (decompositions)

For any linear system Ax = b, if we can factorize (decompose) A as A = LU where L
is a lower triangular matrix and U is an upper triangular matrix, then we have

Ly = b
(1.1)

U x = y.
By substituting, we can easily solve (1.1) and then Ax = b. Therefore, matrix factor-
izations (decompositions) are very important tools in NLA. The following theorem is
basic and useful in linear algebra, see [17].
Theorem 1.1 (Jordan Decomposition Theorem) If A Cnn , then there exists

a nonsingular matrix X Cnn such that
X 1 AX = J diag(J1 , J2 , , Jp ),
or A = XJX 1 , where J is called the Jordan canonical form of A and

i 1 0 0
.. .
0 i
1 . ..

Ji = ... 0
.. ..
. . 0 Cni ni ,

.. .. ..
. . . 1
0 0 i
for i = 1, 2, , p, are called Jordan blocks with n1 + +np = n. The Jordan canonical
form of A is unique up to the permutation of diagonal Jordan blocks. If A Rnn with
only real eigenvalues, then the matrix X can be taken to be real.
1.5. PERTURBATION AND ERROR ANALYSIS 5
1.5 Perturbation and error analysis

The solutions provided by numerical algorithms are seldom absolutely correct. Usu-
ally, there are two kinds of errors. First, errors appear in input data caused by prior
computations or measurements. Second, there may be errors caused by algorithms
themselves because of approximations made within algorithms. Thus, we need to carry
out a perturbation and error analysis.
(1) Perturbation.
For a given x, we want to compute the value of function f (x). Suppose there
is a perturbation x of x and |x|/|x| is very small. We want to find a positive
number c(x) as small as possible such that
|f (x + x) f (x)| |x|
c(x) .
|f (x)| |x|
Then c(x) is called the condition number of f (x) at x. If c(x) is large, we say that
the function f is ill-conditioned at x; if c(x) is small, we say that the function f
is well-conditioned at x.
Remark: A computational problem being ill-conditioned or not has no relation
with numerical methods that we used.
(2) Error.
By using some numerical methods, we calculate the value of a function f at a
point x and we obtain y. Because of the rounding error (or chopping error),
usually
y 6= f (x).
If there exists x such that
y = f (x + x), |x| |x|,
where is a positive constant having a closed relation with numerical methods
and computers used, then we say that the method is stable if is small; the
method is unstable if is large.
Remark: A numerical method being stable or not has no relation with computa-
tional problems that we faced.
With the perturbation and error analysis, we obtain

|y f (x)| |f (x + x) f (x)| |x|
= c(x) c(x).
|f (x)| |f (x)| |x|
Therefore, whether a numerical result is accurate depends on both the stability of the
numerical method and the condition number of the computational problem.
1.6 Operation cost and convergence rate

Usually, numerical algorithms are divided into two classes:

(i) direct methods;

(ii) iterative methods.
By using direct methods, one can obtain an accurate solution of computational prob-
lems within finite steps in exact arithmetic. By using iterative methods, one can only
obtain an approximate solution of computational problems within finite steps.
The operation cost is an important measurement of algorithms. The operation
cost of an algorithm is the total operations of +, , , used in the algorithm.
We remark that the speed of algorithms is only partially depending on the operation
cost. In modern computers, the speed of operations is much faster than that of data
transfer. Therefore, sometimes, the speed of an algorithm is mainly depending on the
total amount of data transfers.
For direct methods, usually, we use the operation cost as a main measurement of
the speed of algorithms. For iterative methods, we need to consider

(i) operation cost in each iteration;

(ii) convergence rate of the method.
For a sequence {xk } provided by an iterative algorithm, if {xk } x, the exact

solution, and if {xk } satisfies
kxk xk ckxk1 xk, k = 1, 2, ,
where 0 < c < 1 and k k is any vector norm (see Chapter 3 for a detail), then we say
that the convergence rate is linear. If it satisfies
kxk xk ckxk1 xkp , k = 1, 2, ,
where 0 < c < 1 and p > 1, then we say that the convergence rate is superlinear.
Exercises:
1. A matrix is strictly upper triangular if it is upper triangular with zero diagonal elements.
Show that if A is a strictly upper triangular matrix of order n, then An = 0.
2. Let A Cnm and B Cml . Prove that
rank(AB) rank(A) + rank(B) m.

1.6. OPERATION COST AND CONVERGENCE RATE 7
3. Let
A11 A12
A= ,
A21 A22
where Aij , for i, j = 1, 2, are square matrices with det(A11 ) 6= 0, and satisfy
A11 A21 = A21 A11 .
Then
det(A) = det(A11 A22 A21 A12 ).
4. Show that det(I uv ) = 1 v u where u, v Cm are column vectors.

5. Prove Hadamards inequality for A Cnn :
n
Y
|det(A)| kaj k2 ,
j=1
12
P
n
2
where aj = A(:, j) and kaj k2 = |A(i, j)| . When does the equality hold?
i=1
6. Let B be nilpotent, i.e., there exists an integer k > 0 such that B k = 0. Show that if
AB = BA, then
det(A + B) = det(A).
7. Let A be an m-by-n matrix and B be an n-by-m matrix. Show that the matrices

AB 0 0 0
and
B 0 B BA
are similar. Conclude that the nonzero eigenvalues of AB are the same as those of BA,
and
det(Im + AB) = det(In + BA).
8. A matrix M Cnn is Hermitian positive definite if it satisfies
M = M , x M x > 0,
for all x 6= 0 Cn . Let A and B be Hermitian positive definite matrices.
(1) Show that the matrix product AB has positive eigenvalues.

(2) Show that AB is Hermitian if and only if A and B commute.
9. Show that any matrix A Cnn can be written uniquely in the form
A = B + iC,
where B and C are Hermitian.

10. Show that if A is skew-Hermitian, i.e., A = A, then all its eigenvalues lie on the
imaginary axis.
11. Let
A11 A12
A= .
A21 A22
Assume that A11 , A22 are square, and A11 , A22 A21 A1
11 A12 are nonsingular. Let

B11 B12
B=
B21 B22
be the inverse of A. Show that

B22 = (A22 A21 A1
11 A12 )
1
, B12 = A1
11 A12 B22 ,
B21 = B22 A21 A1

11 , B11 = A1 1
11 B12 A21 A11 .
12. Suppose that A and B are Hermitian with A being positive definite. Show that A + B
is positive definite if and only if all the eigenvalues of A1 B are greater than 1.
13. Let A be idempotent, i.e., A2 = A. Show that each eigenvalue of A is either 0 or 1.
14. Let A be a matrix with all entries equal to one. Show that A can be written as A =
eeT , where eT = (1, 1, , 1), and A is positive semi-definite. Find the eigenvalues and
eigenvectors of A.
15. Prove that any matrix A Cnn has a polar decomposition A = HQ, where H is
Hermitian positive semi-definite and Q is unitary. We recall that M Cnn is a unitary
matrix if M 1 = M . Moreover, if A is nonsingular, then H is Hermitian positive definite
and the polar decomposition of A is unique.
Chapter 2
Direct Methods for Linear

Systems
The problem of solving linear systems is central in NLA. For solving linear systems, in
general, we have two classes of methods. One is called the direct method and the other
is called the iterative method. By using direct methods, within finite steps, one can
obtain an accurate solution of computational problems in exact arithmetic. By using
iterative methods, within finite steps, one can only obtain an approximate solution of
computational problems.
In this chapter, we will introduce a basic direct method called Gaussian elimination
for solving general linear systems. Usually, Gaussian elimination is used for solving a
dense linear system with median size and no special structure.
2.1 Triangular linear systems and LU factorization
We first study triangular linear systems.
2.1.1 Triangular linear systems
We consider the following nonsingular lower triangular linear system
Ly = b (2.1)
9
10 CHAPTER 2. DIRECT METHODS FOR LINEAR SYSTEMS
where b = (b1 , b2 , , bn )T Rn is a known vector, y = (y1 , y2 , , yn )T is an unknown

vector, and L = [lij ] Rnn is given by

l11 0 0

..
l21 l22 0
.

.. ..
L = l31 l32 l33 . .

.. .. .. ..
. . . . 0
ln1 ln2 ln3 lnn
with lii 6= 0, i = 1, 2, , n. By the first equation in (2.1), we have
l11 y1 = b1 ,
and then
b1
y1 =.
l11
Similarly, by the second equation in (2.1), we have
1
y2 = (b2 l21 y1 ).
l22
In general, if we have already obtained y1 , y2 , , yi1 , then by using the i-th equation
in (2.1), we have
1
Xi1
yi = bi lij yj .
lii
j=1
This algorithm is called the forward substitution method which needs O(n2 ) operations.
Now, we consider the following nonsingular upper triangular linear system
Ux = y (2.2)
where x = (x1 , x2 , , xn )T is an unknown vector, and U Rnn is given by

u11 u12 u13 u1n
..
0 u22 u23 .

.. ..
U = 0 0 u33 . .

.. .. .. .. ..
.
. . . .
0 0 unn
with uii 6= 0, i = 1, 2, , n. Beginning from the last equation of (2.2), we can obtain
xn , xn1 , , x1 step by step. The xn = yn /unn and xi is given by
1
n
X
xi = yi uij xj
uii
j=i+1
2.1. TRIANGULAR LINEAR SYSTEMS AND LU FACTORIZATION 11
for i = n 1, , 1. This algorithm is called the backward substitution method which

also needs O(n2 ) operations.
For general linear systems
Ax = b (2.3)
where A Rnn and b Rn are known. If we can factorize the matrix A into
A = LU
where L is a lower triangular matrix and U is an upper triangular matrix, then we

could find the solution of (2.3) by the following two steps:
(1) By using the forward substitution method to find solution y of Ly = b.
(2) By using the backward substitution method to find solution x of U x = y.

Now the problem that we are facing is how to factorize the matrix A into A = LU . We
therefore introduce Gaussian transform matrices.
2.1.2 Gaussian transform matrix

Let
Lk = I lk eTk
where I Rnn is the identity matrix, lk = (0, , 0, lk+1,k , , lnk )T Rn and ek Rn
is the k-th unit vector. Then for any k,

1 0 0
..
0 ... .

.. ..
. 1 .
Lk =
..

..
. l k+1,k 1 .

.. .. ..
. . . 0
0 lnk 0 1
is called the Gaussian transform matrix. Such a matrix is a unit lower triangular
matrix. We remark that a unit triangular matrix is a triangular matrix with ones on
its diagonal. For any given vector
x = (x1 , x2 , , xn )T Rn ,
we have
Lk x = (x1 , , xk , xk+1 xk lk+1,k , , xn xk lnk )T
= (x1 , , xk , 0, , 0)T
if we take
xi
lik = , i = k + 1, , n
xk
with xk 6= 0. It is easy to check that
L1 T
k = I + lk ek
by noting that eTk lk = 0.

For a given matrix A Rnn , we have
Lk A = (I lk eTk )A = A lk (eTk A)
and
rank(lk (eTk A)) = 1.
Therefore, Lk A is a rank-one modification of the matrix A.
2.1.3 Computation of LU factorization

We consider the following simple example. Let

1 5 9
A= 2 4 7 .
3 3 10
By using the Gaussian transform matrix

1 0 0
L1 = 2 1 0 ,
3 0 1
we have
1 5 9
L1 A = 0 6 11 .
0 12 17
Followed by using the Gaussian transform matrix

1 0 0
L2 = 0 1 0 ,
0 2 1
we have
1 5 9
L2 (L1 A) U = 0 6 11 .
0 0 5
2.1. TRIANGULAR LINEAR SYSTEMS AND LU FACTORIZATION 13
Therefore, we finally have

A = LU
where
1 0 0
L (L2 L1 )1 = L1 1
1 L2 = 2 1 0 .
3 2 1
For general n-by-n matrix A, we can use n 1 Gaussian transform matrices L1 ,
L2 , , Ln1 such that Ln1 L1 A is an upper triangular matrix. In fact, let A(0) A
and assume that we have already found k1 Gaussian transform matrices L1 , , Lk1
Rnn such that " #
(k1) (k1)
(k1) A11 A12
A = Lk1 L1 A = (k1)
0 A22
(k1)
where A11 is a (k 1)-by-(k 1) upper triangular matrix and
(k1) (k1)
akk akn
(k1) .. .. ..
A22 = . . . .
(k1) (k1)
ank ann
(k1)
If akk 6= 0, then we can use the Gaussian transform matrix
Lk = I lk eTk
where
lk = (0, , 0, lk+1,k , , lnk )T
with
(k1)
aik
lik = (k1)
, i = k + 1, , n,
akk
such that the last n k entries in the k-th column of Lk A(k1) become zeros. We
therefore have " #
(k) (k)
(k) (k1) A11 A12
A Lk A = (k)
0 A22
(k)
where A11 is a k-by-k upper triangular matrix. After n 1 steps, we obtain A(n1)
which is an upper triangular matrix that we need. Let
L = (Ln1 L1 )1 , U = A(n1) ,
then A = LU . Now we want to show that L is a unit lower triangular matrix. By

noting that eTj li = 0 for j < i, we have
L = L1 1
1 Ln1
= (I + l1 eT1 )(I + l2 eT2 ) (I + ln1 eTn1 )
= I + l1 eT1 + + ln1 eTn1
= I + [l1 , l2 , , ln1 , 0]

1 0 0 0
l21 1 0 0

. . ..
=
l31 l32 1 . .
.
.. .. .. . .
. . . . 0
ln1 ln2 ln3 1
This computational process of the LU factorization is called Gaussian elimination.

Thus, we have the following algorithm.
Algorithm 2.1 (Gaussian elimination)

for k = 1 : n 1

A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)
A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n)

A(k + 1 : n, k)A(k, k + 1 : n)

end
The operation cost of Gaussian elimination is
X
n1 n(n 1) n(n 1)(2n 1)
(n k) + 2(n k)2 = +
2 3
k=1
2 3
= n + O(n2 ) = O(n3 ).
3
(k1)
We remark that in Gaussian elimination, akk , k = 1, , n 1, are required to
be nonzero. We have the following theorem.
(i1)
Theorem 2.1 The entries aii 6= 0, i = 1, , k, if and only if all the leading
principal submatrices Ai of A, i = 1, , k, are nonsingular.
2.2. LU FACTORIZATION WITH PIVOTING 15
Proof: By induction, for k = 1, it is obviously true. Assume that the statement is

true until k 1. We want to show that if A1 , , Ak1 are nonsingular, then
(k1)
Ak is nonsingular akk 6= 0 .
By assumption, we know that
(i1)
aii 6= 0, i = 1, , k 1.
By using k 1 Gaussian transform matrices L1 , , Lk1 , we obtain

" #
(k1) (k1)
(k1) A11 A12
A = Lk1 L1 A = (k1) (2.4)
0 A22
(k1) (i1)
where A11 is an upper triangular matrix with nonzero diagonal entries aii , i =
1, , k 1. Therefore, the k-th leading principal submatrix of A(k1) has the following
form " #
(k1)
A11
(k1) .
0 akk
Let (L1 )k , , (Lk1 )k denote the k-th leading principal submatrices of L1 , , Lk1 ,
respectively. By using (2.4), we obtain
" #
(k1)
A11
(Lk1 )k (L1 )k Ak = (k1) .
0 akk
By noting that Li , i = 1, , k 1, are unit lower triangular matrices, we immediately

know that
(k1) (k1)
det(Ak ) = akk det(A11 ) 6= 0
(k1)
if and only if akk 6= 0.
Thus, we have
Theorem 2.2 If all the leading principal submatrices Ai of a matrix A Rnn are
nonsingular for i = 1, , n 1, then there exists a unique LU factorization of A.
2.2 LU factorization with pivoting

Before we study pivoting techniques, we first consider the following simple example:

0.3 1011 1 x1 0.7
= .
1 1 x2 0.9
If we using Gaussian elimination with the 10-decimal-digit floating point arithmetic,

we have
b 1 0
L=
0.3333333333 1012 1
and
b= 0.3 1011 1
U .
0 0.3333333333 1012
Then the computational solution is
x = (0.0000000000, 0.7000000000)T
which is not good comparing with the accurate solution
x = (0.2000000000006 , 0.6999999999994 )T .
If we just interchange the first equation and the second equation, we have

1 1 x1 0.9
= .
0.3 1011 1 x2 0.7
By using Gaussian elimination with the 10-decimal-digit floating point arithmetic again,
we have
b 1 0 b 1 1
L= , U= .
0.3 1011 1 0 1
Then the computational solution is
x = (0.2000000000, 0.7000000000)T
which is very good. So we need to introduce permutations into Gaussian elimination.

We first define a permutation matrix.
Definition 2.1 A permutation matrix P is an identity matrix with permuted rows.
The important properties of the permutation matrix are included in the following
lemma. Its proof is straightforward.
Lemma 2.1 Let P , P1 , P2 Rnn be permutation matrices and X Rnn . Then
(i) P X is the same as X with its rows permuted. XP is the same as X with its
columns permuted.
(ii) P 1 = P T .
(iii) det(P ) = 1.
(iv) P1 P2 is also a permutation matrix.
Now we introduce the main theorem of this section.
Theorem 2.3 If A is nonsingular, then there exist permutation matrices P1 and P2 ,

a unit lower triangular matrix L, and a nonsingular upper triangular matrix U such
that
P1 AP2 = LU.
Only one of P1 and P2 is necessary.
Proof: We use induction on the dimension n. For n = 1, it is obviously true. Assume

that the statement is true for n 1. If A is nonsingular, then it has a nonzero entry.
0 0 0 0
Choose permutation matrices P1 and P2 such that the (1, 1)-th position of P1 AP2 is
nonzero. Now we write a desired factorization and solve for unknown components:

0 0 a11 A12 1 0 u11 U12
P1 AP2 = = e22
A21 A22 L21 I 0 A
(2.5)

u11 U12
= e22 ,
L21 u11 L21 U12 + A
e22 are (n 1)-by-(n 1) matrices, and L21 , U T are (n 1)-by-1 matrices.

where A22 , A 12
Solving for the components of this 2-by-2 block factorization, we get
u11 = a11 6= 0, U12 = A12 ,
and
L21 u11 = A21 , e22 .
A22 = L21 U12 + A
Therefore, we obtain
A21 e22 = A22 L21 U12 .

L21 = , A
a11
e22 , but to do so we need to check that

We want to apply induction to A
e22 ) 6= 0.
det(A
Since
0 0
det(P1 AP2 ) = det(A) 6= 0
and also

0 0 1 0 u11 U12
det(P1 AP2 ) = det det e22
L21 I 0 A
e22 ),
= u11 det(A
we know that
e22 ) 6= 0.
det(A
Therefore, by the assumption of induction, there exist permutation matrices Pe1 and
e
P2 such that
Pe1 A
e22 Pe2 = L
eUe, (2.6)
where Le is a unit lower triangular matrix and U

e is a nonsingular upper triangular
matrix. Substituting (2.6) into (2.5) yields

0 0 1 0 u11 U12
P1 AP2 =
L21 I 0 Pe1T L
eUe PeT
2

1 0 1 0 u11 U12
= e e e PeT
L21 I 0 P1T L 0 U 2
" #
1 0 u11 U12 Pe2 1 0
= e e
L21 P1T L 0 e
U 0 Pe2T
" #
1 0 1 0 u11 U12 Pe2 1 0
= ,
0 Pe1T e e
P1 L21 L 0 e
U 0 Pe2T
so we get a desired factorization of A:

1 0 0 0 1 0
P1 AP2 = P1 A P2
0 Pe1 0 Pe2
" #
1 0 u11 U12 Pe2
= e e e .
P1 L21 L 0 U
This row-column interchange strategy is called complete pivoting. We therefore

have the following algorithm.
Algorithm 2.2 (Gaussian elimination with complete pivoting)

for k = 1 : n 1

choose p, q, (k p, q n) such that

|A(p, q)| = max {|A(i, j)| : i = k : n, j = k : n}

A(k, 1 : n) A(p, 1 : n)

A(1 : n, k) A(1 : n, q)

if A(k, k) =6 0
A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)

A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n)

A(k + 1 : n, k)A(k, k + 1 : n)

else

stop

end

end
We remark that although the LU factorization with complete pivoting can overcome
some shortcomings of the LU factorization without pivoting, the cost of complete
pivoting is very high. Usually, it requires O(n3 ) operations in comparison with entries
of the matrix for pivoting.
In order to reduce the operation cost of pivoting, the LU factorization with partial
(k1)
pivoting is proposed. In partial pivoting, at the k-th step, we choose apk from the
(k1)
submatrix A22 which satisfies
n o
(k1) (k1)
|apk | = max |aik | : k i n .
When A is nonsingular, the LU factorization with partial pivoting can be carried out
until we finally obtain
P A = LU.
In this algorithm, the operation cost in comparison with entries of the matrix for
pivoting is O(n2 ). We have
Algorithm 2.3 (Gaussian elimination with partial pivoting)

for k = 1 : n 1

choose p, (k p n) such that

|A(p, k)| = max {|A(i, k)| : i = k : n}

A(k, 1 : n) A(p, 1 : n)

if A(k, k) 6= 0

A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)

A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n)

A(k + 1 : n, k)A(k, k + 1 : n)

else

stop

end

end
2.3 Cholesky factorization

Let A Rnn be symmetric positive definite, i.e., it satisfies
A = AT , xT Ax > 0,
for all x 6= 0 Rn . We have
Theorem 2.4 Let A Rnn be symmetric positive definite. Then there exists a lower
triangular matrix L Rnn with positive diagonal entries such that
A = LLT .
This factorization is called the Cholesky factorization.
Proof: Since A is positive definite, all the principal submatrices of A should be positive
definite. By Theorem 2.2, there exist a unit lower triangular matrix L e and an upper
triangular matrix U such that
e
A = LU.
Let
D = diag(u11 , , unn ), e = D1 U,
U
where uii > 0, for i = 1, , n. Then we have
e T DL
U e T = AT = A = LD
e Ue.
Therefore,
L e 1 = D1 U
eT U e
e T LD.
2.3. CHOLESKY FACTORIZATION 21
We note that LeT Ue 1 is a unit upper triangular matrix and D1 U

e T LD
e is a lower
triangular matrix. Hence
L e 1 = I = D1 U
eT U e T LD
e
e =L
which implies U e T . Thus
e L
A = LD eT .
Let
e
L = Ldiag( u11 , , unn ).
We finally have
A = LLT .
Thus, when a matrix A is symmetric positive definite, we could find the solution of
the system Ax = b by the following three steps:
(1) Find the Cholesky factorization of A: A = LLT .
(2) Find solution y of Ly = b.
(3) Find solution x of LT x = y.
From Theorem 2.4, we know that we do not need a pivoting in Cholesky factor-
ization. Also we could calculate L directly through a comparison in the corresponding
entries between two sides of A = LLT . We have the following algorithm.
Algorithm 2.4 (Cholesky factorization)

for k = 1 : n p

A(k, k) = A(k, k)

A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)
for j = k + 1 : n

A(j : n, j) = A(j : n, j) A(j : n, k)A(j, k)

end

end
The operation cost of Cholesky factorization is n3 /3.

Exercises:
1. Let S, T Rnn be upper triangular matrices such that
(ST I)x = b
is a nonsingular system. Find an algorithm of O(n2 ) operations for computing x.

2. Show that the LDLT factorization of a symmetric positive definite matrix A is unique.
3. Let A Rnn be symmetric positive definite. Find an algorithm for computing an upper
triangular matrix U Rnn such that A = U U T .
4. Let A = [aij ] Rnn be strictly diagonally dominant matrix, i.e.,
n
X
|akk | > |akj |, k = 1, 2, , n.
j=1
j6=k
Prove that a strictly diagonally dominant matrix is nonsingular, and a strictly diagonally
dominant symmetric matrix with positive diagonal entries is positive definite.
5. Let
A11 A12
A=
A21 A22
with A11 being a k-by-k nonsingular matrix. Then
S = A22 A21 A1
11 A12
is called the Schur complement of A11 in A. Show that after k steps of Gaussian elimi-
(k1)
nation without pivoting, A22 = S.
6. Let A be a symmetric positive definite matrix. At the end of the first step of Gaussian
elimination, we have
a11 aT1
.
0 A22
Prove that A22 is also symmetric positive definite.
7. Let A = [aij ] Rnn be a strictly diagonally dominant matrix. After one step of
Gaussian elimination, we have
a11 aT1
.
0 A22
Show that A22 is also strictly diagonally dominant.
8. Show that if P AQ = LU is obtained via Gaussian elimination with pivoting, then |uii |
|uij |, for j = i + 1, , n.
9. Let H = A + iB be a Hermitian positive definite matrix, where A, B Rnn .
2.3. CHOLESKY FACTORIZATION 23
(1) Prove the matrix

A B
C=
B A
is symmetric positive definite.
(2) How to solve

(A + iB)(x + iy) = b + ic, x, y, b, c Rn
by real number computation only?
10. Develop an algorithm to solve a tridiagonal system by using Gaussian elimination with
partial pivoting.
11. Show that if a singular matrix A Rnn has a unique LU factorization, then Ak is
nonsingular for k = 1, 2, , n 1.
Chapter 3
Perturbation and Error Analysis
In this chapter, we will discuss effects of perturbation and error on numerical solutions.
The error analysis on floating point operations and on partial pivoting technique is also
given. It is well-known that the essential notions of distance and size in linear vector
spaces are captured by norms. We therefore need to introduce vector and matrix norms
and study their properties before we develop our perturbation and error analysis.
3.1 Vector and matrix norms

We first introduce vector norms.
3.1.1 Vector norms

Let
x = (x1 , x2 , , xn )T Rn .
Definition 3.1 A vector norm on Rn is a function that assigns to each x Rn a real

number kxk, called the norm of x, such that the following three properties are satisfied
for all x, y Rn and all R:
(i) kxk > 0 if x 6= 0, and kxk = 0 if and only if x = 0;

(ii) kxk = || kxk;
(iii) kx + yk kxk + kyk.
A useful class of vector norms is the p-norm defined by

n !1
X p
kxkp |xi |p
i=1
25
26 CHAPTER 3. PERTURBATION AND ERROR ANALYSIS
where 1 p. The following p-norms are the most commonly used norms in practice:
n
n !1
X X 2
kxk1 = |xi |, kxk2 = |xi |2 , kxk = max |xi |.

1in
i=1 i=1
The Cauchy-Schwarz inequality concerning k k2 is given as follows,
|xT y| kxk2 kyk2
for x, y Rn , which is a special case of the Holder inequality given as follows,
|xT y| kxkp kykq
where 1/p + 1/q = 1.

A very important property of vector norms on Rn is that all the vector norms on
n
R are equivalent as the following theorem said, see [35].
Theorem 3.1 If k k and k k are two norms on Rn , then there exist two positive
constants c1 and c2 such that
c1 kxk kxk c2 kxk
for all x Rn .
For example, if x Rn , then we have

kxk2 kxk1 nkxk2 ,

kxk kxk2 nkxk
and
kxk kxk1 nkxk .
(k) (k)
We remark that for any sequence of vectors {xk } where xk = (x1 , , xn )T Rn ,
and x = (x1 , , xn )T Rn , by Theorem 3.1, one can prove that
(k)
lim kxk xk = 0 lim |xi xi | = 0,
k k
for i = 1, , n.
3.1. VECTOR AND MATRIX NORMS 27
3.1.2 Matrix norms

Let
A = [aij ]ni,j=1 Rnn .
We now turn our attention to matrix norms.
Definition 3.2 A matrix norm is a function that assigns to each A Rnn a real
number kAk, called the norm of A, such that the following four properties are satisfied
for all A, B Rnn and all R:
(i) kAk > 0 if A 6= 0, and kAk = 0 if and only if A = 0;
(ii) kAk = || kAk;
(iii) kA + Bk kAk + kBk;
(iv) kABk kAk kBk.
An important property of matrix norms on Rnn is that all the matrix norms on
Rnn are equivalent. For the relation between a vector norm and a matrix norm, we
have
Definition 3.3 If a matrix norm k kM and a vector norm k kv satisfy
kAxkv kAkM kxkv ,
for A Rnn and x Rn , then these norms are called mutually consistent.
For any vector norm k kv , we can define a matrix norm in the following natural
way:
kAxkv
kAkM max = max kAxkv .
x6=0 kxkv kxkv =1
The most important matrix norms are the matrix p-norms induced by the vector p-
norms for p = 1, 2, . We have the following theorem.
Theorem 3.2 Let

A = [aij ]ni,j=1 Rnn .
Then we have
P
n
(i) kAk1 = max |aij |.
1jn i=1
P
n
(ii) kAk = max |aij |.
1in j=1
p
(iii) kAk2 = max (AT A), where max (AT A) is the largest eigenvalue of AT A.
Proof: We only give the proof of (i) and (iii). In the following, we always assume that
A 6= 0.
For (i), we partition the matrix A by columns:
A = [a1 , , an ].
Let
= kaj0 k1 = max kaj k1 .
1jn
P
n
Then for any vector x Rn which satisfies kxk1 = |xi | = 1, we have
i=1
P P
n n
kAxk1 = xj aj |xj | kaj k1
j=1 1 j=1
P
n
( |xj |) max kaj k1
j=1 1jn
= kaj0 k1 = .
Let ej0 denote the j0 -th unit vector and then
kAej0 k1 = kaj0 k1 = .
Therefore
n
X
kAk1 = max kAxk1 = = max kaj k1 = max |aij |.
kxk1 =1 1jn 1jn
i=1
For (iii), we have
kAk2 = max kAxk2 = max [(Ax)T (Ax)]1/2

kxk2 =1 kxk2 =1
= max [xT (AT A)x]1/2 .

kxk2 =1
Since AT A is positive semi-definite, its eigenvalues can be assumed to be in the following

order:
1 2 n 0.
Let
v1 , v2 , , vn Rn
denote the orthonormal eigenvectors corresponding to 1 , 2 , , n , respectively. Then
for any vector x Rn with kxk2 = 1, we have
n
X n
X
x= i vi , i2 = 1.
i=1 i=1
Therefore,
n
X
xT AT Ax = i i2 1 .
i=1
On the other hand, let x = v1 , we have
xT AT Ax = v1T AT Av1 = v1T 1 v1 = 1 .
Thus p q
kAk2 = max kAxk2 = 1 = max (AT A).
kxk2 =1
We have the following theorem for the norm k k2 .
Theorem 3.3 Let A Rnn . Then we have
(i) kAk2 = max max |y Ax|, where x, y Cn .

kxk2 =1 kyk2 =1
p
(ii) kAT k2 = kAk2 = kAT Ak2 .
(iii) kAk2 = kQAZk2 , for any orthogonal matrices Q and Z. We recall that a matrix
M Rnn is called orthogonal if M 1 = M T .
Proof: We only prove (i). We first introduce the dual norm k kD of a vector norm
k k defined as follows,
kykD = max |y x|.
kxk=1
For k k2 , we have by the Cauchy-Schwarz inequality,
|y x| kyk2 kxk2
1
with equality when x = kyk2 y. Therefore, the dual norm of k k2 is given by
kykD
2 = max |y x| = max kyk2 kxk2 = kyk2 .
kxk2 =1 kxk2 =1
So, k k2 is its own dual. Now, we consider
kAk2 = max kAxk2 = max kAxkD

2
kxk2 =1 kxk2 =1
= max max |(Ax) y|

kxk2 =1 kyk2 =1
= max max |y Ax|.

kxk2 =1 kyk2 =1
Another useful norm is the Frobenius norm which is defined by

1
n X
n 2
X
kAkF |aij |2
.
j=1 i=1
One of the most important properties of k kF is that for any orthogonal matrices Q
and Z,
kAkF = kQAZkF .
In the following, we will extend our discussion on norms to the field of C. We remark
that from the viewpoint of norms, there is no essential difference between matrices or
vectors in the field of R and matrices or vectors in the field of C.
Definition 3.4 Let A Cnn . Then the set of all the eigenvalues of A is called the
spectrum of A and
(A) = max{|| : belongs to the spectrum of A}
is called the spectral radius of A.
For the relation between the spectral radius and matrix norms, we have
Theorem 3.4 Let A Cnn . Then
(i) For any matrix norm, we have
(A) kAk.
(ii) For any > 0, there exists a norm defined on Cnn such that
kAk (A) + .
Proof: For (i), let x Cn satisfy

x 6= 0, Ax = x, || = (A).
Then we have
(A)kxeT1 k = kxeT1 k = kAxeT1 k kAk kxeT1 k.
Hence
(A) kAk.
For (ii), by using Theorem 1.1 (Jordan Decomposition Theorem), we know that
there is a nonsingular matrix X Cnn such that

1 1
2 2

.. ..
X 1 AX = . .

n1 n1
n
where i = 1 or 0. For any given > 0, let
D = diag(1, , 2 , , n1 ),
then
1 1
2 2

1 1 .. ..
D X AXD = . . .

n1 n1
n
Now, define
kGk = kD1 X 1 GXD k , G Cnn .
It is easy to see this matrix norm k k actually is induced by the vector norm defined
as follows:
kxkXD = k(XD )1 xk , x Cn .
Therefore,
kAk = kD1 X 1 AXD k = max (|i | + |i |) (A) + ,
1in
where n = 0.
(k)
We remark that for any sequence of matrices {A(k) } where A(k) = [aij ] Rnn ,
and A = [aij ] Rnn ,
(k)
lim kA(k) Ak = 0 lim aij = aij ,
k k
for i, j = 1, , n.
lim Ak = 0 (A) < 1.

k
Proof: We first assume that

lim Ak = 0.
k
Let be an eigenvalue of A such that (A) = ||. Then k is an eigenvalue of Ak for

any k. By Theorem 3.4 (i), we know that for any k,
(A)k = ||k = |k | (Ak ) kAk k.
Therefore,
lim (A)k = 0
k
which implies (A) < 1.
Conversely, assume that (A) < 1. By Theorem 3.4 (ii), there exists a matrix norm
k k such that kAk < 1. Therefore, we have
0 kAk k kAkk 0, k ,
i.e.,
lim Ak = 0.
k
By using Theorem 3.5, one can easily prove that following important theorem.

P

(i) Ak is convergent if and only if (A) < 1.
k=0
P

(ii) When Ak converges, we have
k=0

X
Ak = (I A)1 .
k=0
Moreover, there exists a norm defined on Cnn such that for any m,
m
X kAkm+1
1 k
(I A) A .
1 kAk
k=0
3.2. PERTURBATION ANALYSIS FOR LINEAR SYSTEMS 33
Corollary 3.1 Let k k be a norm defined on Cnn with kIk = 1 and A Cnn satisfy
kAk < 1. Then I A is nonsingular and satisfies
1
k(I A)1 k .
1 kAk
Proof: Just note that

X
X 1
1 k
k(I A) k= A 1+ kAkk = .
1 kAk
k=0 k=1
3.2 Perturbation analysis for linear systems

We first consider the following simple example Ax = b given by:

2.0001 1.9999 x1 4
= .
1.9999 2.0001 x2 4
The solution of this linear system is x = (1, 1)T . If there is a small perturbation on b,
say,
= (1 104 , 1 104 )T ,
the system becomes

2.0001 1.9999 x1 4.0001
= .
1.9999 2.0001 x2 3.9999
The solution of this perturbed system is x = (1.5, 0.5)T . Therefore, we have
kx xk 1 kk 1
= , = ,
kxk 2 kbk 40000
i.e., the relative error of the solution is 20000 times of that of the perturbation on b.
Thus, when we solve a linear system Ax = b, a good measurement, which can
tell us how sensitive the computed solution is to input small perturbations, is needed.
The condition number of matrices is then defined. It relates perturbations of x to
perturbations of A and b.
Definition 3.5 Let k k be any norm of matrix and A be a nonsingular matrix. The
condition number of A is defined as follows,
(A) kAk kA1 k. (3.1)

Obviously, the condition number depends on the matrix norm used. When (A) is
small, then A is said to be well-conditioned, whereas if (A) is large, then A is said to
be ill-conditioned. Note that for any p-norm, we have
1 = kIk = kA A1 k kAk kA1 k = (A).
Let x be an approximation of the exact solution x of Ax = b. The error vector is

defined as follows,
e = x x,
i.e.,
x = x + e. (3.2)
The absolute error is given by
kek = kx xk
for any vector norm. If x 6= 0, then the relative error is defined by
kek kx xk
= .
kxk kxk
We have by substituting (3.2) into Ax = b,
A(x + e) = Ax + Ae = b.
Therefore,
Ax = b Ae = b.
The x is the exact solution of Ax = b where b is a perturbed vector of b. Since x = A1 b
and x = A1 b, we have
kx xk = kA1 (b b)k kA1 k kb bk. (3.3)
Similarly,
kbk = kAxk kAk kxk,
i.e.,
1 kAk
. (3.4)
kxk kbk
Combining (3.3), (3.4) and (3.1), we obtain the following theorem which gives the effect
of perturbations of the vector b on the solution of Ax = b in terms of the condition
number.
3.2. PERTURBATION ANALYSIS FOR LINEAR SYSTEMS 35
Theorem 3.7 Let x be an approximate solution of the exact solution x of Ax = b.

Then
kx xk kb bk
(A) .
kxk kbk
The next theorem includes the effect of perturbations of the coefficient matrix A
on the solution of Ax = b in terms of the condition number.
Theorem 3.8 Let A be a nonsingular matrix and Ab be a perturbed matrix of A such

that
b kA1 k < 1.
kA Ak
b = b where b is a perturbed vector of b, then
If Ax = b and Ax
!
kx xk (A) kA Akb kb bk
b
+ .
kxk 1 (A) kAAk kAk kbk
kAk
Proof: Let
b and = b b.
E =AA
b = b, we have
By subtracting Ax = b from Ax
A(x x) = E x + .
Furthermore, we get
kx xk kxk kAxk kk
kA1 Ek + kA1 k .
kxk kxk kxk kbk
By using
kxk kx xk + kxk and kAxk kAk kxk,
we then have
kx xk kx xk kk
kA1 Ek + kA1 Ek + kA1 kkAk ,
kxk kxk kbk
i.e.,
kx xk kk
(1 kA1 Ek) kA1 Ek + (A) .
kxk kbk
Since
b < 1,
kA1 Ek kA1 k kEk = kA1 k kA Ak
we get
kx xk 1 1 1 kk
(1 kA Ek) kA Ek + (A) .
kxk kbk
By using
kEk
kA1 Ek kA1 k kEk = (A) ,
kAk
we finally have
kx xk (A) kEk kk
+ .
kxk 1 (A) kEk
kAk
kAk kbk
Theorems 3.7 and 3.8 give upper bounds for the relative error of x in terms of
the condition number of A. From Theorems 3.7 and 3.8, we know that if A is well-
conditioned, i.e., (A) is small, the relative error in x will be small if the relative errors
in both A and b are small.
Corollary 3.2 Let k k be any matrix norm with kIk = 1 and A be a nonsingular
e being a perturbed matrix of A such that
matrix with A + A
e < 1.
kA1 Ak
e is nonsingular and
Then A + A
e 1 A1 k
k(A + A) (A) e
kAk
.
kA1 k e kAk
1 (A) kkAk
Ak
Proof: We first prove that
e kA1 k2
kAk
e 1 A1 k
k(A + A) .
e
1 kA1 Ak
Note that
A+A e
e = A[I (A1 A)].
e = A(I + A1 A)
e and r = kA1 Ak.
Let F = A1 A e Now,
e 1 = (I F )1 A1 .
(A + A)
Therefore by noting that kF k = r < 1 and Corollary 3.1, we have

1 1
e 1 k = k(I F )1 A1 k k(I F )1 k kA1 k < kA k = kA k .
k(A + A)
1 kF k 1r
3.3. ERROR ANALYSIS ON FLOATING POINT ARITHMETIC 37
By using identity
B 1 = A1 B 1 (B A)A1 ,
we have,
e 1 A1 = (A + A)
(A + A) e 1 AA
e 1 .
Then
1 2 e
e 1 A1 k kA1 k kAk
k(A + A) e 1 k kA k kAk .
e k(A + A)
1r
Finally, we obtain
e 1 A1 k
k(A + A) e
kA1 k kAk e
kA1 k kAk (A) e
kAk
.
kA1 k 1r e
1 kA1 k kAk e kAk
1 (A) kkAk
Ak
3.3 Error analysis on floating point arithmetic

In computers, the floating point numbers f are expressed as
f = J , L J U,
where is the base, J is the order, and is the fraction. Usually, has the following
form:
= 0.d1 d2 dt
where t is the length (precision) of , d1 6= 0, and 0 di < , for i = 2, , t.
Let
F = {0} {f : f = J , 0 di < , d1 6= 0, L J U }.
Then F contains
2( 1) t1 (U L + 1) + 1
floating point numbers. These numbers are symmetrically distributed in the intervals
[m, M ] and [M, m], where
m = L1 , M = U (1 t ). (3.5)
We remark that F is only a finite set which cannot contain all the real numbers in
these two intervals.
Let f l(x) denote the floating point number of any real number x. Then
f l(x) = 0, for x = 0.
If m |x| M , by rounding, f l(x) is the minimum of
|f l(x) x| = min |f x|.

f F
By chopping, f l(x) is the minimum of
|f l(x) x| = min |f x|.

|f ||x|
For example, let = 10, t = 3, L = 0 and U = 2. We consider the floating point

expression of x = 5.45627. By rounding, we have f l(x) = 0.546 10. By chopping, we
have f l(x) = 0.545 10. The following theorem gives an estimate of the relative error
of floating point expressions.
Theorem 3.9 Let m |x| M , where m and M are defined by (3.5). Then
f l(x) = x(1 + ), || u,
where u is the machine precision, i.e.,

1 1t
2 , by rounding,
u=
1t
, by chopping.
Proof: In the following, we assume that x 6= 0 and x > 0. Let be an integer and
satisfy
1 x < . (3.6)
Since the order of floating point numbers in [ 1 , ) is , all the numbers
0.d1 d2 dt
are distributed in the interval with distance t . For the rounding error, by (3.6), we
have
1 1 1
|f l(x) x| t = 1 1t x 1t ,
2 2 2
i.e.,
|f l(x) x| 1
1t .
x 2
For the chopping error, we have
|f l(x) x| t = 1 1t x 1t ,
i.e.,
|f l(x) x|
1t .
x
3.3. ERROR ANALYSIS ON FLOATING POINT ARITHMETIC 39
The proof is complete.
Let us now consider the rounding error of elementary operations. Let a, b F and
represent any elementary operations: +, , , . By Theorem 3.9, we immedi-
ately have
Theorem 3.10 We have
f l(a b) = (a b)(1 + ), || u.
Theorem 3.11 If |i | u and nu 0.01, then

n
Y
1 1.01nu (1 + i ) 1 + 1.01nu.
i=1
Proof: Since |i | u, we have

n
Y
n
(1 u) (1 + i ) (1 + u)n . (3.7)
i=1
For the lower bound of (1 u)n , by using the Taylor expansion of (1 x)n , i.e.,
n(n 1)
(1 x)n = 1 nx + (1 x)n2 x2 ,
2
we have
1 nx (1 x)n .
Therefore,
1 1.01nu 1 nu (1 u)n . (3.8)
Now, we estimate the upper bound of (1 + u)n . By using the Taylor expansion of ex ,
we have
x2 x3
ex = 1 + x + + +
2! 3!
x x
= 1 + x + x 1 + + .
2 3
Therefore, when 0 x 0.01, we know that by using e0.01 < 2,
0.01 x
1 + x ex 1 + x + xe 1 + 1.01x. (3.9)
2
Let x = u. By the left inequality of (3.9), we have
(1 + u)n enu . (3.10)
Let x = nu. By the right inequality of (3.9), we have
enu 1 + 1.01nu. (3.11)
Combining (3.10) and (3.11), we have
(1 + u)n 1 + 1.01nu. (3.12)
By (3.7), (3.8) and (3.12), the proof is complete.
We consider the following example.

Example 3.1. For given x, y Rn , estimate the upper bound of
|f l(xT y) xT y|.
Let
X
k
Sk = f l xi yi .
i=1
By Theorem 3.10, we have
S1 = x1 y1 (1 + 1 ), |1 | u,
and
Sk = f l(Sk1 + f l(xk yk ))
= [Sk1 + xk yk (1 + k )](1 + k ), |k |, |k | u.
Therefore,
P
n Q
n
f l(xT y) = Sn = xi yi (1 + i ) (1 + j )
i=1 j=i
P
n
= (1 + i )xi yi ,
i=1
where
n
Y
1 + i = (1 + i ) (1 + j )
j=i
with 1 = 0. Thus, if nu 0.01, we then have by Theorem 3.11,

n
X n
X
T T
|f l(x y) x y| |i | |xi yi | 1.01nu |xi yi |.
i=1 i=1
3.4. ERROR ANALYSIS ON PARTIAL PIVOTING 41
Before we finish this section, let us briefly discuss the floating point analysis on
elementary matrix operations. We first introduce the following notations:
|E| = [|eij |],
where E = [eij ] Rnn and
|E| |F | |eij | |fij |
for i, j = 1, 2, , n. Let A, B Rnn be matrices with entries in F, and F. By

Theorem 3.10, we have
f l(A) = A + E, |E| u|A|,
and
f l(A + B) = (A + B) + E, |E| u|A + B|.
From Example 3.1, we also have
f l(AB) = AB + E, |E| 1.01nu|A| |B|.
Note that |A| |B| maybe is much larger than |AB|. Therefore the relative error of AB
may not be small.
3.4 Error analysis on partial pivoting

We will show that if Gaussian elimination with partial pivoting is used to solve Ax = b,
then the computational solution x satisfies
(A + E)x = b,
where E is an error matrix. An upper bound of E is also given. We first study the
rounding error of the LU factorization of A.
Lemma 3.1 Let A Rnn with floating point entries. Assume that A has an LU
factorization and 6nu 1 where u is the machine precision. Then by using Gaussian
elimination, we have
eU
L e =A+E
where
e |U
|E| 3nu(|A| + |L| e |).
Proof: We use induction on n. Obviously, Lemma 3.1 is true for n = 1. Assume that
the lemma holds for n 1. Now, we consider a matrix A Rnn :

wT
A= ,
v A1
where A1 R(n1)(n1) . At the first step of Gaussian elimination, we should compute

the vector l1 = f l(v/) and modify the matrix A1 as
e1 = f l(A1 f l(l1 wT )).

A
l1 = v/ + f, u
|f | |v| (3.13)
||
and
e1 = A1 l1 wT + F,
A |F | (2 + u)u(|A1 | + |l1 | |w|T ). (3.14)
For Ae1 , by using the assumption, we obtain an LU factorization with a unit lower
triangular matrix Le 1 and an upper triangular matrix U
e1 such that
e1 U
L e1 = A
e1 + E1
where
e1 | + |L
|E1 | 3(n 1)u(|A e 1 | |U
e1 |).
Thus, we have

eUe= 1 0 wT
L l1 L
e1 e1 = A + E,
0 U
where

0 0
E= .
f E1 + F
By using (3.14), we obtain
e1 | (1 + 2u + u2 )(|A1 | + |l1 | |w|T ).

|A
Therefore, by using the condition 6nu 1, we have
|E1 + F | |E1 | + |F |
e1 | + |L
3(n 1)u(|A e1 |) + (2 + u)u(|A1 | + |l1 | |w|T )
e 1 | |U

3(n 1)u (1 + 2u + u2 )(|A1 | + |l1 | |w|T ) + |L
e 1 | |U
e1 |
+(2 + u)u(|A1 | + |l1 | |w|T )

u 3n 1 + [6n + 3(n 1)u 5]u (|A1 | + |l1 | |w|T )
e 1 | |U
+3(n 1)u(|L e1 |)
3nu(|A1 | + |l1 | |w|T + |L

e 1 | |U
e1 |).
Combining with (3.13), we obtain

0 0
|E| =
|||f | |E1 + F |

0 0
3nu e 1 | |U
e1 |
|v| |A1 | + |l1 | |w|T + |L

|| |w|T 1 0 || |w|T
3nu +
|v| |A1 | |l1 | |L
e1 | 0 |Ue1 |
e |U
= 3nu(|A| + |L| e |).
Corollary 3.3 Let A Rnn be nonsingular with floating point entries and 6nu 1.
Assume that by using Gaussian elimination with partial pivoting, we obtain
eU
L e = PA + E
e = [lij ] is a unit lower triangular matrix with |lij | 1, U

where L e is an upper triangular
matrix and P is a permutation matrix. Then E satisfies the following inequality:
e |U
|E| 3nu(|P A| + |L| e |).
After we obtain the LU factorization of A, the problem of solving Ax = b becomes

the problem of solving the following two triangular systems:
e = P b,
Ly e x = y.
U
Therefore, we need to estimate the rounding error of solving triangular systems.
Lemma 3.2 Let S Rnn be a nonsingular triangular matrix with floating point
entries and 1.01nu 0.01. By using the method proposed in Section 2.1.1 to solve
Sx = b, we then obtain a computational solution x which satisfies
(S + H)x = b,
where
|H| 1.01nu|S|.
Proof: We use induction on n. Without loss of generality, let S = L be a lower

triangular matrix. Obviously, Lemma 3.2 is true for n = 1. Assume that the lemma
is true for n 1. Now, we consider a lower triangular matrix L Rnn . Let x be the
computational solution of Lx = b and we partition L, b and x as follows:

l11 0 b1 x1
L= , b= , x = ,
l1 L1 c y
where c, y Rn1 and L1 R(n1)(n1) . By Theorem 3.10, we have

b1
x1 = f l(b1 /l11 ) = , |1 | u. (3.15)
l11 (1 + 1 )
Note that y is the computational solution of the (n 1)-by-(n 1) system
L1 y = f l(c x1 l1 ).
By assumption, we have
(L1 + H1 )y = f l(c x1 l1 )
where
|H1 | 1.01(n 1)u|L1 |. (3.16)
By Theorem 3.10 again, we obtain
f l(c x1 l1 ) = f l(c f l(x1 l1 )) = (I + D )1 (c x1 l1 x1 D l1 ),
where
D = diag(2 , , n ), D = diag(2 , , n )
with
|i |, |i | u, i = 2, , n.
Therefore,
x1 l1 + x1 D l1 + (I + D )(L1 + H1 )y = c,
and then
(L + H)x = b,
where
1 l11 0
H= .
D l1 H1 + D (L1 + H1 )
By using (3.15), (3.16) and the condition 1.01nu 0.01, we have

|1 | |l11 | 0
|H|
|D | |l1 | |H1 | + |D |(|L1 | + |H1 |)

u|l11 | 0

u|l1 | |H1 | + u(|L1 | + |H1 |)

|l11 | 0
u
|l1 | [1.01(n 1) + 1 + 1.01(n 1)u]|L1 |
1.01nu|L|.
We then have the main theorem of this section.
Theorem 3.12 Let A Rnn be a nonsingular matrix with floating point entries and
1.01nu 0.01. If Gaussian elimination with partial pivoting is used to solve Ax = b,
we then obtain a computational solution x which satisfies
(A + A)x = b,
where
kAk u(3n + 5.04n3 )kAk (3.17)
with the growth factor
1 (k)
max |a |.
kAk i,j,k ij
Proof: By using Gaussian elimination with partial pivoting, we have the following two
triangular systems:
e = P b,
Ly e x = y.
U
By using Lemma 3.2, the computational solution x should satisfy
e + F )(U
(L e + G)x = P b,
i.e.,
eU
(L e + FU
e + LG
e + F G)x = P b, (3.18)
where
e
|F | 1.01nu|L|, e |.
|G| 1.01nu|U (3.19)
eU
Substituting L e = P A + E into (3.18), we have
(A + A)x = b,
where
e + LG
A = P T (E + F U e + F G).
By using (3.19), Corollary 3.3 and the condition 1.01nu 0.01, we have
e |U
|A| P T (3nu|P A| + (3n + 2.04n)u|L| e |)
(3.20)
= nuP T (3|P A| e |U
+ 5.04|L| e |).
e are less than or equal to 1.

By Corollary 3.3 again, the absolute values of entries in L
Therefore, we have
e n.
kLk (3.21)
We now define
1 (k)
max |a |
kAk i,j,k ij
and then we have
e k nkAk .
kU (3.22)
Substituting (3.21) and (3.22) into (3.20), we have (3.17). The proof is complete.
We remark that kAk usually is very small comparing with the initial error from
given data. Thus, Gaussian elimination with partial pivoting is numerically stable.
Exercises:
1. Let
1 0.999999
A= .
0.999999 1
Compute A1 , det(A) and the condition number of A.
2. Prove that kABkF kAk2 kBkF and kABkF kAkF kBk2 .

3. Prove that kAk22 kAk1 kAk for any square matrix A.
4. Show that
A11 0
A11 A12 .
0 A22 A21 A22
2 2
5. Let A be nonsingular. Show that
kA1 k1
2 = min kAxk2 .
kxk2 =1
6. Show that if S is real and S = S T , then I S is nonsingular and the matrix
(I S)1 (I + S)
is orthogonal. This is known as the Cayley transform of S.

7. Prove that if both A and A + E are nonsingular, then
k(A + E)1 A1 k
k(A + E)1 k kEk.
kA1 k
8. Let A Rnn be nonsingular and let x, y, z Rn such that Ax = b and Ay = b + z.

Show that
kzk2
kx yk2 kA1 k2 kzk2 .
kAk2
9. Let A = [aij ] be an m-by-n matrix. Define
|||A|||l = max |aij |.

i,j
Is ||| |||l a matrix norm? Give a reason for your answer.

10. Show that if X Cnn is nonsingular, then kAkX = kX 1 AXk2 defines a matrix norm.
11. Let A = LDLT Rnn be a symmetric positive definite matrix and
D = diag(d11 , , dnn ).
Show that
max{dii }
i
2 (A) .
min{dii }
i
12. Verify that

kxy kF = kxy k2 = kxk2 kyk2 ,
for any x, y Cn .
13. Show that if 0 6= v Rn and E Rnn , then
vv T kEvk2
2
E I T = kEk2F T 2 .
v v F v v
14. Let A Rmn and x Rn . Show that
|A| |x| kAk |x|
where = max |xi |/ min |xi |.

i i
15. Prove the Sherman-Morrison-Woodbury formula. Let U , V be n-by-k rectangular ma-

trices with k n and A be an n-by-n matrix. Then
T = I + V T A1 U
is nonsingular if and only if A + U V T is nonsingular. In this case, we have
(A + U V T )1 = A1 A1 U T 1 V T A1 .
Chapter 4
Least Squares Problems
In this chapter, we study linear least squares problems:

min kAy bk2
yRn
where the data matrix A Rmn with m n and the observation vector b Rm
are given. We introduce some well-known orthogonal transformations and the QR
decomposition for constructing efficient algorithms for these problems. For a literature
on least squares problems, we refer to [15, 21, 42, 44, 45, 48].
4.1 Least squares problems

In practice, if we are given m points t1 , t2 , , tm with data on these points y1 , y2 , , ym ,
and functions 1 (t), 2 (t), , n (t) defined on these points, we then try to find f (x, t)
defined by
Xn
f (x, t) xj j (t)
j=1
such that residuals defined by
n
X
ri (x) yi f (x, ti ) = yi xj j (ti ), i = 1, 2, , m,
j=1
can be as small as possible. In matrix form, we have

r(x) = b Ax
where
1 (t1 ) n (t1 )
.. ..
A= . . ,
1 (tm ) n (tm )
49
50 CHAPTER 4. LEAST SQUARES PROBLEMS
and
b = (y1 , , ym )T , x = (x1 , , xn )T , r(x) = (r1 (x), , rm (x))T .
When m = n, we can require that r(x) = 0 and x can be found by solving the system
Ax = b. When m > n, we require that r(x) can reach its minimum under the norm
k k2 . We therefore introduce the following definition of the least squares problem.
Definition 4.1 Let A Rmn and b Rm . Find x Rn such that
kb Axk2 = kr(x)k2 = minn kr(y)k2 = minn kb Ayk2 . (4.1)

yR yR
It is called the least squares (LS) problem and r(x) is called the residual.
In the following, we only consider the case of
rank(A) = n < m.
We first study the solution x of the following equation
Ax = b, A Rmn . (4.2)
The range of matrix A is defined by
R(A) {y Rm : y = Ax, x Rn }.
It is easy to see that

R(A) = span{a1 , , an }
where ai , i = 1, , n, are column vectors of A. The nullspace of A is defined by
N (A) {x Rn : Ax = 0}.
The dimension of N (A) is denoted by null(A). The orthogonal complement of a sub-

space S Rn is defined by
S {y Rn : y T x = 0, for all x S}.
We have the following theorems for (4.2).
Theorem 4.1 The equation (4.2) has solutions rank(A) = rank([A, b]).
Theorem 4.2 Let x be a special solution of (4.2). Then the solution set of (4.2) is
given by x + N (A).
4.1. LEAST SQUARES PROBLEMS 51
Corollary 4.1 Assume that the equation (4.2) has some solution. The solution is
unique null(A) = 0.
We have the following essential theorem for the solution of (4.1).
Theorem 4.3 The LS problem (4.1) always has solutions. The solution is unique if
and only if null(A) = 0.
Proof: Since
Rm = R(A) R(A) ,
the vector b can be expressed uniquely by
b = b1 + b2
where b1 R(A) and b2 R(A) . For any x Rn , since b1 Ax R(A) and is

orthogonal to b2 , we therefore have
kr(x)k22 = kb Axk22 = k(b1 Ax) + b2 k22
= kb1 Axk22 + kb2 k22 .
Note that kr(x)k22 reaches the minimum if and only if kb1 Axk22 reaches the minimum.
Since b1 R(A), kr(x)k22 reaches its minimum if and only if
Ax = b1 ,
i.e.,
kb1 Axk22 = 0.
Thus, by Corollary 4.1, we know that the solution of Ax = b1 is unique, i.e., the solution
of (4.1) is unique, if and only if
null(A) = 0.
Let
X = {x Rn : x is a solution of (4.1)}.
We have
Theorem 4.4 A vector x X if and only if
AT Ax = AT b. (4.3)
Proof: Let x X . By Theorem 4.3, we know that Ax = b1 where b1 R(A) and
r(x) = b Ax = b b1 = b2 R(A) .
Therefore
AT r(x) = AT b2 = 0.
Substituting r(x) = b Ax into AT r(x) = 0, we obtain (4.3).
Conversely, let x Rn satisfy
AT Ax = AT b,
then for any y Rn , we have
kb A(x + y)k22 = kb Axk22 2y T AT (b Ax) + kAyk22
= kb Axk22 + kAyk22
kb Axk22 .
Thus, x X .
We therefore have the following algorithm for LS problems:
(1) Compute C = AT A and d = AT b.
(2) Find the Cholesky factorization of C = LLT .
(3) Solve triangular linear systems: Ly = d and LT x = y.
We remark that in computation of AT A, usually, the operation cost is O(n2 m), and
some information of matrix A could be lost. For example, we consider

1 1 1
0 0
A= 0 0 .

0 0
We have
1 + 2 1 1
AT A = 1 1 + 2 1 .
1 1 1 + 2
Assume that = 103 and a 6-digital decimal floating system is used. Then 1 + 2 =
1 + 106 is rounded off to be 1, which means that AT A is singular!
4.1. LEAST SQUARES PROBLEMS 53
We note that the solution x of (4.3) can be expressed as
x = (AT A)1 AT b.
If we let
A = (AT A)1 AT ,
then the LS solution x could be written as
x = A b.
Actually, the n-by-m matrix A is the Moore-Penrose generalized inverse of A, which

is unique, see [14, 17, 42]. In general, we have
Definition 4.2 Let X Rnm . If it satisfies the following conditions:
AXA = A, XAX = X, (AX)T = AX, (XA)T = XA,
then X is called the Moore-Penrose generalized inverse of A and denoted by A .
Now we develop the perturbation analysis of LS problems. Assume that there is a

perturbation b on b and let x, x+x denote the solutions of the following LS problems,
respectively,
min kb Axk2 , min k(b + b) Axk2 .
x x
Then
x = A b,
and
x + x = A (b + b) = A b
where b = b + b. We have
Theorem 4.5 Let b1 and b1 denote orthogonal projections of b and b on R(A), respec-
tively. If b1 6= 0, then
kxk2 kb1 k2
2 (A)
kxk2 kb1 k2
where 2 (A) = kAk2 kA k2 and b1 = b1 + b1 .
Proof: Let b2 denote the orthogonal projection of b on R(A) . Then b = b1 + b2 and

AT b2 = 0. Note that
A b = A b1 + A b2 = A b1 + (AT A)1 AT b2 = A b1 .
Similarly, A b = A b1 . Therefore,
kxk2 = kA b A bk2 = kA (b1 b1 )k2

(4.4)
kA k2 kb1 b1 k2 = kA k2 kb1 k2 .
Since Ax = b1 , we have
kb1 k2 kAk2 kxk2 . (4.5)
By combining (4.4) and (4.5), the proof is complete.
We remark that the condition number 2 (A) is important for LS problems. When
2 (A) is large, we say that the LS problem is ill-conditioned. When 2 (A) is small, we
say that the LS problem is well-conditioned.
Theorem 4.6 Suppose that column vectors of A are linearly independent. Then
2 (A)2 = 2 (AT A).
Proof: By Theorem 3.3 (ii) and the given condition, we have
kAk22 = kAT Ak2 , kA k22 = kA (A )T k2 = k(AT A)1 k2 .
Therefore,
2 (A)2 = kAk22 kA k22 = kAT Ak2 k(AT A)1 k2 = 2 (AT A).
4.2 Orthogonal transformations

In order to construct efficient algorithms for solving LS problems, we introduce some
well-known orthogonal transformations.
4.2.1 Householder transformation

We first introduce the following definition of Householder transformation.
Definition 4.3 Let Rn with kk2 = 1. Define H Rnn as follows:
H = I 2 T . (4.6)
The matrix H is called the Householder transformation.

4.2. ORTHOGONAL TRANSFORMATIONS 55
Theorem 4.7 Let H be defined as in (4.6). Then H has the following properties:
(i) H is a symmetric orthogonal matrix.
(ii) H 2 = I.
(iii) H is called a reflection because Hx is the reflection of x Rn in the plane through

0 perpendicular to .
Proof: We only prove (iii). Note that for any vector x Rn , it can be expressed as:
x = u +
where u span{} and R. By using uT = 0 and T = 1, we have
Hx = (I 2 T )(u + ) = u + 2 T u 2 T = u .
Theorem 4.8 For any 0 6= x Rn , we can construct a unit vector such that the
Householder transformation defined as in (4.6) satisfies
Hx = e1
where = kxk2 .
Proof: Note that Hx = (I 2 T )x = x 2( T x). Let

x e1
= .
kx e1 k2
We then have
Hx = x 2( T x)
2 T
= x 2 (x eT1 )x (x e1 )
kx e1 k2
2(kxk22 x1 ) 2(kxk22 x1 ) (4.7)

= x 2 x+ e1
kx e1 k2 kx e1 k22

2(kxk22 x1 ) 2(kxk22 x1 )
= 1 x + e1 ,
kx e1 k22 kx e1 k22
where x1 is the first component of the vector x. Let the coefficient of x be zero and
then we have the following equation:
2(kxk22 x1 )
1 = 0.
kx e1 k22
Solving this equation for , we have = kxk2 . Substituting it into (4.7), we therefore
have
Hx = kxk2 e1 .
We remark that for any vector 0 6= x Rn , by Theorem 4.8, one can construct a
Householder matrix H such that the last n 1 components of Hx are zeros. We can
use the following two steps to construct the unit vector of H:
(1) compute v = x kxk2 e1 ;
(2) compute = v/kvk2 .
Now a natural question is: how to choose the sign in front of kxk2 ? Usually, we
choose
v = x + sign(x1 )kxk2 e1 ,
where x1 6= 0 is the first component of the vector x, see [38]. Since
2
H = I 2 T = I vv T = I vv T
vT v
where = 2/v T v, we only need to compute and v instead of forming . Thus, we
have the following algorithm.
Algorithm 4.1 (Householder transformation)

function: [v, ] = house(x)

n = length(x)

= x(2 : n)T x(2 : n) p

v(1) = x(1) + sign(x(1)) x(1)2 +

v(2 : n) = x(2 : n)

if = 0

=0

else

= 2/(v(1)2 + )
end
4.3. QR DECOMPOSITION 57
4.2.2 Givens rotation

A Givens rotation is defined as follows:
G(i, k, ) = I + s(ei eTk ek eTi ) + (c 1)(ei eTi + ek eTk )

.. ..
1 . .
.. .. ..
. . .

c s

.. ..
= . . ,

s c

.. .. . .
. . .

.. ..
. . 1
where c = cos and s = sin . It is easy to prove that G(i, k, ) is an orthogonal matrix.
Let x Rn and y = G(i, k, )x. We then have
yi = cxi + sxk , yk = sxi + cxk , yj = xj , j 6= i, k.
If we want to make yk = 0, then we only need to take

xi xk
c= q , s= q .
x2i + x2k x2i + x2k
Therefore, q
yi = x2i + x2k , yk = 0.
We remark that for any vector 0 6= x Rn , one can construct a Givens rotation
G(i, k, ) acting on x to make a nonzero component of x be zero.
4.3 QR decomposition
Let A Rmn and b Rm . By Theorem 3.3 (iii), for any orthogonal matrix Q, we
have
kAx bk2 = kQT (Ax b)k2 .
Therefore, the LS problem
min kQT Ax QT bk2
x
is equivalent to (4.1). We wish that we could find a suitable orthogonal matrix Q such
that the original LS problem becomes an easier solvable LS problem. We have
Theorem 4.9 (QR decomposition) Let A Rmn (m n). Then A has a QR

decomposition:
R
A=Q , (4.8)
0
where Q Rmm is an orthogonal matrix and R Rnn is an upper triangular matrix
with nonnegative diagonal entries. The decomposition is unique when m = n and A is
nonsingular.
Proof: We use induction. When n = 1, we note that it is true by using Theorem

4.8. Now, we assume that the theorem is true for all the matrices in Rp(n1) with
p n 1. Let the first column of A Rmn be a1 . By Theorem 4.8 again, there exists
an orthogonal matrix Q1 Rmm such that
QT1 a1 = ka1 k2 e1 .
Therefore, we have
ka1 k2 v T
QT1 A = .
0 A1
For the matrix A1 R(m1)(n1) , we obtain by assumption,

R2
A1 = Q2 ,
0
where Q2 R(m1)(m1) is an orthogonal matrix and R2 is an upper triangular matrix

with nonnegative diagonal entries. Thus, let

ka1 k2 v T
1 0
Q = Q1 , R= 0 R2 .
0 Q2
0 0
Then Q and R are the matrices satisfying the conditions of the theorem.
When A Rmm is nonsingular, we want to show that the QR decomposition is
unique. Let
A = QR = Q eR
e
where Q, Qe Rmm are orthogonal matrices, and R, R e Rmm are upper triangular
matrices with nonnegative diagonal entries. Since A is nonsingular, we know that the
e are positive. Therefore, the matrices
diagonal entries of R and R
e T Q = RR
Q e 1
are both orthogonal and upper triangular matrices with positive diagonal entries. Thus
e T Q = RR
Q e 1 = I,
i.e.,
e = Q,
Q e = R.
R
A complex version of the QR decomposition is needed later on.
Corollary 4.2 Let A Cmn (m n). Then A has a QR decomposition:

R
A=Q ,
0
where Q Cmm is a unitary matrix and R Cnn is an upper triangular matrix
with nonnegative diagonal entries. The decomposition is unique when m = n and A is
nonsingular.
Now we use the QR decomposition to solve the LS problem (4.1). Suppose that
A Rmn (m n) has linearly independent columns, b Rm , and A has a QR
decomposition (4.8). Let Q be partitioned as
Q = [Q1 Q2 ] ,
and
T QT1 c1
Q b= b= .
QT2 c2
Then
kAx bk22 = kQT Ax QT bk22 = kRx c1 k22 + kc2 k22 .
The x is the solution of the LS problem (4.1) if and only if it is the solution of Rx = c1 .
Note that it is much easier to get the solution of (4.1) by solving Rx = c1 since R is
an upper triangular matrix. We have the following algorithm for LS problems:
(1) Compute a QR decomposition of A.
(2) Compute c1 = QT1 b.
(3) Solve the upper triangular system Rx = c1 .
Finally, we discuss how to use Householder transformations to compute the QR
decomposition of A. Let m = 7 and n = 5. Assume that we have already found
Householder transformations H1 and H2 such that

0

0 0 +

H2 H1 A =
0 0 + .

0 0 +

0 0 +
0 0 +
Now we construct a Householder transformation He 3 R55 such that

+
+ 0

e
H3 + = 0 .

+ 0
+ 0
e 3 ). We obtain
Let H3 = diag(I2 , H

0

0 0

H3 H2 H1 A =
0 0 0
.
0 0 0

0 0 0
0 0 0
In general, after n such steps, we can reduce the matrix A into the following form,

R
Hn Hn1 H1 A = ,
0
where R is an upper triangular matrix with nonnegative diagonal entries. By setting
Q = H1 Hn , we obtain
R
A=Q .
0
Thus, we have the following algorithm.
Algorithm 4.2 ( QR decomposition: Householder transformation)

for j = 1 : n

[v, ] = house(A(j : m, j))

A(j : m, j : n) = (Imj+1 vv T )A(j : m, j : n)
if j < m

A(j + 1 : m, j) = v(2 : m j + 1)

end

end
We remark that the QR decomposition is not only a basic tool for solving LS
problems but also an important tool for solving some other fundamental problems in
NLA.
Exercises:
1. Let A Rmn have full column rank. Prove that A + E also has full column rank if E
satisfies kEk2 kA1 k2 , where A = (AT A)1 AT .
2. Let U = [uij ] be a nonsingular upper triangular matrix. Show that
max |uii |
i
(U ) ,
min |uii |
i
where (U ) = kU k kU 1 k .
3. Let A Rmn with m n and have full column rank. Show that

I A r b
=
AT 0 x 0
has a solution where x minimizes kAx bk2 .

4. Let x Rn and P be a Householder transformation such that
P x = kxk2 e1 .
Let G12 , G23 , , Gn1,n be Givens rotations, and let
Q = G12 G23 Gn1,n .
Suppose Qx = kxk2 e1 . Is P equal to Q? Give a proof or a counterexample.

5. Let A Rmn . Show that X = A minimizes kAX IkF over all X Rnm . What is
the minimum?

x1
6. Let x = C2 . Find an algorithm to compute the following unitary matrix
x2

c s
Q= , c R, c2 + |s|2 = 1
s c

such that Qx = .
0
7. Suppose an m-by-n matrix A has the form

A1
A= ,
A2
where A1 is an n-by-n nonsingular matrix and A2 is an (m n)-by-n arbitrary matrix.

Prove that kA k2 kA1
1 k2 .
8. Consider the following well-known ill-conditioned matrix

1 1 1
0 0
A= 0 0 ,
|| 1.
0 0
(a) Choose a small such that rank(A) = 3. Then compute 2 (A) to show that A is
ill-conditioned.
(b) Find the LS solution with A given as above and b = (3, , , )T by using
(i) the normalized equation method;
(ii) the QR method.
9. Let A = BC where B Cmr and C Crn with
r = rank(A) = rank(B) = rank(C).
Show that
A = C (CC )1 (B B)1 B .
10. Let A = U V Cmn , where U Cmn satisfies U U = I, V Cnn satisfies

V V = I and is an n-by-n diagonal matrix. Show that
A = V U .
11. Prove that

A = lim (A A + I)1 A = lim A (AA + I)1 .
0 0
12. Show that

R(A) N (A ) = {0}.
13. Let A = [aij ] Cnn be idempotent. Then

n
X
R(A) N (A) = Cn , rank(A) = aii .
i=1
14. Let A Cmn . Prove that

R(AA ) = R(AA ) = R(A),
R(A A) = R(A A) = R(A ) = R(A ),
N (AA ) = N (AA ) = N (A ) = N (A ),
N (A A) = N (A A) = N (A).
Therefore AA and A A are orthogonal projectors.

15. Prove Corollary 4.2.
Chapter 5
Classical Iterative Methods
We study classical iterative methods for the solution of Ax = b. Iterative methods,

originally proposed by Gauss in 1823, Liouville in 1837, and Jacobi in 1845, are quite
different from direct methods such as Gaussian elimination, see [2].
Direct methods based on an LU factorization of A become prohibitive in terms of
computing time and computer storage if the matrix A is quite large. In some practical
situation such as the discretization of partial differential equations, the matrix size can
be as large as several hundreds of thousands. For such problems, direct methods become
impractical. Furthermore, most large problems are sparse, and usually the sparsity is
lost during LU factorizations. Therefore, we have to face a very large matrix with
many nonzero entries at the end of LU factorizations, and then the storage becomes a
crucial issue. For such kind of problems, we can use a class of methods called iterative
methods. In this chapter, we only consider some classical iterative methods.
We remark that the disadvantage with classical iterative methods is that the conver-
gence rate maybe is slow or they may even diverge, and a stopping criterion is needed
to be found.
5.1 Jacobi and Gauss-Seidel method

5.1.1 Jacobi method
Consider the following linear system
Ax = b
where A = [aij ] Rnn . We can write the matrix A in the following form
A = D L U,
where
D = diag(a11 , a22 , , ann ),
63
64 CHAPTER 5. CLASSICAL ITERATIVE METHODS

0
a21 0

a31 a32 0
L= ,
.. .. .. ..
. . . .
an1 an2 an,n1 0
and
0 a12 a13 a1n
0 a a2n
23
. .. . .. ..
U = . .

0 an1,n
0
Then it is easy to see that
x = BJ x + g,
where
BJ = D1 (L + U ), g = D1 b.
The matrix BJ is called the Jacobi iteration matrix. The corresponding iteration
xk = BJ xk1 + g, k = 1, 2, , (5.1)
T
(0) (0) (0)
is known as the Jacobi method if an initial vector x0 = x1 , x2 , , xn is given.
5.1.2 Gauss-Seidel method

In the Jacobi method, to compute the components of the vector
T
(k+1) (k+1)
xk+1 = x1 , x2 , , x(k+1)
n ,
(k+1)
only the components of the vector xk are used. However, note that to compute xi ,
(k+1) (k+1) (k+1)
we could use x1 , x2 , , xi1 , which were already available for us. Thus a
natural modification of the Jacobi method is to rewrite the Jacobi iteration (5.1) in the
following form
xk = (D L)1 U xk1 + (D L)1 b, k = 1, 2, . (5.2)
The idea is to use each new component as soon as it is available in the computation of
the next component. The iteration (5.2) is known as the Gauss-Seidel method.
Note that the matrix D L is a lower triangular matrix with a11 , , ann on the
diagonal. Because these entries are assumed to be nonzero, the matrix D L is non-
singular. The matrix
BGS = (D L)1 U
is called the Gauss-Seidel iteration matrix.
5.2. CONVERGENCE ANALYSIS 65
5.2 Convergence analysis

5.2.1 Convergence theorems
It is often hard to make a good initial approximation x0 . Thus, it will be nice to have
conditions that will guarantee the convergence of Jacobi, Gauss-Seidel methods for any
arbitrary choice of the initial approximation.
Both of the Jacobi iteration and the Gauss-Seidel iteration can be expressed by
xk+1 = Bxk + g, k = 0, 1, . (5.3)
For the Jacobi iteration, we have
BJ = D1 (L + U ), g = D1 b;
and for the Gauss-Seidel iteration, we have
BGS = (D L)1 U, g = (D L)1 b.
The iteration (5.3) is called linear stationary iteration, where B Rnn is called the
iteration matrix, g Rn the constant term, and x0 Rn the initial vector. In the
following, we give a convergence theorem.
Theorem 5.1 The iteration (5.3) converges with an arbitrary initial guess x0 if and
only if the matrix B k 0 as k .
Proof: From x = Bx + g and xk+1 = Bxk + g, we have
x xk+1 = B(x xk ). (5.4)
Because it is true for any value of k, we can write
x xk = B(x xk1 ). (5.5)
Substituting (5.5) into (5.4), we have
x xk+1 = B 2 (x xk1 ).
Continuing this process k times, we can write
x xk+1 = B k+1 (x x0 ).
This shows that {xk } converges to the solution x for any choice x0 if and only if B k 0
as k .
Recall that B k 0 as k if and only if the spectral radius (B) < 1. Since
|i | kBk, a good way to see whether (B) < 1 is to see whether kBk < 1 by
computing kBk with a row-sum or column-sum norm. Note that the converse is not
true. Combining the result of Theorem 5.1 with the above observation, we have the
following theorem.
Theorem 5.2 The iteration (5.3) converges for any choice of x0 if and only if (B) <
1. Moreover, if kBk < 1 for some matrix norm, then the iteration (5.3) converges.
Let us consider the examples:

1 2 2 2 1 1
A1 = 1 1 1 , A2 = 1 1 1 .
2 2 1 1 1 2
It is easy to verify that for A1 , the Jacobi method converges even if the Gauss-Seidel
method does not. For A2 , the Jacobi method diverges while the Gauss-Seidel method
converges.
5.2.2 Sufficient conditions for convergence

We now apply Theorem 5.2 to give a sequence of criteria that guarantee the convergence
of the Jacobi and (or) Gauss-Seidel methods with any choice of initial approximation
x0 .
Definition 5.1 Let A = [aij ] Rnn . If

n
X
|aii | > |aij |, i = 1, 2, , n,
j=1
j6=i
then the matrix A is called strictly diagonally dominant. The matrix A is called weakly
diagonally dominant if for i = 1, 2, , n,
n
X
|aii | |aij |
j=1
j6=i
with at least one strict inequality.
Theorem 5.3 If A is strictly diagonally dominant, then the Jacobi method converges
for any initial approximation x0 .
Proof: Because A = [aij ] is strictly diagonally dominant, we have

n
X
|aii | > |aij |, i = 1, 2, , n.
j=1
j6=i
Recall that the Jacobi iteration matrix
BJ = D1 (L + U )
is given by

0 aa12
11
aa1n11
a21 0 aa23 aa2n
a22 22 22
.. .. .. .. ..
BJ =
. . . . . .

.. .. .. a
. . . n1,n
an1,n1
an,n1
aann
n1
ann 0
We know that the absolute row sum of each row is less than 1, which means
kBJ k < 1.
Thus by Theorem 5.2, the Jacobi method converges.
Theorem 5.4 If A is strictly diagonally dominant, then the Gauss-Seidel method con-
verges for any initial approximation x0 .
Proof: The Gauss-Seidel iteration matrix is given by
BGS = (D L)1 U.
Let be an eigenvalue of this matrix and x = (x1 , x2 , , xn )T be the corresponding

eigenvector with the largest component having the magnitude 1. Then from
BGS x = x,
we have
U x = (D L)x,
i.e.,
n
X i
X
aij xj = aij xj , 1 i n,
j=i+1 j=1
which can be rewritten as

i1
X n
X
aii xi = aij xj aij xj , 1 i n. (5.6)
j=1 j=i+1
Let xk be the largest component having the magnitude 1 of the vector x. Then by
(5.6), we have
k1
X n
X
|| |akk | || |akj | + |akj |
j=1 j=k+1
i.e.,
k1
X n
X
||(|akk | |akj |) |akj |
j=1 j=k+1
or
P
n
|akj |
j=k+1
|| (5.7)
P
k1
|akk | |akj |
j=1
Since A is strictly diagonally dominant, we have

k1
X n
X
|akk | |akj | > |akj |.
j=1 j=k+1
Thus from (5.7), we conclude that || < 1, i.e., (BGS ) < 1. By Theorem 5.2, the
Gauss-Seidel method converges.
We now discuss the convergence of the Jacobi, Gauss-Seidel methods for the sym-
metric positive definite matrices.
Theorem 5.5 Let A be symmetric with diagonal entries aii > 0, i = 1, 2, , n. Then
the Jacobi method converges if and only if both A and 2D A are positive definite.
Proof: Since
BJ = D1 (L + U ) = D1 (D A) = I D1 A,
and
D = diag(a11 , a22 , , ann )
with aii > 0, i = 1, 2, , n, then
BJ = I D1 A = D1/2 (I D1/2 AD1/2 )D1/2 .

It is easy to see that I D1/2 AD1/2 symmetric and similar to BJ . Then the eigen-
values of BJ are real.
Now, we first suppose that the Jacobi method converges, then (BJ ) < 1 by Theo-
rem 5.2. The absolute value of the eigenvalues of
I D1/2 AD1/2
is less than 1, i.e., the eigenvalues of D1/2 AD1/2 lies on (0, 2). Thus A is positive
definite. On the other hand, the eigenvalues of 2I D1/2 AD1/2 are positive, so the
matrix
2I D1/2 AD1/2
is positive definite. Since
D1/2 (2D A)D1/2 = 2I D1/2 AD1/2 ,
we know that 2D A is positive definite too.

Conversely, since
D1/2 (I BJ )D1/2 = D1/2 AD1/2
and A is positive definite, it follows that the eigenvalues of I BJ are positive, i.e., the
eigenvalues of BJ are less than 1. Because 2D A is positive definite and
D1/2 (2D A)D1/2 = D1/2 (I + BJ )D1/2 ,
we can deduce that the eigenvalues of I + BJ are positive, i.e., the eigenvalues of BJ
are greater than 1. Thus (BJ ) < 1. By Theorem 5.2 again, the Jacobi method
converges.
Theorem 5.6 Let A be a symmetric positive definite matrix. Then the Gauss-Seidel
method converges for any initial approximation x0 .
Proof: Let be an eigenvalue of the iteration matrix BGS and u the corresponding
eigenvector. Then
(D L)1 U u = u.
Since A is symmetric positive definite, we have U = LT and
(D L)u = LT u.
Therefore,
u (D L)u = u LT u.
Let
u Du = , u Lu = + i.
We then have
u LT u = (Lu) u = u Lu = i.
Thus
[ ( + i)] = i.
Taking the modulus of both sides, we have
2 + 2
||2 = .
( )2 + 2
On the other hand,
0 < u Au = u (D L LT )u = 2.
Hence,
( )2 + 2 = 2 + 2 + 2 2
= ( 2) + 2 + 2
> 2 + 2 .
So we get || < 1. By Theorem 5.2, the Gauss-Seidel method converges.
Definition 5.2 A matrix A is called irreducible if there is no permutation matrix P

such that
T A11 A12
P AP = ,
0 A22
where A11 and A22 are square matrices.
We have the following lemma, see [41].
Lemma 5.1 If a matrix A is strictly diagonally dominant; or irreducible and weakly

diagonally dominant, then A is nonsingular.
Furthermore,
Theorem 5.7 We have

5.3. CONVERGENCE RATE 71
(i) If A is strictly diagonally dominant, then both the Jacobi and the Gauss-Seidel
methods converge. In fact,
kBGS k kBJ k < 1.
(ii) If A is irreducible and weakly diagonally dominant, then both the Jacobi and the
Gauss-Seidel methods converge. Moreover,
(BGS ) < (BJ ) < 1.
For the proof of this theorem, we refer to [14, 41].
5.3 Convergence rate

We consider a stationary linear iteration method,
xk+1 = M xk + g, k = 0, 1, ,
where M is an n-by-n matrix. If I M is nonsingular, then there exists a unique

solution of
(I M )x = g.
The error vectors yk are defined as yk = xk x , for k = 0, 1, , and then
yk = M yk1 = = M k y0 .
Using matrix and vector norms, we have
kyk k kM k k ky0 k
with the equality possible for each k for some vector y0 . Thus, if y0 is not the zero
vector, then kM k k gives us a sharp upper-bound estimate for the ratio kyk k/ky0 k. Since
the initial vector y0 is unknown in practical problems, kM k k serves as a measurement
of comparison of different iterative methods.
Definition 5.3 Let M be an n-by-n iteration matrix. If kM k k < 1 for some positive
integer k, then
ln kM k k
Rk (M ) ln[(kM k k)1/k ] =
k
is called the average rate of convergence for k iterations of M .
In terms of actual computations, the significance of the average rate of convergence

Rk (M ) is given as follows. Clearly, the quantity

kyk k 1/k
=
ky0 k
is the average reduction factor per iteration for the norm of error. If kM k k < 1, then
by Definition 5.3, we have
(kM k k)1/k = eRk (M )
where e is the base of the natural logarithm.

If M is symmetric (or Hermitian, or normal, i.e., M M = M M ), by using the
spectral radius of the iteration matrix M , we then have
kM k k2 = [(M )]k ,
and thus,
Rk (M ) = ln (M ),
which is independent of k.
Next we will consider the asymptotic convergence rate
R (M ) lim Rk (M ).
k
Theorem 5.8 We have

R (M ) = ln (M ).
Proof: We only need to prove that
lim kM k k1/k = (M ).
k
Since
[(M )]k = (M k ) kM k k,
we have
(M ) kM k k1/k .
On the other hand, for any > 0, consider the matrix
1
B = M.
(M ) +
It is obvious that (B ) < 1 and then lim Bk = 0. Hence, there exists a natural
k
number K, for k K, we have
kBk k 1,
5.4. SOR METHOD 73
i.e.,
kM k k [(M ) + ]k .
Thus,
(M ) kM k k1/k (M ) + ,
which means
lim kM k k1/k = (M ).
k
5.4 SOR method

The Gauss-Seidel method is very slow when (BGS ) is close to unity. However, the
convergence rate of the Gauss-Seidel iteration, in certain cases, can be improved by
introducing a parameter , known as the relaxation parameter. This method is called
the successive overrelaxation (SOR) method.
5.4.1 Iterative form

Let A = [aij ] Rnn and A = D L U defined as in Section 5.1.1. Consider the
solution of Ax = b again. The motivation of the SOR method is to improve the Gauss-
(k) (k+1)
Seidel iteration by taking an appropriately weighted average of the xi and xi
yielding the following algorithm

Xi1 n
X
(k+1) (k) (k+1) (k)
xi = (1 )xi + cij xj + cij xj + gi . (5.8)
j=1 j=i+1

(k) (k) (k) T
Here D1 (L+U ) = [cij ], xk = x1 , x2 , , xn , and g = D1 b = (g1 , g2 , , gn )T .
In matrix form, we have
xk+1 = L xk + (D L)1 b
where
L = (D L)1 [(1 )D + U ]
is called the iteration matrix of the SOR method and is the relaxation parameter.
We have three cases depending on the values of :
(1) if = 1, (5.8) is equivalent to the Gauss-Seidel method;
(2) if < 1, (5.8) is called underrelaxation;
(3) if > 1, (5.8) is called overrelaxation.
5.4.2 Convergence criteria

It is natural to ask for which range of the SOR iteration converges. To this end, we
first prove the following important result, see [14, 41].
Theorem 5.9 The SOR iteration cannot converge for any initial approximation if
lies outside the interval (0,2).
Proof: Recall that the SOR iteration matrix L is given by
L = (D L)1 [(1 )D + U ],
where A = [aij ] = D L U .
The matrix (D L)1 is a lower triangular matrix with 1/aii , i = 1, 2, , n, as
diagonal entries, and the matrix
(1 )D + U
is an upper triangular matrix with (1 )aii , i = 1, 2, , n, as diagonal entries.

Therefore,
det(L ) = (1 )n .
Since the determinant of a matrix is equal to the product of its eigenvalues, we conclude
that
(L ) |1 |,
where (L ) is the spectral radius of L . By Theorem 5.2, the spectral radius of
the iteration matrix should be less than 1 for convergence. We then conclude that
0 < < 2 is required for the convergence of the SOR method.
Theorem 5.10 If A is symmetric positive definite, then
(L ) < 1
for all 0 < < 2.
Proof: Let be any eigenvalue of the SOR iteration matrix
L = (D L)1 [(1 )D + LT ]
and x be the corresponding eigenvector. We have
[(1 )D + LT ]x = (D L)x
5.4. SOR METHOD 75
or
x [(1 )D + LT ]x = x (D L)x.
Let
x Dx = , x Lx = + i.
Therefore, x LT x = i and then
(1 ) + ( i) = [ ( + i)].
Taking modulus of both sides, we obtain
[(1 ) + ]2 + 2 2
||2 = . (5.9)
( )2 + 2 2
Note that
[(1 ) + ]2 + 2 2 ( )2 2 2
= [ ( )]2 ( )2
= ( 2)( 2).
Since A is symmetric positive definite, we have
> 0, 2 > 0.
Therefore, if 0 < < 2, we have
[(1 ) + ]2 + 2 2 < ( )2 + 2 2 . (5.10)
Thus, for 0 < < 2, we obtain by (5.9) and (5.10),
||2 < 1,
i.e., the SOR method converges.
5.4.3 Optimal in SOR iteration

For further comparison of the Jacobi, Gauss-Seidel and SOR methods, we impose
another condition on matrices. This condition allows us to compute (BGS ) and (L )
explicitly in terms of (BJ ).
Definition 5.4 A matrix M has property A if there exists a permutation P such that

T D11 M12
PMP =
M21 D22
where D11 and D22 are diagonal matrices.
If M has property A, then we can write

b L
PMPT = D bU
b
where

b= D11 0 b= 0 0 b= 0 M12
D , L , U .
0 D22 M21 0 0 0
Let
cJ () D
B b + 1D
b 1 L b 1 U
b.

We have
cJ () are independent of .
Theorem 5.11 The eigenvalues of B
Proof: Just note that the matrix

1 1

cJ () = 0 D11 M12
B 1
D22 M21 0
is similar to the matrix

1 1

I 0 c I 0 0 D11 M12 cJ (1).
BJ () = 1 =B
0 I 0 I D22 M21 0
Definition 5.5 Let M = D L U and

1 1
BJ () = D1 L + D U.

If the eigenvalues of BJ () are independent of , then M is called consistent ordering
(or said to be consistently ordered).
Note that BJ (1) = BJ is the Jacobi iteration matrix. From Theorem 5.11, we have
known that if M has property A, then P M P T is consistently ordered where P is a
permutation matrix such that

T D11 M12
PMP =
M21 D22
with D11 and D22 being diagonal. It is not true that consistent ordering implies prop-
erty A.
5.4. SOR METHOD 77
Example 5.1. Any block tridiagonal matrix

D1 A1

B1 . . . ..
.

. .. . ..
An1
Bn1 Dn
is consistently ordered when Di are diagonal.

The following theorem gives a relation between the eigenvalues of BJ and the eigen-
values of L , see [14, 41].
Theorem 5.12 If A is consistently ordered and 6= 0, then
(i) The eigenvalues of BJ appear in pairs.
(ii) If is an eigenvalue of BJ and satisfies
( + 1)2 = 2 2 , (5.11)
then is an eigenvalue of L .
(iii) If 6= 0 is an eigenvalue of L and satisfies (5.11), then is an eigenvalue of

BJ .
Corollary 5.1 If A is consistently ordered, then
(BGS ) = [(BJ )]2 .
This means that the convergence rate of the Gauss-Seidel method is twice as fast as
that of the Jacobi method.
Proof: The choice of = 1 is equivalent to the Gauss-Seidel method. Therefore, by

(5.11), we have 2 = 2 and then
= 2 .
To get the most benefit from overrelaxation, we would like to find an optimal ,
denoted by opt , minimizing (L ). We have the following theorem, see [14, 41].
Theorem 5.13 Suppose that A is consistently ordered and BJ has real eigenvalues
with = (BJ ) < 1. Then
2
opt = p ,
1 + 1 2
2
(Lopt ) = p ,
(1 + 1 2 )2
and
1,
opt < 2,
(L ) = q

1 + 1 2 2 + 1 + 1 2 2 ,
2 4 0 < opt .
Exercises:
1. Judge the convergence of the Jacobi method and the Gauss-Seidel method for the fol-
lowing examples:

1 2 2 2 1 1
A1 = 1 1 1 , A2 = 1 1 1 .
2 2 1 1 1 2
2. Show that the Jacobi method converges for 2-by-2 symmetric positive definite systems.
3. Show that if A = M N is singular, then we can never have (M 1 N ) < 1 even if M is
nonsingular.
4. Consider Ax = b where
1 0
A= 0 1 0 .
0 1
(1) For which , is A positive definite?

(2) For which , does the Jacobi method converge?
(3) For which , does the Gauss-Seidel method converge?
5. Let A Rnn be nonsingular. Show that there exists a permutation matrix P such that
the diagonal entries of P A are nonzero.
6. Prove that
4 1 1 0
1 4 0 1
A=
1 0

4 1
0 1 1 4
is consistently ordered.
5.4. SOR METHOD 79
7. Let A Rnn . Then (A) < 1 if and only if I A nonsingular and each eigenvalue of
(I A)1 (I + A) has a positive real part.
8. Let B Rnn satisfy (B) = 0. Show that for any g, x0 Rn , the iterative formula
xk+1 = Bxk + g, k = 0, 1,
converges to the exact solution of x = Bx + g for at most n iterations.

9. If A = [aij ] Rnn is irreducible with aij 0 and B = [bij ] Rnn with bij 0. Show
that A + B is irreducible.
10. Let
0 a 0
A = 0 0 b .
c 0 0
Is Ak irreducible (k = 1, 2, 3)?
11. Let
2 1
A= .
1 0
Show that
(1)
1 + 1/k 1
Ak = k , k = 1, 2, .
1 1/k 1
(2)
kAk kp
lim = 2, p = 1, 2, .
k k
(3) (Ak ) = 1.
12. Let
Bk = Bk1 + Bk1 (I ABk1 ), k = 1, 2, .
Show that if kI AB0 k = c < 1, then
lim Bk = A1
k
and k
c2
kA1 Bk k kB0 k.
1c
13. Prove Theorem 5.12.
14. Prove Theorem 5.13.
Chapter 6
Krylov Subspace Methods
In this chapter, we will introduce a class of iterative methods called Krylov subspace
methods. Among Krylov subspace methods developed for large sparse problems, we will
mainly study two methods: the conjugate gradient (CG) method and the generalized
minimum residual (GMRES) method. The CG method proposed by Hestenes and
Stiefel in 1952 is one of the best known iterative methods for solving symmetric positive
definite linear systems, see [16]. The GMRES method was proposed by Saad and
Schultz in 1986 for solving nonsymmetric linear systems, see [34]. As usual, let us
begin our discussion from the steepest descent method.
6.1 Steepest descent method

We consider the linear system
Ax = b
where A Rnn is a symmetric positive definite matrix and b Rn is a known vector.
We define the following quadratic function
(x) xT Ax 2bT x. (6.1)
Theorem 6.1 Let A Rnn be a symmetric positive definite matrix. Then finding
the solution of Ax = b is equivalent to finding the minimum of function (6.1).
Proof: Note that

= 2(ai1 x1 + + ain xn ) 2bi , i = 1, 2, , n.
xi
We therefore have
grad(x) = 2(Ax b) = 2r, (6.2)
81
82 CHAPTER 6. KRYLOV SUBSPACE METHODS
where grad(x) denotes the gradient of (x) and r = b Ax. If (x) reaches its
minimum at a point x , then
grad(x ) = 0,
i.e., Ax = b which means that x is the solution of the system.
Conversely, if x is the solution of the system, then for any vector y, we have
(x + y) = (x + y)T A(x + y) 2bT (x + y)
= xT Ax 2bT x + y T Ay = (x ) + y T Ay.
Since A is symmetric positive definite, we have y T Ay 0. Hence,
(x + y) (x ),
i.e., (x) reaches its minimum at the point x .
How to find the minimum of (6.1)? Usually, for any given initial vector x0 , we
choose a direction p0 , and then we try to find a point
x1 = x0 + 0 p0
on the line x = x0 + p0 such that
(x1 ) = (x0 + 0 p0 ) (x0 + p0 ).
It means that along this line, (x) reaches its minimum at point x1 . Afterwards,
starting from x1 , we choose another direction p1 , and then we try to find a point
x2 = x1 + 1 p1
on the line x = x1 + p1 such that
(x2 ) = (x1 + 1 p1 ) (x1 + p1 ),
i.e., along this line, (x) reaches its minimum at point x2 . Step by step, we have
p0 , p 1 , , and 0 , 1 , ,
where {pk } are line search directions and {k } are step sizes. In general, at a point xk ,
we choose a direction pk and then determine a step size k along the line x = xk + pk
such that
(xk + k pk ) (xk + pk ).
We then obtain xk+1 = xk + k pk . We remark that different ways for choosing line
search directions and step sizes give different algorithms for solving (6.1).
6.1. STEEPEST DESCENT METHOD 83
In the following, we first consider how to determine a step size k . Starting from a
point xk along a direction pk , we want to find a step size k on the line x = xk + pk
such that
(xk + k pk ) (xk + pk ).
Let
f () = (xk + pk )
= (xk + pk )T A(xk + pk ) 2bT (xk + pk )
= 2 pTk Apk 2rkT pk + (xk )

where rk = b Axk . We have
df
= 2pTk Apk 2rkT pk = 0.
d
Therefore,
rkT pk
k = . (6.3)
pTk Apk
Once we get k , we can compute
xk+1 = xk + k pk .
Is (xk+1 ) (xk )? We consider

(xk+1 ) (xk ) = (xk + k pk ) (xk )
= k2 pTk Apk 2k rkT pk
(rkT pk )2
= 0.
pTk Apk
If rkT pk 6= 0, then (xk+1 ) < (xk ).

Now we consider how to choose a direction pk . It is well-known that the steepest
descent direction of (x) is the negative direction of the gradient, i.e., pk = rk by (6.2).
Thus we have the steepest descent method. In order to discuss the convergence rate of
the method, we introduce the following lemma first.
Lemma 6.1 Let 0 < 1 n be the eigenvalues of a symmetric positive definite

matrix A and P (t) be a real polynomial of t. Then
kP (A)xkA max |P (j )| kxkA , x Rn ,

1jn

where kxkA xT Ax.
Proof: Let y1 , y2 , , yn be the eigenvectors of A corresponding to the eigenvalues

1 , 2 , , n , respectively. Suppose that y1 , y2 , , yn also form an orthonormal basis
Pn
of Rn . Therefore, for any x Rn , we have x = i yi and furthermore,
i=1
T
P
n P
n
xT P (A)AP (A)x = i P (i )yi A i P (i )yi
i=1 i=1
P
n P
n
= i i2 P 2 (i ) max P 2 (j ) i i2
i=1 1jn i=1
= max P 2 (j )xT Ax.

1jn
Then
kP (A)xkA max |P (j )| kxkA .
1jn
For the steepest descent method, we have the following convergence theorem.
Theorem 6.2 Let 0 < 1 n be the eigenvalues of a symmetric positive

definite matrix A. Then the sequence {xk } produced by the steepest descent method
satisfies

n 1 k
kxk x kA kx0 x kA ,
n + 1
where x is the exact solution of Ax = b.
Proof: The xk produced by the steepest descent method satisfies
(xk ) (xk1 + rk1 ), R.
By noting that
(x) + xT Ax = (x x )T A(x x ),
we have
(xk x )T A(xk x ) (xk1 + rk1 x )T A(xk1 + rk1 x )
(6.4)
= [(I A)(xk1 x )]T A[(I A)(xk1 x )],
for any R. Let P (t) = 1 t. By using Lemma 6.1, we have from (6.4),
kxk x kA kP (A)(xk1 x )kA
(6.5)
max |P (j )| kxk1 x kA ,
1jn
6.2. CONJUGATE GRADIENT METHOD 85
for any R. By using properties of the Chebyshev approximation, see [33], we have
n 1
min max |1 t| = . (6.6)
1 tn n + 1
Substituting (6.6) into (6.5), we obtain
n 1
kxk x kA kxk1 x kA .
n + 1
Thus,
k
n 1
kxk x kA kx0 x kA .
n + 1
6.2 Conjugate gradient method

In this section, we introduce the conjugate gradient (CG) method which is one of the
most important Krylov subspace methods.
6.2.1 Conjugate gradient method

The basic idea of the CG method is given as follows. For a given initial vector x0 , in
the first step, we still choose the direction of negative gradient, i.e., p0 = r0 . Then we
have
r T p0
0 = T0 , x1 = x0 + 0 p0 , r1 = b Ax1 .
p0 Ap0
Afterwards, at the (k + 1)-th step (k 1), we want to choose a direction pk on the
plane
2 = {x = xk + rk + pk1 : , R}
such that decreases most rapidly. Consider on 2 :
(, ) = (xk + rk + pk1 )
= (xk + rk + pk1 )T A(xk + rk + pk1 )
2bT (xk + rk + pk1 ).

By directly computing, we have

= 2(rkT Ark + rkT Apk1 rkT rk ),

= 2(rkT Apk1 + pTk1 Apk1 ),

where we use rkT pk1 = 0 (see Theorem 6.3 later). Let

= = 0,

then we find a unique minimum point
x = xk + 0 rk + 0 pk1
of on the plane 2 , where 0 and 0 satisfy:
0 rkT Ark + 0 rkT Apk1 = rkT rk

(6.7)
0 rkT Apk1 + 0 pTk1 Apk1 = 0.
Note that from (6.7), if rk 6= 0 then 0 6= 0. We therefore can choose

1 0
pk = (x xk ) = rk + pk1
0 0
as a new direction which is the optimal direction for minimizing on the plane 2 . Let
k1 = 00 . Then by using the second equation in (6.7), we have
rkT Apk1
k1 = .
pTk1 Apk1
Note that pk satisfies pTk Apk1 = 0 (see Theorem 6.3 later), i.e., pk and pk1 are
mutually A-conjugate.
Once we get pk , we can determine k by using (6.3) and then compute
xk+1 = xk + k pk .
In conclusion, we have the following formulas:
rkT pk
k = ,
pTk Apk
xk+1 = xk + k pk ,
rk+1 = b Axk+1 ,
T Ap
rk+1 k
k = ,
pTk Apk
pk+1 = rk+1 + k pk .
After a few elementary computations, we obtain

T r
rkT rk rk+1 k+1
k = , k = .
pTk Apk rkT rk
Thus, the scheme of the CG method, one of the most popular and successful iterative
methods for solving symmetric positive definite systems Ax = b, is given as follows. At
the initialization step, for k = 0, we choose x0 and then calculate
r0 = b Ax0 .
While rk 6= 0, in iteration steps, we have

k =k+1

if k = 1

p0 = r0

else

T r T
k2 = rk1 k1 /rk2 rk2

pk1 = rk1 + k2 pk2

end

T r T

k1 = rk1 k1 /pk1 Apk1

xk = xk1 + k1 pk1

rk = rk1 k1 Apk1
where rk , pk are vectors and k , k are scalars, k = 0, 1, . The xk is the approxima-

tion to the exact solution after the k-th iteration. When rk = 0, then the solution is
x = xk .
6.2.2 Basic properties
Theorem 6.3 The vectors {ri } and {pi } satisfy the following properties:
(1) pTi rj = 0, 0 i < j k;
(2) riT rj = 0, i 6= j, 0 i, j k;
(3) pTi Apj = 0, i 6= j, 0 i, j k;
(4) span{r0 , , rk } = span{p0 , , pk } = K(A, r0 , k + 1), where
K(A, r0 , k + 1) span{r0 , Ar0 , , Ak r0 }
is called the Krylov subspace.

Proof: By using induction, for k = 1, we have
p0 = r0 , r1 = r0 0 Ap0 , p1 = r1 + 0 p0 .
Then
pT0 r1 = r0T r1 = r0T (r0 0 Ap0 ) = r0T r0 0 pT0 Ap0 = 0
provided 0 = r0T r0 /pT0 Ap0 , and
r1T Ar0 T
pT1 Ap0 = (r1 + 0 r0 )T Ar0 = r1T Ar0 r Ar0 = 0.
r0T Ar0 0
Now, we assume that the theorem is true for k and we try to prove that it also holds
for k + 1.
For (1), by using assumption and rk+1 = rk k Apk , we have
pTi rk+1 = pTi rk k pTi Apk = 0, 0 i k 1,
and also
pTk rk T
pTk rk+1 = pTk rk p Apk = 0.
pTk Apk k
Thus, (1) is true for k + 1.
For (2), we have by assumption,
span{r0 , , rk } = span{p0 , , pk }.
By (1), we know that rk+1 is orthogonal to this subspace. Therefore, (2) is true for
k + 1.
For (3), by using assumption, (2) and
pk+1 = rk+1 + k pk , ri+1 = ri i Api ,
we have
1 T
pTk+1 Api = r (ri ri+1 ) + k pTk Api = 0, i = 0, 1, , k 1.
i k+1
By the definition of k , we have
T Ap
rk+1 k
pTk+1 Apk = (rk+1 + k pk )T Apk = rk+1
T
Apk pTk Apk = 0.
pTk Apk
Then, (3) holds for k + 1.

For (4), we know by using assumption that
rk , pk K(A, r0 , k + 1) = span{r0 , Ar0 , , Ak r0 }.

Therefore
rk+1 = rk k Apk K(A, r0 , k + 2) = span{r0 , Ar0 , , Ak+1 r0 },
and
pk+1 = rk+1 + k pk K(A, r0 , k + 2) = span{r0 , Ar0 , , Ak+1 r0 }.
By (2) and (3), we note that the vectors r0 , , rk+1 and p0 , , pk+1 are linearly
independent. Thus, (4) is true for k + 1.
We remark that by Theorem 6.3, at most n steps, we can obtain the exact solution
of an n-by-n system by using the CG method. Therefore, from a theoretical viewpoint,
the CG method is a direct method.
Theorem 6.4 The xk obtained from the CG method satisfies
(xk ) = min{(x) : x x0 + K(A, r0 , k)} (6.8)
or
kxk x kA = min{kx x kA : x x0 + K(A, r0 , k)}, (6.9)

where kxkA = xT Ax and x is the exact solution of Ax = b.
Proof: Since (6.8) and (6.9) are equivalent, we only need to prove (6.9). Suppose that
rl = 0 at the l-th step of the CG method, then we have
x = xl = xl1 + l1 pl1
= xl2 + l2 pl2 + l1 pl1
= x0 + 0 p0 + + l1 pl1 .
For k < l, we have
xk = x0 + 0 p0 + + k1 pk1 x0 + K(A, r0 , k).
Let x be any vector in x0 + K(A, r0 , k). Then by Theorem 6.3 (4), we have
x = x0 + 0 p0 + + k1 pk1 .
Moreover,
x x = (0 0 )p0 + + (k1 k1 )pk1 + k pk + + l1 pl1 .

Since
x xk = k pk + + l1 pl1 ,
we have by using Theorem 6.3 (3),
kx xk2A = k(0 0 )p0 + + (k1 k1 )pk1 k2A
+kk pk + + l1 pl1 k2A
kk pk + + l1 pl1 k2A
= kx xk k2A .
6.3 Practical CG method and convergence analysis

In this section, we give a practical algorithm of the CG method and analyze the con-
vergence rate of the CG method.
6.3.1 Practical CG method
From Theorem 6.3, we know that the CG method would obtain an accurate solution
after n steps in exact arithmetic, where n is the size of the system. In other words,
the CG method is thought as a direct method rather than an iterative method. When
n is very large, in practice, we use the CG method as an iterative method and stop
iterations when
(i) krk k is less than , where rk = b Axk and is a given error bound; or
(ii) the number of iterations reaches kmax , the largest number of iterations provided
by us, where kmax n.
We then have the following practical algorithm for solving symmetric positive definite
systems Ax = b. At the initialization step k = 0, we choose a initial vector x and
calculate
r = b Ax, = rT r.
6.3. PRACTICAL CG METHOD AND CONVERGENCE ANALYSIS 91

While > kbk2 and k < kmax , in iteration steps, we have

k =k+1

if k = 1

p=r

else

= /; p = r + p

end

w = Ap; = /pT w; x = x + p

r = r w; = ; = rT r
where r, p, w are vectors and , , are scalars.

We remark that:
(1) We only have the matrix-vector multiplications in the algorithm. If the matrix
is sparse or it has a special structure, then these multiplications can be done
efficiently by using some sparse solvers or fast solvers.
(2) We do not need to estimate any parameter in the algorithm unlike the SOR
method.
(3) For each iteration, we could use the parallel algorithms for the vector operations.
Now, we briefly discuss how to use the CG method to solve general linear systems
Ax = b. Since we cannot use the CG method directly to the system, instead of solving
Ax = b, we can use the CG method to solve the normalized system,
AT Ax = AT b.
When the system is well-conditioned, then the normalized CG method is suitable. But
if the system is ill-conditioned, then the condition number of the normalized system
could become very large because of 2 (AT A) = (2 (A))2 . Hence the normalized CG
method is not suitable for ill-conditioned systems.
6.3.2 Convergence analysis

We have the following theorem for the convergence rate of the CG method.
Theorem 6.5 If A = I + B where I is the identity matrix and rank(B) = p, then by

using at most p + 1 iterations, the CG method can obtain the exact solution of Ax = b.
Proof: Since A = I + B, it is easy to show that
span{r0 , Ar0 , , Ak r0 } = span{r0 , Br0 , , B k r0 }.

Note that rank(B) = p and then

dim span{r0 , Br0 , , B k r0 } p + 1.
By Theorem 6.3, we know that
dim (span{r0 , , rp , rp+1 }) p + 1
and also
riT rp+1 = 0, i = 0, 1, , p.
Therefore rp+1 = 0, i.e., Axp+1 = b.
Theorem 6.6 Let A Rnn be symmetric positive definite and x be the exact solution
of Ax = b. We then have

2 1 k
kx xk kA 2 kx x0 kA
2 + 1
where xk is produced by the CG method and
2 = 2 (A) = kAk2 kA1 k2 .
Proof: By Theorem 6.3, we know that for any x x0 + K(A, r0 , k),
x x = x x0 + ak1 r0 + ak2 Ar0 + + akk Ak1 r0
= A1 (r0 + ak1 Ar0 + ak2 A2 r0 + + akk Ak r0 )
= A1 Pk (A)r0 ,
P
k
where Pk () = akj j with Pk (0) = 1. Let Pk be the set of all the polynomials Pk
j=0
with order less than or equal to k and Pk (0) = 1. By Theorem 6.4 and Lemma 6.1, we
have
kxk x kA = min{kx x kA : x x0 + K(A, r0 , k)}
= min kA1 Pk (A)r0 kA = min kPk (A)A1 r0 kA

Pk Pk Pk Pk
min max |Pk (i )| kA1 r0 kA

Pk Pk 1in
min max |Pk ()| kx x0 kA ,

Pk Pk a1 a2
6.3. PRACTICAL CG METHOD AND CONVERGENCE ANALYSIS 93
where
0 < a1 = 1 n = a2
are the eigenvalues of A. By the well-known Approximation Theorem of Chebyshev
polynomials, see [33], we know that there exists a unique solution of the optimal prob-
lem
min max |Pk ()|
Pk Pk a1 a2
given by
Tk ( a2a+a 1 2
2 a1
)
Pek () = .
Tk ( aa22 a
+a1
1
)
Here Tk (z) is the k-th Chebyshev polynomial defined recursively by
Tk (z) = 2zTk1 (z) Tk2 (z)
with T0 (z) = 1 and T1 (z) = z. By the properties of Chebyshev polynomials, see [33]
again, we know that

1 2 1 k
max |Pek ()| = 2 ,
a1 a2 Tk ( aa22 +a
a1 )
1 2 + 1
where 2 = 2 (A) = kAk2 kA1 k2 . Therefore,

2 1 k
kx xk kA 2 kx x0 kA .
2 + 1
Furthermore, we have the following theorem, see [2].
Theorem 6.7 If the eigenvalues j of a symmetric positive definite matrix A are or-
dered such that
0 < 1 ... p b1 p+1 ... nq b2 nq+1 ... n
where b1 and b2 are two constants, then

kpq p
Y
kx xk kA 1 j
2 max ,
kx x0 kA +1 [b1 ,b2 ] j
j=1
where (b2 /b1 )1/2 1.

From Theorem 6.7, we note that when n is increased, if p, q are constants that do
not depend on n and 1 is uniformly bounded from zero, then the convergence rate is
linear, i.e., the number of iterations is independent of n. We also notice that the more
clustered the eigenvalues are, the faster the convergence rate will be.
Corollary 6.1 If the eigenvalues j of a symmetric positive definite matrix A are

ordered such that
0 < < 1 ... p 1 p+1 ... nq 1 + nq+1 ... n
where 0 < < 1, then

p
kx xk kA 1+
2 kpq
kx x0 kA
where k p + q.
Proof: For given in Theorem 6.7, we have
1 1
b2 2 1+ 2
= .
b1 1
Therefore,

1 1 1 2
= < .
+1
For 1 j p and [1 , 1 + ], we have
j 1+
0 .
j
Thus, by using Theorem 6.7, we obtain

kpq p
Y
kx xk kA 1 j
2 max
kx x0 kA +1 [b1 ,b2 ] j
j=1
p
1+
2 kpq .

6.4. PRECONDITIONING 95
6.4 Preconditioning
From Section 6.3, we know that if the matrix A of the system
Ax = b
is well-conditioned or its spectrum is clustered, then the convergence rate of the CG
method will be very quick. Therefore, in order to speed up the convergence rate, we
usually precondition the system, i.e., instead of solving the original system, we solve
the following preconditioned system
e = b,
Ax (6.10)
where
e = C 1 AC 1 ,
A x = Cx, b = C 1 b,
e could
and C is symmetric positive definite. We wish that the preconditioned matrix A
have better spectral properties than those of A.
By using the CG method on (6.10), we have

rkT rk

k = ,

p e k
T Ap

k

xk+1 = xk + k pk ,

rk+1 = rk k Ap e k, (6.11)

T r

rk+1 k+1

= ,

k
r T r

k k

pk+1 = rk+1 + k pk ,
e 0 and p0 = r0 . Let
where x0 is any given initial vector, r0 = b Ax
xk = Cxk , rk = C 1 rk , pk = Cpk , M = C 2.
Substituting them into (6.11), we actually have
wk = Apk , k = k /pTk wk ,
xk+1 = xk + k pk , rk+1 = rk k wk ,
zk+1 = M 1 rk+1 , T z
k+1 = rk+1 k+1 ,
k = k+1 /k , pk+1 = zk+1 + k pk ,

where x0 is any given initial vector, r0 = b Ax0 , z0 = M 1 r0 , 0 = r0T z0 and p0 = z0 .

We then have the following preconditioned algorithm. At the initialization step
k = 0, we choose a initial vector x and calculate r = b Ax. While rT r > kbk2 and
k < kmax , in iteration steps, we have

Solve M z = r for z

=k+1

k

if k = 1

p=z
else

= /; p = z + p

end

w = Ap; = /pT w; x = x + p

r = r w; = ; = rT z
where z, r, p, w are vectors and , , are scalars. This algorithm is called the
preconditioned conjugate gradient (PCG) method. Note that the PCG method has the
following properties:
(i) riT M 1 rj = 0, for i 6= j.
(ii) pTi Apj = 0, for i 6= j.
(iii) The approximation xk satisfies:

k
1
kx xk kAe 2 kx x0 kAe
+1
where = n /1 with n being the largest eigenvalue of M 1 A and 1 being the

smallest eigenvalue of M 1 A.
A good preconditioner M = C 2 is chosen with two criteria in mind, see [2, 19, 22]:
(1) M z = d is easy to solve.
(2) The spectrum of M 1 A is clustered and (or) M 1 A is well-conditioned compared

to A.
Usually, it is not easy to choose a preconditioner which satisfies all these two criteria.
Now we briefly discuss the following three classes of preconditioners.
1. Diagonal preconditioners. If diagonal entries of the coefficient matrix A are

much different, one could use the matrix
M = diag(a11 , , ann )
6.4. PRECONDITIONING 97
as a preconditioner to speed up the convergence rate of the CG method hopefully.

For block matrix
A11 A1k
.. ,
A = ... ..
. .
Ak1 Akk
if A1
ii are easily to be obtained, then one could use the block diagonal matrix
M = diag(A11 , , Akk )
as a preconditioner.
2. Preconditioners based on incomplete Cholesky factorization. If one com-

putes the incomplete Cholesky factorization first,
A = LLT + R,
then one can use the matrix M = LLT as a preconditioner. We could require
that the matrix L has the same sparse structure as the matrix A and also the
matrix LLT A.
3. Optimal (circulant) preconditioners. This class of preconditioners is pro-

posed very recently, see [9, 22, 37]. The circulant matrix is defined as follows:

c0 c1 c2n c1n
c1 c0 c1 c2n

.. . . .
.
Cn = . c1 c0 . .
. . . .
cn2 . . c1
cn1 cn2 c1 c0
where ck = cnk for 1 k n 1. It is well-known that circulant matrices can

be diagonalized by the Fourier matrix Fn , see [13], i.e.,
Cn = Fn n Fn , (6.12)
where the entries of Fn are given by

1
(Fn )j,k = e2i(j1)(k1)/n , i 1,
n
for 1 j, k n, and n is a diagonal matrix holding the eigenvalues of Cn . We

note that n can be obtained in O(n log n) operations by taking the fast Fourier
transform (FFT) of the first column of Cn . Once n is obtained, the products Cn y
and Cn1 y for any vector y can be computed by FFTs in O(n log n) operations.
For the Fourier matrix Fn , when there is no ambiguity, we shall denote F .
Now, we study a kind of preconditioner called the optimal preconditioner, see
[9, 11, 22]. Given any unitary matrix U Cnn , let MU be the set of all
matrices simultaneously diagonalized by U , i.e.,
MU = {U U | is an n-by-n diagonal matrix}. (6.13)
We note that in (6.13), when U = F , the Fourier matrix, MF is the set of all the
circulant matrices. Let (A) denote the diagonal matrix whose diagonal is equal
to the diagonal of the matrix A. We have the following lemma, see [22, 27, 39].
Lemma 6.2 For any arbitrary A = [apq ] Cnn , let cU (A) be the minimizer of
kW AkF over all W MU . Then
(i) cU (A) is uniquely determined by A and is given by

cU (A) = U (U AU )U.
(ii) If A is Hermitian, then cU (A) is also Hermitian. Furthermore, if min ()
and max () denote the smallest and largest eigenvalues respectively, then we
have
min (A) min (cU (A)) max (cU (A)) max (A).
In particular, if A is positive definite, then so is cU (A).
(iii) If A is normal and stable, i.e., A A = AA and the real parts of all the
eigenvalues of A are negative, then cU (A) is also stable.
(iv) When U is the Fourier matrix F , we then have

n1
X 1 X
cF (A) = apq Qj ,
n
j=0 pqj( mod n)
where Q is an n-by-n circulant matrix given by

0 1
1 0

. .

Q 0 1 . .

.. . . . . . .
. . . .
0 0 1 0
The matrix cU (A) is called the optimal preconditioner of A and the matrix cF (A)
is called the optimal circulant preconditioner of A. We remark that cF (A) is a
good preconditioner for solving a large class of structured linear systems Ax = b,
for instance, Toeplitz systems, Hankel systems, etc, see [9, 11, 12, 22].
6.5. GMRES METHOD 99
6.5 GMRES method

In this section, we introduce the generalized minimum residual (GMRES) method to
solve general systems
Ax = b
where A Rnn is nonsingular. The GMRES method was proposed by Saad and
Schultz in 1986, which is one of the most important Krylov subspace methods for
nonsymmetric systems, see [33, 34].
6.5.1 Basic properties of GMRES method

For the GMRES method, in the k-th iteration, we are going to find a solution xk of
the LS problem
min kb Axk2
xx0 +K(A,r0 ,k)
where
K(A, r0 , k) span{r0 , Ar0 , , Ak1 r0 }
with r0 = b Ax0 . Let x x0 + K(A, r0 , k). We have
k1
X
x = x0 + j Aj r0
j=0
and then
k1
X k
X
r = b Ax = b Ax0 j Aj+1 r0 = r0 j1 Aj r0 .
j=0 j=1
Hence
r = Pk (A)r0
where Pk Pk with Pk being the set of all the polynomials Pk with order less than or
equal to k and Pk (0) = 1. We therefore have the following theorem.
Theorem 6.8 Let xk be the solution after the k-th GMRES iteration. Then we have
krk k2 = min kP (A)r0 k2 kPk (A)r0 k2 .

P Pk
Furthermore,
krk k2
kPk (A)k2 .
kr0 k2
Moreover, we have
Theorem 6.9 The GMRES method will obtain the exact solution of Ax = b within n
iterations, where A Rnn .
Proof: The characteristic polynomial of A is given by
pA (z) = det(zI A)
where the order of pA (z) is n and pA (0) = (1)n det(A) 6= 0. Then
pA (z)
Pn (z) = Pn .
pA (0)
By the Hamilton-Cayley Theorem, see [17], we know that
Pn (A) = pA (A) = 0.

rn = b Axn = 0.
Thus xn is the exact solution of Ax = b.
If A is diagonalizable, i.e., A = V V 1 where is a diagonal matrix, then we have
P (A) = V P ()V 1 .
If V is orthogonal, then A is a normal matrix.
Theorem 6.10 Let A = V V 1 . We have

krk k2
2 (V ) min max |P (i )|
kr0 k2 P Pk i
where i are the eigenvalues of the matrix A.
Proof: By Theorem 6.8, we have
krk k2 = min kP (A)r0 k2 = min kV P ()V 1 r0 k2

P Pk P Pk
min kV k2 kV 1 k2 kP ()k2 kr0 k2

P Pk
= 2 (V ) min kP ()k2 kr0 k2

P Pk
= 2 (V ) min max |P (i )| kr0 k2 .

P Pk i
We remark that when A is normal, then 2 (V ) = 1. We therefore have
krk k2
min max |P (i )|.
kr0 k2 P Pk i
Theorem 6.11 If A is diagonalizable and has exactly k distinct eigenvalues, then the
GMRES method will terminate in at most k iterations.
Proof: We construct a polynomial as follows,

k
Y i z
P (z) = Pk ,
i
i=1
where i are the eigenvalues of A. By Theorem 6.10, we know that rk = 0, i.e., Axk = b.
We should emphasize that in general, the behavior of the GMRES method cannot
be determined by eigenvalues alone. In fact, it is shown in [20] that any nonincreasing
convergence rate is possible for the GMRES method applied to some problem with
nonnormal matrix. Moreover, that problem can have any desired distribution of eigen-
values. Thus, for instance, eigenvalues tightly clustered around 1 are not necessarily
good for nonnormal matrices, as they are for normal ones. However, we have the
following two theorems.
Theorem 6.12 If b is a linear combination of k eigenvectors of A, say

k
X
b= l uil ,
l=1
then the GMRES method will terminate in at most k iterations.
Proof: We first extend the set of eigenvectors {uil } to be a basis of Rn , i.e., the vectors
ui1 , ui2 , , uik , v1 , , vnk ,
form a basis of Rn . Let x be the exact solution of Ax = b. Then

k
X nk
X
x = l uil + j vj .
l=1 j=1
Moreover, we have
P
k P
nk
Ax = l Auil + j Avj
l=1 j=1
P
k P
nk
= l il uil + j Avj
l=1 j=1
P
k
= b= l uil .
l=1
Hence,
j = 0, j = 1, 2, , n k;

l = l /il , l = 1, 2, , k.
We therefore have
k
X
x = (l /il )uil .
l=1
Let
k
Y il z
Pk (z) = Pk .
il
l=1
Note that Pk (il ) = 0, for 1 l k, and
k
X
Pk (A)x = Pk (il )(l /il )uil = 0.
l=1
We thus have
krk k2 kPk (A)r0 k2 = kPk (A)bk2
= kPk (A)Ax k2 = kAPk (A)x k2 = 0,

where we choose the initial vector x0 = 0.
Theorem 6.13 When the GMRES method is applied for solving a linear system Ax =
b where A = I + L, the method will converge in at most rank(L) + 1 iterations.
Proof: We first recall that the minimal polynomial of r0 with respect to A is the
nonzero monic polynomial p of the lowest degree such that p(A)r0 = 0, see [33]. By
Theorem 6.8, the GMRES method must converge within iterations, where is the
degree of the minimal polynomials of the residual r0 with respect to A = I + L. Let

be the degree of the minimal polynomials of r0 with respect to L. Since
k
X
i (I + L)i r0 = 0
i=0
implies
k
X
i Li r0 = 0
i=0
for some constants i and vice versa, we have = . Moreover, from the definition of
, the set
{r0 , Lr0 , , L1 r0 }
is linearly independent. Let B be the column vector space of L. Then, the dimension
of B is equal to the rank of L. Since Li r0 B for i 1, we have
{Lr0 , , L1 r0 } B.
Thus, 1 rank(L), i.e., rank(L) + 1. Hence, the GMRES method converges

within
= rank(L) + 1
iterations.
6.5.2 Implementation of GMRES method

Recall that the LS problem from the k-th iteration of the GMRES method is
min kb Axk2 .
xx0 +K(A,r0 ,k)
Suppose that we have a matrix
Vk = [v1k , v2k , , vkk ]
whose columns form an orthonormal basis of K(A, r0 , k). Then for any z K(A, r0 , k),
it can be written as
Xk
z= ul vlk = Vk u,
l=1
)T Rk .
where u = (u1 , u2 , , uk Thus, once we find Vk , we can convert the original
LS problem in the Krylov subspace into a LS problem in Rk as follows. Let xk be the
solution after the k-th iteration. We then have
xk = x0 + Vk yk
where the vector yk minimizes
min kb A(x0 + Vk y)k2 = min kr0 AVk yk2 .

yRk yRk
This is a standard linear LS problem that can be solved by a QR decomposition.

One could use the Gram-Schmidt orthogonalization to find an orthonormal basis of
K(A, r0 , k). The algorithm is given as follows:
r0
(1) Define r0 = b Ax0 and v1 = .
kr0 k2
(2) Compute
P
i
Avi ((Avi )T vj )vj
j=1
vi+1 =
P
i
T
Avi ((Avi ) vj )vj
j=1 2
for i = 1, 2, , k 1.
This algorithm produces the columns of the matrix Vk which are also an orthonormal
basis for K(A, r0 , k). We note that a breakdown happens when a division by zero occurs.
We have the following theorem for the breakdown happening.
Theorem 6.14 Let A be nonsingular, the vectors vj be generated by the above algo-
rithm, and i be the smallest integer for which
i
X
Avi ((Avi )T vj )vj = 0. (6.14)
j=1
Then x = A1 b x0 + K(A, r0 , i).
Proof: Since by (6.14),

i
X
Avi = ((Avi )T vj )vj K(A, r0 , i),
j=1
we know that
AK(A, r0 , i) K(A, r0 , i).
Note that the columns of Vi = [v1 , v2 , , vi ] form an orthonormal basis for K(A, r0 , i),
i.e.,
K(A, r0 , i) = span{v1 , v2 , , vi },
and then
AVi = Vi H (6.15)
where H Rii is nonsingular since A is nonsingular. There exists a vector y Ri
such that xi x0 = Vi y because xi x0 K(A, r0 , i). We therefore have
kri k2 = kb Axi k2 = kr0 A(xi x0 )k2 = kr0 AVi yk2 . (6.16)
Let = kr0 k2 and e1 = (1, 0, , 0)T Ri . Then r0 = Vi e1 . Since Vi is a matrix with

orthonormal columns, we have by (6.15) and (6.16),
kri k2 = kVi (e1 Hy)k2 = ke1 Hyk2 .
Setting y = H 1 e1 and hence ri = 0, i.e.,
xi = A1 b x0 + K(A, r0 , i).
If the Gram-Schmidt process does not breakdown, we can use it to carry out the
GMRES method in the following efficient way. Let hij = (Avj )T vi . By the Gram-
Schmidt algorithm, we have a (k + 1)-by-k matrix Hk which is upper Hessenberg, i.e.,
its entries hij satisfy hij = 0 if i > j + 1. This process produces a sequence of matrices
{Vk } with orthonormal columns such that
AVk = Vk+1 Hk .
Therefore, we have
rk = b Axk = r0 A(xk x0 ) = Vk+1 e1 AVk yk = Vk+1 (e1 Hk yk ),
where yk is the solution of

min ke1 Hk yk2 .
yRk
Hence
xk = x0 + Vk yk .
Let the GMRES iterations be ended when one finds a vector x such that for a given ,
krk2 = kb Axk2 kbk2 .
We then have the following GMRES algorithm for solving
Ax = b,
where A Rnn and b Rn are known, see [33]. At the initialization step, let
r0 = b Ax0 , = kr0 k2 , v1 = r0 /.
In the iteration steps, we have

for j = 1 : m

wj = Avj

for i = 1 : j

hij = wjT vi

wj = wj hij vi
end

hj+1,j = kwj k2 ; if hj+1,j = 0, set m = j and go to ()

vj+1 = wj /hj+1,j

end

() compute ym the minimizer of ke1 Hm yk2

x =x +V y
m 0 m m
where Hm = (hij )1im+1,1jm . The matrix Vm = [v1 , v2 , , vm ] Rnm with

m n in the algorithm is a matrix with orthonormal columns.
Exercises:
1. Suppose that xk is generated by the steepest descent method. Prove that

1
(xk ) 1 (xk1 ),
2 (A)
where 2 (A) = kAk2 kA1 k2 .

2. Let A be a symmetric positive definite matrix. Define kxkA xT Ax. Show that it is
a vector norm.
3. Let A Rnn be symmetric positive definite, and p1 , p2 , , pk Rn be mutually A-
conjugate, i.e., pTi Apj = 0, i 6= j. Prove that {p1 , p2 , , pk } is linearly independent.
4. Let xk be produced by the CG method. Show that

2 1 k
kxk x k2 2 2 kx0 x k2 ,
2 + 1
where 2 = 2 (A) = kAk2 kA1 k2 .

5. Suppose zk and rk are generated by the PCG method. Show that if rk 6= 0, then zkT rk > 0.
6. Find an efficient algorithm for solving AT Ax = AT b by the CG method.
7. Show that if A Rnn is symmetric positive definite and has exactly k distinct eigen-
values, then the CG method will terminate in at most k iterations.
8. Let the initial vector x0 = 0. When the GMRES method is used to solve the linear
system Ax = b where
0 1
1 0

A=
1 0

1 0
1 0
and b = (1, 0, 0, 0, 0)T , what is the convergence rate?
9. Let
I Y
A= .
0 I
When the GMRES method is used to solve Ax = b, what is the maximum number of
iterations required to converge?
10. Prove that the LS problem in the GMRES method has full column rank.
11. Show that cU (A) = U (U AU )U .
Chapter 7
Nonsymmetric Eigenvalue
Problems
Eigenvalue problems are particularly interesting in NLA. In this chapter, we study

nonsymmetric eigenvalue problems. Some well-known methods, such as the power
method, the inverse power method and the QR method, are discussed.
7.1 Basic properties

Let A Cnn . We recall that a complex number is called an eigenvalue of A if there
exists a nonzero vector x Cn such that
Ax = x.
Here x is called the eigenvector of A associated with . It is well-known that is an

eigenvalue of A if and only if
det(I A) = 0.
Let
pA () = det(I A)
be the characteristic polynomial of A. By the Fundamental Theorem of Algebra, we
know that pA () has n roots in C, i.e., A has n eigenvalues.
Now suppose that pA () has the following factorization:
pA () = ( 1 )n1 ( 2 )n2 ( p )np ,
where n1 +n2 + +np = n, i 6= j for i 6= j. The ni is called the algebraic multiplicity

of i and the number
mi = n rank(i I A)
109
110 CHAPTER 7. NONSYMMETRIC EIGENVALUE PROBLEMS
is called the geometric multiplicity of i . Actually, mi is the dimension of the eigenspace

of i . We remark that the eigenspace of i is the solution space of (i I A)x = 0.
It is easy to see that mi ni for i = 1, 2, , p. If ni = 1, then i is called a simple
eigenvalue. If mi < ni for some eigenvalue i , then A is called defective. If the geometric
multiplicity is equal to the algebraic multiplicity for each eigenvalue of A, then A is
said to be nondefective. Note that A is diagonalizable if and only if A is nondefective.
Let A, B Cnn . We recall that A and B are called similar matrices if there is a
nonsingular matrix X Cnn such that
B = XAX 1 .
The transformation
A B = XAX 1
is called a similarity transformation by the similarity matrix X. The similar matrices
have the same eigenvalues. If x is an eigenvector of A, then y = Xx is an eigenvector of
B. By the Jordan Decomposition Theorem (Theorem 1.1), we know that any n-by-n
matrix is similar to its Jordan canonical form. If the similarity matrix is required to be
a unitary matrix, we then have the following perhaps the most fundamentally useful
theorem in NLA, see [17].
Theorem 7.1 (Schur Decomposition Theorem) Let A Cnn with the eigenval-
ues 1 , , n in any prescribed order. Then there exists a unitary matrix U Cnn
such that
U AU = T = [tij ]
where T is an upper triangular matrix with diagonal entries tii = i , i = 1, , n.
Furthermore, if A Rnn and if all the eigenvalues of A are real, then U may be
chosen to be real and orthogonal.
Proof: Let x1 Cn be a normalized eigenvector of A associated with the eigenvalue

1 . The vector x1 can be extended to a basis of Cn :
x1 , y2 , , yn .
By applying the Gram-Schmidt orthonormalization procedure to this basis, we therefore

have an orthonormal basis of Cn :
x1 , z2 , , zn .
Let
U1 = [x1 , z2 , , zn ]
7.1. BASIC PROPERTIES 111
be a unitary matrix. By a simple calculation, we obtain

1
U1 AU1 = .
0 A1
The matrix A1 C(n1)(n1) has the eigenvalues 2 , , n . Let x2 Cn1 be a

normalized eigenvector of A1 associated with 2 , and do it all over again. Determine a
unitary matrix V2 C(n1)(n1) such that

2
V2 A1 V2 = .
0 A2
Let
1 0
U2 = .
0 V2
Then the matrices U2 and U1 U2 are unitary, and

1
U2 U1 AU1 U2 = 2 .
0 A2
Continuing this process, we can produce unitary matrices U1 , U2 , , Un1 such that
the matrix
U = U1 U2 Un1
is unitary and U AU yields the desired form.
Theorem 7.2 (Real Schur Decomposition Theorem [17]) Let A Rnn . Then
there exists an orthogonal matrix Q Rnn such that

R11 R12 R1m
R22 R2m

QT AQ = .. .. ,
. .
0 Rmm
where Rii is either a real number or a 2-by-2 matrix having a pair of complex conjugate
eigenvalues.
In general, one cannot hope to reduce a real matrix to a strictly upper triangular
form by using an orthogonal similarity transformation because the diagonal entries
would then be eigenvalues, which could not be real.
7.2 Power method

The power method is an iterative algorithm for computing the eigenvalue with the
largest absolute value and its corresponding eigenvector. Now, we introduce the basic
idea of this method. For simplicity, we first suppose that the matrix A Cnn is
diagonalizable, i.e., A has the following Jordan decomposition:
A = XX 1
where = diag(1 , , n ) with the order
|1 | > |2 | |n |,
and X = [x1 , , xn ] Cnn . Let u0 Cn be any vector. Since the column vectors
x1 , , xn form a basis of Cn , we have
n
X
u0 = j xj
j=1
where j C. We therefore have,

P
n P
n
Ak u 0 = j Ak xj = j kj xj
j=1 j=1
!
P
n k
j
= k1 1 x1 + j xj ,
j=2 1
and then
Ak u 0
lim = 1 x1 .
k k
1
When 1 6= 0 and k is sufficiently large, we know that the vector
Ak u0
uk = (7.1)
k1
is a good approximate eigenvector of A.
In practice, we cannot use (7.1) directly to compute an approximate eigenvector
since we do not know the eigenvalue 1 in advance and the operation cost of Ak is very
large when k is large. We therefore propose the following iterative algorithm:

yk = Auk1 ,

(k)
k = j , (7.2)

u = y / ,
k k k
7.2. POWER METHOD 113
(k)
where u0 Cn is any given initial vector with ku0 k = 1 usually, and j is the
largest absolute value of components of yk . This iterative algorithm is called the power
method. We have the following theorem for the convergence of the power method.
Theorem 7.3 Let A Cnn with p distinct eigenvalues which satisfy
|1 | > |2 | |p |.
The geometric multiplicity of 1 is equal to its algebraic multiplicity. If the projection

of the initial vector u0 on the eigenspace of 1 is nonzero, then {uk } produced by
(7.2) converges to an eigenvector x1 associated with 1 . Also {k } produced by (7.2)
converges to 1 .
Proof: We note that A has the Jordan decomposition as follows,
A = Xdiag(J1 , , Jp )X 1 , (7.3)
where X Cnn , Ji Cni ni is the Jordan block associated with i (i = 1, , p), and
n1 + n2 + + np = n.
Since the geometric multiplicity of 1 is the same as its algebraic multiplicity, we have
J1 = 1 In1
where In1 Rn1 n1 is the identity matrix. Let y = X 1 u0 and then decompose y and
X as follows:
y = (y1T , y2T , , ypT )T , X = [X1 , X2 , , Xp ]
where yi Cni and Xi Cnni , for i = 1, , p. By using (7.3), we have
Ak u0 = Xdiag(J1k , , Jpk )X 1 u0
= X1 J1k y1 + X2 J2k y2 + + Xp Jpk yp
= k1 X1 y1 + X2 J2k y2 + + Xp Jpk yp

= k1 X1 y1 + X2 (1 k 1 k
1 J2 ) y2 + + Xp (1 Jp ) yp .
Note that the spectral radius of 1

1 Ji satisfies
(1
1 Ji ) = |i |/|1 | < 1,
for i = 2, 3, , p. Therefore,
1 k
lim A u 0 = X 1 y1 . (7.4)
k k1
Since the projection of the initial vector u0 on the eigenspace of 1 is nonzero, we have
X1 y1 6= 0. Let
x1 = 1 X1 y1
where is the largest absolute value of components of X1 y1 . Obviously, x1 is an
eigenvector of A associated with 1 . Let k be the largest absolute value of components
of Ak u0 . Then the largest absolute value of components of k k k
1 A u0 is k 1 . By (7.2),
we have
Auk1 Ak u 0 Ak u 0 Ak u0 /k1
uk = = = = .
k k k1 1 k k /k1
By using (7.4), we know that {uk } is convergent and
lim (Ak u0 /k1 ) X 1 y1

k
lim uk = = = x1 .
k lim (k /k1 )
k
By using Auk1 = k uk and the fact that {uk } converges to an eigenvector associated
with 1 having the largest absolute value of components equaling to 1, we immediately
know that {k } converges to 1 .
We remark that from the proof of Theorem 7.3, the convergence rate of the power
method is determined by the value of |2 |/|1 |. Under the conditions of the theorem,
we know that
|2 |
< 1.
|1 |
The smaller the |2 |/|1 | is, the faster the convergence rate will be. When |2 |/|1 |
is closed to 1, then the convergence rate will be very slow. In order to speed up the
convergence of the power method, we could use the method on A I where is called
a shift. The could be chosen such that the distance between the eigenvalue with
the largest absolute value and the other eigenvalues becomes larger. Therefore, the
convergence rate of the power method can be increased.
7.3 Inverse power method

If one uses the power method on A1 to obtain an eigenvalue of A with the smallest
absolute value and its corresponding eigenvector, then this algorithm is called the
7.3. INVERSE POWER METHOD 115
inverse power method. Its basic iterative scheme is given as follows:

Ayk = zk1 ,

(k)
k = j ,

z = y / ,
k k k
(k)
where z0 Cn is any given initial vector and j is the largest absolute value of
components of yk . From Theorem 7.3, we know that if the eigenvalues of A satisfy
|n | < |n1 | |1 |,
then {zk } converges to an eigenvector associated with n and {k } converges to 1

n .
The convergence rate of the inverse power method is determined by |n |/|n1 |.
In practice, usually, the inverse power method is used on the matrix
A I
to compute an approximate eigenvector when an approximate eigenvalue of a distinct

eigenvalue i of A is obtained in advance. Therefore, we have the following inverse
power method with a shift :

(A I)vk = zk1 ,
(7.5)

zk = vk /kvk k2 .
From (7.5), we know that in each iteration of the inverse power method, one needs to
solve a linear system. Hence, its operation cost is much larger than that of the power
method. However, one could use the LU factorization with partial pivoting in advance
and then in each iteration later on, one only needs to solve two triangular systems.
Suppose that the eigenvalues of the A I are ordered as follows:
0 < |1 | < |2 | |3 | |n |.
From Theorem 7.3 again, we know that the sequence {zk } produced by (7.5) converges
to an eigenvector associated with 1 . The convergence rate is determined by the value
of |1 |/|2 |. The more closer of to 1 , the faster the convergence rate will
be. But if is closed to an eigenvalue of A, then A I is closed to a singular matrix.
Therefore, one needs to solve an ill-conditioned linear system in each iteration of the
inverse power method. However, from practical computations, the illness of systems
has no effect on the convergence rate of the method. Usually, only one iteration could
produce a good approximate eigenvector of A if is closed to an eigenvalue of A.
7.4 QR method
In this section, we introduce the well-known QR method which is one of main important
developments in matrix computations. For any given A0 = A Cnn , the basic
iterative scheme of the QR algorithm is given as follows:

Am1 = Qm Rm ,
(7.6)

Am = Rm Qm ,
for m = 1, 2, , where Qm is a unitary matrix and Rm is an upper triangular matrix.
For simplicity in later analysis, we require that diagonal entries of Rm are nonnegative.
By (7.6), one can easily obtain
Am = Qm Am1 Qm , (7.7)
i.e., each matrix in the sequence {Am } is similar to the matrix A. By using (7.7) again
and again, we have
Am = Q e m AQ
em , (7.8)
e
where Qm = Q1 Q2 Qm . Substituting Am = Qm+1 Rm+1 into (7.8), we obtain
e m Qm+1 Rm+1 = AQ
Q em .
Therefore,
e m Qm+1 Rm+1 Rm R1 = AQ
Q e m Rm R1 ,
i.e.,
e m+1 R
Q em+1 = AQ
em R
em ,
ek = Rk Rk1 R1 , for k = m, m + 1. Moreover, we have
where R
em R
Am = Q em . (7.9)
Theorem 7.4 Suppose that the eigenvalues of A satisfy:

|1 | > |2 | > > |n | > 0.
Let Y be a matrix and yiT denote the i-th row of Y satisfying
yiT A = i yiT .
If Y has an LU factorization, then the entries under the diagonal of the matrix
(m)
Am = [ij ]
produced by (7.6) tend to zero as m . At the same time,
(m)
ii i ,
for i = 1, 2, , n.
7.4. QR METHOD 117
Proof: Let
X = Y 1 , = diag(1 , , n ).
Then A = XY . By assumption, Y has an LU factorization
Y = LU
where L is a unit lower triangular matrix and U is an upper triangular matrix. Hence
Am = Xm Y = Xm LU = X(m Lm )m U
(7.10)
= X(I + Em )m U,
where I + Em = m Lm . Since L is a unit lower triangular matrix and |i | < |j |

for i > j, we have
lim Em = 0. (7.11)
m
Let X = QR where Q is a unitary matrix and R is an upper triangular matrix. Since

X is nonsingular, we can require that diagonal entries of R are positive. Substituting
X = QR into (7.10), we obtain
Am = QR(I + Em )m U = Q(I + REm R1 )Rm U. (7.12)
When m is sufficiently large, I + REm R1 is nonsingular and has the following QR

decomposition,
I + REm R1 = Qbm R
bm , (7.13)
bm are positive. By using (7.11) and (7.13), it is easy to
where diagonal entries of R
show that
lim Qb m = lim Rbm = I. (7.14)
m m
Substituting (7.13) into (7.12), we have
b m )(R
Am = (QQ bm Rm U ),
which is a QR decomposition of Am . In order to guarantee all the diagonal entries of

the upper triangular matrix to be positive, we define

1 n
D1 = diag ,, ,
|1 | |n |
and
u11 unn
D2 = diag ,, ,
|u11 | |unn |
where uii is the i-th diagonal entry of U . Hence, we have

b m Dm D2 )(D1 Dm R
Am = (QQ bm Rm U ).
1 2 1
Comparing with (7.9) and noting that the QR decomposition is unique, we obtain
e m = QQ
Q b m Dm D2 , em = D1 Dm R
R bm Rm U.
1 2 1
Substituting them into (7.8), we have

b m Q AQQ
Am = D2 (D1 )m Q b m D1m D2 .
Note that
A = XY = XX 1 = QRR1 Q .
We finally have
b RR1 Q
Am = D2 (D1 )m Q b m Dm D2 .
m 1
When m , we know that by (7.14) the entries under the diagonal of the matrix
Am produced by (7.6) tend to zero. At the same time,
(m)
ii i ,
for i = 1, 2, , n.
From Theorem 7.4, we know that the sequence {Am } produced by (7.6) converges to
the Schur decomposition of A.
7.5 Real version of QR algorithm

If A Rnn , we wish that we could find a fast and effective QR algorithm with real
number operations only. We concentrate on developing the real analog of (7.6) as
follows: let A1 = A and then construct an iterative algorithm:

Am = Qm Rm ,
(7.15)

Am+1 = Rm Qm ,
for m = 1, 2, , where Qm Rnn is an orthogonal matrix and Rm Rnn is an upper

triangular matrix. A difficulty associated with (7.15) is that Am can never converge
to a strictly upper triangular form in the case that A has complex eigenvalues. By
Theorem 7.2, we can expect that (7.15) converges to the real Schur decomposition of
A.
We note that in practice that (7.15) is not a good iterative method because the
following two reasons:
7.5. REAL VERSION OF QR ALGORITHM 119
(i) the operation cost in each iteration is too large;
(ii) the convergence rate is too slow.

We therefore need to reduce the operation cost in each iteration and increase the
convergence rate of the method. Thus, the upper Hessenberg reduction and the shift
strategy are introduced.
7.5.1 Upper Hessenberg reduction

We construct an orthogonal matrix Q0 such that
QT0 AQ0 = H (7.16)
has some special structure (Hessenberg) with many zero entries. Afterwards, we can
apply the QR algorithm (7.15) to the matrix H in (7.16). Then the operation cost per
iteration can be dramatically reduced.
For A = [ij ] Rnn , at the first step, we can choose a Householder transformation
H1 such that the first column of H1 A has many zero entries (at most n 1 zero
entries). In order to keep a similarity transformation, we need to add one more column
transformation:
H1 AH1 .
Hence H1 could have the following form

1 0
H1 = e1 (7.17)
0 H
to keep the zero entries in the first column unchanged. By using H1 defined as in (7.17),
we have " #
11 e1
aT2 H
H1 AH1 = e 1 A22 H
e 1 a1 H e1 (7.18)
H
where aT1 = (21 , 31 , n1 ), aT2 = (12 , 13 , 1n ), and A22 is the (n 1)-by-(n 1)

principal submatrix in the lower right corner of A. We know by (7.18) that the best
choice of the Householder transformation H e 1 should be
e 1 a1 = pe1
H (7.19)
where p R and e1 Rn1 is the first unit vector. Therefore, the first column of the
matrix in (7.18) has n 2 zeros by using H1 defined by (7.17) and (7.19).
Afterwards, for Ae22 = H e 1 , we could find a Householder transformation
e 1 A22 H

e 1 0
H2 = b2
0 H
such that
e2A
(H e22 H
e 2 )e1 = (, , 0, , 0)T .
Let
1 0
H2 = e2 .
0 H
We therefore have
h11 h12
. ..

h21 h22 .. .
.
0 h32 ..
H2 H1 AH1 H2 =
..
.

0 0 .

.. .. .. ..
. . . .
0 0
After n 2 steps, we have found n 2 Householder transformations H1 , H2 , , Hn2 ,
such that
Hn2 H2 H1 AH1 H2 Hn2 = H
where
h11 h12 h13 h1,n1 h1n
h21 h22 h23 h2,n1 h2n

0 h32 h33 h3,n1 h3n

H=
.. .. ..

. 0 h43 . .
.. .. .. .. ..
. . . . .
0 0 0 hn,n1 hnn
with hij = 0 for i > j + 1, and is called the upper Hessenberg matrix. Let
Q0 = H1 H2 Hn2
and therefore,
QT0 AQ0 = H,
which is called the upper Hessenberg decomposition of A. We have the following
algorithm by using Householder transformations.
Algorithm 7.1 (Upper Hessenberg decomposition)

for k = 1 : n 2

[v, ] = house(A(k + 1 : n, k))
A(k + 1 : n, k : n) = (I vv T )A(k + 1 : n, k : n)

A(1 : n, k + 1 : n) = A(1 : n, k + 1 : n)(I vv T )

end
The operation cost of the Hessenberg decomposition by Householder transforma-

tions is O(n3 ). Of course, the Hessenberg decomposition can also be obtained by using
Givens rotations but the operation cost will be double of that by using Householder
transformations. Although the Hessenberg decomposition of a matrix is not unique,
we have the following theorem.
Theorem 7.5 Suppose that A Rnn has the following two upper Hessenberg decom-
positions:
U T AU = H, V T AV = G, (7.20)
where
U = [u1 , u2 , , un ], V = [v1 , v2 , , vn ]
are n-by-n orthogonal matrices, and H = [hij ], G = [gij ] are upper Hessenberg matrices.
If u1 = v1 and all the entries hi+1,i are not zero, then there exists a diagonal matrix D
with diagonal entries being either 1 or 1 such that
U = V D, H = DGD.
Proof: Assume that we have proved for some m, 1 m < n, that
uj = j vj , j = 1, 2, , m, (7.21)
where 1 = 1, j = 1 or 1. Now, we want to show that there exists m+1 = 1 or 1

such that
um+1 = m+1 vm+1 .
From (7.20), we have
AU = U H, AV = V G.
Comparing with the m-th column of above matrix equalities respectively, we obtain
Aum = h1m u1 + + hmm um + hm+1,m um+1 , (7.22)
and
Avm = g1m v1 + + gmm vm + gm+1,m vm+1 . (7.23)
Multiplying (7.22) and (7.23) by uTi and viT , respectively, we have
him = uTi Aum , gim = viT Avm , i = 1, 2, , m.
Therefore by (7.21),
him = i m gim , i = 1, 2, , m. (7.24)
Substituting (7.24) into (7.22), and using (7.21) and (7.23), we obtain
hm+1,m um+1 = m (Avm 21 g1m v1 2m gmm vm )
= m (Avm g1m v1 gmm vm ) (7.25)
= m gm+1,m vm+1 .
Hence,
|hm+1,m | = |gm+1,m |.
Since hm+1,m 6= 0, from (7.25), we know that
um+1 = m+1 vm+1 ,
where m+1 = 1 or 1. By induction, the proof is complete.
We note that for an upper Hessenberg matrix H = [hij ], if hi+1,i 6= 0, i =

1, 2, , n 1, then it is irreducible. Theorem 7.5 said that if QT AQ = H is irreducible
upper Hessenberg where Q is an orthogonal matrix, then Q and H are determined
completely by the first column of Q (up to 1).
Now, suppose that H Rnn is an upper Hessenberg matrix, we want to apply a QR
iteration to H. Firstly, we need to construct a QR decomposition for H. Since H has
a special structure, we can use n 1 Givens rotations to obtain the QR decomposition.
For simplicity, in the following, we consider the case of n = 5. Suppose that we have
already found two Givens rotations P12 and P23 such that

0

P23 P12 H =
0 0 h 33 .

0 0 h43
0 0 0
Then we construct a Givens rotation P34 = G(3, 4, 3 ) such that the 3 satisfies

cos 3 sin 3 h33
= .
sin 3 cos 3 h43 0
Hence,

0

P34 P23 P12 H =
0 0 .

0 0 0
0 0 0
Therefore, it is easy to see that for n-by-n upper Hessenberg matrix H, we can construct
n 1 Givens rotations P12 , P23 , , Pn1,n such that
Pn1,n Pn2,n1 P12 H = R
is an upper triangular matrix. Let
Q = (Pn1,n Pn2,n1 P12 )T .
Then we have H = QR. In order to complete a QR iteration, we have to compute

e = RQ = RP12
H T T T
P23 Pn1,n .
T is different from R only in the first two columns. Since R is an upper
Note that RP12
T should have the following form (for n = 5),
triangular matrix, RP12

RP12T
= 0 0 .

0 0 0
0 0 0 0
T P T is different from RP T
Similarly, RP12 only in the second column and the third
23 12
column. Hence,

RP12 P23 =
T T
0
.
0 0 0
0 0 0 0
e which is also an upper Hessenberg matrix.
Continuously, we finally obtain the matrix H
It is easy to know that the operation cost of a QR iteration for an upper Hessenberg
matrix is O(n2 ). Note that the operation cost of a QR iteration for a general matrix
is O(n3 ).
7.5.2 QR iteration with single shift

Now, we only need to discuss the Hessenberg form. From Theorem 7.4, we know
that the convergence rate of the basic QR algorithm is linear and is depending on the
distance between eigenvalues. In order to speed up the convergence rate, we introduce
the shift strategy. The QR iteration with a single shift is given as follows:

Hm m I = Qm Rm ,
(7.26)

Hm+1 = Rm Qm + m I,
for m = 1, 2, , where H1 = H Rnn is a given upper Hessenberg matrix satisfying

the conditions of Theorem 7.4 and m R. We consider how to choose a shift if all
the eigenvalues of H are assumed to be real. Since Hm is upper Hessenberg, there are
(m) (m)
only two nonzero entries hn,n1 and hnn in the last row (for n = 5):

0
Hm = .
0 0

(m) (m)
0 0 0 hn,n1 hnn
(m) (m)
When the QR algorithm converges, hn,n1 will be very small and hnn will approach to
(m)
an eigenvalue of H. Therefore, we can choose a shift m = hnn . In fact, we can prove
(m)
that if hn,n1 = is very small, then after one iteration of the QR algorithm, we have
(m+1)
hn,n1 = O(2 ). (7.27)
(m)
From the discussion above, we know that there are n 1 steps to reduce Hm hnn I
to be an upper triangular matrix. Assume that the first n 2 steps are completed:

0

b
H=
0 0 .
0 0 0
0 0 0 0
Actually, we only need to study the 2-by-2 submatrix

Hm2 =
0
b In the (n 1)-th step of reduction, we want
in the lower right corner of the matrix H.
to determine c = cos and s = sin such that

c s
= .
s c 0
It is easy to see that
p
c = /, s = /, = 2 + 2 .
Thus, after a few simple computations on the 2-by-2 submatrix in the lower right corner
(m)
of the matrix Hm+1 = Rm Qm + hnn I, we have
(m+1)
hn,n1 = s2 = 2 / 2 = O(2 ),
i.e., (7.27) holds. Through a shift, the convergence rate of the QR iteration is expected
to be quadratic. If H has complex eigenvalues, then a double shift strategy can be used
to speed up the convergence rate of the QR iteration.
7.5.3 QR iteration with double shift

Note that difficulties with (7.26) can be expected if the submatrix
" #
(m) (m)
hpp hpn
G= (m) (m) , p=n1
hnp hnn
in the lower right corner of Hm has a pair of complex conjugate eigenvalues 1 and 2 .
(m)
We cannot expect that hnn tends to an eigenvalue of A. A way around this difficulty
is to perform the following QR algorithm with double shifts:

H 1 I = Q1 R1 ,

H1 = R1 Q1 + 1 I,
(7.28)

H1 2 I = Q2 R2 ,

H2 = R2 Q2 + 2 I,
where H = Hm . Let
M (H 1 I)(H 2 I). (7.29)
By a few simple computations, we have
M = QR, (7.30)
and
H2 = QT HQ, (7.31)
where
Q = Q1 Q2 , R = R 2 R1 .
By (7.29), we obtain
M = H 2 sH + tI,
where
s = 1 + 2 = h(m) (m)
pp + hnn R,
and
t = 1 2 = det(G) R.
Hence M is also real. If 1 , 2 are not eigenvalues of H and diagonal entries of R1 ,

R2 in each iteration are chosen to be positive, then by using (7.30), Q is also real.
By (7.31), it then follows that H2 is real. Therefore, under an assumption of without
rounding error, by using (7.28), H2 is still a real upper Hessenberg matrix. In practice,
because of rounding error, usually, H2 is not real. In order to keep the reality of H2 ,
by using (7.30) and (7.31), we propose the following process to compute H2 :
(1) Compute M = H 2 sH + tI.
(2) Compute the QR decomposition of M : M = QR.
(3) Compute H2 = QT HQ.
Note that the operation cost of forming the matrix M in (1) needs O(n3 ). We
remark that actually, we do not need to form matrix M explicitly, see [14]. For a
practical implementation of the QR algorithm with the shift strategy, we refer to [14,
19, 47].
Exercises:
1. Show that if T Cnn is upper triangular and normal, then T is diagonal.
2. Let A, B Cnn . Prove that the spectrum of AB is equal to the spectrum of BA.
3. Let A Cnn , x Cn and X = [x, Ax, , An1 x]. Show that if X is nonsingular, then
X 1 AX is an upper Hessenberg matrix.
4. Suppose that A Cnn has distinct eigenvalues. Show that if Q AQ = T is the Schur
decomposition and AB = BA, then Q BQ is upper triangular.
5. Suppose that A Rnn and z Rn . Find a detailed algorithm for computing an
orthogonal matrix Q such that QT AQ is upper Hessenberg and QT z is a multiple of e1
where e1 is the first unit vector.
6. Suppose that W , Y Rnn and define matrices C, B by

W Y
C = W + iY, B = .
Y W
Show that if R is an eigenvalue of C, then is also an eigenvalue of B. What is the
relation between two corresponding eigenvectors?
7. Suppose that
w x
A= R22
y z
has eigenvalues i, where 6= 0. Find an algorithm that determines c = cos and
s = sin stably such that
T
c s w x c s
= ,
s c y z s c
where = 2 .
8. Find a 2-by-2 diagonal matrix D that minimizes kD1 ADkF where

w x
A= .
y z
9. Let H = H1 be a given matrix. We generate matrices Hk via
Hk k I = Qk Rk , Hk+1 = Rk Qk + k I.
Show that
(Q1 Qj )(Rj R1 ) = (H 1 I) (H j I).
10. Show that if

I Z
Y = ,
0 I
then
1 p
2 (Y ) kY k2 kY 1 k2 = 2 + 2 + 4 2 + 4 ,
2
where = kZk2 .
11. Let A be a matrix with real diagonal entries. Show that
X
|Im()| max |aij |,
i
j6=i
where is any eigenvalue of A and Im() denotes the imaginary part of a complex number.
12. Show that if
1
(A + AT )
2
is positive definite, then Re() > 0, where is any eigenvalue of A and Re() denotes the
real part of a complex number.
13. Let B be a matrix with kBk2 < 1. Show that I B is nonsingular and the eigenvalues
of I + 2(B I)1 have negative real parts.
Chapter 8
Symmetric Eigenvalue Problems
The symmetric eigenvalue problem with its nice properties and rich mathematical the-
ory is one of the most pleasing topics in NLA. In this chapter, we will study symmetric
eigenvalue problems. We begin by introducing some basic spectral properties of sym-
metric matrices. Then the symmetric QR method, the Jacobi method and the bisection
method are discussed. Finally, a divide-and-conquer algorithm is described.
8.1 Basic spectral properties

We first introduce some basic properties of eigenvalues and eigenvectors of any sym-
metric matrix A Rnn . It is well known that the eigenvalues of any symmetric matrix
A are real and there is an orthonormal basis of Rn formed by the eigenvectors of A.
Theorem 8.1 (Spectral Decomposition Theorem [17]) If A Rnn is symmet-

ric, then there exists an orthogonal matrix Q Rnn such that
QT AQ = diag(1 , , n ).
The eigenvalues of a symmetric matrix have minimax properties based on the values
called the Rayleigh quotient:
xT Ax
.
xT x
We have the following theorem and its proof can be found in [19].
Theorem 8.2 (Courant-Fischer Minimax Theorem) Let A Rnn be symmetric

and its eigenvalues be ordered as
1 n .
129
130 CHAPTER 8. SYMMETRIC EIGENVALUE PROBLEMS
Then
uT Au uT Au
i = max min = min max ,
dim(S)=i 06=uS uT u dim(S)=ni+1 06=uS uT u
where S is any subspace of Rn .
The next theorem [46] shows the sensitivity of eigenvalues of symmetric matrices.
Theorem 8.3 (Weyl, Wielandt-Hoffman Theorem) If A, B Rnn are sym-

metric matrices and the eigenvalues of A, B are ordered respectively as follows,
1 (A) n (A), 1 (B) n (B),
then
|i (A) i (B)| kA Bk2 , i = 1, 2, , n,
and
n
X
(i (A) i (B))2 kA Bk2F .
i=1
Theorem 8.3 tells us that the eigenvalues of any symmetric matrix are well conditioned,
i.e., small perturbations on the entries of A cause only small changes in the eigenvalues
of A.
Theorem 8.4 (Cauchy Interlace Theorem [17, 46]) If A Rnn is symmetric

and Ar denotes the r-by-r leading principal submatrix of A, then
r+1 (Ar+1 ) r (Ar ) r (Ar+1 ) 2 (Ar+1 ) 1 (Ar ) 1 (Ar+1 ),
for r = 1, 2, , n 1.
As for the sensitivity of eigenvectors, we have the following theorem, see [46].
Theorem 8.5 Suppose A, A + E Rnn are symmetric matrices and
Q = [q1 , Q2 ] Rnn
is an orthogonal matrix where q1 is a unit eigenvector of A. Partition the matrices

QT AQ and QT EQ as follows:

T 0 T eT
Q AQ = , Q EQ = ,
0 D22 e E22
8.1. BASIC SPECTRAL PROPERTIES 131
where D22 , E22 R(n1)(n1) . If

d = min | | > 0, kEk2 d/4,

where is any eigenvalue of D22 , then there exists a unit eigenvector q1 such that
q
4 4
sin = 1 |q1T q1 |2 kek2 kEk2 ,
d d
where = arccos |q1T q1 |.
It seems that could be a good measurement between q1 and q1 . We can see that the
sensitivity to a perturbation of a single eigenvector depends on the separation of its
corresponding eigenvalue from the rest of eigenvalues.
The eigenvalues of any symmetric matrix are closely related with the singular values
of the matrix. The singular value decomposition [19] is essential in NLA.
Theorem 8.6 (Singular Value Decomposition Theorem) Let A Rmn with

rank(A) = r. Then there exist orthogonal matrices
U = [u1 , , um ] Rmm , V = [v1 , , vn ] Rnn
such that
T r 0
U AV = ,
0 0
where
r = diag(1 , , r )
with 1 2 r > 0.
The i are called the singular values of A. The vectors ui and vi are called the i-th
left singular vector and the i-th right singular vector respectively. The next corollary
is the Weyl, Wielandt-Hoffman Theorem for the singular values.
Corollary 8.1 Let A, B Rnn and their singular values be ordered respectively as
follows:
1 (A) n (A), 1 (B) n (B),
then
|i (A) i (B)| kA Bk2 , i = 1, 2, , n,
and
n
X
(i (A) i (B))2 kA Bk2F .
i=1
The corollary shows that the singular values of any real matrix are also well conditioned,
i.e., small perturbations on the entries of A cause only small changes in the singular
values of A.
8.2 Symmetric QR method

The symmetric QR method is a QR iteration for solving symmetric eigenvalue problem.
It applies the QR algorithm to any symmetric matrix A by using its symmetry. In
order to construct an efficient QR method, we first reduce the symmetric matrix A to
a tridiagonal matrix T and then apply the QR iteration to the matrix T .
8.2.1 Tridiagonal QR iteration

Let A be symmetric and suppose that A can be decomposed as
QT AQ = T,
where Q is an orthogonal matrix and T is an upper Hessenberg matrix. Then, T should

be symmetric tridiagonal. In this case we only need to handle with an eigenvalue
problem of a symmetric tridiagonal matrix T .
Partition A as follows,
1 v0T
A=
v0 A0
where A0 R(n1)(n1) . By using Householder transformations, we can reduce the
matrix A into a symmetric tridiagonal matrix as follows: at the k-th step,
(1) Compute a Householder transformation Hk R(nk)(nk) such that
Hk vk1 = k e1 , k R.
(2) Compute
k+1 vkT
= Hk Ak1 Hk ,
vk Ak
where Ak R(nk1)(nk1) .
If we use k , k and Hk generated by the reduction above to define

1 1 0
..
1 2 .
T = .. ..
,

. . n1
0 n1 n
Hk = diag(Ik , Hk ) Rnn , Q = H1 H2 Hn2 ,

where
n1 n1
= Hn2 An3 Hn2 ,
n1 n
8.2. SYMMETRIC QR METHOD 133
then we have
QT AQ = T.
From the reduction above, it is easy to see that the main operation cost of the k-th
step is to compute Hk Ak1 Hk . Let
Hk = I vv T , v Rnk .
By using the symmetry of Ak1 , we then have
Hk Ak1 Hk = Ak1 vwT wv T ,
where
1
w = u (v T u)v, u = Ak1 v.
2
Since only the upper triangular portion of this matrix needs to be computed, we see
that the transition from Ak1 to Ak can be computed in 4(nk)2 operations only. Given
a symmetric matrix A Rnn , the following algorithm overwrites A with T = QT AQ,
where T is a tridiagonal matrix and Q is a product of Householder transformations.
Algorithm 8.1 (Householder tridiagonalization)

for k = 1 : n 2

[v, ] = house(A(k + 1 : n, k))

u = A(k + 1 : n, k + 1 : n)v

w = u (uT v/2)v

A(k + 1, k) = kA(k + 1 : n, k)k2

A(k, k + 1) = A(k + 1, k)

A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n) vwT wv T

end
This algorithm requires 4n3 /3 operations. If Q is explicitly required, then it can be

formed with additional 4n3 /3 operations.
After a symmetric matrix A has been reduced to a symmetric tridiagonal matrix
T , our aim turns to choose a suitable shift for the QR iteration. Consider the following
QR iteration with a single shift k :

Tk k I = Qk Rk ,
(8.1)

Tk+1 = Rk Qk + k I,
for k = 1, 2, , where T1 = T is a symmetric tridiagonal matrix and so is each matrix

Tk in (8.1). Just as the QR algorithm for nonsymmetric case, Tk is assumed to be
irreducible, i.e., sub-diagonal entries are nonzero. Let us discuss how to choose the
shift k . We can take k = Tk (n, n), the (n, n)-th entry at each iteration as a shift.
However, a better way is to select
q
2 ,
k = n + sign() 2 + n1
where = (n1 n )/2. This is the well-known Wilkinson shift, see [46]. Note that
k is just the eigenvalue of the matrix

n1 n1
Tk (n 1 : n, n 1 : n) =
n1 n
which is closer to n .
8.2.2 Implicit symmetric QR iteration

For the explicit QR algorithm (8.1), it is possible to execute the transition from
T I = QR
to
T = RQ + I
without explicitly forming the matrix T I. The essence of (8.1) is to transform
T to T by orthogonal similarity transformations. It follows from Theorem 7.5 that T
can be determined completely by the first column of Q. From the process of the QR
decomposition of T I by Givens rotations, we know that
Qe1 = G1 e1 ,
where G1 = G(1, 2, 1 ) is a rotation which makes the second entry in the first column
of T I to be zero. The 1 can be computed from

cos 1 sin 1 1
= .
sin 1 cos 1 1 0
Let
B = G1 T GT1 .
Then B has the following form (n = 4),

+ 0
0
B= +
.

0 0
8.2. SYMMETRIC QR METHOD 135
Let Gi = G(i, i+1, i ), i = 2, 3. Then Gi of this form can be used to chase the unwanted
nonzero entry + out of the matrix B as follows:

0 0 0 0
G2 +
G3 0

B 0 0 .
0 + 0 0
In general, if Z = G1 G2 Gn1 , then
Ze1 = G1 e1 = Qe1
and ZT Z T is tridiagonal. Thus, from Theorem 7.5, the tridiagonal matrix ZT Z T

produced by this zero-chasing technique is essentially the same as the tridiagonal matrix
T obtained by the explicit method (8.1). Overall, we obtain
Algorithm 8.2 (Implicit symmetric QR iteration with Wilkinson shift)

d = (T (n 1, n 1) T (n, n))/2 p

= T (n, n) T (n, n 1)2 /(d + sign(d) d2 + T (n, n 1)2 )

x = T (1, 1) ; z = T (2, 1)

for k = 1 : n 1

[c, s] = givens(x, z)

T = Gk T GTk , where Gk = G(k, k + 1, k )

if k < n 1

x = T (k + 1, k); z = T (k + 2, k)

end

end
This algorithm requires about 30n operations and n square roots. Of course, the tridi-
agonal matrix T would be stored in a pair of n-vectors in any practical implementation.
8.2.3 Implicit symmetric QR algorithm

Algorithm 8.2 is a base of the symmetric QR algorithm the standard means for com-
puting the spectral decomposition of any symmetric matrix. By applying Algorithm
8.2, we develop the following algorithm.
Algorithm 8.3 (Implicit symmetric QR algorithm)
(1) Input A (real symmetric matrix).

(2) Tridiagonalization: Compute the tridiagonalization of A by Algorithm 8.1,
T = U0T AU0 .
Set Q = U0 .
(3) Criterion for convergence:
(i) For i = 1, , n 1, let ti+1,i and ti,i+1 be zero if
|ti+1,i | = |ti,i+1 | (|ti,i | + |ti+1,i+1 |)u,
where u is the machine precision.

(ii) Find the largest integer m 0 and the smallest integer l 0 such that

T11 0 0
T = 0 T22 0 ,
0 0 T33
where T11 Rll , T22 R(nlm)(nlm) is an irreducible tridiagonal ma-

trix and T33 Rmm is a diagonal matrix.
(iii) If m = n, then output; otherwise
(4) QR iteration: Apply Algorithm 8.2 to T22 :
T22 = GT22 GT , G = G1 G2 Gnlm1 .
(5) Set Q = Qdiag(Il , G, Im ), then go to (3).
If we only need to compute the eigenvalues, this algorithm requires about 4n3 /3
operations. If we need both the eigenvalues and eigenvectors, it requires about 9n3
operations. It can be shown [46] that the computed eigenvalues i , i = 1, 2, , n,
obtained by Algorithm 8.3, satisfy
QT (A + E)Q = diag(1 , , n ),
where Q Rnn is orthogonal and kEk2 kAk2 u where u is the machine precision.
Using Theorem 8.3, we have
|i i | kAk2 u, i = 1, 2, , n,
where {i } are the eigenvalues of A. The absolute error in each i is small and the
relative error is less than the machine precision u. If Q = [q1 , , qn ] is the matrix
of computed orthonormal eigenvectors, then the accuracy of each qi depends on the
separation of i from the rest of eigenvalues.
8.3. JACOBI METHOD 137
8.3 Jacobi method

Jacobi method is one of the earliest methods for computing the symmetric eigenvalue
problem. It was developed by Jacobi in 1846, see [19]. It is well-known that a real
symmetric matrix can be reduced to a diagonal matrix by orthogonal similarity trans-
formations. Jacobi method exploits the symmetry of the matrix and chooses suitable
rotations to reduce a symmetric matrix to a diagonal form. Jacobi method is usually
much slower than the symmetric QR algorithm. However, Jacobi method remains to be
interesting because the method is capable of programming and inherently parallelism
recognized in recent years.
8.3.1 Basic idea

Let A = [aij ] Rnn be a symmetric matrix and
1/2
n
!1/2 n X
n
X X
off(A) kAk2F a2ii =
a2ij
.
i=1 i=1 j=1
j6=i
The idea of Jacobi method is to systematically reduce the off(A) to be zero. The basic
tools for doing this are called Jacobi rotations defined as follows,
J(p, q, ) = I + sin (ep eTq eq eTp ) + (cos 1)(ep eTp + eq eTq ),
where p < q and ek denotes the k-th unit vector. Note that Jacobi rotations are no
different from Givens rotations, see Section 4.2.2. We change the name in this section
to honour the inventor. The basic step in a Jacobi procedure involves:
(1) Choose p and q for a rotation with 1 p < q n.
(2) Compute a rotation angle such that
T
bpp bpq c s app apq c s
= (8.2)
bqp bqq s c aqp aqq s c
is diagonal, i.e., bpq = bqp = 0, where c = cos and s = sin .
(3) Overwrite A with B = [bij ] = J T AJ, where J = J(p, q, ).

Note that the matrix B agrees with the matrix A except the p-th row (column) and
the q-th row (column). The relations are:
bip = bpi = caip saiq , i 6= p, q
biq = bqi = saip + caiq , i 6= p, q
bpp = c2 app 2scapq + s2 aqq ,
bqq = s2 app + 2scapq + c2 aqq ,
bpq = bqp = (c2 s2 )apq + sc(app aqq ).

Let us first consider actual computations of s = sin and c = cos such that
bpq = bqp = 0 in (8.2). Actually it is equivalent to
apq (c2 s2 ) + (app aqq )cs = 0. (8.3)
If apq = 0, then we just set c = 1 and s = 0. Otherwise define

aqq app s
= , t = tan = ,
2apq c
and conclude from (8.3) that t solves the quadratic equation
t2 + 2 t 1 = 0.
Then, p
t = 1 + 2.
We select t to be the smaller of the two roots which ensures that || /4 and has the
effect of minimizing of kB Ak2F because
n
X 2a2pq
kB Ak2F = 4(1 c) (a2ip + a2iq ) + .
i=1
c2
i6=p,q
After t is determined, we can obtain c and s from the formulas

1
c= , s = tc.
1 + t2
We summarize the computation of Jacobi rotation J(p, q, ) as follows. Given a
symmetric matrix A Rnn and indices p, q with 1 p < q n, the following
algorithm computes a cosine-sine pair such that bpq = bqp = 0 where bjk is the (j, k)-th
entry of the matrix B = J(p, q, )T AJ(p, q, ).
Algorithm 8.4

function : [c, s] = sym(A, p, q)

if A(p, q) 6= 0

= (A(q, q) A(p, p))/(2A(p, q))

if 0

t = 1/( + 1 + 2)

else

t = 1/( + 1 + 2 )

end

c = 1/ 1 + t2

s = tc

else

c=1

s=0

end
Once J(p, q, ) is determined, then the updated
A J(p, q, )T AJ(p, q, )
can be computed in 6n operations.

How can we choose the integers p and q? Since the Frobenius norm is preserved by
the orthogonal transformation, we have kBkF = kAkF . Note that
a2pp + a2qq + 2a2pq = b2pp + b2qq + 2b2pq = b2pp + b2qq ,
and then
P
n
off(B)2 = kBk2F b2ii
i=1
P
n
(8.4)
= kAk2F a2ii + (a2pp + a2qq b2pp b2qq )
i=1
= off(A)2 2a2pq .
Our goal is to minimize off(B), the best choice of p, q should be
|apq | = max |aij |.

1i<jn
This is the basic idea of the classical Jacobi method.

8.3.2 Classical Jacobi method

Given a symmetric matrix A Rnn and a tolerance > 0, the following algorithm
overwrites A with U T AU , where U is orthogonal and off(U T AU ) kAkF .
Algorithm 8.5 (Classical Jacobi method)

U = In ; eps = kAkF

while off(A) > eps

choose p, q such that |apq | = max |aij |
i6=j
[c, s] = sym(A, p, q)

A = J(p, q, )T AJ(p, q, )

U = U J(p, q, )

end
Since |apq | is the largest off-diagonal entry, we have
off(A)2 N (a2pq + a2qp ),
where N = n(n 1)/2. It follows from (8.4) that

2 1
off(B) 1 off(A)2 .
N
If Ak denotes the matrix A after k-th Jacobi iteration and A0 = A, we have by induc-
tion,

2 1 k
off(Ak ) 1 off(A0 )2 .
N
This implies that the classical Jacobi method converges linearly. However, actually,
the asymptotic convergence rate of the Jacobi method is quadratic. We can prove that
for k large enough, there is a constant c such that
off(Ak+N ) c off(Ak )2 ,
see [19] and references therein. Therefore, the off-diagonal norm will approach to
zero at a quadratic rate after a sufficient number of iterations.
Another advantage of the Jacobi method is easy to compute the eigenvectors. If
the iteration stops after the k-th rotation, we then have
Ak = JkT Jk1
T
J1T AJ1 J2 Jk .
Denote
Qk = J1 J2 Jk .
Thus
AQk = Qk Ak .
Since off-diagonal entries of Ak are tiny and then diagonal entries of Ak are good
approximations to the eigenvalues of A, the identity above shows that the columns of Qk
are good approximations to the eigenvectors of A and all the approximate eigenvectors
are orthonormal. We can obtain Qk , the approximate eigenvectors, during Jacobi
iterative process.
8.3.3 Parallel Jacobi method

Jacobi method to the symmetric eigenvalue problem is inherently parallelism. To illus-
trate this, let A R88 be symmetric. If one has a parallel computer with 4 processors,
then one can group the 28 subproblems into 7 groups of rotations as follows:
group (1) : (1, 2), (3, 4), (5, 6), (7, 8);
group (2) : (1, 3), (2, 4), (5, 7), (6, 8);
group (3) : (1, 4), (2, 3), (5, 8), (6, 7);
group (4) : (1, 5), (2, 6), (3, 7), (4, 8);
group (5) : (1, 6), (2, 5), (3, 8), (4, 7);
group (6) : (1, 7), (2, 8), (3, 5), (4, 6);
group (7) : (1, 8), (2, 7), (3, 6), (4, 5).
Note that all 4 rotations within each group are nonconflicting. For instance, the
subproblems J(2i 1, 2i, i ), i = 1, 2, 3, 4, in the first group can be carried out in
parallel. When we compute J(1, 2, 1 )T AJ(1, 2, 1 ), it has no effect on the rotations
(3, 4), (5, 6) and (7, 8). Then the computation of
A = AJ(1, 2, 1 ), A = AJ(3, 4, 2 ),
A = AJ(5, 6, 3 ), A = AJ(7, 8, 4 ),
can be executed in parallel by 4 processors. Similarly, the computation of
A = J(1, 2, 1 )T A, A = J(3, 4, 2 )T A,
A = J(5, 6, 3 )T A, A = J(7, 8, 4 )T A,
can also be carried out in parallel by 4 processors. For the example above, it only needs
1/4 computing time of a computer with a single processor. A parallel Jacobi algorithm
can be found in [19].
8.4 Bisection method

In this section we present the bisection method for symmetric tridiagonal eigenvalue
problems. Combining the bisection method with the tridiagonal skill, we can obtain
a numerical method for a specified eigenvalue and its corresponding eigenvector of a
symmetric matrix. For a given tridiagonal matrix

a1 b1 0
..
b1 a2 .
T = . .
Rnn ,
(8.5)
. . . . bn1
0 bn1 an
we consider the computation of eigenvalues of T . Without loss of generality, we assume
that bi 6= 0, i = 1, 2, , n 1, i.e., T is an irreducible symmetric tridiagonal matrix.
Otherwise, T can be divided into several smaller irreducible symmetric tridiagonal
matrices.
Let pi () be the characteristic polynomial of the i-by-i leading principal submatrix
Ti of T , i = 1, 2, , n. Then these polynomials satisfy a three-term recurrence:
p0 () 1, p1 () = a1 ,
(8.6)
pi () = (ai )pi1 () b2i1 pi2 (), i = 2, 3, , n.
Since T is symmetric, the roots of polynomial pi () (i = 1, 2 , n) are real. The

following interlacing property is very important.
Theorem 8.7 (Sturm Sequence Property) Let the symmetric tridiagonal matrix
T in (8.5) be irreducible. Then the eigenvalues of Ti1 strictly separate the eigenvalues
of Ti :
i (Ti ) < i1 (Ti1 ) < i1 (Ti ) < < 2 (Ti ) < 1 (Ti1 ) < 1 (Ti ).
Moreover, if sn () denotes the number of sign changes in the sequence
{p0 (), p1 (), , pn ()}
then sn () is equal to the number of eigenvalues of T that are less than , where pi ()
are defined by (8.6). If pi () = 0, then pi1 ()pi+1 () < 0.
Proof: It follows from Theorem 8.4 that the eigenvalues of Ti1 weakly separate
those of Ti . Next we will show that the separation must be strict. Assume that
pi () = pi1 () = 0 for some i and . Since T is irreducible and, we note that by (8.6),
p0 () = p1 () = = pi () = 0,
8.4. BISECTION METHOD 143
which is a contradiction with p0 () 1. Thus we have a strict separation. The assertion

about sn () is developed in [46].
The bisection method for computing a specified eigenvalue of T can be stated as

follows:
Algorithm 8.6 (Bisection algorithm) Let 1 < 2 < < n be the eigenvalues
of T , i.e., the roots of pn (), and be a tolerance. Suppose that the desired eigenvalue
is m for a given m n. Then
(1) Find an interval [l0 , u0 ] including m . Since |i | (T ) kT k , we can take

l0 = kT k and u0 = kT k .
l0 + u0
(2) Compute r1 = and sn (r1 ).
2
(3) If sn (r1 ) m, then m [l0 , r1 ], set l1 = l0 and u1 = r1 ; otherwise m [r1 , u0 ],

set l1 = r1 and u1 = u0 .
l1 + u1
(4) If |l1 u1 | < , take r2 = as an approximate value of m . Otherwise go
2
to (2).
From the algorithm above, we can see that the main operation cost is to compute
sn (). However in practice, sn () cannot be obtained through computing the value of
pi () because it is difficult to evaluate polynomials of high order. In order to avoid
such a problem, we define
pi ()
qi () = , i = 1, 2, , n.
pi1 ()
From (8.6), we have
b2i1
q1 () = p1 () = a1 , qi () = ai , i = 2, 3, , n.
qi1 ()
It is easy to check that sn () is exactly the number of negative values in the sequence
of q1 (), , qn (). The following is a practical algorithm for computing sn ().
Algorithm 8.7 (Compute sign changes)

x = [a1 , a2 , , an ]

y = [0, b1 , , bn1 ]

s = 0; q = x(1)

for k = 1 : n

if q < 0

s=s+1

end

if k < n

if q = 0

q = |y(k + 1)|u

end

q = x(k + 1) y(k + 1)2 /q

end

end
where u is the machine precision.

When qi () = 0, we treat qi as a positive number. Therefore, we can use a small
positive number |bi |u instead of qi () in Algorithm 8.7. If we store b2i in advance, then
Algorithm 8.7 needs 3n operations. If an eigenvalue is computed by using the bisection
method m times on average, then the operation cost is 3nm. Thus, it is efficient to
compute the eigenvalues of any symmetric tridiagonal matrix by the bisection method.
On the other hand, the rounding error analysis shows that the bisection method is
numerically stable, see [46].
8.5 Divide-and-conquer method
A divide-and-conquer method is a numerical method developed by Dongarra and

Sorensen in 1987 for computing all the eigenvalues and eigenvectors of symmetric tridi-
agonal matrices, see [19]. The basic idea is to tear the original symmetric tridiagonal
matrix into 2k symmetric tridiagonal matrices with smaller sizes and then compute
the spectral decomposition of each smaller symmetric tridiagonal matrix. Once we ob-
tained these smaller spectral decompositions, we then combine them together to form a
spectral decomposition of the original matrix. Thus, this method is suitable for parallel
computing.
8.5. DIVIDE-AND-CONQUER METHOD 145
8.5.1 Tearing
Let T Rnn be given as follows,

a1 b1 0
..
b1 a2 .
T =
.. ..
.

. . bn1
0 bn1 an
Without loss of generality, assume n = 2m. Let
v = (0, , 0, 1, , 0, , 0)T Rn .
| {z } | {z }
m1 m1
Consider the matrix

T = T vv T
where , R are needed to be determined. It is easy to see that T is identical to T
except its 4 middle entries:

am bm
.
bm am+1 2
If we set = bm , then
T1 0
T = + vv T ,
0 T2
where
a1 b1 0
b1 a2 b2

.. .. ..
T1 = . . .

bm2 am1 bm1
0 bm1 am
and
am+1 bm+1 0
bm+1 am+2 bm+2

.. .. ..
T2 = . . .

bn2 an1 bn1
0 bn1 an
with
am = am , am+1 = am+1 2 .
Therefore, T is divided into a sum of a partitioned matrix and a rank-one matrix. If
we divide T1 and T2 repeatedly, then finally we can divide T into 2k blocks.
8.5.2 Combining
Once we obtained the spectral decompositions of T1 and T2 :
QT1 T1 Q1 = D1 , QT2 T2 Q2 = D2 ,
where Q1 , Q2 Rmm are orthogonal matrices, and D1 , D2 Rmm are diagonal

matrices, our aim now is to compute the spectral decomposition of T from that of T1
and T2 , i.e., to find an orthogonal matrix V such that
V T T V = diag(1 , , n ).
Let
Q1 0
U= ,
0 Q2
then
T
Q1 0 T1 0 Q1 0
UT TU = + vv T
0 Q2 0 T2 0 Q2
= D + zz T ,
where
D = diag(D1 , D2 ), z = U T v.
Now the problem of finding the spectral decomposition of T is reduced to the problem
of computing the spectral decomposition of D +zz T . We will consider how to compute
the spectral decomposition of D + zz T quickly and stably.
Lemma 8.1 Let D = diag(d1 , , dn ) Rnn with d1 > d2 > > dn . Assume that
0 6= R and z = (z1 , z2 , , zn )T Rn with zi 6= 0 for all i. Let u Rn and R
satisfy
(D + zz T )u = u, u 6= 0.
Then z T u 6= 0 and D I is nonsingular.
Proof: If z T u = 0, then Du = u with u 6= 0, i.e., is the eigenvalue of D and u is

the eigenvector associated with . Since D is a diagonal matrix with distinct entries,
there must exist some i such that di = and u = ei with 6= 0, where ei is the i-th
unit vector. Thus
0 = z T u = z T ei = zi ,
which implies zi = 0, a contradiction. Therefore, z T u 6= 0.
On the other hand, if D I is singular, then there exists some i such that eTi (D
I) = 0, and then
0 = eTi (D I)u = z T ueTi z.
Since z T u 6= 0, we have eTi z = zi = 0, a contradiction. Thus, D I is nonsingular.
Theorem 8.8 Let D = diag(d1 , , dn ) Rnn with d1 > d2 > > dn . Assume
that 0 6= R and z = (z1 , z2 , , zn )T Rn with zi 6= 0 for all i. Suppose that the
spectral decomposition of D + zz T is
V T (D + zz T )V = diag(1 , , n ),
where V = [v1 , , vn ] is an orthogonal matrix and 1 n . Then
(i) 1 , , n are n roots of the function
f () = 1 + z T (D I)1 z.
(ii) If > 0, then

1 > d 1 > 2 > > n > d n ;
if < 0, then
d1 > 1 > d 2 > > d n > n .
(iii) There exists a constant i 6= 0 such that
vi = i (D i I)1 z, i = 1, 2, , n.
Proof: From the assumption, we have
(D + zz T )vi = i vi , kvi k2 = 1.
It follows from Lemma 8.1 that D i I is nonsingular. Thus,
vi = z T vi (D i I)1 z, i = 1, 2, , n, (8.7)
thereby establishing (iii). Note that D + zz T has distinct eigenvalues. Otherwise, if

i = j , then vi and vj are linearly dependent, which contradicts with the orthogonality
between vi and vj .
By multiplying z T to the both sides of (8.7) and noting that z T vi 6= 0, we have
1 = z T (D i I)1 z,
i.e.,
f (i ) = 0, i = 1, 2, , n.
Thus, i , i = 1, 2, , n, are the roots of f (). Next we prove that f () has exactly n
zeros. Note that
z12 zn2
f () = 1 + + + ,
d1 dn
and moreover,
0 z12 zn2
f () = + + .
(d1 )2 (dn )2
Thus, f () is strictly monotone between the poles di and di+1 . If > 0, f () is strictly
increasing; if < 0, f () is strictly decreasing. Therefore, it is easy to see that f ()
has exactly n roots, one of each in the intervals
(dn , dn1 ), , (d2 , d1 ), (d1 , ),
if > 0; and one of each in the intervals
(, dn ), (dn , dn1 ), , (d2 , d1 ),
if < 0. Thus, (i) and (ii) are established.
By Theorem 8.8, we can compute the spectral decomposition of D +zz T efficiently

in the following two steps:
(1) Find roots 1 , 2 , , n of f (). There is a unique root of f () in each of the

intervals (di+1 , di ) and f () is strictly monotone in the interval. Thus, this step
can be implemented quickly and stably by using a Newton-like method [14].
(2) Compute
(D i I)1 z
vi = , i = 1, 2, , n.
k(D i I)1 zk2
The spectral decomposition of a general D + zz T can be turned into the case as

in Theorem 8.8. To this end, we can prove the following theorem constructively.
Theorem 8.9 Let D = diag(d1 , , dn ) Rnn and z Rn . Then there exists an

orthogonal matrix V and a permutation of {1, 2, , n} such that
(i) V T z = (1 , , r , 0, , 0)T where i 6= 0, for i = 1, 2, , r.
(ii) V T DV = diag(d(1) , , d(n) ) where d(1) > d(2) > > d(r) .
Proof: Suppose that two indices i < j satisfy di = dj . Then we can set a rotation
Pij = G(i, j, ) such that the j-th component of Pij z is zero. It is easy to show that
PijT DPij = D. After several steps, we can find an orthogonal matrix V1 which is a
product of some rotations such that
V1T DV1 = D, V1T z = (1 , , n )T
with the property that if i j 6= 0 (i 6= j), then di 6= dj .

If i = 0, j 6= 0 for i < j, then we can find a permutation matrix to interchange the
columns i and j. In such a way, we can find a permutation matrix P1 which permutes
all the nonzero j to the front, i.e.,
P1T V1T z = (1 (1) , , 1 (n) )T
with
1 (i) 6= 0, i = 1, 2, , r,
and
1 (i) = 0, i = r + 1, , n.
Here 1 is a permutation of {1, 2, , n}. It follows from the construction of P1 that
P1T V1T DV1 P1 = P1T DP1 = diag(d1 (1) , , d1 (n) ),
where the first r diagonal entries d1 (1) , , d1 (r) are distinct.

Finally, we can find another permutation matrix P2 of order r such that
P2T diag(d1 (1) , , d1 (r) )P2 = diag(1 , , r )
where 1 > 2 > > r . Let
V = V1 P1 diag(P2 , Inr )
and be a permutation of {1, 2, , n} determined by P1 and P2 . Then,
V T z = ((1) , , (r) , 0, , 0)T = (1 , , r , 0, , 0)T ,
where i 6= 0 for i = 1, 2, , r, and
V T DV = diag(d(1) , , d(n) )
with
d(1) > d(2) > > d(r) .
For any D = diag(d1 , , dn ) Rnn and z Rn , by Theorem 8.9, we can construct

an orthogonal matrix V such that

T T D1 + T 0
V (D + zz )V = ,
0 D2
where
D1 = diag(d(1) , , d(r) ) Rrr , d(1) > > d(r) ;
D2 = diag(d(r+1) , , d(n) ) R(nr)(nr) ;

and
= (1 , , r )T , i 6= 0, i = 1, 2, , r.
Then we only need to compute the spectral decomposition of D1 + T instead of
D + zz T .
Finally, we briefly introduce the parallel computation of the divide-and-conquer
method. For simplicity, we use a parallel computer with 4 processors to compute the
spectral decomposition of a 4N -by-4N symmetric tridiagonal matrix T . It can be
summarized by the following 4 steps:
(1) Tear
T1 0
T = + vv T , T1 R2N 2N , v R4N ;
0 T2
and
Ti1 0
Ti = + i i iT ,
0 Ti2
where Tij RN N and i R2N , for i = 1, 2.
(2) Compute the spectral decompositions of T11 , T12 , T21 and T22 by 4 processors in
parallel.
(3) Combine the spectral decompositions of T11 , T12 to form a spectral decomposition
of T1 , and combine the spectral decompositions of T21 , T22 to form a spectral
decomposition of T2 . These can be implemented by 4 processors at the same
time.
(4) Combine the spectral decompositions of T1 and T2 to from a spectral decompo-

sition of T . This can be derived by 4 processors in parallel.
From discussions above, we know that the divide-and-conquer method can be used
for computing all the eigenvalues and eigenvectors of any large symmetric tridiagonal
matrix in parallel.
Exercises:
1. Compute the Schur decomposition of

1 2
A= .
2 3
2. Show that if X Rnr with r n, and kX T X Ik2 = < 1, then
min (X) 1 ,
where min denotes the smallest singular value.
3. Show that A = B + iC is Hermitian if and only if

B C
M=
C B
is symmetric. Relate the eigenvalues and eigenvectors of A to those of M .
4. Relate the singular values and the singular vectors of A = B + iC to those of

B C
,
C B
where B, C Rmn .
5. Use the singular value decomposition to show that if A Rmn with m n, then there
exist a matrix Q Rmn with QT Q = I and a positive semi-definite matrix P Rnn
such that A = QP .
6. Let
I B
A=
B I
with kBk2 < 1. Show that
1 + kBk2
kAk2 kA1 k2 = .
1 kBk2
7. Let
a1 b1 0
c1 a2 b2

.. .. ..
A=
. . . ,

.. ..
. . bn1
0 cn1 an
where bi ci > 0. Then there exists a diagonal D such that D1 AD is a symmetric

tridiagonal matrix.
8. Let
2 1
1 2 1
T =

.
1 2 1
1 2
(1) Is T positive definite?

(2) How many eigenvalues of T lie in the interval [0, 2]?
9. Let A, E Rnn be two symmetric matrices. Show that if A is positive definite and
kA1 k2 kEk2 < 1, then A + E is also positive definite.
10. Let A Rmn with m n, and assume that the singular values of A are ordered as
1 2 n .
Show that
kAuk2 kAuk2
i = max min = min max ,
dim(S)=i 06=uS kuk2 dim(S)=ni+1 06=uS kuk2
where S is any subspace of Rn .

11. Let A = [aij ] Rnn be symmetric and satisfy
(1) aii > 0, i = 1, 2, , n,

(2) aij 0, i 6= j,
Pn
(3) ai1 > 0,
i=1
Pn
(4) aij = 0, j = 2, 3, , n.
i=1
Prove that the eigenvalues of A are nonnegative.

12. Let A be symmetric and have bandwidth p. Show that if we perform the shifted QR
iteration A I = QR and A = RQ + I, then A has bandwidth p.
Chapter 9
Applications
In this chapter, we will briefly survey some of the latest developments in using bound-
ary value methods (BVMs) for solving initial value problems of systems of ordinary
differential equations (ODEs). These methods require the solution of one or more
nonsymmetric, large and sparse linear systems. Therefore, we will use the GMRES
method studied in Chapter 6 with some preconditioners for solving these linear sys-
tems. One of the main results is that if an A1 ,2 -stable BVM is used for an n-by-n
system of ODEs, then the preconditioned matrix can be decomposed as I + L where
I is the identity matrix and the rank of L is at most 2n(1 + 2 ). When the GMRES
method is applied to the preconditioned systems, the method will converge in at most
2n(1 + 2 ) + 1 iterations. Applications to different kinds of delay differential equations
(DDEs) are also given. For a literature on BVMs for ODEs and DDEs, we refer to
[3, 4, 5, 8, 10, 23, 24, 25, 26, 29, 30].
9.1 Introduction
Let us begin with the initial value problem:
0
y (t) = Jn y(t) + g(t), t (t0 , T ],
(9.1)

y(t0 ) = z,
where y(t), g(t) : R Rn , z Rn , and Jn Rnn . The initial value methods (IVMs),
such as the Runge-Kutta methods, are well-known methods for solving (9.1), see [40].
Recently, another class of methods called the boundary value methods (BVMs) has
been proposed in [5]. Using BVMs to discretize (9.1), we obtain a linear system
M u = b.
The advantage of using BVMs is that the methods are more stable and the resulting
linear system M u = b is hence more well-conditioned. However, this system is in
153
154 CHAPTER 9. APPLICATIONS
general large and sparse (with band-structure), and solving it is a major problem in
the application of BVMs. The GMRES method studied in Chapter 6 will be used for
solving M u = b. In order to speed up the convergence of the GMRES iterations, a
preconditioner S called the Strang-type block-circulant preconditioner [10] is used to
precondition the discrete system. The advantage of the Strang-type preconditioner is
that if an A1 ,2 -stable BVM is used for solving (9.1), then S is invertible and the
preconditioned matrix can be decomposed as
S 1 M = I + L,
where the rank of L is at most 2n(1 + 2 ) which is independent of the integration step
size. It follows that the GMRES method applied to the preconditioned system will
converge in at most 2n(1 + 2 ) + 1 iterations in exact arithmetic.
The outline of this chapter is as follows. In Section 9.2, we will give some background
knowledge about the linear multistep formulas (LMFs) and BVMs. Then, we will
investigate the properties of the Strang-type block-circulant preconditioner for ODEs
in Section 9.3. The convergence and cost analysis of the method will also be given
with a numerical example. Finally, we discuss the applications of the Strang-type
preconditioner with BVMs for solving different kinds of delay differential equations
(DDEs) in Sections 9.49.6.
9.2 Background of BVMs

We give some background knowledge on LMFs and BVMs in this section.
9.2.1 Linear multistep formulas

Consider an initial value problem
0
y = f (t, y), t (t0 , T ],

y(t0 ) = y0 ,
where y(t) : R R and f (t, y) : R2 R. The -step linear multistep formula (LMF)
over a uniform mesh with step size h is defined as follows:

X
X
j ym+j = h j fm+j , m = 0, 1, , (9.2)
j=0 j=0
where ym is the discrete approximation to y(tm ) and fm denotes f (tm , ym ).

To get the solution of (9.2), we need initial conditions
y0 , y1 , , y1 .
9.2. BACKGROUND OF BVMS 155
Since only y0 is provided from the original problem, we have to find additional condi-
tions for the remaining values
y1 , y2 , , y1 .
The equation (9.2) with 1 additional conditions is called initial value methods
(IVMs). An IVM is called implicit if 6= 0 and explicit if = 0. If an IVM is
applied to an initial value problem on the interval [t0 , tN +1 ], we have the following
discrete problem 1
P
(i yi hi fi )
i=0
..
.

AN y = hBN f + 0 y1 h0 f1

, (9.3)
0

..
.
0
where
y = (y , y+1 , , yN +1 )T , f = (f , f+1 , , fN +1 )T ,

.. . . .. . .
. . . .

.. ..
AN = 0 . , BN = 0 . .

.. .. .. ..
. . . .
0 0
Note that the matrices AN , BN RN N are lower triangular band Toeplitz matrices
with lower bandwidth . We recall that a matrix is said to be Toeplitz if its entries
are constant along its diagonals. Moreover, the linear system (9.3) can be solved
easily by forward recursion. A classical example of IVM is the second order backward
differentiation formula (BDF),
3ym+2 4ym+1 + ym = 2hfm+1 ,
which is a two-step method with 0 = 1, 1 = 4, 2 = 3 and 1 = 2.

Instead of using an IVM with initial conditions for solving (9.1), we can also use
the so-called boundary value methods (BVMs). Given 1 , 2 0 such that 1 + 2 = ,
then the corresponding BVM requires 1 initial additional conditions
y0 , y1 , , y1 1 ,
and 2 final additional conditions
yN , yN +1 , , yN +2 1 ,
which are called (1 , 2 )-boundary conditions. Note that the class of BVMs contains
the class of IVMs (i.e., 1 = , 2 = 0).
The discrete problem generated by a -step BVM with (1 , 2 )-boundary conditions
can be written in the following matrix form
P

1 1
(i yi hi fi )
i=0
.
..

0 y1 1 h0 f1 1

0

..
Ay = hBf + . ,

0

yN h fN

..
.

P 2
(1 +i yN 1+i h1 +i fN 1+i )
i=1
where
y = (y1 , y1 +1 , , yN 1 )T , f = (f1 , f1 +1 , , fN 1 )T ,
A and B R(N 1 )(N 1 ) are defined as follows,

1 1
.. .. .. .. .. .. .. ..
. . . . . . . .

A = 0 . . . . . . ..
. , B = 0 .. .. ..
. . . . (9.4)

.. .. .. .. .. ..
. . . . . .
0 1 0 1
Note that the coefficient matrices are band Toeplitz with lower bandwidth 1 and
upper bandwidth 2 . An example of BVMs is the third order generalized backward
differentiation formula (GBDF),
2ym+1 + 3ym 6ym1 + ym2 = 6hfm ,
which is a three-step method with (2, 1)-boundary conditions where
0 = 1, 1 = 6, 2 = 3, 3 = 2, 2 = 6.
Although IVMs are more efficient than BVMs (which cannot be solved by forward
recursion), the advantage in using BVMs over IVMs comes from their stability prop-
erties. For example, the usual BDF are not A-stable for > 2 but the GBDF are
A1 ,2 -stable for any 1, see for instance [1] and [5, p. 79 and Figures 5.15.3].
9.2. BACKGROUND OF BVMS 157
9.2.2 Block-BVMs and their matrix forms
Let = 1 + 2 . By using the -step block-BVM based on LMF over a uniform mesh
h = (T t0 )/s for solving (9.1), we have:
2
X 2
X
i+1 ym+i = h i+1 fm+i , m = 1 , . . . , s 2 . (9.5)
i=1 i=1
Here, ym is the discrete approximation to y(tm ),
fm = Jn ym + gm , gm = g(tm ).
Also, (9.5) requires 1 initial conditions and 2 final conditions which are provided by
the following 1 additional equations:

X
X
(j) (j)
i yi = h i fi , j = 1, . . . , 1 1, (9.6)
i=0 i=0
and

X
X
(j) (j)
i ysi = h i fsi , j = s 2 + 1, . . . , s. (9.7)
i=0 i=0
The coefficients {(j) }, { (j) } in (9.6) and (9.7) should be chosen such that the
truncation errors for these initial and final conditions are of the same order as that in
(9.5). By combining (9.5), (9.6), (9.7) and the initial condition y(t0 ) = y0 = z, the
discrete system of (9.1) is given by the following block form
e In hB
M y (A e Jn )y = e1 z + h(B
e In )g. (9.8)
Here
e1 = (1, 0, , 0)T Rs+1 , y = (y0T , , ysT )T R(s+1)n ,
g = (g0T , , gsT )T R(s+1)n ,

e B
A, e R(s+1)(s+1) given by:

1 0
(1)
(1)
0
. .. ..
.. . . 0

(1 1) ( 1)
0 1

0

0
Ae= .. .. ..

,
. . .
.. .. ..
. . .

0
(s +1) (s +1)
0 0 2 2

.. ..
. .
(s) (s)
0

0 0
0
(1)

(1)

.. .. ..
. . .

(1 1) ( 1)
0 1 0

0

0
e=
B .. .. ..

,
. . .
.. .. ..
. . .

0
(s +1) (s +1)
0 0 2 2

.. ..
. .
(s) (s)
0
and is the tensor product.
We recall that the tensor product of A = (aij ) Rmn and B Rpq is defined as
follows:
a11 B a12 B a1n B
a21 B a22 B a2n B

AB .. .. ..
. . .
am1 B am2 B amn B
which is an mp-by-nq matrix. The basic properties of the tensor product can be found
in [19, 22].
9.3. STRANG-TYPE PRECONDITIONER FOR ODES 159
We remark that usually the linear system (9.8) is large and sparse (with band-
structure), and solving it is a major problem in the application of the BVMs. We
will use the GMRES method in Chapter 6 for solving (9.8). In order to speed up the
convergence rate of the GMRES iterations, we will use a preconidtioner S called the
Strang-type block-circulant preconditioner.
9.3 Strang-type preconditioner for ODEs

Now, we construct the Strang-type block-circulant preconditioner S for solving (9.8).
We will show that the main advantages of the Strang-type preconditioner are:
(1) S is invertible if an A1 ,2 -stable BVM is used.
(2) The spectrum of the preconditioned system is clustered.
(3) The operation cost for each iteration of the preconditioned GMRES method is
smaller than that of direct solvers.
9.3.1 Construction of preconditioner
We first recall the definition of Strangs circulant preconditioner for Toeplitz matrices.
Given any Toeplitz matrix
Tl = [tij ]li,j=1 = [tq ],
Strangs preconditioner s(Tl ) is a circulant matrix with diagonals given by

tq , 0 q bl/2c,

[s(Tl )]q = tql , bl/2c < q < l,

[s(Tl )]l+q , 0 < q < l,
see [9, 22, 37].

Neglecting the perturbations of Ae and B,
e we propose the following preconditioner
S R (s+1)n(s+1)n called the Strang-type block circulant preconditioner for M given
in (9.8):
S = s(A) In hs(B) Jn , (9.9)
where
1 0 1 1
.. .. .. .. ..
. . . . .

.. ..
0 . . 0

.. .. ..
. . . 0
.. .. ..
s(A) =
. . .

.. .. ..
0 . . .

.. ..
. .

.. .. .. .. ..
. . . . .
1 +1 0 1
and s(B) is defined similarly by using {i }i=0 instead of {i }i=0 in s(A). The {i }i=0
and {i }i=0 here are the coefficients given in (9.5). We remark that actually s(A),
s(B) are just Strangs circulant preconditioners for Toeplitz matrices A, B respectively,
where A, B are given by (9.4).
We will show that the preconditioner S is invertible provided that the given BVM
is A1 ,2 -stable and the eigenvalues of Jn are in
C {q C : Re(q) < 0}
where Re() denotes the real part of a complex number. The stability of a BVM is
closely related to two characteristic polynomials of degree = 1 + 2 , defined as
follows:
X2 X2
j+1
(z) j+1 z and (z) j+1 z j+1 . (9.10)
j=1 j=1
The A1 ,2 -stability polynomial is defined by
(z, q) (z) q(z) (9.11)
where z, q C.
Consider now the equation (z, q) = 0. It defines a mapping between the complex
z-plane and the complex q-plane. For every z C which is a root of (z, q), (9.11)
provides
(z)
q = q(z) = .
(z)
Let
(ei )
qC:q= , 0 < 2 . (9.12)
(ei )
The is the set corresponding to the roots on the unit circumference and is called the
boundary locus. We have the following definition and lemma, see [5].
Definition 9.1 Consider a BVM with an A1 ,2 -stability polynomial (z, q) defined by

(9.10). The region
D1 ,2 = {q C : (z, q) has 1 zeros inside |z| = 1 and 2 zeros outside |z| = 1}
is called the region of A1 ,2 -stability of the given BVM. Moreover, the BVM is said to
be A1 ,2 -stable if
C D1 ,2 .
Lemma 9.1 If a BVM is A1 ,2 -stable and is defined by (9.12), then Re(q) 0 for
all q .
Now, we want to show that the preconditioner S is invertible under the stability
condition.
Theorem 9.1 If the BVM for (9.1) is A1 ,2 -stable and hk (Jn ) D1 ,2 where k (Jn ),
k = 1, , n, are the eigenvalues of Jn , then the preconditioner S defined by (9.9) is
invertible.
Proof: Since s(A) and s(B) are circulant matrices, their eigenvalues are given by
1 1 (z)
gA (z) z 2 + . . . + 1 + 1 1 + . . . + 0 1 = 1
z z z
and
1 1 (z)
gB (z) z 2 + . . . + 1 + 1 1 + . . . + 0 1 = 1 ,
z z z
2ij
evaluated at j = e s+1 where i 1, for j = 0, , s, see [9, 13]. The eigenvalues
jk (S) of S are therefore given by
jk (S) = gA (j ) hk (Jn )gB (j ), j = 0, , s, k = 1, , n.
Since the BVM is A1 ,2 -stable, the -degree polynomial
[z, hk (Jn )] = (z) hk (Jn )(z)
has no roots on the unit circle |z| = 1 if hk (Jn ) D1 ,2 . Thus for all k = 1, , n,
and any arbitrary |z| = 1, we have
1
gA (z) hk (Jn )gB (z) = [z, hk (Jn )] 6= 0.
z 1
It follows that
jk (S) 6= 0, j = 0, , s, k = 1, , n.
Thus S is invertible.
In particular, we have
Corollary 9.1 If the BVM is A1 ,2 -stable and k (Jn ) C , then the preconditioner
S is invertible.
9.3.2 Convergence rate and operation cost

We have the following theorem for the convergence rate.
Theorem 9.2 We have

S 1 M = In(s+1) + L
where
rank(L) 2n.
Proof: Let E = M S. We have by (9.8) and (9.9),

E= A e s(A) In h Be s(B) Jn = LA In hLB Jn .
It is easy to check that LA and LB are (s + 1)-by-(s + 1) matrices with nonzero entries
only in the following four corners: a 1 -by-( + 1) block in the upper left; a 1 -by-1
block in the upper right; a 2 -by-( + 1) block in the lower right; and a 2 -by-2 block
in the lower left. By noting that = 1 + 2 , we then have
rank(LA ) , rank(LB ) .
Therefore,
rank(LA In ) = rank(LA ) n n
and
rank(LB Jn ) = rank(LB ) n n.
Thus,
S 1 M = In(s+1) + S 1 E = In(s+1) + L,
where the rank of L is at most 2n.
Therefore, when the GMRES method is applied to
S 1 M y = S 1 b,
by Theorems 6.13 and 9.2, we know that the method will converge in at most 2n + 1
iterations in exact arithmetic.
Regarding the cost per iteration, the main work in each iteration for the GMRES
method is the matrix-vector multiplication
e In hB
S 1 M z = (s(A) In hs(B) Jn )1 (A e Jn )z,
e B
see Section 6.5. Since A, e are band matrices and Jn is assumed to be sparse, the
matrix-vector multiplication
e In hB
M z = (A e Jn )z
can be done very fast.

Now we compute S 1 (M z). Note that any circulant matrix can be diagonalized by
the Fourier matrix F , see Section 6.4 and [13]. Since s(A) and s(B) are circulant, we
have the following decompositions by (6.12),
s(A) = F A F , s(B) = F B F
where A , B are diagonal matrices containing the eigenvalues of s(A), s(B) respec-
tively. It follows that
S 1 (M z) = (F In )(A In hB Jn )1 (F In )(M z).
This product can be obtained by using FFTs and solving s + 1 linear systems of order
n. Since Jn is sparse, the matrix
A In hB Jn
will also be sparse. Thus S 1 (M z) can be obtained by solving s + 1 sparse linear

systems of order n. It follows that the total number of operations per iteration is
1 n(s + 1) log(s + 1) + 2 (s + 1)nq,
where q is the number of nonzeros of Jn , and 1 and 2 are some positive constants.
For comparing the computational cost of the method with direct solvers for the linear
system (9.8), we refer to [10].
9.3.3 Numerical result

Now we give an example to illustrate the efficiency of the preconditioner S by solving
a test problem given in [4]. The experiments were performed in MATLAB. We used
the MATLAB-provided M-file gmres to solve the preconditioned systems. We should
emphasize that in all of our tests in this chapter, the zero vector is the initial guess
and the stopping criterion is
krq k2
< 106 ,
kr0 k2
where rq is the residual after q iterations. The BVM we used is the third order general-
ized Adams method (GAM). Its formula and the initial and final additional conditions
can be found in [5].
Example 9.1. Heat equation:

u 2u

= ,

t x2

u

u(0, t) = (, t) = 0, t [0, 2],

x

u(x, 0) = x, x [0, ].
We discretize the partial differential operator 2 /x2 with central differences and step
size equals to /(n + 1). The system of ODEs obtained is:
0
y (t) = Tn y(t), t [0, 2]

y(0) = (x1 , x2 , , xn )T ,
where Tn is a scaled discrete Laplacian matrix

2 1

1 ... ...
(n + 1)2

.
Tn = .. .. ..
2 . . .

1 2 1
1 1
Table 9.1 lists the number of iterations required for convergence of the GMRES
method for different n and s. In the table, I means no preconditioner is used and S
denotes the Strang-type block-circulant preconditioner defined by (9.9). We see that
the number of iterations required for convergence, when S is used, is much less than
that when no preconditioner is used. The numbers under the column S stay almost a
constant for increasing s and n.
9.4 Strang-type preconditioner for DDEs

Now, we study delay differential equations (DDEs).
9.4. STRANG-TYPE PRECONDITIONER FOR DDES 165
Table 9.1: Number of iterations for convergence.
n s I S n s I S
24 6 19 4 48 6 47 4
12 70 4 12 167 4
24 152 4 24 359 4
48 227 3 48 >400 3
96 314 3 96 >400 3
9.4.1 Differential equations with multi-delays

Consider the solution of differential equations with multi-delays:
(1) (s)
y0 (t) = Jn y(t) + Dn y(t 1 ) + + Dn y(t s ) + f (t), t t0 ,
(9.13)

y(t) = (t), t t0 ,
(1) (s)
where y(t), f (t), (t) : R Rn ; Jn , Dn , , Dn Rnn ; and 1 , , s > 0 are some
rational numbers.
In order to find a reasonable numerical solution, we require that the solution of
(9.13) is asymptotically stable. We have the following lemma, see [31, 43].
Lemma 9.2 For any s 1, if

1
(Jn ) max (Jn + JnT ) < 0
2
and
s
X
(Jn ) + kDn(j) k2 < 0, (9.14)
j=1
then the solution of (9.13) is asymptotically stable.
In the following, for simplicity, we only consider the case of s = 2 in (9.13). The
generalization to any arbitrary s is straightforward. Let
h = 1 /m1 = 2 /m2
be the step size where m1 and m2 are positive integers with m2 > m1 (2 > 1 ). For
(9.13), by using a BVM with (1 , 2 )-boundary conditions over a uniform mesh
tj = t0 + jh, j = 0, , r1 ,
on the interval [t0 , t0 + r1 h], we have

X
X
i yp+i1 = h i (Jn yp+i1 + Dn(1) yp+i1 m1 + Dn(2) yp+i1 m2 + fp+i1 ),
i=0 i=0
(9.15)
for p = 1 , , r1 1, where = 1 + 2 . By providing the values
ym2 , , ym1 , , y0 , y1 , , y1 1 , yr1 , , yr1 +2 1 , (9.16)
(9.15) can be written in a matrix form as
Ry = b
where
R A In hB Jn hC (1) Dn(1) hC (2) Dn(2) , (9.17)
y = (yT1 , yT1 +1 , , yrT1 1 )T Rn(r1 1 ) ,
b Rn(r1 1 ) depends on f , the boundary values and the coefficients of the method.
The matrices A, B R(r1 1 )(r1 1 ) are defined as in (9.4) and C (1) , C (2) R(r1 1 )(r1 1 )
are defined as follows:

0 0

. . . . . .

.. . . . . .. . . . .
. . . . . .
(1)
C = , C =(2) .
. .
0 . . 0 . .

.. .. .. .. .. ..
. . . . . .
0 0 0 0
We remark that the first column of C (1) is given by
(0, , 0, , , 0 , 0, , 0 )T
| {z } | {z }
m1 2 r1 m1 21 1
and the first column of C (2) is given by
(0, , 0, , , 0 , 0, , 0 )T .
| {z } | {z }
m2 2 r1 m2 21 1

The Strang-type block-circulant preconditioner for (9.17) is defined as follows:
Se s(A) In hs(B) Jn hs(C (1) ) Dn(1) hs(C (2) ) Dn(2) (9.18)
where s(E) is Strangs circulant preconditioner of Toeplitz matrix E, for E = A, B,

C (1) , C (2) respectively.
Now we discuss the invertibility of the Strang-type preconditioner Se defined by
(9.18). Since any circulant matrix can be diagonalized by the Fourier matrix F , we
have by (6.12),
s(E) = F E F,
where E is the diagonal matrix holding the eigenvalues of s(E), for E = A, B, C (1) ,
C (2) respectively. Therefore, we obtain
Se = (F In )(A In hB Jn hC (1) Dn(1) hC (2) Dn(2) )(F In ).
Note that the j-th block of
A In hB Jn hC (1) Dn(1) hC (2) Dn(2)
is given by
Sej = [A ]jj In h[B ]jj Jn h[C (1) ]jj Dn(1) h[C (2) ]jj Dn(2) ,
for j = 1, 2, , r1 1 . Let
2ij
wj = e r1 1 .
We have
[A ]jj = (wj )/wj1 , [B ]jj = (wj )/wj1 ,
[C (1) ]jj = wjm1 +2 + + 0 wjm1 1 = (wj )/wjm1 +1 ,

and
[C (2) ]jj = wjm2 +2 + + 0 wjm2 1 = (wj )/wjm2 +1 ,
where (z) and (z) are defined as in (9.10). Therefore,
1 h i
Sej = wjm2 (wj )In h(wj )Jn hwjm1 (wj )Dn(1) h(wj )Dn(2) .
wjm2 +1
We therefore only need to prove that Sej , j = 1, 2, , r1 1 , are invertible. We have

the following theorem.
Theorem 9.3 If the BVM with (1 , 2 )-boundary conditions is A1 ,2 -stable and (9.14)
holds, then for any arbitrary R, the matrix

eim2 (ei )In h(ei )Jn heim1 (ei )Dn(1) h(ei )Dn(2)
is invertible. It follows that the Strang-type preconditioner Se defined by (9.18) is also

invertible.
Proof: Suppose that there exist x Cn with kxk2 = 1 and R such that
h i
eim2 (ei )In h(ei )Jn heim1 (ei )Dn(1) h(ei )Dn(2) x = 0.
Then
h i
x (ei )In h(ei )Jn heim1 (ei )Dn(1) heim2 (ei )Dn(2) x = 0,
i.e.,
(ei ) h(ei )x Jn x heim1 (ei )x Dn(1) x heim2 (ei )x Dn(2) x = 0.
We therefore have
(1) (2)
(ei ) (hx Jn x + heim1 x Dn x + heim2 x Dn x)(ei )

(1) (2)
= ei , h(x Jn x + eim1 x Dn x + eim2 x Dn x) = 0
where (z, q) is given by (9.11). Thus,
h(x Jn x + eim1 x Dn(1) x + eim2 x Dn(2) x) ,
where is the boundary locus defined by (9.12). Since the BVM is A1 ,2 -stable, from
Lemma 9.1, we know that
Re(x Jn x + eim1 x Dn(1) x + eim2 x Dn(2) x) 0.
By Cauchy-Schwarz inequality, we have
Re(eim1 x Dn(1) x) |x Dn(1) x| kxk2 kDn(1) xk2 kDn(1) k2 kxk2 = kDn(1) k2 ,

(2) (2)
and similarly, Re(eim1 x Dn x) kDn k2 . Note that
(Jn ) = max Re(x Jn x) Re(x Jn x).

kxk2 =1
Thus we have
(Jn ) + kDn(1) k2 + kDn(2) k2 0,
which is a contradiction to (9.14). Therefore, the matrix

eim2 (ei )In h(ei )Jn heim1 (ei )Dn(1) h(ei )Dn(2)
is invertible and it follows that the Strang-type preconditioner Se is also invertible.
9.4.3 Convergence rate

Now, we discuss the convergence rate of the preconditioned GMRES method with the
Strang-type block-circulant preconditioner. We have the following result for the spectra
of preconditioned matrices, see [23].
Theorem 9.4 Let R be given by (9.17) and Se be given by (9.18). Then we have
Se1 R = In(r1 1 ) + L
where
rank(L) (2 + m1 + m2 + 21 + 2)n.
By Theorems 6.13 and 9.4, when the GMRES method is applied to
Se1 Ry = Se1 b,
the method will converge in at most (2 + m1 + m2 + 21 + 2)n + 1 iterations in exact

arithmetic.
We know from Theorem 9.4 that if the step size h = 1 /m1 = 2 /m2 is fixed,
the number of iterations for convergence of the GMRES method, when applied to the
preconditioned system
Se1 Ry = Se1 b,
will be independent of r1 and therefore is independent of the length of the interval

that we considered. We should emphasize that the numerical example in Section 9.4.4
shows a much faster convergence rate than that predicted by the estimate provided by
Theorem 9.4. For the operation cost of our algorithm, we refer to [23, 30].

We illustrate the efficiency of our preconditioner by solving the following example. The
BVM we used is the third order GBDF for t [0, 4].
Example 9.2. Consider

(1) (2)
y0 (t) = Jn y(t) + Dn y(t 0.5) + Dn y(t 1), t 0,

y(t) = (sin t, 1, , 1)T , t 0,
where

10 2
.. .. 2
1
2 . . . ..

.. .. .. 1
1 . . .
,
Jn = 1 . . . , Dn(1) = .. ..
n . . 1
.. .. ..
. . . 2 1 2
1 2 10
and
21
. ..
1 1 .. .
Dn(2) = .
n ..
.
..
. 1
1 2
In practice, we do not have the boundary values
y1 , , y1 1 , yr1 , , yr1 +2 1 ,
provided by (9.16). Instead of giving the above values, as in Section 9.2, 1 1 initial
additional equations and 2 final additional equations are given. We remark that after
introducing the additional equations, the matrices A, B, C (1) and C (2) in (9.17) are
Toeplitz matrices with small rank perturbations. Neglecting the small rank perturba-
tions, we can also construct the Strang-type preconditioner (9.18).
Table 9.2 shows the number of iterations required for convergence of the GMRES
method with different combinations of matrix size n and step size h. In the table, I
means no preconditioner is used and Se denotes the Strang-type block-circulant pre-
conditioner defined by (9.18). We see that the numbers of iterations required for
convergence increase slowly for increasing n and decreasing h under the column S. e
9.5 Strang-type preconditioner for NDDEs

In this section, we study neutral delay differential equations (NDDEs).
9.5. STRANG-TYPE PRECONDITIONER FOR NDDES 171
n h I Se n h I Se
24 1/10 52 9 48 1/10 53 12
1/20 97 11 1/20 98 14
1/40 185 15 1/40 189 14
1/80 367 19 1/80 378 17
9.5.1 Neutral delay differential equations

We consider the solution of NDDE:
0
y (t) = Ln y0 (t ) + Mn y(t) + Nn y(t ), t t0 ,
(9.19)

y(t) = (t), t t0 ,
where y(t), (t) : R Rn ; Ln , Mn , Nn Rnn , and > 0 is a constant.

As in Section 9.4.1, we want to find an asymptotically stable solution for (9.19).
We have the following lemma, see [18, 28].
Lemma 9.3 Let Ln , Mn and Nn be any matrices with kLn k2 < 1. Then the solution of
(9.19) is asymptotically stable if Re(i ) < 0 where i , i = 1, , n, are the eigenvalues
of matrix
(In Ln )1 (Mn + Nn )
with || 1.
Let h = /k1 be the step size where k1 is a positive integer. For (9.19), by using a
BVM with (1 , 2 )-boundary conditions over a uniform mesh
tj = t0 + jh, j = 0, , r2 ,
on the interval [t0 , t0 + r2 h], we have

X
X
X
i yp+i1 = i Ln yp+i1 k1 + h i (Mn yp+i1 + Nn yp+i1 k1 ), (9.20)
i=0 i=0 i=0
for p = 1 , , r2 1, where = 1 + 2 . By providing the values
yk1 , , y0 , y1 , , y1 1 , yr2 , , yr2 +2 1 , (9.21)

Hy = b
where
H A In A(1) Ln hB Mn hB (1) Nn , (9.22)
y= (yT1 , yT1 +1 , , yrT2 1 )T R n(r2 1 )
,
b Rn(r2 1 ) depends on the boundary values and the coefficients of the method. In
(9.22), the matrices A, B R(r2 1 )(r2 1 ) are defined as in (9.4), and A(1) , B (1)
R(r2 1 )(r2 1 ) are given as follows:

0 0

. . . . . .

.. . . . . .. . . . .
. . . . . .
(1)
A = , B (1)
= ,

0 . . . 0 . . .

.. .. .. .. .. ..
. . . . . .
0 0 0 0
see [3]. We remark that the first column of A(1) is given by:
(0, , 0, , , 0 , 0, , 0 )T
| {z } | {z }
k1 2 r2 k1 21 1
and the first column of B (1) is given by
(0, , 0, , , 0 , 0, , 0 )T .
| {z } | {z }
k1 2 r2 k1 21 1

The Strang-type block-circulant preconditioner for (9.22) is defined as follows:
S s(A) In s(A(1) ) Ln hs(B) Mn hs(B (1) ) Nn (9.23)
where s(E) is Strangs circulant preconditioner of Toeplitz matrix E, for E = A, B,

A(1) , B (1) respectively. By using (6.12) again, we have
S = (F In ) (A In A(1) Ln hB Mn hB (1) Nn ) (F In ),
where E is the diagonal matrix holding the eigenvalues of s(E), for E = A, B, A(1) ,
B (1) respectively.
Now we discuss the invertibility of the Strang-type preconditioner S. Let

2ij
wj = e r2 1 .
We have
[A ]jj = (wj )/wj1 , [B ]jj = (wj )/wj1 ,
[A(1) ]jj = wjk1 +2 + + 0 wjk1 1 = (wj )/wjk1 +1 ,

and
[B (1) ]jj = wjk1 +2 + + 0 wjk1 1 = (wj )/wjk1 +1 ,
where (z) and (z) are defined as in (9.10). Thus the j-th block of
A In A(1) Ln hB Mn hB (1) Nn
in S is given by
Sj = [A ]jj In [A(1) ]jj Ln h[B ]jj Mn h[B (1) ]jj Nn
1 h i
= wjk1 ((wj )In h(wj )Mn ) (wj )Ln h(wj )Nn ,
wjk1 +1
for j = 1, 2, , r2 1 . In order to prove that S is invertible, we only need to show

that Sj , j = 1, 2, , r2 1 , are invertible. Let

eik1 (ei )In h(ei )Mn (ei )Ln h(ei )Nn
= eik1 (In eik1 Ln )D
where
D (ei )In h(In eik1 Ln )1 (Mn + eik1 Nn )(ei ). (9.24)
Hence, we are required to show that is invertible for any R in order to prove
that Sj is invertible. Assume that kLn k2 < 1, we have In eik1 Ln is nonsingular for
any R. Therefore, we only need to show D is invertible for any R. We have
the following theorem, see [3]
Theorem 9.5 If the BVM with (1 , 2 )-boundary conditions is A1 ,2 -stable and Re(i ) <
0 where i , i = 1, , n, are the eigenvalues of matrix
(In Ln )1 (Mn + Nn )
with || 1, then for any R, the matrix D defined by (9.24) is invertible. It follows
that the Strang-type preconditioner S defined as in (9.23) is also invertible.
Proof: Let
U (In eim Ln )1 (Mn + eim Nn ).
Then D can be written as
D = (z)In hU (z).
Note that the eigenvalues of D are given by
i (D) = (z) hi (U )(z), i = 1, , n,
where i (U ), i = 1, , n, denote the eigenvalues of U . Since we know that
Re[i (U )] < 0, i = 1, , n,
it follows that hi (U ) C . Note that the BVM is A1 ,2 -stable and then we have
hi (U ) C D1 ,2 .
Therefore, the A1 ,2 -stability polynomial defined by (9.11)
[z, hi (U )] (z) hi (U )(z)
has no roots on the unit circle |z| = 1. Thus, for any |z| = 1, we have
i (D) = (z) hi (U )(z) = [z, hi (U )] 6= 0, i = 1, , n.
It follows that D is invertible. Therefore, the Strang-type preconditioner S defined as
in (9.23) is also invertible.
9.5.3 Convergence rate

We have the following result for the spectra of preconditioned matrices, see [3].
Theorem 9.6 Let H be given by (9.22) and S be given by (9.23). Then we have
S 1 H = In(r2 1 ) + L
where
rank(L) 2( + k1 + 1 + 1)n.
By Theorems 6.13 and 9.6 , when the GMRES method is applied to
S 1 Hy = S 1 b,
the method will converge in at most 2(+k1 +1 +1)n+1 iterations in exact arithmetic.
We observe from Theorem 9.6 that if the step size h = /k1 is fixed, the number of
iterations for convergence of the GMRES method, when applied for solving
S 1 Hy = S 1 b,
is independent of r2 , i.e., the length of the interval that we considered.

We illustrate the efficiency of our preconditioner by solving the following example.
0
y (t) = Ln y0 (t 1) + Mn y(t) + Nn y(t 1), t 0,

y(t) = (1, 1, , 1)T , t 0,
where

8 2 1
2 1 .. .. ..
. .. 2 . . .

1
1 .. .
, .. .. ..
Ln = .. .. Mn = 1 . . . 1 ,
n . . 1
.. .. ..
1 2 . . . 2
1 2 8
and
1 2
. ..
1 1 . . .
Nn = .
n ..
.
..
. 1
1 2
Example 9.3 is solved by using the fifth order GAM for t [0, 4]. In practice, we
do not have the boundary values
y1 , , y1 1 , yr2 , , yr2 +2 1 ,
provided by (9.21). Again as in Section 9.2, instead of giving the above values, 1 1
initial additional equations and 2 final additional equations are given. After introduc-
ing the additional equations, the matrices A, A(1) , B and B (2) in (9.22) are Toeplitz
matrices with small rank perturbations. We can also construct the Strang-type pre-
conditioner (9.23) by neglecting the small rank perturbations.
Table 9.3 lists the number of iterations required for convergence of the GMRES
method for different n and k1 . In the table, I means no preconditioner is used and S
denotes the Strang-type block-circulant preconditioner defined by (9.23). We see that
the number of iterations required for convergence, when S is used, is much less than
that when no preconditioner is used. We should emphasize that our numerical example
shows a much faster convergence rate than that predicted by the estimate provided by
Theorem 9.6.
Table 9.3: Number of iterations for convergence ( means out of memory).
n k1 I S n k1 I S
24 10 43 7 48 10 44 6
20 83 7 20 83 6
40 161 7 40 163 6
80 * 7 80 * 6
9.6 Strang-type preconditioner for SPDDEs

In this section, we study the solution of singular perturbation delay differential equa-
tions (SPDDEs).
9.6.1 Singular perturbation delay differential equations

We consider the solution of SPDDE:
0

x (t) = V (1) x(t) + V (2) x(t ) + C (1) y(t) + C (2) y(t ), t t0 ,

y0 (t) = F (1) x(t) + F (2) x(t ) + G(1) y(t) + G(2) y(t ), t t0 ,
(9.25)

x(t) = (t), t t0 ,

y(t) = (t), t t0 ,
where
x(t), (t) : R Rm ; y(t), (t) : R Rn ;
V (1) , V (2) Rmm ; C (1) , C (2) Rmn ;
F (1) , F (2) Rnm ; G(1) , G(2) Rnn ;
and > 0, 0 < 1 are constants. We can rewrite SPDDE (9.25) as the following
initial value problem:
0

z (t) = P z(t) + Qz(t ), t t0 ,

(9.26)

(t)
z(t) = , t t0 ,
(t)
9.6. STRANG-TYPE PRECONDITIONER FOR SPDDES 177
where P and Q R(m+n)(m+n) are defined as follows,

V (1) C (1) V (2) C (2)
P = , Q= .
1 F (1) 1 G(1) 1 F (2) 1 G(2)
see [24]. Let

h = /k2
be the step size where k2 is a positive integer. For (9.26), by using a BVM with
(1 , 2 )-boundary conditions over a uniform mesh
tj = t0 + jh, j = 0, , v,
on the interval [t0 , t0 + vh], we have

X
X
i zp+i1 = h i (P zp+i1 + Qzp+i1 k2 ), (9.27)
i=0 i=0
for p = 1 , , v 1, where = 1 + 2 . By providing the values
zk2 , , z0 , z1 , , z1 1 , zv , , zv+2 1 , (9.28)
Kv = b, (9.29)
where
K A Im+n hB P hU Q. (9.30)
The vector v in (9.29) is defined by
v = (zT1 , zT1 +1 , , zTv1 )T R(m+n)(v1 ) .
The right-hand side b R(m+n)(v1 ) of (9.29) depends on the boundary values and
the coefficients of the method. The matrices A, B R(v1 )(v1 ) in (9.30) are defined
as in (9.4) and U R(v1 )(v1 ) in (9.30) is defined as the matrix C (1) in (9.17).

The Strang-type block-circulant preconditioner can be constructed for solving (9.29):
Sb s(A) Im+n hs(B) P hs(U ) Q, (9.31)
where s(E) is Strangs circulant preconditioner of matrix E, for E = A, B, U respec-

tively. We have the following theorem for the invertibility of our preconditioner Sb and
for the convergence rate of our method, see [24].
Theorem 9.7 If the BVM with (1 , 2 )-boundary conditions is A1 ,2 -stable,

1
(P ) max (P + P T ) < 0, (P ) + kQk2 < 0,
2
then the Strang-type preconditioner Sb defined by (9.31) is invertible. Moreover, when
the GMRES method is applied to
Sb1 Kv = Sb1 b,
the method will converge in at most O(m + n) iterations in exact arithmetic.

We test the following SPDDE.
0

x (t) = Ax(t) + Bx(t 1) + Cy(t) + Dy(t 1), t 0,

y0 (t) = Ex(t) + F x(t 1) + Gy(t) + Hy(t 1), t 0, = 0.001,

x(t) = (1, 1, . . . , 1)T , t 0,

y(t) = (2, 2, . . . , 2)T , t 0,
where

10 2 1
.. .. .. 2 1
2 . . . .. ..
1 . .
.. .. ..
A= 1 . . . 1 , B=
.. ..

,
. . 1
.. .. ..
. . . 2 1 2 mm
1 2 10 mm

1 2 1

..
.

1 ... ...
C= , D = 3C, E=
.. ..

,
1 . . 1
mn 1 2 1 nm

1 1 2
.. ..
. . 1 ...
F =
..

, G =
.. ..

,
. 1 . .
1 1 nm
1 2 nn
9.6. STRANG-TYPE PRECONDITIONER FOR SPDDES 179
and
5 1
.. ..
2 . .
H=
.. ..

.
. . 1
2 5 nn
Example 9.4 is solved by using the third order GAM for t [0, 4]. Table 9.4 lists
the number of iterations required for convergence of the GMRES method for different
m, n and k2 . In the table, Sb denotes the Strang-type block-circulant preconditioner
defined by (9.31).
k2 m n I Sb k2 m n I Sb
24 8 2 55 29 48 8 2 100 34
16 4 83 57 16 4 131 63
32 8 109 90 32 8 177 90
Bibliography
[1] P. Amodio, F. Mazzia and D. Trigiante, Stability of Some Boundary Value Methods
for the Solution of Initial Value Problems, BIT, vol. 33 (1993), pp. 434451.
[2] O. Axelsson, Iterative Solution Methods, Cambridge University Press, Cambridge,

1996.
[3] Z. Bai, X. Jin and L. Song, Strang-type Preconditioners for Solving Linear Systems
from Neutral Delay Differential Equations, Calcolo, vol. 40 (2003), pp. 2131.
[4] D. Bertaccini, A Circulant Preconditioner for the Systems of LMF-Based ODE

Codes, SIAM J. Sci. Comput., vol. 22 (2000), pp.767786.
[5] L. Brugnano and D. Trigiante, Solving Differential Problems by Multistep Initial

and Boundary Value Methods, Gordon and Berach Science Publishers, Amsterdam,
1998.
[6] Z. Cao, Numerical Linear Algebra (in Chinese), Fudan University Press, Shanghai,
1996.
[7] R. Chan and X. Jin, A Family of Block Preconditioners for Block Systems, SIAM
J. Sci. Statist. Comput., vol. 13 (1992), pp. 12181235.
[8] R. Chan, X. Jin and Y. Tam, Strang-type Preconditioners for Solving System of
ODEs by Boundary Value Methods, Electron. J. Math. Phys. Sci., vol. 1 (2002),
pp. 1446.
[9] R. Chan and M. Ng, Conjugate Gradient Methods for Toeplitz Systems, SIAM
Review, vol. 38 (1996), pp. 427482.
[10] R. Chan, M. Ng and X. Jin, Strang-type Preconditioners for Systems of LMF-Based

ODE Codes, IMA J. Numer. Anal., vol. 21 (2001), pp. 451462.
[11] T. Chan, An Optimal Circulant Preconditioner for Toeplitz Systems, SIAM J. Sci.
Statist. Comput., vol. 9 (1988), pp. 766771.
181
182 BIBLIOGRAPHY
[12] W. Ching, Iterative Methods for Queuing and Manufacturing Systems, Springer-
Verlag, London, 2001.
[13] P. Davis, Circulant Matrices, 2nd edition, AMS Chelsea Publishing, Rhode Island,
1994.
[14] J. Demmel, Applied Numerical Linear Algebra, SIAM Press, Philadelphia, 1997.
[15] H. Diao, Y. Wei and S. Qiao, Displacement Rank of the Drazin Inverse, J. Comput.
Appl. Math., vol. 167 (2004), pp. 147161.
[16] M. Hestenes and E. Stiefel, Methods of Conjugate Gradients for Solving Linear
Systems, J. Res. Nat. Bur. Stand., vol. 49 (1952), pp. 409436.
[17] R. Horn and C. Johnson, Matrix Analysis, Cambridge University Press, Cam-
bridge, 1985.
[18] G. Hu and T. Mitsui, Stability Analysis of Numerical Methods for Systems of

Neutral Delay-Differential Equations, BIT, vol. 35 (1995), pp. 504515.
[19] G. Golub and C. Van Loan, Matrix Computations, 3rd edition, Johns Hopkins
University Press, Baltimore, 1996.
[20] A. Greenbaum, Iterative Methods for Solving Linear Systems, SIAM Press,
Philadephia, 1997.
[21] M. Gulliksson, X. Jin and Y. Wei, Perturbation Bounds for Constrained and
Weighted Least Squares Problems, Linear Algebra Appl., vol. 349 (2002), pp. 221
232.
[22] X. Jin, Developments and Applications of Block Toeplitz Iterative Solvers, Kluwer
Academic Publishers, Dordrecht; and Science Press, Beijing, 2002.
[23] X. Jin, S. Lei and Y. Wei, Circulant Preconditioners for Solving Differential Equa-
tions with Multi-Delays, Comput. Math. Appl., vol. 47 (2004), pp. 14291436.
[24] X. Jin, S. Lei and Y. Wei, Circulant Preconditioners for Solving Singular Pertur-
bation Delay Differential Equations. Numer. Linear Algebra Appl., vol. 12 (2005),
pp. 327336.
[25] X. Jin, V. Sin and L. Song, Circulant Preconditioned WR-BVM Methods for ODE
systems, J. Comput. Appl. Math., vol. 162 (2004), pp. 201211.
[26] X. Jin, V. Sin and L. Song, Preconditioned WR-LMF-Based Method for ODE
systems, J. Comput. Appl. Math., vol. 162 (2004), pp. 431444.
BIBLIOGRAPHY 183
[27] X. Jin, Y. Wei and W. Xu, A Stability Property of T. Chans Preconditioner, SIAM
J. Matrix Anal. Appl., vol. 25 (2003), pp. 627629.
[28] J. Kuang, J. Xiang and H. Tian The Asymptotic Stability of One Parameter Meth-
ods for Neutral Differential Equations, BIT, vol. 34 (1994), pp. 400408.
[29] S. Lei and X. Jin, BCCB Preconditioners for Systems of BVM-Based Numerical
Integrators, Numer. Linear Algebra Appl., vol. 11 (2004), pp. 2540.
[30] F. Lin, X. Jin and S. Lei, Strang-type Preconditioners for Solving Linear Systems
form Delay Differential Equations, BIT, vol. 43 (2003), pp. 136149.
[31] T. Mori, N. Fukuma and M. Kuwahara, Simple Stability Criteria for Single and
Composite Linear Systems with Time Delays, Int. J. Control, vol. 34 (1981), pp.
11751184.
[32] T. Mori, E. Noldus, and M. Kuwahara, A Way to Stabilize Linear Systems with
Delayed State, Automatica, vol. 19 (1983), pp. 571573.
[33] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing Company,
Boston, 1996.
[34] Y. Saad and M. Schultz, GMRES: A Generalized Minimal Residual Algorithm for
Solving Nonsymmetric Linear Systems, SIAM J. Sci. Stat. Comput., vol. 7 (1986),
pp. 856869.
[35] G. Stewart and J. Sun, Matrix Perturbation Theory, Academic Press, San Diego,
1990.
[36] J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, Springer-Verlag,

New York, 1992.
[37] G. Strang, A Proposal for Toeplitz Matrix Calculations, Stud. Appl. Math., vol.
74 (1986), pp. 171176.
[38] L. Trefethen and D. Bau III, Numerical Linear Algebra, SIAM Press, Philadelphia,
1997.
[39] E. Trytyshnikov, Optimal and Super-Optimal Circulant Preconditioners, SIAM J.

Matrix Anal. Appl., vol. 13 (1992), pp. 459473.
[40] S. Vandewalle and R. Piessens, On Dynamic Iteration Methods for Solving Time-
Periodic Differential Equations, SIAM J. Num. Anal., vol. 30 (1993), pp. 286303.
[41] R. Varga, Matrix Iterative Analysis, 2nd edition, Springer-Verlag, Berlin, 2000.
184 BIBLIOGRAPHY
[42] G. Wang, Y. Wei and S. Qiao, Generalized Inverses: Theory and Computations,
Science Press, Beijing, 2004.
[43] S. Wang, Further Results on Stability of X(t) = AX(t) + BX(t ), Syst. Cont.
Letters, vol. 19 (1992), pp. 165168.
[44] Y. Wei, J. Cai and M. Ng, Computing Moore-Penrose Inverses of Toeplitz Matrices
by Newtons Iteration, Math. Comput. Modelling, vol. 40 (2004), pp. 181191.
(2)
[45] Y. Wei and N. Zhang, Condition Number Related with Generalized Inverse AT,S
and Constrained Linear Systems, J. Comput. Appl. Math., vol. 157 (2003), pp.
5772.
[46] J. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.
[47] S. Xu, L. Gao and P. Zhang, Numerical Linear Algebra (in Chinese), Peking Uni-
versity Press, Beijing, 2000.
[48] N. Zhang and Y. Wei, Solving EP Singular Linear Systems, Int. J. Computer
Mathematics, vol. 81 (2004), pp. 13951405.
Index
backward substitution, 11 Gauss-Seidel, 66

bisection method, 144 Gauss elimination, 4, 14, 41
BVM, 155 GBDF, 158
Givens rotation, 57
Cauchy Interlace Theorem, 132 GMRES method, 83, 155
Cauchy-Schwartz inequality, 29, 170 Gram-Schmidt, 106
characteristic polynomial, 111 growth factor, 45
Cholesky factorization, 20, 52, 99
chopping error, 5, 38 Hermitian positive definite matrix, 7
circulant matrix, 99, 160 Hessenberg decomposition, 122
classical Jacobi method, 142 Hessenberg matrix, 122
condition number, 5, 34 Householder transformation, 57
conjugate gradient method, 87
convergence rate, 73, 175 idempotent, 8
Courant-Fischer Minimax Theorem, 131 ill-conditioned, 5
implicit symmetric QR, 136
DDE, 155 initial additional conditions, 157
defective, 112 inverse power method, 116
determinant, 2 invertibility, 168
diagonally dominant, 22, 69 IVM, 155
diagonal matrix, 2
divide-and-conquer method, 146 Jacobi method, 65, 139
double shift, 127 Jordan Decomposition Theorem, 4, 31
eigenvalue, 3, 28, 76, 114 Krylov subspace method, 83

eigenvector, 3, 29, 76, 114
LDLT , 22
error, 5, 25
least squares problem, 49
factorization, 4, 20 LMF, 156
fast Fourier transform (FFT), 99, 165 LS, 50
final additional conditions, 157 LU factorization, 9, 41, 65, 119
forward substitution, 10
Fourier matrix, 99, 164 matrix norm, 25
MATLAB, 2, 165
GAM, 165 Moore-Penrose inverse, 53
185
186 INDEX
nilpotent, 7 successive overrelaxation, 75

NLA, 1, 9, 111, 131 superlinear, 6
norm, symmetric positive definite matrix, 22
Frobenius norm k kF , 30
k k1 , 2, 27 tensor product, 158
k k2 , 2, 28 Toeplitz matrix, 157
k k , 2, 27 triangular system, 9
nonsingular, 9, 43, 112 tridiagonal matrix, 134
normal matrix, 102 tridiagonalization, 135
nullspace, 50
unitary matrix, 8
ODE, 155 vector norm, 25
orthogonal matrix, 29, 54, 120
well-conditioned, 5
parallel computing, 146 Weyl, Wielandt-Hoffman theorem, 132
PCG method, 98 Wilkinson shift, 137
permutation matrix, 16, 43
perturbation, 5
pivoting, 15, 41
power method, 114
preconditioner,
block-circulant preconditioner, 156
optimal preconditioner, 99
Strangs circulant preconditioner, 161
QR algorithm, 120
range, 50
rank-one matrix, 147
rank-one modification, 12
rounding error, 5, 38
Schur decomposition, 112

shift, 125
similarity transformation, 112
single shift, 125
singular value, 3, 133
singular value decomposition, 133
singular vector, 133
Spectral Decomposition Theorem, 131
stability, 162
Sturm Sequence Property, 144

Numerical Linear PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numerical Linear PDF

Uploaded by

Copyright:

Available Formats

Numerical Linear Algebra

And Its Applications

August 29, 2008

Chapter 2 Direct Methods for Linear Systems . . . . . . . . . . . . . . . . . 9

Chapter 3 Perturbation and Error Analysis . . . . . . . . . . . . . . . . . . . 25

Chapter 4 Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 5 Classical Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Chapter 6 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Chapter 7 Nonsymmetric Eigenvalue Problems . . . . . . . . . . . . . . . 111

Chapter 8 Symmetric Eigenvalue Problems . . . . . . . . . . . . . . . . . . . 131

8.4 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Chapter 9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7, nonsymmetric eigenvalue problems are studied. We introduce some well-known

If any other mathematical topic is as fundamental to the mathematical

Acknowledgments: We would like to thank Professor Raymond H. F. Chan of

1.1 Basic symbols

The symbol aij will denote the ij-th entry in a matrix A.

Let a1 , , am Rn (or Cn ). We will use span{a1 , , am } to denote the linear

Let rank(A) denote the rank of the matrix A.

Let dim(S) denote the dimension of the vector space S.

The symbol I will denote the identity matrix, i.e.,

We will use k k to denote a norm of matrix or vector. The symbols k k1 , k k2

1.2 Basic problems in NLA

(1) Find the solution of linear systems

where A is an n-by-n nonsingular matrix and b is an n-vector.

kAx bk2 = minn kAy bk2 .

where x is called the eigenvector of A associated with .

1.3 Why shall we study numerical methods?

where A is an n-by-n nonsingular matrix and x = (x1 , x2 , , xn )T . If we use the

det(A1 ) det(A2 ) det(An )

If one uses Gaussian elimination, it requires

1.4 Matrix factorizations (decompositions)

Theorem 1.1 (Jordan Decomposition Theorem) If A Cnn , then there exists

or A = XJX 1 , where J is called the Jordan canonical form of A and

1.5 Perturbation and error analysis

With the perturbation and error analysis, we obtain

1.6 Operation cost and convergence rate

For a sequence {xk } provided by an iterative algorithm, if {xk } x, the exact

kxk xk ckxk1 xk, k = 1, 2, ,

kxk xk ckxk1 xkp , k = 1, 2, ,

rank(AB) rank(A) + rank(B) m.

A11 A21 = A21 A11 .

4. Show that det(I uv ) = 1 v u where u, v Cm are column vectors.

8. A matrix M Cnn is Hermitian positive definite if it satisfies

for all x 6= 0 Cn . Let A and B be Hermitian positive definite matrices.

(1) Show that the matrix product AB has positive eigenvalues.

where B and C are Hermitian.

be the inverse of A. Show that

B21 = B22 A21 A1

Direct Methods for Linear

2.1 Triangular linear systems and LU factorization

We first study triangular linear systems.

2.1.1 Triangular linear systems

We consider the following nonsingular lower triangular linear system

where b = (b1 , b2 , , bn )T Rn is a known vector, y = (y1 , y2 , , yn )T is an unknown

for i = n 1, , 1. This algorithm is called the backward substitution method which

where L is a lower triangular matrix and U is an upper triangular matrix, then we

(2) By using the backward substitution method to find solution x of U x = y.

2.1.2 Gaussian transform matrix

by noting that eTk lk = 0.

2.1.3 Computation of LU factorization

By using the Gaussian transform matrix