Afternotes On Numerical Analysis

Afternotes on Numeri
al Analysis
Being a series of le tures on elementary numeri al analysis
presented at the University of Maryland at College Park
and re orded after the fa t by
G. W. Stewart
Univerity of Maryland
College Park, MD
Contents
Prefa e ix
Nonlinear Equations 1
Le ture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
By the dawn's early light . . . . . . . . . . . . . . . . . . . . . 3
Interval bise tion . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Relative error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Le ture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Newton's method . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Re ipro als and square roots . . . . . . . . . . . . . . . . . . . 11
Lo al onvergen e analysis . . . . . . . . . . . . . . . . . . . . . 12
Slow death . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Le ture 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A quasi-Newton method . . . . . . . . . . . . . . . . . . . . . . 17
Rates of onvergen e . . . . . . . . . . . . . . . . . . . . . . . . 20
Iterating for a xed point . . . . . . . . . . . . . . . . . . . . . 21
Multiple zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Ending with a proposition . . . . . . . . . . . . . . . . . . . . . 25
Le ture 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
The se ant method . . . . . . . . . . . . . . . . . . . . . . . . . 27
Convergen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Rate of onvergen e . . . . . . . . . . . . . . . . . . . . . . . . 31
Multipoint methods . . . . . . . . . . . . . . . . . . . . . . . . 33
Muller's method . . . . . . . . . . . . . . . . . . . . . . . . . . 33
The linear-fra tional method . . . . . . . . . . . . . . . . . . . 34
Le ture 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
A hybrid method . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Errors, a ura y, and ondition numbers . . . . . . . . . . . . . 40
Floating-Point Arithmeti 43
Le ture 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Floating-point numbers . . . . . . . . . . . . . . . . . . . . . . 45
Over ow and under ow . . . . . . . . . . . . . . . . . . . . . . 47
Rounding error . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Floating-point arithmeti . . . . . . . . . . . . . . . . . . . . . 49
Le ture 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Computing sums . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Ba kward error analysis . . . . . . . . . . . . . . . . . . . . . . 55
Perturbation analysis . . . . . . . . . . . . . . . . . . . . . . . . 57
Cheap and hippy hopping . . . . . . . . . . . . . . . . . . . . 58
Le ture 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
v
vi Afternotes on Numeri al Analysis
Can ellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
The quadrati equation . . . . . . . . . . . . . . . . . . . . . . 61
That fatal bit of rounding error . . . . . . . . . . . . . . . . . . 63
Envoi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Linear Equations 67
Le ture 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Matri es, ve tors, and s alars . . . . . . . . . . . . . . . . . . . 69
Operations with matri es . . . . . . . . . . . . . . . . . . . . . 70
Rank-one matri es . . . . . . . . . . . . . . . . . . . . . . . . . 73
Partitioned matri es . . . . . . . . . . . . . . . . . . . . . . . . 74
Le ture 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
The theory of linear systems . . . . . . . . . . . . . . . . . . . . 77
Computational generalities . . . . . . . . . . . . . . . . . . . . 78
Triangular systems . . . . . . . . . . . . . . . . . . . . . . . . . 79
Operation ounts . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Le ture 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Memory onsiderations . . . . . . . . . . . . . . . . . . . . . . . 83
Row-oriented algorithms . . . . . . . . . . . . . . . . . . . . . . 83
A olumn-oriented algorithm . . . . . . . . . . . . . . . . . . . 84
General observations on row and olumn orientation . . . . . . 86
Basi linear algebra subprograms . . . . . . . . . . . . . . . . . 86
Le ture 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Positive-denite matri es . . . . . . . . . . . . . . . . . . . . . 89
The Cholesky de omposition . . . . . . . . . . . . . . . . . . . 90
E onomi s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Le ture 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Inner-produ t form of the Cholesky algorithm . . . . . . . . . . 97
Gaussian elimination . . . . . . . . . . . . . . . . . . . . . . . . 98
Le ture 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
BLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Upper Hessenberg and tridiagonal systems . . . . . . . . . . . . 110
Le ture 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Ve tor norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Relative error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Sensitivity of linear systems . . . . . . . . . . . . . . . . . . . . 116
Le ture 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
The ondition of a linear system . . . . . . . . . . . . . . . . . 119
Arti ial ill- onditioning . . . . . . . . . . . . . . . . . . . . . . 120
Rounding error and Gaussian elimination . . . . . . . . . . . . 122
Comments on the error analysis . . . . . . . . . . . . . . . . . . 125
Contents vii
Le ture 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Introdu tion to a proje t . . . . . . . . . . . . . . . . . . . . . . 127
More on norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
The wonderful residual . . . . . . . . . . . . . . . . . . . . . . . 128
Matri es with known ondition numbers . . . . . . . . . . . . . 129
Invert and multiply . . . . . . . . . . . . . . . . . . . . . . . . . 130
Cramer's rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Polynomial Interpolation 133
Le ture 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Quadrati interpolation . . . . . . . . . . . . . . . . . . . . . . 135
Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . 137
Lagrange polynomials and existen e . . . . . . . . . . . . . . . 137
Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Le ture 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Syntheti division . . . . . . . . . . . . . . . . . . . . . . . . . . 141
The Newton form of the interpolant . . . . . . . . . . . . . . . 142
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Existen e and uniqueness . . . . . . . . . . . . . . . . . . . . . 143
Divided dieren es . . . . . . . . . . . . . . . . . . . . . . . . . 144
Le ture 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Error in interpolation . . . . . . . . . . . . . . . . . . . . . . . 147
Error bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Convergen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chebyshev points . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Numeri al Integration 155
Le ture 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Numeri al integration . . . . . . . . . . . . . . . . . . . . . . . 157
Change of intervals . . . . . . . . . . . . . . . . . . . . . . . . . 158
The trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . 158
The omposite trapezoidal rule . . . . . . . . . . . . . . . . . . 160
Newton{Cotes formulas . . . . . . . . . . . . . . . . . . . . . . 161
Undetermined oe ients and Simpson's rule . . . . . . . . . . 162
Le ture 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
The Composite Simpson rule . . . . . . . . . . . . . . . . . . . 165
Errors in Simpson's rule . . . . . . . . . . . . . . . . . . . . . . 166
Treatment of singularities . . . . . . . . . . . . . . . . . . . . . 167
Gaussian quadrature: The idea . . . . . . . . . . . . . . . . . . 169
Le ture 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Gaussian quadrature: The setting . . . . . . . . . . . . . . . . . 171
viii Afternotes on Numeri al Analysis
Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . 171

Existen e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Zeros of orthogonal polynomials . . . . . . . . . . . . . . . . . . 174
Gaussian quadrature . . . . . . . . . . . . . . . . . . . . . . . . 175
Error and onvergen e . . . . . . . . . . . . . . . . . . . . . . . 176
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Numeri al Dierentiation 179
Le ture 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Numeri al dierentiation and integration . . . . . . . . . . . . . 181
Formulas from power series . . . . . . . . . . . . . . . . . . . . 182
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Bibliography 187
Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Referen es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Index 191
Prefa e
In the spring of 1993, I took my turn at tea hing our upper-division ourse
in introdu tory numeri al analysis. The topi s overed were nonlinear equa-
tions, omputer arithmeti , linear equations, polynomial interpolation, numer-
i al integration, and numeri al dierentiation. The ontinuation of the ourse
is a sequen e of two graduate ourses, whi h oer a sele tion of omplementary
topi s. I taught Tuesday-Thursday lasses of eighty-ve minutes.
The textbook was Numeri al Analysis by David Kin aid and Ward Cheney.
However, I usually treat textbooks as supplemental referen es and seldom look
at them while I am preparing le tures. The pra ti e has the advantage of
giving students two views of the subje t. But in the end all I have is a handful
of sket hy notes and vague re olle tions of what I said in lass.
To nd out what I was a tually tea hing I de ided to write down ea h
le ture immediately after it was given while it was still fresh in my mind. I
all the results afternotes. If I had known what I was letting myself in for,
I would have never undertaken the job. Writing to any kind of deadline is
di ult; writing to a self-imposed deadline is torture. Yet now I'm glad I did
it. I learned a lot about the anatomy of a numeri al analysis ourse.
I also had an ulterior motive. Most numeri al analysis books, through no
fault of their authors, are a bit ponderous. The reason is they serve too many
masters. They must instru t the student (and sometimes the tea her). They
must also ontain enough topi s to allow the instru tor to sele t his or her
favorites. In addition, many authors feel that their books should be referen es
that students an take with them into the real world. Now there are various
ways to ombine these fun tions in a single book | and they all slow down the
exposition. In writing these afternotes, I was urious to see if I ould give the
subje t some narrative drive by s rewing down on the fo us. You will have to
judge how well I have su eeded.
So what you have here is a repli a of what I said in lass. Not a slavish
repli a. The bla kboard and the printed page are dierent instruments and
must be handled a ordingly. I orre ted errors, big and small, whenever I
found them. Moreover, when I saw better ways of explaining things, I did not
hesitate to rework what I had originally presented. Still, the orresponden e
is lose, and ea h se tion of the notes represents about a lass period's worth
of talking.
In making these notes available, I hope that they will be a useful supple-
ment for people taking a numeri al ourse or studying a onventional textbook
on their own. They may also be a sour e of ideas for someone tea hing nu-
meri al analysis for the rst time. To in rease their utility I have appended a
brief bibliography.
The notes were originally distributed over the Internet, and they have
beneted from the feedba k. I would like to thank Stu Antman, Rogerio Brito,
ix
x Afternotes on Numeri al Analysis
John Carroll, Bob Funderli , David Goldberg, Murli Gupta, Ni k Higham,

Walter Homan, Keith Lindsay, Dean S hulze, and Larry Shampine for their
omments. I am also indebted to the people at SIAM who saw the notes
through produ tion: to Vi kie Kearn for resurre ting them from the oblivion
of my lass dire tory, to Jean Anderson for a painstaking job of opy editing,
and to Corey Gray for an elegant design.
Above all I owe one to my wife, Astrid S hmidt-Nielsen, who was a patient
and en ouraging workstation widow throughout the writing of these notes.
They are dedi ated to her.
G. W. Stewart
College Park, MD
Nonlinear Equations
1
Le ture 1
Nonlinear Equations
By the Dawn's Early Light
Interval Bise tion
Relative Error
By the dawn's early light

1. For a simple example of a nonlinear equation, onsider the problem of
aiming a annon to hit a target at distan e d. The annon is assumed to have
muzzle velo ity V0 and elevation .
To determine how far the annon ball travels, note that the verti al ompo-
nent of the muzzle velo ity is V0 sin . Sin e the ball is moving verti ally against
the a eleration of gravity g, its verti al position y(t) satises the dierential
equation
y(0) = 0;

00
y (t) = g; y0 (0) = V sin :
0
The solution is easily seen to be

y(t) = V0 t sin
1 gt2 :
2
Thus the ball will hit the ground at time
T= 0
2V sin :
g
Sin e the horizontal omponent of the velo ity is V0 os , the ball will travel a
distan e of T V0 os . Thus to nd the elevation we have to solve the equation
2V02 os sin = d;
g
or equivalently
f ()
2V02 os sin d = 0: (1.1)
g
2. Equation (1.1) exhibits a number of features asso iated with nonlinear
equations. Here is a list.
The equation is an idealization. For example, it does not take into a ount
the resistan e of air. Again, the derivation assumes that the muzzle of the
annon is level with the ground | something that is obviously not true. The
lesson is that when you are presented with a numeri al problem in the abstra t
3
4 Afternotes on Numeri al Analysis
it is usually a good idea to ask about where it ame from before working hard
to solve it.
The equation may not have a solution. Sin e os sin assumes a maximum
of 21 at = 4 , there will be no solution if
V02
d> :
g
Solutions, when they exist, are not unique. If there is one solution, then there
are innitely many, sin e sin and os are periodi . These solutions represent
a rotation of the annon elevation through a full ir le. Any resolution of the
problem has to take these spurious solutions into a ount.
If d < V02=4g, and < 2 is a solution, then 2 is also a solution. Both
solutions are meaningful, but as far as the gunner is on erned, one may be
preferable to the other. You should nd out whi h.
The fun tion f is simple enough to be dierentiated. Hen e we an use a
method like Newton's method.
In fa t, (1.1) an be solved dire tly. Just use the relation 2 sin os = sin 2.
It is rare for things to turn out this ni ely, but you should try to simplify before
looking for numeri al solutions.
If we make the model more realisti , say by in luding air resistan e, we may
end up with a set of dierential equations that an only be solved numeri ally.
In this ase, analyti derivatives will not be available, and one must use a
method that does not require derivatives, su h as a quasi-Newton method
(x3.1).
Interval bise tion
3. In pra ti e, a gunner may determine the range by trial and error, raising and
lowering the annon until the target is obliterated. The numeri al analogue
of this pro ess is interval bise tion. From here on we will onsider the general
problem of solving the equation
f (x) = 0: (1.2)
4. The theorem underlying the bise tion method is alled the intermediate
value theorem.
If f is ontinuous on [a; b and g lies between f (a) and f (b), then
there is a point x 2 [a; b su h that g = f (x).
1. Nonlinear Equations 5
5. The intermediate value theorem an be used to establish the existen e of a

solution of (1.2). Spe i ally if sign[f (a) 6= sign[f (b), then zero lies between
f (a) and f (b). Consequently, there is a point x in [a; b su h that f (x) = 0.
6. We an turn this observation into an algorithm that bra kets the root
in intervals of ever de reasing width. Spe i ally, suppose that sign[f (a) 6=
sign[f (b), and to avoid degenera ies suppose that neither f (a) nor f (b) is zero.
In this ase, we will all the interval [a; b a nontrivial bra ket for a root of f .
Here, and always throughout these notes, [a; b will denote the set of points
between a and b, in lusive, with no impli ation that a b.
Now let = a+2 b . There are three possibilities.
1. f ( ) = 0. In this ase we have found a solution of (1.2).
2. f ( ) 6= 0 and sign[f ( ) 6= sign[f (b). In this ase [ ; b is a nontrivial
bra ket.
3. f ( ) 6= 0 and sign[f (a) 6= sign[f ( ). In this ase [a; is a nontrivial
bra ket.
Thus we either solve the problem or we end up with a nontrivial bra ket that
is half the size of the original. The pro ess an be repeated indenitely, ea h
repetition either solving the problem or redu ing the length of the bra ket by
a fa tor of two. Figure 1.1 illustrates several iterations of this pro ess. The
numbers following the letters a and b indi ate at whi h iteration these points
were sele ted.
7. These onsiderations lead to the following algorithm. The input is a non-
trivial bra ket [a; b and the fun tion values fa and fb at the endpoints. In
addition we need a stopping riterion eps 0. The algorithm usually returns
a bra ket of length not greater than eps. (For the ex eption see x1.9.)
while (abs(b-a) > eps){
= (b+a)/2;
if ( ==a || ==b)
return;
f = f( );
if (f == 0){
a = b = ;
fa = fb = f ;
return;
(1.3)
}
if (sign(f ) != sign(fb))
{a = ; fa = f ;}
else
{b = ; fb = f ;}
}
return;
a1 a2 a4 b3 b1
Figure 1.1. Interval bise tion.
8. The hardest part about using the bise tion algorithm is nding a bra ket.
On e it is found, the algorithm is guaranteed to onverge, provided the fun tion
is ontinuous. Although later we shall en ounter algorithms that onverge
mu h faster, the bise tion method onverges steadily. If L0 = jb aj is the
length of the original bra ket, after k iterations the bra ket has length
L0
Lk =
2k :
Sin e the algorithm will stop when Lk eps, it will require

L0
log2
eps
iterations to onverge. Thus if L0 = 1 and eps = 10 6 , the iteration will

require 20 iterations.
9. The statement
if ( ==a || ==b)
return;
is a on ession to the ee ts of rounding error. If eps is too small, it is possible
for the algorithm to arrive at the point where (a+b)/2 evaluates to either a or
b, after whi h the algorithm will loop indenitely. In this ase the algorithm,
having given its all, simply returns.1
Relative error
10. The onvergen e riterion used in (1.3) is based on absolute error ; that is,
it measures the error in the result without regard to the size of the result. This
may or may not be satisfa tory. For example, if eps = 10 6 and the zero in
question is approximately one, then the bise tion routine will return roughly
six a urate digits. However, if the root is approximately 10 7 , we an expe t
no gures of a ura y: the nal bra ket an a tually ontain zero.
11. If a ertain number of signi ant digits are required, then a better measure
of error is relative error. Formally, if y is an approximation to x 6= 0, then the
relative error in y is the number
=
jy xj :
jxj
Alternatively, y has relative error , if there is a number with jj = su h
that
y = x(1 + ):
12. The following table of approximations to e = 2:7182818 : : : illustrates the
relation of relative error and signi ant digits.
Approximation
2: 2 10 1
2:7 6 10 3
2:71 3 10 3
2:718 1 10 4
2:7182 3 10 5
2:71828 6 10 7
An examination of this table suggests the following.

If x and y agree to k de imal digits, then the relative error in y will
be approximately 10 k .
1
Thanks to Urs von Matt for pointing this out.
13. If we ex lude tri ky ases like x = 2:0000 and y = 1:9999, in whi h the
notion of agreement of signi ant digits is not well dened, the relation between
agreement and relative error is not di ult to establish. Let us suppose, say,
that x and y agree to six gures. Writing x above y, we have
x = X1 X2 X3 X4 X5 X6 X7 X8 ;
y = Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 :
Now sin e the digits X7 and Y7 must disagree, the smallest dieren e between
x and y is obtained when, e.g.,
X7 X8 = 40;
y7 y8 = 38:
Thus jx yj 2, whi h is a lower bound on the dieren e. On the other hand,

if X7 is nine while Y7 is zero, then jy xj < 10, whi h is an upper bound. Thus
2 jx yj < 100:
Sin e 1 X1 9, it is easy to see that
107 jxj < 108 :
Hen e
0:2 10 7 < jy jxjxj < 10 5 ;
i.e., the relative error is near 10 6 .
14. Returning to interval bise tion, if we want a relative error of in the
answer, we might repla e the onvergen e riterion in the while statement
with
while(abs(b-a)/min(abs(a),abs(b)) > rho)
Note that this riterion is dangerous if the initial bra ket straddles zero, sin e
the quantity min(abs(a),abs(b)) ould approa h zero.
Le ture 2
Nonlinear Equations
Newton's Method
Re ipro als and Square Roots
Lo al Convergen e Analysis
Slow Death
Newton's method
1. Newton's method is an iterative method for solving the nonlinear equation
f (x) = 0: (2.1)
Like most iterative methods, it begins with a starting point x0 and produ es
su essive approximations x1 , x2 , . . . . If x0 is su iently near a root x of (2.1),
the sequen e of approximations will approa h x . Usually the onvergen e is
quite rapid, so that on e the typi al behavior of the method sets in, it requires
only a few iterations to produ e a very a urate approximation to the root.
(The point x is also alled a zero of the fun tion f . The distin tion is that
equations have roots while fun tions have zeros.)
Newton's method an be derived in two ways: geometri ally and analyti-
ally. Ea h has its advantages, and we will treat ea h in turn.
2. The geometri approa h is illustrated in Figure 2.1. The idea is to draw
a tangent to the urve y = f (x) at the point A = (x0 ; f (x0 )). The abs issa
x1 of the point C = (x1 ; 0) where the tangent interse ts the axis is the new
approximation. As the gure suggests, it will often be a better approximation
to x than x0 .
To derive a formula for x1 , onsider the distan e BC from x0 to x1 , whi h
satises
BC = BAd :
tan ACB
But BA = f (x0 ) and tan ABC
d = 0
f (x0 ) (remember the derivative is negative
at x0 ). Consequently,
f (x0 )
x1 = x0 :
f 0 (x )
0
If the iteration is arried out on e more, the result is point D in Figure 2.1. In
general, the iteration an be ontinued by dening
f (xk )
xk+1 = xk ; k = 0; 1; : : : :
f 0(xk )
9
B C D
Figure 2.1. Geometri illustration of Newton's method.
3. The analyti derivation of Newton's method begins with the Taylor expan-
sion
1
f (x) = f (x0 ) + f 0 (x0 )(x x0 ) + f 00 (0 )(x x0 )2 ;
2
where as usual 0 lies between x and x0 . Now if x0 is near the zero x of f and
f 0 (x0 ) is not too large, then the fun tion
f^(x) = f (x0 ) + f 0 (x0 )(x x0 )
provides a good approximation to f (x) in the neighborhood of x . For example,
if jf 00 (x)j 1 and jx x0 j 10 2 , then jf^(x) f (x)j 10 4 . In this ase it is
reasonable to assume that the solution of the equation f^(x) = 0 will provide a
good approximation to x . But this solution is easily seen to be
f (x0 )
x1 = x0 ;
f 0(x0 )
whi h is just the Newton iteration formula.
4. In some sense the geometri and analyti derivations of Newton's method

say the same thing, sin e y = f^(x) is just the equation of the tangent line
AC in Figure 2.1. However, in other respe ts the approa hes are omplemen-
tary. For example, the geometri approa h shows (informally) that Newton's
method must onverge for a fun tion shaped like the graph in Figure 2.1 and
a starting point for whi h the fun tion is positive. On the other hand the
analyti approa h suggests fruitful generalizations. For example, if f 0 does not
vary too mu h we might skip the evaluation of f 0(xk ) and iterate a ording to
the formula
f (xk )
xk+1 = xk ; k = 1; 2; : : : : (2.2)
f 0 (x )
0
Re ipro als and square roots

5. If a > 0, the fun tion
f (x) =
1 a
x
has the single positive zero x = a 1 . If Newton's method is applied to this
fun tion, the result is the iteration
xk+1 = 2xk ax2k : (2.3)
In fa t, the graph in Figure 2.1 is just a graph of x 1 0:5. Among other
things, it shows that the iteration will onverge from any starting point x0 > 0
that is less than a1 . Be ause the iteration requires no divisions, it has been
used as an alternative to hard-wired division in some omputers.
6. If a > 0, the fun tion
f (x) = x2 a
p
has the single positive zero x = a. If Newton's method is applied to this
fun tion, the result is the iteration
1
xk+1 = xk +
a
: (2.4)
2 xk
This formula for approximating the square root was known p to the Babyloni-
ans. Again, itpis easy to see geometri ally that if x0 > a then the iteration
onverges to a.
7. Newton's method does not have to onverge. For example, if x0 is too large
in the iteration (2.3) for the re ipro al, then x1 will be less than zero and the
subsequent iterates will diverge to 1.
Lo al onvergen e analysis
8. We are going to show that if x0 is su iently near a zero of x of f and
f 0(x ) 6= 0;
then Newton's method onverges | ultimately with great rapidity. To simplify
things, we will assume that f has derivatives of all orders. We will also set
f (x)
'(x) = x ;
f 0 (x)
so that
xk+1 = '(xk ):
The fun tion ' is alled the iteration fun tion for Newton's method. Note that
f (x )
'(x ) = x = x :
f 0 (x )
Be ause x is unaltered by ', it is alled a xed point of '.
Finally we will set
ek = xk x :
The quantity ek is the error in xk as an approximation to x . To say that
xk ! x is the same as saying that ek ! 0.
9. The lo al onvergen e analysis of Newton's method is typi al of many
onvergen e analyses. It pro eeds in three steps.
1. Obtain an expression for ek+1 in terms of ek .
2. Use the expression to show that ek ! 0.
3. Knowing that the iteration onverges, assess how fast it onverges.
10. The error formula an be derived as follows. Sin e xk+1 = '(xk ) and
x = '(x ),
ek+1 = xk+1 x = '(xk ) '(x ):
By Taylor's theorem with remainder,
'(xk ) '(x ) = '0 (k )(xk x );
where k lies between xk and x . It follows that
ek+1 = '0 (k )ek : (2.5)
This is the error formula we need to prove onvergen e.
11. At rst glan e, the formula (2.5) appears di ult to work with sin e it
depends on k whi h varies from iterate to iterate. However | and this is the
essen e of any lo al onvergen e theorem | we may take our starting point

lose enough to x so that
j'0 (k )j C < 1; k = 0; 1; : : : :
Spe i ally, we have
f (x)f 00(x)
'0 (x) = :
f 0 (x)2
Sin e f (x ) = 0 and f 0(x ) 6= 0, it follows that '0 (x ) = 0. Hen e by ontinuity,
there is an interval I = [x ; x + about x su h that if x 2 I then
j'0 (x)j C < 1:
Now suppose that x0 2 I . Then sin e 0 is between x and x0 , it follows
that 0 2 I . Hen e from (2.5),
je j j'0 (k )jje j C je j C < ;
1 0 0
and x1 2 I . Now sin e x1 is in I , so is x2 by the same reasoning. Moreover,

je j C je j C je j:
2 1
2
0
By indu tion, if xk 1 2 I , so is xk , and

jek j C jek j C k je j:
1 0
Sin e C k ! 0, it follows that ek ! 0; that is, the sequen e x , x , . . . onverges

0 1
to x , whi h is what we wanted to show.
12. To assess the rate of onvergen e, we turn to a higher-order Taylor expan-
sion. Sin e '0 (x ) = 0, we have
1
'(xk ) '(x ) = '00 (k )(xk x )2 ;
2
where k lies between xk and x . Hen e
1
ek+1 = '00 (k )e2k :
2
Sin e k approa hes x along with xk , it follows that
ek+1 1 00 f 00 (x )
lim = ' (x ) 0 :
k!1 ek 2
2 2f (x )
In other words,
f 00(x ) 2
ek+1
= (2.6)
2f 0 (x ) ek :
A sequen e whose errors behave like this is said to be quadrati ally onvergent .
13. To see informally what quadrati onvergen e means, suppose that the
multiplier of e2k in (2.6) is one and that e0 = 10 1 . Then e1
= 10 2 , e3
= 10 4 ,

e4 = 10 , e5 = 10 , and so on. Thus if x is about one in magnitude, the
8 16
rst iterate is a urate to about two pla es, the se ond to four, the third to
eight, the fourth to sixteen, and so on. In this ase ea h iteration of Newton's
method doubles the number of a urate gures.
For example, if the formula (2.4) is used to approximate the square root of
ten, starting from three, the result is the following sequen e of iterates.
3:
3:16
3:1622
3:16227766016
3:16227766016838
Only the orre t gures are displayed, and they roughly double at ea h iter-
ation. The last iteration is ex eptional, be ause the omputer I used arries
only about fteen de imal digits.
14. For a more formal analysis, re all that the number of signi ant gures in
an approximation is roughly the negative logarithm of the relative error (see
x1.12). Assume that x 6= 0, and let k denote the relative error in xk . Then
from (2.6) we have
k+1
jx2 f 00(x )j
= 2jf 0 (x )j 2k K2k :

Hen e
log k+1 = 2 log k log K:
As the iteration onverges, log k ! 1, and it overwhelms the value of log K .
Hen e
log k+1
= 2 log k ;
whi h says that xk+1 has twi e as many signi ant gures as xk .
Slow death
15. The onvergen e analysis we have just given shows that if Newton's method
onverges to a zero x for whi h f 0 (x ) 6= 0 then in the long run it must onverge
quadrati ally. But the run an be very long indeed.
For example, in x2.5 we noted that the iteration
xk+1 = 2xk ax2k
will onverge to a 1 starting from any point less than a 1 . In parti ular, if
a < 1, we an take a itself as the starting value.
But suppose that a = 10 10 . Then
x1 = 2 10 10
+ 10 30 = 2 10 :
10
Thus for pra ti al purposes the rst iterate is only twi e the size of the starting
value. Similarly, the se ond iterate will be about twi e the size of the rst.
This pro ess of doubling the sizes of the iterates ontinues until xk = 1010 , at
whi h point quadrati onvergen e sets in. Thus we must have 2 10 10
k = 1010
or k
= 66 before we begin to see quadrati onvergen e. That is a lot of work
to ompute the re ipro al of a number.
16. All this does not mean that the iteration is bad, just that it needs a good
starting value. Sometimes su h a value is easy to obtain. For example, suppose
that a = f 2e , where 12 f < 1 and we know e. These onditions are satised
if a is represented as a binary oating-point number on a omputer. Then
a 1 = f 1 2 e . Sin e 1 < f 1 2, the number 2 e < a 1 provides a good
starting value.
Le ture 3
Nonlinear Equations
A Quasi-Newton Method
Rates of Convergen e
Iterating for a Fixed Point
Multiple Zeros
Ending with a Proposition
A quasi-Newton method
1. One of the drawba ks of Newton's method is that it requires the omputa-
tion of the derivative f 0 (xk ) at ea h iteration. There are three ways in whi h
this an be a problem.
1. The derivative may be very expensive to ompute.
2. The fun tion f may be given by an elaborate formula, so that it is
easy to make mistakes in dierentiating f and writing ode for the
derivative.
3. The value of the fun tion f may be the result of a long numeri al
al ulation. In this ase the derivative will not be available as a
formula.
2. One way of getting around this di ulty is to iterate a ording to the
formula
f (xk )
xk+1 = xk ;
gk
where gk is an easily omputed approximation to f 0 (xk ). Su h an iteration
is alled a quasi-Newton method.2 There are many quasi-Newton methods,
depending on how one approximates the derivative. For example, we will later
examine the se ant method in whi h the derivative is approximated by the
dieren e quotient
f (xk ) f (xk 1)
gk = : (3.1)
x x k k 1
Here we will analyze the simple ase where gk is onstant, so that the iteration
takes the form
f (xk )
xk+1 = xk : (3.2)
g
We will all this method the onstant slope method. In parti ular, we might
take g = f 0 (x0 ), as in (2.2). Figure 3.1 illustrates the ourse of su h an
2
The term \quasi-Newton" usually refers to a lass of methods for solving systems of
simultaneous nonlinear equations.
17
Figure 3.1. The onstant slope method.
iteration.
3. On e again we have a lo al onvergen e theorem. Let
f (x)
'(x) = x
g
be the iteration fun tion and assume that be the iteration fun tion and assume
that
f 0 (x )

0
j' (x)j 1

< 1: (3.3)
g
Arguing as in x2.10, we an show that

ek+1 = '0 (k )ek ; (3.4)
where k lies between xk and x . Sin e j'0 (x )j < 1, there is an interval I
about x su h that
j'0 (x)j C < 1 whenever x 2 I:
Arguing as in x2.11, we nd that if x0 2 I then

jek j C k je j;
0
whi h implies that the iteration onverges.

4. To assess the quality of the onvergen e, note that (3.4) implies that
ek+1
lim
k!1 e
= '0 (x );
k
or asymptoti ally
ek+1
= '0 (x )ek :
Thus if '0 (x ) 6= 0, ea h iteration redu es the error by roughly a fa tor of
'0 (x ). Su h onvergen e is alled linear onvergen e.
5. It is worth noting that the onvergen e proof for the onstant slope method
(3.2) is the same as the proof for Newton's method itself | with one important
ex eption. The iteration fun tion for Newton's method satises '0 (x ) = 0,
from whi h it follows that the onstant C < 1 exists regardless of the fun tion
f (provided that f 0 (x ) 6= 0). For the quasi-Newton method, we must postulate
that j'0 (x )j < 1 in order to insure the existen e of the onstant C < 1. This
dieren e is a onsequen e of the fa t that the denominator g in the quasi-
Newton method is a free parameter, whi h, improperly hosen, an ause the
method to fail.
6. The methods also dier in their rates of onvergen e: Newton's method
onverges quadrati ally, while the onstant slope method onverges linearly
(ex ept in the unlikely ase where we have hosen g = f 0(x ), so that '0 (x ) =
0). Now in some sense all quadrati onvergen e is the same. On e it sets in,
it doubles the number of signi ant gures at ea h step. Linear onvergen e
is quite dierent. Its speed depends on the ratio
xk+1 x
= lim :
k!1 xk x
If is near one, the onvergen e will be slow. If it is near zero, the onvergen e
will be fast.
7. It is instru tive to onsider the ase where
xk+1 x = (xk x );
so that the error is redu ed exa tly by a fa tor of at ea h iteration. In this
ase
xk x = k (x0 x ):
It follows that to redu e the error by a fa tor of , we must have k or

k=

log :
log
The following table gives values of k for representative values of and .

:99 :90 :50 :10 :01
10 5
1146 110 17 6 3
10 10
2292 219 34 11 6
10 15 3437 328 50 16 8
The numbers show that onvergen e an range from quite fast to very slow.
With = 0:01 the onvergen e is not mu h worse than quadrati onver-
gen e | at least in the range of redu tions we are onsidering here. When
= 0:99 the onvergen e is very slow. Note that su h rates, and even slower
ones, arise in real life. The fa t that some algorithms reep toward a solution
is one of the things that keeps super omputer manufa turers in business.
Rates of onvergen e
8. We are going to derive a general theory of iteration fun tions. However,
rst we must say something about rates of onvergen e. Here we will assume
that we have a sequen e x0 , x1 , x2 , . . . onverging to a limit x .
9. If there is a onstant satisfying 0 < jj < 1 su h that
xk+1 x
lim
k!1 x x
= ; (3.5)
k
then the sequen e fxk g is said to onverge linearly with ratio (or rate) . We
have already treated linear onvergen e above.3
10. If the ratio in (3.5) onverges to zero, the onvergen e is said to be su-
perlinear. Certain types of superlinear onvergen e an be hara terized as
follows. If there is a number p > 1 and a positive onstant C su h that
k!1
lim jxjxk+1 x xjp j = C;
k
then the sequen e is said to onverge with order p. When p = 2 the onvergen e
is quadrati . When p = 3 the onvergen e is ubi . In general, the analysis
of quadrati onvergen e in x2.14 an be adapted to show that the number of
orre t gures in a sequen e exhibiting pth order onvergen e in reases by a
If = 1, the onvergen e is sometimes alled
3
sublinear.The sequen e f g onverges
1
k
sublinearly to zero.
fa tor of about p from iteration to iteration. Note that p does not have to be
an integer. Later we shall see the se ant method (3.1) typi ally onverges with
order p = 1:62. . . .4
11. You will not ordinarily en ounter rates of onvergen e greater than ubi ,
and even ubi onvergen e o urs only in a few spe ialized algorithms. There
are two reasons. First, the extra work required to get higher-order onvergen e
may not be worth it | espe ially in nite pre ision, where the a ura y that
an be a hieved is limited. Se ond, higher-order methods are often less easy
to apply. They generally require higher-order derivatives and more a urate
starting values.
Iterating for a xed point
12. The essential identity of the lo al onvergen e proof for Newton's method
and the quasi-Newton method suggests that they both might be subsumed un-
der a general theory. Here we will develop su h a theory. Instead of beginning
with an equation of the form f (x) = 0, we will start with a fun tion ' having
a xed point x | that is, a point x for whi h '(x ) = x | and ask when the
iteration
xk+1 = '(xk ); k = 0; 1; : : : (3.6)
onverges to x . This iterative method for nding a xed point is alled the
method of su essive substitutions.
13. The iteration (3.6) has a useful geometri interpretation, whi h is illus-
trated in Figures 3.2 and 3.3. The xed point x is the abs issa of the inter-
se tion of the graph of '(x) with the line y = x. The ordinate of the fun tion
'(x) at x0 is the value of x1 . To turn this ordinate into an abs issa, re e t it in
the line y = x. We may repeat this pro ess to get x2 , x3 , and so on. It is seen
that the iterates in Figure 3.2 zigzag into the xed point, while in Figure 3.3
they zigzag away: the one iteration onverges if you start near enough to the
xed point, whereas the other diverges no matter how lose you start. The
xed point in the rst example is said to be attra tive, and the one in the
se ond example is said to be repulsive .
14. It is the value of the derivative of ' at the xed point that makes the
dieren e in these two examples. In the rst the absolute value of the derivative
is less than one, while in the se ond it is greater than one. (The derivatives
here are both positive. It is instru tive to draw iteration graphs in whi h
the derivatives at the xed point are negative.) These examples along with
our earlier onvergen e proofs suggest that what is ne essary for a method of
su essive substitutions to onverge is that the absolute value of the derivative
be less than one at the xed point. Spe i ally, we have the following result.
4
It is also possible for a sequen e to onverge superlinearly but not with order p > 1. The
sequen e k1! is an example.
x0 x1 x2 x3
Figure 3.2. An attra tive xed point.
If
j'0 (x )j < 1;
then there is an interval I = [x ; x + su h that the iteration
(3.6) onverges to x whenever x0 2 I . If '0 (x ) 6= 0, then the
onvergen e is linear with ratio '0 (x ). On the other hand, if
0 = '0 (x ) = '00 (x ) = = '(p 1) (x ) 6= '(p) (x ); (3.7)
then the onvergen e is of order p.
15. We have essentially seen the proof twi e over. Convergen e is established
exa tly as for Newton's method or the onstant slope method. Linear onver-
gen e in the ase where '0 (x ) 6= 0 is veried as it was for the onstant slope
method. For the ase where (3.7) holds, we need to verify that the onvergen e
is of order p. In the usual notation, by Taylor's theorem
1
ek+1 = '(p) (k )epk :
p!
Sin e k ! x , it follows that
lim ek+1 = p1! '(p) (x ) 6= 0;
k!1 epk
x2 x1 x0
Figure 3.3. A repulsive xed point.
whi h establishes the pth-order onvergen e.

16. Armed with this result, we an return to Newton's method and the onstant
slope method. For Newton's method we have
f (x )f 00 (x )
' 0 (x ) = =0
f 0 (x )2
(remember that f 0 (x ) is assumed to be nonzero). Thus Newton's method is
seen to be at least quadrati ally onvergent. Sin e
f 00(x )
'00 (x ) = 0 ;
f (x )
Newton's method will onverge faster than quadrati ally only when f 00(x ) = 0.
For the onstant slope method we have
f 0 (x )
'0 (x ) = 1
:
g
The requirement that the absolute value of this number be less than one is
pre isely the ondition (3.3).
Multiple zeros
17. Up to now we have onsidered only a simple zero of the fun tion f , that
is, a zero for whi h f 0 (x ) 6= 0. We will now onsider the ase where
0 = f 0(x ) = f 00(x ) = = f (m 1) (x ) 6= f (m) (x ):
By Taylor's theorem
f (m) (x )
f (x) = (x x )m ;
m!
where x lies between x and x. If we set g(x) = f (m) (x )=m!, then
f (x) = (x x )m g(x); (3.8)
where g is ontinuous at x and g(x ) 6= 0. Thus, when x is near x , the
fun tion f (x) behaves like a polynomial with a zero of multipli ity m at x .
For this reason we say that x is a zero of multipli ity m of f .
18. We are going to use the xed-point theory developed above to assess the
behavior of Newton's method at a multiple root. It will be most onvenient to
use the form (3.8). We will assume that g is twi e dierentiable.
19. Sin e f 0 (x) = m(x x )m 1 g(x) + (x x )m g0 (x), the Newton iteration
fun tion for f is
'(x) = x
(x x )m g(x) =x (x x )g(x) :
m(x x ) g(x) + (x x ) g (x)
m 1 m 0 mg(x) (x x )g0 (x)
From this we see that ' is well dened at x and
'(x ) = x :
A ording to xed-point theory, we have only to evaluate the derivative of
' at x to determine if x is an attra tive xed point. We will skip the slightly
tedious dierentiation and get straight to the result:
' 0 (x ) = 1
1:
m
Therefore, Newton's method onverges to a multiple zero from any su iently
lose approximation, and the onvergen e is linear with ratio 1 m1 . In parti -
ular for a double root, the ratio is 12 , whi h is omparable with the onvergen e
of interval bise tion.
Ending with a proposition

20. Although roots that are exa tly multiple are not ommon in pra ti e,
the above theory says something about how Newton's method behaves with
a nearly multiple root. An amusing example is the following proposition for
omputing a zero of a polynomial of odd degree.
Let
f (x) = xn + an 1 xn 1 + + a0 ; (3.9)
where n is odd. Sin e f (x) > 0 for large positive x and f (x) < 0 for large
negative x, by the intermediate value theorem f has a real zero. Moreover,
you an see graphi ally that if x0 is greater than the largest zero of f , then
Newton's method onverges from x0 . The proposition, then, is to hoose a
very large value of x0 (there are ways of hoosing x0 to be greater than the
largest root), and let Newton's method do its thing.
The trouble with this proposition is that if x is very large, the term xn in
(3.9) dominates the others, and f (x)
= xn . In other words, from far out on the
x-axis, f appears to have a zero of multipli ity n at zero. If Newton's method
is applied, the error in ea h iterate will be redu ed by a fa tor of only 1 n1
(when n = 100, this is a painfully slow 0:99). In the long run, the iterates will
arrive near a zero, after whi h quadrati onvergen e will set in. But, as we
have had o asion to observe, in the long run we are all dead.
Le ture 4
Nonlinear Equations
The Se ant Method
Convergen e
Rate of Convergen e
Multipoint Methods
Muller's Method
The Linear-Fra tional Method
The se ant method

1. Re all that in a quasi-Newton method we iterate a ording to the formula
f (xk )
x =x k+1 k ; (4.1)
gk
where the numbers gk are hosen to approximate f 0 (xk ). One way of al ulating
su h an approximation is to hoose a step size hk and approximate f 0 (xk ) by
the dieren e quotient
f (xk + hk ) f (xk )
g =
k :
hk
There are two problems with this approa h.
First, we have to determine the numbers hk . If they are too large, the
approximations to the derivative will be ina urate and onvergen e will be
retarded. If they are too small, the derivatives will be ina urate owing to
rounding error (we will return to this point in Le ture 24).
Se ond, the pro edure requires one extra fun tion evaluation per iteration.
This is a serious problem if fun tion evaluations are expensive.
2. The key to the se ant method is to observe that on e the iteration is
started we have two nearby points, xk and xk 1, where the fun tion has been
evaluated. This suggests that we approximate the derivative by
f (xk ) f (xk 1)
g =k :
xk xk 1
The iteration (4.1) then takes the form

f (xk )(xk xk 1 ) xk 1 f (xk ) xk f (xk 1)
xk+1 = xk = : (4.2)
f (x k ) f ( x k 1 ) f (xk ) f (xk 1)
This iteration is alled the se ant method.
27
x2
x0 x1
Figure 4.1. The se ant method.
3. The se ant method derives its name from the following geometri interpreta-
tion of the iteration. Given x0 and x1 , draw the se ant line through the graph
of f at the points (x0 ; f (x0 )) and (x1 ; f (x1 )). The point x2 is the abs issa
of the interse tion of the se ant line with the x-axis. Figure 4.1 illustrates
this pro edure. As usual, a graph of this kind an tell us a lot about the
onvergen e of the method in parti ular ases.
4. If we set
f (u)(u v) vf (u) uf (v)
'(u; v) = u = f (u) f (v) ; (4.3)
f (u) f (v)
then the iteration (4.2) an be written in the form
xk+1 = '(xk ; xk 1 ):
Thus ' plays the role of an iteration fun tion. However, be ause it has two
arguments, the se ant method is alled a two-point method.
5. Although ' is indeterminate for u = v, we may remove the indetermina y
by setting
f (u)
'(u; u) = u :
f 0(u)
In other words, the se ant method redu es to Newton's method in the on uent
ase where xk = xk 1 . In parti ular, it follows that
'(x ; x ) = x ;
so that x is a xed point of the iteration.
Convergen e
6. Be ause the se ant method is a two-point method, the xed-point theory
developed above does not apply. In fa t, the onvergen e analysis is onsider-
ably more ompli ated. But it still pro eeds in the three steps outlined in x2.9:
(1) nd a re ursion for the error, (2) show that the iteration onverges, and
(3) assess the rate of onvergen e. Here we will onsider the rst two steps.
7. It is a surprising fa t that we do not need to know the spe i form (4.3)
of the iteration fun tion to derive an error re urren e. Instead we simply use
the fa t that if we input the answer we get the answer ba k. More pre isely, if
one of the arguments of ' is the zero x of f , then ' returns x ; i.e.,
'(u; x ) x and '(x ; v) x :
Sin e '(u; x ) and '(x ; v) are onstant, their derivatives with respe t to u
and v are zero:
'u (u; x ) 0 and 'v (x ; v) 0:
The same is true of the se ond derivatives:
'uu (u; x ) 0 and 'vv (x ; v) 0:
8. To get an error re ursion, we begin by expanding ' about (x ; x ) in a

two-dimensional Taylor series. Spe i ally,
'(x + p; x + q) = '(x ; x ) + 'u (x ; x )p + 'v (x ; x )q
+ 12 ['uu (x + p; x + q)p2
+ 2'uv (x + p; x + q)pq + 'vv (x + p; x + q)q2;
where 2 [0; 1. Sin e '(x ; x ) = x and 'u (x ; x ) = 'v (x ; x ) = 0,
'(x + p; x + q) = x + 12 ['uu (x + p; x + q)p2 (4.4)
+ 2'uv (x + p; x + q)pq + 'vv (x + p; x + q)q2 :
The term ontaining the ross produ t pq is just what we want, but the terms
in p2 and q2 require some massaging. Sin e 'uu (x + p; x ) = 0, it follows
from a Taylor expansion in the se ond argument that
'uu (x + p; x + q) = 'uuv (x + p; x + q q)q;
where q 2 [0; 1. Similarly,
'vv (x + p; x + q) = 'uvv (x + pp; x + q)p;
where p 2 [0; 1. Substituting these values in (4.4) gives
pq
'(x + p; x + q) = x + ['uuv (x + p; x + q q)p
2 (4.5)
+ 2'uv (x + p; x + q) + 'uvv (x + pp; x + q)q:
9. Turning now to the iteration proper, let the starting values be x0 and x1 ,
and let their errors be e0 = x0 x and e1 = x1 x . Taking p = e1 and q = e0
in (4.5), we get
e2 = '(x + e1 ; x + e0 ) x
= e12e0 ['uuv (x + e1 ; x + e e0 )e1
0
+ 2'uv (x + e1 ; x + e0 ) + 'uvv (x + e e1 ; x + e0 )e0

e1 e0 1
r(e1; e0 ):
2
(4.6)
This is the error re urren e we need.
10. We are now ready to establish the onvergen e of the method. First note
that
r(0; 0) = 2'uv (x ; x ):
Hen e there is a > 0 su h that if juj; jvj then
jvr(u; v)j C < 1:
Now let je0 j; je1 j . From the error re urren e (4.6) it follows that je2 j
C je1 j < je1 j . Hen e
je1 r(e2 ; e1 )j C < 1;
and je3 j C je2 j C 2je1 j. By indu tion
jek j C k je j;
1
1
and sin e the right-hand side of this inequality onverges to zero, we have
ek ! 0; i.e., the se ant method onverges from any two starting values whose
errors are less than in absolute value.
Rate of onvergen e
11. We now turn to the onvergen e rate of the general two-point method.
The rst thing to note is that sin e
ek ek
ek+1 =
2 r(ek ; ek 1 ) (4.7)
1
and r(0; 0) = 2'uv (x ; x ), we have

ek+1
lim
k!1 e e
= 'uv (x ; x ): (4.8)
k k 1
6 0, we shall say that the sequen e fxk g exhibits two-point

If 'uv (x ; x ) =
onvergen e.
12. We are going to show that two-point onvergen e is superlinear of order
p
p=
1 + 5 = 1:618 : : : :
2
This number is the largest root of the equation
p2 p 1 = 0: (4.9)
Now there are two ways to establish this fa t. The rst is to derive (4.9)
dire tly from (4.8), whi h is the usual approa h. However, sin e we already
know the value of p, we an instead set
sk =
jek j
+1
(4.10)
jek jp
and use (4.8) to verify that the sk have a nonzero limit.
13. You should be aware that some people obje t to this way of doing things
be ause (they say) it hides the way the result | in this ase the parti ular value
of p | was derived. On the other hand, the mathemati ian Gauss is reported
to have said that after you build a athedral you don't leave the s aolding
around; and he ertainly would have approved of the following proof. Both
sides have good arguments to make; but as a pra ti al matter, when you know
or have guessed the solution of a problem it is often easier to verify that it
works than to derive it from rst prin iples.
14. From (4.10) we have
jek j = sk 1jek 1 jp
and
jek+1j = sk jepk j = sk spk 1jek 1 jp :
2
From (4.7),
sk spk 1jek 1 jp 2
jrk j jr(ek ; ek 1 )j = s je jpje j = sk spk 1

1
jek jp 1
2
p 1
:
k 1 k 1 k 1
Sin e p2 p 1 = 0, we have jek 1 jp 2

p 1
= 1 and
jrk j = sk spk : 1
1
Let k = log jrk j and k = log sk . Then our problem is to show that the
sequen e dened by
k = k (p 1)k 1
has a limit.
Let = limk!1 k . Then the limit , if it exists, must satisfy
= (p 1) :
Thus we must show that the sequen e of errors dened by
(k ) = (k ) (p 1)(k 1 )
onverges to zero.
15. The onvergen e of the errors to zero an easily be established from rst
prin iples. However, with an eye to generalizations I prefer to use the following
result from the theory of dieren e equations.
If the roots of the equation
xn a1 xn 1
an = 0
all lie in the unit ir le and limk!1 k = 0, then the sequen e fk g
generated by the re ursion
k = k + a1 k 1 + an k n
onverges to zero, whatever the starting values 0 ; : : : ; n 1 .
16. In our appli ation n = 1 and k = k , and k = k . The equation

whose roots are to lie in the unit ir le is x + (p 1) = 0. Sin e p 1
= 0:618,
the onditions of the above result are satised, and k ! . It follows that
the numbers sk have a nonzero limit. In other words, two-point onvergen e
is superlinear of order p = 1:618. . . .
Multipoint methods
17. The theory we developed for the se ant method generalizes to multipoint
iterations of the form
xk+1 = '(xk ; xk 1 ; : : : ; xk n+1 ):
Again the basi assumption is that if one of the arguments is the answer x
then the value of ' is x . Under this assumption we an show that if the
starting points are near enough x then the errors satisfy
ek+1
lim = '12:::n (x ; x ; : : : ; x );
k!1 ek ek 1 ek n+1
where the subs ript i of ' denotes dierentiation with respe t to the ith argu-
ment.
18. If '12:::n (x ; x ; : : : ; x ) 6= 0, we say that the sequen e exhibits n-point
onvergen e. As we did earlier, we an show that n-point onvergen e is the
same as pth-order onvergen e, where p is the largest root of the equation
pn pn 1
p 1:
The following is a table of the onvergen e rates as a fun tion of n.
n p
2 1:61
3 1:84
4 1:93
5 1:96
The upper bound on the order of onvergen e is two, whi h is ee tively at-
tained for n = 3. For this reason multipoint methods of order four or greater
are seldom en ountered.
Muller's method
19. The se ant method is sometimes alled an interpolatory method, be ause
it approximates a zero of a fun tion by a line interpolating the fun tion at two
points. A useful iteration, alled Muller's method, an be obtained by tting
a quadrati polynomial at three points. In outline, the iteration pro eeds
as follows. The input is three points xk , xk 1 , xk 2 , and the orresponding
fun tion values.
1. Find a quadrati polynomial g(x) su h that g(xi ) = f (xi ), (i = k; k
1; k 2).
2. Let xk+1 be the zero of g that lies nearest xk .
x1 x3 x2
This way to x4.
Figure 4.2. A horrible example.
It is a worthwhile exer ise to work out the details.

20. Muller's method has the advantage that it an produ e omplex iterates
from real starting values. This feature is not shared by either Newton's method
or the se ant method.
The linear-fra tional method
21. Fun tions like the one pi tured in Figure 4.2 ome up o asionally, and
they are di ult to solve. The gure shows the ourse of the se ant method
starting from a bra ket [x1 ; x2 . The third iterate x3 joins x2 to the right of
the zero, and be ause the fun tion is at there, x4 is large and negative.
22. The trouble with the se ant method in this ase is that a straight line is
not a good approximation to a fun tion that has a verti al asymptote, followed
by a zero and then a horizontal asymptote. On the other hand, the fun tion
x a
g(x) =
bx
has a verti al asymptote at x = b , a zero at x = a, and a horizontal asymptote
at y = b 1 and therefore should provide a better approximation.
Sin e there are three free parameters in the fun tion g, it is determined
by three points. Thus a three-point interpolatory method similar to Muller's
method an be based on linear-fra tional fun tions.
23. In implementing the linear-fra tional method it is easy to get lost in the
details and end up with an inde ipherable mess. My rst en ounter with the
method was as a single fortran statement that extended over several ards
(Yes, ards!), and I was awed by the ingenuity of the programmer. It was
only some years later that I learned better. By deriving the results we need in
simple steps, we an ee tively write a little program as we go along. Here is
how it is done.
24. Most low-order interpolation problems are simplied by shifting the origin.
In parti ular we take yi = xi xk (i = k; k 1; k 2) and determine a, b, and
so that
y a
g(y) =
by
satises
f (xi) = g(yi ); i = k; k 1; k 2;
or equivalently
yi a = f (xi )(byi ); i = k; k 1; k 2: (4.11)
The fun tion g is zero when yk+1 = a, and the next point is given by
xk+1 = xk + a:
25. Sin e at any one time there are only three points, there is no need to keep
the index k around. Thus we start with three points x0, x1, x2, and their
orresponding fun tion values f0, f1, f2. We begin by setting
y0 = x0 x2;
y1 = x1 x2:
From (4.11) we have

y0 a = f0(b y0 );
y1 a = f1(b y1 );
and
a = f2 : (4.12)
If we add this last equation to the pro eeding two we get
y0 = fy0 b + df0 ;
y1 = fy1 b + df1 ;
(4.13)
where
fy0 = f0 y0;
fy1 = f1 y1;
and
df0 = f2 f0;
df1 = f2 f1:
The equations (4.13) an be solved for by Cramer's rule:
fy0 y1 fy1 y0
= :
fy0 df1 fy1 df0
Thus from (4.12) the next iterate is

x3 = x2 + f2 :
26. Be ause we have hosen our origin arefully and have taken are to dene
appropriate intermediate variables, the above development leads dire tly to
the following simple program. The input is the three points x0, x1, x2, and
their orresponding fun tion values f0, f1, f2. The output is the next iterate
x3.
y0 = x0 - x2;
y1 = x1 - x2;
fy0 = f0*y0;
fy1 = f1*y1;
df0 = f2 - f0;
df1 = f2 - f1;
= (fy0*y1-fy1*y0)/(fy0*df1-fy1*df0);
x3 = x2 + f2* ;
Le ture 5
Nonlinear Equations
A Hybrid Method
Errors, A ura y, and Condition Numbers
A hybrid method
1. The se ant method has the advantage that it onverges swiftly and requires
only one fun tion evaluation per iteration. It has the disadvantage that it
an blow up in your fa e. This an happen when the fun tion is very at so
that f 0(x) is small ompared with f (x) (see Figure 4.2). Newton's method is
also sus eptible to this kind of failure; however, the se ant method an fail in
another way that is uniquely its own.
2. The problem is that in pra ti e the fun tion f will be evaluated with error.
Spe i ally, the program that evaluates f at the point x will return not f (x)
but f~(x) = f (x) + e(x), where e(x) is an unknown error. As long as f (x) is
large ompared to e(x), this error will have little ee t on the ourse of the
iteration. However, as the iteration approa hes x , e(x) may be ome larger
than f (x). Then the approximation to f 0 that is used in the se ant method
will have the value
[f (xk ) f (xk 1) + [e(xk ) e(xk 1 ) :
xk xk 1
Sin e the terms in e dominate those in f , the value of this approximate deriva-
tive will be unpredi table. It may have the wrong sign, in whi h ase the
se ant method may move away from x . It may be very small ompared to
f~(xk ), in whi h ase the iteration will take a wild jump. Thus, if the fun -
tion is omputed with error, the se ant method may behave errati ally in the
neighborhood of the zero it is supposed to nd.
3. We are now going to des ribe a wonderful ombination of the se ant
method and interval bise tion.5 The idea is very simple. At any stage of
the iteration we work with three points a, b, and . The points a and b are
the points from whi h the next se ant approximation will be omputed; that
is, they orrespond to the points xk and xk 1 . The points b and form a
proper bra ket for the zero. If the se ant method produ es an undesirable
approximation, we take the midpoint of the bra ket as our next iterate. In
5
The following presentation owes mu h to Jim Wilkinson's elegant te hni al report \Two
Algorithms Based on Su essive Linear Interpolation," Computer S ien e, Stanford Univer-
sity, TR CS-60, 1967.
37
this way the speed of the se ant method is ombined with the se urity of the
interval bise tion method. We will now ll in the details.
4. Let fa, fb, and f denote the values of the fun tion at a, b, and . These
fun tion values are required to satisfy
1: fa; fb; f 6= 0;
2: sign(fb) 6= sign(f ); (5.1)
3: jfbj jf j:
At the beginning of the algorithm the user will be required to furnish points
b and = a satisfying the rst two of these onditions. The user must also
provide a onvergen e riterion eps. When the algorithm is nished, the bra k-
eting points b and will satisfy j bj eps.
5. The iterations take pla e in an endless while loop, whi h the program leaves
upon onvergen e. Although the user must see that the rst two onditions
in (5.1) are satised, the program an take are of the third ondition, sin e
it has to anyway for subsequent iterations. In parti ular, if jf j < jfbj, we
inter hange b and . In this ase, a and b may no longer be a pair of su essive
se ant iterates, and therefore we set a equal to .
while(1){
if (abs(f ) < abs(fb))
{
t = ; = b; b = t; (5.2)
t = f ; f = fb; fb = t;
a = ; fa = f ;
}
6. We now test for onvergen e, leaving the loop if the onvergen e riterion
is met.
if (abs(b- ) <= eps)
break;
7. The rst step of the iteration is to ompute the se ant step s at the points
a and b and also the midpoint m of b and . One of these is to be ome our
next iterate. Sin e jfbj jf j, it is natural to expe t that x will be nearer to
b than , and of ourse it should lie in the bra ket. Thus if s lies between b
and m, then the next iterate will be s; otherwise it will be m.
8. Computing the next iterate is a matter of some deli a y, sin e we annot
say a priori whether b is to the left or right of . It is easiest to ast the tests
in terms of the dieren es ds = s b and m = m b. The following ode does
the tri k. When it is nished, dd has been omputed so that the next iterate
is b + dd. Note the test to prevent division by zero in the se ant step.
dm = ( -b)/2;
df = (fa-fb);
if (df == 0)
ds = dm;
else
ds = -fb*(a-b)/df;
if (sign(ds)!=sign(dm) || abs(ds) > abs(dm))
dd = dm;
else
dd = ds;
9. At this point we make a further adjustment to dd. The explanation is best

left for later (x5.16).
if (abs(dd) < eps)
dd = 0.5*sign(dm)*eps;
10. The next step is to form the new iterate | all it d | and evaluate the
fun tion there.
d = b + dd;
fd = f(d);
11. We must now rename our variables in su h a way that the onditions
of (5.1) are satised. We take are of the ondition that fd be nonzero by
returning if it is zero.
if (fd == 0){
b = = d; fb = f = fd;
break;
}
12. Before taking are of the se ond ondition in (5.1), we make a provisional
assignment of new values to a, b, and .
a = b; b = d;
fa = fb; fb = fd;
13. The se ond ondition in (5.1) says that b and form a bra ket for x .
If the new values fail to do so, the ure is to repla e by the old value of b.
The reasoning is as follows. The old value of b has a dierent sign than the
old value of . The new value of b has the same sign as the old value of .
Consequently, the repla ement results in a new value of that has a dierent
sign than the new value of b.
In making the substitution, it is important to remember that the old value
of b is now ontained in a.
c
a b
Figure 5.1. A problem with .
if (sign(fb) == sign(f )){

= a; f = fa;
}
14. The third ondition in (5.1) is handled at the top of the loop; see (5.2).
15. Finally, we return after leaving the while loop.
}
return;
16. To explain the adjustment of dd in x5.9, onsider the graph in Figure 5.1.
Here d is always on the side of x that is opposite , and the value of
is not hanged by the iteration. This means that although b is onverging
superlinearly to x , the length of the bra ket onverges to a number that is
greater than zero | presumably mu h greater than eps. Thus the algorithm
annot onverge until its errati asymptoti behavior for es some bise tion
steps.
The ure for this problem lies in the extra ode introdu ed in x5.9. If the
step size dd is less than eps in absolute value, it is for ed to have magnitude
0.5*eps. This will usually be su ient to push s a ross the zero to the same
side as , whi h insures that the next bra ket will be of length less than eps |
just what is needed to meet the onvergen e riterion.
Errors, a ura y, and ondition numbers
17. We have already observed in x5.2 that when we attempt to evaluate the
fun tion f at a point x the value will not be exa t. Instead we will get a
perturbed value
f~(x) = f (x) + e(x):
The error e(x) an ome from many sour es. It may be due to rounding error in
the evaluation of the fun tion, in whi h ase it will behave irregularly. On the
other hand, it may be dominated by approximations made in the evaluation of
the fun tion. For example, an integral in the denition of the fun tion may have
been evaluated numeri ally. Su h errors are often quite smooth. But whether
or not the error is irregular or smooth, it is unknown and has an ee t on the
zeros of f that annot be predi ted. However, if we know something about the
size of the error, we an say something about how a urately we an determine
a parti ular zero.
18. Let x be a zero of f , and suppose we have a bound on the size of the
error; i.e.,
je(x)j :
If x1 is a point for whi h f (x1 ) > , then
f~(x1 ) = f (x1 ) + e(x1 ) f (x1 ) > 0;
i.e., f~(x1 ) has the same sign as f (x1 ). Similarly, if f (x2 ) < , then f~(x2 )
is negative along with f (x2 ), and by the intermediate value theorem f has a
zero between x1 and x2 . Thus, whenever jf (x)j > , the values of f~(x) say
something about the lo ation of the zero in spite of the error.
To put the point another way, let [a; b be the largest interval about x for
whi h
x 2 [a; b =) f (x) :
As long as we are outside that interval, the value of f~(x) provides useful in-
formation about the lo ation of the zero. However, inside the interval [a; b
the value of f~(x) tells us nothing, sin e it ould be positive, negative, or zero,
regardless of the sign of f (x).
19. The interval [a; b is an interval of un ertainty for the zero x : we know
that x is in it, but there is no point in trying to pin it down further. Thus,
a good algorithm will return a point in [a; b, but we should not expe t it to
provide any further a ura y. Algorithms that have this property are alled
stable algorithms.
20. The size of the interval of un ertainty varies from problem to problem. If
the interval is small, we say that the problem is well onditioned. Thus, a stable
algorithm will solve a well- onditioned problem a urately. If the interval is
large, the problem is ill onditioned. No algorithm, stable or otherwise, an be
expe ted to return an a urate solution to an ill- onditioned problem. Only if
we are willing to go to extra eort, like redu ing the error e(x), an we obtain
a more a urate solution.
21. A number that quanties the degree of ill- onditioning of a problem is
alled a ondition number. To derive a ondition number for our problem,
let us ompute the half-width of the interval of un ertainty [a; b under the
assumption that
f 0 (x ) 6= 0:
eps
a a
b b
eps
Figure 5.2. Ill- and well- onditioned roots.
From the approximation

f (x)
= f (x ) + f 0 (x )(x x ) = f 0 (x )(x x )
it follows that jf (x)j when jf 0 (x )(x x )j <
. Hen e,
jx x j < jf 0(x )j ;

or equivalently

[a; b
= x jf 0(x )j ; x + jf 0(x )j :

Thus the number 1=jf 0 (x )j tells us how mu h the error is magnied in
the solution and serves as a ondition number. A zero with a large derivative
is well onditioned: one with a small derivative is ill onditioned. Figure 5.2
illustrates these fa ts.
Floating-Point Arithmeti
43
Le ture 6
Floating-Point Numbers
Over ow and Under ow
Rounding Error
Floating-point numbers
1. Anyone who has worked with a s ienti hand al ulator is familiar with
oating-point numbers. Right now the display of my al ulator ontains the
hara ters
2:597 03 (6.1)
whi h represent the number
2:597 10 3 :
The hief advantage of oating-point representation is that it an en ompass
numbers of vastly diering magnitudes. For example, if we onne ourselves
to six digits with ve after the de imal point, then the largest number we an
represent is 9:99999 = 10, and the smallest is 0:00001 = 10 5 . On the other
hand, if we allo ate two of those six digits to represent a power of ten, then
we an represent numbers ranging between 10 99 and 1099 . The pri e to be
paid is that these oating-point numbers have only four gures of a ura y, as
opposed to as mu h as six for the xed-point numbers.
2. A base- oating-point number onsists of a fra tion f ontaining the
signi ant gures of the number and exponent e ontaining its s ale.6 The
value of the number is
f e:
3. A oating-point number a = f e is said to be normalized if
1
f < 1:
In other words, a is normalized if the base- representation of its fra tion has
the form
f = 0:x1 x2 : : : ;
where x1 6= 0. Most omputers work hie y with normalized numbers, though
you may en ounter unnormalized numbers in spe ial ir umstan es.
The term \normalized" must be taken in ontext. For example, by our
denition the number (6.1) from my al ulator is not normalized, while the
6
The fra tion is also alled the mantissa and the exponent the hara teristi .
45
s exponent fraction
01 9 32
Figure 6.1. A oating-point word.
number 0:2597 10 2 is. This does not mean that there is something wrong
with my al ulator | just that my al ulator, like most, uses a dierent nor-
malization in whi h 1 f < 10.
4. Three bases for oating-point numbers are in ommon use.
name base where found
binary 2 most omputers
de imal 10 most hand al ulators
hex 16 IBM mainframes and lones
In most omputers, binary is the preferred base be ause, among other things,
it fully uses the bits of the fra tion. For example, the binary representation
of the fra tion of the hexade imal number one is :00010000. . . . Thus, this
representation wastes the three leading bits to store quantities that are known
to be zero.
5. Even binary oating-point systems dier, something that in the past has
made it di ult to produ e portable mathemati al software. Fortunately, the
IEEE has proposed a widely a epted standard, whi h most PCs and work-
stations use. Unfortunately, some manufa turers with an investment in their
own oating-point systems have not swit hed. No doubt they will eventually
ome around, espe ially sin e the people who produ e mathemati al software
are in reasingly relu tant to jury-rig their programs to onform to inferior
systems.
6. Figure 6.1 shows the binary representation of a 32-bit IEEE standard
oating-point word. One bit is devoted to the sign of the fra tion, eight bits to
the exponent, and twenty-three bits to the fra tion. This format an represent
numbers ranging in size from roughly 10 38 to 1038 . Its pre ision is about
seven signi ant de imal digits. A uriosity of the system is that the leading
bit of a normalized number is not represented, sin e it is known to be one.
The shortest oating-point word in a system is usually alled a single pre i-
sion number. Double pre ision numbers are twi e as long. The double pre ision
IEEE standard devotes one bit to the sign of the fra tion, eleven bits to the
exponent, and fty-two bits to the fra tion. This format an represent num-
bers ranging from roughly 10 307 to 10307 to about fteen signi ant gures.
Some implementations provide a 128-bit oating-point word, alled a quadruple
pre ision number, or quad for short.
6. Floating-Point Arithmeti 47
Over ow and under ow

7. Sin e the set of real numbers is innite, they annot all be represented by
a word of nite length. For oating-point numbers, this limitation omes in
two avors. First, the range of the exponent is limited; se ond, the fra tion
an only represent a nite number of the numbers between 1 and one. The
rst of these limitations leads to the phenomena of over ow and under ow,
olle tively known as exponent ex eptions ; the se ond leads to rounding error.
We will begin with exponent ex eptions.
8. Whenever an arithmeti operation produ es a number with an exponent
that is too large, the result is said to have over owed. For example, in a de imal
oating-point system with a two-digit exponent, the attempt to square 1060
will result in over ow.
Similarly an arithmeti operation that produ es an exponent that is too
small is said to have under owed. The attempt to square 10 60 will result in
under ow in a de imal oating-point system with a two-digit exponent.
9. Over ow is usually a fatal error whi h will ause many systems to stop with
an error message. In IEEE arithmeti , the over ow will produ e a spe ial word
standing for innity, and if the exe ution is ontinued it will propagate. An
under ow is usually set to zero and the program ontinues to run. The reason
is that with proper s aling, over ows an often be eliminated at the ost of
generating harmless under ows.
10. To illustrate this point, onsider the problem of omputing
p
= a2 + b2 ;
where a = 1060 and b = 1. On a four-digit de imal omputer the orre tly
rounded answer is 1060 . However, if the exponent has only two digits, then the
omputation will over ow while omputing a2 .
The ure is to ompute in the form
s

a 2 b 2
=s + s ;
s
where s is a suitable s aling fa tor, say
s = maxfjaj; jbjg = 1060 :
With this value of s, we must ompute
s
12 + 10160 :
2
= 10 60
Now when (1=1060 )2 is squared, it under ows and is set to zero. This does no
harm, be ause 10 120 is insigni ant ompared with the number one, to whi h
it is added. Continuing the omputation, we obtain = 1060 , whi h is what

we wanted.
Although the above example is a simple al ulation, it illustrates a te h-
nique of wide appli ability| one that should be used more often.
Rounding error
11. The se ond limitation that a xed word length imposes on oating-point
arithmeti is that most numbers annot be represented exa tly. For example,
the square root of seven is
2:6457513 : : : :
On a ve-digit de imal omputer, the digits beyond the fourth must be dis-
arded. There are two traditional ways of doing this: onventional rounding,
whi h yields
2:6458;
and hopping (also alled trun ation ), whi h yields
2:6457:
12. It is instru tive to ompute a bound on the relative error that rounding
introdu es. The pro ess is su iently well illustrated by rounding to ve digits.
Thus onsider the number
a = X:XXXXY;
whi h is rounded to
b = X:XXXZ:
Let us say we round up if Y 5 and round down if Y < 5. Then it is easy to
see that
jb aj 5 10 5 :
On the other hand, the leading digit of a is assumed nonzero, and hen e jaj 1.
It follows that
jb aj 5 10 5 = 1 10 4 :
jaj 2
More generally, rounding a to t de imal digits gives a number b satisfying
jb aj = 1 10 t+1 :
jaj 2
13. The same argument an be used to show that when a is hopped it gives
a number b satisfying
jb aj = 10 t+1 :
jaj
This bound is twi e as large as the bound for rounding, as might be expe ted.
However, as we shall see later, there are other, more ompelling reasons for
preferring rounding to hopping.
14. The bounds for t-digit binary numbers are similar:
jb aj = 2 t rounding,
jaj 2 t+1 hopping.
15. These bounds an be put in a form that is more useful for rounding-
error analysis. Let b = (a) denote the result of rounding or hopping a on a
parti ular ma hine, and let M denote the upper bound on the relative error.
If we set
b a
= ;
a
then b = a(1 + ) and jj M . In other words,
(a) = a(1 + ); jj :
M (6.2)
16. The number M in (6.2) is hara teristi of the oating-point arithmeti

of the ma hine in question and is alled the rounding unit for the ma hine
(also alled ma hine epsilon ). In some omputations we may need to know
the rounding unit. Many programming languages have library routines that
return ma hine hara teristi s, su h as the rounding unit or the exponent
range. However, it is also possible to ompute a reasonable approximation.
The key observation is that the rounding unit is at most slightly greater
than the largest number x for whi h the omputed value of 1 + x is equal
to one. For example, if in six-digit binary arithmeti with rounding we take
x = 2 7 , then the rounded value of 1 + x is exa tly one. Thus, we an obtain
an approximation to M by starting with a large value of x and diminishing it
until 1 + x evaluates to one. The following fragment does just that.
x = 1;
while (1+x != 1)
x = x/2;
Unfortunately, this ode an be undone by well-meaning but unsophisti-
ated ompilers that optimize the test 1+x != 1 to x != 0 or perform the test
in extended pre ision registers.
Floating-point arithmeti
17. Most omputers provide instru tions to add, subtra t, multiply, and divide
oating-point numbers, and those that do not usually have software to do the
same thing. In general a ombination of oating-point numbers will not be
representable as a oating-point number of the same size. For example, the

produ t of two ve-digit numbers will generally require ten digits for its rep-
resentation. Thus, the result of a oating-point operation an be represented
only approximately.
18. Ideally, the result of a oating-point operation should be the exa t result
orre tly rounded. More pre isely, if (a b) denotes the result of omputing
a b in oating-point and M is the rounding unit, then we would like to have
( f. x6.15)
(a b) = (a b)(1 + ); jj M:
Provided no exponent ex eptions o ur, the IEEE standard arithmeti satises
this bound. So do most other oating-point systems, at least when = ; .
However, some systems an return a dieren e with a large relative error, and
it is instru tive to see how this an ome about.
19. Consider the omputation of the dieren e 1 0:999999 in six-digit de imal
arithmeti . The rst step is to align the operands thus:
1:000000
0:999999
If the omputation is done in seven -digit registers, the omputer an go on to
al ulate
1:000000
0:999999
0:000001
and normalize the result to the orre t answer: :100000 10 6 . However, if
the omputer has only six -digit registers, the trailing 9 will be lost during the
alignment. The resulting omputation will pro eed as follows
1:00000
0:99999 (6.3)
0:00001
giving a normalized answer of :100000 10 5 . In this ase, the omputed answer
has a relative error of ten!
20. The high relative error in the dieren e is due to the absen e of an extra
guard digit in the omputation. Unfortunately some omputer manufa turers
fail to in lude a guard digit, and their oating-point systems are an annoyan e
to people who have to hold to high standards, e.g., designers of library routines
for spe ial fun tions. However, the vast majority of people never noti e the
absen e of a guard digit, and it is instru tive to ask why.
21. The erroneous answer :100000 10 5 omputed in (6.3) is the dieren e
between 1 and 0:99999 instead of 1 and 0:999999. Now the relative error in
0:99999 as an approximation to 0:999999 is about 9 10 6 , whi h is of the same

order of magnitude as the rounding unit M . This means that the omputed
result ould have been obtained by rst making a very slight perturbation in
the arguments and then performing the subtra tion exa tly. We an express
this mathemati ally by saying that
(a b) = a(1 + a ) b(1 + b ); ja j; jb j :
M (6.4)
(Note that we may have to adjust the rounding unit upward a little for this
bound to hold.)
22. Equation (6.4) is an example of a ba kward error analysis. Instead of
trying to predi t the error in the omputation, we have shown that whatever
the error it ould have ome from very slight perturbations in the original data.
The power of the method lies in the fa t that data usually omes equipped with
errors of its own, errors that are far larger than the rounding unit. When put
beside these errors, the little perturbations in (6.4) are insigni ant. If the
omputation gives an unsatisfa tory result, it is due to ill- onditioning in the
problem itself, not to the omputation.
We will return to this point later. But rst an example of a nontrivial
omputation.
Le ture 7
Computing Sums
Ba kward Error Analysis
Perturbation Analysis
Cheap and Chippy Chopping
Computing sums
1. The equation (a + b) = (a + b)(1 + ) (jj M) is the simplest example
of a rounding-error analysis, and its simplest generalization is to analyze the
omputation of the sum
sn = (x1 + x2 + + xn ):
There is a slight ambiguity in this problem, sin e we have not spe ied the
order of summation. For deniteness assume that the x's are summed left to
right.
2. The tedious part of the analysis is the repeated appli ation of the error
bounds. Let
si = (x1 + x2 + + xi ):
Then
s2 = (x1 + x2 ) = (x1 + x2 )(1 + 1 ) = x1 (1 + 1 ) + x2 (1 + 1 );
where j1 j M. Similarly,
s3 = (s2 + x3 ) = (s2 + x3 )(1 + 2 )
= x1 (1 + 1 )(1 + 2 ) +
x2 (1 + 1 )(1 + 2 ) +
x3 (1 + 2 ):
Continuing in this way, we nd that
sn = (sn 1 + xn) = (sn 1 + xn )(1 + n 1 )
= x1 (1 + 1 )(1 + 2 ) (1 + n 1 ) +
x2 (1 + 1 )(1 + 2 ) (1 + n 1 ) +
x3 (1 + 2 ) (1 + n 1 ) + (7.1)

xn 1 (1 + n 2 )(1 + n 1 ) +
xn (1 + n 1 );
where ji j M (i = 1; 2; : : : ; n 1).
53
3. The expression (7.1) is not very informative, and it will help to introdu e
some notation. Let the quantities i be dened by
1 + 1 = (1 + 1 )(1 + 2 ) (1 + n 1 );
1 + 2 = (1 + 1 )(1 + 2 ) (1 + n 1 );
1 + 3 = (1 + 2 ) (1 + n 1 );

1 + n 1 = (1 + n 2 )(1 + n 1 );
1 + n = (1 + n 1 ):
Then
sn = x1 (1+ 1 )+ x2 (1+ 2 )+ x3(1+ 3 )+ + xn 1(1+ n 1 )+ xn(1+ n ): (7.2)
4. The number 1 + i is the produ t of numbers 1 + j that are very near one.
Thus we should expe t that 1 + i is itself near one. To get an idea of how
near, onsider the produ t
1 + n 1 = (1 + n 2 )(1 + n 1 ) = 1 + (n 2 + n 1 ) + n 2 n 1 : (7.3)
Now jn 2 + n 1 j 2M and jn 2 n 1 j 2M. If, say, M = 10 15 , then
2M = 2 10 15 while 2M = 10 30 . Thus the third term on the right-hand side
of (7.3) is insigni ant ompared to the se ond term and an be ignored. If we
ignore it, we get
n 1 = n 2 + n 1
or
jn 1j < jn 2j + jn 1 j 2M:
In general,
j1 j < (n 1)M ; (7.4)
j j < (n i + 1) ; i = 2; 3; : : : ; n:
i M
5. The approximate bounds (7.4) are good enough for government work, but
there are fastidious individuals who will insist on rigorous inequalities. For
them we quote the following result.
If nM 0:1 and i M (i = 1; 2; : : : ; n), then
(1 + 1 )(1 + 2 ) (1 + n ) = 1 + ;
where
1:06nM :
Thus if we set
0M = 1:06M ;
then the approximate bounds (7.4) be ome quite rigorously
j j (n 1)0 ;
1 M
(7.5)
jij (n i + 1)0 ;M i = 2; 3; : : : ; n:
The quantity 0M is sometimes alled the adjusted rounding unit.
6. The requirement that nM 0:1 is a restri tion on the size of n, and it is
reasonable to ask if it is one we need to worry about. To get some idea of what
it means, suppose that M = 10 15 . Then for this inequality to fail we must
have n 1014 . If we start summing numbers on a omputer that an add at
the rate of 1se = 10 6 se , then the time required to sum 1014 numbers is
108 se = 3:2 years:
In other words, don't hold your breath waiting for nM to be ome greater than
0:1.
Ba kward error analysis
7. The expression
sn = x1 (1+ 1 )+ x2(1+ 2 )+ x3 (1+ 3 )+ + xn 1(1+ n 1 )+ xn(1+ n); (7.6)
along with the bounds on the i , is alled a ba kward error analysis be ause the
rounding errors made in the ourse of the omputation are proje ted ba kward
onto the original data. An algorithm that has su h an analysis is alled stable
(or sometimes ba kward stable ).
We have already mentioned in onne tion with the sum of two numbers
(x6.22) that stability in the ba kward sense is a powerful property. Usually the
ba kward errors will be very small ompared to errors that are already in the
input. In that ase it is the latter errors that are responsible for the ina ura ies
in the answer, not the rounding errors introdu ed by the algorithm.
8. To emphasize this point, suppose you are a numeri al analyst and are
approa hed by a ertain Dr. Xyz who has been adding up some numbers.
Xyz: I've been trying to ompute the sum of ten numbers, and the answers I
get are nonsense, at least from a s ienti viewpoint. I wonder if the omputer
is fouling me up.
You: Well it ertainly has happened before. What pre ision were you using?
Xyz: Double. I understand that it is about fteen de imal digits.
You: Quite right. Tell me, how a urately do you know the numbers you were
summing?
Xyz: Pretty well, onsidering that they are experimental data. About four
digits.
You: Then it's not the omputer that is ausing your poor results.
Xyz: How an you say that without even looking at the numbers? Some sort
of magi ?
You: Not at all. But rst let me ask another question.
Xyz: Shoot.
You: Suppose I took your numbers and twiddled them in the sixth pla e.
Could you tell the dieren e?
Xyz: Of ourse not. I already told you that we only know them to four pla es.
You: Then what would you say if I told you that the errors made by the
omputer ould be a ounted for by twiddling your data in the fourteenth pla e
and then performing the omputations exa tly?
Xyz: Well, I nd it hard to believe. But supposing it's true, you're right. It's
my data that's the problem, not the omputer.
9. At this point you might be tempted to bow out. Don't. Dr. Xyz wants to
know more.
Xyz: But what went wrong? Why are my results meaningless?
You: Tell me, how big are your numbers?
Xyz: Oh, about a million.
You: And what is the size of your answer?
Xyz: About one.
You: And the answers you ompute are at least an order of magnitude too
large.
Xyz: How did you know that. Are you a mind reader?
You: Common sense, really. You have to an el ve digits to get your answer.
Now if you knew your numbers to six or more pla es, you would get one or more
a urate digits in your answer. Sin e you know only four digits, the lower two
digits are garbage and won't an el. You'll get a number in the tens or greater
instead of a number near one.
Xyz: What you say makes sense. But does that mean I have to remeasure my
numbers to six or more gures to get what I want?
You: That's about it.
Xyz: Well I suppose I should thank you. But under the ir umstan es, it's not
easy.
You: That's OK. It omes with the territory.
10. The above dialogue is arti ial in three respe ts. The problem is too
simple to be hara teristi of real life, and no s ientist would be as naive as
Dr. Xyz. Moreover, people don't roll over and play dead like Dr. Xyz: they
require a lot of onvin ing. But the dialogue illustrates two important points.
The rst point is that a ba kward error analysis is a useful tool for removing
the omputer as a suspe t when something goes wrong. The se ond is that
ba kward stability is seldom enough. We want to know what went wrong. What
is there in the problem that is ausing di ulties? To use the terminology we
introdu ed for zeros of fun tions: When is the problem ill- onditioned?
Perturbation analysis
11. To answer the question just posed, it is a good idea to drop any onsid-
erations of rounding error and ask in general what ee ts known errors in the
xi will have on the sum
= x1 + x2 + + xn :
Spe i ally, we will suppose that
x~i = xi (1 + i ); jij ; (7.7)
and look for a bound on the error in the sum
~ = x~1 + x~2 + + x~n :
Su h a pro edure is alled a perturbation analysis be ause it assesses the ee ts
of perturbations in the arguments of a fun tion on the value of the fun tion.
The analysis is easy enough to do. We have
j~ j jx jj j + jx jj j + + jxnjjnj:
1 1 2 2
From (7.7), we obtain the following bound on the absolute error:

j~ j (jx j + jx j + + jxnj):
1 2
We an now obtain a bound on the relative error by dividing by jj. Spe i ally,
if we set
= 1
jx j + jx2 j + + jxnj ;
jx + x + + x j
1 2 n
then
j~ j : (7.8)
jj
The number , whi h is never less than one, tells how the rounding errors made
in the ourse of the omputation are magnied in the result. Thus it serves
as a ondition number for the problem. (Take a moment to look ba k at the
dis ussion of ondition numbers in x5.21.)
12. In Dr. Xyz's problem, the experimental errors in the fourth pla e an be
represented by = 10 4 , in whi h ase the bound be omes
relative error = 10 4 :
Sin e there were ten x's of size about 1,000,000, while the sum of the x's was
about one, we have = 107 , and the bound says that we an expe t no
a ura y in the result, regardless of any additional rounding errors.
13. We an also apply the perturbation analysis to bound the ee ts of round-
ing errors on the sum. In this ase the errors i orrespond to the errors i in
(7.2). Thus from (7.5) we have
jij (n 1)0 :
M
It then follows from (7.8) that

jsn j (n 1)0 ;
jj M
where as usual
=
jx j + jx j + + jxnj
1 2
jx + x + + xnj
1 2
is the ondition number for the sum.

Cheap and hippy hopping
14. When the terms xi in the sum are all positive (or all negative), the on-
dition number is one; i.e., the problem is perfe tly onditioned. In this ase,
the bound on the relative error due to rounding redu es to
jsn j = (n 1)0 :
jj M
This inequality predi ts that rounding error will a umulate slowly as terms
are added to the sum. However, the analysis on whi h the bound was based
assumes that the worst happens all the time, and one might expe t that the
fa tor n 1 is an overestimate.
In fa t, if we sum positive numbers with rounded arithmeti , the fa tor
will be an overestimate, sin e the individual rounding errors will be positive or
negative at random and will tend to an el one other. On the other hand, if we
are summing positive numbers with hopped arithmeti , the errors will tend
to be in the same dire tion (downward), and they will reinfor e one another.
In this ase the fa tor n 1 is realisti .
15. We don't have to resort to a lengthy analysis to see how this phenomenon
omes about. Instead, let's imagine that we take two six-digit numbers, and
do two things with them. First, we round the numbers to ve digits and sum
them exa tly; se ond, we hop the numbers to ve digits and on e again sum
them exa tly. The following table shows what happens.
number = rounded + error = hopped + error
1374.8 = 1375 0.2 = 1374 + 0.8
3856.4 = 3856 + 0.4 = 3856 + 0.4
total 5231.2 = 5231 + 0.2 = 5230 + 1.2
As an be seen from the table, the errors made in rounding have opposite
signs and an el ea h other in the sum to yield a small error of 0:2. With
hopping, however, the errors have the same sign and reinfor e ea h other to
yield a larger error of 1:2. Although we have summed only two numbers to
keep things simple, the errors in sums with more terms tend to behave in the
same way: errors from rounding tend to an el, while errors from hopping
reinfor e. Thus rounding is to be preferred in an algorithm in whi h it may be
ne essary to sum numbers all having the same sign.
16. The above example makes it lear that you annot learn everything about a
oating-point system by studying the bounds for its arithmeti . In the bounds,
the dieren e between rounding and hopping is a simple fa tor of two, yet
when it omes to sums of positive numbers the dieren e in the two arithmeti s
is a matter of the a umulation of errors. In parti ular, the fa tor n 1 in
the error bound (7.8) re e ts how the error may grow for hopped arithmeti ,
while it is unrealisti forprounded arithmeti . (On statisti al grounds it an be
argued that the fa tor n is realisti for rounded arithmeti .)
To put things in a dierent light, binary, hopped arithmeti has the same
bound as binary, rounded arithmeti with one less bit. Yet on the basis of
what we have seen, we would be glad to sa ri e the bit to get the rounding.
Le ture 8
Can ellation
The Quadrati Equation
That Fatal Bit of Rounding Error
Envoi
Can ellation
1. Many al ulations seem to go well until they have to form the dieren e
between two nearly equal numbers. For example, if we attempt to al ulate
the sum
37654 + 25:874 37679 = 0:874
in ve-digit oating-point, we get
(37654 + 25:874) = 37680
and
(37680 37679) = 1:
This result does not agree with the true sum to even one signi ant gure.
2. The usual explanation of what went wrong is to say that we an elled most
of the signi ant gures in the al ulation of (37860 37679) and therefore
the result annot be expe ted to be a urate. Now this is true as far as it
goes, but it onveys the mistaken impression that the an ellation aused the
ina ura y. However, if you look losely, you will see that no error at all was
made in al ulating (37860 37679). Thus the sour e of the problem must
lie elsewhere, and the an ellation simply revealed that the omputation was
in trouble.
In fa t, the sour e of the trouble is in the addition that pre eded the an el-
lation. Here we omputed (37654+25:874) = 37680. Now this omputation is
the same as if we had repla ed 25:874 by 26 and omputed 37654 + 26 exa tly.
In other words, this omputation is equivalent to throwing out the three digits
0:874 in the number 25:874. Sin e the answer onsists of just these three digits,
it is no wonder that the nal omputed result is wildly ina urate. What has
killed us is not the an ellation but the loss of important information earlier
in the omputation. The an ellation itself is merely a death erti ate.
The quadrati equation
3. To explore the matter further, let us onsider the problem of solving the
quadrati equation
x2 bx + = 0;
61
whose roots are given by the quadrati formula

p
b b2 4
r= :
2
If we take
b = 3:6778 and = 0:0020798;
then the roots are
r1 = 3:67723441190 : : : and r2 = 0:00056558809 : : : :
4. An attempt to al ulate the smallest root in ve-digit arithmeti gives the

following sequen e of operations.
1: b2 : 1:3526 10+1
2: 4 : 8:3192 10 3
3: b2 4
p : 1:3518 10+1
4: b 2 p 4 : 3:6767 10+0 (8.1)
5: b pb2 4 : 1:1000 10 3
6: (b b2 4 )=2 : 5:5000 10 4
The omputed value 0:00055000 diers from the true value 0:000565 : : : of the
root in its se ond signi ant gure.
5. A ording to the onventional wisdom on an ellation, the algorithm failed
at step 5, where we an eled three-odd signi ant gures in omputing the
dieren e 3:6778 3:67667. However, from the point of view taken here, the
an ellation only reveals a loss of information that o urred earlier. In this
ase the oending operation is in step 3, where we ompute the dieren e
(13:453 0:0083192) = 13:518:
This al ulation orresponds to repla ing the number 0:0083192 by 0:008 and
performing the al ulation exa tly. This is in turn equivalent to repla ing the
oe ient = 0:0020798 by ~ = 0:002 and performing the al ulation exa tly.
Sin e the oe ient ontains riti al information about r2 , it is no wonder
that the hange auses the omputed value of r2 to be ina urate.
6. Can anything be done to save the algorithm? It depends. If we don't save
after using it in step 3, then the answer is we an do nothing: the numbers
that we have at hand, namely b and the omputed value of b2 4 , simply do
not have the information ne essary to re over an a urate value of r2 . On the
other hand, if we keep around, then we an do something.
7. The rst thing to observe is that there is no problem in al ulating the

largest root r1 , sin e taking the plus sign in the quadrati formula entails no
an ellation. Thus after step 4 in (8.1) we an pro eed as follows.
p
5:0 b + b2 p4 : 7:3545 10+0
6:0 r1 = (b + b2 4 )=2 : 3:6773 10+0
The result agrees with the true value of r1 to almost the last pla e.
To al ulate r2 , we next observe that = r1 r2 , so that

r2 = :
r1
Sin e we have already omputed r1 a urately, we may use this formula to
re over r2 as follows.
7: r2 = =r1 : 5:6558 10 4
The omputed value is as a urate as we an reasonably expe t.

8. Many al ulations in whi h an ellation o urs an be salvaged by rewriting
the formulas. The major ex eptions are intrinsi ally ill- onditioned problems
that are poorly dened by their data. The problem is that the ina ura ies
that were built into the problem will have to reveal themselves one way or
another, so that any attempt to suppress an ellation at one point will likely
introdu e it at another.
For example, the dis riminant b2 4 is equal to (r1 r2)2 , so that an el-
lation in its omputation is an indi ation that the quadrati has nearly equal
roots. Sin e f 0(r1 ) = 2r1 b = r1 r2 , these nearby roots will be ill onditioned
(see x5.21).
That fatal bit of rounding error
9. Consider the behavior of the solution of the dieren e equation
xk+1 = 2:25xk 0:5xk 1 ; (8.2)
where
x1 =
1 and x = 1 : (8.3)
3 2
12
The solution is
41 k ;
xk = k = 1; 2; 3; : : : :
3
Consequently the omputed solution should de rease indenitely, ea h su es-
sive omponent being a fourth of its prede essor.
10. Figure 8.1 ontains a graph of log2 xk as a fun tion of k. Initially this
-5
-10
-15
log2[x(k)]
-20
-25
-30
-35
-40
0 5 10 15 20 25 30 35 40
k
Figure 8.1. Computed solution of xk+1 = 2:25xk 0:5xk 1 .
graph des ends linearly with a slope of 2, as one would expe t of any fun tion
proportional to (1=4)k . However, at k = 20 the graph turns around and begins
to as end with a slope of one. What has gone wrong?
11. The answer is that the dieren e equation (8.2) has two prin ipal solutions:
1 k and 2k :

4
Any solution an be expanded as a linear ombination of these two solutions;
i.e., the most general form of a solution is
k

1
+ 2k :
4
Now in prin iple, the xk dened by (8.2) and (8.3) should have an expansion
in whi h = 0; however, be ause of rounding error, is ee tively nonzero,
though very small. As time goes by, the in uen e of this solution grows until
it dominates. Thus the des ending part of the graph represents the interval
in whi h the ontribution of 2k is negligible, while the as ending portion
represents the interval in whi h 2k dominates.
12. It is possible to give a formal rounding-error analysis of the omputation
of xk . However, it would be tedious, and there is a better way of seeing
what is going on. We simply assume that all the rounding error is made at the
beginning of the al ulation and that the remaining al ulations are performed
exa tly.
Spe i ally, let us assume that errors made in rounding have given us x1
and x2 that satisfy
1
x1 = (4+0 + 2 56 );
3
1
x2 = (4 1 + 2 55 )
3
(note that 2 is the rounding unit for IEEE 64-bit arithmeti ). Then the
56
general solution is
1
xk = (41 k + 2k 57 ):
3
The turnaround point for this solution o urs when
41 k = 2k 57 ;
whi h gives a value of k between nineteen and twenty. Obviously, our simplied
analysis has predi ted the results we a tually observed.
13. All this illustrates a general te hnique of wide appli ability. It frequently
happens that an algorithm has a riti al point at whi h a little bit of rounding
error will ause it to fail later. If you think you know the point, you an
onrm it by rounding at that point but allowing no further rounding errors.
If the algorithm goes bad, you have spotted a weak point, sin e it is unlikely
that the rounding errors you have not made will somehow orre t your fatal
bit of error.
Envoi
14. We have now seen three ways in whi h rounding error an manifest it-
self.
1. Rounding error an a umulate, as it does during the omputation
of a sum. Su h a umulation is slow and is usually important only
for very long al ulations.
2. Rounding error an be revealed by an ellation. The o urren e of
an ellation is invariably an indi ation that something went wrong
earlier in the al ulation. Sometimes the problem an be ured by
hanging the details of the algorithm; however, if the sour e of the
an ellation is an intrinsi ill- onditioning in the problem, then it's
ba k to the drawing board.
3. Rounding error an be magnied by an algorithm until it dominates
the numbers we a tually want to ompute. Again the al ulation
does not have to be lengthy. There are no easy xes for this kind of
problem.
It would be wrong to say that these are the only ways in whi h rounding
error makes itself felt, but they a ount for many of the problems observed in
pra ti e. If you think you have been bitten by rounding error, you ould do
worse than ask if the problem is one of the three listed above.
Linear Equations
67
Le ture 9
Linear Equations
Matri es, Ve tors, and S alars
Operations with Matri es
Rank-One Matri es
Partitioned Matri es
Matri es, ve tors, and s alars

1. An m n matrix A is a re tangular array of numbers of the form
0
a11 a12 a1;n 1 a1n
1
B
B a21 a22 a2;n 1 a2n C
C
A=
B
.. .. .. .. C
:
B
B . . . . C
C
B
am 1;1 an 1;2 am ;n 1 1 am 1;n
C
A
am1 am2 am;n 1 amn

We write A 2 Rmn . If m = n, so that A is square, we say that A is of order
n.
2. The numbers aij are alled the elements of the matrix A. By onvention, the
rst subs ript, i, alled the row index, indi ates the row in whi h the element
lies. The se ond subs ript, j , alled the olumn index, indi ates the olumn in
whi h the element lies. Indexing usually begins at one, whi h makes for minor
problems in the language C, whose arrays begin with an index of zero.
3. An n-ve tor x is an array of the form
0 1
x1
B
x
B 2C
C
x = B ..
B C :
.C
A
xn
We write x 2 Rn . The number n is alled the dimension. The numbers xj are
alled the omponents of x.
4. Note that by onvention, all ve tors are olumn ve tors; that is, their
omponents are arranged in a olumn. Obje ts like (x1 x2 xn ) whose
omponents are arranged in a row are alled row ve tors. We generally write
row ve tors in the form xT (see the denition of the transpose operation below
in x9.18).
5. We will make no distin tions between Rn1 and Rn : it is all the same to
us whether we all an obje t an n 1 matrix or an n-ve tor. Similarly, we will
69
not distinguish between the real numbers R, also alled s alars , and the set of
1-ve tors, and the set of 1 1 matri es.
6. Matri es will be designated by upper- ase Latin or Greek letters, e.g., A,
, et . Ve tors will be designated by lower- ase Latin letters, e.g., x, y, et .
S alars will be designated by lower- ase Latin and Greek letters. Some attempt
will be made to use an asso iated lower- ase letter for the elements of a matrix
or the omponents of a ve tor. Thus the elements of A will be aij or possibly
ij . In parti ular note the asso iation of with x and with y.
Operations with matri es

7. Matri es are more than stati arrays of numbers: they have an algebra. We
will be parti ularly on erned with the following four operations with matri-
es:
1. multipli ation by a s alar,
2. the matrix sum,
3. the matrix produ t,
4. the matrix transpose.
8. Any matrix A an be multiplied by a s alar . The result is the matrix A
dened by
A = (aij ):
9. If A and B have the same dimensions, then their sum is the matrix A + B
dened by
A + B = (aij + bij ):
We express the fa t that A and B have the same dimensions by saying that
their dimensions are onformal for summation.
10. A matrix whose elements are all zero is alled a zero matrix and is written
0 regardless of its dimensions. It is easy to verify that
A + 0 = 0 + A = 0;
so that 0 is an additive identity for matri es.
11. If A is an l m matrix and B is an m n matrix, then the produ t AB
is an l n matrix dened by
m
X

AB = aik bkj :
k=1
9. Linear Equations 71
Note that for the produ t AB to be dened, the number of olumns of A

must be the same as the number of rows of B . In this ase we say that the
dimensions onform for multipli ation.
12. The matrix produ t is so widely used that it is useful to have a re ipe for
doing the bookkeeping. Here is mine. To nd the (2; 3)-element of the matrix
produ t 0 1
a11 a12 a13 a14 Bbb11 b12 b113 b14 b15

0 1
b22 b223 b24 b25 C

a21 a222 a323 a424 C ;
B 1 B 21 C
a31 a32 a33 a34 bb31 b32 b333 b34 b35 C

AB
A
41 b42 b443 b44 b45

pla e your left index nger on a21 and your right index nger on b13 (these
are the elements with the supers ript one). As you do so, say, \Times." Now
move your ngers to a22 and b23 , saying \plus" as you move them and \times"
as you land. Continue in this manner, alternating \plus" and \times." At the
end you will have omputed
a21 b13 + a22 b23 + a23 b33 + a24 b43 ;
whi h is the (2; 3)-element. You may feel foolish doing this, but you'll get the
right answer.
13. The matrix In of order n whose diagonal elements are one and whose
o-diagonal elements are zero [we write In = diag(1; 1; : : : ; 1) is alled the
identity matrix. If A is any m n matrix, then it is easy to verify that
Im A = AIn ;
so that identity matri es are multipli ative identities. When the ontext makes
the order lear, we drop the subs ript and simply write I for an identity matrix.
14. The identity matrix is a spe ial ase of a useful lass of matri es alled
diagonal matri es . A matrix D is diagonal if its only nonzero entries lie on its
diagonal, i.e., if dij = 0 whenever i 6= j . We write
diag(d1 ; : : : ; dn )
for a diagonal matrix whose diagonal entries are d1 ; : : : ; dn .
15. Sin e we have agreed to regard an n-ve tor as an n 1 matrix, the above
denitions an be transferred dire tly to ve tors. Any ve tor an be multiplied
by a s alar. Two ve tors of the same dimension may be added. Only 1-ve tors,
i.e., s alars, an be multiplied.
16. A parti ularly important ase of the matrix produ t is the matrix-ve tor
produ t Ax. Among other things it is useful as an abbreviated way of writing
systems of equations. Rather than say that we shall solve the system
b1 = a11 x1 + a12 x2 + + a1n xn
b2 = a21 x1 + a22 x2 + + a2n xn
: : :
bn = an1 x1 + an2 x2 + + annxn
we an simply write that we shall solve the equation
b = Ax;
where A is of order n.
17. Both the matrix sum and the matrix produ t are asso iative; that is,
(A + B ) + C = A + (B + C ) and (AB )C = A(BC ). The produ t distributes
over the sum; e.g., A(B + C ) = AB + AC . In addition, the matrix sum is
ommutative: A + B = B + A. Unfortunately the matrix produ t is not
ommutative: in general AB 6= BA. It is easy to forget this fa t when you are
manipulating formulas involving matri es.
18. The nal operation we shall use is the matrix transpose. If A is an m n
matrix, then the transpose of A is the n m matrix AT dened by
AT = (aji ):
Thus the transpose is the matrix obtained by re e ting a matrix through its
diagonal.
19. The transpose intera ts ni ely with the other matrix operations:
1: (A)T = (AT );
2: (A + B )T = AT + B T;
3: (AB )T = B T AT :
Note that the transposition reverses the order of a produ t.
20. If x is a ve tor, then xT is a row ve tor. If x and y are n-ve tors, then
yT x = x1 y1 + x2 y2 + + xn yn
is a s alar alled the inner produ t of x and y. In parti ular the number
p
kxk = x x T
is the Eu lidean length (or two-norm) of the ve tor x.

Rank-one matri es
21. If x; y 6= 0, then any matrix of the form

0 1
x1 y1 x1 y2 x1 y3
Bx2 y1
B
x2 y2 x2 y3 C
C
W = xyT = B
Bx3 y1 x3 y2 x3 y3 C
C (9.1)

.. .. .. A
. . .
has rank one; that is, its olumns span a one-dimensional spa e. Conversely,
any rank-one matrix W an be represented in the form xyT . Rank-one matri es
arise frequently in numeri al appli ations, and it's important to know how to
deal with them.
22. The rst thing to note is that one does not store a rank-one matrix as a
matrix. For example, if x and y are n-ve tors, then the matrix xyT requires n2
lo ations to store, as opposed to 2n lo ations to store x and y. To get some idea
of the dieren e, suppose that n = 1000. Then xyT requires one million words
to store as a matrix, as opposed to 2000 to store x and y individually| the
storage diers by a fa tor of 500.
23. If we always represent a rank-one matrix W = xyT by storing x and y,
the question arises of how we perform matrix operations with W | how, say,
we an ompute the matrix-ve tor produ t = W b? An elegant answer to this
question may be obtained from the equation
= W b = (xyT )b = x(yT b) = (yT b)x; (9.2)
in whi h the last equality follows from the fa t that yT b is a s alar.
This equation leads to the following algorithm.
1. Compute = yT b (9.3)
2. Compute = x
This algorithm requires 2n multipli ations and n 1 additions. This should be
ontrasted with the roughly n2 multipli ations and additions required to form
an ordinary matrix ve tor produ t.
24. The above example illustrates the power of matrix methods in deriving
e ient algorithms. A person ontemplating the full matrix representation
(9.1) of xyT would no doubt ome up with what amounts to the algorithm (9.3),
albeit in s alar form. But the pro ess would be arduous and error prone. On
the other hand, the simple manipulations in (9.2) yield the algorithm dire tly
and in a way that relates it naturally to operations with ve tors. We shall see
further examples of the power of matrix te hniques in deriving algorithms.
Partitioned matri es
25. Partitioning is a devi e by whi h we an express matrix operations at a
level between s alar operations and operations on full matri es. A partition
of a matrix is a de omposition of the matrix into submatri es. For example,
onsider the matrix
0 1
a11 a12 a13 a14 a15 a16 a17
Ba21 a22 a23 a24 a25 a26 a27 C
B
C
A=B
Ba31
B
a32 a33 a34 a35 a36 C:
a37 C
C
a41 a42 a43 a44 a45 a46 a47 A
a51 a52 a53 a54 a55 a56 a57
The partitioning indu ed by the lines in the matrix allows us to write the
matrix in the form !
A
A= A A A ;
11 A 12 A13
21 22 23
where ! !
A11 = aa11 aa12 ; A12 = aa13 aa14 aa15 ; et .

21 22 23 24 25
26. The power of partitioning lies in the following fa t.

If the partitions of two matri es are onformal, the submatri es
may be treated as s alars for the purposes of performing matrix
operations.
For example, if the partitions below are onformal, we have
! ! !
A11 A12 + B11 B12 = A11 + B11 A12 + B12 :
A21 A22 B21 B22 A21 + B21 A22 + B22
Similarly, we may write a matrix produ t in the form
! ! !
A11 A12 B11 B12 = A11 B11 + A12 B21 A11 B12 + A12 B22 ;
A21 A22 B21 B22 A21 B11 + A22 B21 A21 B12 + A22 B22
again provided the partitions are onformal. The one thing to be areful about
here is that the produ ts of the submatri es do not ommute like s alar prod-
u ts.
27. As a simple but important example of the use of matrix partitions, onsider
the matrix-ve tor produ t Ax, where A is of order n. If we partition A by
olumns in the form
A = (a1 a2 : : : an );
then 0 1
1
B 2 C
B C
Ax = (a1 a2 : : : an ) B .. C = 1 a1 + 2 a2 + + n an :
B C
.A
n
From this formula we an draw the following useful on lusion.
The matrix-ve tor produ t Ax is the linear ombination of the
olumns of A whose oe ients are the omponents of x.
Le ture 10
Linear Equations
The Theory of Linear Systems
Computational Generalities
Triangular Systems
Operation Counts
The theory of linear systems

1. For some time to ome we will be on erned with the solution of the equation
Ax = b; (10.1)
where A is of order n. Before attempting to solve any numeri al problem it
is a good idea to nd out if it has a unique solution. Fortunately, the theory
of linear systems provides a number of onditions that an be used to he k if
(10.1) has a solution.
Let A be of order n. Then the following statements are equiva-
lent.
1. For any ve tor b, the system Ax = b has a solution.
2. If a solution of the system Ax = b exists, it is unique.
3. For all x, Ax = 0 =) x = 0.
4. The olumns (rows) of A are linearly independent.
5. There is a matrix A 1 su h that A 1 A = AA 1 = I .
6. det(A) 6= 0.
2. Although the above onditions all have their appli ations, as a pra ti al
matter the ondition det(A) 6= 0 an be quite misleading. The reason is that
the determinant hanges violently with minor res aling. Spe i ally, if A is of
order n, then
det(A) = n det(A):
To see the impli ations of this equality, suppose that n = 30 (rather small by
today's standards) and that det(A) = 1. Then
det(0:1 A) = 10 30
:
In other words, dividing the elements of A by ten redu es the determinant by
a fa tor of 10 30 . It is not easy to determine whether su h a volatile quantity
77
is truly zero | whi h is why the determinant is used primarily in theoreti al

settings.
3. A matrix A satisfying any of the onditions of x10.1 is said to be nonsin-
gular. If A is nonsingular, the matrix A 1 guaranteed by the fth ondition is
alled the inverse of A. The inverse intera ts ni ely with matrix operations of
multipli ation and transposition.
Let A and B be of order n.
1. The produ t AB is nonsingular if and only if A and B are
nonsingular. In this ase
(AB ) 1 = B 1A 1 :
2. The matrix AT is nonsingular and
(AT ) 1 = (A 1 )T :
A onvenient shorthand for (AT ) 1 is A T .
Computational generalities
4. The solution of the linear system Ax = b an be written in terms of the
matrix inverse as follows
A 1 b = A 1 (Ax) = Ix = x:
This suggests the following algorithm for solving linear systems.
1. Compute C = A 1 (10.2)
2. x = Cb
With very few ex eptions, this is a bad algorithm. In fa t, you would not even
use it for s alar equations. For example, if we were to try to ompute the
solution of 10x = 2 in this way, we would end up with
1. = 1=10
2. x = 2
If instead we write x = 2=10, we save an operation and a little rounding
error. The situation is the same with the general invert-and-multiply algorithm
(10.2), ex ept that (10.2) is mu h more expensive than its alternatives and a
lot less stable.
5. What are the alternatives? Later we are going to show how to fa tor a
matrix A in the form7
A = LU;
This is a slight oversimpli ation. For stability we have to inter hange rows of A as we
7
fa tor it.
where L is lower triangular (i.e., its elements are zero above the diagonal)
and U is upper triangular (its elements are zero below the diagonal). This
fa torization is alled an LU de omposition of the matrix A. Now if A is
nonsingular, then so are L and U . Consequently, if we write the system Ax = b
in the form LUx = b, we have
Ux = L 1 b y: (10.3)
Moreover, by the denition of y
Ly = b: (10.4)
Thus, if we have a method for solving triangular systems, we an use the
following algorithm to solve the system Ax = b.
1. Fa tor A = LU
2. Solve Ly = b
3. Solve Ux = y
To implement this general algorithm we must be able to fa tor A and to
solve triangular systems. We will begin with triangular systems.
Triangular systems
6. A matrix L is lower triangular if
i < j =) ìj = 0:
This is a fan y way of saying that the elements of L lying above the diagonal
are zero. For example, a lower triangular matrix of order ve has the form
0 1
`11 0 0 0 0
B`21
B
`22 0 0 0 C C
L=B B
` 31 `32 ` 33 0 0 C:
C
`42 `43 `44 0 A

B C
`41
`51 `52 `53 `54 `55

A matrix U is upper triangular if
i > j =) uij = 0:
The elements of an upper triangular matrix lying below the diagonal are zero.
7. For deniteness we will onsider the solution of the lower triangular system
Lx = b of order n. The basi fa t about su h systems is that if the diagonal
elements ìi of L are nonzero, then L is nonsingular and the system has a
unique solution. We shall establish this fa t by des ribing an algorithm to
ompute the solution.
8. The algorithm is su iently well illustrated by the ase n = 5. We begin

by writing the system Lx = b in s alar form.
b1 = `11 x1
b2 = `21 x1 + `22 x2
b3 = `31 x1 + `32 x2 + `33 x3
b4 = `41 x1 + `42 x2 + `43 x3 + `44 x4
b5 = `51 x1 + `52 x2 + `53 x3 + `54 x4 + `55 x5
The rst equation in this system involves only x1 and may be solved forthwith:
b1
x1 = :
`11
Knowing the value of x1 , we an substitute it into the se ond equation and
solve to get
b ` x
x2 = 2 21 1 :
`22
Substituting x1 and x2 into the third equation and solving, we get
b3 `31 x1 `32 x2
x3 = :
`33
Continuing in this manner, we get
b4 `41 x1 `42 x2 `43 x3
x4 =
`44
and
b5 `51 x1 `52 x2 `53 x3 `54 x4
x5 = ;
`55
whi h ompletes the solution of the system. Sin e by hypothesis the diagonal
elements ìi are nonzero, these formulas uniquely determine the xi .
9. The pro edure sket hed above is quite general and leads to the follow-
ing forward-substitution algorithm for solving a lower triangular system. The
lower triangular matrix is ontained in a doubly subs ripted array l. The
omponents of the solution overwrite the right-hand side b.8
for (i=1; i<=n; i++){
for (j=1; j<i; j++)
b[i = b[i - l[i[j*b[j; (10.5)
b[i = b[i/l[i[i;
}
We have already noted that the indexing onventions for matri es, in whi h the rst
8
element is the (1,1)-element, are in onsistent with C array onventions in whi h the rst
element of the array a is a[0[0. In most C ode presented here, we will follow the matrix
onvention. This wastes a little storage for the unused part of the array, but that is a small
pri e to pay for onsisten y.
Operation ounts
10. To get an idea of how mu h it osts to solve a triangular system, let us
ount the number of multipli ations required by the algorithm. There is one
multipli ation in the statement
b[i = b[i - l[i[j*b[j;
This statement is exe uted for j running from 1 to i and for i running from
1 to n. Hen e the total number of multipli ations is
n i n n(n + 1)
XX
1=
X
i= = n ;
2
(10.6)
i=1 j =1 i=1 2 2
the last approximation holding for large n. There are a like number of addi-
tions.
11. Before we try to say what an operation ount like (10.6) a tually means,
let us dispose of a te hni al point. In deriving operation ounts for matrix
pro esses, we generally end up with sums nested two or three deep, and we
are interested in the dominant term, i.e., the term with the highest power of
n. We an obtain this term by repla ing sums with integrals and adjusting the
limits of integration to make life easy. If this pro edure is applied to (10.6),
the result is Z nZ i Z n
1 dj di = i di = n2 ;
2
0 0 0
whi h is the dominant term in the sum (10.6).

12. It might be thought that one ould predi t the running time of an algorithm
by ounting its arithmeti operations and multiplying by the time required for
an operation. For example, if the ombined time for a oating-point addition
and multipli ation is , then it might be expe ted that it would take time n2
2
to solve a triangular system of order n.

Unfortunately, this expe tation is not realized in pra ti e. The reason is
that the algorithm is busy with tasks other than oating-point arithmeti . For
example, the referen e to l[i[j requires that an element of l be retrieved
from memory. Again, the test j<i must be performed ea h time the inner loop
is exe uted. This overhead in ates the time, so that an operation ount based
solely on arithmeti will underestimate the total time.
Nonetheless, operation ounts of the kind we have introdu ed here an be
useful. There are two reasons.
1. The additional overhead in an algorithm is generally proportional to
the number of arithmeti operations. Although an arithmeti ount
does not predi t the running time, it predi ts how the running time
will in rease with n | linearly, quadrati ally, et .
2. In onsequen e, if algorithms have dierent orders of omplexity |

that is, dominant terms with dierent powers of n | then the one
with the lower order will ultimately run faster.
13. Some are must be taken in omparing algorithms of the same order.
The presumption is that the one with the smaller order onstant will run
faster. Certainly, if the order onstants dier signi antly, this is a reasonable
assumption: we would expe t an algorithm with a ount of 2n2 to outperform
an algorithm with a ount of 100n2 . But when the order onstants are nearly
equal, all bets are o, sin e the a tually order onstant will vary a ording to
the details of the algorithm. Someone who says that an algorithm with a ount
of n2 is faster than an algorithm with a ount of 2n2 is sti king out the old
ne k.
Le ture 11
Linear Equations
Memory Considerations
Row-Oriented Algorithms
A Column-Oriented Algorithm
General Observations
Basi Linear Algebra Subprograms
Memory onsiderations
1. Virtual memory is one of the more important advan es in omputer systems
to ome out of the 1960s. The idea is simple, although the implementation
is ompli ated. The user is supplied with very large virtual memory. This
memory is subdivided into blo ks of modest size alled pages . Sin e the entire
virtual memory annot be ontained in fast, main memory, most of its pages
are maintained on a slower ba king store, usually a disk. Only a few a tive
pages are ontained in the main memory.
When an instru tion referen es a memory lo ation, there are two possibil-
ities.
1. The page ontaining the lo ation is in main memory (a hit). In this
ase the lo ation is a essed immediately.
2. The page ontaining the lo ation is not in main memory (a miss).
In this ase the system sele ts a page in main memory and swaps it
with the one that is missing.
2. Sin e misses involve a time- onsuming ex hange of data between main
memory and the ba king store, they are to be avoided if at all possible. Now
memory lo ations that are near one another are likely to lie on the same page.
Hen e one strategy for redu ing misses is to arrange to a ess memory sequen-
tially, one neighboring lo ation after another. This is a spe ial ase of what is
alled lo ality of referen e.
Row-oriented algorithms
3. The algorithm (10.5) is one that preserves lo ality of referen e. The reason
is that the language C stores doubly subs ripted arrays by rows, so that in the
ase n = 5 the matrix L might be stored as follows.
`11 0 0 0 0 `21 `22 0 0 0 `31 `32 `33 0 0 `41 `42 `43 `44 0 `51 `52 `53 `54 `55
83
Now if you run through the loops in (10.5), you will nd that the elements of
L are a essed in the following order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
`11 0 0 0 0 `21 `22 0 0 0 `31 `32 `33 0 0 `41 `42 `43 `44 0 `51 `52 `53 `54 `55
Clearly the a esses here tend to be sequential. On e a row is in main memory,
we mar h along it a word at a time.
4. A matrix algorithm like (10.5) in whi h the inner loops a ess the elements
of the matrix by rows is said to be row oriented. Provided the matrix is stored
by rows, as it is in C, row-oriented algorithms tend to intera t ni ely with
virtual memories.
5. The situation is quite dierent with the language fortran, in whi h ma-
tri es are stored by olumns. For example, in fortran the elements of the
matrix L will appear in storage as follows.
`11 `21 `31 `41 `51 0 `22 `32 `42 `52 0 0 `33 `43 `53 0 0 0 `44 `54 0 0 0 0 `55
If the fortran equivalent of algorithm (10.5) is run on this array, the memory
referen es will o ur in the following order.
1 2 4 7 11 3 5 8 12 6 9 13 10 14 15
`11 `21 `31 `41 `51 0 `22 `32 `42 `52 0 0 `33 `43 `53 0 0 0 `44 `54 0 0 0 0 `55
Clearly the referen es are jumping all over the pla e, and we an expe t a high
miss rate for this algorithm.
A olumn-oriented algorithm
6. The ure for the fortran problem is to get another algorithm | one that
is olumn oriented. Su h an algorithm is easy to derive from a partitioned
form of the problem.
Spe i ally, let the system Lx = b be partitioned in the form
! ! !
11 0 1 1
`21 L22 x2 = b2 ;
where L22 is lower triangular. (Note that we now use the Greek letter to
denote individual elements of L, so that we do not onfuse the ve tor `21 =
(21 ; : : : ; n1 )T with the element 21 .) This partitioning is equivalent to the
two equations
11 1 = 1 ;
`21 1 + L22 x2 = b2 :
The rst equation an be solved as usual:

1
1 = :
11
Knowing the rst omponent 1 of x we an write the equation in the form
L22 x2 = b2 1 `21 : (11.1)
But this equation is a lower triangular system of order one less than the original.
Consequently we an repeat the above pro edure and redu e it to a lower
triangular system of order two less than the original, and so on until we rea h
a 1 1 system, whi h an be readily solved.
7. The following fortran ode implements this algorithm.
do 20 j=1,n
b(j) = b(j)/l(j,j)
do 10 i=j+1,n
b(i) = b(i) - b(j)*l(i,j)
(11.2)
10 ontinue
20 ontinue
At the jth iteration of the outer loop, the omponents x1 , . . . , xj 1 have
already been omputed and stored in b(1), . . . , b(j-1). Thus we have to
solve the system involving the matrix
!
jj 0
`j+1,j Lj+1,j+1 :
The statement
b(j) = b(j)/l(j,j)
overwrites b(j) with the rst omponent of this solution. The loop
do 10 i=j+1,n
b(i) = b(i) - b(j)*l(i,j)
10 ontinue
then adjusts the right-hand side as in (11.1). The algorithm then ontinues to
solve the smaller system whose matrix is Lj+1,j+1.
8. Running through the algorithm, we nd that the memory referen es to the
elements of L o ur in the following order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
`11 `21 `31 `41 `51 0 `22 `32 `42 `52 0 0 `33 `43 `53 0 0 0 `44 `54 0 0 0 0 `55
Clearly, the referen es here are mu h more lo alized than in the row-riented
algorithm, and we an expe t fewer misses.
General observations on row and olumn orientation

9. Although we have onsidered row- and olumn-oriented algorithms in terms
of their intera tions with virtual memories, many omputers have more than
two levels of memory. For example, most omputers have a limited amount of
superfast memory alled a a he . Small blo ks of main memory are swapped
in and out of the a he, mu h like the pages of a virtual memory. Obviously, an
algorithm that has good lo ality of referen e will use the a he more e iently.
10. It is important to keep things in perspe tive. If the entire matrix an t into
main memory, the distin tion between row- and olumn-oriented algorithms
be omes largely a matter of a he ee ts, whi h may not be all that important.
A megaword of memory an hold a matrix of order one thousand, and many
workstations in ommon use today have even more memory.
To illustrate this point, I ran row- and olumn-oriented versions of Gaus-
sian elimination (to be treated later in x13) oded in C on my spar ip
workstation. For a matrix of order 1000, the row-oriented version ran in about
690 se onds, while the olumn-oriented version took 805 se onds. Not mu h
dieren e. On a de workstation, the times were 565 se onds for the row-
oriented version and 1219 se onds for the olumn-oriented version. A more
substantial dieren e, but by no means an order of magnitude.9
11. Finally, the problems treated here illustrate the need to know something
about the target omputer and its ompilers when one is designing algorithms.
An algorithm that runs e iently on one omputer may not run well on an-
other. As we shall see, there are ways of hiding some of these ma hine de-
penden ies in spe ially designed subprograms; but the general rule is that
algorithms have to run on a tual ma hines, and ma hines have idiosyn rasies
whi h must be taken into a ount.
Basi linear algebra subprograms
12. One way of redu ing the dependen y of algorithms on the hara teristi s
of parti ular ma hines is to perform frequently used operations by invoking
subprograms instead of writing in-line ode. The hief advantage of this ap-
proa h is that manufa turers an provide sets of these subprograms that have
been optimized for their individual ma hines. Subprograms to fa ilitate matrix
omputations are olle tively known as Basi Linear Algebra Subprograms, or
blas for short.
13. To illustrate the use of the blas, onsider the loop
for (j=1; j<i; j++)
b[i = b[i - l[i[j*b[j;
The programs were ompiled by the g ompiler with the optimization ag set and were
9
run on unloaded ma hines.

from the algorithm (10.5). In expanded form, this omputes

b[i = b[i - l[i[1*b[1 - l[i[2*b[2 - ...
- l[i[i-1*b[i-1;
that is, it subtra ts the dot produ t of the ve tors
0 1 0 1
l[i[1 b[1
B C B C
B l[i[2 C B b[2 C
B
.. C and B
.. C
B
. C
A
B
. C
A
l[i[i-1 b[i-1
from b[i. Consequently, if we write a little fun tion
float dot(n, float x[, float y[)
{
float d=0;
for (i=0; i<n; i++)
d = d + x[i*y[i;
return d;
}
we an rewrite (10.5) in the form
for (i=1; i<=n; i++)
b[i = (b[i - dot(i-1, &l[i[1, &b[1))/l[i[i;
Not only do we now have the possibility of optimizing the ode for the fun tion
dot, but the row-oriented algorithm itself has been onsiderably simplied.
14. As another example, onsider the following statements from the olumn-
oriented algorithm (11.2).
do 10 i=j+1,n
b(i) = b(i) - b(j)*l(i,j)
10 ontinue
Clearly these statements ompute the ve tor
0 1 0 1
b(j+1) l(j+1,j)
B C B C
Bb(j+2)C Bl(j+2,j)C
B
.. C b(j) B .. C :
B
. C
A
B
. C
A
b(n) l(n,j)
Consequently, if we write a little fortran program axpy (for ax + y)
subroutine axpy(n, a, x, y)
integer n
real a, x(*), y(*)
do 10 i=1,n
y(i) = y(i) + a*x(i)
10 ontinue
return
end
then we an rewrite the program in the form

do 20 j=1,n
b(j) = b(j)/l(j,j)
all axpy(n-j, -b(j), l(j+1,j), b(j+1))
20 ontinue
On e again the ode is simplied, and we have the opportunity to optimize
axpy for a given ma hine.
15. The programs dot and axpy are two of the most frequently used ve tor
blas; however, there are many more. For example, s al multiplies a ve tor
by a s alar, and opy opies a ve tor into the lo ations o upied by another.
It should be stressed that the names and alling sequen e we have given in
these examples are deliberate oversimpli ations, and you should read up on
the a tual blas before writing produ tion ode.
Le ture 12
Linear Equations
Positive-Denite Matri es
The Cholesky De omposition
E onomi s
Positive-denite matri es

1. A matrix A of order n is symmetri if AT = A, or equivalently if
aij = aji ; i; j = 1; : : : ; n:
Be ause of this stru ture a symmetri matrix is entirely represented by the
elements on and above its diagonal, and hen e an be stored in half the mem-
ory required for a general matrix. Moreover, many matrix algorithms an be
simplied for symmetri matri es so that they have smaller operation ounts.
Let us see what symmetry buys us for linear systems.
2. The best way to solve a linear system Ax = b is to fa tor A in the form
LU , where L is lower triangular and U is upper triangular and then solve the
resulting triangular systems (see x10.5). When A is symmetri , it is reasonable
to expe t the fa torization to be symmetri ; that is, one should be able to fa tor
A in the form A = RT R, where R is upper triangular. However, not just any
symmetri matrix has su h a fa torization.
To see why, suppose A is nonsingular and x 6= 0. Then R is nonsingular,
and y = Rx 6= 0. Hen e,
X
xT Ax = xT RT Rx = (Rx)T (Rx) = yT y = yi2 > 0: (12.1)
Thus a nonsingular matrix A that an be fa tored in the form RT R has the
following two properties.
1: A is symmetri . (12.2)
6 0 =) xT Ax > 0:
2: x =
Any matrix with these two properties is said to be positive denite.10
3. Positive-denite matri es o ur frequently in real life. For example, a vari-
ant of the argument (12.1) shows that if X has linearly independent olumns
then X T X is positive denite. This means, among other things, that positive-
denite matri es play an important role in least squares and regression, where
systems like (X T X )b = are ommon. Again, positive-denite matri es arise
10
Warning: Some people drop symmetry when dening a positive-denite matrix.
89
in onne tion with ellipti partial dierential equations, whi h o ur every-

where. In fa t, no advan ed student in the hard s ien es or engineering an
fail to meet with a positive-denite matrix in the ourse of his or her studies.
4. Positive-denite matri es are nonsingular. To see this we will show that
Ax = 0 =) x = 0
(see x10.1). Suppose on the ontrary that Ax = 0 but x 6= 0. Then 0 =
xT Ax, whi h ontradi ts the positive-deniteness of A. Thus x = 0, and A is
nonsingular.
5. Square matri es lying on the diagonal of a partitioned positive-denite
matrix are positive denite. In parti ular, if we partition A in the form
!
A = a A
aT ;

then > 0 and A is positive denite. To see that > 0, set x = (1; 0; : : : ; 0).
Then ! !
a T
0 < x Ax = (1 0) a A 0 = :
T 1

To see that A is positive denite, let y 6= 0 and set xT = (0 yT ). Then
! !
0 < x Ax = (0 y ) aT
0 = yT A y:
T T
a A y
The Cholesky de omposition

6. We have seen that any nonsingular matrix A that an be fa tored in the
form RTR is positive denite. The onverse is also true. If A is positive
denite, then A an be fa tored in the form A = RT R, where R is upper
triangular. If, in addition, we require the diagonal elements of R to be positive,
the de omposition is unique and is alled the Cholesky de omposition or the
Cholesky fa torization of A. The matrix R is alled the Cholesky fa tor of
A. We are going to establish the existen e and uniqueness of the Cholesky
de omposition by giving an algorithm for omputing it.
7. We will derive the Cholesky algorithm in the same way we derived the
olumn-oriented method for solving lower triangular equations: by onsider-
ing a suitable partition of the problem, we solve part of it and at the same
time redu e it to a problem of smaller order. For the Cholesky algorithm the
partition of the equation A = RT R is
! ! !
aT = 0 rT :
a A r RT 0 R
Writing out this equation by blo ks of the partition, we get the three equations
1: = 2 ;
2: aT = rT ;
3: A = RT R + rrT:
Equivalently, p
1: = ;
2: rT = 1 aT ; (12.3)
3: RTR = A rrT:
The rst two equations are, in ee t, an algorithm for omputing the rst
row of R. The (1,1)-element of R is well dened, sin e > 0. Sin e 6= 0,
rT is uniquely dened by the se ond equation.
The third equation says that R is the Cholesky fa tor of the matrix
A^ = A rrT = A 1 aaT
[the last equality follows from the rst two equations in (12.3). This matrix is
of order one less than the original matrix A, and onsequently we an ompute
its Cholesky fa torization by applying our algorithm re ursively. However, we
must rst establish that A^ is itself positive denite, so that it has a Cholesky
fa torization.
8. The matrix A^ is learly symmetri , sin e
A^T = (A rrT)T = AT (rT )T rT = A rrT :
Hen e it remains to show that for any nonzero ve tor y
^ = yT A y 1 (aT y)2 > 0:
yT Ay
To do this we will use the positive-deniteness of A. If is any s alar, then
! !
0 < ( yT ) a A
aT = 2 + 2aT y + yTA y:
y

If we now set = 1 aT y, then it follows after a little manipulation that
0 < 2 + 2aT y + yT A y = yT A y 1 (aT y)2 ;
whi h is what we had to show.
9. Before we ode the algorithm sket hed above, let us examine its relation
to an elimination method for solving a system of equations Ax = b. We begin
by writing the equation A^ = A 1 aaT in s alar form as follows:
^ ij = ij 111 1i 1j = ij 111 i1 1j :
Here we have put the subs ripting of A ba k into the partition, so that = 11
and aT = (12 ; : : : ; 1n ). The se ond equality follows from the symmetry of
A.
Now onsider the system
11 x1 + 12 x2 + 13 x3 + 14 x4 = b1
21 x1 + 22 x2 + 23 x3 + 24 x4 = b2
31 x1 + 32 x2 + 33 x3 + 34 x4 = b3
41 x1 + 42 x2 + 43 x3 + 44 x4 = b4
If the rst equation is solved for x1 , the result is
x1 = 111 (b1 12 x2 13 x3 14 x4 ):
Substituting x1 in the last three of the original equations and simplifying, we
get
(22 111 21 12 )x2 + (23 111 21 13 )x3 + (24 111 21 14 )x4
= b2 111 21 b1
(32 111 31 12 )x2 + (33 111 31 13 )x3 + (34 111 31 14 )x4
= b3 111 31 b1 (12.4)
(42 111 41 12 )x2 + (43 111 41 13 )x3 + (44 111 41 14 )x4
= b4 111 41 b1
In this way we have redu ed our system from one of order four to one of order
three. This pro ess is alled Gaussian elimination.
Now if we ompare the elements ij 111 i1 1j of the matrix A^ produ ed
by the Cholesky algorithm with the oe ients of the system (12.4), we see
that they are the same. In other words, Gaussian elimination and the Cholesky
algorithm produ e the same submatrix, and to that extent are equivalent. This
is no oin iden e: many dire t algorithms for solving linear systems turn out
to be variants of Gaussian elimination.
10. Let us now turn to the oding of Cholesky's algorithm. There are two
ways to save time and storage.
1. Sin e A and A^ are symmetri , it is unne essary to work with the
lower half | all the information we need is in the upper half. The
same applies to the other submatri es generated by the algorithm.
2. On e and aT have been used to ompute and rT , they are no
longer needed. Hen e their lo ations an be used to store and rT .
As the algorithm pro eeds, the matrix R will overwrite the upper
half of A row by row.
11. The overwriting of A by R is standard pro edure, dating from the time
when storage was dear and to be onserved at all osts. Perhaps now that
storage is bounteous, people will quit evi ting A and give R a home of its own.
Time will tell.
12. The algorithm pro eeds in n stages. At the rst stage, the rst row of
R is omputed and the (n 1) (n 1) matrix A in the southeast orner
is modied. At the se ond stage, the se ond row of R is omputed and the
(n 2) (n 2) matrix in the southeast orner is modied. The pro ess
ontinues until it falls out of the southeast orner. Thus the algorithm begins
with a loop on the row of R to be omputed.
do 40 k=1,n
At the beginning of the kth stage the array that ontained A has the form
illustrated below for n = 6 and k = 3:
0 1
11 12 13 14 15 16
B 0 22 23 24 25 26 C
B
C
B 0 0 33 34 35 36 C
B
B 0
B
0 0 44 45
C
46 C
C
:
B C
0 0 0 0 55 56 A
0 0 0 0 0 26
The omputation of the kth row of R is straightforward:
a(k,k) = sqrt(a(k,k))
do 10 j=k+1,n
a(k,j) = a(k,j)/a(k,k)
10 ontinue
At this point the array has the form
0 1
11 12 13 14 15 16
B 0 22 23 24 25 26 C
B
C
B
B 0 0 33 34 35 36 C
C
:
B 0 0 0 44 45 46 C
B
C
B C
0 0 0 0 55 56 A
0 0 0 0 0 26
We must now adjust the elements beginning with 44 . We will do it by
olumns.
do 30 j=k+1,n
do 20 i=k+1,j
a(i,j) = a(i,j) - a(k,i)*a(k,j)
20 ontinue
30 ontinue
Finally we must nish o the loop in k.

40 ontinue
Here is what the algorithm looks like when the ode is assembled in one
pla e.
do 40 k=1,n
do 10 j=k+1,n
10 ontinue
do 30 j=k+1,n
do 20 i=k+1,j
a(i,j) = a(i,j) - a(k,i)*a(k,j)
20 ontinue
30 ontinue
40 ontinue
E onomi s
13. Sin e our ode is in fortran, we have tried to preserve olumn orientation
by modifying A by olumns. Unfortunately, this strategy does not work. The
kth row of R is stored over the kth row of A, and we must repeatedly ross
a row of A in modifying A . The oending referen e is a(k,i) in the inner
loop.
do 20 i=k+1,j
a(i,j) = a(i,j) - a(k,i)*a(k,j)
20 ontinue
There is really nothing to be done about this situation, unless we are willing
to provide an extra one-dimensional array | all it r. We an then store the
urrent row of R in r and use it to adjust the urrent A . This results in the
following ode.
do 40 k=1,n
do 10 j=k+1,n
r(j) = a(k,j)
10 ontinue
do 30 j=k+1,n
do 20 i=k+1,j
a(i,j) = a(i,j) - r(i)*r(j)
20 ontinue
30 ontinue
40 ontinue
The hief drawba k to this alternative is that it requires an extra parameter

in the alling sequen e for the array r.
A se ond possibility is to work with the lower half of the array A and
ompute L = RT. For then the rows of R be ome olumns of L.
14. The integration te hnique of x10.11 gives an operation ount for the
Cholesky algorithm as follows. The expression
a(i,j) = a(i,j) - r(i)*r(j)
in the inner loop ontains one addition and one multipli ation. It is exe uted
for k=k+1,j, j=k+1,n, and k=1,n. Consequently the number of additions and
multipli ations will be approximately
Z nZ nZ j
1
di dj dk = n3 :
0 k k 6
15. The fa t that the Cholesky algorithm is an O(n3 ) algorithm has important
onsequen es for the solution of linear systems. Given the Cholesky de ompo-
sition of A, we solve the linear system Ax = b by solving the two triangular
systems
1: RT y = b;
2: Rx = y:
Now a triangular system requires 12 n2 operations to solve, and the two systems
together require n2 operations. To the extent that the operation ounts re e t
a tual performan e, we will spend more time in the Cholesky algorithm when
1 n3 > n2 ;
6
or when n > 6. For somewhat larger n, the time spent solving the triangular
systems is insigni ant ompared to the time spent omputing the Cholesky
de omposition. In parti ular, having omputed the Cholesky de omposition
of a matrix of moderate size, we an solve several systems having the same
matrix at pra ti ally no extra ost.
16. In x10.4 we have depre ated the pra ti e of omputing a matrix inverse
to solve a linear system. Now we an see why. A good way to al ulate the
inverse X = (x1 x2 xn ) of a symmetri positive-denite matrix A is to
ompute the Cholesky de omposition and use it to solve the systems
Axj = ej ; j = 1; 2; : : : ; n;
where ej is the j th olumn of the identity matrix. Now if these solutions
are omputed in the most e ient way, they require 13 n3 additions and mul-
tipli ations | twi e as many as the Cholesky de omposition. Thus the invert-
and-multiply approa h is mu h more expensive than using the de omposition
dire tly to solve the linear system.
Le ture 13
Linear Equations
Inner-Produ t Form of the Cholesky Algorithm
Gaussian Elimination
Inner-produ t form of the Cholesky algorithm

1. The version of the Cholesky algorithm just des ribed is sometimes alled
the outer-produ t form of the algorithm be ause the expression A^ = A aaT
involves the outer produ t aaT . We are now going to des ribe a form whose
omputational unit is the inner produ t.
2. The inner-produ t form of the algorithm su essively omputes the Cholesky
de ompositions of the leading prin ipal submatri es
0 1
!
11 12 13
A1 = (11 ); A2 = 11 12 ; A3 = B
21 22 23 C
A; ::::
21 22
31 32 33
To get things started, note that the Cholesky fa tor of A is the s alar =
p 1 11
.
11
Now assume that we have omputed the Cholesky fa tor Rk 1 of Ak 1 ,
and we wish to ompute the Cholesky fa tor Rk of Ak . Partition the equation
Ak = RkT Rk in the form
! ! !
Ak 1 ak = RkT 1 0 Rk 1 rk :
aTk kk rkT kk 0 kk
This partition gives three equations:
1: Ak 1 = RkT 1Rk 1 ;
2: ak = RkT 1 rk ;
3: kk = rkT rk + 2kk :
The rst equation simply onrms that Rk 1 is the Cholesky fa tor of
Ak 1 . But the se ond and third equations an be turned into an algorithm for
omputing the kth olumn of R: namely,
1: Solve RkT 1 rk = ak
q
2: kk = kk rkT rk
Sin e RkT 1 is a lower triangular matrix, the system in the rst step an be easily
solved. Moreover, sin e Ak is positive denite, kk must exist; i.e., kk rkT rk
97
must be greater than zero, so that we an take its square root. Thus we an
ompute the Cholesky fa tors of A1 , A2 , A3 , and so on until we rea h An = A.
The details are left as an exer ise.11
3. The bulk of the work done by the inner-produ t algorithm is in the solution
of the system RkT 1 rk = ak , whi h requires 12 k2 additions and multipli ations.
Sin e this solution step must be repeated for k = 1; 2; : : : ; n, the total operation
ount for the algorithm is 16 n3 , the same as for the outer-produ t form of the
algorithm.
4. In fa t the two algorithms not only have the same operation ount, they
perform the same arithmeti operations. The best way to see this is to position
yourself at the (i; j )-element of the array ontaining A and wat h what happens
as the two algorithms pro eed. You will nd that for both algorithms the (i; j )-
element is altered as follows:
ij 1i 1j ;
ij 1i 1j 2i 2j ;

ij 1i 1j 2i 2j i ;ii
1 1;j :
Then, depending on whether or not i = j , the square root of the element will
be taken to give ii , or the element will be divided by ii to give ij .
One onsequen e of this observation is that the two algorithms are the
same with respe t to rounding errors: they give the same answers to the very
last bit.
Gaussian elimination
5. We will now turn to the general nonsymmetri system of linear equations
Ax = b. Here A is to be fa tored into the produ t A = LU of a lower
triangular matrix and an upper triangular matrix. The approa h used to derive
the Cholesky algorithm works equally well with nonsymmetri matri es; here,
however, we will take another line that suggests important generalizations.
6. To motivate the approa h, onsider the linear system
11 x1 + 12 x2 + 13 x3 + 14 x4 = b1
21 x1 + 22 x2 + 23 x3 + 24 x4 = b2
31 x1 + 32 x2 + 33 x3 + 34 x4 = b3
41 x1 + 42 x2 + 43 x3 + 44 x4 = b4
If we set
mi1 = i1 =11 ; i = 2; 3; 4;
11
If you try for a olumn-oriented algorithm, you will end up omputing inner produ ts; a
row-oriented algorithm requires axpy's.
and subtra t mi1 times the rst equation from the ith equation (i = 2; 3; 4),
we end up with the system
11 x1 + 12 x2 + 13 x3 + 14 x4 = b1
022 x2 + 023 x3 + 024 x4 = b02
032 x2 + 033 x3 + 034 x4 = b03
042 x2 + 043 x3 + 044 x4 = b04
where
0ij = ij mi1 1j and b0i = bi mi1 b1 :
Note that the variable x1 has been eliminated from the last three equations.
Be ause the numbers mi1 multiply the rst equation in the elimination they
are alled multipliers.
Now set
mi2 = a0i2 =a022 ; i = 3; 4;
and subtra t mi2 times the se ond equation from the ith equation (i = 3; 4).
The result is the system
11 x1 + 12 x2 + 13 x3 + 14 x4 = b1
022 x2 + 023 x3 + 024 x4 = b02
0033 x3 + 0034 x4 = b003
0043 x3 + 0044 x4 = b004
where
00ij = 0ij mi2 02j and b00i = b0i mi2 b02 :
Finally set
mi3 = a00i3 =a0033 ; i=4
and subtra t mi3 times the third equation from the fourth equation. The result
is the upper triangular system
11 x1 + 12 x2 + 13 x3 + 14 x4 = b1
022 x2 + 023 x3 + 024 x4 = b02 (13.1)
0033 x3 + 0034 x4 = b003
00044 x4 = b0004
where
000ij = 00ij mi3 003j and b000i = b00i mi3 b003 :
Sin e the system (13.1) is upper triangular, it an be solved by the te hniques
we have already dis ussed.
7. The algorithm we have just des ribed for a system of order four extends in
an obvious way to systems of any order. The triangularization of the system is
usually alled Gaussian elimination, and the solution of the resulting triangular
system is alled ba k substitution. Although there are sli ker derivations, this
one has the advantage of showing the exibility of the algorithm. For example,
if some of the elements to be eliminated are zero, we an skip their elimination
with a orresponding savings in operations. We will put this exibility to use
later.
8. We have not yet onne ted Gaussian elimination with an LU de ompo-
sition. One way is to partition the equation A = LU appropriately, derive
an algorithm, and observe that the algorithm is the same as the elimination
algorithm we just derived. However, there is another way.
9. Let A1 = A and set
0 1
1 0 0 0
B
M1 = B m21 1 0 0CC
;
m31
B
0 1 0CA
m41 0 0 1
where the mij 's are the multipliers dened above. Then it follows that
A2 M1 A1
0 10 1 0 1
1 0 0 0 11 12 13 14 11 12 13 14
B
=B m21 1 0 0C
C B21
B
22 23 24 C
C B 0
B
=B 022 023 024 C
C
:
m
B
31 0 1 0A 31
CB
32 33 34 C
A 0 032 033 0
34 CA
m41 0 0 1 41 42 43 44 0 042 043 044

Next set 0 1
1 0 0 0
B
M2 = B0 1 0 0CC
:
0 m32 1 0C
B
A
0 m42 0 1
Then
A3 M2 A2
0 10 1 0 1
1 0 0 0 11 12 13 14 11 12 13 14
B
=B0 1 0 0C
CB 0
B
022 023 024 C
C
= B 0
B
022 023 024 CC
:
0
B
m32 1 0A 0
CB
032 033 034 A 0 0
C B
0033 0034 C
A
0 m42 0 1 0 042 043 044 0 0 0043 0044

Finally set 0 1
1 0 0 0
B
M3 = B0 1 0 0CC
:
0 0 1 0C
B
A
0 0 m43 1
Then
U M A
3 3
0 10 1 0 1
1 0 0 0 11 12 13 14 11 12 13 14
B
0 1 0 0 CB
0 0 0 0 C
B 0
B
022 023 024 C
=B 0 0
B
CB
CB
22 23 24 C
=
1 0A 0 0 0033 0034 A 0 0 0033 0034 C
C B
C
A
:
0 0 m43 1 0 0 0043 0044 0 0 0 00044
In other words the produ t U = M3 M2 M1 A is the upper triangular ma-
trix | the system of oe ients | produ ed by Gaussian elimination. If we set
L = M1 1 M2 1 M3 1 , then
A = LU:
Moreover, sin e the inverse of a lower triangular matrix is lower triangular
and the produ t of lower triangular matri es is lower triangular, L itself is
lower triangular. Thus we have exhibited an LU fa torization of A.
10. We have exhibited an LU fa torization, but we have not yet omputed it.
To do that we must supply the elements of L. And here is a surprise. The
(i; j )-element of L is just the multiplier mij .
To see this, rst note that Mk 1 may be obtained from Mk by ipping the
sign of the multipliers; e.g.,
0 1
1 0 0 0
M2 1 = B
B
0 1 0 0C C
:
0 m32 1 0C
B
A
0 m42 0 1
You an establish this by showing that the produ t is the identity.
It is now easy to verify that
0 10 1 0 1
1 0 0 0 1 0 0 0 1 0 0 0
M2 1 M3 1 = B
B
0 1 0 0C C B0 1
B
0 0C C = B
C B0
B
1 0 0CC
0 m32 1 0A 0 0 1 0A 0 m32 1 0C
B CB
A
0 m42 0 1 0 0 m43 1 0 m42 m43 1

and
L = M1 1 M2 1 M3 1
0 10 1 0 1
1 0 0 0 1 0 0 0 1 0 0 0
=B
B
m21 1 0 0C C B0
B
1 0 0C Bm21
B
1 0 0C
C = B ;
C C
m31
B
0 1 0A 0 m32 1 0A m31 m32 1 0C
CB
A
m41 0 0 1 0 m42 m43 1 m41 m42 m43 1

whi h is what we wanted to show.
11. On e again, the argument does not depend on the order of the system.
Hen e we have the following general result.
If Gaussian elimination is performed on a matrix of order n to give

an upper triangular matrix U , then A = LU , where L is a lower
triangular matrix with ones on its diagonal. For i > j the (i; j )-
element of L is the multiplier mij .
Be ause the diagonal elements of L are one, it is said to be unit lower trian-
gular.
12. The following ode overwrites A with its LU fa torization. When it is
done, the elements of U o upy the upper half of the array ontaining A,
in luding the diagonal, and the elements of L o upy the lower half, ex luding
the diagonal. (The diagonal elements of L are known to be one and do not
have to be stored.)
do 40 k=1,n
do 10 i=k+1,n
a(i,k) = a(i,k)/a(k,k)
10 ontinue
do 30 j=k+1,n
do 20 i=k+1,n
a(i,j) = a(i,j) - a(i,k)*a(k,j)
20 ontinue
30 ontinue
40 ontinue
13. An operation ount for Gaussian elimination an be obtained in the usual

way by integrating the loops:
Z nZ nZ n
1
di dj dk = n2 :
0 k k 3
The ount is twi e that of the Cholesky algorithm, whi h is to be expe ted,
sin e we an no longer take advantage of symmetry.
Le ture 14
Linear Equations
Pivoting
BLAS
Upper Hessenberg and Tridiagonal Systems
Pivoting
1. The leading diagonal elements at ea h stage of Gaussian elimination play a
spe ial role: they serve as divisors in the formulas for the multipliers. Be ause
of their pivotal role they are alled | what else | pivots. If the pivots are all
nonzero, the algorithm goes to ompletion, and the matrix has an LU fa tor-
ization. However, if a pivot is zero the algorithm mis arries, and the matrix
may or may not have an LU fa torization. The two ases are illustrated by the
matrix !
0 1 ;
1 0
whi h does not have an LU fa torization and the matrix
! ! !
0 1 = 1 0 0 1 ;
0 0 0 1 0 0
whi h does, but is singular. In both ases the algorithm fails.
2. In some sense the failure of the algorithm is a blessing | it tells you that
something has gone wrong. A greater danger is that the algorithm will go on
to ompletion after en ountering a small pivot. The following example shows
what an happen.12
0 1
0:001 2:000 3:000
A1 = B
1:000 3:712 4:623 C
A ;
2:000 1:072 5:643
0 1
1:000 0:000 0:000
1000:
M1 = B 1:000 0:000 C
A;
2000: 0:000 1:000
0 1
0:001 2:000 3:000
A2 = 0:000
B
2004: 3005: C
A;
0:000 4001: 6006:
12
This example is from G. W. Stewart, Introdu tion to Matrix Computations, A ademi
Press, New York, 1973.
103
0 1
1:000 0:000 0:000
M2 = B 0:000 1:000 0:000 C A;
0:000 1:997 1:000
0 1
0:001 2:000 3:000
A3 = B 0:000 2004: 3005: CA:
0:000 0:000 5:000
The (3; 3)-element of A3 was produ ed by an elling three signi ant gures
in numbers that are about 6000, and it annot have more than one gure of
a ura y. In fa t the true value is 5:922. . . .
3. As was noted earlier, by the time an ellation o urs in a omputation, the
omputation is already dead. In our example, death o urs in the passage from
A1 to A2 , where large multiples of the rst row were added to the se ond and
third, obliterating the signi ant gures in their elements. To put it another
way, we would have obtained the same de omposition if we had started with
the matrix 0 1
0:001 2:000 3:000
A~1 = B 1:000 4:000 5:000 C A:
2:000 1:000 6:000
Clearly, there will be little relation between the solution of the system A1 x = b
and A~1 x~ = b.
4. If we think in terms of linear systems, a ure for this problem presents itself
immediately. The original system has the form
0:001x1 + 2:000x2 + 3:000x3 = b1
1:000x1 + 3:712x2 + 4:623x3 = b2
2:000x1 + 1:072x2 + 5:643x3 = b3
If we inter hange the rst and third equations, we obtain an equivalent system
2:000x1 + 1:072x2 + 5:643x3 = b3
0:001x1 + 2:000x2 + 3:000x3 = b1
1:000x1 + 3:712x2 + 4:623x3 = b2
whose matrix 0 1
2:000 1:072 5:643
A^1 = B
0:001 2:000 3:000 C A
1:000 3:712 4:623
an be redu ed to triangular form without di ulty.
5. This suggests the following supplement to Gaussian elimination for om-
puting the LU de omposition.
At the kth stage of the elimination, determine an index pk (k

pk n) for whi h jp k j is largest and inter hange rows k and pk
k
before pro eeding with the elimination.

This strategy is alled partial pivoting.13 It insures that the multipliers are not
greater than one in magnitude, sin e we divide by the largest element in the
pivot olumn. Thus a gross breakdown of the kind illustrated in x14.2 annot
o ur in the ourse of the elimination.
6. We have motivated the idea of partial pivoting by onsidering a system
of equations; however, it is useful to have a pure matrix formulation of the
pro ess. The problem is to des ribe the inter hange of rows in matrix terms.
The following easily veried result does the job.
Let P denote the matrix obtained by inter hanging rows k and p
of the identity matrix. Then P A is the matrix obtained by inter-
hanging rows k and p of A.
The matrix P is alled a (k; p) elementary permutation.
7. To des ribe Gaussian elimination with partial pivoting in terms of matri es,
let Pk denote a (k; pk ) elementary permutation, and, as usual, let Mk denote
the multiplier matrix. Then
U = Mn 1 Pn 1 M P M P A
2 2 1 1 (14.1)
is upper triangular.
8. On e we have de omposed A in the form (14.1), we an use the de ompo-
sition to overwrite a ve tor b with the solution of the linear system Ax = b.
The algorithm goes as follows.
1: b = Mk Pk b; k = 1; 2; : : : ; n 1
2: b = U 1 b
9. The notation in the above algorithm must be properly interpreted. When
we al ulate Pk b, we do not form the matrix Pk and multiply b by it. Instead
we retrieve the index pk and inter hange bk with bp . Similarly, we do not
k
ompute Mk b by matrix multipli ation; rather we ompute

bi = bi mik bk ; i = k + 1; : : : ; n:
Finally, the notation U 1 b does not mean that we invert U and multiply.
Instead we overwrite b with the solution of the triangular system Uy = b.
13
There is a less frequently used strategy, alled omplete pivoting, in whi h both rows and
olumns are inter hanged to bring the largest element of the matrix into the pivot position.
Notation like this is best suited for the lassroom or other situations where
mis on eptions are easy to orre t. It is risky in print, sin e someone will
surely take it literally.
10. One drawba k of the de omposition (14.1) is that it does not provide
a simple fa torization of the original matrix A. However, by a very simple
modi ation of the Gaussian elimination algorithm, we an obtain an LU fa -
torization of Pn 1 P2 P1 A.
11. The method is best derived from a simple example. Consider A with its
third and fth rows inter hanged:
0 1
a11 a12 a13 a14 a15
Ba21 a22 a23 a24 a25 C
B
C
Ba51 a52 a53 a54 C:
a55 C
B
B C
a41 a42 a43 a44 a45 A
a31 a32 a33 a34 a35
If one step of Gaussian elimination is performed on this matrix, we get
0 1
a11 a12 a13 a14 a15
Bm21
B
a022 a023 a024 a025 C
C
Bm51
B
a052 a053 a054 a055 C
C;
a042 a043 a044 0
B C
m41 a45 A
m31 a032 a033 a034 a035
where the numbers mij and a0ij are the same as the numbers we would have
obtained by Gaussian elimination on the original matrix | after all, they are
omputed by the same formulas:
mi1 = mi1 =m11 ;
a0ij = aij mi1 a1j :
If we perform a se ond step of Gaussian elimination, we get
0 1
a11 a12 a13 a14 a15
B
m
B 21 a022 a023 a024 a025 CC
Bm51
B
m52 a0053 a0054 a0055 C
C;
a0043 a0044 a0045 A
B C
m41 m42
m31 m32 a0033 a0034 a0035
where on e again the mij and the a00ij are from Gaussian elimination on the
original matrix. Now note that this matrix diers from the one we would
get from Gaussian elimination on the original matrix only in having its third
and fth rows inter hanged. Thus if at the third step of Gaussian elimination
we de ide to use the fth row as a pivot and ex hange both the row of the
submatrix and the multipliers, it would be as if we had performed Gaussian

elimination without pivoting on the original matrix with its third and fth
rows inter hanged.
12. This last observation is ompletely general.
If in the ourse of Gaussian elimination with partial pivoting both
the multipliers and the matrix elements are inter hanged, the re-
sulting array ontains the LU de omposition of
Pn 1 P P A:
2 1
13. The following ode implements this variant of Gaussian elimination.

do 60 k=1,n-1
maxa = abs(a(k,k))
p(k) = k
do 10 i=k+1,n
if (abs(a(i,k)) .gt. maxa) then
maxa = abs(a(i,k))
p(k) = i
end if
10 ontinue
do 20 j=1,n
temp = a(k,j)
a(k,j) = a(p(k),j) (14.2)
a(p(k),j) = temp
20 ontinue
do 30 i=k+1,n
30 ontinue
do 50 j=k+1,n
do 40 i=k+1,n
a(i,j) = a(i,j) - a(i,k)*a(k,j)
40 ontinue
50 ontinue
60 ontinue
The loop ending at 10 nds the index of the pivot row. The loop ending at 20
swaps the pivot row with row k. Note that the loop goes from 1 to n, so that
the multipliers are also swapped. The rest of the ode is just like Gaussian
elimination without pivoting.
14. To solve the linear system Ax = b, note that
LUx = Pn 1 P P Ax = Pn P P b:
2 1 1 2 1
Thus we rst perform the inter hanges on the ve tor b and pro eed as usual
to solve the two triangular systems involving L and U .
BLAS
15. Although we have re ommended the blas for matrix omputations, we
have ontinued to ode at the s alar level. The reason is that Gaussian elim-
ination is a exible algorithm that an be adapted to many spe ial purposes.
But to adapt it you need to know the details at the lowest level.
16. Nonetheless, the algorithm (14.2) oers many opportunities to use the
blas. For example, the loop
maxa = abs(a(k,k))
p(k) = k
do 10 i=k+1,n
if (abs(a(i,k)) .gt. maxa) then
maxa = abs(a(i,k))
p(k) = i
end if
10 ontinue
an be repla ed with a all to a blas that nds the position of the largest
omponent of the ve tor
(a(k,k); a(k+1,k); : : : ; a(n,k))T :
(In the anoni al blas, the subprogram is alled imax.) The loop
do 20 j=1,n
temp = a(k,j)
a(k,j) = a(p(k),j)
a(p(k),j) = temp
20 ontinue
an be repla ed by a all to a blas (swap in the anon) that swaps the ve tors
(a(k,k); a(k,k+1); : : : ; a(k,n))
and
(a(p(k),k); a(p(k),k+1); : : : ; a(p(k),n)):
The loop
do 30 i=k+1,n
30 ontinue
an be repla ed by a all to a blas (s al) that multiplies a ve tor by a s alar

to ompute
a(k,k) 1 (a(k,k+1); a(k,k+2); : : : ; a(k,n))T :
17. The two inner loops

do 50 j=k+1,n
do 40 i=k+1,n
a(i,j) = a(i,j) - a(i,k)*a(k,j)
40 ontinue
50 ontinue
in whi h most of the work is done, are the most interesting of all. The innermost
loop an, of ourse, be repla ed by an axpy that omputes the ve tor
0 1 0 1
a(k+1,j) a(k+1,k)
B .. C
a(k,j) B .. C
:
. A . A
a(n,j) a(n,k):
However, the two loops together ompute the dieren e

0
a(k+1,k+1) a(k+1,n)
1 0
a(k+1,k)
1
B .. .. C B .. C
(a(k,k+1); : : : ; a(k,n));
. . A . A
a(n,k+1) a(n,n) a(n,k)
i.e., the sum of a matrix and an outer produ t. Sin e this operation o urs
frequently, it is natural to assign its omputation to a blas, whi h is alled
ger in the anon.
18. The subprogram ger is dierent from the blas we have onsidered so far
in that it ombines ve tors and matri es. Sin e it requires O(n2 ) it is alled a
level-two blas.
The use of level-two blas an redu e the dependen y of an algorithm on
array orientations. For example, if we repla e the loops in x14.17 with an
invo ation of ger, then the latter an be oded in olumn- or row-oriented
form as required. This feature of the level-two blas makes the translation
from fortran, whi h is olumn oriented, to C, whi h is row oriented, mu h
easier. It also makes it easier to write ode that takes full advantage of ve tor
super omputers.14
14
There is also a level-three blas pa kage that performs operations between matri es. Used
with a te hnique alled blo king, they an in rease the e ien y of some matrix algorithms,
espe ially on ve tor super omputers, but at the ost of twisting the algorithms they benet
out of their natural shape.
Upper Hessenberg and tridiagonal systems

19. As we have mentioned, Gaussian elimination an be adapted to matri es
of spe ial stru ture. Spe i ally, if an element in the pivot olumn is zero,
we an save operations by skipping the elimination of that element. We will
illustrate the te hnique with an upper Hessenberg matrix.
20. A matrix A is upper Hessenberg if
i > j + 1 =) aij = 0:
Diagrammati ally, A is upper Hessenberg if it has the form
0 1
X X X X X
B
B X X X X XC
C
A=B
BO
B
X X X C:
XC
C
O O X X XA
O O O X X
Here we are using a onvention of Jim Wilkinson in whi h a O stands for a zero
element, while an X stands for an element that may or may not be zero (the
presumption is that it is not).
21. From this diagrammati form it is lear that in the rst step of Gaussian
elimination only the (2; 1)-element needs to be eliminated | the rest are al-
ready zero. Thus we have only to subtra t a multiple of the rst row from the
se ond to get a matrix of the form
0 1
X X X X X
BO X X X XC
B
C
XC :
B C
BO X X X
B C
O O X X XA
O O O X X
At the se ond stage, we subtra t a multiple of the se ond row from the third
to get a matrix of the form
0 1
X X X X X
BO X X X XC
B
C
C:
B
BO O X X XC
B C
O O X X XA
O O O X X
Two more steps of the pro ess yield the matri es
0 1 0 1
X X X X X X X X X X
BO X X X XC BO X X X XC
B B
C C
B
BO
B
O X X XCC
C
and B
BO
B
O X X XCC;
C
O O O X XA O O O X XA
O O O X X O O O O X
the last of whi h is upper triangular.

22. Here is ode for the redu tion of an upper Hessenberg matrix to triangular
form. For simpli ity, we leave out the pivoting. The multipliers overwrite the
subdiagonal elements, and the nal triangular form overwrites the matrix.
do 20 k=1,n-1
a(k+1,k) = a(k+1,k)/a(k,k)
do 10 j=k+1,n
a(k+1,j) = a(k+1,j) - a(k+1,k)*a(k,k+1)
10 ontinue
20 ontinue
23. An operation ount for this algorithm an be obtained as usual by

integrating over the loops:
Z nZ n
1
dj dk = n2 :
0 k 2
Thus, by taking advantage of the zero elements in an upper Hessenberg matrix
we have redu ed its triangularization from an order n3 pro ess to an order
n2 pro ess. In fa t the triangularization is of the same order as the ba k
substitution.
24. Even greater simpli ations are obtained when the matrix is tridiagonal ,
i.e., of the form 0 1
X X O O O
B
B X X X O OC
C
OC :
B C
B O X X X
B C
O O X X XA
O O O X X
Here we not only redu e the number of rows to be eliminated, but (in the
absen e of pivoting) when two rows are ombined, only the diagonal element
of the se ond is altered, so that the omputation redu es to statement
a(k+1,k+1) = a(k+1,k+1) - a(k+1,k)*a(k,k+1)
The result is an O(n) algorithm. Of ourse, we would not waste a square array
to store the 3n 2 numbers that represent a tridiagonal matrix. Instead we
might store the diagonal elements in linear arrays. The details of this algorithm
are left as an exer ise.
Le ture 15
Linear Equations
Ve tor Norms
Matrix Norms
Relative error
Sensitivity of Linear Systems
Ve tor norms
1. We are going to onsider the sensitivity of linear systems to errors in their
oe ients. To do so, we need some way of measuring the size of the errors in
the oe ients and the size of the resulting perturbation in the solution. One
possibility is to report the errors individually, but for matri es this amounts
to n2 numbers | too many to examine one by one. Instead we will summarize
the sizes of the errors in a single number alled a norm. There are norms for
both matri es and ve tors.
2. A ve tor norm is a fun tion k k : Rn ! R that satises
1: x 6= 0 =) kxk > 0;
2: kxk = jj kxk; (15.1)
3: kx + yk kxk + kyk:
The rst ondition says that the size of a nonzero ve tor is positive. The
se ond says that if a ve tor is multiplied by a s alar its size hanges propor-
tionally. The third is a generalization of the fa t that one side of a triangle
is not greater than the sum of the other two sides: see Figure 15.1. A useful
variant of the triangle inequality is
kx yk kxk kyk:
3. The onditions satised by a ve tor norm are satised by the absolute value
fun tion on the line | in fa t, the absolute value is a norm on R1 . This means
that many results in analysis an be transferred mutatis mutandis from the
real line to Rn .
4. Although there are innitely many ve tor norms, the ones most ommonly
found in pra ti e are the one-, two-, and innity-norms. They are dened as
follows:
1: kxk1 = q i jxi j;
P
2: kxk2 =
P
i xi ;
2
1: kxk1 = maxi jxi j:
113
x+y
y
Figure 15.1. The triangle inequality.
In R3 the two-norm of a ve tor is its Eu lidean length. Hen e the two-

norm is also alled the Eu lidean norm. The one norm is sometimes alled the
Manhattan norm, be ause it is the distan e you have to travel to get from A
to B in Manhattan (think about it). The innity-norm is also alled the max
norm. All three norms are easy to ompute. They satisfy
kxk kxk kxk1:
1 2
Matrix norms
5. Matrix norms are dened in analogy with ve tor norms. Spe i ally, a
matrix norm is a fun tion k k : Rmn ! R that satises
1: A 6= 0 =) kAk > 0;
2: kAk = jj kAk;
3: kA + B k kAk + kB k:
6. The triangle inequality allows us to bound the norm of the sum of two
ve tors in terms of the norms of the individual ve tors. To get bounds on the
produ ts of matri es, we need another property. Spe i ally, let k k stand for
a family of norms dened for all matri es. Then we say that k k is onsistent
if
kAB k kAkkB k;
whenever the produ t AB is dened. A ve tor norm k kv is onsistent with a
matrix norm k kM if kAxkv kAkM kxkv .
7. The requirement of onsisten y frustrates attempts to generalize the ve tor
innity-norm in a natural way. For if we dene kAk = maxi jaij j, then
! ! !

1 1 1 1

=

2 2

= 2:

1 1 1 1

2 2

But ! !

1 1 1 1 = 1 1 = 1:

1 1 1 1
This is one reason why the matrix one- and innity-norms have ompli ated
denitions. Here they are | along with the two-norm, whi h gets new name:
1: kAk1 = max i jaij j;
P
q j
F: kAkF =
P
a2ij ;
i;jP
1: kAk1 = maxi j jaij j:
The norm k kF is alled the Frobenius norm.15
The one, Frobenius, and innity norms are onsistent. When A is a ve tor,
the one- and innity-norms redu e to the ve tor one- and innity-norms, and
the Frobenius norm redu es to the ve tor two-norm.
Be ause the one-norm is obtained by summing the absolute values of the
elements in ea h olumn and taking the maximum, it is sometimes alled the
olumn-sum norm. Similarly, the innity-norm is alled the row-sum norm.
Relative error
8. Just as we use the absolute value fun tion to dene the relative error in
a s alar, we an use norms to dene relative errors in ve tors and matri es.
Spe i ally, the relative error in y as an approximation to x is the number
=
ky xk :
kxk
The relative error in a matrix is dened similarly.
9. For s alars there is a lose relation between relative error and the number
of orre t digits: if the relative error in y is , then x and y agree to roughly
log de imal digits. This simple relation does not hold for the omponents
of a ve tor, as the following example shows.
Let 0 1 0 1
1:0000 1:0002
x=B 0:0100 A and y = B 0:0103 A :
C C
0:0001 0:0002
In the innity-norm, the relative error in y as an approximation to x is 3 10 4 .
But the relative errors in the individual omponents are 2 10 4 , 3 10 2 , and 1.
The large omponent is a urate, but the smaller omponents are ina urate
in proportion as they are small. This is generally true of the norms we have
15
We will en ounter the matrix two-norm later in x17.
introdu ed: the relative error gives a good idea of the a ura y of the larger
omponents but says little about small omponents.
10. It sometimes happens that we are given the relative error of y as an
approximation to x and want the relative error of x as an approximation to y.
The following result says that when the relative errors are small, the two are
essentially the same.
If
ky xk < 1; (15.2)
kxk
then
kx y k :
kyk 1
To see this, note that from (15.2), we have
kxk ky xk kxk ky k
or
(1 )kxk kyk:
Hen e
kx yk ky xk :
kyk (1 )kxk 1
If = 0:1, then =(1 ) = 0:111. . . , whi h diers insigni antly from .
Sensitivity of linear systems
11. Usually the matrix of a linear system will not be known exa tly. For exam-
ple, the elements of the matrix may be measured. Or they may be omputed
with rounding error. In either ase, we end up solving not the true system
Ax = b;
but a perturbed system
A~x~ = b:
It is natural to ask how lose x is to x~. This is a problem in matrix perturbation
theory. From now on, k k will denote both a onsistent matrix norm and a
ve tor norm that is onsistent with the matrix norm.16
12. Let E = A~ A so that
A~ = A + E:
The rst order of business is to determine onditions under whi h A~ is nonsin-
gular.
16
We ould also ask about the sensitivity of the solution to perturbations in b. This is a
very easy problem, whi h we leave as an exer ise.
Let A be nonsingular. If
kA E k < 1;
1
(15.3)
then A + E is nonsingular.
To establish this result, we will show that under the ondition (15.3) if
x 6= 0 then (A + E )x 6= 0. Sin e A is nonsingular (A + E )x = A(I + A 1 E )x 6= 0
if and only if (I + A 1 E )x 6= 0. But
k(I + A 1E )xk = kx + A 1Exk kxk kA 1 E kkxk = (1 kA 1 E k)kxk > 0;
whi h establishes the result.
13. We are now in a position to establish the fundamental perturbation theo-
rem for linear systems.
Let A be nonsingular and let A~ = A + E . If
Ax = b and A~x~ = b;
where b is nonzero, then
kx~ xk kA E k:1
(15.4)
kx~k
If in addition
kA E k < 1;
1
then A~ is nonsingular and

kx~ xk kA E k : 1
(15.5)
kxk 1 kA E k 1
To establish (15.4), multiply the equation (A + E )~x = b by A 1 to get

(I + A 1 E )~x = A 1 b = x:
It follows that
x x~ = A 1 E x~;
or on taking norms
kx x~k kA 1 E kkx~k:
Sin e x~ 6= 0 (be ause b 6= 0), this inequality is equivalent to (15.4).
If = kA 1 E k < 1, the inequality (15.4) says that the relative error in x
as an approximation to x~ is less than one. Hen e by the result in x15.10, the
relative error in x~ as an approximation to x is less than or equal to =(1 ),
whi h is just the inequality (15.5).
Le ture 16
Linear Equations
The Condition of a Linear System
Arti ial Ill-Conditioning
Rounding Error and Gaussian Elimination
Comments on the Error Analysis
The ondition of a linear system

1. As it stands, the inequality
kx~ xk kA 1 E k
kxk 1 kA 1 E k
is not easy to interpret. By weakening it we an make it more intelligible.
First note that
kA E k kA kkE k = (A) kkEAkk ;
1 1
where
(A) = kAkkA 1
k:
If
(A)
kE k < 1;
kAk
then we an write
kE k
(A)
kx~ xk kAk :
kxk 1 (A) kkE
Ak
k
2. Now let's disassemble this inequality. First note that if (A)kE k=kAk is at
all small, say less than 0:1, then the denominator on the right is near one and
has little ee t. Thus we an onsider the approximate inequality
kx~ xk < (A) kE k :
kxk kAk
The fra tion on the left,
kx~ xk ;
kxk
is the relative error in x~ as an approximation to x. The fra tion on the right,
kE k ;
kAk
119
is the relative error in A + E as an approximation to A. Thus the number

(A) mediates the transfer of error from the matrix A to the solution x. As
usual we all su h a number a ondition number | in parti ular, (A) is the
ondition number of A with respe t to inversion.17
3. The ondition number is always greater than one:
1 kI k kAA 1 k kAkkA 1 k = (A):
This means | unfortunately | that the ondition number is a magni ation
onstant: the bound on the error is never diminished in passing from the
matrix to the solution.
4. To get a feel for what the ondition number means, suppose that A is
rounded on a ma hine with rounding unit M , so that a~ij = aij (1 + ij ), where
jij j M. Then in any of the usual matrix norms,
kE k kAk:
M
If we solve the linear system A~x~ = b without further error, we get a solution
that satises
kx~ xk (A) : (16.1)
kxk M
In parti ular, if M = 10 t and (A) = 10k , the solution x~ an have relative

error as large as 10 t+k . Thus the larger omponents of the solution an be
ina urate in their (t k)th signi ant gures (the smaller omponents an be
mu h less a urate; see x15.9). This justies the following rule of thumb. If
(A) = 10k expe t to lose at least k digits in solving the system Ax = b.
Arti ial ill- onditioning

5. Unfortunately, the rule of thumb just stated is subje t to the quali ation
that the matrix must in some sense be balan ed. To see why onsider the
matrix !
1
A= 1 2 ; 1
whose inverse is !
A = 21 11 :
1
The ondition number of this matrix in the innity-norm is nine.

Now multiply the rst row of A by 10 4 to get the matrix
!
^
A= 1 10 4
10 4
2 ;
17
Although it looks like you need a matrix inverse to ompute the ondition number, there
are reliable ways of estimating it from the LU de omposition.
whose inverse is !
A^ 1 = 2 10 1 :
4
104 1
The ondition number of A^ is about 6 104 .
If we now introdu e errors into the fth digits of the elements of A and A^ |
su h errors as might be generated by rounding to four pla es | the innity-
norms of the error matri es will be about kAk1 10 4 = kA^k1 10 4 . Thus,
for A, the error in the solution of Ax = b is approximately
(A)
kE k1 = 9 10 ;4
kAk1
while for A^ the predi ted error is
kE^ k
(A^) ^ 1 = 6:
kAk1
Thus we predi t a small error for A and a large one for A^. Yet the passage from
A to A^ is equivalent to multiplying the rst equation in the system Ax = b by
10 4 , an operation whi h should have no ee t on the a ura y of the solution.
6. What's going on here? Is A^ ill onditioned or is it not? The answer is \It
depends."
7. There is a sense in whi h A^ is ill onditioned. It has a row of order 10 4 ,
and a perturbation of order 10 4 an ompletely hange that row | even make
it zero. Thus the linear system Ax ^ = b is very sensitive to perturbations of
order 10 in the rst row, and that fa t is re e ted in the large ondition
4
number.
8. On the other hand, the errors we get by rounding the rst row of A^ are not
all of order 10 4 . Instead the errors are bounded by
! !
10 4 10 4 10 4 = 10 8
10 8
:
1 2 10 4
2 10 4
The solution of Ax ^ = ^b is insensitive to errors of this form, and hen e the

ondition number is misleading.
9. To put it another way, the ondition number of A^ has to be large to a om-
modate errors that an never o ur in our parti ular appli ation, a situation
that is alled arti ial ill- onditioning. Unfortunately, there is no me hani al
way to distinguish real from arti ial ill- onditioning. When you get a large
ondition number, you have to go ba k to the original problem and take a hard
look to see if it is truly sensitive to perturbations or is just badly s aled.
Rounding error and Gaussian elimination

10. One of the early triumphs of rounding-error analysis was the ba kward
analysis of Gaussian elimination. Although the omplete analysis is tedious,
we an get a good idea of the te hniques and results by looking at examples.
We will start with a 2 2 matrix.
11. Let !
3 : 000 2
A = 1:000 2:000 :: 000
If we perform Gaussian elimination on this matrix in four-digit arithmeti , the

rst (and only) multiplier is
m21 = (a21 =a11 ) = (1:000=3:000) = 0:3333:
Note that if we dene
a~21 = 0:9999;
then
m21 = a~21 =a11 = 0:9999=3:000 = 0:3333:
In other words, the multiplier we ompute with rounding error is the same
multiplier we would get by doing exa t omputations on A with its (2; 1)-
element slightly altered.
Let us ontinue the elimination. The redu ed (2; 2)-element is
a022 = (a22 m21 a12 ) = (2:000 0:3333 2:000) = (2:000 0:6666) = 1:333:
If we repla e a22 by
a~22 = 1:9996;
then
a022 = a~22 m21 a12 = 1:9996 0:3333 2:000 = 1:9996 0:6666 = 1:333:
On e again, the omputed value is the result of exa t omputing
with slightly perturbed input.
To summarize, if we repla e A by the nearby matrix
!
A~ = 30::0000 2:0000
9999 1:9996
and perform Gaussian elimination on A~ without rounding, we get the same
results as we did by performing Gaussian elimination with rounding on A.
12. The above example seems ontrived. What is true of a 2 2 matrix may
be false for a large matrix. And if the matrix is nearly singular, things might
be even worse.
However, onsider the 100 100 matrix

A=I 0:01eeT ;
where eT = (1; 1; : : : ; 1). This matrix has the form
0:01
0 1
0:99 0:01
B
B 0:01 0:99 0:01CC
B
.. .. .. C :
B
. . . C A
0:01 0:01 0:99
It is singular, whi h is about as ill onditioned as you an get.18 On a ma hine
with rounding unit 2:2 10 16 , I omputed the LU de omposition of A and
multiplied the fa tors to get a matrix A~ = (LU ). In spite of the very large
amount of omputation involved, I found that
kA~ Ak1 = 7:4 10 16 ;
whi h is less than four times the rounding unit.
13. Let us now turn from examples to analysis. The tedious part of the
rounding-error analysis of Gaussian elimination is keeping tra k of the errors
made at ea h stage of the elimination. However, you an at h the avor of
the analysis by looking at the rst stage.
First we ompute the multipliers:
a (1 + i1 )
m = (a =a ) = i1
i1 i1 ;
11
a11
where as usual ji1 j M . It follows that if we set
a~i1 = ai1 (1 + i1 ); (16.2)
then
mi1 = a~i1 =a11 : (16.3)
The next step is to redu e the submatrix:
a0ij = (aij mi1 a1j ) = [aij mi1 a1j (1 + ij )(1 + ij )
= [aij mi1 a1j + aij ij mi1 a1j (ij + ij + ij ij ):
It follows that if we set
a~ij = aij + aij ij mi1 a1j (ij + ij + ij ij ); (16.4)
18
However, this does not keep Gaussian elimination from going to ompletion, sin e only
the element u100;100 is zero.
then
a0ij = a~ij mi1 a1j : (16.5)
Now it follows from (16.3) and (16.5) that the matrix
a n
0 1
a11 a12 1
B m21
B
a022 a0 n
C
2 C
B
.. .. ..C ;
B
. . .C
A
mn1 a0n2 a0nn

whi h results from performing Gaussian elimination with rounding error on
A, is the same as what we would get from performing Gaussian elimination
without rounding error on
a n
0 1
a11 a12 1
Ba
B
~21 a~22 a~ n C
A~ = B ..
2 C
B
.. ..C :
. . .C
A
a~n1 a~n2 a~nn
Moreover, A and A~ are near ea h other. From (16.4) it follows that for j > 1
ja~ij aij j (jaij j + 3jmi1 jja1j j)M:
If we assume that the elimination is arried out with pivoting so that jmi1 j 1
and set = maxi;j jaij j, then the bound be omes
ja~ij aij j 4M :
Similarly, (16.2) implies that this bound is also satised for j = 1.
14. All this gives the avor of the ba kward rounding-error analysis of Gaussian
elimination; however, there is mu h more to do to analyze the solution of a
linear system. We will skip the details and go straight to the result.
If Gaussian elimination with partial pivoting is used to solve the
n n system Ax = b on a omputer with rounding unit M , the
omputed solution x~ satises
(A + E )~x = b;
where
kE k '(n) : (16.6)
kAk M
Here ' is a slowly growing fun tion of n that depends on the norm,
and is the ratio of the largest element en ountered in the ourse
of the elimination to the largest element of A.
Comments on the error analysis

15. The ba kward error analysis shows that the omputed solution is the exa t
solution of a slightly perturbed matrix; that is, Gaussian elimination is a stable
algorithm. As we have observed in x6.22 and x7.7, a ba kward error analysis
is a powerful tool for understanding what went wrong in a omputation. In
parti ular, if the bound (16.6) is small, then the algorithm annot be blamed for
ina ura ies in the solution. Instead the responsible parties are the ondition
of the problem and (usually) errors in the initial data.
16. The fun tion ' in the bound is a small power of n, say n2 . However,
any mathemati ally rigorous ' is invariably an overestimate, and the error is
usually of order n or less, depending on the appli ation.
17. The number in the bound is alled the growth fa tor be ause it measures
the growth of elements during the elimination. If it is large, we an expe t a
large ba kward error. For example, the growth fa tor in the example of x14.2,
where Gaussian elimination failed, was large ompared to the rounding unit.
In this light, partial pivoting an be seen as a way of limiting the growth by
keeping the multipliers less than one.
18. Unfortunately, even with partial pivoting the growth fa tor an be on the
order of 2n . Consider, for example, the matrix
0 1
1 0 0 0 1
B 1 1 0 0 1C
B
C
W =B B
1 1 1 0 1 C:
C
1 1 1 1 1C
B
A
1 1 1 1 1
Gaussian elimination with partial pivoting applied to this matrix yields the
following sequen e of matri es.
0 1 0 1 0 1
1 0 0 0 1 1 0 0 0 1 1 0 0 0 1
B0 1 0 0 2C B0 1 0 0 2C B0 1 0 0 2C
B B B
C C C
B0 1 1 0 2 B0 0 1 0 4 B0 0 1 0 4
B C B C B C
C C C
0 1 1 1 2A 0 0 1 1 4A 0 0 0 1 8C
B C B C B
A
0 1 1 1 2 0 0 1 1 4 0 0 0 1 8
0 1
1 0 0 0 1
B0 1 0 0 2C
B
C
B0 0 1 0 4C:
B C
0 0 0 1 8C
B
A
0 0 0 0 16
Clearly, if Gaussian elimination is performed on a matrix of order n having
this form, the growth fa tor will be 2n 1 .
19. Does this mean that Gaussian elimination with partial pivoting is not to
be trusted? The re eived opinion has been that examples like the one above
o ur only in numeri al analysis texts: in real life there is little growth and
often a de rease in the size of the elements. Re ently, however, a naturally
o urring example of exponential growth has been en ountered | not surpris-
ingly in a matrix that bears a family resemblan e to W . Nonetheless, the
re eived opinion stands. Gaussian elimination with partial pivoting is one of
the most stable and e ient algorithms ever devised. Just be a little areful.
Le ture 17
Linear Equations
Introdu tion to a Proje t
More on Norms
The Wonderful Residual
Matri es with Known Condition Numbers
Invert and Multiply
Cramer's Rule
Submission
Introdu tion to a proje t

1. In this proje t we will use matlab to investigate the stability of three
algorithms for solving linear systems. The algorithms are Gaussian elimination,
invert-and-multiply, and Cramer's rule.
More on norms
2. We have mentioned the matrix two-norm in passing. Be ause the two-
norm is expensive to ompute, it is used hie y in mathemati al investigations.
However, it is ideal for our experiments; and sin e matlab has a fun tion norm
that omputes the two-norm, we will use it in this proje t. From now on k k
will denote the ve tor and matrix two-norms.
3. The matrix two-norm is dened by19
kAk = kmax
xk
kAxk:
=1
Here is what you need to know about the two-norm for this proje t.
1. The matrix two-norm of a ve tor is its ve tor two-norm.
2. The matrix two-norm is onsistent; that is, kAB k kAkkB k, when-
ever AB is dened.
3. kxyT k = kxkkyk.
4. kdiag(d1 ; : : : ; dn )k = maxi fjdi jg.
5. If U T U = I and V T V = I (we say U and V are orthogonal ), then
kU TAV k = kAk.
19
If you nd this denition onfusing, think of it this way. Given a ve tor x of length one,
the matrix A stret hes or shrinks it into a ve tor of length kAxk. The matrix two-norm of A
is the largest amount it an stret h or shrink a ve tor.
127
All these properties are easy to prove from the denition of the two-norm, and
you might want to try your hand at it. For the last property, you begin by
establishing it for the ve tor two-norm.
With these preliminaries out of the way, we are ready to get down to
business.
The wonderful residual
4. How an you tell if an algorithm for solving the linear system Ax = b is
stable | that is, if the omputed solution x~ satises a slightly perturbed system
(A + E )~x = b; (17.1)
where
kE k = O( )?
kAk M
One way is to have a ba kward rounding-error analysis, as we do for Gaussian

elimination. But la king that, how an we look at a omputed solution and
determine if it was omputed stably?
5. One number we annot trust is the relative error
kx~ xk :
kxk
We have seen that even with a stable algorithm the relative error depends on
the ondition of the problem.
6. If x~ satises (17.1), we an obtain a lower bound on kE k by omputing
the residual
r = b Ax~:
Spe i ally, sin e 0 = b (A + E )~x = r + E x~, we have krk kE x~k kE kkx~k.
It follows that
kE k krk :
kAk kAkkx~k
Thus if the relative residual r
kAkkx~k
has a large norm, we know that the solution was not omputed stably.
On the other hand, if the relative residual is small, the result was omputed
stably. To see this, we must show that there is a small matrix E su h that
(A + E )~x = b. Let
rx~T
E=
kx~k2 :
Then
rx~T x~ rkx~k2
b (A + E )~x = (b Ax~) E x~ = r
kx~k2 = r kx~k2 = 0;
so that (A + E )~x = b. But
kE k = krx~ k ; T
kAk kx~k kAk 2
and it is easy to see that krx~ k = krkkx~k. Hen e,

T
kE k = krk :
kAk kAkkx~k
7. What we have shown is that the relative residual norm
krk
kAkkx~k
is a reliable indi ation of stability. A stable algorithm will yield a relative
residual norm that is of the order of the rounding unit; an unstable algorithm
will yield a larger value.
Matri es with known ondition numbers
8. To investigate the ee ts of onditioning, we need to be able to gener-
ate nontrivial matri es of known ondition number. Given an order n and a
ondition number we will take A in the form
A = UDV T ;
where U and V are random orthogonal matri es (i.e., random matri es satis-
fying U T U = V T V = I ), and
D = diag(1; ; ; : : : ; 1 ):
1 2
n 1 n 1
The fa t that the ondition number of A is follows dire tly from the properties
of the two-norm enumerated in x17.3.
9. The rst part of the proje t is to write a fun tion
fun tion a = ondmat(n, kappa)
to generate a matrix of order n with ondition number . To obtain a ran-
dom orthogonal matrix, use the matlab fun tion rand to generate a random,
normally distributed matrix. Then use the fun tion qr to fa tor the random
matrix into the produ t QR of an orthogonal matrix and an upper triangular
matrix, and take Q for the random orthogonal matrix.
You an he k the ondition of the matrix you generate by using the fun -
tion ond.
Invert and multiply

10. The purpose here is to ompare the stability of Gaussian elimination
with the invert-and-multiply algorithm for solving Ax = b. Write a fun tion
fun tion invmult(n, kap)
where n is the order of the matrix A and kap is a ve tor of ondition numbers.
For ea h omponent kap(i), the fun tion should do the following.
1. Generate a random n n matrix A of ondition kap(i).
2. Generate a (normally distributed) random n-ve tor x.
3. Cal ulate b = Ax.
4. Cal ulate the solution of the system Ax = b by Gaussian elimination.
5. Cal ulate the solution of the system Ax = b by inverting A and
multiplying b by the inverse.
6. Print
[i, kap(i); reg, rrg; rei, rri
where
reg is the relative error in the solution by Gaussian elimination,
rrg is the relative residual norm for Gaussian elimination,
rei is the relative error in the invert-and-multiply solution,
rri is the relative residual norm for invert-and-multiply.
11. The matlab left divide operator \\" is implemented by Gaussian elimi-
nation. To invert a matrix, use the fun tion inv.
Cramer's rule
12. The purpose here is to ompare the stability of Gaussian elimination with
Cramer's rule for solving the 2 2 system Ax = b. For su h a system, Cramer's
rule an be written in the form
x1 = (b1 a22 b2 a12 )=d;
x2 = (b2 a11 b1 a21 )=d;
where
d = a11 a22 a21 a12 :
13. Write a fun tion
fun tion ramer(kap)
where kap is a ve tor of ondition numbers. For ea h omponent kap(i), the
fun tion should do the following.
1. Generate a random 2 2 matrix A of ondition kap(i).

2. Generate a (normally distributed) random 2-ve tor x.
3. Cal ulate b = Ax.
4. Cal ulate the solution of the system Ax = b by Gaussian elimination.
5. Cal ulate the solution of the system Ax = b by Cramer's rule.
6. Print
[i, kap(i); reg, rrg; re , rr
where
reg is the relative error in the solution by Gaussian elimination,
rrg is the relative residual norm for Gaussian elimination,
re is the relative error in the solution by Cramer's rule,
rr is the relative residual norm for Cramer's rule.
Submission
14. Run your programs for
kap = (1; 104 ; 108 ; 1012 ; 1016 )
using the matlab ommand diary to a umulate your results in a le. Edit
the diary le and at the top put a brief statement in your own words of what
the results mean.
Polynomial Interpolation
133
Le ture 18
Quadrati Interpolation
Shifting
Lagrange Polynomials and Existen e
Uniqueness
Quadrati interpolation
1. Muller's method for nding a root of the equation f (t) = 0 is a three-
point iteration (see x4.19). Given starting values x0 , x1 , x2 and orresponding
fun tion values f0 , f1 , f2 , one determines a quadrati polynomial
p(t) = a0 + a1 t + a2 t2
satisfying
p(xi ) = fi ; i = 0; 1; 2 : (18.1)
The next iterate x3 is then taken to be the root nearest x2 of the equation
p(t) = 0.
2. At the time the method was presented, I suggested that it would be instru -
tive to work through the details of its implementation. One of the details is the
determination of the quadrati polynomial p satisfying (18.1), an example of
quadrati interpolation. Sin e quadrati interpolation exhibits many features
of the general interpolation problem in readily digestible form, we will treat it
rst.
3. If the equations (18.1) are written out in terms of the oe ients a0 , a1 ,
a2 , the result is the linear system
0 10 1 0 1
1 x0 x20 a0 f0
1 x1 A a1 A = f1 A :
x21 C
B B C B C
1 x2 x22 a2 f2
In prin iple, we ould nd the oe ients of the interpolating polynomial by
solving this system using Gaussian elimination. There are three obje tions to
this pro edure.
4. First, it is not at all lear that the matrix of the system | it is alled
a Vandermonde matrix | is nonsingular. In the quadrati ase it is possible
to see that it is nonsingular by performing one step of Gaussian elimination
and verifying that the determinant of the resulting 2 2 system is nonzero.
However, this approa h breaks down in the general ase.
135
5. A se ond obje tion is that the pro edure is too expensive. This obje tion
is not stri tly appli able to the quadrati ase; but in general the pro edure
represents an O(n3 ) solution to a problem whi h, as we will see, an be solved
in O(n2 ) operations.
6. Another obje tion is that the approa h an lead to ill- onditioned systems.
For example, if x0 = 100, x1 = 101, x2 = 102, then the matrix of the system is
0 1
1 100 10; 000
V =B 1 101 10; 201C
A:
1 102 10; 404
The ondition number of this system is approximately 2108 .
Now the unequal s ale of the olumns of V suggests that there is some
arti ial ill- onditioning in the problem (see x16.5) | and indeed there is. But
if we res ale the system, so that its matrix assumes the form
0 1
1 1:00 1:0000
V^ = B
1 1:01 1:0201C
A;
1 1:02 1:0404
the ondition number hanges to about 105 | still un omfortably large, though
perhaps good enough for pra ti al purposes. This ill- onditioning, by the way,
is real and will not go away with further s aling.
Shifting
7. By rewriting the polynomial in the form
p(t) = b0 + b1 (t x2 ) + b2 (t x2 )2 ;
we an simplify the equations and remove the ill- onditioning. Spe i ally, the
equations for the oe ients bi be ome
0 10 1 0 1
1 x0 x2 (x0 x2 )2 b0 f0
1 x1 x2 (x1 x2 )2 C A b1 A = f1 A :
B B C B C
1 0 0 b2 f2
From the third equation we have
b0 = f2 ;
from whi h it follows that ! ! !
x0 x2 (x0 x2 )2 b1 = f0 f2 :
x1 x2 (x1 x2 )2 b2 f1 f 2
For our numeri al example, this equation is
! ! !
2 4 b 1 = f0 f2 ;
1 1 b2 f1 f2
whi h is very well onditioned.
18. Polynomial Interpolation 137
Polynomial interpolation
8. The quadrati interpolation problem has a number of features in ommon
with the general problem.
1. It is of low order. High-order polynomial interpolation is rare.
2. It was introdu ed to derive another numeri al algorithm. Not all
polynomial interpolation problems originate in this way, but many
numeri al algorithms require a polynomial interpolant.
3. The appearan e of the problem and the nature of its solution hange
with a hange of basis.20 When we posed the problem in the natural
basis 1, t, t2 , we got an ill- onditioned 3 3 system. On the other
hand, posing the problem in the shifted basis 1, t x2 , (t x2 )2 lead
to a well- onditioned 2 2 system.
9. The general polynomial interpolation problem is the following.
Given points (x0 ; f0 ), (x1 ; f1 ), . . . , (xn ; fn ), where the xi are dis-
tin t, determine a polynomial p satisfying
1: deg(p) n;
2: p(xi ) = fi; i = 0; 1; : : : ; n:
10. If we write p in the natural basis to get

p(t) = a0 + a1 t + a2 t2 + + an tn ;
the result is the linear system
1 x0 x20 xn0 a0
0 10 1 0 1
f0
B1
B
x1 x21 xn1 C C B a1 C
B C
C = B
B f1 C
B C
B ..
B
.. .. .. C B .. C B .. C
CB : (18.2)
. . . . A . A . C A
1 xn x2n xnn an fn
The matrix of this system is alled a Vandermonde matrix. The dire t solution
of Vandermonde systems by Gaussian elimination is not re ommended.
Lagrange polynomials and existen e
11. The existen e of the interpolating polynomial p an be established in the
following way. Suppose we are given n + 1 polynomials `j (t) that satisfy the
following onditions:
6= j ,
`j (xi ) = 10 ifif ii = (18.3)
j.
20
The term \basis" is used here in its usual sense. The spa e of, say, quadrati polynomials
is a ve tor spa e. The fun tions 1, t, t2 form a basis for that spa e. So do the fun tions 1,
t a, (t a)2 .
Then the interpolating polynomial has the form

p(t) = f0 `0 (t) + f1 `1 (t) + + fn`n (t): (18.4)
To establish this result, note that when the right-hand side of (18.4) is evalu-
ated at xi , all the terms ex ept the ith vanish, sin e `j (xi ) vanishes for j 6= i.
This leaves fiì (xi ), whi h is equal to fi , sin e ì (xi ) = 1. In equations,
Xn
p(xi ) = fj `j (xi ) = fiì (xi ) = fi :
j =0
12. We must now show that polynomials `j having the properties (18.3) a tu-
ally exist. For n = 2, they are
`0 (t) = t x )(t x ) ; `1 (t) = (t x )(t x )
x x )(x x ) ;
( 1 2 0 2
x
( 0 x )(x x )
1 0 2 ( 1 0 1 2
(t x )(t x )
`2 (t) = x x )(x x ) :
( 2
0
0 2
1
It is easy to verify that these polynomials have the desired properties.

13. Generalizing from the quadrati ase, we see that the following polynomials
do the job:
iY
=n
t xi
`j (t) = ; j = 0; : : : ; n:
i=0 xj xi
i6=j
These polynomials are alled Lagrange polynomials.
14. One onsequen e of the existen e theorem is that equation (18.2) has a
solution for any right-hand side. In other words, the Vandermonde matrix for
n + 1 distin t points x0 , . . . , xn is nonsingular.
Uniqueness
15. To establish the uniqueness of the interpolating polynomial, we use the
following result from the theory of equations.
If a polynomial of degree n vanishes at n + 1 distin t points, then
the polynomial is identi ally zero.
16. Now suppose that in addition to the polynomial p the interpolation prob-
lem has another solution q. Then r(t) = p(t) q(t) is of degree not greater
than n. But sin e r(xi ) = p(xi ) q(xi ) = fi fi = 0, the polynomial r vanishes
at n + 1 points. Hen e r vanishes identi ally, or equivalently p = q.
17. The ondition that deg(p) n in the statement of the interpolation
problem appears unnatural to some people. \Why not require the polynomial
to be exa tly of degree n?" they ask. The uniqueness of the interpolant provides
an answer.
Suppose, for example, we try to interpolate three points lying on a straight
line by a quadrati . Now the line itself is a linear polynomial that interpolates
the points. By the uniqueness theorem, the result of the quadrati interpolation
must be that same straight line. What happens, of ourse, is that the oe ient
of t2 omes out zero.
Le ture 19
Syntheti Division
The Newton Form of the Interpolant
Evaluation
Existen e and Uniqueness
Divided Dieren es
Syntheti division
1. The interpolation problem does not end with the determination of the inter-
polating polynomial. In many appli ations one must evaluate the polynomial
at a point t. As we have seen, it requires no work at all to determine the
Lagrange form of the interpolant: its oe ients are the values fi themselves.
On the other hand, the individual Lagrange polynomials are tri ky to evaluate.
For example, produ ts of the form
(x0 xi ) (xi 1 xi )(xi+1 xi ) (xn xi )
an easily over ow or under ow.
2. Although the oe ients of the natural form of the interpolant
p(t) = an tn + an 1 tn 1 + an 1 tn 1 + + a1 t + a0 (19.1)
are not easy to determine, the polynomial an be e iently and stably evalu-
ated by an algorithm alled syntheti division or nested evaluation.
3. To derive the algorithm, write (19.1) in the nested form
p(t) = (( (((an )t + an 1 )t + an 2 ) )t + a1 )t + a0 : (19.2)
(It is easy to onvin e yourself that (19.1) and (19.2) are the same polynomial
by looking at, say, the ase n = 3. More formally, you an prove the equality
by an easy indu tion.) This form naturally suggests the su essive evaluation
an ;
(a n )t + a n 1 ;
((an )t + an 1 )t + an 2 ;

(( ((an )t + an 1 )t + an 2 ) )t + a1 ;
(( ((an )t + an 1 )t + an 2 ) )t + a1 )t + a0 :
At ea h step in this evaluation the previously al ulated value is multiplied by
t and added to a oe ient. This leads to the following simple algorithm.
141
p = a[n;
for (i=n-1; i>=0; i--)
p = p*t + a[i;
4. Syntheti division is quite e ient, requiring only n additions and n mul-
tipli ations. It is also quite stable. An elementary rounding-error analysis
will show that the omputed value of p(t) is the exa t value of a polynomial p~
whose oe ients dier from those of p by relative errors on the order of the
rounding unit.
The Newton form of the interpolant
5. The natural form of the interpolant is di ult to determine but easy to
evaluate. The Lagrange form, on the other hand, is easy to determine but
di ult to evaluate. It is natural to ask, \Is there a ompromise?" The
answer is, \Yes, it is the Newton form of the interpolant."
6. The Newton form results from hoosing the basis
1; t x0 ; (t x0 )(t x1 ); : : : ; (t x0 )(t x1 ) (t xn 1 ); (19.3)
or equivalently from writing the interpolating polynomial in the form
p(t) = 0 + 1 (t x0 ) + 2 (t x0 )(t x1 ) + (19.4)
+ n (t x0 )(t x1 ) (t xn 1 ):
To turn this form of the interpolant into an e ient omputational tool, we
must show two things: how to determine the oe ients and how to evaluate
the resulting polynomial. The algorithm for evaluating p(t) is a variant of syn-
theti division, and it will be onvenient to derive it while the latter algorithm
is fresh in our minds.
Evaluation
7. To derive the algorithm, rst write (19.4) in nested form:
p(t) = (( ((( n )(t xn 1 )
+ n 1 )(t xn 2 ) + n 2 ) )(t x1 ) + 1 )(t x0 ) + 0 :
From this we see that the nested Newton form has the same stru ture as the
nested natural form. The only dieren e is that at ea h nesting the multiplier
t is repla ed by (t xi ). Hen e we get the following algorithm.
p = [n;
for (i=n-1; i>=0; i--)
p = p*(t-x[i) + [i;
8. This algorithm requires 2n additions and n multipli ations. It is ba kward
stable.
Existen e and uniqueness

9. The existen e of the Newton form of the interpolant is not a foregone
on lusion. Just be ause one has written down a set of n + 1 polynomials, it
does not follow that all polynomials of degree n an be expressed in terms of
them. For this reason we will now establish the existen e of the Newton form,
before going on to show how to ompute it.21
10. We begin by evaluating p(xi ), where p is in the Newton form (19.4). Now
p(x0 ) = 0 + 1 (x0 x0 ) + 2 (x0 x0 )(x0 x1 ) +
+ n (x0 x0 )(x0 x1 ) (x0 xn 1 )
= 0 ;
the last n terms vanishing be ause they ontain the zero fa tor (x0 x0 ).
Similarly,
p(x1 ) = 0 + 1 (x1 x0 ) + 2 (x1 x0 )(x1 x1 ) +
+ n (x1 x0 )(x1 x1 ) (x1 xn 1 )
= 0 + 1 (x1 x0 ):
In general, p(xi ) will ontain only i +1 nonzero terms, sin e the last n i terms
ontain the fa tor (xi xi ).
It now follows from the interpolation onditions fi = p(xi ) that we must
determine the oe ients i to satisfy
f0 = 0 ;
f1 = 0 + 1 (x1 x0 ); (19.5)

fn = 0 + 1 (xn x0 ) + + n (xn x0 )(xn x1 ) (xn xn 1 ):
This is a lower triangular system whose diagonal elements are nonzero. Hen e
the system is nonsingular, and there are unique oe ients i that satisfy the
interpolation onditions.
11. The matrix of the system (19.5), namely
0
1 0 0 0
1
B1
B
(x 1 x 0 ) 0 0 C
C
B1
B
( x 2 x 0 ) ( x2 x )(
0 x2 x1 ) 0 C
C ;
B .. . . ..
B C
.
.. .. . C
A
1 (xn x0 ) (xn x0 )(xn x1 ) (xn x )(xn x ) (xn xn )

0 1 1
The existen e also follows from the fa t that any sequen e fpi g
21 n
of polynomials su h
that pi is exa tly of degree i forms a basis for the set of polynomials of degree n (see x23.6).
i=0
The approa h taken here, though less general, has pedagogi al advantages.
is analogous to the Vandermonde matrix in the sense that its (i; j )-element is
the (j 1)th basis element evaluated at xi 1 . The orresponding analogue for
the Lagrange basis is the identity matrix. The in reasingly simple stru ture of
these matri es is re e ted in the in reasing ease with whi h we an form the
interpolating polynomials in their respe tive bases.
12. An interesting onsequen e of the triangularity of the system (19.5) is
that the addition of new points to the interpolation problem does not ae t
the oe ients we have already omputed. In other words,
0 is the 0-degree polynomial interpolating
(x0 ; f0 );
0 + 1 (t x1 ) is the 1-degree polynomial interpolating
(x0 ; f0 ); (x1 ; f1 );
0 + 1 (t x0 ) + 2 (t x0 )(t x1 ) is the 2-degree polynomial interpolating
(x0 ; f0 ); (x1 ; f1 ); (x1 ; f1 );
and so on.
Divided dieren es
13. In prin iple, the triangular system (19.5) an be solved in O(n2 ) operations
to give the oe ients of the Newton interpolant. Unfortunately, the oe-
ients of this system an easily over ow or under ow. However, by taking a
dierent view of the problem, we an derive a substantially dierent algorithm
that will determine the oe ients in O(n2 ) operations.
14. We begin by dening the divided dieren e f [x0 ; x1 ; : : : ; xk to be the
oe ient of xk in the polynomial interpolating (x0 ; f0 ), (x1 ; f1 ), . . . , (xk ; fk ).
From the observations in x19.12, it follows that
f [x0 ; x1 ; : : : ; xk = k ;
i.e., f [x0 ; x1 ; : : : ; xk is the oe ient of (t x0 )(t x1 ) (t xk 1) in the
Newton form of the interpolant.
15. From the rst equation in the system (19.5), we nd that
f [x0 = f0 ;
and from the se ond
f1 a0 f [x1 f [x0
f [x0 ; x1 = = x x :
x1 x0 1 0
Thus the rst divided dieren e is obtained from zeroth-order divided dier-
en es by subtra ting and dividing, whi h is why it is alled a divided dieren e.
16. The above expression for f [x0 ; x1 is a spe ial ase of a more general
relation:
f [x ; x ; : : : ; xk f [x0 ; x1 ; : : : ; xk 1
f [x0 ; x1 ; : : : ; xk = 1 2 : (19.6)
k x x 0
To establish it, let

p be the polynomial interpolating (x0 ; f0 ); (x1 ; f1 ); : : : ; (xk ; fk );
q be the polynomial interpolating (x0 ; f0 ); (x1 ; f1 ); : : : ; (xk 1 ; fk 1 );
r be the polynomial interpolating (x1 ; f1 ); (x2 ; f2 ); : : : ; (xk ; fk ):
(19.7)
Then p, q, and r have leading oe ients f [x0 ; x1 ; : : : ; xk , f [x0 ; x1 ; : : : ; xk 1 ,
and f [x1 ; x2 ; : : : ; xk , respe tively.
Now we laim that
t x0
p(t) = q(t) + [r(t) q(t): (19.8)
xk x0
To see this, we will show that the right-hand side evaluates to fi = p(xi ) when
t = xi (i = 0; : : : ; k). The laim then follows by the uniqueness of interpolating
polynomials.
For t = x0 ,
x x
q(x0 ) + 0 0 [r(x0 ) q(x0 ) = f0 ;
x x
k 0
sin e q interpolates (x0 ; f0 ). For t = xi (i = 1; : : : ; k 1),

xi x0
q (x i ) + [r(x ) q(xi ) = fi + xi xk (fi fi ) = fi :
xk x0 i x0 xk
Finally, for t = xk ,
xk x0
q(xk ) + [r(x ) q(x0 ) = q(xk ) + [r(xk ) q(xk ) = r(xk ) = fk :
xk x0 0
From (19.8) and (19.7) it follows that the oe ient of xk in p is
f [x ; x ; : : : ; xk f [x0 ; x1 ; : : : ; xk 1
f [x ; x ; : : : ; x = 1 2
k ;
0 1
xk x0
whi h is what we wanted to show.
17. A onsequen e of (19.6) is that we an ompute dieren e quotients
re ursively beginning with the original fi. Spe i ally, let dij denote the (i; j )-
entry in the following table (the indexing begins with d00 = f [x0 ):
f0 = f [x0
f1 = f [x1 f [x0 ; x1 (19.9)
f2 = f [x2 f [x1 ; x2 f [x0 ; x1 ; x2
f3 = f [x3 f [x2 ; x3 f [x1 ; x2 ; x3 f [x0 ; x1 ; x2 ; x3
Then it follows that

di;j di ;j
dij = 1 1 1
;
xi xi j
from whi h the array an be built up olumnwise. The diagonal elements are
the oe ients of the Newton interpolant.
18. It is not ne essary to store the whole two-dimensional array (19.9). The
following ode assumes that the fi have been stored in the array and over-
writes the array with the oe ients of the Newton interpolant. Note that
the olumns of (19.9) are generated from the bottom up to avoid overwriting
elements prematurely.
for (j=2; j<=n; j++)
for (i=n; i>=j; i--)
[i = ( [i- [i-1)/(x[i-x[i-j)
The operation ount for this algorithm is n2 additions and 21 n2 divisions.
Le ture 20
Error in Interpolation
Error Bounds
Convergen e
Chebyshev Points
Error in interpolation
1. Up to this point we have treated the ordinates fi as arbitrary numbers. We
will now shift our point of view and assume that the fi satisfy
fi = f (xi );
where f is a fun tion dened on some interval of interest. As usual we will
assume that f has as many derivatives as we need.
2. Let p be the polynomial interpolating f at x0 , x1 , . . . , xn . Sin e poly-
nomial interpolation is often done in the hope of nding an easily evaluated
approximation to f , it is natural to look for expressions for the error
e(t) = f (t) p(t):
In what follows, we assume that t is not equal to x0 , x1 , . . . , xn (after all, the
error is zero at the points of interpolation).
3. To nd an expression for the error, let q(u) be the polynomial of degree
n +1 that interpolates f at the points x0 , x1 , . . . , xn , and t. The Newton form
of this interpolant is
q(u) = 0 + 1 (u x0 ) + + n (u x0 ) (u xn 1 )
+ f [x0 ; : : : ; xn ; t(u x0 ) (u xn 1 )(u xn ):
Now (see x19.12) 0 + 1 (u x0 ) + + n (u x0 ) (u xn 1 ) is just the
polynomial p(u). Hen e if we set
!(u) = (u x0 ) (u xn 1 )(u xn );
we have
q(u) = p(u) + f [x0 ; : : : ; xn ; t!(u):
But by onstru tion q(t) = f (t). Hen e
f (t) = p(t) + f [x0 ; : : : ; xn ; t!(t);
147
or
e(t) = f (t) p(t) = f [x0 ; : : : ; xn ; t!(t); (20.1)
whi h is the expression we are looking for.
4. Although (20.1) reveals an elegant relation between divided dieren es and
the error in interpolation, it does not allow us to bound the magnitude of the
error. However, we an derive another, more useful, expression by onsidering
the fun tion
'(u) = f (u) p(u) f [x0 ; : : : ; xn ; t!(u):
Here, as above, we regard u as variable and t as xed.
Sin e p(xi ) = f (xi ) and !(xi ) = 0,
'(xi ) = f (xi ) p(xi ) f [x0 ; : : : ; xn ; t!(xi ) = 0:
Moreover, by (20.1),
'(t) = f (t) p(t) f [x0; : : : ; xn ; t!(t) = 0:
In other words, if I is the smallest interval ontaining x0 , . . . , xn and t, then
'(u) has at least n + 2 zeros in I:
By Rolle's theorem, between ea h of these zeros there is a zero of '0 (u):
'0 (u) has at least n + 1 zeros in I:
Similarly,
'00 (u) has at least n zeros in I:
Continuing, we nd that
'(n+1) (u) has at least one zero in I .
Let be one of the zeros of '(n+1) lying in I .
To get our error expression, we now evaluate '(n+1) at . Sin e p(u) is a
polynomial of degree n,
p(n+1) ( ) = 0:
Sin e !(u) = un+1 + ,
!(n+1) ( ) = (n + 1)!
Hen e
0 = '(n+1) ( ) = f (n+1) ( ) f [x0 ; : : : ; xn ; t(n + 1)!;
or
f (n+1) ( )
f [x0 ; : : : ; xn ; t = (20.2)
(n + 1)! :
In view of (20.1), we have the following result.
Let p be the polynomial interpolating f at x0 , x1 , . . . , xn . Then

f (n+1) ( )
f (t) p(t) =
(n + 1)! (t x0 ) (t xn ); (20.3)
where lies in the smallest interval ontaining x0 , x1 , . . . , xn and
t.
5. The point t = is a fun tion of t. The proof says nothing of the properties
of t , other than to lo ate it in a ertain interval. However, it is easy to show
that
f (n+1) (t ) is ontinuous:
Just apply l'Hôpital's rule to the expression
f (t) p(t)
f (n+1) (t ) = (n + 1)!
(t x0 ) (t xn )
at the points x0 , . . . , xn.
6. A useful and informative orollary of (20.2) is the following expression.
1 f (n) ();
f [x0 ; x2 ; : : : ; xn =
n!
where lies in the interval ontaining x1 , x2 , . . . , xn .
In parti ular, if the points x0 , . . . , xn luster about a point t, the nth dieren e
quotient is an approximation to f (n) (t).
Error bounds
7. We will now show how to use (20.3) to derive error bounds in a simple ase.
Let `(t) be the linear polynomial interpolating f (t) at x0 and x1 , and suppose
that
jf 00(t)j M
in some interval of interest. Then
00
jf (t) `(t)j = jf 2()j j(t x0)(t x1)j M2 j(t x0)(t x1)j:
The further treatment of this bound depends on whether t lies outside or inside
the interval [x0 ; x1 .
8. If t lies outside [x0 ; x1 , we say that we are extrapolating the polynomial
approximation to f . Sin e j(t x0 )(t x1 )j qui kly be omes large as t moves
away from [x0 ; x1 , extrapolation is a risky business. To be sure, many nu-

meri al algorithms are based on a little judi ious extrapolation. But when
the newspapers arry a story asserting that the population of the U.S. will
be ome zero in 2046 and two weeks later a learned worthy announ es that the
world population will be ome innite in 2045, you an bet that someone has
been pra ti ing unsafe extrapolation. E onomists and astrologers are mighty
extrapolators.
9. If t lies inside [x0 ; x1 , we say that we are interpolating the polynomial
approximation to f . (Note the new sense of the term \interpolation.") In this
ase we an get a uniform error bound. Spe i ally, the fun tion j(t x0 )(t x1 )j
attains its maximum in [x0 ; x1 at the point t = (x0 + x1 )=2, and that maximum
is (x1 x0 )2 =4. Hen e
M
t 2 [x0 ; x1 =) jf (t) `(t)j
8 (x1 x0 ) : (20.4)
2
10. As an appli ation of this bound, suppose that we want to ompute heap
and dirty sines by storing values of the sine fun tion at equally spa ed points
and using linear interpolation to ompute intermediate values. The question
then arises of how small the spa ing h between the points must be to a hieve
a pres ribed a ura y.
Spe i ally, suppose we require the approximation to be a urate to 10 4 .
Sin e the absolute value of the se ond derivative of the sine is bounded by one,
we have from (20.4)
j sin t `(t)j h8 :
2
Thus we must take h2 =8 10 4 or

p
h :01 8 = 0:0283 : : : :
Convergen e
11. The method just des ribed for approximating the sine uses many inter-
polants over small intervals. Another possibility is to use a single high-order
interpolant to represent the fun tion f over the entire interval of interest. Thus
suppose that for ea h n = 0, 1, 2, . . . we hoose n equally spa ed points and
let pn interpolate f at these points. If the sequen e of polynomials fpn g1 n=0
onverges uniformly to f , we know there will be an n for whi h pn will be a
su iently a urate approximation.
n=4 n=8
1 1
0.5
0.5
0
-0.5
-0.5 -1
-5 0 5 -5 0 5
n = 12 n = 16
1 5
0
0
-1
-5
-2
-10
-3
-4 -15
-5 0 5 -5 0 5
Figure 20.1. Equally spa ed interpolation.
12. Unfortunately, equally spa ed interpolation an diverge. The following

example is due to Runge. Consider the fun tion
f (t) =
1
1 + t2
on the interval [ 5; 5. Figure 20.1 exhibits plots of the fun tion and its in-
terpolant for n = 4, 8, 12, and 16. You an see that the interpolants are
diverging at the ends of the intervals, an observation whi h an be established
mathemati ally.
Chebyshev points
13. Although it is not a omplete explanation, part of the problem with
Runge's example is the polynomial !(t) = (t x0 )(t x1 ) (t xn ), whi h
appears in the error expression (20.3). Figure 20.2 ontains plots of ! for
equally spa ed points and for a set of interpolation points alled Chebyshev
points. The ! based on equally spa ed points is the one that peaks at the ends
of the interval, just about where the error in the interpolant is worst. On the
other hand, the ! based on the Chebyshev points is uniformly small.
14. The Chebyshev points ome from an attempt to adjust the interpolation
points to ontrol the size of !. For deniteness, let us onsider the interval
-3
x 10 n = 16
1
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Figure 20.2. ! (t) for equally spa ed and Chebyshev points.
[ 1; 1 and suppose that

jf n (t)j M;
( +1)
1 t 1:
Then from (20.3),
jf (t) pn(t)j M max j!(x)j:
n! x2 ; [ 1 1
This bound will be minimized if we hoose the interpolating points so that

max j!(x)j
x2[ ;
1 1
is minimized.
It an be shown
min max j!(x)j = 2 n ;
!(x)=(x x )(x x 0 (x x ) x2[ 1;1
1) n
and the minimum is a hieved when

xi = os

2(n i) + 1 ; i = 0; 1; : : : ; n:
2n + 2
These are the Chebyshev points.
n=4 n=8
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-5 0 5 -5 0 5
n = 12 n = 16
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-5 0 5 -5 0 5
Figure 20.3. Chebyshev interpolation.
15. Figure 20.3 shows what happens when 1=(1 + x2 ) is interpolated at the
Chebyshev points. It appears to be onverging satisfa torily.
16. Unfortunately, there are fun tions for whi h interpolation at the Chebyshev
points fails to onverge. Moreover, better approximations of fun tions like
1=(1+ x2 ) an be obtained by other interpolants | e.g., ubi splines. However,
if you have to interpolate a fun tion of modest or high degree by a polynomial,
you should onsider basing the interpolation on the Chebyshev points.
Numeri al Integration
155
Le ture 21
Change of Intervals
The Trapezoidal Rule
The Composite Trapezoidal Rule
Newton{Cotes Formulas
Undetermined Coe ients and Simpson's Rule
Numeri al integration
1. The dierential al ulus is a s ien e; the integral al ulus is an art. Given
a formula for a fun tion | say e x or e x | it is usually possible to work your
2
way through to its derivative. The same is not true of the integral. We an
al ulate Z
e x dx
easily enough, but Z
e x 2
dx (21.1)
annot be expressed in terms of the elementary algebrai and trans endental
fun tions.
2. Sometimes it is possible to dene away the problem. For example the
integral (21.1) is so important in probability and statisti s that there is a well-
tabulated error fun tion
erf(x) = p2
Z x
e x dx;
2
0
whose properties have been extensively studied. But this approa h is spe ial-
ized and not suitable for problems in whi h the fun tion to be integrated is
not known in advan e.
3. One of the problems with an indenite integral like (21.1) is that the
solution has to be a formula. The denite integral, on the other hand, is a
number, whi h in prin iple an be omputed. The pro ess of evaluating a
denite integral of a fun tion from values of the fun tion is alled numeri al
integration or numeri al quadrature.22
The word \quadrature" refers to nding a square whose area is the same as the area
22
under a urve.
157
Change of intervals
4. A typi al quadrature formula is Simpson's rule :
Z 1
f (x) dx
1 2 1 1
= 6 f (0) + 3 f 2 + 6 f (1):
0
Now a rule like this would not be mu h good if it ould only be used to
integrate fun tions over the interval [0; 1. Fortunately, by performing a linear
transformation of variables, we an use the rule over an arbitrary interval [a; b.
Sin e this pro ess is used regularly in numeri al integration, we des ribe it now.
5. The tri k is to express x as a linear fun tion of another variable y. The
expression must be su h that x = a when y = 0 and x = b when y = 1. This
is a simple linear interpolation problem whose solution is
x = a + (b a)y:
It follows that
dx = (b a)dy:
Hen e if we set
g(y) = f [a + (b a)y;
we have Z b Z 1
f (x) dx = (b a) g(y) dy:

a 0
6. For Simpson's rule we have

1 = f a + b ; g(1) = f (b):
g(0) = f (a); g

2 2
Hen e the general form of Simpson's rule is
b a + b

Z
b a
f (x) dx = f (a) + 4f + f (b) :
a 6 2
7. This te hnique easily generalizes to arbitrary hanges of interval, and we
will silently invoke it whenever ne essary.
The trapezoidal rule
8. The simplest quadrature rule in wide use is the trapezoidal rule. Like
Newton's method, it has both a geometri and an analyti derivation. The
geometri derivation is illustrated in Figure 21.1. The idea is to approximate
the area under the urve y = f (x) from x = 0 to x = h by the area of the
trapezoid ABCD. Now the area of a trapezoid is the produ t of its base with
21. Numeri al Integration 159
B
f(h)
f(0)
h
A D
Figure 21.1. The trapezoidal rule.
its average height. In this ase the length of the base is h, while the average
height is [f (0) + f (h)=2. In this way we get the trapezoidal rule
Z h h
f (x) dx
= [f (0) + f (h): (21.2)
0 2
9. The idea behind the analyti derivation is to interpolate f (x) at 0 and h

by a linear polynomial `(x) and take 0h `(x) dx as an approximation to the
R
integral of f (x). The interpolant is

f (1) f (0)
`(x) = f (0) + x;
h
and an easy integration shows that 0
R h `(x) dx
is the same as the right-hand
side of (21.2).
10. An advantage of the analyti approa h is that it leads dire tly to an error
formula. From the theory of polynomial interpolation (see x20.4) we know that
f 00 (x )
f (x) `(x) = x(x h);
2
where x 2 [0; h and f 00 (x ) is a ontinuous fun tion of x. If Th (f ) denotes
the right-hand side of (21.2), then
Z h Z h
f (x) dx Th(f ) = [f (x) `(x) dx =

1 Z h f 00( )x(x h) dx: (21.3)
x
0 0 2 0
Now the fun tion x(x h) is nonpositive on [0; h. Hen e by the mean value
theorem for integrals, for some u and = u, both in [0; h, we have
Z h
f 00 () Z h f 00() 3
f (x) dx Th(f ) = x(x h) dx = (21.4)
0 2 0 12 h :
11. The error formula (21.4) shows that if f 00 (x) is not large on [0; h and h
is small, the trapezoidal rule gives a good approximation to the integral. For
example if jf 00 (x)j 1 and h = 10 2 , the error in the trapezoidal rule is less
than 10 7 .
The omposite trapezoidal rule
12. The trapezoidal rule annot be expe ted to give a urate results over a
large interval. However, by summing the results of many appli ations of the
trapezoidal rule over smaller intervals, we an obtain an a urate approxima-
tion to the integral over any interval [a; b.
13. We begin by dividing [a; b into n equal intervals by the points
a = x0 < x1 < < xn 1 < xn = b:
Spe i ally, if
b a
h=
n
is the ommon length of the intervals, then
xi = a + ih; i = 0; 1; : : : ; n:
Rx
Next we approximate x f (x) dx by the trapezoidal rule:
i
i 1
Z xi h
f (x) dx
= [f (xi 1 ) + f (xi ):
xi 21
Finally, we sum these individual approximations to get the approximation

Z b n h
f (x) dx
X
= [f (xi 1 ) + f (xi ):
a i=1 2
After some rearranging, this sum be omes
Z b

f (x0 ) f (xn )
a
f (x) dx = h
2 + f (x1 ) + + f (xn 1 ) + 2 :
This formula is alled the omposite trapezoidal rule.23
14. We an also derive an error formula for the omposite trapezoidal rule.
Let CTh (f ) denote the approximation produ ed by the omposite trapezoidal
rule. Then from the error formula (21.4),
Z b h3 Xn n
h 2 (b a ) X
f (x) dx CTh (f ) = f 00 (i ) = 00
a 12 i=1 12 n i=1 f (i );
23
Be ause the omposite rule is used far more often than the simple rule, people often drop
the quali ation \ omposite" and simply all it the trapezoidal rule.
where i 2 [xi 1 ; xi . Now the fa tor n1 i f 00 (i ) is just the arithmeti mean
P
of the numbers f 00 (i ). Hen e it lies between the largest and the smallest of
these numbers, and it follows from P the intermediate value theorem that there
is an 2 [a; b su h that f 00 () = n1 i f 00 (i ). Putting all this together, we get
the following result.
Let CTh (f ) denote the approximation produ ed by the omposite
trapezoidal rule applied to f on [a; b. Then
Z b
f (x) dx CTh(f ) =
(b a)f 00 () h2 :
a 12
15. This is strong stu. It says that we an make the approximate integral
as a urate as we want simply by adding more points ( ompare this with
polynomial interpolation where onvergen e annot be guaranteed). Moreover,
be ause the error de reases as h2 , we get twi e the bang out of our added points.
For example, doubling the number of points redu es the error by a fa tor of
four.
Newton{Cotes formulas
16. From the analyti derivation of the trapezoidal rule, we see that the rule
integrates any linear polynomial exa tly. This suggests that we generalize the
trapezoidal rule by requiring that our rule integrate exa tly any polynomial
of degree n. Sin e a polynomial of degree n has n + 1 free parameters, it is
reasonable to look for the approximate integral as a linear ombination of the
fun tion evaluated at n + 1 xed points or abs issas. Su h a quadrature rule
is alled a Newton{Cotes formula.24
17. Let x0 , x1 , . . . , xn be points in the interval [a; b.25 Then we wish to
determine onstants A0 , A1 , . . . , An , su h that
Z b
deg(f ) n =) f (x) dx = A0 x0 + A1 x1 + + An xn : (21.5)
a
This problem has an elegant solution in terms of Lagrange polynomials.
Let ì be the ith Lagrange polynomial over x0 , x1 , . . . , xn . Then
Z b
Ai = ì (x) dx (21.6)
a
are the unique oe ients satisfying (21.5).
24
Stri tly speaking, the abs issas are equally spa ed in a Newton{Cotes formula. But no
one is making us be stri t.
25
In point of fa t, the points do not have to lie in the interval, and sometimes they don't.
But mostly they do.
18. To prove the assertion rst note that the rule must integrate the ith
Lagrange polynomial. Hen e
Z b X n
ì (x) = Aj ì (xj ) = Ai ì (xi ) = Ai ;
a j =0
whi h says that the only possible value for the Ai is given by (21.6).
Now let deg(f ) n. Then
Xn
f (x) = f (xi)ì (x):
i=0
Hen e Z b n Z b n
X X
`j (x) dx = f (xi ) ì (x) dx = f (xi )Ai ;
a i=0 a i=0
whi h is just (21.5).
Undetermined oe ients and Simpson's rule
19. Although the expressions (21.6) have a ertain elegan e, they are di ult
to evaluate. An alternative for low-order formulas is to use the exa tness
property (21.5) to write down a system of equations for the oe ients, a
te hnique known as the method of undetermined oe ients.
20. We will illustrate the te hnique with a three-point formula over the interval
[0; 1 based on the points 0, 12 , and 1. First note that the exa tness property
requires that the rule integrate the fun tion that is identi ally one. In other
words, Z 1
1 A0 + 1 A1 + 1 A2 = 1 dx = 1:
0
The rule must also integrate the fun tion x, whi h gives
1
0 A0 + A1 + 1 A2 = x dx = 1 :
Z 1
2 0 2
Finally, the rule must integrate the fun tion x2 . This gives a third equation
1
0 A0 + 4 A1 + 1 A2 = x2 dx = 31 :
Z 1
The solution of these three equations is

A0 = A2 =
1
6
and
A1 = :
2
3
Thus our rule is

Z 1
1
f (x) dx
2 1 + 1 f (1);
= f (0) + f
6 3 2 6
0
whi h is just Simpson's rule.

Le ture 22
The Composite Simpson Rule
Errors in Simpson's Rule
Treatment of Singularities
Gaussian Quadrature: The Idea
The Composite Simpson rule

1. There is a omposite version of Simpson's rule for an interval [a; b. To
derive it, let
b a
h=
n
and
xi = a + ih; i = 0; 1; : : : ; n:
For brevity set
fi = f (xi ):
Then by Simpson's rule
Z xi h
f (x) dx = (fi + 4fi+1 + fi+2 ):
+2
xi 3
2. We now wish to approximate the integral of f over [a; b by summing the
results of Simpson's rule over [xi ; xi+2 . However, ea h appli ation of Simpson's
rule involves two of the intervals [xi ; xi+1 . Thus the total number of intervals
must be even. For the moment, therefore, we will assume that n is even.
3. The summation an be written as follows:
Rb

h a f (x) dx = f0 + 4f1 + f2
3
+ f2 + 4f3 + f4 +
: : :
+ fn 4 + 4fn 3 + fn 2
+ fn 2 + 4fn 1 + fn :
This sum teles opes into
Z b h
f (x) dx
= (f0 + 4f1 + 2f2 + 4f3 + 2f4 + + 2fn 2 + 4fn 1 + fn); (22.1)
a 3
whi h is the omposite Simpson rule.
4. Here is a little fragment of ode that omputes the sum (22.1). As above
we assume that n is an even, positive integer.
165
simp = f[0 + 4*f[1 + f[n;

for (i=2; i<=n-2; i=i+2)
simp = 2*f[i + 4*f[i+1;
simp = h*simp/3;
5. When n is odd, we have an extra interval [xn 1 ; xn over whi h we must

integrate f . There are three possibilities.
First, we an use the trapezoidal rule to integrate over the interval. The
problem with this solution is that the error in the trapezoidal rule is too big,
and as h de reases it will dominate the entire sum.
Se ond, we an evaluate f at
x +x
xn = n 1 n1
2 2
and approximate the integral over [xn 1 ; xn by
h
6 (fn 1 + 4fn + fn ):
1
2
This solution works quite well. However, it has the drawba k that it is not
suitable for tabulated data, where additional fun tion values are unobtainable.
A third option is to on o t a Newton{Cotes-type formula of the form
Z xn
f (x) dx
= A0 fn 2 + A1 fn 1 + A2 fn
xn 1
and use it to integrate over the extra interval. The formula an be easily
derived by the method of undetermined oe ients. It is sometimes alled the
half-simp or semi-simp rule.
Errors in Simpson's rule

6. It is more di ult to derive an error formula for Simpson's rule than for
the trapezoidal rule. In the error formula for the trapezoidal rule (see the
right-hand side of (21.3)) the polynomial x(x h) does not hange sign on
the interval [0; h. This means that we an invoke the mean value theorem for
integrals to simplify the error formula. For Simpson's rule the orresponding
polynomial x(x h)(x 2h) does hange sign, and hen e dierent te hniques
are required to get an error formula. Sin e these te hniques are best studied
in a general setting, we will just set down the error formula for Simpson's rule
over the interval [a; b:
a + b
+ f (b) = (b a) f (4) ( ); (22.2)
Z b
b a 5
f (x) dx f (a) + 4f
a 6 2 2880
where 2 [a; b.
7. The presen e of the fa tor f (4) ( ) on the right-hand side of (22.2) implies
that the error vanishes when f is a ubi : Although Simpson's rule was derived
to be exa t for quadrati s, it is also exa t for ubi s. This is no oin iden e,
as we shall see when we ome to treat Gaussian quadrature.
8. The error formula for the omposite Simpson rule an be obtained from
(22.2) in the same way as we derived the error formula for the omposite trape-
zoidal rule. If CSh (f ) denotes the result of applying the omposite Simpson
rule to f over the interval [a; b, then
Z b
f (x) dx CSh (f ) =
(b a)f (4) ( ) h4 ;
a 180
where 2 [a; b.
Treatment of singularities
9. It sometimes happens that one has to integrate a fun tion with a singularity.
For example, if

f (x)
= px ;
R
when x is near zero, then 01 f (x) dx exists. However, we annot use the trape-
zoidal rule or Simpson's rule to evaluate the integral be ause f (0) is undened.
Of ourse one an try to al ulate the integral by a Newton{Cotes formula
based on points that ex lude zero; e.g., x0 = 41 and x1 = 43 . However, we will
still not obtain very good results, sin e f is not at all linear on [0; 1. A better
approa h is to in orporate the singularity into the quadrature rule itself.
10. First dene p
g(x) = xf (x):
Then g(x) = when x is near zero, so that g is well behaved. Thus we should
seek a rule that evaluates the integral
Z 1
g(x)x dx;
1
2
where g is a well-behaved fun tion on [0; 1. The fun tion x is alled a weight
1
2
fun tion be ause it assigns a weight or degree of importan e to ea h value of

the fun tion g.
11. We will use the method of undetermined oe ients to determine a quadra-
ture rule based on the points x0 = 14 and x1 = 43 . Taking g(x) = 1, we get the
following equation for the oe ients A0 and A1 :
Z 1
1 x dx = 2 = A0 + A1 :
1
2
0
Taking g(x) = x, we get a se ond equation:

Z 1
2 1
x x dx = = A0 + A1 :
1 3
3 4 4
2
Solving these equations, we get A0 = 35 and A1 = 13 . Hen e our formula is

Z 1
g(x)x dx
5 1 2 3
= 3g 4 + 3g 4 :
1
2
12. In transforming this formula to another interval, say [0; h, are must be
taken to transform the weighting fun tion properly. For example, if we wish
to evaluate Z h
g(x)x dx;
1
2
we make the transformation x = hy, whi h gives

Z h p Z 1
g(x)x dx = h g(hx)x dx:

1 1
2 2
0 0
Owing
p to the weight fun tion x , the transformed integral is multiplied by
1
2
h, rather than h as in the unweighted ase.

13. The ee tiveness of su h a transformation an be seen by omparing it
with an unweighted Newton{Cotes formula on the same points. The formula
is easily seen to be
Z h
h h 3h

f (x) dx = f +f :
0 2 4 4
The following matlab ode ompares the results of the two formulas for h =
0:1 applied to the fun tion
os x p
p
2 x x sin x;
whose integral is px os x:
x = .025;
f0 = .5* os(x)/sqrt(x) - sqrt(x)*sin(x);
g0 = .5* os(x) - x*sin(x);
x = .075;
f1 = .5* os(x)/sqrt(x) - sqrt(x)*sin(x);
g1 = .5* os(x) - x*sin(x);
[.05*(f0+f1), sqrt(.1)*(5*g0/3 + g1/3), sqrt(.1)* os(.1)
The true value of the integral is 0:3146. The Newton{Cotes approximation is
0.2479. The weighted approximation is 0:3151 | a great improvement.
Gaussian quadrature: The idea

14. We have observed that the oe ients Ai in the integration rule
Z b
f (x) dx
= A0 f (x0 ) + A1 f (x1 ) + + An f (xn )
a
represent n + 1 degrees of freedom that we an use to make the rule exa t for
polynomials of degree n or less. The key idea behind Gaussian quadrature is
that the abs issas x0 , x1 , . . . , xn represent another n + 1 degrees of freedom,
whi h an be used to extend the exa tness of the rule to polynomials of degree
2n + 1.
15. Unfortunately, any attempt to derive a Gaussian quadrature rule by the
method of undetermined oe ients (and abs issas) must ome to grips with
the fa t that the resulting equations are nonlinear. For example, when n = 1
and [a; b = [0; 1, the equations obtained by requiring the rule to integrate 1,
x, x2 , and x3 are
1 = A0 + A1
1
2
= x0 A0 + x1 A1
1
3
= x20 A0 + x21 A1
1
4
= x30 A0 + x31 A1
Although this tangle of equations an be simplied, in general the approa h
leads to ill- onditioned systems. As an alternative, we will do as Gauss did and
approa h the problem through the theory of orthogonal polynomials, a theory
that has wide appli ability in its own right.
16. But rst let's digress a bit and onsider a spe ial ase. Rather than freeing
all the abs issas, we ould x n 1 of them, allowing only one to be free. For
example, to get a three-point rule that integrates ubi s, we might take x0 = 0
and x2 = 1, leaving x1 to be determined. This leads to the equations
1 = A0 + A1 + A2
1
2
= x1 A1 + A2
1
3
= x21 A1 + A2
1
4
= x31 A1 + A2
When these equations are solved, the result is Simpson's rule. Thus the un-
expe ted a ura y of Simpson's rule an be explained by the fa t that it is
a tually a onstrained Gaussian quadrature formula.
Le ture 23
Gaussian Quadrature: The Setting
Orthogonal Polynomials
Existen e
Zeros of Orthogonal Polynomials
Gaussian Quadrature
Error and Convergen e
Examples
Gaussian quadrature: The setting

1. The Gauss formula we will a tually derive has the form
Z b
f (x)w(x) dx
= A0 f (x0 ) + A1 f (x1 ) + + An f (xn );
a
where w(x) is a weight fun tion that is greater than zero on the interval [a; b.
2. The in orporation of a weight fun tion reates no ompli ations in the
theory. However, it makes our integrals, whi h are already too long, even more
umbersome. Sin e the interval [a; b and the weight w(x) do not hange, we
will suppress them along with the variable of integration and write
R
Z b
f= f (x)w(x) dx:
a
R R R
3.
R
RegardedR as anR operator on fun tions, is linear. That is, f = f and
(f + g) = f + g. We will make extensive use of linearity in what follows.
Orthogonal polynomials
4. Two fun tions f and g are said to be orthogonal if
R
fg = 0:
R
The term \orthogonal" derives from the fa t that the integral fg an be
regarded as an inner produ t of f and g. Thus two polynomials are orthogonal
if their inner produ t is zero, whi h is the usual denition of orthogonality in
Rn .
5. A sequen e of orthogonal polynomials is a sequen e fpi g1 i=0 of polynomials
with deg(pi ) = i su h that
i 6= j =) pi pj = 0:
R
(23.1)
171
Sin e orthogonality is not altered by multipli ation by a nonzero onstant, we

may normalize the polynomial pi so that the oe ient of xi is one: i.e.,
pi (x) = xi + ai;i 1 xi 1 + + ai0 :
Su h a polynomial is said to be moni .
6. Our immediate goal is to establish the existen e of orthogonal polynomials.
Although we ould, in prin iple, determine the oe ients aij of pi in the
natural basis by using the orthogonality onditions (23.1), we get better results
by expressing pn+1 in terms of lower-order orthogonal polynomials. To do this
we need the following general result.
Let fpi g1
i=0 be a sequen e of polynomials su h that pi is exa tly of
degree i. If
q(x) = an xn + an 1 xn 1 + + a0 ; (23.2)
then q an be written uniquely in the form
q = bn pn + bn 1 pn 1 + + b0 p0 : (23.3)
7. In establishing this result, we may assume that the polynomials pi are

moni . The proof is by indu tion. For n = 0, we have
q(x) = a0 = a0 1 = a0 p0 (x):
Hen e we must have b0 = a0 .
Now assume that q has the form (23.2). Sin e pn is the only polynomial in
the sequen e pn , pn 1 , . . . , p0 that ontains xn and sin e pn is moni , it follows
that we must have bn = an . Then the polynomial q an pn is of degree n 1.
Hen e by the indu tion hypothesis, it an be expressed uniquely in the form
q an pn = bn 1 pn 1 + + b0 p0 ;
whi h establishes the result.
8. A onsequen e of this result is the following.
The polynomial pn+1 is orthogonal to any polynomial q of degree n
or less.
For from (23.3) it follows that
pn+1 q = bn pn+1 pn + + b0 pn+2 p0 = 0;
R R R
the last equality following from the orthogonality of the polynomials pi .

Existen e
9. To establish the existen e of orthogonal polynomials, we begin by omputing
the rst two. Sin e p0 is moni and of degree zero,
p0 (x) 1:
Sin e p1 is moni and of degree one, it must have the form
p1 (x) = x 1 :
To determine 1 , we use orthogonality:
0 = p1 p0 = (x 1 ) 1 = x 1 1:
R R R R
R
Sin e the fun tion 1 is positive in the interval of integration, 1 > 0, and it
follows that R
x
1 = R :
1
10. In general we will seek pn+1 in the form
pn+1 = xpn n+1 pn n+1 pn 1 n+1 pn 2 :
As in the onstru tion of p1 , we use orthogonality to determine the oe ients
n+1 , n+1 , n+1 , . . . .
To determine n+1 , write
0 = pn+1pn = xpnpn n+1 pn pn n+1 pn 1 pn n+1 pn 2 pn :
R R R R R
By orthogonality, 0 = pn 1 pn = pn 2 pn = . Hen e
R R
R R
xp2n n+1 p2n = 0:
R
Sin e p2n > 0, we may solve this equation to get
R
xp2n
n+1 = R :
p2n
For n+1 , write
R R R
0 = pn+1 pn 1 =R xpn pn 1 n+1R pn pn 1
n+1 pn 1 pn 1 n+1 pn 2 pn 1 :
Dropping terms that are zero be ause of orthogonality, we get
R R
xpnpn 1 n+1 p2n 1 =0
or R
xp n pn 1
n+1 = R :
p2n 1
11. The formulas for the remaining oe ients are similar to the formula for
k+1 ; e.g., R
xp n pn 2
n+1 = R :
p2n 2
R
However, Rthere is a surprise here. The numerator xpn pn 2 an be written in
the form xpn 2 pn . Sin e xpn 2 is of degree n 1 it is orthogonal to pn ; i.e.,
R
xpn 2 pn 1 = 0. Hen e k+1 = 0, and likewise the oe ients of pn 3 , pn 4 ,
. . . are zero.
12. To summarize:
The orthogonal polynomials an be generated by the following re-
urren e:
p0 = 1;
p1 = x 1 ;
pn+1 = xpn n+1 pn n+1 pn 1 ; n = 1; 2; : : : ;
where R R
xp2 xp p
n+1 = 2n and n+1 = R n2 n 1 :
R
pn pn 1
The rst two equations in the re urren e merely start things o. The
right-hand side of the third equation ontains three terms and for that reason
is alled the three-term re urren e for the orthogonal polynomials.
Zeros of orthogonal polynomials
13. It will turn out that the abs issas of our Gaussian quadrature formula will
be the zeros of pn+1 . We will now show that
The zeros of pn+1 are real, simple, and lie in the interval [a; b.
14. Let x0 , x1 , . . . , xk be the zeros of odd multipli ity of pn+1 in [a; b; i.e,, x0 ,
x1 , . . . , xk are the points at whi h pn+1 hanges sign in [a; b. If k = n, we are
through, sin e the xi are the n + 1 zeros of pn+1 .
Suppose then that k < n and onsider the polynomial
q(x) = (x x0 )(x x1 ) (x xk ):
Sin e deg(q) = k + 1 < n + 1, by orthogonality
R
pn+1 q = 0:
On the other hand pn+1 (x)q(x) annot hange sign on [a; b | ea h sign hange
in pn+1 (x) is an elled by a orresponding sign hange in q(x). It follows that
pn+1 q 6= 0;
R
whi h is a ontradi tion.

Gaussian quadrature
15. The Gaussian quadrature formula is obtained by onstru ting a Newton{
Cotes formula on the zeros of the orthogonal polynomial pn+1 .
Let x0 , x1 , . . . , xn be the zeros of the orthogonal polynomial pn+1
and set R
Ai = ì ; i = 0; 1; : : : ; n;
where ì is the ith Lagrange polynomial over x0 , x1 , . . . , xn . For
any fun tion f let
Gn f = A0 f (x0 ) + A1 f (x1 ) + + An f (xn):
Then
deg(f ) 2n + 1 =) f = Gn f:
R
16. To establish this result, rst note that by onstru tion the integration
formula Gn f is exa t for polynomials of degree less than or equal to n (see
x21.17).
Now let deg(f ) 2n + 1. Divide f by pn+1 to get
f = pn+1 q + r; deg(q); deg(r) n: (23.4)
Then
Gn f = Pi Ai f (xi)
P
= Pi Ai [pn+1 (xi )q(xi ) + r(xi ) by (23.4)

= i Ai r(xi ) be ause pn+1(xi ) = 0
= G n r
be ause RGn is exa t for deg(r) n
R
= Rr
= R (pn+1 q + r) be ause pn+1 q = 0 for deg(q) n
= f by (23.4).
Quod erat demonstrandum.
17. An important orollary of these results is that the oe ients Ai are
positive. To see this note that
2

6= j ,
ì (xj ) = ì (xj ) = 10 ifif ii = j.
Sin e `2i (x) 0 and deg(`2i ) = 2n,

R X
0 < `2i = Gn `2i = Ai `2i (xj ) = Ai :
j
18. Sin e A0 + +An = 1, no oe ient an be larger than 1. Consequently,

R R
we annot have a situation in whi h large oe ients reate large intermediate

results that suer an ellation when they are added.
Error and onvergen e
19. Gaussian quadrature has error formulas similar to the ones for Newton{
Cotes formulas. Spe i ally
R
f
(2n+2)
( ) R p2 ;
Gn f = f(2n + 2)! n+1
where 2 [a; b.
20. A onsequen e of the positivity of the oe ients Ai is that Gaussian
quadrature onverges for any ontinuous fun tion; that is,
f ontinuous =) nlim
R
!1 Gn f = f:
The proof | it is a good exer ise in elementary analysis | is based on the
Weierstrass approximation theorem, whi h says that for any ontinuous fun -
tion f there is a sequen e of polynomials that onverges uniformly to f .
Examples
21. Parti ular Gauss formulas arise from parti ular hoi es of the interval [a; b
and the weight fun tion w(x). The workhorse is Gauss{Legendre quadrature,26
in whi h [a; b = [ 1; 1 and w(x) 1, so that the formula approximates the
integral Z 1
f (x) dx:
1
The orresponding orthogonal polynomials are alled Legendre polynomials.

22. If we take [a; b = [0; 1 and w(x) = e x , we get a formula to approximate
Z 1
f (x)e x dx:
0
This is Gauss{Laguerre quadrature.

26
Curious bedfellows! Gauss and Legendre be ame involved in a famous priority dispute
over the invention of least squares, and neither would enjoy seeing their names oupled this
way.
23. If we take [a; b = [ 1; 1 and w(x) = e x , we get a formula to approx-

2
imate Z 1
f (x)e x dx:
2
1
This is Gauss{Hermite quadrature.
24. There are many other Gauss formulas suitable for spe ial purposes. Most
mathemati al handbooks have tables of the abs issas and oe ients. The
automati generation of Gauss formulas is an interesting subje t in its own
right.
Numeri al Dierentiation
179
Le ture 24
Numeri al Dierentiation
Numeri al Dierentiation and Integration
Formulas from Power Series
Limitations
Numeri al dierentiation and integration

1. We have already noted that formulas are easier to dierentiate than to
integrate. When it omes to numeri s the opposite is true. The graphs in
Figure 24.1 suggest why.
The fun tion of interest is represented by the straight line. The verti al
bars represent the values of the fun tion with a little error thrown in. The
dashed line is the resulting pie ewise linear approximation, and the area under
it is the approximate integral generated by the omposite trapezoidal rule.
It is lear from the gure that the errors in the individual points tend
to wash out in the integral as the dashed line os illates above and below the
solid line. On the other hand, if we approximate the derivative by the slope
of the dashed line between onse utive points, the results vary widely. In two
instan es the slope is negative!
At the end of this le ture we will see why numeri al dierentiation is by its
nature a tri ky pro ess, but rst let us derive some dierentiation formulas.
Figure 24.1. A straight line.
181
Formulas from power series

2. From elementary al ulus we know that
f (x + h) f (x)
f 0 (x) = + O(h2 ): (24.1)
h
This suggests that we attempt to approximate the derivatives of f as a linear
ombination of values of f near the point x. For deniteness, we will work
with the points f (x h), f (x), and f (x + h), where h is presumed small.
3. We begin by writing the Taylor expansion of f (x + h) and f (x h) about
x.
h2 h3 h4
f (x + h) = f (x) + hf 0 (x) + f 00 (x) + f 000 (x) + f (4) (x) +
2 6 24
f (x) = f (x)
h2 h3 000 h4
f (x h) = f (x) hf 0 (x) + f 00 (x) f (x) + f (4) (x) +
2 6 24
To derive a formula like (24.1), we take a linear ombination of these three
values that annihilates f (x) on the right, leaving f 0 (x) as its leading term. In
the display below, the oe ients of the ombination appear before the olons,
and the result appears below the line.
= f (x) + hf 0 (x) + f 00 ( )
h 2
1 : f (x + h)
2
1 : f (x) = f (x)
hf 0 (x) + f 00 ( )
h2
0 : f (x h) = f (x)
2
hf 0 (x) + f 00 ( )
h2
f (x + h) f (x) =
2
Note that we have repla ed terms in h2 by orresponding remainder terms.

Dividing by h, we obtain the formula
f (x + h) f (x) h 00
f 0(x) = 2 [x; x + h:
h 2 f ( );
This formula is alled a forward-dieren e approximation to the derivative
be ause it looks forward along the x-axis to get an approximation to f 0(x).
4. A linear ombination of f (x h), f (x), and f (x + h) has three degrees of
freedom. In deriving the forward-dieren e formula, we used only two of them.
By using all three, we not only an eliminate f (x), but we an eliminate the
terms in h2 .
Spe i ally,
24. Numeri al Dierentiation 183
= f (x) + hf 0 (x) + f 00 (x) + f 000 (+ )

h2 h3
1 : f (x + h)
2 6
0 : f (x) = f (x)
hf 0 (x) + f 00 (x) f 000 ( )
h2 h3
1 : f (x h) = f (x)
2 6
h3 f 000 (+ ) + f 000 ( )
f (x + h) f (x h) = 2hf 0 (x) +
3 2
Note that the error term onsists of two evaluations of f 000 , one at + 2 [x; x + h
from trun ating the series for f (x + h) and the other at 2 [x h; x from
trun ating the series for f (x h). If f 000 is ontinuous, the average of these two
values an be written as f 000 ( ), where 2 [x h; x + h. Hen e we have the
entral-dieren e formula
f (x + h) f (x h) h2 000
f 0(x) = f ( ); 2 [x h; x + h:
2h 6
5. Sin e the error in the entral-dieren e formula is of order h2 , it is ultimately
more a urate than a forward-dieren e s heme. And on the fa e of it, both
require two fun tion evaluations, so that it is no less e onomi al. However, in
many appli ations we will be given f (x) along with x. When this happens,
a forward-dieren e formula requires only one additional fun tion evaluation,
ompared with two for a entral-dieren e formula.
6. To get a formula for the se ond derivative, we hoose the oe ients to
pi k o the rst two terms of the Taylor expansion:
= f (x) + hf 0 (x) + f 00 (x) + f 000 (x) + f (4) (+ )
h2 h3 h4
1 : f (x + h)
2 6 24
2 : f (x ) = f (x)
hf 0 (x) + f 00 (x) f 000 (x) +
h2 h3 h4
1 : f (x + h ) = f (x) f (4) ( )
2 6 24
h4 (f (4) + ) + f (4) ( )
f (x + h) 2 f (x ) + f (x h) = h2 f 00 (x) +
6 2
where + 2 [x; x + h and 2 [x; x h. It follows that

f (x + h) 2f (x) + f (x h) h2
f 00 (x) = 2 [x h; x + h:
6 f ( );
(4)
h2
7. The three formulas derived above are workhorses that are put to servi e
in many appli ations. However, the te hnique is quite exible and an be
used to derive formulas for spe ial o asions. For instan e, if we want an
approximation to f 0(x) in terms of f (x), f (x + h), and f (x + 2h), we form the
following ombination:
3 : f (x ) = f (x)
= f (x) + hf 0 (x) + f 00 (x) +
f 000 (1 )
h2 h3
4 : f (x + h)
2 63
4h 000
1 : f (x + h ) = f (x) + 2hf 0 (x) + 2h2 f 00 (x) + f (2 )
6
2h3 000 4h3 000
3f (x) + 4f (x + h) f (x + 2h) = 2hf 00 (x) + f (1 ) f (2 )
3 3
Hen e
f 0 (x) =
3f (x) + 4f (x + h) f (x + 2h) + h2 [2f 000 ( ) f 000 ( ):
2h 3 2 1
The error term does not depend on a single value of f 000 ; however, if h is
small, it is approximately
h2 000
3 f (x):
8. The te hnique just des ribed has mu h in ommon with the method of
undetermined oe ients for nding integration rules. There are other, more
systemati ways of deriving formulas to approximate derivatives. But this one
is easy to remember if you are stranded on a desert island without a textbook.
Limitations
9. As we indi ated at the beginning of this le ture, errors in the values of f
an ause ina ura ies in the omputed derivatives. In fa t the errors do more:
they pla e a limit on the a ura y to whi h a given formula an ompute
derivatives.
10. To see how this omes about, onsider the forward-dieren e formula
f (x + h) f (x) h2 00
D(f ) = + 2 f ( );
h
where D(f ) denotes the operation of dierentiating f at x. If we dene the
operator Dh by
f (x + h) f (x)
Dh (f ) = ;
h
then the error in the forward-dieren e approximation is
h 00
Dh (f ) D(f ) = f ( ):
2
In parti ular, if
jf 00(t)j M
24. Numeri al Dierentiation 185
for all t in the region of interest, then

jDh(f ) D(f )j M2 h: (24.2)
Thus the error goes to zero with h, and we an make the approximation Df
as a urate as we like by taking h small enough.
11. In pra ti e, however, we do not evaluate f (t), but
f~(t) = f (t) + e(t);
where e(t) represents the error in omputing f (t). Sin e Dh is a linear operator,
it follows that
Dh (f~) Dh (f ) = Dh (e):
In parti ular, if we know that
je(t)j
in the region of interest, then
jD (f~) D (f )j = je(t + h) e(t)j 2 :
h h
h h
Combining this inequality with (24.2), we nd that the error in what we a tu-
ally ompute is
jDh (f~) D(f )j 2h + M2 h:
12. To the extent that this bound is realisti , it says that there is a limit
on how a urately we an approximate derivatives by the forward-dieren e
formula. For large h the error formula term Mh=2 dominates, while for small
h the term 2=h dominates. The minimum o urs approximately when they
are equal, that is, when Mh=2 = 2=h, or
r

h=2 :
M
At this point the error bound is
p
2 M:
For example,
p if equals the rounding unit M and M = 1, then the mini-
mum error is 2 M . In other words, we annot expe t more than half ma hine
pre ision in su h a forward dieren e.
13. To illustrate this point, the matlab ode
for i=1:14,
x(i) = (sin(pi/3.2+10^(-i))-sin(pi/3.2))/10^(-i);
y(i) = x(i) - os(pi/3.2);
end
uses forward dieren ing to ompute the derivative of sin(=3:2) with h =

10 1 , . . . , 10 14 . The array y ontains the error in these approximations.
Here is the output.
x y
0:51310589790214 0:04246433511746
0:55140366014496 0:00416657287465
0:55515440565301 0:00041582736659
0:55552865861230 0:00004157440730
0:55556607565510 0:00000415736450
0:55556981726212 0:00000041575748
0:55557019096319 0:00000004205641
0:55557023426189 0:00000000124229
0:55557025646635 0:00000002344675
0:55557003442175 0:00000019859785
0:55556670375267 0:00000352926693
0:55555560152243 0:00001463149717
The error de reases until h = 10 8 and then in reases. Sin e M is 10 16 , this
is in rough agreement with our analysis.
14. It is worth noting that if you turn the output sideways, the nonzero digits
of y plot a graph of the logarithm of the error. The slope is the same going
down and oming up, again as predi ted by our analysis.
15. It is important not to get an exaggerated fear of numeri al dierentiation.
It is an inherently sensitive pro edure. But as the above example shows, we
an often get a good many digits of a ura y before it breaks down, and this
a ura y is often su ient for the purposes at hand.
Bibliogaphy
Introdu tion
Referen es
Introdu tion
1. The following referen es fall into three lasses. The rst onsists of ele-
mentary books on numeri al analysis and programming. The number of su h
books is legion, and I have listed far less than a tithe. The se ond lass onsists
of books treating the individual topi s in these notes. They are a good sour e
of additional referen es. I have generally avoided advan ed treatises (hen e the
absen e of Wilkinson's magisterial Algebrai Eigenvalue Problem ). The third
lass onsists of books on pa kages of programs, prin ipally from numeri al
linear algebra. They illustrate how things are done by people who know how
to do them.
2. No bibliography in numeri al analysis would be omplete without referen -
ing Netlib, an extensive olle tion of numeri al programs available through the
Internet. Its URL is
http://www.netlib.org
The Guide to Available Mathemati al Software (GAMS) at
http://gams.nist.gov
ontains pointers to additional programs and pa kages. Be warned that the
preferred way to a ess the Internet hanges frequently, and by the time you
read this you may have to nd another way to a ess Netlib or GAMS.
Referen es
E. Anderson, Z. Bai, C. Bis hof, J. Demmel, J. Dongarra, J. Du Croz, A. Green-
baum, S. Hammarling, A. M Kenney, S. Ostrou hov, and D. Sorensen.
LAPACK Users' Guide. SIAM, Philadelphia, se ond edition, 1995.
K. E. Atkinson. An Introdu tion to Numeri al Analysis. John Wiley, New
York, 1978.

A. Bjor k. Numeri al Methods for Least Squares Problems. SIAM, Philadel-
phia, 1994.
R. P. Brent. Algorithms for Minimization without Derivatives. Prenti e{Hall,
Englewood Clis, New Jersey, 1973.
T. Coleman and C. Van Loan. Handbook for Matrix Computations. SIAM
Publi ations, Philadelphia, 1988.
187
S. D. Conte and C. de Boor. Elementary Numeri al Analysis: An Algorithmi

Approa h. M Graw{Hill, New York, third edition, 1980.
Germund Dahlquist and Ake Bjor k. Numeri al Methods. Prenti e-Hall,
B. N. Datta. Numeri al Linear Algebra and Appli ations. Brooks/Cole, Pa i
Grove, California, 1995.
P. J. Davis. Interpolation and Approximation. Blaisdell, New York, 1961.
Reprinted by Dover, New York, 1975.
P. J. Davis and P. Rabinowitz. Methods of Numeri al Integration. A ademi
Press, New York, 1967.
J. E. Dennis and R. B. S hnabel. Numeri al Methods for Un onstrained Opti-
mization and Nonlinear Equations. Prenti e-Hall, Englewook Clis, New
Jersey, 1983.
J. J. Dongarra, J. R. Bun h, C. B. Moler, and G. W. Stewart. LINPACK
User's Guide. SIAM, Philadelphia, 1979.
L. Elden and L. Wittmeyer-Ko h. Numeri al Analysis: An Introdu tion.
A ademi Press, New York, 1993.
G. Evans. Pra ti al Numeri al Integration. Wiley, New York, 1990.
G. E. Forsythe, M. A. Mal olm, and C. B. Moler. Computer Methods for
Mathemati al Computations. Prenti e-Hall, Englewood Clis, N.J., 1977.
C.-E. Froberg. Introdu tion to Numeri al Analysis. Adison{Wesley, Reading,
Massa husetts, 1969.
G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins
University Press, Baltimore, Maryland, 2nd edition, 1989.
N. J. Higham. A ura y and Stability of Numeri al Algorithms. SIAM,
Philadelphia, 1995.
A. S. Householder. The Numeri al Treatment of a Single Nonlinear Equation.
M Graw{Hill, New York, 1970.
E. Isaa son and H. B. Keller. Analysis of Numeri al Methods. John Wiley,
New York, 1966.
D. Kahaner, S. Nash, and C. Moler. Numeri al Methods and Software. Pren-
tise Hall, Englewood Clis, New Jersey, 1988.
B. W. Kernighan and D. M. Rit hie. The C Programming Language. Prenti e
Hall, Englewood Clis, New Jersey, se ond edition, 1988.
D. Kin aid and W. Cheney. Numeri al Analysis: Mathemati s of S ienti
Computing. Brooks/Cole, Pa i Grove, CA, 1991.
Bibliography 189
C. L. Lawson and R. J. Hanson. Solving Least Squares Problems. Prenti e

Hall, Englewood Clis, New Jersey, 1974.
M. J. Maron and R. J. Lopez. Numeri al Analysis: A Pra ti al Approa h.
Wadsworth, Belmont, California, third edition, 1991.
C. Moler, J. Little, and S. Bangert. Pro-Matlab User's Guide. The Math
Works, Shereborn, MA, 1987.
J. M. Ortega. Numeri al Analysis: A Se ond Course. A ademi Press, New
York, 1972.
P. H. Sterbenz. Floating-Point Computation. Prenti e-Hall, Englewood Clis,
New Jersey, 1974.
G. W. Stewart. Introdu tion to Matrix Computations. A ademi Press, New
York, 1973.
J. Stoer and R. Bulirs h. Introdu tion to Numeri al Analysis. Springer Verlag,
New York, se ond edition, 1993.
D. S. Watkins. Fundamentals of Matrix Computations. John Wiley & Sons,
New York, 1991.
J. H. Wilkinson. Rounding Errors in Algebrai Pro esses. Prenti e-Hall,
J. H. Wilkinson and C. Reins h. Handbook for Automati Computation. Vol. II
Linear Algebra. Springer, New York, 1971.
Index
Itali s signify a dening entry. The abbreviation qv (quod vide ) means to look
for the topi in question as a main entry. The letter \n" indi ates a footnote.
absolute error, 7 C, 69, 80n, 86
as onvergen e riterion, 7 storage of arrays, 83, 109
absolute value a he memory, 86
as a norm, 113 an ellation, 61{63, 65, 176
astrologer, 150 death before an ellation,
61{62, 104
ba k substitution, 100 in a sum of three numbers, 61
ba kward error analysis, 55{57 , in ill- onditioned problems, 63
125, 128 in the quadrati formula, 62
ba kward stability, 55 re overy from, 62{63, 65
Gaussian elimination, 122{124 revealing loss of information,
stable algorithm, 55 61, 62
sum of n numbers, 53{55 hara teristi , see oating-point
sum of two numbers, 50{51 arithmeti
ba kward stablity, see ba kward Chebyshev points, 152
error analysis Cholesky algorithm, 90{95, 98
balan ed matrix, 120, 136 and rounding error, 98
base, see oating-point arithmeti and solution of linear systems,
basi linear algebra subprogram, 95
see blas Cholesky de omposition, 90
binary arithmeti , see Cholesky fa tor, 90
oating-point arithmeti Cholesky fa torization, 90
blas, 86{88 olumn orientation, 94{95
axpy, 87, 98n, 109 e onomization of operations,
opy, 88 92
dot, 87 e onomization of storage, 92
ger, 109 equivalen e of inner and outer
imax, 108 produ t forms, 98
in Gaussian elimination, implementation, 92{95
108{109 inner produ t form, 97{98
level-three, 109n operation ount, 95, 98
level-two, 109 relation to Gaussian
s al, 88, 109 elimination, 91{92
swap, 108 olumn index, see matrix
bra ket for a root, see nonlinear olumn orientation, 84{85, 87, 94,
equations 98n
191
and level-two blas, 109 multipoint methods, 33

general observations, 86 noninteger orders, 21
olumn ve tor, see ve tor two-point methods, 32
omponent of a ve tor, see ve tor Cramer's rule, 127, 130
omputer arithmeti , see ompared with Gaussian
oating-point arithmeti elimination, 130{131
ondition ubi onvergen e, 20, 21
arti ial ill- onditioning,
120{121 de imal arithmeti , see
ondition number, 41 , 42 oating-point arithmeti
ondition number with respe t determinant, 77
to inversion, 120 unsuitability as measure of
generating matri es of known singularity, 77{78
ondition, 129 diagonal matrix, 71
ill- onditioning, 41, 51, 57, 63, two-norm of, 127
65 dieren e equation, 32
ill- onditioning of ee ts of rounding error, 63{65
Vandermonde, 136 dieren e quotient, 27
ill- onditioning revealed by dierentiation
an ellation, 63 ompared with integration,
linear system, 119{120, 169 157, 181
roots of nonlinear equations, dimension of a ve tor, see ve tor
41{42 divided dieren e, 144{146
roots of quadrati equations, and error in interpolant, 148
63 as oe ient of Newton
sum of n numbers, 57 interpolant, 144
well- onditioning, 41 omputation, 145{146
ondition number, see ondition operation ount, 146
onformity, see partitioning, relation to derivative, 149
produ t, sum e onomist, see astrologer
onstant slope method, 17 , 22 element of a matrix, see matrix
as su essive-substitution elementary permutation, 105
method, 23 error bounds for oating-point
onvergen e analysis, 18{19 operations, see rounding
failure, 19 error
linear onvergen e, 19 error fun tion, 157
onvergen e, see onvergen e of Eu lidean length of a ve tor, see
order p, ubi norm, two-norm
onvergen e, et . exponent, see oating-point
onvergen e of order p, 20 , 22 arithmeti
and signi ant gures, 20
limitations for large p, 21 xed point, 21
Index 193
xed-point number, 45 and the Cholesky algorithm,

oating-point arithmeti 91{92
advantage over xed-point ba k substitution, 100
arithmeti , 45 ba kward error analysis,
avoidan e of over ow, 47{48 122{124
base, 45{46 blas in implementation,
binary, 46, 49 108{109
hara teristi , 45n ompared with Cramer's rule,
de imal, 46 130{131
double pre ision, 46 ompared with
error bounds for oating-point invert-and-multiply, 130
operations, see rounding omplete pivoting, 105n
error exponential growth, 125{126
exponent ex eption, 47 for Hessenberg matri es,
oating-point number, 45{46 110{111
oating-point operations, for tridiagonal matri es, 111
49{50 general assessment, 126
fra tion, 45 growth fa tor, 125
guard digit, 50 implementation, 102
hexade imal, 46 multiplier, 99 , 101{102, 122,
high relative error in 123
subtra tion, 50 operation ount, 102, 111
IEEE standard, 46 , 47, 50 partial pivoting, 104{108, 125
mantissa, 45n pivot, 103
normalization, 45{46 pivoting, 103{108, 111
over ow, 47{48 , 141, 144 stability, 125
quadruple pre ision, 46 Gaussian quadrature, 167
rounding error, qv onvergen e, 176
single pre ision, 46 derivation, 175
under ow, 47{48 , 141, 144 error formula, 176
unnormalized number, 45 Gauss{Hermite quadrature,
fortran, 94 177
storage of arrays, 84, 109 Gauss{Laguerre quadrature,
forward-substitution algorithm, 80 176
fra tion, see oating-point Gauss{Legendre quadrature,
arithmeti 176
introdu tion, 169
GAMS, 187 positivity of oe ients,
Gauss, C. F., 31, 169, 176n 175{176
Gaussian elimination, 92 , 98{100 , guard digit, see oating-point
108, 127, 128, 135 arithmeti
and LU de omposition,
100{102 Hessenberg matrix, 110
and Gaussian elimination, linear equations, see linear system

110{111 linear fra tional method, 34{36
hexade imal arithmeti , see linear independen e, 77
oating-point arithmeti linear system, 77, 104
and partial pivoting, 105{108
identity matrix, see matrix and relative residual, 128{129
IEEE standard, see oating-point and rounding error, 120
arithmeti arti ial ill- onditioning,
ill- onditioning, 123 120{121
inner produ t, 98n, 171 ondition, 119{120
Cholesky algorithm, 97{98 Cramer's rule, qv
omputed by blas, 87 existen e of solutions, 77
integral invert-and-multiply algorithm,
denite, 157 78, 95, 127, 130
indenite, 157 lower triangular system,
integration 79{81, 84{85, 90, 97
ompared with dierentiation, matrix representation, 72
157, 181 nonsymmetri system, 98
intermediate value theorem, 4 operation ount, qv
interpolation, 35 perturbation analysis, 116{117
interpolatory method, 34 positive-denite system, 89, 95
interval bise tion, 4{6 solution by LU de omposition,
ombined with se ant method, 78{79, 89
37{40 triangular system, 95
onvergen e, 6 uniqueness of solutions, 77
implementation, 5{6 lo ality of referen e, 83 , 85, 86
inverse matrix, 78 , 105, 120n, 130 lower triangular matrix, 79 , 79
and linear systems, 78 inverse, 101
al ulation of, 95 produ t, 101
lower triangular matrix, 101 unit lower triangular matrix,
of a produ t, 78 102
of a transpose, 78 lower triangular system
invert-and-multiply algorithm, see nonsingularity, 143
linear system LU de omposition, 79 , 98, 102,
Lagrange interpolation, 144 120n, 123
Lagrange polynomials, 137{138 , and Gaussian elimination,
161 100{102
least squares, 89, 176 existen e, 103
Legendre, A. M., 176n solution of linear systems,
linear ombination, 75 78{79, 89
linear onvergen e, 19, 20 , 22 with partial pivoting, 106{107
rate of onvergen e, 20 mantissa, see oating-point
Index 195
arithmeti rate of onvergen e, 33

matlab, 127
matrix, 69 natural basis interpolation,
olumn index, 69 135{137, 141
diagonal matrix, qv evaluation by syntheti
element, 69 division, 141{142
identity matrix, 71 ill- onditioning of
inverse matrix, qv Vandermonde, 136
lower triangular matrix, qv Vandermonde matrix, qv
nonsingular matrix, qv Netlib, 187
normally distributed matrix, Newton interpolation, 142 , 144
129 addition of new points, 144
operations, see multipli ation oe ients as divided
by a s alar, sum, produ t, dieren es, 144
transpose omputation of oe ients,
order, 69 145{146
orthogonal matrix, qv evaluation by syntheti
partitioning, qv division, 142
positive denite matrix, qv existen e and uniqueness,
rank-one matrix, qv 143{144
represented by upper- ase Newton's method, 9{15 , 22, 34
letters, 70 analyti derivation, 10
row index, 69 as on uent ase of se ant
symmetri matrix, 89 method, 29
unit lower triangular matrix, as su essive-substitution
102 method, 23
upper triangular matrix, 79, al ulating re ipro als, 11
79 al ulating square roots, 11
Vandermonde matrix, qv onvergen e analysis, 12{14,
zero matrix, 70 19
matrix perturbation theory, 116 onvergen e to multiple zero,
moni polynomial, 172 24
Muller's method, 33{34, 135 derivative evaluation, 4, 17
multiple zero, 24 divergen e, 11
behavior of Newton's method, failure, 37
24 geometri derivation, 9
multipli ation by a s alar quadrati onvergen e, 14, 19,
and norms, 113 23
and transposition, 72 retarded onvergen e, 14{15,
matrix, 70 25
ve tor, 71 starting values, 15
multipoint method, 33 , 34, 135 nonlinear equations, 4
analyti solutions, 4
bra ket for a root, 5 one-norm, 113{115

ondition of roots, see row-sum norm, 115
ondition, roots of triangle inequality, 114
nonlinear equations triangle inequality., 113
ee ts of rounding error, 6 two-norm, 72, 113{114 , 115n,
errors in the fun tion, 37, 127{128
40{42 ve tor norm, 113{114
existen e of solutions, 4, 5 numeri al dierentiation
general observations, 3{4 entral-dieren e formula, 183
hybrid of se ant method and ompared with numer ial
interval bise tion, 37{40 integration, 181
interpolatory method, 33 error analysis, 184{186
interval bise tion, qv forward-dieren e formula,
Muller's method, qv 182, 184
multipoint method, qv se ond derivative, 183
Newton's method, qv three-point
polynomial equations, 25 ba kward-dieren e
quadrati formula, qv formula, 184
quasi-Newton method, 4, 17n, numeri al integration, 41, 157
17, 27 and Lagrange polynomials,
root and zero ontrasted, 9 161{162
se ant method, qv hange of intervals, 158
su essive substitutions ompared with numeri al
method, qv dierentiation, 181
two-point method, qv Gauss{Hermite quadrature,
uniqueness of solutions, 4 177
nonsingular matrix, 78 , 90 Gauss{Laguerre quadrature,
perturbation of, 117 176
norm Gauss{Legendre quadrature,
and orthogonal matri es, 127 176
olumn-sum norm, 115 Gaussian quadrature, qv
onsisten y, 114 , 116, 127 Newton{Cotes formulas,
Eu lidean norm, 114 161{162 , 166, 167
Frobenius norm, 115 Simpson's rule, qv
innity-norm, 113{115 trapezoidal rule, qv
Manhattan norm, 114 treatment of singularities,
matrix norm, 114{115 167{168
max norm, 114 undetermined oe ients,
normwise relative error, 162{163 , 167, 169
115{116 , 117, 120, 128 weight fun tion, 167 , 176
of a diagonal matrix, 127 numeri al quadrature, 157
of a rank-one matrix, 127
operation ount, 81{82
Index 197
approximation by integrals, bases, 143n, 172

81, 95 evaluation by syntheti
Cholesky algorithm, 95 division, 141{142
divided dieren e, 146 moni , 172
Gaussian elimination, 102 number of distin t zeros, 138
Gaussian elimination for polynomial interpolation, 137
Hessenberg matri es, 111 approximation to the sine, 150
Gaussian elimination for at Chebyshev points, 151{153
tridiagonal matri es, 111 onvergen e, 150{153, 161
interpretation and aveats, error bounds, 149{150
81{82 error in interpolant, 147{149
lower triangular system, 81 existen e, 137{138
syntheti division, 142 extrapolation and
order of a matrix, see matrix interpolation, 149{150
orthogonal fun tion, 171 failure of onvergen e, 151, 153
orthogonal matrix, 127 general features, 137
and two-norm, 127 Lagrange interpolation,
random, 129 137{138, 141
orthogonal polynomials, 169, 171 linear interpolation, 149
existen e, 173{174 natural basis interpolation, qv
Legendre polynomials, 176 Newton interpolation, qv
normalization, 172 quadrati interpolation,
orthogonality to polynomials 135{136
of lesser degree, 172 Runge's example, 151
reality of roots, 174{175 shift of origin, 136
three-term re urren e, 174 uniqueness, 138{139
outer produ t, 97 Vandermonde matrix, qv
over ow, see oating-point positive-denite matrix, 89
arithmeti al ulation of inverse, 95
overwriting, 92, 93, 102 nonsingularity, 90
partitioned, 90
partitioning, 74 , 90 without symmetry
by olumns, 74 requirement, 89n
onformity, 74 pre ision, see oating-point
matrix operations, 74 arithmeti
paritioned sum, 74 produ t, 70
partitioned produ t, 74 and transposition, 72
positive-denite matrix, 90 asso iativity, 72
perturbation analysis, 57 onformity, 71
linear system, 116{117 distributivity, 72
sum of n numbers, 57 inner produ t, 72
pivoting, see Gaussian elimination inverse of, 78
polynomial
matrix, 70{71 error bounds for oating-point

matrix-ve tor, 71 , 75 operations, 50
non ommutativity of matrix general observations, 65{66
produ t, 72, 74 in linear systems, 120
of partitioned matri es, 74 inferiority of hopped
of triangular matri es, 101 arithmeti , 58{59
rank-one matrix and a ve tor, ma hine epsilon, 49
73 magni ation, 65
re ipe for matrix produ t, 71 and relative error, 49
rounding, 48
quadrati onvergen e, 14 , 19{20 rounding unit, 49 , 120, 123
doubling of signi ant gures, statisti al analysis, 59
14 trun ation, 48
of Newton's method, 14, 19 rounding unit, see rounding error
quadrati formula, 61{63 rounding-error analysis
dis riminant, 63 a ura y of a sum of positive
revised, 63 numbers, 58{59
rank-one matrix, 73 a ura y of omputed sum, 58
omputing with, 73 ba kward error analysis, qv
storing, 73 an ellation, qv
two-norm of, 127 Cholesky algorithm, 98
re ipro al al ulated by Newton's dieren e equation, 64{65
method, 11 Gaussian elimination, qv
regression, 89 numeri al dierentiation,
relative error, 7 , 57, 128 184{186
and signi ant gures, 7{8 simpli ation of error bounds,
as onvergen e riterion, 8 54{55
normwise, see norm single error strategy, 65
and rounding error, 49 sum of n numbers, 53{55
relative residual, 128 sum of two numbers, 50{51
and stability of linear systems, row index, see matrix
128{129 row orientation, 83{84, 87, 98n
residual, 128 and level-two blas, 109
rounding error, 40, 47, 48{49 general observations, 86
a umulation, 55, 58{59, 65 row ve tor, see ve tor
adjusted rounding unit, 55 s alar, 70
an ellation, qv as 1 1 matrix, 69
hopping, 48{49 multipli ation by, 70
omputation of the rounding represented by lower- ase
unit, 49 Latin or Greek letter, 70
dieren e equation, 63{65 se ant method, 27 , 34
error bounds, 48{49 , 59 as an interpolatory method, 33
Index 199
ombined with interval ommutativity, 72

bise tion, 37{40 distributivity, 72
onvergen e, see two-point matrix, 70
method, 21 of partitioned matri es, 74
failure, 34, 37 ve tor, 71
geometri derivation, 28 superlinear onvergen e, 20
Newton's method as on uent not of order p, 21n
ase, 29 two-point methods, 32
quasi-Newton method, 17 symmetri matrix
signi ant gures e onomization of operations,
and quadrati onvergen e, 14 92
and relative error, 7{8 syntheti division, 141{142
simple zero, 24 evaluation of Newton
Simpson's rule, 158 interpolant, 142
as a partial Gauss quadrature, operation ount, 142
169 stability, 142
omposite rule, 165{166
derived by undetermined transpose, 69{70, 72
oe ients, 162{163 and matrix operations, 72
error formula, 166 inverse of, 78
error in omposite rule, 167 trapezoidal rule, 158{160
exa t for ubi s, 167, 169 analyti derivation, 159
half-simp rule, 166 omposite rule, 160 , 181
singular matrix, 123 error formula, 159{160
spline interpolant, 153 error in the omposite rule,
square root 160{161
al ulated by Newton's geometri derivation, 158{159
method, 11 triangle inequality, see norm
stable algorithm, 41, 55, 128{129 tridiagonal matrix, 111
ba kward error analysis, qv and Gaussian elimination, 111
dialogue on stability, 55{56 two-point method, 28
Gaussian elimination, 125 onvergen e analysis, 29{32
syntheti division, 142 rate of onvergen e, 32
Stewart, G. W., 103 under ow, see oating-point
sublinear onvergen e, 20n arithmeti
su essive substitution method, 21
onvergen e, 21 Vandermonde matrix, 137 , 144
geometri interpretation, 21 ill- onditioning, 136
sum, 70 nonsingularity, 135, 138
and transposition, 72 ve tor, 69
asso iativity, 72 olumn ve tor, 69
omformity, 70 omponent, 69
dimension, 69
as n 1 matrix, 69
n-ve tor, 69
represented by lower- ase
Latin letters, 70
row ve tor, 69 , 72
ve tor operations, see
multipli ation by a s alar,
sum, produ t, transpose
ve tor super omputer, 109
virtual memory, 83 , 84, 86
page, 83
page hit, 83
page miss, 83 , 83{85
Weierstrass approximation
theorem, 176
weight fun tion, 171
Wilkinson, J. H., 37, 110
zero matrix, see matrix
zero of a fun tion, 9

Afternotes On Numerical Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Afternotes On Numerical Analysis

Uploaded by

Copyright:

Available Formats

Afternotes on Numeri

Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . 171

John Carroll, Bob Funderli , David Goldberg, Murli Gupta, Ni k Higham,

By the dawn's early light

The solution is easily seen to be

5. The intermediate value theorem an be used to establish the existen e of a

Figure 1.1. Interval bise tion.

iterations to onverge. Thus if L0 = 1 and eps = 10 6 , the iteration will

An examination of this table suggests the following.

Thus jx yj  2, whi h is a lower bound on the di eren e. On the other hand,

Figure 2.1. Geometri illustration of Newton's method.

4. In some sense the geometri and analyti derivations of Newton's method

Re ipro als and square roots

essen e of any lo al onvergen e theorem | we may take our starting point

and x1 2 I . Now sin e x1 is in I , so is x2 by the same reasoning. Moreover,

By indu tion, if xk 1 2 I , so is xk , and

Sin e C k ! 0, it follows that ek ! 0; that is, the sequen e x , x , . . . onverges

Figure 3.1. The onstant slope method.

Arguing as in x2.10, we an show that

Arguing as in x2.11, we nd that if x0 2 I then

whi h implies that the iteration onverges.

It follows that to redu e the error by a fa tor of , we must have k   or

Figure 3.2. An attra tive xed point.

Figure 3.3. A repulsive xed point.

whi h establishes the pth-order onvergen e.

Ending with a proposition

The se ant method

The iteration (4.1) then takes the form

Figure 4.1. The se ant method.

8. To get an error re ursion, we begin by expanding ' about (x ; x ) in a

+ 2'uv (x + e1 ; x + e0 ) + 'uvv (x + e e1 ; x + e0 )e0

and r(0; 0) = 2'uv (x ; x ), we have

6 0, we shall say that the sequen e fxk g exhibits two-point

jrk j  jr(ek ; ek 1 )j = s je jpje j = sk spk 1

Sin e p2 p 1 = 0, we have jek 1 jp 2

onverges to zero, whatever the starting values 0 ; : : : ; n 1 .

16. In our appli ation n = 1 and k = k  , and k = k  . The equation

This way to x4.

Figure 4.2. A horrible example.

It is a worthwhile exer ise to work out the details.

From (4.11) we have

Thus from (4.12) the next iterate is

9. At this point we make a further adjustment to dd. The explanation is best

Figure 5.1. A problem with .

if (sign(fb) == sign(f )){

Figure 5.2. Ill- and well- onditioned roots.

From the approximation

Figure 6.1. A oating-point word.

Over ow and under ow

it is added. Continuing the omputation, we obtain = 1060 , whi h is what

16. The number M in (6.2) is hara teristi of the oating-point arithmeti

representable as a oating-point number of the same size. For example, the

0:99999 as an approximation to 0:999999 is about 9 10 6 , whi h is of the same

Xyz: Double. I understand that it is about fteen de imal digits.

From (7.7), we obtain the following bound on the absolute error:

It then follows from (7.8) that

is the ondition number for the sum.

whose roots are given by the quadrati formula

4. An attempt to al ulate the smallest root in ve-digit arithmeti gives the

7. The rst thing to observe is that there is no problem in al ulating the

The omputed value is as a urate as we an reasonably expe t.

Figure 8.1. Computed solution of xk+1 = 2:25xk 0:5xk 1 .

Matri es, ve tors, and s alars

Thus jx yj 2, whi h is a lower bound on the dieren e. On the other hand,

It follows that to redu e the error by a fa tor of , we must have k or

+ 2'uv (x + e1 ; x + e0 ) + 'uvv (x + e e1 ; x + e0 )e0

jrk j jr(ek ; ek 1 )j = s je jpje j = sk spk 1

onverges to zero, whatever the starting values 0 ; : : : ; n 1 .

16. In our appli ation n = 1 and k = k , and k = k . The equation

16. The number M in (6.2) is hara teristi of the oating-point arithmeti

Note that for the produ t AB to be dened, the number of olumns of A

a31 a32 a33 a34 bb31 b32 b333 b34 b35 C

2. In onsequen e, if algorithms have dierent orders of omplexity |

Positive-denite matri es

in onne tion with ellipti partial dierential equations, whi h o ur every-

At the kth stage of the elimination, determine an index pk (k

However, the two loops together ompute the dieren e

In parti ular, if M = 10 t and (A) = 10k , the solution x~ an have relative

The ondition number of this matrix in the innity-norm is nine.