Numerical Analysis Lecturer Notes

Walter Gautschi
Numerical Analysis
Second Edition
Walter Gautschi
Department of Computer Sciences
Purdue University
250 N. University Street
West Lafayette, IN 47907-2066
wgautschi@purdue.edu
ISBN 978-0-8176-8258-3
e-ISBN 978-0-8176-8259-0
DOI 10.1007/978-0-8176-8259-0
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2011941359
Mathematics Subject Classification (2010): 65-01, 65D05, 65D07, 65D10, 65D25, 65D30, 65D32,
65H04, 65H05, 65H10, 65L04, 65L05, 65L06, 65L10
c Springer Science+Business Media, LLC 1997, 2012

All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer ScienceCBusiness Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
www.birkhauser-science.com
TO
ERIKA
Preface to the Second Edition
In this second edition, the outline of chapters and sections has been preserved. The
subtitle An Introduction, as suggested by several reviewers, has been deleted. The
content, however, is brought up to date, both in the text and in the notes. Many
passages in the text have been either corrected or improved. Some biographical
notes have been added as well as a few exercises and computer assignments. The
typographical appearance has also been improved by printing vectors and matrices
consistently in boldface types.
With regard to computer language in illustrations and exercises, we now adopt
uniformly Matlab. For readers not familiar with Matlab, there are a number of
introductory texts available, some, like Moler [2004], Otto and Denier [2005],
Stanoyevitch [2005] that combine Matlab with numerical computing, others, like
Knight [2000], Higham and Higham [2005], Hunt, Lipsman and Rosenberg [2006],
and Driscoll [2009], more exclusively focused on Matlab.
The major novelty, however, is a complete set of detailed solutions to all exercises
and machine assignments. The solution manual is available to instructors upon
request at the publishers website http://www.birkhauser-science.com/978-0-81768258-3. Selected solutions are also included in the text to give students an idea of
what is expected. The bibliography has been expanded to reflect technical advances
in the field and to include references to new books and expository accounts. As a
result, the text has undergone an expansion in size of about 20%.
West Lafayette, Indiana
November 2011
Walter Gautschi
vii
Preface to the First Edition
The book is designed for use in a graduate program in Numerical Analysis that
is structured so as to include a basic introductory course and subsequent more
specialized courses. The latter are envisaged to cover such topics as numerical
linear algebra, the numerical solution of ordinary and partial differential equations,
and perhaps additional topics related to complex analysis, to multidimensional
analysis, in particular optimization, and to functional analysis and related functional
equations. Viewed in this context, the first four chapters of our book could serve as
a text for the basic introductory course, and the remaining three chapters (which
indeed are at a distinctly higher level) could provide a text for an advanced course
on the numerical solution of ordinary differential equations. In a sense, therefore,
the book breaks with tradition in that it does no longer attempt to deal with all
major topics of numerical mathematics. It is felt by the author that some of the
current subdisciplines, particularly those dealing with linear algebra and partial
differential equations, have developed into major fields of study that have attained
a degree of autonomy and identity that justifies their treatment in separate books
and separate courses on the graduate level. The term Numerical Analysis as
used in this book, therefore, is to be taken in the narrow sense of the numerical
analogue of Mathematical Analysis, comprising such topics as machine arithmetic,
the approximation of functions, approximate differentiation and integration, and the
approximate solution of nonlinear equations and of ordinary differential equations.
What is being covered, on the other hand, is done so with a view toward
stressing basic principles and maintaining simplicity and student-friendliness as far
as possible. In this sense, the book is An Introduction. Topics that, even though
important and of current interest, require a level of technicality that transcends the
bounds of simplicity striven for, are referenced in detailed bibliographic notes at the
end of each chapter. It is hoped, in this way, to place the material treated in proper
context and to help, indeed encourage, the reader to pursue advanced modern topics
in more depth.
A significant feature of the book is the large collection of exercises that
are designed to help the student develop problem-solving skills and to provide
interesting extensions of topics treated in the text. Particular attention is given to
ix
machine assignments, where the student is encouraged to implement numerical

techniques on the computer and to make use of modern software packages.
The author has taught the basic introductory course and the advanced course on
ordinary differential equations regularly at Purdue University for the last 30 years
or so. The former, typically, was offered both in the fall and spring semesters, to a
mixed audience consisting of graduate (and some good undergraduate) students in
mathematics, computer science, and engineering, while the latter was taught only in
the fall, to a smaller but also mixed audience. Written notes began to materialize in
the 1970s, when the author taught the basic course repeatedly in summer courses on
Mathematics held in Perugia, Italy. Indeed, for some time, these notes existed only
in the Italian language. Over the years, they were progressively expanded, updated,
and transposed into English, and along with that, notes for the advanced course were
developed. This, briefly, is how the present book evolved.
A long gestation period such as this, of course, is not without dangers, the
most notable one being a tendency for the material to become dated. The author
tried to counteract this by constantly updating and revising the notes, adding newer
developments when deemed appropriate. There are, however, benefits as well: over
time, one develops a sense for what is likely to stand the test of time and what
may only be of temporary interest, and one selects and deletes accordingly. Another
benefit is the steady accumulation of exercises and the opportunity to have them
tested on a large and diverse student population.
The purpose of academic teaching, in the authors view, is twofold: to transmit
knowledge, and, perhaps more important, to kindle interest and even enthusiasm
in the student. Accordingly, the author did not strive for comprehensiveness
even within the boundaries delineated but rather tried to concentrate on what is
essential, interesting and intellectually pleasing, and teachable. In line with this,
an attempt has been made to keep the text uncluttered with numerical examples and
other illustrative material. Being well aware, however, that mastery of a subject does
not come from studying alone but from active participation, the author provided
many exercises, including machine projects. Attributions of results to specific
authors and citations to the literature have been deliberately omitted from the body
of the text. Each chapter, as already mentioned, has a set of appended notes that
help the reader to pursue related topics in more depth and to consult the specialized
literature. It is here where attributions and historical remarks are made, and where
citations to the literature both textbook and research appear.
The main text is preceded by a prologue, which is intended to place the book in
proper perspective. In addition to other textbooks on the subject, and information
on software, it gives a detailed list of topics not treated in this book, but definitely
belonging to the vast area of computational mathematics, and it provides ample
references to relevant texts. A list of numerical analysis journals is also included.
The reader is expected to have a good background in calculus and advanced
calculus. Some passages of the text require a modest degree of acquaintance with
linear algebra, complex analysis, or differential equations. These passages, however,
can easily be skipped, without loss of continuity, by a student who is not familiar
with these subjects.
xi
It is a pleasure to thank the publisher for showing interest in this book and
cooperating in producing it. The author is also grateful to Soren Jensen and Manil
Suri, who taught from this text, and to an anonymous reader; they all made many
helpful suggestions on improving the presentation. He is particularly indebted to
Prof. Jensen for substantially helping in preparing the exercises to Chap. 7. The
author further acknowledges assistance from Carl de Boor in preparing the notes to
Chap. 2 and to Werner C. Rheinboldt for helping with the notes to Chap. 4. Last but
not least, he owes a measure of gratitude to Connie Wilson for typing a preliminary
version of the text and to Adam Hammer for assisting the author with the more
intricate aspects of LaTeX.
West Lafayette, Indiana
January 1997
Walter Gautschi
Contents
Prologue .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xix
P1
Overview .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xix
P2
Numerical Analysis Software . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xxi
P3
Textbooks and Monographs .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xxi
P3.1 Selected Textbooks on Numerical Analysis .. . . . . . . . . . . . . . . . xxi
P3.2 Monographs and Books on Specialized Topics . . . . . . . . . . . . . xxiii
P4
Journals.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xxvi
1 Machine Arithmetic and Related Matters . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.1 Real Numbers, Machine Numbers, and Rounding .. . . . . . . . . . . . . . . . .
1.1.1 Real Numbers.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.1.2 Machine Numbers .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.1.3 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.2 Machine Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.2.1 A Model of Machine Arithmetic . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.2.2 Error Propagation in Arithmetic Operations:
Cancellation Error .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.3 The Condition of a Problem .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.3.1 Condition Numbers . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.3.2 Examples.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.4 The Condition of an Algorithm . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.5 Computer Solution of a Problem; Overall Error .. . . . . . . . . . . . . . . . . . .
1.6 Notes to Chapter 1 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises and Machine Assignments to Chapter 1 . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Machine Assignments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Selected Solutions to Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Selected Solutions to Machine Assignments. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1
2
2
3
5
7
7
8
11
13
16
24
27
28
31
31
39
44
48
2 Approximation and Interpolation .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

2.1 Least Squares Approximation .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.1.1 Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.1.2 The Normal Equations . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
55
59
59
61
xiii
xiv
Contents
2.1.3 Least Squares Error; Convergence.. . . . . . .. . . . . . . . . . . . . . . . . . .

2.1.4 Examples of Orthogonal Systems . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.2 Polynomial Interpolation .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.2.1 Lagrange Interpolation Formula: Interpolation Operator .. .
2.2.2 Interpolation Error.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.2.4 Chebyshev Polynomials and Nodes . . . . . .. . . . . . . . . . . . . . . . . . .
2.2.5 Barycentric Formula . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.2.6 Newtons Formula .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.2.7 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.2.8 Inverse Interpolation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.3 Approximation and Interpolation by Spline Functions . . . . . . . . . . . . .
2.3.1 Interpolation by Piecewise Linear Functions . . . . . . . . . . . . . . .
2.3.2 A Basis for S01 ./ . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.3.3 Least Squares Approximation . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.3.4 Interpolation by Cubic Splines . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
2.3.5 Minimality Properties of Cubic Spline Interpolants . . . . . . . .
2.4 Notes to Chapter 2 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
64
67
73
74
77
81
86
91
93
97
100
101
102
104
106
107
110
112
118
118
134
138
150
3 Numerical Differentiation and Integration . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

3.1 Numerical Differentiation .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
3.1.1 A General Differentiation Formula for Unequally
Spaced Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
3.1.2 Examples.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
3.1.3 Numerical Differentiation with Perturbed Data .. . . . . . . . . . . .
3.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
3.2.1 The Composite Trapezoidal and Simpsons Rules . . . . . . . . . .
3.2.2 (Weighted) NewtonCotes and Gauss Formulae.. . . . . . . . . . .
3.2.3 Properties of Gaussian Quadrature Rules . . . . . . . . . . . . . . . . . . .
3.2.4 Some Applications of the Gauss Quadrature Rule .. . . . . . . . .
3.2.5 Approximation of Linear Functionals: Method
of Interpolation vs. Method of Undetermined
Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
3.2.6 Peano Representation of Linear Functionals .. . . . . . . . . . . . . . .
3.2.7 Extrapolation Methods . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
3.3 Notes to Chapter 3 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
159
159
159
161
163
165
165
169
175
178
182
187
190
195
200
200
214
219
232
Contents
xv
4 Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1 Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1.1 A Transcendental Equation . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1.2 A Two-Point Boundary Value Problem . .. . . . . . . . . . . . . . . . . . .
4.1.3 A Nonlinear Integral Equation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1.4 s-Orthogonal Polynomials . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.2 Iteration, Convergence, and Efficiency . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.3 The Methods of Bisection and Sturm Sequences . . . . . . . . . . . . . . . . . . .
4.3.1 Bisection Method .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.3.2 Method of Sturm Sequences . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.4 Method of False Position . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.5 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.6 Newtons Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.7 Fixed Point Iteration .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.8 Algebraic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.8.1 Newtons Method Applied to an Algebraic Equation . . . . . .
4.8.2 An Accelerated Newton Method for Equations
with Real Roots. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.9 Systems of Nonlinear Equations . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.9.1 Contraction Mapping Principle . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.9.2 Newtons Method for Systems of Equations .. . . . . . . . . . . . . . .
4.10 Notes to Chapter 4 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
253
254
254
254
256
257
258
261
261
264
266
269
274
278
280
280
5 Initial Value Problems for ODEs: One-Step Methods . . . . . . . . . . . . . . . . . .

5.1 Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.2 Types of Differential Equations .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.3 Existence and Uniqueness .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.4 Numerical Methods .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.5 Local Description of One-Step Methods . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.6 Examples of One-Step Methods . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.6.1 Eulers Method . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.6.2 Method of Taylor Expansion.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.6.3 Improved Euler Methods .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.6.4 Second-Order Two-Stage Methods .. . . . . .. . . . . . . . . . . . . . . . . . .
5.6.5 RungeKutta Methods . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.7 Global Description of One-Step Methods . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.7.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.7.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.7.3 Asymptotics of Global Error .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
325
326
328
331
332
333
335
335
336
337
339
341
343
344
347
348
282
284
284
285
287
292
292
302
306
318
xvi
Contents
5.8
Error Monitoring and Step Control . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

5.8.1 Estimation of Global Error .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.8.2 Truncation Error Estimates . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.8.3 Step Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.9 Stiff Problems .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.9.1 A-Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.9.2 Pade Approximation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.9.3 Examples of A-Stable One-Step Methods . . . . . . . . . . . . . . . . . .
5.9.4 Regions of Absolute Stability . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
5.10 Notes to Chapter 5 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
352
352
354
357
360
361
362
367
370
371
378
378
383
387
392
6 Initial Value Problems for ODEs: Multistep Methods .. . . . . . . . . . . . . . . . .

6.1 Local Description of Multistep Methods . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.1.1 Explicit and Implicit Methods . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.1.2 Local Accuracy .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.1.3 Polynomial Degree vs. Order . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.2 Examples of Multistep Methods . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.2.1 AdamsBashforth Method . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.2.2 AdamsMoulton Method . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.2.3 PredictorCorrector Methods .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.3 Global Description of Multistep Methods .. . . . . . .. . . . . . . . . . . . . . . . . . .
6.3.1 Linear Difference Equations . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.3.2 Stability and Root Condition . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.3.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.3.4 Asymptotics of Global Error .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.3.5 Estimation of Global Error .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.4 Analytic Theory of Order and Stability. . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.4.1 Analytic Characterization of Order .. . . . . .. . . . . . . . . . . . . . . . . . .
6.4.2 Stable Methods of Maximum Order .. . . . .. . . . . . . . . . . . . . . . . . .
6.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.5 Stiff Problems .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.5.1 A-Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.5.2 A./-Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
6.6 Notes to Chapter 6 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
399
399
399
401
405
408
409
412
413
416
416
420
424
426
430
433
433
441
446
450
450
452
453
456
456
459
461
466
Contents
xvii
7 Two-Point Boundary Value Problems for ODEs . . . . .. . . . . . . . . . . . . . . . . . .

7.1 Existence and Uniqueness .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.1.1 Examples.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.1.2 A Scalar Boundary Value Problem . . . . . . .. . . . . . . . . . . . . . . . . . .
7.1.3 General Linear and Nonlinear Systems . .. . . . . . . . . . . . . . . . . . .
7.2 Initial Value Techniques .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.2.1 Shooting Method for a Scalar Boundary Value Problem . . .
7.2.2 Linear and Nonlinear Systems . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.2.3 Parallel Shooting . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.3 Finite Difference Methods . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.3.1 Linear Second-Order Equations . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.3.2 Nonlinear Second-Order Equations . . . . . .. . . . . . . . . . . . . . . . . . .
7.4 Variational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.4.1 Variational Formulation .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.4.2 The Extremal Problem . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
7.4.3 Approximate Solution of the Extremal Problem .. . . . . . . . . . .
7.5 Notes to Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
471
474
474
476
481
482
483
485
490
494
494
500
503
503
506
507
509
512
512
518
521
532
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
543
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
571
Prologue
P1 Overview
Numerical Analysis is the branch of mathematics that provides tools and methods
for solving mathematical problems in numerical form. The objective is to develop
detailed computational procedures, capable of being implemented on electronic
computers, and to study their performance characteristics. Related fields are Scientific Computation, which explores the application of numerical techniques and
computer architectures to concrete problems arising in the sciences and engineering;
Complexity Theory, which analyzes the number of operations and the amount of
computer memory required to solve a problem; and Parallel Computation, which
is concerned with organizing computational procedures in a manner that allows
running various parts of the procedures simultaneously on different processors.
The problems dealt with in computational mathematics come from virtually
all branches of pure and applied mathematics. There are computational aspects
in number theory, combinatorics, abstract algebra, linear algebra, approximation
theory, geometry, statistics, optimization, complex analysis, nonlinear equations,
differential and other functional equations, and so on. It is clearly impossible
to deal with all these topics in a single text of reasonable size. Indeed, the
tendency today is to develop specialized texts dealing with one or the other
of these topics. In the present text we concentrate on subject matters that are
basic to problems in approximation theory, nonlinear equations, and differential
equations. Accordingly, we have chapters on machine arithmetic, approximation
and interpolation, numerical differentiation and integration, nonlinear equations,
one-step and multistep methods for ordinary differential equations, and boundary
value problems in ordinary differential equations. Important topics not covered
in this text are computational number theory, algebra, and geometry; constructive
methods in optimization and complex analysis; numerical linear algebra; and the
numerical solution of problems involving partial differential equations and integral
equations. Selected texts for these areas are enumerated in Sect. P3.
xix
xx
Prologue
We now describe briefly the topics treated in this text. Chapter 1 deals with
the basic facts of life regarding machine computation. It recognizes that, although
present-day computers are extremely powerful in terms of computational speed,
reliability, and amount of memory available, they are less than ideal unless
supplemented by appropriate software when it comes to the precision available,
and accuracy attainable, in the execution of elementary arithmetic operations. This
raises serious questions as to how arithmetic errors, either present in the input
data of a problem or committed during the execution of a solution algorithm,
affect the accuracy of the desired results. Concepts and tools required to answer
such questions are put forward in this introductory chapter. In Chap. 2, the central
theme is the approximation of functions by simpler functions, typically polynomials
and piecewise polynomial functions. Approximation in the sense of least squares
provides an opportunity to introduce orthogonal polynomials, which are relevant
also in connection with problems of numerical integration treated in Chap. 3. A large
part of the chapter, however, deals with polynomial interpolation and associated
error estimates, which are basic to many numerical procedures for integrating
functions and differential equations. Also discussed briefly is inverse interpolation,
an idea useful in solving equations.
First applications of interpolation theory are given in Chap. 3, where the tasks
presented are the computation of derivatives and definite integrals. Although the
formulae developed for derivatives are subject to the detrimental effects of machine
arithmetic, they are useful, nevertheless, for purposes of discretizing differential
operators. The treatment of numerical integration includes routine procedures, such
as the trapezoidal and Simpsons rules, appropriate for well-behaved integrands, as
well as the more sophisticated procedures based on Gaussian quadrature to deal
with singularities. It is here where orthogonal polynomials reappear. The method of
undetermined coefficients is another technique for developing integration formulae.
It is applied to approximate general linear functionals, the Peano representation
of linear functionals providing an important tool for estimating the error. The
chapter ends with a discussion of extrapolation techniques; although applicable to
more general problems, they are inserted here since the composite trapezoidal rule
together with the EulerMaclaurin formula provides the best-known application
Romberg integration.
Chapter 4 deals with iterative methods for solving nonlinear equations and
systems thereof, the pi`ece de resistance being Newtons method. The emphasis here
lies in the study of, and the tools necessary to analyze, convergence. The special
case of algebraic equations is also briefly given attention.
Chapter 5 is the first of three chapters devoted to the numerical solution of
ordinary differential equations. It concerns itself with one-step methods for solving
initial value problems, such as the RungeKutta method, and gives a detailed
analysis of local and global errors. Also included is a brief introduction to stiff
equations and special methods to deal with them. Multistep methods and, in
particular, Dahlquists theory of stability and its applications, is the subject of
Chap. 6. The final chapter (Chap. 7) is devoted to boundary value problems and their
solution by shooting methods, finite difference techniques, and variational methods.
P3 Textbooks and Monographs
xxi
P2 Numerical Analysis Software

There are many software packages available, both in the public domain and distributed commercially, that deal with numerical analysis algorithms. A widely used
source of numerical software is Netlib, accessible at http://www.netlib.org.
Large collections of general-purpose numerical algorithms are contained in
sources such as Slatec (http://www.netlib.org/slatec) and TOMS
(ACM Transactions on Mathematical Software). Specialized packages relevant
to the topics in the chapters ahead are identified in the Notes to each chapter.
Likewise, specific files needed to do some of the machine assignments in the
Exercises are identified as part of the exercise.
Among the commercial software packages we mention the Visual Numerics
(formerly IMSL) and NAG libraries. Interactive systems include HiQ, Macsyma,
Maple, Mathcad, Mathematica, and Matlab. Many of these packages, in addition
to numerical computation, have symbolic computation and graphics capabilities.
Further information is available in the Netlib file commercial. For more libraries,
and for interactive systems, also see Lozier and Olver [1994, Sect. 3].
In this text we consistently use Matlab as a vehicle for describing algorithms
and as the software tool for carrying out some of the exercises and all machine
assignments.

We provide here an annotated list (ordered alphabetically with respect to authors)
of other textbooks on numerical analysis, written at about the same, or higher, level
as the present one. Following this, we also mention books and monographs dealing
with topics in computational mathematics not covered in our (and many other) books
on numerical analysis. Additional books dealing with specialized subject areas, as
well as other literature, are referenced in the Notes to the individual chapters. We
generally restrict ourselves to books written in English and, with a few exceptions,
published within the last 25 years or so. Even so, we have had to be selective. (No
value judgment is to be implied by our selections or omissions.) A reader with access
to the AMS (American Mathematical Society) MathSci Net homepage will have no
difficulty in retrieving a more complete list of relevant items, including older texts.
P3.1 Selected Textbooks on Numerical Analysis

Atkinson [1989] A comprehensive in-depth treatment of standard topics short of
partial differential equations; includes an appendix describing some of the betterknown software packages.
xxii
Prologue
Atkinson and Han [2009] An advanced text on theoretical (as opposed to computational) aspects of numerical analysis, making extensive use of functional
analysis.
Bruce, Giblin, and Rippon [1990] A collection of interesting mathematical problems, ranging from number theory and computer-aided design to differential
equations, that require the use of computers for their solution.
Cheney and Kincaid [1994] Although an undergraduate text, it covers a broad
area, has many examples from science and engineering as well as computer
programs; there are many exercises, including machine assignments.
Conte and de Boor [1980] A widely used text for upper-division undergraduate
students; written for a broad audience, with algorithmic concerns in the foreground; has Fortran subroutines for many algorithms discussed in the text.
Dahlquist and Bjorck [2003, 2008] The first (2003) text a reprint of the 1974
classic provides a comprehensive introduction to all major fields of numerical
analysis, striking a good balance between theoretical issues and more practical
ones. The second text expands substantially on the more elementary topics
treated in the first and represents the first volume of more to come.
Deuflhard and Hohmann [2003] An introductory text with emphasis on machine
computation and algorithms; includes discussions of three-term recurrence
relations and stochastic eigenvalue problems (not usually found in textbooks),
but no differential equations.
Froberg [1985] A thorough and exceptionally lucid exposition of all major topics
of numerical analysis exclusive of algorithms and computer programs.
Hammerlin and Hoffmann [1991] Similar to Stoer and Bulirsch [2002] in its
emphasis on mathematical theory; has more on approximation theory and
multivariate interpolation and integration, but nothing on differential equations.
Householder [2006] A reissue of one of the early mathematical texts on the
subject, with coverage limited to systems of linear and nonlinear equations and
topics in approximation.
Isaacson and Keller [1994] One of the older but still eminently readable texts,
stressing the mathematical analysis of numerical methods.
Kincaid and Cheney [1996] Related to Cheney and Kincaid [1994] but more
mathematically oriented and unusually rich in exercises and bibliographic items.
Kress [1998] A rather comprehensive text with a strong functional analysis
component.
Neumaier [2001] A text emphasizing robust computation, including interval
arithmetic.
Rutishauser [1990] An annotated translation from the German of an older text
based on posthumous notes by one of the pioneers of numerical analysis;
although the subject matter reflects the state of the art in the early 1970s, the
treatment is highly original and is supplemented by translators notes to each
chapter pointing to more recent developments.
Schwarz [1989] A mathematically oriented treatment of all major areas of numerical analysis, including ordinary and partial differential equations.
xxiii
Stoer and Bulirsch [2002] Fairly comprehensive in coverage; written in a style

appealing more to mathematicians than engineers and computer scientists; has
many exercises and bibliographic references; serves not only as a textbook but
also as a reference work.
Todd [1979, 1977] Rather unique books, emphasizing problem-solving in areas
often not covered in other books on numerical analysis.
P3.2 Monographs and Books on Specialized Topics

A collection of outstanding survey papers on specialized topics in numerical
analysis is being assembled by Ciarlet and Lions [19902003] in handbooks of
numerical analysis; nine volumes have appeared so far. Another source of surveys
on a variety of topics is Acta numerica, an annual series of books edited by Iserles
[19922010], of which 19 volumes have been published so far. For an authoritative
account of the history of numerical analysis from the 16th through the 19th century,
the reader is referred to the book by Goldstine [1977]. For more recent history, see
Bultheel and Cools, eds. [2010].
The related areas of Scientific Computing and Parallel Computing are rather
more recent fields of study. Basic introductory texts are Scott et al. [2005]
and Tveito and Winter [2009]. Texts relevant to linear algebra and differential
equations are Schendel [1984], Ortega and Voigt [1985], Ortega [1989], Golub
and Ortega [1992], [1993], Van de Velde [1994], Burrage [1995], Heath [1997],
Deuflhard and Bornemann [2002], OLeary [2009], and Quarteroni et al. [2010].
Other texts address topics in optimization, Pardalos et al. [1992] and Gonnet
and Scholl [2009]; computational geometry, Akl and Lyons [1993]; and other
miscellaneous areas, Crandall [1994], [1996], Kockler [1994], Bellomo and Preziosi
[1995], Danaila et al. [2007], and Farin and Hansford [2008]. Interesting historical
essays are contained in Nash, ed. [1990]. Matters regarding the Complexity of
numerical algorithms are discussed in an abstract framework in books by Traub and
Wozniakowski [1980] and Traub, Wasilkowski, and Wozniakowski [1983], [1988],
with applications to the numerical integration of functions and nonlinear equations,
and similarly, applied to elliptic partial differential equations and integral equations,
in the book by Werschulz [1991]. Other treatments are those by Kronsjo [1987], Ko
[1991], Bini and Pan [1994], Wang et al. [1994], Traub and Werschulz [1998], Ritter
[2000], and Novak et al. [2009]. For an in-depth complexity analysis of Newtons
method, the reader is encouraged to study Smales [1987] lecture.
Material on Computational Number Theory can be found, at the undergraduate
level, in the book by Rosen [2000], which also contains applications to cryptography
and computer science, and in Allenby and Redfern [1989], and at a more advanced
level in the books by Niven et al. [1991], Cohen [1993], and Bach and Shallit
[1996]. Computational methods of factorization are dealt with in the book by
Riesel [1994]. Other useful sources are the set of lecture notes by Pohst [1993]
on algebraic number theory algorithms, and the proceedings volumes edited by
xxiv
Prologue
Pomerance [1990] and Gautschi [1994a, Part II]. For algorithms in Combinatorics,
see the books by Nijenhuis and Wilf [1978], Hu and Shing [2002], and Cormen et
al. [2009]. Various aspects of Computer Algebra are treated in the books by Geddes
et al. [1992], Mignotte [1992], Davenport et al. [1993], Mishra [1993], Heck [2003],
and Cox et al. [2007].
Other relatively new disciplines are Computational Geometry and Geometric
Modeling, Computer-Aided Design, and Computational Topology, for which relevant texts are, respectively, Preparata and Shamos [1985], Edelsbrunner [1987],
Mantyla [1988], Taylor [1992], McLeod and Baart [1998], Gallier [2000], Cohen et
al. [2001], and Salomon [2006]; Hoschek and Lasser [1993], Farin [1997], [1999],
and Prautsch et al. [2002]; Edelsbrunner [2006], and Edelsbrunner and Harer [2010].
Statistical Computing is covered in general textbooks such as Kennedy and Gentle
[1980], Anscombe [1981], Maindonald [1984], Thisted [1988], Monahan [2001],
Gentle [2009], and Lange [2010]. More specialized texts are Devroye [1986] and
Hormann et al. [2004] on the generation of nonuniform random variables, Spath
[1992] on regression analysis, Heiberger [1989] on the design of experiments,
Stewart [1994] on Markov chains, Xiu [2010] on stochastic computing and uncertainty quantification, and Fang and Wang [1994], Manno [1999], Gentle [2003],
Liu [2008], Shonkwiler and Mendivil [2009], and Lemieux [2009] on Monte Carlo
and number-theoretic methods. Numerical techniques in Optimization (including
optimal control problems) are discussed in Evtushenko [1985]. An introductory
book on unconstrained optimization is Wolfe [1978]; among more advanced and
broader texts on optimization techniques we mention Gill et al. [1981], Ciarlet
[1989], and Fletcher [2001]. Linear programming is treated in Nazareth [1987] and
Panik [1996], linear and quadratic problems in Sima [1996], and the application of
conjugate direction methods to problems in optimization in Hestenes [1980]. The
most comprehensive text on (numerical and applied) Complex Analysis is the threevolume treatise by Henrici [1988, 1991, 1986]. Numerical methods for conformal
mapping are also treated in Kythe [1998], Schinzinger and Laura [2003], and
Papamichael and Stylianopoulos [2010]. For approximation in the complex domain,
the standard text is Gaier [1987]; Stenger [1993] deals with approximation by
sinc functions, Stenger [2011] providing some 450 Matlab programs. The book by
Iserles and Nrsett [1991] contains interesting discussions on the interface between
complex rational approximation and the stability theory of discretized differential
equations. The impact of high-precision computation on problems and conjectures
involving complex approximation is beautifully illustrated in the set of lectures by
Varga [1990].
For an in-depth treatment of many of the preceding topics, also see the fourvolume work of Knuth [1975, 1981, 1973, 20052006].
Perhaps the most significant topic omitted in our book is numerical linear algebra
and its application to solving partial differential equations by finite difference or
finite element methods. Fortunately, there are many treatises available that address
these areas. For Numerical Linear Algebra, we refer to the classic work of Wilkinson
[1988] and the book by Golub and Van Loan [1996]. Links and applications
of matrix computation to orthogonal polynomials and quadrature are the subject
xxv
of Golub and Meurant [2010]. Other general texts are Jennings and McKeown
[1992], Watkins [2002], [2007], Demmel [1997], Trefethen and Bau [1997], Stewart
[1973], [1998], Meurant [1999], White [2007], Allaire and Kaber [2008], and
Datta [2010]; Higham [2002], [2008] has a comprehensive treatment of error and
stability analyses and the first, equally extensive, treatment of the numerics of matrix
functions. Solving linear systems on vector and shared memory parallel computers
and the use of linear algebra packages on high-performance computers are discussed
in Dongarra et al. [1991], [1998]. The solution of sparse linear systems and the
special data structures and pivoting strategies required in direct methods are treated
in sterby and Zlatev [1983], Duff et al. [1989], Zlatev [1991], and Davis [2006],
whereas iterative techniques are discussed in the classic texts by Young [2003]
and Varga [2000], and in Ilin [1992], Hackbusch [1994], Weiss [1996], Fischer
[1996], Brezinski [1997], Greenbaum [1997], Saad [2003], Broyden and Vespucci
[2004], Hageman and Young [2004], Meurant [2006], Chan and Jin [2007], Byrne
[2008], and Woznicki [2009]. The books by Branham [1990] and Bjorck [1996]
are devoted especially to least squares problems. For eigenvalues, see Chatelin
[1983], [1993], and for a good introduction to the numerical analysis of symmetric
eigenvalue problems, see Parlett [1998]. The currently very active investigation of
large sparse symmetric and nonsymmetric eigenvalue problems and their solution
by Lanczos-type methods has given rise to many books, for example, Cullum and
Willoughby [1985], [2002], Meyer [1987], Sehmi [1989], and Saad [1992]. For
structured and symplectic eigenvalue problems, see Fassbender [2000] and Kressner
[2005], and for inverse eigenvalue problems, Xu [1998] and Chu and Golub [2005].
For readers wishing to test their algorithms on specific matrices, the collection of
test matrices in Gregory and Karney [1978] and the matrix market on the Web
(http://math.nist.gov./MatrixMarket) are useful sources.
Even more extensive is the textbook literature on the numerical solution of Partial Differential Equations. The field has grown so much that there are currently only
a few books that attempt to cover the subject more or less as a whole. Among these
are Birkhoff and Lynch [1984] (for elliptic problems), Hall and Porsching [1990],
Ames [1992], Celia and Gray [1992], Larsson and Thomee [2003], Quarteroni and
Valli [1994], Morton and Mayers [2005], Sewell [2005], Quarteroni [2009], and
Tveito and Winter [2009]. Variational and finite element methods seem to have
attracted the most attention. An early and still frequently cited reference is the
book by Ciarlet [2002] (a reprint of the 1978 original); among more recent texts
we mention Beltzer [1990] (using symbolic computation), Krz ek and Neittaanmaki
[1990], Brezzi and Fortin [1991], Schwab [1998], Kwon and Bang [2000] (using
Matlab), Zienkiewicz and Taylor [2000], Axelsson and Barker [2001], Babuska
and Strouboulis [2001], Hollig [2003], Monk [2003] (for Maxwells equation),
Ern and Guermonde [2004], Kythe and Wei [2004], Reddy [2004], Chen [2005],
Elman et al. [2005], Thomee [2006] (for parabolic equations), Braess [2007],
Demkowicz [2007], Brenner and Scott [2008], Bochev and Gunzburger [2009],
Efendiev and Hou [2009], and Johnson [2009]. Finite difference methods are treated
in Ashyralyev and Sobolevski [1994], Gustafsson et al. [1995], Thomas [1995],
[1999], Samarskii [2001], Strikwerda [2004], LeVeque [2007], and Gustafsson
xxvi
Prologue
[2008]; the method of lines in Schiesser [1991]; and the more refined techniques
of multigrids and domain decomposition in McCormick [1989], [1992], Bramble
[1993], Shadurov [1995], Smith et al. [1996], Quarteroni and Valli [1999], Briggs
et al. [2000], Toselli and Widlund [2005], and Mathew [2008]. Problems in potential
theory and elasticity are often approached via boundary element methods, for which
representative texts are Brebbia and Dominguez [1992], Chen and Zhou [1992],
Hall [1994], and Steinbach [2008]. A discussion of conservation laws is given in the
classic monograph by Lax [1973] and more recently in LeVeque [1992], Godlewski
and Raviart [1996], Kroner [1997], and LeVeque [2002]. Spectral methods, i.e.,
expansions in (typically) orthogonal polynomials, applied to a variety of problems,
were pioneered in the monograph by Gottlieb and Orszag [1977] and have received
extensive treatments in more recent texts by Canuto et al. [1988], [2006], [2007],
Fornberg [1996], Guo [1998], Trefethen [2000] (in Matlab), Boyd [2001], Peyret
[2002], Hesthaven et al. [2007], and Kopriva [2009].
Early, but still relevant, texts on the numerical solution of Integral Equations are
Atkinson [1976] and Baker [1977]. More recent treatises are Atkinson [1997] and
Kythe and Puri [2002]. Volterra integral equations are dealt with by Brunner and van
der Houwen [1986] and Brunner [2004], whereas singular integral equations are the
subject of Prossdorf and Silbermann [1991].
P4 Journals
Here we list the major journals (in alphabetical order) covering the areas of
numerical analysis and mathematical software.
ACM Transactions on Mathematical Software
Applied Numerical Mathematics
BIT Numerical Mathematics
Calcolo
Chinese Journal of Numerical Mathematics and Applications
Computational Mathematics and Mathematical Physics
Computing
IMA Journal on Numerical Analysis
Journal of Computational and Applied Mathematics
Mathematical Modelling and Numerical Analysis
Mathematics of Computation
Numerical Algorithms
Numerische Mathematik
SIAM Journal on Numerical Analysis
SIAM Journal on Scientific Computing
18.330 Lecture Notes:

Boundary-Value Problems
Homer Reid
February 26, 2014
Contents
1 Boundary value problems
1.1 Reconstructing trajectories of particles moving in force fields . .
1.2 Deflection of a loaded beam . . . . . . . . . . . . . . . . . . . . .
2
2
3
2 ODE Approach to Boundary-Value Problems: The Shooting

Method
4
3 Linear-Algebra Approach to Boundary-Value Problems: The
Finite-Difference Method
5
3.1 Example: The beam equation . . . . . . . . . . . . . . . . . . . .
6
18.330 Lecture Notes
Boundary value problems
In our discussion of ODEs we considered initial value problemsthat is, ODEs

du
dt = f (t, u) in which we are given a vector u0 specifying all components of
the u vector at a single time point t0 . In such a situation, we are guaranteed
(assuming f satisfies certain niceness conditions discussed in our unit on ODEs)
the existence of a unique curve u(t) that satifies the differential equation and
runs through the point t0 , u0 .
An alternative type of ODE is the boundary value problem. In this case,
we are given only partial data for the components of the u vector, but we
are given these data for multiple time points t. Such problems arise in many
fields of science and engineering; for the purposes of numerical analysis they are
interesting not only because they reveal the limitations of the ODE techniques
we discussed previously, but also because they motivate the introduction of
finite-difference solution techniques, which then extend immediately to higherdimensional PDEs.
1.1
Reconstructing trajectories of particles moving in force

fields
For example, suppose we are biologists observing under a microscope the motion
of a bioparticle moving in a time-dependent force field F(t) = F (t)
x. (For
simplicity, we will consider here the case of 1D motion, although it is easy to
extend the discussion to higher dimensions.) For example, if the bioparticle has
charge q and the xcomponent of the electric field is Ex (t), then the force is
F (t) = qEx (t).
Suppose we observe that the position of the particle at time t1 is x1 , while
at some later time t2 it is at position x2 . (Note that we do not observe the
velocity of the particle.) We would like to reconstruct the trajectory that the
particle followed between t1 and t2 . We then have a boundary-value problem of
the form
1
d2 x
x(t1 ) = x1 ,
x(t2 ) = x2 .
(1)
= F (t),
dt2
m
where m is the mass of the bioparticle. To phrase this equation in the language
of first-order ODE systems, we define u1 = x, u2 = x and obtain the ODE
system

du
d
u1
u2
(2)
=
=
u2
F (t)/m
dt
dt
subject to the boundary conditions

x1
u(t1 ) =
,
?

u(t2 ) =
x2
?

.
(3)
The point is that we dont know the velocity of the particle at either endpoint,
which means we dont have an initial-value problem. This has at least two
immediate implications:
(a) the nice existence and uniqueness theorems for initial-value problems go
completely out the window; for a boundary-value problem like (2) there
may be no solution, or there may be multiple solutions, and these things
may be true even for perfectly nice f functions.
(b) Even assuming there is a solution curve u(t), we cant use the ODE algorithms we discussed previously to find points on it, because all of those
algorithms required that we start with a known point on the curve. In
this case we dont know all the coordinates of even a single point on the
curve, so none of our ODE integrators can get started.
1.2
Deflection of a loaded beam
Another classic example of a boundary-value problem is the deflection of a

beam of constant cross-section forced to support a position-dependent weight
(mechanical engineers would say subject to a position-dependent load). The
relevant equation here is the Euler-Bernoulli equation,
d4 h
= q(x)
(4)
dx4
where h(x) is the vertical deflection of the beam at position x, q(x) is the
position-dependent loading of the beam1 , and is a material-dependent rigidity
parameter describing the beams resistance to shearing. Suppose the beam is
affixed rigidly to two supporting walls at positions x1 and x2 . This means that
both the beams deflection and slope are constrained to be 0 at both endpoints,
or in other words
h(x1 ) = 0,
h0 (x1 ) = 0,
h(x2 ) = 0,
h0 (x2 ) = 0.
If we proceed in the usual way to convert equation (10) to a first-order ODE

system, we obtain
u1
u2

du
d
u3
u2 =
=
(5)
u3
u4
dt
dt
u4
q(u1 )/
subject to the boundary conditions

0
0
u(x1 ) =
,
?
?
0
0
u(x2 ) =
? .
?
(6)
As before, we cant simply use an ODE integrator to solve this equation because we dont have any full point on the solution curve from which to start
integrating.
1 For example, if the beam in question were a bookshelf, and there were heavier books near
the center of the shelf and ligher books near its edges, then the function q(x) would be peaked
near the center of the interval.
ODE Approach to Boundary-Value Problems:

The Shooting Method
We noted above that our standard bag of ODE tricks for integrating initial-value
problems (such as Eulers method or RK4) cant get started on a boundary-value
problem like (2) or (5), because in order to use e.g. Eulers method we need to
know a point on the solution curve. In a problem like (2) we only know half
of a point on a solution curve at t1 we know the u1 coordinate of the point,
but not the u2 coordinate.
There is, however, a way to remedy this difficulty. Starting at t = t1 , we
guess a number for the u2 coordinate. In the case of (2), this corresponds to
guessing an initial velocity for the particle. Denote our guess by uguess
. We now
2
have the coordinates of one full point on a curve at time t1 , and we call this
point uguess :

u1
uguess =
uguess
2
The existence and uniqueness theorems now guarantee that there exists a full
curve uguess (t) satisfying the differential equation and the condition uguess (t0 ) =
uguess
. So we can now use any ODE algorithm we like to integrate our equation
0
to compute more points on this curve. In particular, we can integrate all the
way from t1 to t2 and evaluate the value of uguess (t2 ). If this value equals x2 ,
were done! We have found our desired solution curve. If not, we have to go
back and try a new value for uguess
.
2
This method is known as the shooting method, for obvious reasons: integrating from t1 to t2 with initial position and velocity u1 , uguess
corresponds to
2
shooting the particle from that position with that velocity, and if we guess
the initial velocity just right then the particle will just pass through position u2
at time u2 .
The difficulty is that we now have to solve a root-finding problem to compute
uguess
. Indeed, for each choice of uguess
at time t1 we can integrate the resulting
2
2
initial-value problem and compute the value it predicts for the coordinate u1 at
time t2 . Denote this value by uintegrated
(uguess
; t2 ). Choosing the correct value
1
2
guess
of u2
then corresponds to finding a root of the nonlinear equation
uintegrated
(uguess
; t2 ) udesired
(t2 ) = 0
1
2
1
(7)
(t2 ) is the given boundary-value at time t2 .

where udesired
1
Equation (7), a nonlinear root-finding problem, is much more difficult to
solve than standard initial-value ODE problems. Moreover, for a problem like
6 in which we are missing two or more necessary components from the initialcondition vector, we face the problem of finding a root of a multidimensional
function, again much harder than simply integrating an ODE.
Linear-Algebra Approach to Boundary-Value

Problems: The Finite-Difference Method
An alternative approach to boundary-value problems is to convert a differential

equation like (2) or (5) into an algebraic equationmore specifically, a linear
system of equations involving a matrix and two vectorswhich we then solve
using computational linear algebra. This is the idea behind the finite-difference
method. It has several advantages over the shooting method outlined above,
the most significant of which is that it readily extends to higher dimensions,
where it constitutes one of the most widely used techniques for solving partial
differential equations (PDEs).
The key idea here is something we discussed in our notes on numerical differentiation: when we work with finite-difference approximations to derivatives,
the operation of differentiation is equivalent to the operation of matrix multiplication. More specifically, if we have a vector f whose entries are samples of
some function f (x) at evenly-spaced sample points, then there exists a matrix
A such that the matrix-vector product Af is a vector whose entries are samples
of the second2 derivative of f , i.e. if we have an interval [a, b] and we define the
vectors
00
f1
f1
f
f 00
2
2
f =
f 00 =
.. ,
..
.
.
00
fN
fN
where
fn f (a + nh),
fn00 f 00 (a + nh),
n = 1, , N,
h=
ba
N +1
then the vectors f and f 00 are related3 by

Af = f 00
where the matrix A looks like
2
1
1
0
A= 2 .
h ..
0
0
(8)
1
2
1
..
.
0
1
2
..
.
..
.
0
0
0
..
.
0
0
0
0
2
1
0
0
0
..
.
1
2
2 Of course this technique is not limited to the second derivative; we could alternatively
write down different matrices that, when applied to f , yield vectors of samples of its first
derivative, its fourth derivative, etc.
3 Equation (8) assumes that f (a) = f (b) = 0. Implementation of nontrivial boundary
conditions is discussed in our lecture notes on numerical differentiation.
(Equation (8) assumes that f satisfies the boundary conditions f (a) = f (b) = 0;
other boundary conditions may be represented by adding suitable terms to the
RHS.)
Of course, as soon as we write down equation (8) we can immediately proceed
to invert that equation to find a relation predicting values of f from the values
of f 00 :
f = A1 f 00 .
(9)
The usefulness of this equation is that, in a boundary-value problem, we typically have a relation expressing f 00 in terms of some known function. For
example, in (1), the second derivative of the function we seek is related to the
(known) force field F (x). Then all we have to do is replace f 00 in (9) with the expression for the second derivative given by the differential equation in question,
and we can immediately solve for samples of the function f (x).
3.1
Example: The beam equation
In this section well work through a finite-difference method for solving the
one-dimensional beam equation
1
d4 f
= q(x)
dx4
(10)
over an interval [a, b] with boundary conditions

f (a) = f 0 (a) = f (b) = f 0 (b) = 0.
Finite-difference stencil for
(11)
d4
dx4
It is easy to verify that a finite-difference stencil with stepsize h for the fourth
derivative of a function f (x) at a point x is
(4)
fFD (h, x) =
f (x 2h) 4f (x h) + 6f (x) 4f (x + h) + f (x + 2h)

h4
(12)
This stencil achieves second-order convergence, i.e. if f (4) (x) is the exact
fourth derivative of f at x, then we have

(4)

fFD (h, x) f (4) (x) = O(h2 )
Implementation of boundary conditions
When we attempt to apply (12) at points within 1 or 2 sites of the ends of the
interval, we find that we need values for the quantities f1 , f0 , fN +1 , fN +2 .
The values of f0 and fN +1 are fixed by the boundary conditions (12) to be
0. This leaves unspecified the values of f1 and fN +2 , but the condition that
f 0 = 0 at both endpoints winds up being equivalent to the requirement that
f1 = fN +2 = 0. (Less trivial boundary conditions could be handled using the
method described in our lecture notes on numerical differentiation.)
The matrix A
In view of the above considerations, the finite-difference matrix we want is
6 4 1
0
0
0 0
0
4 6 4 1
0
0 0
0
1 4 6 4 1
0
0
0
0
1 4 6 4 1 0
0
1
0
1 4 6 4 0
0
A= 4 0
.
h 0
0
0
1 4 6 0
0
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
0
0
0
0
0
0 6 4
0
0
0
0
0
0 4 6
This matrix operates on a vector of samples of f to yield a vector of samples of
f (4) :
Af = f (4)
(13)
where the nth elements of f and f (4) are respectively
fn = f (a + nh),
fn(4) = f (4) (a + nh),
h=
ba
N +1
and where we have assumed f1 = f0 = fN +1 = fN +2 = 0.

The solution
Inverting equation (13), we have
f = A1 f (4) .
(14)
On the other hand, the differential equation (10) allows us to compute values
of f 00 here in terms of the loading function q(x), i.e. we can put
f (4) =
1
q
where the elements of the vector q are the values of the function q(x) at the
sample point xn . Then equation (15) reads

1
f = A1
q
(15)
We solve this equation numerically using the julia code reproduced below. The
results, for a forcing function q(x) = x2 , are plotted in Figure 1.
#
#
#
#
#
#
solve the beam equation on the interval [0:10] given a

loading function q(x), a stiffness parameter Alpha, and a
dimension N (where N is the dimension of the solution
vector, so the stepsize is (b-a)/(N+1) )
beam deection
0
200
400
600
beam deection
beam loading
800
0
10
Figure 1: Solution of beam equation with loading function q(x) = x2 .
function SolveBeamEquation(q, Alpha, N)

b=10.0;
a=0.0;
h=(b-a)/(N+1);
h4=h^4;
# start by making A a diagonal matrix with 6s on the diagonal
A=6*eye(N,N) / h4;
# add the -4s on the first upper and lower sub-diagonals
for n=1:N-1
A[n,n+1]=-4.0 / h4;
A[n+1,n]=-4.0 / h4;
end
# add the +1s on the second upper and lower sub-diagonals
for n=1:N-2
A[n,n+2]=+1.0 / h4;
A[n+2,n]=+1.0 / h4;
end
# form the RHS vector
# (XVector is just a vector of the sample points)
# note we interpret q as the positive (upward-directed)
# loading, so for downward-directed loading we want -q
xVector = zeros(N);
RHSVector = zeros(N);
for n=1:N
xVector[n]
= a+n*h;
RHSVector[n] = -q( xVector[n] ) / Alpha;
end
# solve the system to obtain the solution vector y
yVector = A\RHSVector;
end

Chebyshev Spectral Methods
Homer Reid
April 29, 2014
Contents
1 The question
2 The classical answer
3 The modern answer for periodic functions
4 The modern answer for non-periodic functions
5 Chebyshev polynomials
10
6 Chebyshev spectral methods
14
The question
In these notes we will concern ourselves with the following basic question: Given
a function f (x) on an interval x [a, b],
1. How accurately can we characterize f using only samples of its value at
N sample points {xn } in the interval [a, b]?
2. What is the optimal way to to choose the N sample points {xn }?
What does it mean to characterize a function f (x) over an interval [a, b]?
There are at least three possible answers:
Rb
1. We may want to evaluate the integral a f (x) dx. In this case, the problem
of characterizing f from N function samples is the problem of designing
an N -point quadrature rule.
2. We may want to evaluate the derivative of f at each of our sample points
using the information contained in the sample values. This is the problem
of constructing a differentiation stencil, and it arises when we try to solve
ODEs or PDEs: in that case we are trying to reconstruct f (x) given knowledge of its derivative, so generally upon constructing the differentiation
stencil we will want to invert it.
3. We may want to construct an interpolant f interp (x) that agrees with f (x)
at the sample points but smoothly interpolates between those points in
a way that mimics the original function f (x) as closely as possible. For
example, f (x) may be the result of an experimental measurement or the
result of a costly numerical calculation, and we might to accelerate calculation of f (x) at arbitrary values of x by precomputing f (xn ) at just the
sample points {xn } and then interpolating to get values at intermediate
points x.
In a sense, the first half of our course was devoted to studying the answer
to this question furnished by classical numerical analysis, while the second half
has been focused on the modern answer. Lets begin by reviewing what the
classical approach had to offer.
The classical answer
Classical numerical analysis answers the question of how to choose the sample
points {xn } in the simplest possible way: We simply take the sample points to
be evenly spaced throughout the interval [a, b]:1
xn = a + n,
n = 0, 1, , N,
ba
.
N
In this case,
The quadrature rules one obtains are the usual Newton-Cotes quadrature
rules, which we studied in the first and second weeks of our course. These
work by fitting polynomials through the function samples and then integrating those polynomials to approximate the integral of the the function.
The differentiation stencils one obtains are the usual finite-difference stencils, which we studied in the third and fourth weeks of our course. These
may again be interpreted as a form of polynomial interpolation: we are
essentially constructing and differentiating a low-degree approximation to
the Taylor-series polynomial
The interpolant one constructs is the unique N th degree polynomial P interp (x)
that agrees with the values of the underlying function f (x) at the N + 1
sample points. Although we didnt get to this in the first unit of our
course, it turns out to be easy to write down a formula for this polynomial
in terms of the sample points {xn } and the values of f at those points,
{fn } {f (xn )}. For example, for the cases N = 1, 2, 3 we have2
P1interp (x) = f1
(x x1 )
(x x2 )
+ f2
(x1 x2 )
(x2 x1 )
P2interp (x) = f1
(x x1 )(x x3 )
(x x1 )(x x2 )
(x x2 )(x x3 )
+ f2
+ f3
(x1 x2 )(x1 x3 )
(x2 x1 )(x2 x3 )
(x3 x1 )(x3 x2 )
P3interp (x) = f1
(x x2 )(x x3 )(x x4 )
(x x1 )(x x3 )(x x4 )
+ f2
(x1 x2 )(x1 x3 )(x1 x4 )
(x2 x1 )(x2 x3 )(x2 x4 )
(x x1 )(x x2 )(x x4 )
(x x1 )(x x2 )(x x3 )
+ f3
+ f4
(x3 x1 )(x3 x2 )(x3 x4 )
(x4 x1 )(x4 x2 )(x4 x3 )
The formula of this type for general N is called the Lagrange interpolation
formula; it constructs an N th degree polynomial passing through N + 1
fixed data points (xn , fn ).
1 Technically
we have here a set of N + 1 points, not N points as we stated above.

you see the pattern here? The general expression for PN includes one term for each
sample point xm . The numerator of this term is a product of N linear factors which are
constructed to ensure that the numerator vanishes whenever x equals one of the other sample
points (x = xn , n 6= m). The denominator of this term is just a constant chosen to replicate
the value of the numerator at x = xm , which ensures that the fraction evaluates to 1 at
x = xm . Then we just multiply by fm to obtain a term which yields fm at xm and vanishes
at all the other sample points. Summing all such terms for each sample point, we obtain an
N th degree polynomial which yields fn at each sample point xn .
2 Do
Performance of the classical approach on general functions

How well does the classical approach work?
Integration: If we divide our interval into N subintervals and approximate
the integral over each subinterval using a pth-order Newton-Cotes quadrature rule, then we saw in Unit 1 that for general functions the error decays
1
, i.e. algebraically with N (as opposed to exponentially with N ).
like N p+1
Differentiation: If we estimate derivative values via a pth-order finitedifference stencil using function samples at points spaced by multiples of
, then the error decays like p , or like N1p . [For example, the forward
(x)
has error proportional
finite-difference approximation f 0 (x) f (x+)f
(x)
to , while the centered finite-difference f 0 (x) f (x+)f
has error
2
proportional to 2 .] Thus here again we find convergence algebraic in N ,
not exponential in N .
Interpolation: Polynomial interpolation in evenly-spaced sample points
is a notoriously badly-behaved procedure due to the Runge phenomenon
(we will discuss it briefly in an appendix). The Runge phenomenon is so
severe that, in some cases, the polynomial interpolant through N evenlyspaced function samples points doesnt just converge slowly as N .
It doesnt converge at all!
To summarize the results of the classical approach,
Classical approach: To characterize a function over an interval using N function choose the sample points to be evenlyspaced points and construct polynomial interpolants. The approach in general yields convergence algebraic in N for integration and differentiation, but does not converge for interpolation
of some functions.
Performance of the classical approach on periodic functions

However, as we saw already in PSet 1, there is one exception to the general
rule of algebraic convergence: If the function we are integrating is periodic
over the interval in question, then simple Newton-Cotes using evenly-spaced
functions achieves convergence exponential in N (although differentiation and
interpolation continue to behave as above even for periodic functions). This
observation forms the basis of the modern approach, to which we now turn.
The modern answer for periodic functions
The classical approachto use evenly-spaced function samples and construct

polynomialsyields slow convergence in general and non-convergence (of the
polynomial interpolant) in some cases.
The modern approach, for periodic functions, retains the evenly-spaced sample points of the classical approach but throws out the idea of using polynomials
to interpolate them, choosing instead to construct trigonometric interpolants
consisting of linear combinations of sinusoids of various frequencies.3
The performance of the modern approach for periodic functions follows logically by aggregating a series of observations we made in our discussion of Fourier
analysis:
If a function f (t) is periodic with period T , then it has a Fourier-series
representation of the form
f (t) =
fen ein0 t
n=
The
Modern approach, periodic functions: To characterize a
periodic function over an interval using N function samples,
choose the sample points to be evenly spaced throughout the
interval and construct a trigonometric interpolant consisting of
a sum of N sinusoids. The approach in general yields convergence exponential in N for integration, differentiation, and interpolation.
Performance of the modern approach for periodic functions
3 Linear
P
combinations of sinusoids like
[an sin n0 t + bn cos n0 t] are sometimes called
trigonometric polynomials since they are in fact polynomials in the variable ei0 t , but I
personally find this terminology a little confusing.
The modern answer for non-periodic functions
The modern answer to the characterization problemsample at evenly-spaced

points and construct a trigonometric interpolantworks very well for periodic
functions. What do we do if we have a non-periodic function? Easy: we make
it into a periodic function. When you have such a powerful hammer, treat
everything like a nail! Lets review how this construction works.
Construct a smooth periodic version of f (x)

To construct a periodic function out of a non-periodic function f (x), we restrict
our attention to the interval x [1 : 1] (if you need to consider a different
interval, just shift and scale variables accordingly) and define
g() = f (cos ).
This is a smooth4 periodic function. As varies from 0 to , g() traces out the
behavior of f (x) over the interval [1, 1] [that is, g() traces out f (x) backwards].
When crosses and continues on to 2, g() turns around and begins to retrace
its steps, going backwards over the same terrain it covered between = 0 and
. Figure 1 (which also appeared in our notes on Clenshaw-Curtis quadrature)
shows an example of a non-periodic function f (x) and the periodic function g()
that captures the behavior of f over the interval [1, 1].
Write down a Fourier cosine series for g()

Because g() is 2periodic and even, it has a a Fourier cosine series of the
form
g() =
with coefficients
2
e
a =
e
a0 X
+
e
a cos()
2
=1
Z
(1)
g() cos() d.
(2)
Sample g() at N + 1 evenly-spaced points and construct an

interpolant
Now consider sampling the function g() at N + 1 evenly-spaced points distributed throughout the interval [0, ], including the endpoints:
n
gn g(n) = g
,
n = 0, 1, , N
(3)
N
4 Assuming f is smooth. The construction of the g function doesnt do anything to smooth
out discontinuities in f or any of its derivatives; it only smoothes out the discontinuities arising
from the mismatch at the endpoints.
f(x)
-1
-1
-2
-2
-3
-3
-4
-2
-1.5
-1
-0.5
0
x
0.5
1.5
-4
(a)
5
c
4
-1
-1
-2
-2
-3
-3
-4
-4
(b)
Figure 1: (a) A function f (t) that we want to integrate over the interval [1, 1].
(b) The function g() = f (cos ). Note the following facts: (1) g() is periodic
with period 2. (2) g() is an even function of . (3) Over the interval 0
, g() traces out the behavior of f (t) as t varies from 1 1 [i.e. g()
traces out f (t) backwards.] However, (4) g() knows nothing about what f (t)
does outside the range 1 < t < 1, which can make it a little tricky to compare
the two plots. For example, g() has local minima at = 0, even though f (t)
does not have local minima at t = 1, 1.
The discrete Fourier transform of the set of samples {gn } yields a set of Fourier
coefficients {e
g }:
DFT
{gn } {e
g }
From the {e
g } coefficients we can reconstruct the original {gn } samples through
the magic of the inverse DFT:
IDFT
{e
g }
{gn }
where the specific form of the reconstruction is

gn =
N
X
ge ein .
(4)
=0
Now proceeding exactly as in our discussion of trigonometric interpolation, we

continue equation (4) from the integer variable n to a real-valued variable :
g interp () =
N
X
ge ei
(5)
=0
Note that g interp () is (in general) not the same function as the original g();
the difference is that the sum in (6) is truncated at = N , whereas the Fourier
series for the full function g() will in general contain infinitely many terms.
The form of (5) may be simplified by noting that, because g() is an even
function of , its Fourier series includes only cosine terms:
N/2
g interp () =
e
a0 X
+
e
a cos()
2
=1
(6)
where the e
an coefficients are related to the gen coefficients computed by the DFT
according to
e
a0 = 2e
g0 ,
e
a = (e
g + ge ) = 2e
g .
[The last equality here follows from the fact that, for an even function g(), the
Fourier series coefficients for positive and negative are equal, ge = ge .]
The procedure we have outlined above uses general DFT techniques for
computing the numbers a . In this particular case, because g() is an even
function, it is possible to accelerate the calculation by a factor of 4 using the
discrete cosine transform, a specialized version of the discrete Fourier transform.
We wont elaborate on this detail here.
Express g interp () in terms of the variable x

Finally, lets now ask what equation (1) looks like in terms of the original variable
x. If we recall the original definition
g() f (cos )
(7)
we can manipulate this to read

f (x) = g(arccos x).
(8)
Now plugging in the approximation (1) yields an approximation to f :

N/2
interp
e
a0 X
e
a cos (n arccos x)
(x) =
+
2
=1
(9)
Equation (9) would appear at first blush to define a horribly ugly function of
x. It took the twisted5 genius of the Russian mathematician P. L. Chebyshev
to figure out that in fact equation (9) defines a polynomial function of x. To
understand how this could possibly be the case, we must now make a brief foray
in the world of the Chebyshev polynomials.
5 We
intend this adjective in the most admiring possible sense.
10
Chebyshev polynomials
Trigonometric definition
The definition of the Chebyshev polynomials is inspired by the observation, from
high-school trigonometry, that cos(n) is a polynomial in cos for any n. For
example,
cos 2 = 2 cos2 1
cos 3 = 4 cos3 3 cos
cos 4 = 8 cos4 8 cos2 + 1
The polynomials on the RHS of these equations define the Chebyshev polynomials for n = 2, 3, 4. More generally, the nth Chebyshev polynomial Tn (x) is
defined by the equation
cos n = Tn (cos )
and the first few Chebyshev polynomials are
T0 (x) = 1
T1 (x) = x
T2 (x) = 2x2 1
T3 (x) = 4x3 3x
T4 (x) = 8x4 8x2 + 1.
Figure 2 plots the first several Chebyshev polynomials. Notice the following
important fact: For all n and all x [1, 1], we have 1 Tn (x) 1. This
boundedness property of the Chebyshev polynomials turns out to be quite useful
in practice.
On the other hand, the Chebyshev polynomials are not bounded between
1 and 1 for values of x outside the interval [1, 1] (nor, being polynomials,
could they possibly be). Figure 3 shows what happens to T15 (x) as soon as we
get even the slightest little bit outside the range x [1, 1]: the polynomial
takes off to . In almost all situations involving Chebyshev polynomials we
will be interested in their behavior within the interval [1, 1].
Completeness and Orthogonality

The Chebyshev polynomials constitute our first example of an orthogonal family
of polynomials. We will have more to say about this idea later, but for the time
being the salient points are the following:
1. The Chebyshev polynomials are complete: Any N th-degree polynomial
can be expressed exactly (and uniquely) as a linear combination of T0 (x), T1 (x), , TN (x).
Thus the set of N + 1 functions {Tn } for n = 0, , N forms a basis of
the N + 1-dimensional vector space of N -th degree polynomials.
11
1.5
1.5
0.5
0.5
-0.5
-0.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1
-0.5
0
x
0.5
-1
-1
-1.5
-1
-0.5
T0 (x)
0
x
0.5
T1 (x)
1.5
1.5
0.5
0.5
0
-0.5
-0.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1
-0.5
0
x
0.5
-1
-1
-1.5
-1
-0.5
T2 (x)
0
x
0.5
T3 (x)
1.5
1.5
0.5
0.5
-0.5
-0.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1
-0.5
0
x
T4 (x)
0.5
-1
-1
-1.5
-1
-0.5
0
x
0.5
T15 (x)
Figure 2: The Chebyshev polynomials T04 (x) and T15 (x).
12
15
15
10
10
-5
-5
-10
-10
-15
-15
-1
-0.5
0.5
Figure 3: The Chebyshev polynomials Tn (x) take off to for values of x

outside the range [1 : 1]. Shown here is the case T15 (x).
13
2. The Chebyshev polynomials are orthogonal with respect to the following

inner product:6
Z 1
f (x)g(x)dx
hf, gi
.
1 x2
1
Orthogonality means that if we insert Tn and Tm in the inner product we
get zero unless n = m:
hTn , Tm i = nm .
(10)
2
Taken together, these two properties furnish a convenient way to represent arbitrary functions as linear combinations of Chebyshev polynomials. The first
property tells us that, given any function f (x), we can write f (x) in the form
f (x) =
Cn Tn (x).
(11)
n=0
The second property gives us a convenient way to extract the Cn coefficients:

Just take the inner product of both sides of (11) with Tm (x). Because of orthogonality (equation 10), every term on the RHS dies except for the one involving
Cm , and we find
hf, Tm i = Cm
2
[where the /2 factor here comes from equation (10)]. In other words, the
Chebyshev expansion coefficients of a general function f (x) are
Cm
2
=
f (x)Tm (x)
dx.
1 x2
(12)
Equations (11) and (12) amount to form what we might refer to as the forward
and inverse discrete Chebyshev transforms of a function f (x).
6 An inner product on a vector space V is just a rule that assigns a real number to any pair
of elements in V . (Mathematicians would say it is a map V V R.) The rule has to be
linear (the inner product of a linear combination is a linear combination of the inner products)
and non-degenerate, meaning no non-zero element has vanishing inner product with itself.
14
Chebyshev spectral methods
Chebyshev spectral methods furnish the second half of the modern solution to
the problem we posed at the beginning of these notes, namely, how best to
characterize a function using samples of its value at N points.
Recall that the first half of the modern solution went like this:
Modern approach, periodic functions: To characterize a
periodic function over an interval using N function samples,
choose the sample points to be evenly spaced throughout the
interval and construct a trigonometric interpolant consisting of
a sum of N sinusoids. The approach in general yields convergence exponential in N for integration, differentiation, and interpolation.
The second half of the modern solution now reads like this:
Modern approach, non-periodic functions: To characterize a non-periodic function over an interval using N function
samples, map the interval into [1, 1], choose the sample points
to be Chebyshev points, and construct a polynomial interpolant
consisting of a sum of N Chebyshev polynomials. The approach
in general yields convergence exponential in N for integration,
differentiation, and interpolation.
Lets now investigate how Chebyshev spectral methods work for each of the
various aspects of the characterization problem we considered above.
Chebyshev approximation
As we saw previously, a function f (x) on the interval [1, 1] may be represented
exactly as a linear combination of Chebyshev polynomials:
f (x) =
Cn Tn (x)
(13)
n=0
One way to obtain a formula for the C coefficients in this expansion is to take
the inner product of both sides with Tm (x) and use the orthogonality of the T
functions:
hf, Tm i
hTm , Tm
Z
2 1 f (x)Tm (x)
=
dx.
1
1 x2
Cm =
(14)
However, there are better ways to compute these coefficients, as discussed below.
15
If we restrict the sum in (15) to include only its first N terms, we obtain an
approximate representation of f (x), the N th Chebyshev approximant:
f approx (x) =
N
1
X
Cn Tn (x)
(15)
n=0
Chebyshev interpolation
The coefficients Cn in formula (15) for the Chebyshev approximant may be
computed using the integral formula (13), but there are easier ways to get them.
These are based on the following alternative characterization of (15):
The N -th Chebyshev approximant (15) is the unique N -th degree polynomial that agrees with f (x) at the N + 1 Chebyshev
points xn = cos n
N , n = 0, 1, , N.
Thus, when we construct (15), we are really constructing an interpolant that
smoothly connects N + 1 samples of f (x) evaluated at the Chebyshev points.
In particular, the values of f at the Chebyshev points are the only data we need
to construct f approx in (15). This is not obvious from expression (14), which
would seem to suggest that we need to know f throughout the interval [1, 1].
How do we use this characterization of (15) to compute the Chebyshev expansion coefficients {Cn } in (15)? There are at least two ways to proceed:
1. We could use the Lagrange interpolation formula to construct the unique
N -th degree polynomial running through the
data points {xn , f (xn )} for
the N + 1 Chebyshev points xn = cos n
N , n = 0, 1, , N.
2. We could observe that the Cn coefficients are the coefficients in the Fourier
cosine series of the even 2-periodic function
g() = f (cos ). The samples
of g() at evenly-spaced points g n
N are precisely just the samples
of f (x) at the Chebyshev points cos n
N , and the Fourier cosine series
coefficients may be computed by computing the discrete cosine transform
of the set of numbers {fn }:
{fn }
where
DCT

n
fn = f cos
,
N
{Cn }
n = 0, 1, , N.
Option 1 here is discussed in Trefethen, Spectral Methods in MATLAB,

Chapter 6 (see particularly Exercise 6.1).
Here we will focus on option 2. The numbers Cn are just the Fourier cosine
series coefficients of g(), i.e. the numbers we called e
a in equation (2):
Z
2
Cn =
f (cos ) cos(n)d.
0
16
We compute the integral using a simple (N + 1)-point trapezoidal rule:

n
2 h1
2n
Cn =
+ f2 cos
f0 + f1 cos
+
N 2
N
N

i
(N 1)n
1
+ fN 1 cos
+ fN cos N
N
2
(16)
where

n
fn f cos
N
If we write out equation (16) for all of the Cn coefficients at once, we have an
(N + 1)-dimensional linear system relating the sets of numbers {fn } and {Cn }:
12
1
2
1
2
2
N
1
2
.
..
1
2
cos N
cos 2
N
cos 3
N
cos 2
N
cos 4
N
cos 6
N
cos 3
N
..
.
cos 6
N
..
.
cos 9
N
..
.
cos
cos 2
cos 3
..
1
2
1
2 cos
2 cos 2
cos
3
2
..
1
cos
N
f0

f1

f2
=

f3

..
.

fN
C0
C1
C2
C3
..
.
CN
which we could write in the form

f = C
(17)
where f and C are the (N + 1)-dimensional vectors of function samples at

Chebyshev points and Chebyshev expansion coefficients, respectively, and the
elements of the matrix are
1
m=0
N ,
nm
nm = N2 cos
, m = 1, , N 1
1
m=N
N cos n,
where the n, m indices run from 0 to N .
Using equation (17) directly is actually not a good way to compute the C
coefficients from the f samples, because the computational cost of the matrixvector multiplication scales like N 2 , whereas FFT techniques (the fast cosine
transform) can perform the same computation with cost scaling like N log N .
However, the existence of the matrix is useful for deriving Clenshaw-Curtis
quadrature rules and Chebyshev differentiation matrices, as we will now see.
17
Chebyshev integration
The Chebyshev spectral approach to integrating a function f (x) goes like this:
1. Construct the N th Chebyshev approximant f approx (x) to f (x) [equation
(15)].
2. Integrate the approximant and take this as an approximation to the integral.
In symbols, we have
Z
f approx (x) dx
f (x) dx
1
Insert equation (15):
N
X
Cm
Tm (x) dx.
(18)
m=0
But the integrals of the Chebyshev polynomials can be evaluated in closed form,
with the result
(
Z 1
2
m even
2,
(19)
Tm (x) dx = 1m
0,
m odd.
1
Thus equation (18) reads
Z
f (x) dx
1
N
X
m=0
m even
2Cm
.
1 m2
(20)
Does this expression look familiar? It is exactly what we found in our discussion
of Clenshaw-Curtis quadrature, except there we interpreted the integral (19) in
the equivalent form
Z 1
Z
Tm (x) =
cos(m) sin d.
1
Thus the Chebyshev spectral approach to integration is just Clenshaw-Curtis

quadrature. As we have observed, the Cm coefficients may be computed exactly
up to m = N using N + 1 samples of the function f (x) (where the samples
are taken at the Chebyshev points). Indeed, we can write (20) in the form of
a vector-vector product involving the vector C of Chebyshev expansion coeffi-
18
cients:
W=
f (x) dx WT C,
2
0
2
122
0
2
142
..
.
2
1N 2
Now plugging in equation (17) yields

Z
f (x) dx WT f
(21)
= wt f
(22)
which just illustrates that the weights of the (N + 1)-point Clenshaw-Curtis

quadrature rule are the elements of the vector w = WT .
Chebyshev differentiation
In the first unit of our course we saw how to use finite-difference techniques to
approximate derivative values from function values. For example, if feven is a
vector of function samples taken at evenly-spaced points in an interval [a, b] i.e.
if
f (a)
f (a + )
feven = f (a + 2)
..
.
f (b)
then the vector of derivative values at the sample points may be represented in
the centered-finite-difference approximation as a matrix-vector product of the
form
0
feven
= DCFD feven
19
where7
DCFD
0
1
0
0
0
0
1
0
1
0
0
1
0
1
0
0
0
0
0
0
0
0
.
..
0 0
0 1
0
0
1
0
0
0
0
0
1
0
As we saw in our discussion of finite-difference techniques, this approximation

will converge like 1/N 2 , i.e. the error between our approximate derivative and
the actual derivative will decay like 1/N 2 .
Now that we are equipped with Chebyshev spectral methods, we can write
a numerical differentiation stencil whose errors will decay exponentially 8 in N .
Indeed, following the general spirit of Chebyshev spectral methods, all we have
to do is
1. Construct the N th Chebyshev approximant f approx (x) to f (x) [equation
(15)].
2. Differentiate the approximant and take this as an approximation to the
derivative.
The N th Chebyshev approximant to f (x) is
fapprox (x) =
N
X
Cm Tm (x)
m=0
Differentiating, we find
0
fapprox
(x) =
N
X
0
Cm Tm
(x).
m=0
If we evaluate
this formula at each of the (N + 1) Chebyshev points xn =

0
cos n
,
n
=
0,
1, , N , we obtain a vector fcheb
whose entries are approximate
N
values of the derivative of f at the Chebyshev points, and which is related to
the vector C of Chebyshev coefficients via a matrix-vector product relationship:
0
0
f (x0 )
T0 (x0 ) T10 (x0 ) T20 (x0 ) TN0 (x0 )
C0
f 0 (x1 ) T00 (x1 ) T10 (x1 ) T20 (x1 ) TN0 (x1 ) C1
0

f (x2 ) T00 (x2 ) T10 (x2 ) T20 (x2 ) TN0 (x2 ) C2
..

..
..
..
..
..
..
.

.
.
.
.
.
.
f 0 (xN )
CN
T00 (xN ) T10 (xN ) T20 (xN ) TN0 (xN )
{z
} |
{z
} | {z }
|
f 0
cheb
T0
(23)
7 We
are here assuming that f vanishes to the left and right of the endpoints; as we saw
earlier in the course, it is easy to generalize to arbitrary boundary values of f .
8 Technically: faster than any polynomial in N .
20
Lets abbreviate this equation by writing

0
fcheb
= T0 C
where T0 is the (N + 1) (N + 1)-dimensional matrix in (23). If we now plug

in C = f cheb [equation 17], we get
0
0
fcheb
= T
fcheb
|{z}
Dcheb
This equation identifies the (N + 1) (N + 1) matrix

Dcheb = T0
as the matrix that operates on a vector of f samples at Chebyshev points to
yield a vector of f 0 samples at Chebyshev points.
Second derivatives
What if we need to compute second derivatives? Easy! Just go like this:
00
0
fcheb
= Dcheb fcheb

= Dcheb Dcheb fcheb

2
= Dcheb fcheb .
This equation identifies the (N +1)(N +1) matrix (Dcheb )2 , i.e just the square
of the matrix Dcheb , as the matrix that operates on a vector of f samples at
Chebyshev points to yield a vector of f 00 samples at Chebyshev points.

Clenshaw-Curtis Quadrature
Homer Reid
April 15, 2014
Contents
1 Newton-Cotes Quadrature
2 Fourier-Series Convergence Analysis of the Trapezoidal Rule
3 Clenshaw-Curtis Quadrature
4 Clenshaw-Curtis Quadrature Rules
14
Newton-Cotes Quadrature
In the first unit of the course we discussed Newton-Cotes quadrature. Recall

Rb
that this technique approximates an integral a f (x) dx by (1) dividing [a, b]
into N subintervals of width = ba
N , (2) approximating f (x) by a p-th degree
polynomial P (x) on each subinterval (where P is chosen to match the values of
f at p + 1 equally spaced points in the subinterval), and then (3) integrating
P (x) over the subinterval to approximate the integral of f . The upshot is that,
for each value of p, we obtain a Newton-Cotes quadrature rule for the integral
of our function. As a reminder, the rules obtained for the first three values of p
are listed in the following table.
Name
rectangular rule
Approximation to
N
1
X
Rb
a
f (x) dx
f (a + n)
n=0
N
1
X
trapezoidal rule
n=0
Simpsons rule
N
1
X
n=0

f a + n + f a + (n + 1)
2

1
f a + n + 4f a + (n + ) + f a + (n + 1)
6
2
When we discussed Newton-Cotes quadrature previously, we offered the following heuristic convergence analysis: The p-th order NC rule models f as a
p-th degree polynomial, which means the error in the approximation is a polynomial that starts at degree p + 1. The integral of this error polynomial over
1
an interval of width is proportional to p+2 N p+2
. Hence the error in our
approximate estimate of the integral over each subinterval is
error per subinterval
1
N p+2
and there are N subintervals, so

total error = N (error per subinterval)
1
N p+1
(1)
In other words, our heuristic convergence analysis suggests that the error should
decay algebraically with N , with faster decay for larger values of p. However,
this analysis is clearly oversimplified in particular, equation (1) blindly sums
the errors within each subinterval, without considering the possibility of cancellations among the errors in different subinterval.
When you investigated the performance of NC quadrature rules in PSet 1,

you found that the heuristic prediction (1) is actually borne out in practice on
a fairly wide class of functions, but with some glaring exceptions. In particular,
although the error incurred by the rectangular and trapezoidal rules did indeed
decay respectively like N1 and N12 for most functions, in some special cases
namely, for periodic functions integrated over their period or an integer multiple
of their periodthe error seemed to be decaying exponentially rapidly with N .
There is nothing in our heuristic convergence analysis that could possibly explain
this phenomenon.
But now that we are equipped with the tools of Fourier analysis, we can
obtain understand this phenomenon in more detail and, in the process, learn
how the excellent behavior of Newton-Cotes quadrature for periodic functions
can be recovered for non-periodic functions as well. This will lead us to the
numerical integration technique known as Clenshaw-Curtis quadrature.
Fourier-Series Convergence Analysis of the Trapezoidal Rule
Lets consider the integral of a function f (t) over an interval of width T , which
we assume without loss of generality to start at t = 0. Thus we are trying to
compute
Z T
I=
f (t) dt.
0
The N -point trapezoidal-rule approximation to I is

(
)
1
h
i NX
1
nT
T
trap
f (0) + f (T ) +
f
.
IN
=
N 2
N
n=1
(2)
This formula is just the second box of the table in the previous section, with a =
T
0, b = T, and = N
. What we would like to understand is the N dependence
of the error
trap
trap
EN
= |I IN
|.
To do this, recall from our discussion of Fourier analysis that our function
may be represented over the interval [0, T ] in the form

X
2
f (t) =
fem eim0 t
0 =
(3)
T
m=
where the Fourier series coefficients are
Z
1 T
fem =
f (t)eim0 t dt.
T 0
(4)
In particular, the integral we are trying to compute is precisely just T times the
value of the m = 0 Fourier series coefficient:
I = T fe0 .
Of course, when we are doing Newton-Cotes quadrature on a function f (t) we
dont know its Fourier series coefficientsif we did, we wouldnt need to be
doing quadrature in the first placebut the point is that even without knowing
the values of the fem we know that the Fourier-synthesized representation (3)
exists, and that is all that we need for this analysis.
We now want to insert the representation (3) into (2). Conveniently, the first
term on the RHS of (2) is precisely what we get by evaluating the Fourier series
(3) at t = 0.1 For the other terms, we simply plug in equation (3) evaluated at
1 This is obviously true when the original function f (t) is periodic with period T , but when
f (0) 6= f (T ) it is a non-trivial and convenient fact that the first term on the RHS of (2)
is precisely what we get by evaluating the Fourier series (3) at t = 0. This, incidentally,
is the reason for starting with a convergence analysis of the trapezoidal rule instead of the
rectangular rule; the latter can be analyzed using Fourier-series techniques as well, but the
analysis is not as nice.
various different values of the argument t:

N
1
h
i
X
T
nT
1
trap
IN =
f (0) + f (T ) +
f
N
2
N
{z
} n=1 | {z
|
Pe
fm
P e im0 ( nT
)
N
fm e
(
)
N 1
X
nT
T X
fem eim0 ( N )
=
N n=0 m=
(
)
N 1
X
T X
2imn/N
=
fem e
N n=0 m=
where I used 0 =
2
T .
(5)
Now rearrange the sums:
)
N
1
X
1
=T
e2imn/N
fem
N n=0
m=
|
{z
}
(
(6)
KN (m)
In the last line here we defined a function KN (m) which has some interesting
properties:
N 1
1 X 2imn/N
e
N n=0
i
1h
=
1 + + 2 + + N 1
N
KN (m) =
(7)
where = e2im/N . Now, if m = 0 or m is an integer multiple of N , then = 1

and the sum simply yields
KN (m) = 1
if m is an integer multiple of N
On the other hand, if m is not an integer multiple of N , then 6= 1 in (7) and

we may sum the geometric series to find
1 h 1 N i
KN (m) =
N 1
1 h 1 e2im i
=
N 1 e2im/N
=0
if m is not an integer multiple of N .
Now going back to (6), we find that the sum over Fourier coefficients is now
restricted to m values that are integer multiples of N , m = pN with p Z:
X
trap
IN
=T
fepN .
p
So the N -point trapezoidal-rule approximation to the integral of f is picking out

precisely just the Fourier-series coefficients with frequencies that are multiples
of N 0 . In particular, the m = 0 term here is the exact integral I that we are
seeking, and everything else is an error term:
X
trap
IN
= T fe0 +T
fepN .
|{z}
I
p6=0
trap
trap
Thus the error EN
= |I IN
| is just the sum of the N th, 2N th, etc.
Fourier-series coefficients of our function:

X

trap
e

(8)
fpN .
EN =

p=

p6=0
Of course, again, we dont know the numbers fepN , so we cant compute the
RHS of this formula exactly. However, we can use the smoothness-vs.-decay
properties of Fourier analysis to estimate how rapidly it decays with N .
Convergence for continuous nonperiodic functions f (t)

First suppose f (t) is a continuous function that does not satisfy the condition
f (0) = f (T ), i.e. f (t) takes different values at the endpoints of the interval over
which we are integrating. In this case, the Fourier-series coefficients we compute
using equation (4) are really the Fourier-series coefficients of a discontinuous
function f per (t) obtained by slicing out just the portion of f (t) between 0 and
T and periodically repeating it, as illustrated in Figure 1. [f per (t) is sometimes
known as the T -periodic extension of f (t).] From general Paley-Wiener analysis
we know that, for a discontinuous function, the magnitudes of the Fourier series
coefficients |fen | decay like |fen | 1/n, and hence looking at (8) we might expect
that the error in the trapezoidal rule should decay like 1/N .
However, this point turns out to require more careful scrutiny, because the
error formula (8) actually involves the sum of both positive and negative Fourier
coefficients. For a function f (t) that is smooth but discontinuous at the endpoints of the interval [0, T ], it turns out2 that the leading term in the expansion
of fen in inverse powers of n has opposite signs for n, i.e. we have

C
1
e
fN =
+O
N
N2

C
1
e
fN = + O
N
N2
and hence

feN + feN = O
2 For
1
N2
a proof, see J. P. Boyd, Chebyshev and Fourier Spectral Methods, Section 2.9.
0.3
0.3
f(x)
0.2
0.2
0.1
0.1
-0.1
-0.1
-0.2
-0.2
-0.3
-0.3
-0.4
-0.4
-1
-0.5
0.5
x
1.5
(a)
0.3
0.3
g(x)
0.2
0.2
0.1
0.1
-0.1
-0.1
-0.2
-0.2
-0.3
-0.3
-0.4
-0.4
-1
-0.5
0.5
x
1.5
(b)
Figure 1: (a) A non-periodic function that we might be trying to integrate
over the interval [0, 1]. (b) The actual function whose Fourier-series coefficients
we are computing when we evaluate equation (4). Note that this function is
discontinuous even though the original function was continuous.
Hence the terms proportional to 1/N cancel out of (8), and we have

X

trap
e
e
EN =
fpN + fpN

p=1
X

#

pN 2
1
2.
N
So theres the 1/N 2 convergence of the trapezoidal rule.
Convergence for periodic functions f (t)

On the other hand, suppose that our original function f (t) was not only smooth
but also periodic with period T . This means not only that f (0) = f (T ), but
also that f 0 (0) = f 0 (T ), f 00 (0) = f 00 (T ), and all higher derivatives agree at the
endpoints. In this case the function whose Fourier series we are computing is
C , and we know from the general Paley-Wiener theorem of Fourier analysis
that its Fourier coefficients decay faster than any polynomial in n, with behavior
like |feN | e|N | typical. In such a case we find

X

trap
EN
=
fepN
p6=0

X

N |p|

=
e

p6=0

and the sum will be dominated by its first terms,
eN .
This explains the exponential convergence rate of the trapezoidal rule applied
to periodic functions.
Clenshaw-Curtis Quadrature
The discussion of the previous section explains why the simple trapezoidal rule
converges so rapidly for periodic functions, and why it converges relatively slowly
for non-periodic functions. Thus, if we are lucky enough to be integrating a
periodic function over a period, all we have to do is apply the usual trapezoidal
rule and we magically get exponential convergence. But what if we have the
bad fortune of needing to integrate a non-periodic function? Are we stuck with
the slow convergence of the trapezoidal rule?
No! This is actually a general principle of mathematics, and of life more
broadly: You are not helpless. You have options. In particular, in the case
at hand we have the option to convert our non-periodic function into a periodic function, and the process of availing ourselves of this option is known as
Clenshaw-Curtis quadrature.
Constructing a periodic function g from our non-periodic

function f
Clenshaw-Curtis quadrature is nicest to formulate when the interval over which
we are trying to integrate our function is [1, 1], so we will consider that case
here.3 Thus consider the integral
Z 1
I=
f (t) dt.
(9)
1
The interval [1, 1] happens to be precisely the range of values covered (though
not in the same order) by cos as ranges from 0 to , so it is convenient to
use the parameterization t = cos and to define a new function
g() f (cos ).
Figure 2 shows some non-periodic function f (t) together with the function
g() f (cos ). Notice the following points about g() :
(a) It is a periodic function with period T = 2.
(b) It is an even function, i.e. g() = g().
(c) As ranges from 0 , g() traces out the behavior of f (t) as t ranges
backward from 1 1.
(d) g() knows nothing about the behavior of f (t) outside the range 1 t 1.
This can make it a little tricky to compare the two plots. For example,
g() has local minima at = 0, even though f (t) does not have local
minima at t = 1, 1.
3
a function f (t) over some other interval [a, b], just define g(u) =
If you need to integrate

(ba)
f a + 2 (u + 1) and apply Clenshaw-Curtis quadrature to integrate g(u) from u = 1
to 1. Dont forget the Jacobian factor.
10
f(x)
-1
-1
-2
-2
-3
-3
-4
-2
-1.5
-1
-0.5
0
x
0.5
1.5
-4
(a)
5
-1
-1
-2
-2
-3
-3
-4
-4
(b)
Figure 2: (a) A function f (t) that we want to integrate over the interval [1, 1].
(b) The function g() = f (cos ). Note the following facts: (1) g() is periodic
with period 2. (2) g() is an even function of . (3) Over the interval 0
, g() reproduces the behavior of f (t). However, (4) g() knows nothing
about what f (t) does outside the range 1 < t < 1, which can make it a little
tricky to compare the two plots. For example, g() has local minima at = 0,
even though f (t) does not have local minima at t = 1, 1.
11
Property (a) here ensures that the function g() has a Fourier-series representation involving sinusoids that are integer multiples of a base period 0 = 2
T =1:
Z
2
X
1
in
g()ein d.
(10)
g() =
gf
,
gf
ne
n =
2
0
n=
Meanwhile, property (b) ensures that this Fourier series contains only cosine
terms, i.e. it is a Fourier cosine series:
g() =
e
a0 X
+
e
an cos n
2
n=1
(11)
where the e
an coefficients are related to the gen coefficients in (10) according to
e
a0 = 2e
g0 ,
e
an = (e
gn + gen ) = 2gn
(where we used the fact that the Fourier-series coefficients of an even real-valued
function satisfy gen = gen ). The e
a coefficients may also be written in the form
Z
1 2
g() cos(n) d.
(12)
e
an =
0
Notice something very important about these integrals: They are integrals of a
periodic function over its period. (Indeed, both g() and cos(n) for integer n
are periodic functions over the interval [0 : 2], so the whole integrand is periodic.) That means the integral (12) can be evaluated using a simple N point
trapezoidal rule with an error that decays exponentially with N .
The integral of f in terms of the Fourier-series coefficients

We now want to rewrite our integral (9) in terms of our newly-constructed
periodic function g. To do this, we simply change variables in (9) according to
t = cos :
Z 1
Z

I=
f (t) dt =
f cos sin d
1
0
Z
g() sin d.
(13)
=
0
Although g() is a periodic function, we dont obtain an exponentially-convergent

quadrature rule by applying the trapezoidal rule directly to (13) because the
range of integration is only over half the period of the integrand (the integral
runs from 0 to , whereas the period of the integrand is 2). However, something brilliant happens when we plug in the Fourier-cosine-series representation
of g():
Z
I=
g() sin d
0
)
Z (
e
a0 X
+
e
an cos(n) sin d
=
2
0
n=1
12
Rearrange the sum and evaluate the integral:

=
e
a0
2
Z
|0
sin d +
{z }
Z
an
0
n=0

cos(n) sin d
{z
}
1+(1)n
1n2
The integral vanishes if n is odd, and yields 2/(1 n2 ) if n is even, so we find

simply
=e
a0 +
X
n=1
n even
2e
an
1 n2
which we could write in the alternative form

=e
a0 +
X
2e
a2n
.
1 4n2
n=1
(14)
Equation (14) expresses the integral of our function f (t) in terms of the Fouriercosine-series coefficients of g(), defined by equation (12).
Moreover, the sum in (14) is rapidly convergent, because (assuming the
original function f is a smooth function) the function g() is smooth and periodic, so its Fourier-series coefficients e
an decay faster than any polynomial in
n. (Note that this would not be the case if we had simply constructed a bruteforce periodic extension of f (t) by slicing out its behavior between [1 : 1] and
periodically repeating it; in that case the function would have discontinuities
at the endpoints of the interval and its Fourier coefficients would only decay
algebraically with n.)
Hence, in practice, we can truncate the sum in (14) at some finite number of
terms, i.e. we keep terms up to aN for some even integer N , which then defines
the Clenshaw-Curtis approximation to our integral:4
N/2
CC
I IN e
a0 +
X 2e
a2n
.
1
4n2
n=1
(15)
Two ways to proceed

Having derived (15), there are now two directions in which we could proceed to
compute numerical integrals.
4 Some authors weight the last term in this sum (i.e. the term involving a ) with a factor of
N
1/2. There are theoretical reasons for doing this, but we wont bother with this complication
here; in any event that term is exponentially suppressed relative to the other terms in the
sum, so its prefactor doesnt matter much.
13
We could approximate the Fourier-coefficient integral, equation (12), using an N -point trapezoidal rule.5 Since this is an integral of a periodic
function over its period, the error in this procedure will decrease exponentially6 with N . Moreover, the trapezoidal-rule approximation to e
an will
sample g() = f (cos ) at the same N points for all values of n, and (15)
then amounts to a weighted sum over those function samplesthat is, it
amounts to an N -point quadrature rule.
Alternatively, we could approximate the e
an coefficients using a fast Fourier
transform and evaluate the sum (15) directly.
Both of these viewpoints are useful in practice. We will consider the first of
these possibilities in the next section, and the second possibility in our lecture
notes on discrete Fourier transforms.
5 Actually we could use any M -point trapezoidal rule here with M not necessarily having
any particular relationship to N ; in this case the error in the individual coefficients would
decay like e#M while the error in the sum (15) would decay like e#N , and the overall error
would be determined by the smaller of the two.
6 Technically, the proper statement is that the error will decrease faster than any polynomial
in N , which still leaves open the possibility of convergence like e N , which is faster than any
polynomial but not exponentially fast. We are only guaranteed to get exponential convergence
if the original function f (t) is analytic.
14
Clenshaw-Curtis Quadrature Rules
As discussed above, Clenshaw-Curtis quadrature rules are obtained in a two-step

process.
1. We first use an N -point trapezoidal rule to approximate the integral (12):
1
e
an =
g() cos(n)d
(16)
Cm g(m ) cos(nm )
(17)
N
X
m=0
2. Next, we insert (17) into (15) and rearrange the order of the summations:
CC
IN
N/2
" N
#
X
2
Cm g(m ) +
=
Cm g(m ) cos(2nm )
1 4n2 m=0
m=0
n=1
|
{z
}
|
{z
}
N
X
e
a0
N
X
e
a2n
Cm 1 +
m=0
N/2
X 2 cos(2nm )
g(m )
2
1
4n
n=1
{z
}
wm
This is just an (N + 1)-point quadrature rule:
N
X
wm g(m ).
(18)
m=0
Trapezoidal rule points and weights

To figure out the Cm and m quantities, first use the fact that the integrand of
(16) is periodic and even, so we may restrict the integration range to [0, ]:
Z
2
e
an =
g() cos(n) d
(19)
0
To approximate this using an N -subinterval trapezoidal rule, we break up the
interval [0, ] into N subintervals of width N

and write
"
#
N
1
X
1
2 1
g(0) cos(0) +
g(m) cos(mn) + g() cos(n)
e
an
2
2
m=1
N 1
mn (1)n
1
2 X m
= g(0) +
g
cos
+
g()
N
N m=1
N
N
N
(20)
(21)
15
This identifies the Cm and m quantities in (18):

m
,
N
1
N ,
= N2 ,
(1)n
m = 0, 1, , N
m =
Cm
(22a)
m=0
m = 1, 2, , N 1
,
(22b)
m = N.
Final form of the Clenshaw-Curtis quadrature rule

Inserting (22) into (18), we find:
CC
IN
=
wm g(m )
(23)
or, in terms of the original function f (t),

CC
IN
=
wm f (tm )
(24)
where the Clenshaw-Curtis quadrature points are

m
tm = cos
,
m = 0, 1, , N
N
and where the definition of the weights looks slightly different depending on
whether N is even or odd:
For even values of N :
,
m=0
N 1

N/21
X
2
2mn
cos m
wm = 2 1 +
, m = 1, , N 1
cos
+
2
N
1 N2
4n
N
n=1
1 ,
m = N.
N2 1
For odd values of N :

Terminology for Describing Convergence Rates
Homer Reid
March 11, 2014
The purpose of this short note is to define the terms first-order convergence
and second-order convergence and contrast them to the terms linear convergence
and quadratic convergence. The terms sound similar, but they mean totally
different things!
Consider a general numerical method that computes an approximation to
some exact quantity Q. In general, our method will have some tunable parameter N that quantifies how much computational cost we are prepared to allow the
method to incur. (For example, in the case of numerical quadrature, N would
be the number of quadrature points, i.e. the number of rectangles in the rectangular rule. Alternatively, in an iterative root-finder like Newtons method,
N would be the number of iterations.) As we increase N , we obtain a more
accurate approximation of Q, but we have to do more work to get there.
Let QN be the approximation to Q that we obtain by running our algorithm
with parameter N . For any algorithm worth anything at all we will have
lim QN = Q
but the question is how quickly does QN approach Q as we increase N ; this is

called the convergence rate of the algorithm. More specifically, define the error
in QN as
EN = |QN Q|
Then there are a few broad classes of ways in which E might depend on N , and
the terminology we use to describe the convergence rate differs depending on
which class we are in.
1. EN decays algebraically (i.e. as a power law) with N .

In this case we write
1
Np
and we say we have pth order convergence or a pth order method.
E
For example, rectangular-rule quadrature is a 1st order method, while

trapezoidal-rule quadrature is a 2nd order method.
2. EN decays exponentially with N .
In this case we write
E 10N
for some constant .1 This situation is known as linear convergence. For
example, the bisection method of root-finding exhibits linear convergence.
Linear convergence is called linear convergence because the number of
correct digits grows linearly with the number of iterations. If it takes 10
iterations to get 3 good digits, it should take about 10 more iterations to
get the next 3 (for a total of 6 good digits), etc.
3. EN decays faster than exponentially with N .
An example of this type of situation is furnished by Newtons method for
root-finding, in which case the error after N iterations satisfies an equation
like
log EN 2N
This means that the error decays doubly exponentially with N ,
N
EN e2
for some constant C.
This is quadratic convergence, and it is again dramatically faster than

linear convergence.
Quadratic convergence is called quadratic convergence because the number
number of correct digits grows as the square of the number of iterations.
For example, if it takesus 10 iterations to get 3 good digits, then we
should only need to do 10 3 more iterations to get the next 3 good
digits.
1 Note that we have here chosen 10 as the base of our exponent, but we could just as easily
have written this in the form E eN , where = ln 10.
More generally, if we have a situation in which x decays exponentially with y, we can
express this as
x 10y
or equivalently as
x ey
or equivalently as
x 2y
the point being that it doesnt matter what base we use for the exponent; all of the above
expressions describe x decaying exponentially with y.
Caveat
You have to be a bit careful with this terminology, because first-order and
linear are usually synonymous, as are second-order and quadratic; but
these terms mean very different things when we are talking about convergence
rates.
Indeed, linear convergence (for example) is much faster than first-order convergence. For a first-order method, to obtain one additional significant digit)
(i.e. to reduce our error by a factor of 10) we must increase N by a factor of
10 that is, we must do ten times more work. To get two additional digits, we
must do one hundred times more work. Thus the cost of each extra digit grows
cumulatively.
In contrast, for a method that exhibits linear convergence, the error decreases
like 10N for some constant . For example, suppose = 51 . In this case, to
get one extra digit we need only increase N by 5 that is, we must do five more
operations. Not five times as many operations as we have done so far just
five more operations, independent of how many we have done thus far. To get
another digit we only have to do another 5 operations, and so on. Thus the cost
of each extra digit is fixed, no matter how many digits we have obtained so far.
For quadratic convergence, the cost of additional digits actually shrinks as we
proceed.

Ewald Summation
Homer Reid
April 10, 2014
Overview
In the first half of the course, we considered the computation of the electrostatic
potential due to the 1D ionic solid pictured in Figure 1, which consists of an
infinite chain of ions with alternating charges Q separated by a distance D.
Figure 1: We want to compute the electrostatic potential at the point r.
Working in units such that Q = D = 1, the quantity we want to compute is

(r) = (x, y) =
(1)n
p
n=
(x n)2 + y 2
(1)
The series (1) is perfectly well defined1 and convergent and may be used to
compute (x, y) numerically. However, as we saw in the beginning of the course,
the convergence is slow, requiring us to sum upwards of millions of terms to get
6-digit accuracy.
We might also consider higher-dimensional versions of this problem. For
example, suppose instead of a 1D chain we had a two-dimensional lattice of
1 At least as long as the evaluation point is not on an ion site, i.e. (x, y) 6= (n, 0), which we
assume.
ions, with ions at positions (in our units) (nx , ny ) for all integer values of nx , ny .
Now the potential at a point (x, y) takes the form
2D (x, y) =
(1)(nx +ny )
p
.
(x nx )2 + (y ny )2
nx = ny =
(2)
If we needed to sum 106 terms in (1) to get 6-digit accuracy, we will need to
sum many more terms of (2) to get similar accuracy. If we need to tabulate the
potential at some large number of points throughout the unit cell 0 < x, y < 1,
the computation will start to get seriously expensive.
Ewald summation is a brilliant trick for speeding the convergence of sums
like (1) and (2). In addition to being an extremely valuable practical tool
in fields like computational electromagnetism and particle simulation, it offers
an excellent example of the power of Fourier analysis and of thinking about
numerical problems in the right domain which, in this case, is the Fourier
domain.
The basic idea

The idea of Ewald summation is to break up the sum (1) into two pieces: a
local term containing only the contributions of ions within some distance of the
origin, and a distant term containing all the other ions:
(r) = local (r) + distant (r).
(3)
When we do this, we find that the two terms have the following properties.
local is easily computed by summing just a few terms of the sum (1); we
say local converges rapidly in real space.
On the other hand, although the sum that defines distant is slowly convergent in real space, the function distant (r) is slowly varying in real
space, which means that its Fourier transform decays rapidly in Fourier
space. We will use this fact to rewrite distant (r) as a sum over Fourier
components that converges rapidly in Fourier space.
Whenever you hear the phrase slowly varying in real space there should
be an alarm bell going ding-ding-ding! in your head and a little guy yelling
rapidly decaying in Fourier space! And, indeed, upon Fourier-transforming
e distant (k) which decays rapidly with k and
distant (x) we will find a function
which, by Poisson summation, will yield a series that is rapidly convergent in
Fourier space. This is the basic idea of Ewald summation.
Now, in principle it would seem easy to effect the separation in (3): We could
simply take the local term to consist of the contributions of all ions within (say)
10 sites of our evaluation point, and the distant term to account for all other
ions. However, this turns out to be the wrong approach, basically because it
destroys the very smoothness property of distant that makes it well-localized
in Fourier space. Instead, the correct procedure is to use a smoothly varying

window function, together with its complement, to break the Coulomb potential
into short-ranged and long-ranged-but-nonsingular contributions.
An outline of the procedure

Heres a slightly more detailed outline of the procedure we will follow to evaluate
sums like (1) and (2). This basic outline remains valid for both the 1D and 2D
cases, with only some technical details changing.
1. Decomposition of the Coulomb potential into short-range and longrange terms. The Coulomb potential due to a single positively-charged ion
at a distance r is (in our units)
Coulomb (r) =
1
.
r
This function of r has two key properties: For small r, it is rapidly varying as
a function of r (indeed, it is singular as r 0). On the other hand, for large r
it is slowly varying as a function of r.
What we would like to do is to break up this potential into two separate
functions, each exhibiting one and only one of these properties. More specifically, we decompose Coulomb into two pieces, one of which is short-ranged (it
captures the rapid variation for small x but decays rapidly for large x) and the
other of which is long-ranged but non-singular at x = 0:
Coulomb (r) = short (r) + long (r).
To construct short (r), we will multiply Coulomb (r) by some sort of window
function W (r) that is 1 for small r (preserving the small-distance behavior of
Coulomb ) but falls to 0 rapidly for large r. Given a windowing function W (r),
we define
h
i
short (r) = W (r)Coulomb (r),
long (r) = 1 W (r) Coulomb (r)
and we define the local and distant contributions to the potential, equation (3),
as
X
X
local (r) =
short (|r rn |),
distant (r) =
long (|r rn |), (4)
n
where the sum over n runs over all ions in the crystal (in the 1D case, n is
just the scalar quantity n, but in 2D we have n = (nx , ny ) and similarly in
3D). Note that, because short (r) decays rapidly with r, local only receives
noticeable contributions from ions in the immediate vicinity of the evaluation
point. On the other hand, because long (r) is small for small r, distant excludes
the contributions of nearby ions and only receives significant contributions from
distant ions. These two properties make local and distant rapidly convergent
in real space and in Fourier space, respectively.
1.5
1.5
0.5
0.5
-0.5
0.5
1.5
x
2.5
-0.5
(a)
6
5
2
2
1
1
0
0
0
0.5
1.5
x
2.5
(b)
Figure 2: (a): The window function W (r) and its complement 1 W (r). (b):
The bare Coulomb potential coulomb and its decomposition into short-ranged
and long-ranged contributions short and long .
-1
2. Evaluation of the local term in real space. Because short (r) decays
rapidly with r, the sum over ions in local converges quickly: we only need to
sum a few terms to get a highly accurate representation of the sum. Thus we
simply evaluate this sum as-is.
3. Evaluation of the distant term in Fourier space. On the other hand,
the sum defining distant is slowly convergent in real space. To improve this
situation, we compute its Fourier transform and evaluate the sum using the
Poisson summation formula:
X
distant (r) =
long (|r rn |)
n
elong ()
where we have written to indicate that we are omitting certain prefactors

which depend on the details.2
We will find that the Fourier-transformed version of the sum in distant
converges as rapidly as the real-space sum in local , and thus to get the total
value of we will only need to sum a few terms in each sum.
2 Here is the Fourier variable conjugate to n: note that we will be Fourier-transforming

with respect to n, not with respect to r.
1.5
1.5
0.5
0.5
-0.5
0.5
1.5
x
2.5
-0.5
Figure 3: The functions erf(x) and erfc(x).
The error function
As discussed in the previous section, the Ewald technique splits the Coulomb
potential into short-ranged and long-ranged components by introducing a window function W (r) which is 1 for small r and falls rapidly to zero for large r. In
principle, there are many different choices of W (r) that could be used; in practice, the particular choice of window function W (r) that people use for Ewald
summation is called the complementary error function. In this section we will
define this function and use it to compute the Fourier transform of long .
2.1
The functions erf and erfc
The total area under the curve of a Gaussian is

Z
2
et dt =
which we could write in the alternative form

Z
2
2
et dt = 1.
0
If we truncate the upper limit of this integral at some finite value x, we obtain
a number between 0 and 1 known as the error function erf(x):
Z x
2
2
et dt
erf(x) =
0
This function is 0 at x = 0 and rises rapidly to 1 as x . If we instead want
a function that is 1 at x = 0 and falls rapidly to 0 as x , we simply take
1 erf(x); this function is known as the complementary error function erfc(x):
erfc(x) = 1 erf(x)
Z
2
2
=
et dt.
x
(5)
(6)
Another way to write erf and erfc is to change variables in the integral to u = t/x,
which yields
Z
Z
2x 1 x2 u2
2x x2 u2
erf(x) =
e
du,
erfc(x) =
e
du.
(7)
0
1
In Ewald summation we use the complementary error function as our window
function:
W (r) = erfc(r).
2.2
Fourier transform of long in 1D
The long-range contribution to the single-ion potential is

h
i
long (r) = coulomb (r) 1 W (r)
1
erf(r)
r
Z 1
2 2
2
=
er u du
0
=
where we used the representation (7) of the error function.

For our purposes we will need the Fourier transform with respect to x of the
function
Z 1
p

2
2
2
2
long
2
2
Py (x)
x +y =
e(x +y )u du.
0
Note that we are here thinking of Py as a function of the single variable x, with
the value of y entering as a parameter.
The Fourier transform of Py (x) is
Z
1
e
Py (k) =
eikx Py (x) dx
2
Z Z 1
2 2
2 2
1
ey u ex u ikx du dx
= 3/2
Swap integrations and complete the square: x2 u2 + ikx u2 x +

=
3/2
k
y 2 u2 4u
2
1
Pey (k) =
1
e
u
eu
ik 2
(x+ 2u
2)
|
Z
k2
4u
2
y u
du.
{z
/u

ik 2
2u2
k2
4u2
dx du
}
(8)
Unfortunately, this integral cant be evaluated in closed form, but this is no

impediment to using the function Pey (k) in practice; we just have to come up
with some way to evaluate the integral numerically. (The integral here may be
related to a family of functions known as the exponential integral functions
and also to the incomplete Gamma functions, but calling it by fancy names
like that doesnt much help to evaluate it.) The functions P1 (k) and P10 (k)
are plotted on linear and logarithmic scales in Figure 4; note that the function
decays extremely rapidly for large k.
As it turns out, the two-dimensional Fourier transform of long actually
can be evaluated in closed form;3 this is one of the rare situations in which
computations are actually simpler in higher dimensions than in one dimension.
You will work this out in your problem set.
3 Well, closed form in the sense that it involves a definite integral with a standard name,
namely, the function erf. Note that this is not a typo: The two-dimensional FT of long , which
involves erfc in real space, involves erf in Fourier space.
3
P_1(k)
P_10(k)
2.5
2.5
1.5
1.5
0.5
0.5
0
-2
-1.5
-1
-0.5
0
x
0.5
1.5
(a)
100000
5
P_1(k)
P_10(k)
1e-05
-5
1e-10
-10
1e-15
-15
1e-20
-20
1e-25
-25
-4
-2
0
k
(b)
Figure 4: The functions Pe1 (k) and Pe10 (k) plotted on (a) linear and (b) logarithmic scales. The important point is that these functions decay extremely
rapidly for large k, which means their Poisson summation is rapidly convergent.
10
Ewald summation in 1D
In this section we flesh out the details of the Ewald summation procedure for a
one-dimensional chain of ions. The basic setup was outlined in the first section:
we split the total potential into contributions from local and distant ions.
(x, y) = local (x, y) + distant (x, y)

p
X
(x n)2 + y 2 ,
local (x, y) =
(1)n short
(9)
nZ
distant
(x, y) =
(1)n long
p

(x n)2 + y 2 .
nZ
erfc(r)
erf(r)
,
long (r) =
.
r
r
We now separately consider the evaluation of each of these sums.
short (r) =
3.1
Evaluation of local
Evaluating the first term in (9) is easy. It is done by the PhiShort function in
the julia code included at the end of these notes. For typical values of x, y the
sum converges to 10 decimal places after summing only 6 or 8 summands. So
this term requires no more work.
3.2
Evaluation of distant
To evaluate the second term in (9), we begin by separating the sum into the
contributions of positive and negative ions:4
p

X
distant (x, y) =
(1)n long
(x n)2 + y 2
n
Separate the contributions of positive and negative ions:

=
X
nZ
long
p
X

p
(x 2n)2 + y 2
(x (2n + 1))2 + y 2
long
nZ
(10)
Lets think of the two different summands here as two different functions of n:
p

p

f+ (n) long
(x 2n)2 + y 2 ,
f (n) long
(x (2n + 1))2 + y 2
(11)
4 This step is necessary if we want to make use of the Fourier transform of long that
we computed in the previous section. An alternative way to do this calculation would
be directly to compute the Fourier transform of the sign-alternating function f (n) =
p
(1)n long
(x n)2 + y 2 . You can check that such a calculation reproduces the results
derived in the text.
11
Then (10) takes the form

distant (x, y) =
f+ (n)
nZ
f (n)
(12)
nZ
and applying Poisson summation allows us to rewrite this in the form

i
hX
X
= 2
fe+ (2m)
fe (2m)
mZ
(13)
mZ
where fe+ (), fe () are the Fourier transforms of (11) with respect to the variable
n, which we think of as a continuous variable for these purposes.
Our next task is to compute these Fourier transforms, which involves the
function Pey (k) that we computed in the previous section, together with some
simple manipulations involving the properties of Fourier transforms.
Fourier transforms of f+ (n) and f (n)
In the previous
worked out the Fourier transform of the function
psection we
long
2
2
x +y :
Py (x)
Z
Py (x) = Pey (k)eikx dk
where Py (k) was defined by (8). Using this, we can write
f+ (n) = Py (x 2n)
Z
= Pey (k)eik(x2n) dk
To make this look like the Fourier synthesis of a function of the continuous
variable n, we change variables to = 2k and rewrite it like this:
Z h
1 ix/2 e i in
e d
=
e
Py
2 }
|2
{z
fe+ ()
which identifies the Fourier transform of f+ (n):

1 1
Fourier transform of f+ (n) = fe+ () = e 2 ix Pey
.
(14)
2
2
In exactly analogous fashion, we can obtain the Fourier transform of f (n) in
terms of Pey (k):
f (n) = Py (x 2n 1)
Z
= Pey (k)eik(x2n1) dk
Z h
1 i(x1)/2 e i in
=
e
Py
e d
2 }
|2
{z
fe ()
12
which identifies the Fourier transform of f (n):

1 1
.
Fourier transform of f (n) = fe () = e 2 i(x1) Pey
2
2
(15)
Final version of the sum for distant

Inserting (14) and (15) into (13) now yields
i
Xh
fe+ (2m) fe (2m)
distant (x, y) = 2
mZ
= 2
eimx
mZ
h
i
1
1 eim Pey (m)
2
The factor in curly braces here vanishes for even m and yields 1 for odd .5
Using this fact and the fact that Pey (k) is an even function of k, we obtain the
final form of distant :
distant (x, y) = 4
cos(mx)Pey (m) .
(16)
m=1
m odd
5 In particular, there is no contribution to the sum from the m = 0 term, which is a

good thing because the function Pey (k) is infinite at k = 0. This divergence is basically just
R
the divergence of the integral dx
, and corresponds physically to the fact that an infinite
x
array of positively-charged ions gives an infinite contribution to the electrostatic potential
at the origin; the cancellation in (16) arises from the equal and opposite contribution of the
negatively-charged ions.
3.3
13
Putting it all together: Numerical results
Lets use Ewald summation to evaluate (x, y) for a couple of different evaluation points.
First consider the point (x = 0.25, y = 0.25). For this evaluation point,
the brute-force summation of equation (1) requires around 800,000 terms to
converge to a relative tolerance of 106 :
Convergence of (x, y) (brute-force summation) for (x, y) = (0.25, 0.25)
n
1
2
3
4
5
799998
799999
800000
nth term in sum

-2.0493756046200877
+1.007411529248624
-0.6689289797090223
+0.500964129900977
-0.40049593015285234
+2.500006250015747e-6
-2.500003125004028e-6
+2.5000000000001218e-6
after n terms
0.7790515201261021
1.7864630493747262
1.117534069665704
1.618498199566681
1.2180022694138286
1.3985540298633128
1.3985515298601878
1.3985540298601877
In contrast, the Ewald-summation technique requires 5 terms in the local

sum and 1 term in the distant sum to achieve the same accuracy:
Convergence of local (x, y) for (x, y) = (0.25, 0.25)
n nth term in sum
local after n terms
1 -0.3893996144303278
1.3559522726396909
2 +0.007629205898998581
1.3635814785386895
3 -3.5342137585185434e-5
1.3635461364011043
4 +2.8775270993747082e-8
1.3635461651763754
5 -3.666105641950547e-12
1.3635461651727092
6 +6.915955958575618e-17
1.3635461651727092
7 -1.8760507978155556e-22 1.3635461651727092
Convergence of distant (x, y) for (x, y) = (0.25, 0.25)
n
1
3
5
nth term in sum

0.03500661470136366
-1.304318427037022e-11
-3.445626248073877e-29
distant after n terms

0.03500661470136366
0.035006614688320475
0.035006614688320475
14
The total potential as computed by Ewald summation is

local (0.25, 0.25)
+ distant (0.25, 0.25)
= (0.25, 0.25)
= 1.3635461651727092
= 0.0350066146883204
= 1.3985527798610296
This number is accurate to machine precision and significantly more accurate

than the number we computed by summing 800,000 terms of the brute-force
sum, which was only correct to 6 digits.
In this example, the contribution of distant was a small contribution to the
overall sum (although an entirely necessary contribution for obtaining 6 or more
significant figures). This is not always the case. For example, at the evaluation
point (x, y) = (0.25, 2.0) we find
local (0.25, 2.00)
+ distant (0.25, 2.00)
= (0.25, 2.00)
= 0.0006953085865214333
= 0.0018971781988659407
= 0.0025924867853873740
In this case the contribution of the distant ions is about 3 the contribution
of the local ions, and clearly both terms are necessary to get even the first
correct digit of the total potential.
15
Julia code for 1D Ewald summation
This code includes routines PhiLocal and PhiDistant for summing the shortand long-ranged contributions to the potential, as well as PhiBF for brute-force
evaluation of the original sum (1).
#
# compute PhiLocal to a relative error tolerance of RelTol.
#
function PhiLocal(x,y,RelTol)
y2=y^2
# n=0 term
r=sqrt(x^2 + y2)
Sum=erfc(r)/r;
# the nth loop iteration adds the contributions of
# the positive ions at r=\pm 2n and the negative ions
# at r=\pm (2n-1)
ConvergedIters=0;
for n=1:100000
tn=2*n-1;
rp=sqrt( (x+tn)^2 + y2)
rm=sqrt( (x-tn)^2 + y2)
Delta1 = -(erfc(rp)/rp + erfc(rm)/rm)
Sum += Delta1;
println(tn," ",Sum," ",Delta1)
tn=2*n;
Delta2 = erfc(rp)/rp + erfc(rm)/rm
Sum += Delta2;
println(tn," ",Sum," ",Delta2)
if ( abs(Delta1+Delta2) < RelTol*Sum )
ConvergedIters+=1;
else
ConvergedIters=0;
end
if ConvergedIters==2
break
end
16
end
Sum
end
#
# compute PhiDistant to a relative error tolerance of RelTol
#
function PhiDistant(x,y,RelTol)
ConvergedIters=0;
Sum=0.0;
for m=1:2:10000
Delta=4*pi*cos(pi*m*x) * TildePyk(pi*m, y)
Sum+=Delta
println(m," ",Sum," ",Delta)
if ( m>3 && ( abs(Delta) < RelTol*abs(Sum)) )
ConvergedIters+=1;
else
ConvergedIters=0;
end
if ConvergedIters==3
break;
end
end
Sum
end
#
# the function \tilde P_y(k), computing using numerical integration
# via Simpsons rule
#
function TildePyk(k,y)
SimpRule( u -> (u==0.0 ? 0.0 : exp(-0.25*k*k/(u*u) - y*y*u*u)/u),
0, 1, 1000
)/pi
end
#
# brute-force evaluation of full sum up to 2N+1 terms
#
function PhiBF(x,y,N)
y2=y^2
# n=0 term
r=sqrt(x^2 + y2)

Sum=1.0/r;
# the nth loop iteration adds the contributions of
# the positive ions at r=\pm 2n and the negative ions
# at r=\pm (2n-1)
for n=1:N
tn=2*n-1;
Delta = -(1.0/rp + 1.0/rm)
Sum+=Delta;
tn=2*n;
Delta = (1.0/rp + 1.0/rm)
Sum+=Delta;
end
Sum
end
17

The FFT and its Applications
Homer Reid
April 24, 2014
Contents
1 The Discrete Fourier Transform
2 The DFT as Trigonometric Interpolation
3 The DFT as a rectangular-rule approximation to a Fourierseries coefficient

12
4 The DFT as a change of basis
14
5 The Fast Fourier Transform
16
6 Applications of the FFT: Signal Processing
18
7 Applications of the FFT: PDE Solvers
19
8 FFT Convolution
20
9 Circulant Matrices
24
The Discrete Fourier Transform
In our discussion of Fourier analysis thus far we have assumed that the function
we are Fourier-analyzing, f (t), exists and is computable for arbitrary values of
the real number t. We now take up the question of what happens when we have
only discrete samples of f , fn f (nt) for integer n and some sampling period
t.
If the integer n runs from to (i.e. we have evenly-spaced samples
of f over the entire real line) then the tool we use is the semidiscrete Fourier
transform. In practice, this situation does not arise as often as the case in which
we have a finite number N of samples of f , fn = f (nt) for n = 0, 1, , N 1.
In this case, the tool we want is the discrete Fourier transform (DFT).
The DFT maps the N numbers {fn } into a set of N Fourier coefficients
{fe } :1
DFT
{fn } {fe }
N 1
n
1 X
e
fn e2i N .
f =
N n=0
(1)
Note this connection to the usual Fourier transform: If f (t) were a continuous
function on the interval t [0, N ], then fe would be the th Fourier-series
coefficient of f (t) [that is, the coefficient of the sinusoid ei0 t in the Fourier
synthesis of f (t), where 0 = 2
N ] evaluated using N -point rectangular-rule
quadrature.
Having exchanged our N real-space coefficients {fn } for the N Fourier coefficients {fe }, it is easy to go the other way and trade the {fe } back in for the
{fn }. This process in the inverse discrete Fourier transform:
{fe }
fn =
IDFT
N
1
X
{fn }
n
fe e2i N .
(2)
n=0
Another way to think about the IDFT is that it is just the Fourier synthesis:
whereas the DFT analyzes the dataset {fn } into constituent sinusoids, the IDFT
reassembles those sinusoids to recover the original data set {fn }.
The IDFT is periodic with period N . Our original data set contained
only N elements, {fn } for n = 0, 1, , N 1. Given such a set, it is not
meaningful to ask for values of fn outside the range 0 n N 1. However,
the RHS of (2) is perfectly-well defined for any n, so we might ask: What do
we get if we evaluate (2) for n outside this range? The answer is that we get
back fn mod N that is, we get whichever element of the original data set is
1 Here
we are thinking of n as a real space variable and using as its Fourier conjugate
variable, but in DFT literature is also common to use symbols like m or even n0 as the
Fourier-conjugate of n.
obtained by translating n by integer multiples of N so that it lies in the range

0 n N 1. In other words, the RHS of (2) defines a sequence of numbers
that is defined for all integers n and which consists of an infinite number of
periodic copies of the original data set {fn }, repeated with period N .
Note on normalization conventions. The normalization convention used
in equations (1) and (2) comports with the convention we used earlier in our
discussion of continuous Fourier transforms. In this convention, the full normalization prefactor (be it 1/N for the DFT, 1/T for the Fourier-series coefficient of
a T -periodic function, or 1/(2) for the Fourier transform of a function defined
over the entire real line) appears in the Fourier-analysis equation (the one that
takes us from real space to Fourier space), while there is no prefactor in the
Fourier-synthesis equation (the one that takes us back).
There are two other possible conventions: we could (a) put the full prefactor
in the Fourier-synthesis equation and have no prefactor in the Fourier-analysis
equation, in which case the 1/N factor in (1) would instead be present in (2);
or (b) we could share the prefactor symmetrically between the Fourier-analysis
and Fourier-synthesis
equations, in which case both (1) and (2) would have
prefactors of 1/ N .
Computer implementations of the FFT, including the ones in julia and
matlab, actually tend to use convention (a). That means that when you use
those systems to compute the DFT of a set of numbers {fn }, the numbers you
get back are actually N times the quantities {fe } defined by my equation (1).
Ways to interpret the DFT

There are multiple different ways to think of the discrete Fourier transform,
among which are these:
It constructs a continuous function that smoothly interpolates between a
discrete set of data values.
It approximates the Fourier-series coefficients of a function from a set of
function samples.
It effects a change of basis in the vector space CN .
In the following sections we will consider each of these in turn.
The DFT as Trigonometric Interpolation
A good way to think of the DFT and the IDFT is that together they constitute
a technique for constructing a smooth interpolating function f (t) guaranteed to
pass through all the data samples fn . Indeed, if in equation (2) we replace the
discrete index n with a continuous variable t, we obtain a function
f (t) =
N
1
X
t
fe e2i N
n=0
N
1
X
n=0
fe ei0 t ,

2
.
0 =
N
(3)
We have written equation (3) in a way that is intended to be suggestive of the

Fourier series, and indeed (3) is precisely just the Fourier series of a function
f (t) with period N .
However, in comparison with the usual form of the Fourier series, equation
(3) has one big distinction: The sum is over only a finite number of frequencies,
not an infinite number of frequencies. This is because the finite resolution with
which our initial data were sampled imposes an effective upper bound on the
frequencies of the sinusoids that can be contained in the interpolating function
f (t).
An immediate example
Lets do a quick example. (The julia source for this example is listed below).
Consider the following (randomly generated) set of 9 numbers.
f0 = 0.823648
f3 = 0.177329
f6 = 0.042301
f1 = 0.910357
f4 = 0.278880
f7 = 0.068269
These points are plotted in the figure below.
f2 = 0.164566
f5 = 0.203477
f8 = 0.361828
1.2
1.2
Data points
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-1
4
n
Figure 1: A randomly-generated set of 9 data points.
The discrete Fourier transform of the set of numbers {fn }, using the normalization convention (1), is the following set of complex numbers {fe }:
fe0 = 0.336739 + 0.0000000i

fe3 = 0.005510 - 0.0507717i
fe6 = 0.005510 + 0.0507717i
fe1 = 0.141727 - 0.0655719i

fe4 = -0.024390 - 0.0187098i
fe7 = 0.120606 + 0.0453027i
fe2 = 0.120606 - 0.0453027i

fe5 = -0.024390 + 0.0187098i
fe8 = 0.141727 + 0.0655719i
You can easily verify at home that summing the numbers fe in this table,
weighted by sinusoids e2in/N (where N = 9 in this case), recovers the numbers
in the previous table, i.e. we have
fn =
8
X
fe ei0 n
0 =
=0
2
.
N
(4)
Now consider the function defined by continuing equation (4) to a continuous

variable tthat is, just making the simple substitution n t:
f (t) =
8
X
=0
fe ei0 t
(5)
This defines a continuous function of t with the property that f (n) = fn , i.e.
f (t) is guaranteed to pass through the points in the table above whenever t
passes through an integer value. This is illustrated in the following figure.2
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-1
Figure 2: The trigonometric interpolant defined by (5).
2 In this figure and the following two figures, we are plotting just the real part of the
interpolant. The imaginary part also exists and is a wiggly function like the real part; at
integer values of t it agrees with the original data (that is, it vanishes, since the original data
were real-valued) but is nonzero elsewhere. The exception is the minimal-variation interpolant
defined below, which is a purely real-valued function (its imaginary part vanishes for all t).
Heres some julia code that reproduces the above example:

N=9;
# create random N-vector
srand(0);
f=rand(N);
# compute FFT
# (the factor of N is just a normalization)
tildef=fft(f)/N;
# define the trigonometric interpolant
# defined by the equation above
function fC(t)
Sum=0.0 + 0.0im;
for nu=0:N-1
Sum+=tildef[nu+1]*exp(2*pi*im*nu*t/N);
end
Sum
end
Subtlety: Which trigonometric interpolant to we want?

There is a certain ambiguity present in equation (4): The left-hand side remains
unchanged if we shift the integers in the exponent by multiples of N . This is
due to the fact that
eipN 0 n = e2ipn = 1
for any integers p, n
and thus we can multiply each term in (4) by eipN 0 n with impunity; this simply
corresponds to shifting + pN in the exponent. For example, we could
modify (4) by writing
X
fn =
fe ei(+N )0 n
(6a)
or
fn =
fe ei(+2N )0 n
(6b)
fe ei(47N )0 n
(6c)
or
fn =
and the equations remain valid, i.e. summing up all 9 terms on the LHS will
exactly recover the quantity on the RHS, as you can readily verify at home. More
generally, we can even shift the frequencies of different sinusoids by different
integer multiples of N , i.e. we can write
X
fn =
fe ei(+p N )0 n
(7)
where {p } is any set of N integers. Every sum of the form (8), including all the
examples on the RHS of (6), reproduces the original data {fn } when evaluated
at integer values of n.
But their respective continuations to continuous variables t define very different functions. For example, heres the function defined by continuing (6a)
from n t:
1.5
1
1
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1
4
n
Figure 3: The function f (t) =
PN 1 e i(+N )0 n
.
=0 f e
and heres the function defined by continuing (6b):
1.5
1
1
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1
4
n
Figure 4: The function f (t) =
P e i(+2N )0 n
f e
.
Note that, in every case, the continued function f (t) is guaranteed to run
exactly through our prescribed data points at integer values of t; however, the
behavior of the function in between those points is increasingly erratic as we
include higher and higher frequencies.
The minimal-variation interpolant

So which of the infinite number of possible interpolants do we want? For most
purposes the correct answer is the minimal-variation interpolant. This is the
function f min var (t) that, out of all possible interpolants of the form
X
f (t) =
fe ei(+p N )0 t
(8)
minimizes the mean-square variation, which is a measure of how much the function wiggles over one full period:
Z
mean-square variation of f (t)
N 1
|f 0 (t)|2 dt.
The minimal-variation interpolant is obtained by shifting values in (6) such

that the N sinusoids it contains cluster as close as possible to the origin. What
this amounts to doing is keeping the sinusoids with frequencies 0 intact as-is
(i.e. unshifted) for = 0, 1, , N/2, but shifting the frequencies according to
N for = N/2 + 1, , N 1. Thus, in our example above with
10
N = 9 the term e.g. fe7 e7i0 t is replaced by fe7 e2i0 t . (The 2 here comes from
7 9 = 2.) The full minimal-variation interpolant in the example above is
f min var (t) = fe0 +fe1 ei0 t + fe2 e2i0 t + fe3 e3i0 t + fe4 e4i0 t
+fe5 e4i0 t + fe6 e3i0 t + fe7 e2i0 t + fe8 ei0 t .
This is plotted below. Note that, compared to the interpolants we plotted above,
this function wiggles much less between the data points; intuitively it is clearly
the right function if we want a smooth interpolant through our data points.
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-1
4
n
Figure 5: The minimal-variation interpolant for our original data set {fn }.
Heres a modified version of the julia function fC from the above snippet
that computes the minimum-variation interpolant:
function fMinVar(t)
Sum=0.0 + 0.0im;
for nu=0:floor(N/2)
Sum+=tildef[nu+1]*exp(2*pi*im*nu*t/N);
end
for nu=floor(N/2)+1 : N-1
Sum+=tildef[nu+1]*exp(2*pi*im*(nu-N)*t/N);
end
Sum
end
11
The trigonometric interpolant is a periodic function

If you look at the interpolants plotted in the last several plots, you will notice
that they all satisfy the property f (9) = f (0). More generally, the trigonometric
interpolant f (t) constructed from the DFT of N data points will always be
periodic with period N . If you evaluate this function at points outside the
original set of data points, you will find periodically repeated copies of the
original data points.
Comparison to polynomial interpolation

Given a set of function samples at equally-spaced time points fn f (nt),
the FFT constructs a smooth (C ) interpolant function f (t) that is guaranteed
to run through all our samples. There is, of course, another way to construct
a smooth (C ) interpolant running through N given data points: we could
simply construct the unique (N + 1)-th degree polynomial that does the trick.
This, however, turns out to be a very ill-behaved procedure due to the Runge
phenomenon, which we saw earlier in the course. When you have data sampled
at evenly-spaced points, trigonometric interpolation via FFT is a much betterbehaved operation than polynomial interpolation.3
On the other hand, when we have the freedom to choose our data samples
at unevenly spaced points, then polynomial interpolation resurfaces as a wellbehaved possibility. We will investigate this in our discussion of Chebyshev
polynomials.
3 An exception is spline interpolation, in which we piece together a number of low-degree

polynomial interpolants that each cover only a portion of the data set instead of trying to find
a single high-order polynomial that fits the entire set.
12
The DFT as a rectangular-rule approximation

to a Fourier-series coefficient
Another way to think about what the DFT is doing is that it is computing a
rectangular-rule approximation to the integrals that define the Fourier-series coefficients of some function f (t). This interpretation is somewhat complementary
to the previous one: in the previous section we were talking about constructing
a continuous function from a set of discrete samples, and now were talking
about sampling a continuous function to obtain a set of discrete samples.
In this interpretation, we consider a function f (t) defined over an interval
T , with a a Fourier-series representation of the form
Z
X
1 T
f (t)ei0 t dt
f (t) =
fe ei0 t
fe =
T
0
=
where 0 = 2
T . Now approximate the integral here using the simple N -point
rectangular-rule quadrature scheme, which samples the integrand at points t =
T
n, n = 0, , N 1 with = N
:
N 1
X
f (n)ei0 n
fe
T n=0
N 1
2i
1 X
fn e N n .
N n=0
(9)
The second line here is precisely the discrete

Fourier transform of the set of
numbers {f (0), f (), f (2), , f (N 1) } i.e. N samples of f (t) at evenlyspaced time points.
Note that if f (t) is actually periodic with period T , then the simple rectangularrule quadrature will converge exponentially rapidly with N , i.e. its error will
decay like e#N , so the rectangular rule is actually not a bad quadrature scheme
to use for approximating the integrals here. Moreover (as discussed below), the
FFT will compute the entire set of coefficients fe for = 0, , N 1 in one
fell swoop with computation time scaling like N log N , to be compared with the
N 2 cost scaling of a nave implementation of those coefficients using N -point
quadrature for each coefficient.
The sampling theorem; aliasing

Notice that in (9) we only get values for the first N Fourier coefficients of
our function, i.e. the coefficients of the sinusoids with frequencies 0 with
= 0, 1, , N 1. But a general function f (t) will have infinitely many nonzero
Fourier series coefficients. What about those? Why dont we get values for those
from the above procedure?
The answer has to do with the sampling theorem, and the discussion parallels
our discussion in the previous section of the multiple possible interpolants we
13
can write down to smoothly connect the dots in a given data set. The long
story short is ...
14
The DFT as a change of basis
Yet another interpretation of the discrete Fourier transform is that it effects a

change of basis between two different bases for the vector space CN .
To see this, let CN be the vector space of all N -tuples of complex numbers
and let Z CN be an element of this vector space. Z is just an N -tuple of
complex numbers:
z1
Z=
z2
..
.
zN
and, as is true for any element of an N -dimensional vector space, we may write
a general element Z as a weighted sum of basis vectors:
Z = z1
e1 + z2
e2 + + zN
eN
(10)
where the basis vectors {

en } are the usual unit vectors with a 1 in the nth
position and a zero everywhere else:
e =
0 ,
..
.
0
e =
0 ,
..
.
0
eN
0 .
..
.
1
The set of basis vectors {

en } obviously forms an orthonormal basis for CN , in
the sense that the inner product of any two distinct basis vectors is 0, while the
inner product of a basis vector with itself is 1:
h
en ,
em i = nm
where the inner product of two vectors x, y is just the Hermitian dot product:
X
hx, yi =
xn yn .
(11)
n
However, the set of vectors {

en } is not the only orthonormal basis for CN ; as
with any vector space, there are infinitely many distinct bases we could choose,
and one particularly convenient one is the cyclotomic basis {
vn }, in which the
2i/N
elements of the basis vectors are powers of = e
, the primitive N th root of
unity. (The cyclotomic basis is also known as the Fourier basis.) The elements
15
of this basis set are the vectors

1

1

1
1
1
1
,
v
2 =
v
=
N 1
N
.
.
.
2
3
..
.
1
3
v
=
N
N 1
22
32
..
.
(N 1)2
and, more generally, the nth component of the pth basis vector in this set is
defined to have components vnp = 1N (n1)(p1) = 1N e2i(n1)(p1)/N :
1
p
v
=
N
(p1)
2(p1)
3(p1)
..
.
(N 1)(p1)
It is easy to show that the set {

vp } is orthonormal under the inner product
(11), i.e.
h
vp , v
q i = pq
What this means is that any element Z CN may be written uniquely as a
weighted sum over the v
p vectors:
Z = ze1 v
1 + ze2 v
2 + + zeN v
N
(12)
and it turns out that the zen coefficients in this expansion are related to the zn
coefficients in (10) by nothing other than the discrete Fourier transform:
n
o
DFT
{zn }
N ze
(The N factor here just corrects for the slightly different normalization we
used previously: what this equation means is that if you perform the DFT on
the {zn } coefficients in (10), you will
z } coefficients in (12),
get precisely the {e
but divided by an extra factor of N .)
The interpretation of the DFT as a change of basis really hammers home
the losslessness property of Fourier analysis. Changing bases in a vector space
changes the coordinates of points but doesnt change the points; we lose zero
information by switching to a different basis.
16
The Fast Fourier Transform
Computational cost scaling of the nave DFT

Suppose we have N numbers {fn } and we want to compute the discrete Fourier
transform:
{fn }
DFT
{fe },
N 1
n
1 X
fe =
fn e2i N .
N n=0
(13)
The set {fe } contains N numbers, and to calculate each one of them using
(13) we have to do a sum involving N summands. Thus the cost of the whole
operation is going to scale like N 2 :
computational cost of nave N -point DFT N 2 .
So, if it takes our computer 1 second to do a 1000-point DFT, it will take
100 seconds ( 1.5 minutes)
to do a 104 -point DFT,
10,000 seconds ( 2.8 hours)
to do a 105 -point DFT, and
1 million seconds ( 11.5 days)
to do a 106 -point DFT.
In practice this means that we would be limited to running small DFT calculations.
The Fast Fourier Transform

A major reason for the ubiquity of DFT techniques in modern science and
engineering is that there exists an algorithm, the fast Fourier transform (FFT),
for carrying out the entire transformation (13)that is, computing all N DFT
coefficientswith computational cost that grows like N log N , i.e.
computational cost of N -point FFT N log N.
The FFT in computer software packages

All high-level math software packages (including julia and matlab) offer builtin implementations of the FFT.4 In julia, for example (the matlab case is
almost identical), you give the FFT a vector of numbers and you get back a
vector of numbers:
4 In fact, the implementation that is most widely used is known as FFTW (the fastest
Fourier transform in the West) and was co-authored by MITs own Prof. Steven Johnson.
17
F
= [1:10]; # or any other set of 10 numbers
tildeF = fft(F);
Due to the slightly different normalizations, the th entry of the tildeF vector
here will be equal to N fe , where fe is defined by (1).
How the FFT works

But how does the FFT algorithm reduce the N 2 cost complexity of the nave
discrete Fourier transform to the much slower rate of N log N growth?
Applications of the FFT: Signal Processing
MP3 and JPEG Coding
18
Applications of the FFT: PDE Solvers
19
20
FFT Convolution
In science and engineering problems we often need to compute discrete convolutions of the form
X
Fn =
Sm Knm
(14)
m
where Sm is a sequence of source points and Knm is a kernel function that

describes how strongly the mth source quantity contributes to the nth output
quantity.
Examples of discrete convolutions

Electrostatics of a lattice of charges
As an obvious example to keep in mind, consider our 1D ionic solid, consisting
of ions lying along at points x = (mD, 0, 0) for all integers m, but now with
each ion having a position-dependent charge; let the ion at site m have a charge
Qm . Then the electrostatic potential felt by the ion at site n [i.e. at position
xn = (nD, 0, 0)] is
X
Qm
(xn ) n =
|n m|D
m6=n
where we have excluded the self-term m = n from the sum (because we dont
want to count the contribution of an ion to the potential it feels itself). This
is a discrete convolution of the form (14) with source function Sm = Q(m) and
kernel function
0,
n=0
Kn =
.
1
,
n
=
6
0.
|n|D
Arbitrary-precision arithmetic
As we discussed in our unit on floating-point arithmetic, each individual number
stored in a computer has a finite number of digits. What if we need to do
arithmetic on numbers with thousands or millions of digits? In this case we
choose a base b and represent arbitrary-precision numbers in the form
x=
Nx
X
xn b ,
y=
n=0
Ny
X
yn bn
n=0
where xn , yn are integers with few enough digits to be stored as individual

numbers inside our computer.
For example, if we are representing the numbers x = 29415 and y = 826 in
decimal arithmetic, we would have b = 10 and
x4 = 2,
x3 = 9,
x2 = 4,
x1 = 1,
x0 = 5,
21
y2 = 8,
y1 = 2,
y1 = 6.
(Of course on a typical computer we wouldnt need arbitrary-precision arithmetic to multiply these two numbers, but it illustrates the point).
The product of x and y is
xy = z =
Nz
X
zn bn
n=0
where the base-b digits of z are

z0 = x0 y0
z1 = x0 y1 + x1 y0
z2 = x0 y2 + x1 y1 + x2 y0
and in general
zn =
xm ynm .
This is a discrete convolution of precisely the form (14), except that we must
pack the digits xn , yn into data vectors in such a way that xn = yn = 0 for
negative values of n. In practice, this corresponds to using data vectors that
are two times longer than necessary to store the actual digits of x and y, then
zero-padding to ensure the second half of the data vector is all zeros.
Discrete convolution by FFT

Periodic source distribution, finite-range kernel
First consider the the case in which the source distribution is periodic with
period N (that is, Sm = Sm+N ) while the kernel function Knm is only nonzero
for N values of its argument. In this case, we can write the source and kernel
sequences in Fourier-synthesized forms as follows:
Sm =
N
1
X
Se e2im/N
=0
Kn =
N
1
X
e e2in/N
K
=0
Using this last equation, the quantity that enters the sum (14) may be written
Knm =
N
1
X
=0
e e2i(nm)/N .
K
22
The discrete convolution then becomes

Fn =
N
1
X
Sm Knm
m=0
N
1
X
(N 1
X
m=0
=0
Se e
2im/N
) (N 1
X
)
e0 e
K
2i(nm) 0 /N
0 =0
Reorganize the sums:
N
1 N
1
X
X
e0 e
Se K
2in 0 /N
(N 1
X
=0 0 =0
)
e
2im( 0 )/N
m=0
{z
N , 0
The sum in curly braces here is something we have seen many times: it vanishes
if 6= 0 and evaluates to N if = 0 . (You can remind yourself how this works
0
by thinking of the sum as a geometric series in the variable = e2i( )/N .)
The double frequency sums then collapse to a single sum:
Fn = N
N
1
X
e e2in/N
Se K
=0
Comparing this to (2), we see that we have obtained a Fourier-synthesized

e } is
version of the sequence Fn : more specifically, the set of numbers {N Se K
precisely just the discrete Fourier transform of {Fn }.
To summarize, we have obtained the following
Algorithm for discrete convolution by FFT:
To compute the discrete convolution Fn of two sequences Sn and Kn , we
1. Compute the discrete Fourier transforms of {Sn } and {Kn }:
{Sn }
DFT
{Se },
{Kn }
DFT
e }
{K
e } to obtain the
2. Multiply (componentwise) the sequences {Se } and {K
discrete Fourier transform of {Fn }:
e ,
Fe = N Se K
= 0, 1, , N 1
3. Finally, compute the inverse DFT of {Fe }:

{Fe }
IDFT
{Fn }.
23
(The inverse DFT is evaluated in high-level languages like julia via the function
ifft, which behaves similarly to the fft function.)
Using FFT techniques, steps (1) and (3) here have cost scaling like N log N ,
while step (2) has cost scaling like N , so the overall algorithm has cost scaling
like N log N , as compared to the N 2 scaling of a brute-force evaluation of the
sum (14).
Periodic source distribution, infinite-range kernel
Finite source distribution
Zero padding
Circulant Matrices
24

Fourier Analysis
Homer Reid
April 8, 2014
Contents
1 Fourier Analysis
2 The Fourier transform
3 Examples of Fourier transforms
4 The smoothness of f (t) and the decay of fe()
12
5 Fourier series
14
6 Fourier analysis is a lossless process: Parsevals theorem
21
7 Poisson summation
23
8 Fourier analysis and convolution
27
9 Higher-Dimensional Fourier Transforms
28
A Exponential Sums
31
B Gaussian Integrals
33
Fourier Analysis
Recall that the verb analyze means to decompose into constituent pieces.
Fourier analysis is the process of decomposing functions into constituent pieces
which vary at definite rates that is, into sinusoids.
Some functions are easy to Fourier-analyze. For example,
f (t) = 3 cos 2t + 19 sin 4t 0.14 cos 7t
3 a sinusoid with angular frequency 2

+19 a sinusoid with angular frequency 4
=
0.14 a sinusoid with angular frequency 7

Thats it! We have Fourier-analyzed the function f .
On the other hand, what about a function like
(
1, |t| < 1
|t|
f (t) = e
or
f (t) =
0, |t| > 1
How do we break functions like this up into pieces that vary with fixed frequencies?
The fourfold way1

The process of Fourier analysis takes slightly different forms (and goes by slightly
different names) depending on (a) whether we have access to values of the
function f (t) for all values of t [, ] or only values within some finite
interval t [ T2 , T2 ] and (b) whether we can query f (t) for its value at arbitrary
times t (i.e. we have a continuous function f (t)) or instead we have just discrete
samples of the function f (nt) at evenly spaced time points. The following
table summarizes the terminology used in the various cases.
Continuous function
f (t)
Discrete samples
fn f (nt),
nZ
Infinite domain
Fourier transform
Semidiscrete Fourier transform
Fourier series
Discrete Fourier transform
< t <
Finite domain
T
T
<t<
2
2
Table 1: The fourfold way: The name of the process used to Fourier-analyze a
function f (t) depends on whether (a) f (t) is defined for all time or only within
a finite window, and (b) whether we have access to f (t) for all real values of t
or just discrete samples at evenly spaced points nt.
1I
borrowed the idea of a fourfold way here from Professor Laurent Demanet.
The Fourier transform
The first entry in the fourfold way is the Fourier transform. This is what we
use when values of the function f (t) we are analyzing are available for arbitrary
points t on the entire real line. In this case, the Fourier transform of f (t) is
defined to be the function2
Z
1
e
eit f (t) dt.
(1)
f ()
2
f () is a function of frequency that tells us how strongly complex exponential
with frequency is represented in f (t).
Using this formula, we can immediately answer one of the questions we
posed at the outset: How do we decompose a function like E (t) e|t| into
sinusoids? The answer is to plug E into (1):
Z
1
e
E () =
eit e|t| dt
2
Z
Z
i
1 h 0 (i+)t
=
e
dt +
e(i)t dt
2
0
h
i
1
1
1
=
2 i i
.
=
2
( + 2 )
What this means is this: The function e|t| may be reconstructed by summing
sinusoids with all possible frequencies . The amplitude of the frequency-
sinusoid in this sum decays, for large , like 1/ 2 . Note that the threshold
defining large is dependent on : The larger the value of (i.e. the more
rapidly decaying the original exponential) the more sinusoids we have to add
with appreciable amplitudes to recover f (t). This is an example of a general
phenomenon that we will discuss in the next section.
Linear vs. angular frequency

A quick comment regarding terminology: Given a sinusoid like sin t or eit ,
strictly speaking we should refer to as the angular frequency. The frequency
is = 2
. is the frequency which with the underlying process repeats itself,
while is the frequency with which the phase angle in the sinusoid accumulates
one radian worth of phase. (The frequency is sometimes called the linear
frequency to distinguish it from the angular frequency.) Linear frequency is
R
theory is nicest if we restrict f (t) to satisfy
|f (t)|dt < , in which case we
say f is contained in the function space L1 (R). Fourier transforms can be defined for more
general functions, and in some places we will do this without being particularly rigorous about
it, but you should be aware that the proper justification of (1) for non-L1 functions involves
a bit of a foray into real analysis.
2 The
measured in units of Hertz (1 Hertz=once per second), while angular frequency

is measured in units of radians per second. For example, if we have a pendulum
that takes 4 seconds to complete one full cycle, then the linear frequency is
= 14 = 0.25 Hz and the angular frequency is = 2
4 1.57 Rad/s.
Terminology of domains
In the above discussion, the function we started out with was f (t), and the
Fourier-transformed function was fe(). In other words, the original independent
variable was the time t, and the transformed variable was the frequency . We
think of the function f (t) as existing in the time domain, while fe() exists in
the frequency domain.
But Fourier analysis is also useful in situations where the original independent variable is something other than time. For example, it may be the position
x, in which case the Fourier-transformed variable is the spatial frequency k. (k
in this case is sometimes called the wavenumber.) In this case it would be
a little confusing to say that f (x) exists in the time domain. Instead, various alternative terminologies have arisen for labeling the two spaces in which
functions live before and after Fourier transformation.
Field
Physical
variable
Physical
domain
Fourier
variable
Fourier
domain
Signal
processing
time t
time
domain
frequency
frequency
domain
Solid state
physics
lattice vector L
real space
reciprocal
lattice vector G
reciprocal
space
Optics
position x
real space
wavenumber k
k space
Quantum
mechanics
position x
position
space
momentum p
momentum
space
As indicated in this table, we will refer collectively to the before domain

and variable as the physical domain and the physical variable, and we
will refer to the after domain and variable as the Fourier domain and the
Fourier variable.3
Dimensional analysis
In physics and engineering problems its important to keep in mind that the
functions f (t) and fe() have different units. Indeed, looking at (1) we see that
3 The term physical variable is something of a misnomer since Fourier variables like
frequency and spatial wavenumber are certainly physical quantities, but whaddya gonna do.
the RHS contains a dt factor that is not present on the LHS, and therefore

units of f
units of fe = units of f time =
.
frequency
Thus, if f (t) has units of volts, then fe() has units of volts seconds or volts
per Hertz.
Properties of Fourier Transforms

There are several properties of the Fourier transform that follow immediately
from the definition.
1. Fourier transform of a derivative. If, for a given function f (x), we know
the Fourier transform fe(k), then its easy to compute the Fourier transform
of derivatives of f (x). The easiest way to get at this is to write down the
Fourier-synthesized version of f :
Z
f (x) =
fe(k)eikx dk
and differentiate both sides with respect to x:

d
f (x) =
dx
ik fe(k)eikx dk
The RHS here is the inverse Fourier transform of the Fourier-space funcdf
tion ik fe(k), so we identify this function as the Fourier transform of dx
,
i.e. we conclude that

df
e
= ik fe(k).
(2)
FT f (x) = f (k)
=
FT
dx
This game can be repeated as many times as we like; for example,

2
d f
e
FT f (x) = f (k)
=
FT
= k 2 fe(k).
dx2
(3)
2. Derivative of a Fourier transform. We can also play basically the same

trick in the opposite direction. Write down the defining equation of the
Fourier transform,
Z
1
e
f (x)eikx dx,
f (k) =
2
and differentiate both sides with respect to k:
d e
f (k) =
dk
ixfe(x)eikx dxx.
d e
What this tells us is that dk
f is something like the Fourier transform of
the function xf (x):

d
FT f (x) = fe(k)
=
FT xf (x) = i fe(k).
(4)
dk
Note that this statement is the dual of (3).

3. Fourier transforms of real-valued functions. If f (x) is a real-valued function, then the information contained in fe(k) for k < 0 is redundant; values
of fe(k) on the negative k axis may be recovered from knowledge of fe on
the positive k axis.
To see the precise relationship, write down the defining equation of the
Fourier transform with negative argument k:
Z
1
f (x)eikx dx
fe(k) =
2
Since f (x) is real-valued, we can write the RHS in the form

Z
1
f (x)eikx dx
2
e
= f (k).

Examples of Fourier transforms
Lorentzians
We already saw our first example of a Fourier transform:
E (x) = e|x|
.
(2 + k 2 )
e (k) =
E
(5)
e (k) are pulse functions: they are maximal in the vicinity

Both E (x) and E
of the origin, and decay to zero as their respective arguments go to infinity.
(These functions are known as Lorentzians.) The width of the two pulses
depends on , in inverse ways: the larger the value of , the narrower the pulse
in real space and the wider the pulse in Fourier space.
To quantify this a little further, define the full width at half maximum
(FWHM) of a pulse as the width between the two points at which the pulse
has fallen to 1/2 its peak value. The function E (x) has peak value of 1 (at
x = 0) and falls to 1/2 at x = ln 2/, so
FWHM(E ) =
2 ln 2
.
e has peak value

On the other hand, the function E
half this value at point k = , so
(6)
1
FWHM(E ) = 2.
(at k = 0) and falls to
(7)
Combining (6) and (7), we have

e ) = 4 ln 2.
FWHM(E ) FWHM(E
(8)
Notice that equation (8) is independent of ; the same statement holds for all
functions in the family {E (t)} regardless of the value of .
Gaussians
Let G (x) be a Gaussian of width :
G (x) ex
/ 2
The Fourier transform of G is

Z
f (k) = 1
G
eikx G (x) dx
2
Z
x2
1
=
eikx 2 dx.
2
Complete the square:
x2
2
+ ikx =
1
2
1 k2 2
=
e 4
2
x+
ik 2
2
2
k2 2
4

2 2
12 x+ ik
2
{z
dx
}
k2 2
= e 4 .
2
e (k) is
Aside from the annoying prefactor4 , the important point here is that G
s again a Gaussian in k-space, but with a width inversely proportional to that
of the original Gaussian:
2
e (k) e ke 2 = Ge (k)
G
where
e .
The FWHM of a Gaussian G (x) is 2 ln 2 , and hence for Gaussians the

equivalent of (9) is

e ) = 2 ln 2 4 ln 2 = 8(ln 2)2 .
FWHM(G ) FWHM(G
(9)
Again, this is independent of : it holds for the entire family of Gaussian pulses.
The Heisenberg Uncertainty Principle

Both of the previous examples have illustrated a general phenomenon: The
narrower we make a pulse in physical space, the wider that pulse is in Fourier
space. The precise details depend on the particular shape of the pulse, so that
different families of pulses exhibit different versions of the relations (8) and (9),
but the general principle is the same.
Equations like (8) and (9) are sometimes known as uncertainty relations
due to a phenomenon in quantum mechanics known as the Heisenberg uncertainty princple. In quantum mechanics, the position and momentum of a
particle are Fourier-conjugate variables, which means that the more precisely
we try to pin down the particles position in space (i.e. the narrower we make
the particles wavefunction in real space) the less accurately we can resolve its
momentum (i.e. the wider the particles wavefunction in momentum space. The
precise statement of the uncertainty princple is
(x)(p ) >
~
1034 kg m2 /s
2
4 It is possible to play games like multiplying G (x) by a certain constant prefactor to
e (k) comes out with a nicer prefactor or even a symmetric prefactor (i.e. the
ensure that G
same prefactor as G ), but we wont bother.
10
so, for example, if we have an electron (mass 1030 kg) and we try to resolve
its position to within x 10 nm, then we cant pin down its velocity to any
better accuracy than p 105 m/s! This is a huge uncertainty compared to
the spatial resolution we are trying to hit.
Fourier transforms of non-pulse functions

Both the Lorentzian and the Gaussian (and their Fourier transforms) are pulse
functionsthey are localized near zero and decay to zero for large arguments.
What happens if we try to take the Fourier transform of a non-pulse function
for, example, a function like f (x) = 1, or f (x) = x, or f (x) = x2 ?
One way to get at the answer is to consider the Fourier transform of the
Lorentzian (5) in the limit 0. In this case, the real-space function approaches simply the constant value 1, i.e.
lim E (x) = 1.
e (k) changes in two

On the other hand, as 0 the Fourier-space function E
ways: (1) its width gets narrower (recall that its FWHM was 2) and (2) its
e (0) = 1 .] A limiting
height gets taller [indeed, its value at the origin is E
process in which a function gets infinitely wide and infinitely tall sounds like
the kind of procedure that defines a Dirac delta function, and indeed its easy
to show that
e (k) = lim
= (k).
lim E
0 (2 + k 2 )
0
Thus we have the Fourier-transform pair
f (x) 1
fe(k) = (k).
(10)
This actually makes sense, if you think about it: The function f (x) 1 already
is a sinusoid, namely, a sinusoid with zero frequency. To synthesize this function
as a sum of sinusoids, we want to set the coefficients of all sinusoids to zero except
the single sinusoid with frequency k = 0.
Armed with equation (10) and the derivative identity (4), we can now compute the Fourier transform of functions like f (x) = x or f (x) = x2 :
f (x) = x
f (x) = x
=
2
fe(k) = i 0 (k)
fe(k) = 00 (k)
(11)
(12)
where e.g. 0 (k) is the derivative of the Dirac delta function, which is defined
using integration by parts:
Z
Z
f (u) 0 (u)du = f 0 (u)(u)du = f 0 (u).
In other words, the object 0 should be thought of as a gadget similar to , except
that when integrated against a function f it pulls out minus the derivative of f
at the origin, not the value of f like the usual function would do.
11
As anticipated in Footnote 2 above, Fourier transforms like (10), (11), and

(12) are not very nice functionsindeed, they are not even functions at all,
but instead are only distributions 5 . This is because the real-space functions
f (x) = R{1, x, x2 } are not contained in the function space L1 , i.e. they do not
satisfy |f (x)|dx < . It is nonetheless convenient to use equations like

(12) in a sort of operational sense, but you should be aware that these formal
manipulations are sweeping some mathematical subtleties under the rug.
5 What this means in essence is that objects like (k) and 0 (k) are meaningless in isolation,
and only make sense when they appear paired with a nice function under an integral sign.
12
The smoothness of f (t) and the decay of fe()
The specific examples in the previous section illustrates a general principle:

The more rapidly varying the function f (t), the less rapidly the function f ()
decays with . Contrariwise, if f (t) is not rapidly varying (it is smooth in a
colloquial sense) then its Fourier transform decays rapidly for large .
f (t) rapidly varying
fe() slowly decaying as
f (t) slowly varying
fe() rapidly decaying as
This makes sense: If f (t) is slow, then it doesnt contain many fast
frequency components (or the ones it does contain have small amplitudes).
This statement can be quantified by characterizing the smoothness of f in
terms of its continuity and that of its derivatives. In particular,
If f (t) and its first p 1 derivatives are continuous, but its pth
derivative is discontinuous with bounded variation, then fe()
decays at least as rapidly as ||(p+1) for || .
In particular, if f (t) is C (it is continuous and all of its derivatives are continuous everywhere, no discontinuities, anytime, anyplace, ever) then f() decays
for large faster than any polynomial. Functions which decay faster than any
2
polynomial include e , e , e , etc.
Statements like the boxed statements above are generally known as PaleyWiener theorems.
This principle is already illustrated by the particular examples we considered
previously. The function e|t| is continuous, but its first derivative is not (it
has a finite jump at the origin). Thus the statement in the box is satisfied for
p = 1 and we expect the Fourier transform to decay like 2 for large , as
2
2
indeed we found above. On the other hand, the function et / is C , so its
Fourier transform should decay faster than any polynomial in and, indeed,
2 2
the Fourier transform of this function goes like e /4 , which decays for large
faster than any polynomial in .
Simultaneous compact support of f (t) and fe()

Another implication of the smoothness/decay relationship in Fourier analysis,
which is also related to the uncertainty-princple ideas of the previous section,
has to do with simultaneous compact support of f and f. Recall that a function
f (t) is said to have compact support if it is only nonzero on a compact subregion
of the real line. For example, the function
(
1, |x| < 1
f (t) =
0, |x| > 1
13
2
has compact support. On the other hand, the Gaussian ex does not have
compact support; for large x it is very small but not exactly zero.
In the same vein as we asked above whether or not we could simultaneously
squeeze f (t) and f() to be narrow pulses, it is interesting to ask if we could
find a function f (t) such that both f and fe have compact support. The answer
is basically no, except for the trivial case f = fe = 0:
If f (t) and f() both have compact support, then

f (t) = fe() = 0.
14
Fourier series
Next suppose f (t) be a periodic function with period T . This means that
f (t + T ) = f (t) for all t; the function f repeats itself every T seconds.6 Suppose
we try to compute the Fourier transform fe() of this periodic function:
Z
1
e
eit f (t) dt.
(13)
f () =
2
There are two distinct cases we need to analyze:
1. The frequency is an integer multiple of 2
T . In this case, the entire integrand of (13) is periodic with period T . Every time interval of width T
makes an identical contribution to the integral, and there are an infinite
number of such time intervals, so fe() = .
2. The frequency is not an integer multiple of 2
T . In this case, the integrand
of (13) is not periodic. [The f (t) factor is periodic with period T , and the
eit factor is periodic with some period not equal to an integer fraction
of T , so the overall integrand is not periodic.] Now what happens is that
each time interval of width T makes a contribution to the integral that
has essentially the same magnitude, but a random phase factor. These
random phase factors cause all the contributions to the integral to cancel,
and we find fe() = 0.
To summarize, if f (t) is periodic with period T , its Fourier transform fe() is
e
zero except when is an integer multiple of 0 2
T , at which f () is infinite.
e
One way to think of this situation is to represent f () as a train of functions:
fe() =
fen ( n0 )
0 =
2
.
T
Another way to think about this is to say that if f (t) is periodic with period T , its
Fourier decomposition only contains sinusoids with frequencies n = n0 = 2n
T
for n Z. We can write
f (t) =
2int
fn e T
or
f (t) =
fn ein0 t
where
nZ
nZ
0 =
2n
.
T
This is the Fourier series representation of f .

To compute the fn coefficients, we simply use a finite-time version of the
Fourier transform in which we only look at f (t) over one of its periods:
fn =
6 Or
1
2
f (t)ein0 t dt.
every T minutes, or hours, or whatever time units we are using.
15
A simple example
For example, lets Fourier-analyze the function cos2 3t, which is periodic with
period T = 3 .
1.5
1.5
cos^2(3t)
cos^2(3t)
0.5
0.5
-0.5
-0.5
-2
-1.5
-1
-0.5
0
t
0.5
1.5
Figure 1: The function f (t) = cos2 3t.
2
T
The base frequency 0 =

1
fn =
T
= 6, and the nth Fourier coefficient is
ein0 t f (t) dt
Use cos 3t = 21 (e3it + e3it ), cos2 3t = 41 (e6it + 2 + e6it )

1
=
4T
h
i
ein0 t ei0 t + 2 + ei0 t dt
Now use the orthogonality result stated in the Appendix:

=
i
1h
n,1 + 2n,0 + n,1
4
16
In other words, the Fourier coefficient fen is only nonzero for n = {1, 0, 1}. The
Fourier-synthesized version of f (t) is
X
f (t) =
fen ein0 t
n
1
1 1
= ei0 t + + ei0 t
4
2 4
1h
= 1 + cos 0 t]
2
1h
= 1 + cos 6t].
2
(14)
Of course, we could have used standard trigonometry identities to show that

cos2 3t = (1 + cos 6t)/2, but its nice to see this result emerging from the full
Fourier analysis procedure.
Fourier Cosine and Sine series

Looking at (14), we see that the function f (t) had only cosine terms7 and no
sine terms. This is actually a general phenomenon that happens whenever the
function we are analyzing is an even function, i.e. satisfies f (t) = f (t): even
functions have only cosine terms in their Fourier series. Similarly, odd functions
[that is, functions for which f (t) = f (t)] have only sine terms in their Fourier
series. We then speak of a Fourier cosine series or a Fourier sine series.
In some cases the function you are analyzing is neither even nor odd, but can
be made into an even or odd function just by shifting the origin of coordinates.8
More generally, any arbitrary function may be decomposed into even and odd
pieces like this:
f (t) = fe (t) + fo (t),
fe (t) =
i
1h
f (t) + f (t)
2
fo (t) =
i
1h
f (t) f (t) .
2
A more interesting example

As a more interesting example of Fourier series, consider the sawtooth wave
depicted below: f (t) is periodic with period T , and for 0 < t < T we have
f (t) = t. [Note that f (t) has the units of t for example, if we are measuring
time in seconds, then f (t) has units of seconds.]
7 Plus a constant term, which may be thought of as a cosine with zero frequency. Note that
sines with zero frequency are identically zero.
8 For example, the T -periodic function f (t) defined to be 0 for 0 < t < T and 1 for
2
T
< t < T is neither even nor odd; but g(t) = f (t + T4 ) is even.
2
17
1.5
1.5
0.5
0.5
-0.5
-0.5
-2
-1.5
-1
-0.5
0
t
0.5
1.5
Figure 2: Sawtooth function: f (t) = t for 0 t T , and f (t) is periodic with

period T . (In this plot, the x and y axis labels are measured in units of T .)
The Fourier series of this function is

X
2
in0 t
e
fn e
0 =
f (t) =
T
n=
Z
1 T
fen =
f (t)ein0 t dt
T 0
Z
1 T in0 t
=
te
dt.
T 0
The n = 0 term evaluates to f0 =
T
2
. For n 6= 0 we integrate by parts:

Z T
1
1 in0 t T
1
in0 t
=
e
dt
te
+
T
in0
in0 0
0
1
=
.
in0
18
Thus the Fourier series for our function is

f (t) =
T
1 X ein0 t
2
i0
n
n6=0
Note that the units are correct: the LHS has units of time, the first term on the
RHS has units of time, and the second term on the RHS has units of (angular
frequency)1 =time.
Note also that fen decays like 1/|| for large . This is in accordance with
our discussion of Paley-Wiener theorems above, since the function f (t) is discontinuous.
We could also rewrite this series in terms of cosines and sines (and eliminate
0 in favor of T :)

T X1
2nt
T
sin
.
(15)
f (t) =
2
n=1 n
T
Note that this is almost a Fourier sine series only the first (constant) term
doesnt belong. If we consider the modified function g(t) = f (t) T2 , then this
term would go away and the Fourier series for g(t) would be a Fourier sine series
which can only be true if g(t) is an odd function. You should look at the graph
of f (t) and convince yourself that shifting the entire curve downward by T /2
does indeed yield an odd function.
It seems amazing to think that summing up a bunch of sine functions each
one of which is individually a nice smooth function can reproduce the jagged,
discontinuous behavior of the sawtooth function of Figure 5. But it does!
The Gibbs phenomenon

However, it does with one proviso: If we truncate the series by summing only a
finite number of terms (that is, if we perform an incomplete Fourier synthesis
of the function f (t)), we encounter the Gibbs phenomenon. The Gibbs phenomenon is the appearance of oscillations near discontinuties in the incomplete
Fourier synthesis of a discontinuous function. For example, the following plot
shows the original sawtooth wave f (t) together with its incomplete FourierPN
synthesized versions fN (t), where fN (t) = n=0 fen ein0 t , for N = 2, 5, 10, 20.
19
1.5
1.5
f(t)
f_2(t)
f_5(t)
f_10(t)
f_20(t)
0.5
0.5
-0.5
-0.2
0.2
0.4
0.6
0.8
-0.5
1.2
Figure 3: The Gibbs phenomenon. When we truncate the Fourier series (15)
at a finite number of terms, we obtain an approximation to the original sawtooth function f (t). Note that, in the regions away from the discontinuity, the
approximation more closely hugs the actual function as N ; however, near
the discontinuity, the peak error between the function and the approximation
does not decrease with increasing N . However, the definition of near the discontinuity does change with N , and for larger N the errors are confined to a
narrower region about the discontinuity.
Convergence of Fourier series at points of discontinuity

Figure 3 also illustrates an important point about Fourier-series representations
of discontinuous functions: If the original function f (t) is discontinuous at a
P e in0 t
point t , then the Fourier series
fn e
converges to the midpoint of the
discontinuity, i.e. we have
lim
N
X
n=N
1h
fen ein0 t = f (t ) + f (t+ )
2
where
f (t ) = lim f (t ),
0
f (t+ ) = lim f (t + ).

0
20
In particular, if we construct a Fourier series to represent the behavior over

[0, T ] of a function that is not periodic on that interval, then evaluating this
Fourier series at t = 0 will yield 12 [f (0) + f (T )]. This behavior is clearly visible
in the figure above.
21
Fourier analysis is a lossless process: Parsevals theorem
A very important property of the process of Fourier analysis is that it is lossless:

after going over to the Fourier domain, we have no less information about f
than we started out with. This is true no matter which of the four entries in
the fourfold way (Table 1) we are talking about.9
Fourier synthesis
One important consequence of the losslessness of Fourier analysis is that the
inverse process Fourier synthesis exists and may be used to recover the
original function from its Fourier-analyzed version. (Again, this is true no matter which of the fourfold-way entries we are talking about.) For example, the
inverse of equation (1) reads
Z
f (t) =
eit fe() d
(16)
This is exactly what you expect: we recover f (t) by summing a bunch of sinusoids eit , with the weight of the frequency- summand given by fe().
Note that equation (16) is exact: there is no loss in going back and forth
between the physical and Fourier domains. If we didnt have the losslessness
property of Fourier analysis, we would have to wonder whether or not the function defined by the RHS of (16) was in some way an inexact representation of
our function.
Parsevals Theorem
Another important consequence of the losslessness of Fourier analysis is that
it allows us to perform certain computations in the Fourier domain with the
confidence that these computations yield the same results as if we had performed
them in the physical domain. If we didnt have the losslessness property of
Fourier analysis, we would have to wonder whether or not we lost something
along the way.
This phenomenon is well illustrated by Parsevals theorem. Suppose we have
two functions f (t) and g(t) and we want to compute their inner product,
Z
hf |gi =
f (t)g(t)dt
9 Note that we may not start out with complete information on the function f (t); for
example, we may only have samples of this function at some limited set of time points. In
this case, the process of Fourier analysis (which, for a finite set of function samples, would
be the discrete Fourier transform) obviously does not magically give us any more information
about the original underlying function f (t) than we started out with, but whats important is
that it doesnt lose any information after computing the DFT, we can always compute the
inverse DFT to recover the original function samples we started with.
22
Insert the Fourier-synthesized versions of f and g, equation (16):

Z
Z Z
0
e+i t ge( 0 ) d 0 dt
eit fe () d
=
Rearrange the order of integration:

Z

Z
Z
0
i( 0 )t
e
=
f ()
e
ge( )
dt d 0 d
|
{z
}
2( 0 )
fe ()e
g () d.
= 2
Thus the inner product of the Fourier transforms of f and g is equal to the inner
product of f and g.
Plancherels theorem
If we take the two functions in Parsevals theorem to be the same function,
g = f , we obtain Plancherels theorem:
Z
Z
|f (t)|2 dt = 2
|fe()|2 dt.
Fourier-Series Versions of Parseval and Plancherel

The derivations of the Parseval and Plancherel theorems above were for the
upper left box of the fourfold way that is, the case in which we are interested
in the behavior of f (t) over all time. If we are instead working in the lower
left box, where we are only interested in the behavior of a function over a finite
time interval [0, T ] (either because the function is periodic with period T , or
because we only care about its behavior in an interval of width T ), then the
corresponding versions of the Parseval and Plancherel theorems are
Z T
X
f (t)g(t) dt = T
fen gn
(17)
0
Z
0
|f (t)|2 dt = T
n=
|fen |2
(18)
n=
These are easy to derive by proceeding in exact analogy to the derivation we

presented for the infinite-time case.
Computational significance of Parseval and Plancherel

Computationally, the significance of the Parseval and Plancherel theorems is
that they allow us to perform computations in either the physical or the Fourier
domain depending on which is easier.
23
Poisson summation
The computational impact of Parsevals theorem is that it gives us the option

of evaluating certain sums in either the physical domain or the Fourier domain
depending on which is easier. If we are trying to compute a physical-domain
integral or sum that is is more rapidly converging in the Fourier domain, we
can just evaluate it in that domain, and Parsevals theorem guarantees that we
incur no error in the process.
Poisson summation is a similar technique which, computationally, gives us
the choice of evaluating sums in the Fourier or physical domain. More specifically, suppose we have a function f (t), and we want to sum the values of this
function at evenly spaced time points separated by t. Then Poisson summation tells us we can just as well do the computation by summing samples of
2
.
fe() at evenly-spaced frequency points separated by t
In equations, the Poisson summation formula reads

X
2 X e 2m
f (nt) =
f
.
(19)
t m=
t
n=
Note that the units are correct: fe has units of

units of f
units of fe =
= units of f time
frequency
while t has units of time; thus
units of
fe
= units of f.
t
Its easy to prove equation (19), and well do it below, but first lets investigate
some practical applications.
Jacobi functions
Recall that the Fourier transform of a Gaussian is a Gaussian, and that, more
specifically, the FT of a wide Gaussian is a narrow Gaussian in particular, the
FT of a Gaussian with width in physical space is a Gaussian with width 2
in Fourier space. Thus, if we ever found ourselves wanting to sum the quantity
2
en x over all integer n, and we were finding our sum slow to converge (because,
say, x might be small and the sum thus slowly convergent) we might be tempted
to exploit Poisson summation to evaluate the sum in Fourier space.
To get technical about it, define
2
Tx (n) en
(20)
where we think of Tx (n) as a function of n parameterized by x. Now think

of n as a continuous physical-space variable whose Fourier-space counterpart
24
variable we will call . The Fourier transform of (20) is

Z
1
Te() =
ein T (n) dn
2
Z
2
1
einn x dn
=
2
2
1
e 4x .
=
2 x
(The integral here is evaluated in the same way as the integral that arose when
we computed the Fourier transform of a Gaussian.) Now consider the following
function of x, known as the Jacobi theta function:10
(x) =
en
n=
Tx (n).
(21)
n=
Applying the Poisson summation formula (19) with t = 1, we immediately

find
(x) = 2
Te(2m)
m=
m2
1
e x
=
x m=
|
{z
}
(22)
(1/x)
But the sum here is nothing but the original function evaluated at the inverse
of its original argument! We have proven the functional equation of the Jacobi
function:

1
1
.
(x) =
x
x
I find this to be a totally wacky formula. (x) looks like a very complicated
function. How could the value of this function at x possibly be related so
simply to its value at 1/x? But it is!
To demonstrate the computational efficacy of (22), write it in the form
X
n=
n2 x
=x
1/2
em
/x
(23)
m=
Suppose we want to compute, to 6-digit accuracy, the value of this sum for
x = 0.04. Using the LHS to evaluate the sum, we need to sum 11 terms:
10 Actually the function defined by equation (21) is only one of several related functions
known collectively as Jacobi theta functions.
25
LHS sum in (23) with x = 0.04 :

N
N
X
N 2 x
en
n=N
0
1
2
3
4
5
6
7
8
9
10
11
1.0
0.8819113782981763
0.6049225627642709
0.322718983267049
0.133905721399763
0.04321391826377226
0.010846710538160161
0.002117494770632841
0.00032151151668886733
3.796825289201935e-5
3.4873423562089973e-6
2.4912565147240595e-7
1.000000000000000
2.763822756596353
3.9736678821248947
4.619105848658993
4.886917291458519
4.973345127986064
4.995038549062384
4.99927353860365
4.999916561637028
4.999992498142812
4.9999994728275245
4.999999971078828
On the other hand, if we use RHS of (23), we only have to sum one term to
get 6-digit accuracy:
RHS sum in (23) with x = 0.04 :
N
N
X
N 2 /x
en
/x
n=N
0
1
2
1.0
7.773044498987552e-35
3.650603079495543e-137
1.000000000000000
1.000000000000000
1.000000000000000
The functional equation of the Jacobi function is upheld to the accuracy

of our calculation:

1
1
(0.04)
=
0.04
0.04
| {z }
| {z }
| {z }
4.999999971078828
5.00000000000000 1.00000000000
Ewald summation
Finally, Poisson summation is the basis of Ewald summation, a wonderful technique for speeding the convergence of real-space sums over particle interactions
that is widely used in computational physics and engineering. We will consider
this topic in detail in a subsequent set of lecture notes.
26
Proof of Poisson Summation

This proof is somewhat heuristic, but it captures the essence of the argument.
Start with the LHS of (19) and insert the Fourier-synthesized representation
of f (nt):
f (nt) =
n=
Z
X
fe()eint d
n=
Rearrange the summation and integration:

Z
fe()
)
e
int
n=
{z
2(t2m)
The point of this step is that the sum over n inside the curly brackets yields zero
(all the terms eventually cancel each other) unless t is an integer multiple of
2, in which case that sum is infinite. We summarize this situation by describing
the quantity in the curly brackets as a function which is only nonzero for t
equal to 2m for arbitrary integers m.
Z
fe()(t 2m) d
= 2
Finally, use the -function identity (ax b) = a1 (x b/a):

2m
d
fe()
t

2 X e 2m
=
f
.
t m=
t
=
This completes the proof.
2
t
27
Fourier analysis and convolution
Another important property of the Fourier-analysis process is that it behaves

multiplicatively under convolutions. Again, this is true no matter which of the
four entries in the fourfold way (Table 1) we are talking about.
Recall that the convolution of two functions f (t) and g(t) is a sum of copies
of g(t), with each copy displaced in time by some time offset and weighted in
the sum by the value of f at time :
Z
f ( )g(t )d.
C(t) f g =
Lets compute the Fourier transform of C(t):

Z
1
e
C()
=
C(t)eit dt
2
Z Z
1
f ( )g(t )eit d dt
=
2
Insert the Fourier-synthesized versions of f and g:
1
=
2
Z
i1
Z
fe(1 ) d1
i2 (t )
ge(2 ) d2 eit d dt
Rearrange the order of integration:

1
=
2
Z
i(1 2 )
Z
i(2 )t
{z
2(1 2 )
}|
{z
2(2 )

g (2 )d1 d2
dt fe(1 )e
}
Use the first function to evaluate the 1 integral, then use the second function
to evaluate the 2 integral:
= 2 fe()e
g ().
In other words: The frequency Fourier coefficient of the convolution of f and
g is just the product of the frequency- Fourier coefficients of f and g.
This fact has important implications for signal processing. In particular, it
means that the operation of convolution is easier to perform in the frequency
domain than the physical domain.
28
Higher-Dimensional Fourier Transforms
The entire theory of Fourier analysis generalizes readily to higher dimensions.

For example, let f (x, y) be a function of two variables. By holding x fixed and
Fourier-transforming with respect to y, we obtain a mixed physical-space/Fourierspace function fe(x, ky ):11
Z
1
eiky y f (x, y) dy.
fe(x, ky ) =
2
And now we hold ky fixed and Fourier-transform fe(x, ky ) with respect to x:
Z
1
e
eikx x fe(x, ky ) dx
f (kx , ky ) =
2
Z Z
1
ei(kx x+ky y) f (x, y) dy dx.
=
(2)2
It is typical to write this in the form
1
=
(2)2
eikx f (x) dx
where the integrations (unless otherwise specified) are generally understood to

range over the full range of the x variable. Written this way, the formula for the
D-dimensional Fourier transform actually looks the same, but with a prefactor
1
.
(2)D
The multidimensional version of Fourier synthesis is
Z
f (x) = fe(k)eikx dk.
Examples of higher-dimensional Fourier transforms

Gaussians Gaussians in D dimensions are easy to Fourier-transform because
they are separable, i.e. they may be written as a product of D factors each
depending on only one variable.
The Coulomb potential A less trivial example is the case of the Coulomb
potential in 3 dimensions:
coulomb (r) = coulomb (x, y, z) = p
1
x2
+ y2 + z2
11 Since this function lives half in physical space and half in Fourier space we really should
adorn it with a half-tilde instead of the full crown, but I dont know how to typeset that in
LATEX.
29
The Fourier transform of this is

Z ikr
1
e
dr
coulomb
e
(k) =
(2)3
|r|
A convenient way of evaluating 3D integrals like this is to use polar coordinates
in a coordinate system in which k points in the z direction. In this coordinate
system we have dx = r2 sin dr d d and k x = kr cos (where k = |k| is the
magnitude of k) so the integral becomes
Z Z Z 2 ikr cos
e
1
r2 sin d d dr
=
(2)3 0
r
0
0
The integral can be done immediately to yield 2. To do the integral,
change variable to u = cos , du = sin d :
Z
Z 1
1
=
drr
eikru du
(2)2 0
1

Z
1 ikru 1
1

e
=
drr

ikr
(2)2 0
1
|
{z
}
2 sin(kr)/kr
1
2k 2
sin kr dr
0
Change variables to t = kr:

=
1
2 2 k 2
Z
0
|
12
and thus
is
sin t dt
{z
}
=1
we conclude that the 3D Fourier transform of the Coulomb potential
1
.
2 2 k 2
A good way to think of (24) is in terms of the Fourier-synthesis picture:
ecoulomb (k) =
(24)
1
= coulomb (r)
r Z
= ecoulomb (k)eikr dk
Z
dk eikr
=
.
2 2 |k|2
Thus, we can recover the Coulomb potential by summing plane waves of all
possible wavevectors; the contribution of the plane wave with wavevector k is
weighted in the sum with a factor 1/(2 2 |k|2 ).
12 Actually the t integral here doesnt quite make sense as we have written it; the proper
justification of the result requires a certain limiting process, which you will work out in your
problem set.
30
Parsevel, Plancherel, Poisson in higher dimensions

All of the theorems that we derived above expressing the lossnessless property of
Fourier analysis extend immediately to the multidimensional case. For example,
Parsevals theorem tells us that we can compute the inner product of two Ddimensional functions equally well in real space or in Fourier space:
Z
Z
f (x)g(x)dx = (2)D fe (k)e
g (k)dk
where the integrations on both sides extend over all of RD .
For the higher-dimensional generalization of the Poisson summation formula,
we have the freedom to choose the sample points with different spacings in the
different dimensions. For example, consider a two-dimensional function f (x, y),
and suppose we want to evaluate the two-dimensional lattice sum

f nx x, ny y .
nx ,ny =
In other words, we are sampling f on a grid of points that lie x apart in the x
direction, and y apart in the y direction. All we have to do is apply Poisson
summation recursively, first in the y direction and then in x direction (or vice
versa, it doesnt matter). The result is
X
nx ,ny =

f nx x, ny y =
2
x

2
y
X
mx ,my

2mx 2my
fe
,
.
x
y
where fe is the two-dimensional Fourier transform of f .
31
Exponential Sums
In several places throughout this document, we have invoked certain sum rules
without justification. Well collect these formulas here just to make sure we
have them all in one place and to emphasize that they are all really just slightly
different twists on the same basic principle.
Continuous, finite-time version

First suppose we are working over a finite interval of the t axis of width T , i.e.
we are in the setting of the Fourier series. Let 0 = 2
T be the base frequency
(the minimal frequency of any sinusoid in the Fourier-series representation of a
function f (t) over our interval). Then our result takes the form
1
T
ei(n1 n2 )0 t dt = n1 ,n2 .
(25)
(The RHS here is the Kronecker delta: it evaluates to 1 if n1 = n2 and 0

otherwise.) You can easily prove this result by evaluating the integral yourself.
Orthogonality interpretation
A good way to interpret (25) is to say that, for n1 6= n2 , the functions fn1 (t) =
ein1 0 t and Rfn2 (t) = ein2 0 t are orthogonal with respect to the inner product
hf, gi = T1 f gdt. The notion of inner products and orthogonality are
borrowed from geometry, and they mean the same things here: the inner product
is an operation that takes two elements and returns a number, and two elements
are orthogonal if they have zero inner product.
Continuous, infinite-time version

Next suppose we are working over the entire real line. Then the appropriate
version of (25) is
Z
0
ei( )t dt = 2( 0 ).
(26)
Discrete version
The following result was used in our derivation of Poisson summation above,
and will be considered further when we discuss discrete Fourier transforms.
X
n=
einkx = 2
(xk 2m)
What this says is the following: The sum on the LHS yields zero unless x is an
integer multiple of 2
k . (The sum over m is just allowing for all possible integer
multiples.) If x is an integer multiple of 2
k , then the sum on the LHS is infinite
32
(all the summands are equal to 1), but infinite in such a way that if we multiply
the LHS by some function f (x) and integrate over all x then we get a finite
number which depends on the values of f (x) at the points x = 2m/k.
33
Gaussian Integrals
The basic Gaussian integral

The basic Gaussian integral is
Z
ex dx =
(27)
If we throw a factor into the exponent, we find instead

r
Z
x2
e
dx =
.
(28)
To derive this formula, just change variables in the original Gaussian integral
(27).
You can use dimensional analysis to remember the dependence of (28) like
this: The entire LHS of (28) has the same units as x because the dx factor in
the integral is the only dimensionful quantity in that expression. For example,
if x is measured in meters, then the entire LHS of (28) has units of meters.
On the other hand, since x2 is the argument of an exponential, it mustbe
dimensionless, whereupon 1/ must have the same units as x2 , and thus 1/
must have the same units as x. Sincethe RHS must have the same units as x,
the RHS must be proportional to 1/ .
Gaussian integrals with linear and constant terms in the exponent
It may also happen that the exponent contains additional terms of lower order
in x, i.e. we may have
Z
2
I(, , ) =
ex +x+ dx.
The first easy thing to do is to pull a factor of e out of the integral:

I(, , ) = e
ex
+x
dx.
To make sense of whats left, complete the square:

2

2
+
.
x2 + x = x
2
4
Inserting back into the above, we have
2
I(, , ) = e
+
4
e(x 2 ) dx
34
Now just change variables to y = x
2 :
= e+ 4
ey dy
|
{z
}
r
=
+
e 4 .
Although its not obvious from this derivation, the formula actually continues
to hold for imaginary values of and .13
13 And even some complex values of , though not all for example, it clearly fails for = 0
or = 1, among other values, as the original integral obviously diverges in these cases.

Convergence of Infinite Sums
Homer Reid
February 6, 2014
Consider a convergent infinite sum
S=
f (n)
(1)
n=1
We want to know how accurately we can approximate S by retaining only the

first N terms in the sum. That is, if we define the N th partial sum as
SN =
N
X
f (n)
(2)
n=1
then we want to estimate the error EN incurred by approximating S by SN . EN

is of course just the sum of all summands from N + 1 to infinity:

EN = S SN

X

f (n).
=
N +1
(3)
Error estimates for monotonic summands

In the commonly encountered case in which f (x) is positive and monotonically
decreasing [that is, y > x implies f (y) < f (x)], it is easy to estimate the sum
in (3) in terms of definite integrals over f (x). To understand the basic idea,
consider the following plot of the function f (x) over the interval [N, M ] (The
particular case we are considering here is f (x) = 1/x2 with [N, M ] = [10, 15],
but the general principles are valid for any monotonically decreasing function
over any interval.)
0.01
0.008
0.006
0.004
0.002
10
11
12
Figure 1: A plot of the function f (x) =

[N, M ] = [10, 15].
13
1
x2 ,
14
15
here considered over the interval
RM
The integral N f (x) dx gives the area under the curve f (x) between N and
M . This is the red-shaped region in Figure 2 below.
0.01
0.008
0.006
0.004
0.002
Figure 2: The integral

x = 10 and x = 15.
10
R 15
10
11
12
13
14
15
f (x) dx gives the area under the curve f (x) between
PM 1
On the other hand, the sum n=N f (n) gives the area of the purple-shaded
region shown in Figure 3 below.
0.01
0.008
0.006
0.004
0.002
10
11
12
13
14
15
PM 1
Figure 3: The sum n=N f (n) gives the area of the shape consisting of the
blue shaded rectangles. Since f (x) is monotonically decreasing, this area is
guaranteed to be greater than the area of the red-shaded area in Figure 2.
The purple-shaded region in Figure 3 is a union of rectangles; the rectangle
between x = n and x = n + 1 has width 1 and height f (n). Since the function
f (x) is decreasing, the area of this rectangle is guaranteed to be greater than
the area under the curve f (x) between n and n + 1, and thus the area of the
entire purple-shaded region in Figure 3) is greater than the red-shaded region
in Figure 2). In other words, we have

M
1
X
f (n) >
f (x) dx
(4)
If we instead take the rectangle between x = n and x = n + 1 to have height

f (n + 1) instead of height f (n), we obtain the the green-shaded region depicted
in Figure 4 below. In Figure 4, the area of the rectangle between x = n and
0.01
0.008
0.006
0.004
0.002
10
11
12
13
14
15
PM 1
Figure 4: The sum n=N f (n + 1) gives the area of the shape consisting of the
green shaded rectangles. Since f (x) is monotonically decreasing, this area is
guaranteed to be less than the area of the red-shaded area in Figure 2.
x = n + 1 is guaranteed to be less than the area under the curve f (x) between
n and n + 1, and thus the area of the entire green-shaded region in Figure 4 is
less than the red-shaded region in Figure 2. In other words, we have
Z M
M
1
X
f (n + 1) <
f (x) dx
(5)
N
which we could alternatively write in the equivalent form

Z M
M
X
f (n) <
f (x) dx.
N +1
(6)
Inequality (6) is the one that will be useful for our purposes. Taking M ,
the sum on the RHS is just the quantity that enters the definition of the error
(3), and hence we find
Z
EN = S SN <
f (x) dx.
(7)
N
(We have dropped the absolute value signs from (3) because f (x) is positive,
which means S SN is always positive.)
Application: Binding energy of a 1D ionic solid

Earlier we considered the sum
S=
X
(1)n
.
n
n=1
The method we discussed earlier cannot be applied to this sum as it stands

because the summand is not positive and monotonically decreasing. To rectify
this situation, we rewrite the sum as follows:

X
1
1
S=
2n 1 2n
n=1
=
1
.
2n(2n 1)
n=1
Now we have a positive and monotonically decreasing summand, so we can

apply (7) to estimate the error in the N th partial sum:
Z
dx
EN = S SN <
N 2x(2x 1)

1
1
= log 1
2
2N
To find the value of N at which the partial sum becomes correct to 6 digits,
we ask for EN to be less than 106 times the exact value of the sum, log 2:

1
1
log 1
< 106 log 2
=
N > 360, 674.
2
2N
This corroborates our earlier finding that the 6th digit of the sum stabilized
somewhere between N = 105 and N = 106 .
Estimating the error on the fly

In this case, we estimated the relative error by diving the absolute error by
the known value of the exact solution. In general, of course, we wont know a
priori the exact value of the sum we are computing (otherwise we wouldnt be
computing it). So how do we estimate the relative error during the course of a
calculation?
Easy: just divide by the current partial sum (that is, our best current approximation to the exact solution) instead of dividing by the exact solution.
For a positive, monotonically decreasing summand, the condition SN < S is
guaranteed to be satisfied for any N . This means that errors measured relative
to SN are always larger than errors relative to S. In other words, for all N we
have
EN
EN
>
SN
S
so EN /SN gives us an upper bound on the true relative error. (Moreover, in the
later stages of a calculation the difference between SN and S is small, so it is a
tight upper bound.)
The Euler-Maclaurin Formula

Inequality (6), which we may write in the form
M
X
f (n)
f (x) dx
<
(8)
n=N +1
is a fairly crude result: It only holds for monotonically decreasing functions,

and it really only expresses a particularly obvious geometric statement.
It turns out that it is possible to refine (8) quite dramatically: We can
relax the constraint that f (x) be positive and monotonically decreasing, and we
can sharpen the inequality into an equality. The result is the Euler-Maclaurin
summation formula, which reads
M
X
f (n)
f (x) dx =
N
n=N +1
h
i
Cp f (p) (M ) f (p) (N )
(9)
p=0
where f (p) is the pth derivative of f [f (0) (x) is just f (x)] and the Cp coefficients
decay rapidly with p:
C0 =
1
,
2
C1 =
1
,
12
C2 =
1
,
720
C3 =
1
,
30240
C4 =
1
.
1209600
In contrast to equation (8), equation (9) holds for general smooth functions f ,
not just functions that are positive and monotonically decreasing.
The Euler-Maclaurin formula is amazing: It says that the difference between
the sum and integral may be expressed entirely in terms of the behavior at the
endpoints. The formula is used extensively by number theorists, who use it to
evaluate sums in terms of integrals (which are generally much easier to compute).
The Euler-Maclaurin summation formula is somewhat tedious to derive, and
since we wont really use it much in this class we will skip the derivation. (It
is derived in many older numerical analysis books, including Stoer&Bulirsch.)
However, we want to call your attention to one important property: The error
term on the RHS of (8) depends only on the difference between the behavior
of f at N and the behavior of f at M . This means, in particular, that if f is
periodic over the interval [N, M ] then the entire error term vanishes!
This is our first brush with a general principle of 18.330: amazing magical
things happen when we work with periodic functions. We will encounter this
phenomenon in several places through the remainder of the course.

Invitation to Numerical Analysis
Homer Reid
February 4, 2014
Suppose youre a material scientist. (Or a physicist, or a chemist, or a
biologist, or an electrical engineer, or a structural engineer, etc. The problems
well consider are pretty general.) Youre in the lab at 4:30 on a Friday afternoon
when your boss asks you to do a quick calculation, for which she needs an answer
accurate to 6 decimal places. The question you face is: How quickly can you get
six digits and get out of the lab to enjoy your weekend?
Contents
1 Binding energy of a one-dimensional ionic solid
Numerical evaluation of infinite sums

2 Electrostatic potential near a 1D ionic solid
Ewald summation and a first glimpse at spectral methods

3 Electrostatic potential near a continuous 1D ionic solid
13
Numerical integration
4 Electric field of a 1D Solid
19
Numerical differentiation
5 Motion of a charged DNA strand near a 1D solid
22
Integration of ordinary differential equations

6 Equilibrium points near a 1D Ionic Solid
24
Numerical root-finding
7 Connecting the dots
28
Numerical interpolation
8 A smattering of other problems well discuss in 18.330
32
Figure 1: A one-dimensional ionic solid consisting of alternating positive and

negative ions (with charges Q) and ion-ion separation distance D. To estimate
the binding energy per ion, we compute the electrostatic energy of interaction
between the centermost ion (at x = 0) and all other ions in the chain.
Binding energy of a one-dimensional ionic solid
Consider the one-dimensional ionic solid depicted in Figure 1. This is a simple

chain of ions, with charges that alternate between +Q and Q, separated from
each other by a distance D. Your boss has asked you to compute the binding
energy per ion of this solid. Think of this as the energy required to pull one ion
out of the chain and move it infinitely far away. Recall from basic electrostatics
that the energy of two charges q1 and q2 , separated by a distance d, is, in
appropriate units1
q1 q2
E=
.
(1)
d
Lets compute the total energy experienced by the ion at the origin in our chain.
This involves summing (1) to account for the contributions of all the other
ions. For example, the contribution of the negatively charged ion at x = 3D
is Q2 /3D. Summing all such contributions, the energy felt by the ion at the
origin is
X
(1)n Q2
E=
|n|D
n=
n6=0
where we have excluded the self-interaction of the ion at the origin. Since the
ion at site n makes the same contribution to the sum as its counterpart at
site +n, we can restrict the sum to just the latter contributions and double the
result. (We can also pull the factor Q2 /D out of the sum). So we have
E=
2Q2 X (1)n
D n=1 n
E=
2Q2
S
D
which we can write as

(2)
1 You may have seen this formula written to look something more like E = q1 q2 . Think
40 d
of (1) as a version of this formula in which we have absorbed the factor of 40 into our units
of electric charge, so that it does not appear explicitly in our formulas.
where we defined
S
X
(1)n
.
n
n=1
(3)
Our question now becomes: How do we obtain a numerical value for the sum
S?
A bad idea
In mathematics we are often encouraged to adopt divide-and-conquer approaches
to difficult problems. Heeding this advice, we might try to split the alternating
sum in (3) into two non-alternating sums: the first sum will add up the contributions of just the positive ions [i.e. just the even n terms in (3)], while the
negative sum will handle just the negative ions (the odd n terms). In symbols
this looks like
S = S+ S
(4)
S+ =
X
1
,
2n
n=1
S =
1
2n
1
n=1
(5)
Surely it will be easier to evaluate and combine two non-alternating sums then
a single alternating sum, right? Does this approach save us time? Do we get
out of the lab any faster? No. In fact, each of the two individual sums in (5) is
divergent, so equation (4) winds up looking like
S = .
Try extracting six good digits out of that!
A better idea
A more promising approach is to use a computer to compute the partial sums
SN , where
N
X
(1)n
SN =
.
(6)
|n|
n=1
In the limit N , SN approaches the number we are trying to compute.
Of course, our computer cant actually sum an infinite number of terms, but
we might evaluate SN for some large N (say N = 106 ), and then again for
some larger N (say N = 107 ), and keep going until the first 6 digits of SN stop
changing (they converge), whereupon we take those digits as our approximate
evaluation of S.
Heres a little computer program that executes this strategy. (In this course
you will find yourself writing many little computer programs like this.) This
program is written in the julia language, but it would look almost the same in
any other programming language.
N=10000
Sum=0.0
for n=1:2:N
Sum += -1/n + 1/(n+1)
end
Sum
Running this program in julia yields the following output:
julia> Sum
-0.6930971830599458
What do we do with this number? Is it correct? Do we report it to our
boss and take off for the weekend? Before we know what to make of this
number or any number emitted by a numerical code we had better make
sure we understand how accurate it is. Lets run the above code again for various
different values of the N parameter and see what we get.
N
SN
104
-0.693097183059946
-0.693142180584944
106
-0.693146680560232
-0.693147130559867
108
-0.693147175473699
10
10
Table 1: Partial sums SN [equation (6)] for various values of N .

Based on these results, it looks like the 6th digit of SN is stabilizing somewhere around N = 106 . So we have to sum roughly the first one million terms
in our series to get our requisite 6 digits. Plugging this number into (2), we can
report to our boss that the binding energy of our 1D ionic solid is approximately2
E
2 If,
2Q2
0.693147
D
(both typical values for ionic

for example, we have Q = proton charge and D = 3A
solids), the quantity Q2 /D works out to 60.3176 electron volts, so we find E 83.4156 eV.
This is quite a large binding energy (i.e. our crystal is quite tightly bound). For comparison,
at room temperature the average thermal energy is around 0.026 eV. We would have to heat
this solid to a temperature of around 3200 degrees Kelvin before thermal energy would suffice
to dislodge an ion from the chain. Of course, this discussion of binding energy in ionic crystals
is somewhat simplified; for a fuller picture, see e.g. Chapter 20 of Ashcroft and Mermin, Solid
State Physics.
Some questions raised by this example

Although computers are fast, summing a million terms still seems like a lot of
work to have to do. Some questions that this example immediately presents are
1. What determines the threshold value of N needed to yield 6-digit accuracy
in SN ? (In this case we found N 106 ; what about for other summands?)
2. Is there anything we can do to speed the convergence of the sum?
We will answer question (1) in the first week of 18.330. As for question (2),
we will hint at an answer in the next section of these notes, and will discuss it
in full glory in the second unit of 18.330.
An even better idea...which sadly only works for this special

problem
Before leaving this problem, we should pause to point out a curiosity. The values
of the partial sums in Table 1 above seem to be converging toward a familiar
number. Do you recognize the number 0.693147...? It is actually nothing but
the natural logarithm of 2. To understand how this special number arises as
the value of our binding-energy problem, consider the Taylor-series expansion
of the function3 log(1 + x) about x = 0:
1
1
log(1 + x) = x x2 + x3 +
2
3
X
(1)n n
=
x .
n
n=1
Inserting x = 1 into both sides of this equation and flipping the signs yields
X
(1)n
log 2 =
.
n
n=1
The sum on the RHS here is just what we called S.

In this case we obtained a beautiful, exact, closed-form expression for the
quantity we were trying to compute. Unfortunately, such pleasant coincidences
almost never happen in numerical analysis. Moreover, even in cases where
we are able to attach a concise symbolic name to a quantity of interest, this
identification may not be of much use when it comes to obtaining a numerical
value for the quantity. (Indeed, if I asked you to compute the first 6 digits of
log(2), how would you proceed? You might well wind up summing the first
106 terms of the Taylor series!)
3 Some authors reserve the symbol log for the base-10 logarithm and use the symbol
ln for the natural (base-e) logarithm. I use both log and ln interchangeably for the natural
logarithm, and denote the base-10 logarithm by log10 .
Electrostatic potential near a 1D ionic solid
Now that youve computed the binding energy of the 1D solid, your boss asks
you to consider a slightly different question: Suppose a charged particle (such
as a charged strand of DNA) finds itself at a point r in the vicinity of our
solid. What potential energy does the DNA feel? In other words, what is the
electrostatic potential at a point r near the chain of ions?
Figure 2: We want to compute the electrostatic potential at the point r.
To be concrete, suppose r lies in the xy plane with p

coordinates r = (x, y).
The distance from r to the nth ion in our chain is d = (x nD)2 + y 2 , and
the electrostatic potential at r is the sum of contributions from all ions in the
chain:
X
(1)n Q
p
(r) = (x, y) =
(x nD)2 + y 2 .
n=
For convenience it what follows, lets choose to work with units of charge and
distance such that Q = D = 1. Then our sum reads simply
(x, y) =
(1)n
p
n=
(x n)2 + y 2
(7)
Note that we are now talking about evaluating a function of x and y instead of
just a single number as in the previous section. This means our evaluation of
the sum must be efficient, since we will probably need to evaluate it for many
points (x, y).
The slow way: brute-force evaluation

Its easy to modify the simple julia code above to compute the sum (7), and if
we dont need high accuracy then such an approach is adequate. For example,
Figure 3 plots (x, y) for values of x between 0 and 2, with y fixed at the value
y = 0.1; to produce a plot like this we really only need to evaluate to roughly
2-digit accuracy. However, the brute-force approach becomes costly when we
10
10
-2
-2
-4
-4
-6
-6
-8
-8
-10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
-10
Figure 3: A plot of the electrostatic potential (x, y) near the ionic solid over
the interval 0 x 2 with y fixed at y = 0.1.
need high accuracy. For example, at the point (x, y) = (0.25, 0.25) we must sum
on the order of 106 terms to get 6-digit accuracy, as shown below:
Convergence of (x, y) (brute-force summation) for (x, y) = (0.25, 0.25)
n
nth term in sum
after n terms
-2.0493756046200877
0.7790515201261021
+1.007411529248624
1.7864630493747262
-0.6689289797090223
1.117534069665704
+0.500964129900977
1.618498199566681
-0.40049593015285234
1.2180022694138286
799998
+2.500006250015747e-6
1.3985540298633128
799999
-2.500003125004028e-6
1.3985515298601878
800000
+2.5000000000001218e-6
1.3985540298601877
Table 2: Convergence of brute-force summation for (0.25, 0.25).

Even on a fast computer, summing millions of numbers may take milliseconds
or even seconds. If that doesnt seem like a long time, suppose your boss wants
the function tabulated on a 10001000 grid of (x, y) points to generate a 3D
image of the potential energy surface. At 1 second per point, the calculation
your boss is requesting will take you 278 hours! That tap-click-tap sound you
hear is you texting your friends to cancel your weekend plans.
The fast way: Ewald Summation

Can we get 6 good digits for expression (7) without summing a million terms?
It turns out we can, by using a technique known as Ewald summation, and the
principle at work here illustrates a key 18.330 concept.
The basic idea is to break the sum into two pieces: one representing the
contribution of the ions nearest the evaluation point r, and a second piece
accounting for the contribution of distant ions:
(x, y) = nearby (x, y) + distant (x, y)
nearby (x, y)
X
|n|<N
(1)n
p
,
(x n)2 + y 2
distant (x, y)
(8)
(1)n
X
p
|n|>N
(x n)2 + y 2
(9)
where N defines the threshold between nearby and distant ions; ions further
than N sites away from the origin are considered distant. For example, here are
plots of nearby and distant for N = 20:
10
10
-2
-2
-4
-4
-6
-6
-8
-8
-10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
-10
-0.04875
-0.04875
-0.0488
-0.0488
-0.04885
-0.04885
-0.0489
-0.0489
-0.04895
-0.04895
-0.049
-0.049
-0.04905
-0.04905
-0.0491
-0.0491
-0.04915
-0.04915
-0.0492
-0.0492
-0.04925
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
-0.04925
Figure 4: Contributions of (a) nearby and (b) distant ions to the potential
plotted in Figure 3. Note the different y-axis scales.
The plot of nearby , which is extremely cheap to calculate (it involves a sum
of just 41 terms), looks to the naked eye indistinguishable from the plot of the
full sum in Figure 3. This appearance is deceptive; as you can see from the
lower plot in Figure 2, the contribution of distant is relevant already to the 2nd
or 3rd digit of the full sum, so for 6-digit accuracy we must certainly include
this expensive-to-calculate contribution.
However, the lower plot in Figure 2 tells us something else that is interesting:
10
distant does not vary rapidly over the interval in question. Indeed, over this
interval distant is monotonic, and its value changes by less than 1%. (This is in
contrast to the behavior of nearby , which exhibits hair-raising curves and deathdefying dips over the interval.) This means that distant will have a compact
Fourier representation that is, only a small number of terms in its Fourier
series will be relevant and by going over to Fourier space we can convert the
expensive real-space sum in (9) into a Fourier-space sum whose evaluation is no
more costly than that of distant .
To summarize,
nearby is a rapidly varying function of x, but one which we can compute
rapidly in real space
remote is costly to evaluate in real space, but is slowly varying, which
means we can compute it rapidly in Fourier space.
The upshot is that the sum (7), for which nave evaluation requires summing
millions of terms to yield 6-digit accuracy, is replaced by two sums (9), each of
which requires summing just a few terms to get 6-digit accuracy.
Convergence of Ewald Summation
The cursory sketch of Ewald summation that we presented above was slightly
cavalier; in particular, the simple definitions of nearby and distant that we gave
in equation (9) were somewhat oversimplified (whats missing is the presence of
a windowing function instead of a sharp cutoff). We will discuss these details
later.
However, we cant resist giving you a sneak peak at the convergence rate
evinced by actual Ewald summation, to be compared to the slow convergence
visible in Table 2. In actual Ewald summation, the functions defined by equation
(9) are replaced by similar functions well here call local and remote . The
former is defined as a sum of real-space contributions, while the latter is a sum
of Fourier-space contributions. The following tables, to be compared with Table
2, indicate the rates at which these sums converge.
11
Convergence of local (x, y) for (x, y) = (0.25, 0.25)

n
nth term in sum
local after n terms
-0.3893996144303278
1.3559522726396909
+0.007629205898998581
1.3635814785386895
-3.5342137585185434e-5
1.3635461364011043
+2.8775270993747082e-8
1.3635461651763754
-3.666105641950547e-12
1.3635461651727092
+6.915955958575618e-17
1.3635461651727092
-1.8760507978155556e-22
1.3635461651727092
Convergence of remote (x, y) for (x, y) = (0.25, 0.25)

n
nth term in sum
remote after n terms
0.03500661470136366
0.03500661470136366
-1.304318427037022e-11
0.035006614688320475
-3.445626248073877e-29
0.035006614688320475
The total potential as computed by Ewald summation is

local (0.25, 0.25)
= 1.3635461651727092
distant (0.25, 0.25)
= 0.0350066146883204
(0.25, 0.25)
= 1.3985527798610296
This number is accurate to machine precision and significantly more accurate

than the number we computed by summing 800,000 terms of the brute-force
sum, which was only correct to 6 digits.
Remark
Notice the progression of this computational example: We began with a straightforward approach that, while theoretically sound and capable of yielding correct
answers, was not particularly sophisticated or powerful. Then, we revisited the
problem from a deeper and more insightful perspective and found4 a dazzlingly
efficient solution.
4 Well, so far we have only sketched the solution; youll have to trust us for now that it
actually does work.
12
This example serves as a microcosm of our large-scale syllabus for 18.330. In

the first half of the course, basic numerical calculus, we will discuss a number of relatively straightforward approaches to the basic problems of numerical
analysis. These approaches will be theoretically sound and capable of yielding
correct answers, but will not be the most powerful techniques available. In the
second part of the course, Fourier analysis and spectral methods, we will
revisit various problems from a deeper perspective and learn more powerful and
elegant techniques that yield greater efficiency and accuracy.
13
Electrostatic potential near a continuous 1D

ionic solid
Next your boss announces that, instead of a discrete 1D chain of ions, she needs
to know the potential near a continuous 1D strip characterized by a line charge
density (x). Think of this as a version of the discrete ion chain in which (1)
the ions have all different charges, not just Q; and (2) the ions are all smushed
together (or, equivalently, we zoom out our perspective) so that we dont see
the individual contributions of each ion, but rather just a continuous charge
density. The strip has finite length L.
Figure 5: A charged strip of length L has linear charge density (L). We want
to compute the electrostatic potential at the point r.
Computing the electrostatic potential at a point r now requires evaluating

a definite integral instead of a sum:
Z
L/2
(x, y) =
L/2
(x0 ) dx0
p
(x x0 )2 + y 2
For example, consider a strip of length L = 20 with a charge density given by

(x) = cos x2 .
Suppose we want to evaluate the potential at the point (x, y) = (1, 1). Then
the integral we have to evaluate is
Z 10
cos x2 dx
p
(10)
(1, 1) =
(x 1)2 + 12
10
The integrand of this function is plotted in Figure 6. It should be obvious that
the integral cant be evaluated in closed form. The question is: How do we
evaluate this integral to six-digit accuracy?
14
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
-0.2
-0.2
-0.4
-0.4
-0.6
-0.6
-0.8
-10
-5
-0.8
10
Figure 6: The integrand of the integral (10).
The rectangular rule

The simplest technique for evaluating definite integrals numerically is inspired
by the familiar geometric interpretation of the definite integral:
Z
I=
f (x)dx
area under the curve f (x) between a and b.
(11)
To estimate the area of the geometric shape under the curve f (x), we approximate it as a union of N rectangles. Each rectangle has base length = ba
N .
The nth rectangle has height f (a + n) (where x = a + n is the x-coordinate
of its left edge), so it has area f (a + n) . Thus the N -point rectangular-rule
evaluation of (11) is
N
X
rect
IN
=
f a + n) .
n=1
Heres a plot of (x, y = 0.1) computed by applying a rectangular rule with

N = 1000 to equation (10).
15
1.4
1.4
ContinuousSolidPhi.dat u 1:2
1.2
1.2
Phi
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
-4
-3
-2
-1
0
x
Figure 7: The function (x, y = 0.1) as computed by applying a 1000-point

rectangular rule to evaluate the integral (10).
Of course, a plot
number of significant
this, let rect
be the
N
(10). Figure (8) plots
like Figure 7 doesnt give us much information on the

digits our rectangular rule is achieving. To investigate
result of using an N -point rectangular rule to evaluate
as a function of N .
the relative error in rect
N
16
0.0001
-4
-4.5
1e-05
-5
-5.5
1e-06
-6
-6.5
1e-07
-7
-7.5
1e-08
-8
-8.5
1e-09
1000
10000
100000
1e+06
1e+07
-9
1e+08
Figure 8: The relative error incurred by using an N -point rectangular rule to

evaluate the integral (10).
We learn two things from this plot.

The rectangular rule works. It yields a decent estimate of the integral,
and the accuracy of this estimate increases with increasing N .
However, the accuracy increases very slowly with increasing N .
Indeed, the slope of the line in the log-log plot of Figure 8 is -1, i.e. the
error is decreasing at a rate linearly proportional to N . We say that we have
first-order convergence 5 This means that, if we have evaluated our integral to
5 significant digits and we would like to improve this to 6 digits, we have to do
ten times as much work to achieve this ten-fold error reduction. That clicking
sound you now hear is your boss angrily tapping her foot and glaring at you
while you wait for this agonizingly slow calculation to complete.
5 Note that first-order convergence is not to be confused with linear convergence, which
sounds the same but means something totally different! We will discuss convergence terminology later in the course.
17
Questions raised by this example

1. What happens if the range of integration is infinite? (Even the worlds
fastest computer cant sum an infinite number of rectangles.)
2. What happens if the integrand has an integrable singularity? (Even the
worlds smartest computer cant process an infinitely tall rectangle.)
3. Can we understand theoretically the fact that the error in the rectangular
rule decreases linearly with N ?
4. Most importantly, are there other methods of numerical integration that
can improve on this convergence rate?
We will answer all of these questions in 18.330, first in a somewhat elementary way during the first unit of the course, and later in a profound and elegant
way during the second unit of the course.
Sneak preview
To give you just a sneak preview of what to expect when we discuss more sophisticated methods of numerical integration, Figure 9 compares the relative
integration error (same quantity plotted in Figure 8) vs. number of function
samples for the rectangular rule discussed above and for the Clenshaw-Curtis
rule, a method of numerical integration that we will discuss in the second half
of the course. Already with just N = 150 samples the Clenshaw-Curtis rule
has converged to an error of 1012 , while (as we saw below) the rectangular
rule needed N = 108 samples just to achieve an error of 109 ! This dramatic
improvement in performance is the analog, for numerical integration, of the
performance improvement achieved by Ewald summation over brute-force summation; its another demonstration of the power of spectral methods.
An interesting property of the Clenshaw-Curtis rule (and, indeed, of many
sophisticated numerical integration strategies) is that it samples the function
at unevenly spaced points. The following figure shows the sample points used
by the 26-point rectangular and Clenshaw-Curtis rules to integrate a function
over the interval [10:10]. Note that the Clenshaw-Curtis points tend to cluster
near the endpoints of the interval, while the rectangular-rule points are evenly
spaced throughout the interval.
18
100
0.01
0.0001
1e-06
1e-08
1e-10
1e-12
100
1000
10000
Figure 9: Like figure 8, but now comparing the performance of the N -point
rectangular rule to that of the N -point Clenshaw-Curtis rule.
Figure 10: The x points at which the function f (x) is sampled by the 26-point
rectangular and 26-point Clenshaw-Curtis rules.
19
Electric field of a 1D Solid
Now that weve delivered on our boss request for values of the electrostatic
potential (x, y) in the vicinity of our 1D solid, suppose she needs answers to a
slightly different question: What is the electric field in the vicinity of our solid?
Recall from basic electrostatics that the component of the electric field in
the x direction is minus the partial derivative of the potential with respect to
x:
(x, y)
.
(12)
Ex (x, y) =
x
(In this section, as before, we will keep y fixed, so we really only have functions of a single variable x, and partial derivatives are equivalent to total derivatives.)
Of course, taking partial derivatives of functions is usually pretty easy, but
the difficulty in this case is that we dont have a closed-form expression for
the function in (12). Instead, what we essentially have is a black box for
computing : We can give it any value of x we like, and it will give us back
a numerical value for , but theres no expression to differentiate, so we cant
write down an expression for the derivative.
The standard way to differentiate a black-box function f (x) is called finitedifferencing. Recall the definition of the derivative of f at x:
f 0 (x) lim
f (x + ) f (x)
The idea of finite-differencing is to arrest the limiting process here and evaluate
the ratio on the RHS at some finite value of . We call this quantity the
finite-difference approximation to the derivative at step size :
0
fFD
(; x)
f (x + ) f (x)
.
As we compute this quantity for smaller and smaller values of , the result
should approach the correct value of f 0 .
To test this out, lets look at how it behaves in a simple case: the function
f (x) = x2 . Of course, this is not a black boxwe can differentiate it analytically
to find f 0 (x) = 2xbut lets pretend its a black box and see how closely we
can reproduce the known result by finite-differencing. The following plot shows
the result of finite-differencing to estimate the derivative of f (x) = x2 at the
point x = 2.
20
0.1
-1
0.01
-2
0.001
-3
0.0001
-4
1e-05
-5
1e-06
-6
1e-07
-7
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
-8
0.1
Figure 11: Relative error in the finite-difference approximation to the derivative

of the function f (x) = x2 at the point x = 1.
So that seems to make perfect sense. The error is decreasing as decreases.

(More specifically, we appear to have first-order convergence.)
But suppose we get greedy. Although the above plot shows that we can get
7 good digits of f 0 by taking 107 , suppose we want 12 or 13 good digits.
We can just decrease a little further, right? Lets see what happens.
21
0.1
-1
0.01
-2
0.001
-3
0.0001
-4
1e-05
-5
1e-06
-6
1e-07
-7
1e-08
-8
1e-09
-9
1e-10
1e-18
1e-16
1e-14
1e-12
1e-10
1e-08
1e-06
0.0001
0.01
-10
Figure 12: Same as Figure 11, but now showing a wider range of the x axis.
Whoa?! What happened? Beyond a certain point the error appears to be

increasing for smaller and smaller values of . For the smallest values of , the
relative error is 1 that is, our approximation is off by 100%. What is going
on?
Questions raised by this example

1. Can we understand theoretically the linear convergence observed in Figure
11?
2. Are there other numerical differentiation algorithms that exhibit better
convergence properties?
3. How is possible that decreasing below a certain value actually winds up
increasing the error in our approximation?
22
Motion of a charged DNA strand near a 1D

solid
OK, so now youve delivered to your boss an accurate way to compute the
electric field in the vicinity of the 1D solid. What kinds of things might she do
with this information? Heres one possibility. Suppose we place a little strand
of DNAwhich we model as a point particle of charge qat a point near the
solid. The electric field of the solid exerts a force on the particle, which causes
it to accelerate and start moving. What trajectory will it traverse?
To simplify the calculation initially, lets suppose we somehow fix the ycoordinate of the DNA strand at some fixed value of y, say y = y0 (for example,
perhaps the particle is constrained to move along a wire held parallel to the
solid at a distance of y0 = 0.1 length units). The x-coordinate of the particle
will be a function of time, x = x(t), and our goal is to compute this function.
In the previous section we discussed techniques for computing the x-component
of the electric field at arbitrary points (x, y); for now lets forget about how we
compute this quantity and just denote it Ex (x, y). Then the force on the DNA
strand is F = qEx (x, y0 ), and this force is related to the acceleration of the
particle by Newtons second law of motion:
m

d2
x(t)
=
qE
x(t),
y
x
0
dt2
(13)
where m is the mass of the particle. This is an ordinary differential equation

for x(t) in terms of the function Ex .
When you studied differential equations, you learned some trickssuch as
the method of integrating factorsfor computing analytical solutions of ODEs
like (13). Such tricks rely on being able to write down analytical expressions
for certain definite integrals involving the function on the RHS of the ODE.
But in this case there can be no question of applying these tricks, because we
dont even have an analytical expression for the function Ex itself, much less
for definite integrals involving it.
Instead, we must proceed numerically. Our basic protocol for numerical
solution of equations like (13) will be something like this:
1. Given values of the position x0 and velocity x 0 at some starting time t0 ,
use equation (13) to compute the acceleration x
0 .
2. Using the velocity and acceleration at time t0 , approximately predict the
new position and velocity at t1 t0 + , where is a small time step.
Call the new position and velocity (x1 , x 1 ).
3. Now repeat the process from step 1.
All of the ODE algorithms we discuss will be variations of this basic theme.
In contrast to topics such as numerical integration and interpolation, our treatment of ODEs in the first half of the course will not need to be revisited after we
discuss spectral methods; the most widely used methods for integrating ODEs
23
fall into the category of basic numerical calculus and can be introduced already
in the first half of our course.
Boundary-value problems
The basic algorithm we just discussed for integrating ODEs started with the
position and velocity of the particle at a fixed time. For example, perhaps we
know that at time t0 the DNA particle is at position x0 with velocity v0 , and
we want to predict its future trajectory. This is an initial-value problem.
Heres a different sort of problem: Suppose we have the positions of the
particle at two timesfor example, we measure experimentally that at time
t0 it is at point x0 , while at time t1 it is at point x1 ; meanwhile, we dont
know the velocities at either point. Can we solve equation (13) to reconstruct
the trajectory followed by the particle in between the two times? This is a
boundary-value problem, and methods for solving it take on a rather different
form from methods for solving initial-value problems. (Indeed, the algorithm
discussed above for IVPs cant even get started for BVPs, because we dont
know the velocity at the starting point.) We will discuss both types of problems
in 18.330.
24
Equilibrium points near a 1D Ionic Solid
In previous sections we considered the motion of a charged particle constrained

to move along the line y = y0 in the force field of the 1D ionic solid. We now
ask the following question: Which values of the x coordinate correspond to
equilibrium points that is, points at which the force on the particle vanishes?
In other words, the problem we are considering is
Find x such that Ex (x) = 0.
This is a numerical root-finding problem.
Newtons Method
We will study several methods for solving numerical root-finding problems. One
method which is particularly simple to describe and which works well in many
cases is Newtons method. To find a root of a function f (x), Newtons method
goes like this:
1. Make an initial guess x1 as to the location of the root.
2. Compute the two numbers f (x1 ) and f 0 (x1 ) (value and derivative of f at
x1 ).
3. Set
x2 = x1
f (x1 )
.
f 0 (x1 )
This is our new guess for the location of the root.

4. Now repeat from step 1.
For example, suppose we apply Newtons method to find a root of the function f (x) = tanh(x 5). The exact root is at x = 5. If we start with an initial
guess of x1 = 4.4 and repeatedly apply the simple algorithm described above,
we obtain the following sequence of numbers (correct digits are printed in red:)
n
xn
4.400000000000000
5.154730677706086
4.997518482593209
5.000000010187351
5.000000000000000
After just 4 applications of the method, we have computed our root to 16digit accuracy!
25
Erratic convergence in Newtons Method

The difficulty with Newtons method is that it is exquisitely sensitive to the
initial guess. We can see this already for the simple problem considered above:
if f (x) = tanh(x 5.0) and our initial guess is x1 = 0, then the improved
estimate of the root after one iteration is
f (x1 )
f 0 (x1 )
tanh(-5)
=0
sech(-5))^2
= 5506.61643
x2 = x1
which is over 1,000 times further from the correct root than our starting guess!
Thus Newtons method must be used in practice with considerable care.
But which root does it find?

Another way in which the erratic convergence of Newtons method manifests
itself is when we use the method to find a root of a function which has multiple
roots for example, a polynomial of degree 2 or higher. In this case, Newtons
method may not converge to a root and, if it does find a root, it can be very
hard to predict which root it will find.
One simple example is the nonlinear function f (x) = x2 1, which has
the two roots x = 1. In this case, the convergence behavior is relatively
straightforward: If we start with a positive initial guess (i.e. x0 > 0), Newtons
method converges to the root x = +1. If we start with a negative initial guess
(x0 < 0), the method converges to the root x = 1.
Only slightly less simple is the nonlinear function f (z) = z 2 + 1, which has
the two complex roots z = i. In this case, if we start Newtons method with a
complex-valued initial guess lying in the upper half-plane (i.e. an initial guess
with positive imaginary part), the method converges to the root +i, while if we
start with an initial guess lying in the lower half-plane the method converges
to i. If we start with a real-valued initial guess then the method does not
converge at all!
The conclusions of the previous two paragraphs are plotted in Figure 6. In
these plots, each point z in the complex plane is assigned a color based on
the root to which Newtons method converges when started with initial guess
z1 = z.
26
(a) f (z) = z 2 1
(b) f (z) = z 2 + 1
Figure 13: Convergence of Newtons method for roots of the polynomials f (z) =
z 2 1 (top) and f (z) = z 2 + 1 (bottom). In these plots, each point z in the
complex plane is assigned a color based on the convergence of Newtons method
when started with initial guess z1 = z. In the upper plot, red (yellow) denotes
convergence to +1(1). In the lower plot, red (yellow) denotes convergence to
+i(i).
27
Finally, consider the nonlinear function f (z) = z 3 1, which has the three
complex roots
z = 1,
e2i/3 ,
e4i/3 .
Based on the experience of Figure 6, we might expect the convergence plot for
Newtons method on this function to look like the complex plane divided up
into three sectors. This is not quite what happens:
Figure 14: Convergence of Newtons method for roots of the polynomial f (z) =
z 3 1. Grey, red, and yellow denote convergence to 1, e2i/3 , e4i/3 . The darker
the color, the more rapid the convergence.
We instead get a fractal, illustrating both the promise and peril of nave use
of numerical root-finding tools.
28
Connecting the dots
Look back at Figure (3) for the potential of the 1D ionic solid as a function of
the x coordinate. Even with the acceleration techniques discussed in previous
sections, it may be quite time-consuming to compute at every point for which
we need a value. But a glance at Figure (3) suggests that perhaps we dont
need to calculate at such a dense grid of points instead, perhaps we could
tabulate on a coarse grid of points, and then infer values at intermediate
points by somehow connecting the dots in some reasonable way. This is the
idea of interpolation.
To see how this works, suppose we have used our computational algorithms
to evaluate (x, y = 0.1) at 5 equally-spaced points between x = 0 and x = 2.
Wed like to draw a curve that runs through these points; by forcing the curve
to match exactly at these points, we hope to find that it approximates in
between those points.
An obvious choice for such a curve is a polynomial. Indeed, given any 5
points in the plane (xi , yi ), i = 1..5, there is a unique 4th-degree polynomial
f (x) that runs through all the points. (More generally, given any N points
there is a unique polynomial of degree N 1 running through them.) Leaving
aside for the moment the question of how we find this polynomial, lets look at
how well it mimics the actual function we are trying to replicate.
29
10
20
8
6
15
4
2
10
0
5
-2
-4
-6
-5
-8
-10
0.5
1.5
Figure 15: The green dots are values of the function (x, 0.1) from the previous
section evaluated at 5 evenly spaced points in the interval x = [0, 2]. The red
curve is the unique 4th-degree polynomial running through the green dots. For
reference, the dotted curve shows actual function (x, 0.1) that we are trying
to mimic with the red curve.
Well, the polynomial is not doing a particularly good job of matching the
behavior of the function in between the data points, but perhaps thats to be
expected for such a low-order approximation. Perhaps if we try again with a
higher-degree polynomial well have better luck? Lets try fitting an eighthdegree polynomial through 9 data points.
30
20
2
15
1
10
1
5
5
0
0
-5
-5
-10
0.5
1.5
Figure 16: Like the previous figure, but now showing the unique 8th-degree
polynomial running through 9 evenly-spaced function samples.
Hmmm. In at least some places, the red curve here seems to be doing a
slightly better job of replicating the dashed black curve than we saw previously.
However, there is a troubling spike near the edges of the interval in which the
polynomial deviates significantly from the function were trying to approximate.
Does this mean again that we simply chose too low a degree? Lets try again
with a polynomial of still higher degree.
31
50
50
40
40
30
30
20
20
10
10
0
0
-10
-10
-20
-20
0
0.5
1.5
-30
Figure 17: Like the previous figure, but now showing the unique 14th-degree
polynomial running through 14 evenly-spaced function samples.
Well, it seems a pattern is emerging: The more we try to force a high-degree

polynomial to conform to some non-polynomial curve, the more the polynomial
bulges out in the regions between the data points, yielding an extremely
inaccurate interpolant. This is known as Runges phenomenon.
Questions posed by this example

1. Both the low-degree and high-degree polynomial interpolants we tried in
this example failed (in different ways) to furnish an accurate approximation of the underlying function between the data points. Why are
polynomials such a bad choice of interpolant in this case? Are there any
situations in which polynomial interpolation works well?
2. For the present case (and cases like it), how can we improve on polynomial
interpolation?
32
A smattering of other problems well discuss

in 18.330
Finite-difference approach to boundary-value problems.
Richardson extrapolation.
Evaluation of special functions.
Well understand the following bizarre and beautiful result:
X
2
1
=
2
n
6
n=1
...as well as the following, perhaps equally beautiful but much more bizarre,
result:
X
n=
en
1 X n2 /x
=
e
x n=
Bump functions. Heres a challenge question for you: Can you design
a single-variable function f (x) that simultaneously satisfies the following
two conditions?
1. f (x) must be everwhere continuous and infinitely differentiable.
2. f (x) must be identically zero except on a finite length of the real line.
(For example, f (x) must vanish identically for x outside the interval
[1, 1].
It is not even obvious that such a function can exist, much less how to
construct it, but we will dissect these mysteries in 18.330.

Machine Arithmetic: Fixed-Point and
Floating-Point Numbers
Homer Reid
March 4, 2014
Contents
1 Overview
2 Fixed Point Representation of Numbers
3 Floating-Point Representation of Numbers
4 The Big Floating-Point Kahuna: Catastrophic Loss of Numerical Precision

11
5 Other Floating-Point Kahunae
16
6 Fixed-Point and Floating-Point Numbers in Modern Computers

19
Overview
Consider an irrational real number like = 3.1415926535..., represented by an

infinite non-repeating sequence of decimal digits. Clearly an exact specification
of this number requires an infinite amount of information. In contrast, computers must represent numbers using only a finite quantity of information, which
clearly means we wont be able to represent numbers like without some error.
In principle there are many different ways in which numbers could be represented on machines, each of which entails different tradeoffs in convenience and
precision. In practice, there are two types of representations that have proven
most useful: fixed-point and floating-point numbers. Modern computers use
both types of representation. Each method has advantages and drawbacks, and
a key skill in numerical analysis is to understand where and how the computers
representation of your calculation can go catastrophically wrong.
The easiest way to think about computer representation of numbers is to
imagine that the computer represents numbers as finite collections of decimal
digits. Of course, in real life computers store numbers as finite collections of
binary digits. However, for our purposes this fact will be an unimportant implementation detail; all the concepts and phenomena we need to understand
can be pictured most easily by thinking of numbers inside computers as finite
strings of decimal digits. At the end of our discussion we will discuss the minor
points that need to be amended to reflect the base-2 reality of actual computer
numbers.
Fixed Point Representation of Numbers
The simplest way to represent numbers in a computer is to allocate, for each

number, enough space to hold N decimal digits, of which some lie before the
decimal point and some lie after. For example, we might allocate 7 digits to
each number, with 3 digits before the decimal point and 4 digits after. (We
will also allow the number to have a sign, .) Then each number would look
something like this, where each box stores a digit from 0 to 9:
+
Figure 1: In a 7-digit fixed-point system, each number consists of a string of 7
digits, each of which may run from 0 to 9.
For example, the number 12.34 would be represented in the form
Figure 2: The number 12.34 as represented in a 7-digit fixed-point system.
The representable set

The numbers that may be exactly represented form a finite subset of the real
line, which we might call S representable or maybe just S rep for short. In the
fixed-point scheme illustrated by Figure 1, the exactly representable numbers
are
S rep =
-999.9999
-999.9998
-999.9997
..
.
-000.0001
+000.0000
+000.0001
+000.0002
..
.
+999.9998
+999.9999
Notice something about this list of numbers: They are all separated by the
same absolute distance, in this case 0.0001. Another way to say this is that the
density of the representable set is uniform over the real line (at least between
max
the endpoints, Rmin
= 999.9999): Between any two real numbers r1 and r2
lie the same number of exactly representable fixed-point numbers. For example, between 1 and 2 there are 104 exactly-representable fixed-point numbers,
and between 101 and 102 there are also 104 exactly-representable fixed-point
numbers.
Rounding error
Another way to characterize the uniform density of the set of exactly representable fixed-point numbers is to ask this question: Given an arbitrary
real number r in the interval [Rmax , Rmin ], how near is the nearest exactlyrepresentable fixed-point number? If we denote this number by fi(r), then the
statement that holds for fixed-point arithmetic is:
for all r R, Rmin < r < Rmax , with || EPSABS such that
fi(r) = r + .
(1)
In equation (1), EPSABS is a fundamental quantity associated with a given fixedpoint representation scheme; it is the maximum absolute error incurred in the
approximate fixed-point representation of real numbers. For the particular fixedpoint scheme depicted in (1), we have EPSABS = 0.00005.
The fact that the absolute rounding error is uniformly bounded is characteristic of fixed-point representation schemes; in floating-point schemes it is the
relative rounding error that is uniformly bounded, as we will see below.
Error-free calculations
There are many calculations that can be performed in a fixed-point system with
no error. For example, suppose we want to add the two numbers 12.34 and
742.55. Both of these numbers are exactly representable in our fixed-point
system, as is their sum (754.89), so the calculation in fixed-point arithmetic

yields the exact result:
+
=
Figure 3: Arithmetic operations in which both the inputs and the outputs are
exactly representable incur no error.
We repeat again that the computer representation of this calculation introduces

no error. In general, arithmetic operations in which both the inputs and outputs
are elements of the representable set incur no error; this is true for both fixedpoint and floating-point
Non-error-free calculations
On the other hand, heres a calculation that is not error-free.
/
=
Figure 4: A calculation that is not error-free. The exact answer here is
24/7=3.42857142857143..., but with finite precision we must round the answer to nearest representable number.
Overflow
The error in (4) is not particularly serious. However, there is one type of calculation that can go seriously wrong in a fixed-point system. Suppose, in the
calculation of Figure 3, that the first summand were 412.34 instead of 12.34.
The correct sum is
412.24 + 742.55 = 1154.89.
However, in fixed-point arithmetic, our calculation looks like this:
+
=
Figure 5: Overflow in fixed-point arithmetic.
The leftmost digit of the result has fallen off the end of our computer! This
is the problem of overflow: the number we are trying to represent does not fit
in our fixed-point system, and our fixed-point representation of this number is
not even close to being correct (154.89 instead of (1154.89). If you are lucky,
your computer will detect when overflow occurs and give you some notification,
but in some unhappy situations the (completely, totally wrong) result of this
calculation may propagate all the way through to the end of your calculation,
yielding highly befuddling results.
The problem of overflow is greatly mitigated by the introduction of floatingpoint arithmetic, as we will discuss next.
Floating-Point Representation of Numbers
The idea of floating-point representations is to allow the decimal point in Figure

1 to move around that is, to float in order to accommodate changes in the
scale of the numbers we are trying to represent.
More specifically, if we have a total of 7 digits available to represent numbers,
we might set aside 2 of them (plus a sign bit) to represent the exponent of the
calculation that is, the order of magnitude. That leaves behind 5 boxes for the
actual significant digits in our number; this portion of a floating-point number
is called the mantissa. A general element of our floating-point representation
scheme will then look like this:
Figure 6: A floating-point scheme with a 5-decimal-digit mantissa and a twodecimal-digit exponent.

For example, some of the numbers we represented above in fixed-point form
look like this when expressed in floating-point form:
12.34
754.89
Vastly expanded dynamic range

The choice to take digits from the mantissa to store the exponent does not come
without cost: now we can only store the first 5 significant digits of a number,
instead of the first 7 digits.
However, the choice buys us enormously greater dynamic range: in the
number scheme above, we can represent numbers ranging from something like
10103 to 10+99 , a dynamic range of of more than 200 orders of magnitude.
In contrast, in the fixed-point scheme of Figure 1, the representable numbers
span a piddling 7 orders of magnitude! This is a huge win for the floating-point
scheme.
Of course, the dynamic range of floating-point scheme is not infinite, and
there do exist numbers that are too large to be represented. In the scheme
considered above, these would be numbers greater than something like Rmax
10100 ; in 64-bit IEEE double-precision binary floating-point (the usual floatingpoint scheme you will use in numerical computing) the maximum representable
number is something closer to Rmax 10308 . We are not being particularly precise in pinning down these maximum representable numbers, because in practice
you should never get anywhere near them: if you are doing a calculation in which
numbers on the order of 10300 appear, you are doing something wrong.
The representable set

Next notice something curious: The number of empty boxes in Figure 6 is the
same as the number of empty boxes in Figure 1. In both cases, we have 7 empty
boxes, each of which can be filled by any of the 10 digits from 0 to 9; thus in both
cases the total number of representable numbers is something like 107 . (This
calculation omits the complications arising from the presence of sign bits, which
give additional factors of 2 but dont change the thrust of the argument). Thus
the sets of exactly representable fixed-point and exactly representable floatingpoint numbers have roughly the same cardinality. And yet, as we just saw, the
floating-point set is distributed over a fantastically wider stretch of the real axis.
The only way this can be true is if the two representable sets have very different
densities.
In particular, in contrast to fixed-point numbers, the density of the set of
exactly representable floating-point numbers is non-uniform. There are more
exactly representable floating-point numbers in the interval [1, 2] then there are
in the interval [101, 102]. (In fact, there are roughly the same number of exactlyrepresentable floating-point numbers in the intervals [1, 2] and [100, 200].)
Some classes of exactly representable numbers
1. Integers. All integers in the range [I max , I max ] are exactly representable,
where I max depends on the size of the mantissa. For our 5-decimal-digit
floating-point scheme, we would have I max = 99, 999. For 64-bit (double
precision) IEEE floating-point arithmetic we have I max 1016 .
2. Integers divided by 10 (in decimal floating-point)
3. Integers divided by 2 (in binary floating-point)
4. Zero is always exactly representable.
Rounding error
For a real number r, let fl(r) be the real number closest to r that is exactly
representable in a floating-point scheme. Then the statement analogous to (1)
is
for all r R, |r| < Rmax , with || EPSREL such that
(2)
fl(r) = r(1 + )
10
where EPSREL is a fundamental quantity associated with a given floating-point

representation; it is the maximum relative error incurred in the approximate
floating-point representation of real numbers. EPSREL is typically known as
machine precision (and often denoted machine or simply EPS). In the decimal
floating-point scheme illustrated in Figure 6, we would have EPSREL 105 .
For actual real-world numerical computations using 64-bit (double-precision)
IEEE floating-point arithmetic, the number you should keep in mind is EPSREL
1015 . Another way to think of this is: double-precision floating-point can
represent real numbers to about 15 digits of precision. High-level languages like
matlab and julia have built-in commands to inform you of the value of EPSREL
on whatever machine you are running on:
julia> eps()
2.220446049250313e-16
11
The Big Floating-Point Kahuna: Catastrophic

Loss of Numerical Precision
In the entire subject of machine arithmetic there is one notion which is so

important that it may be singled out as the most crucial concept in the whole
discussion. If you take away only one idea from our coverage of floating-point
arithmetic, it should be this one:
Never compute a small number as the difference
between two nearly equal large numbers.
The phenomenon that arises when you subtract two nearly equal floating-point
numbers is called catastrophic loss of numerical precision; to emphasize that
it is the main pitfall you need to worry about we will refer to it as the big
floating-point kahuna.
A population dynamics example

As an immediate illustration of what happens when you ignore the admonition
above, suppose we attempt to compute the net change in the U.S. population
during the month of February 2011 by comparing the nations total population
on February 1,2011 and March 1, 2011. We find the following data:1
Date
2011-02-01
2011-03-01
US population (thousands)
311,189
311,356
Table 1: Monthly U.S. population data for February and March 2011.
These data have enough precision to allow us to compute the actual change
in population (in thousands) to three-digit precision:
311,356 311,189 = 167.
(3)
But now suppose we try to do this calculation using the floating-point system
discussed in the previous section, in which the mantissa has 5-digit precision.
The floating representations of the numbers in Table 1 are
fl(311,356) = 3.1136 105
fl(311,189) = 3.1119 105
1 http://research.stlouisfed.org/fred2/series/POPTHM/downloaddata?cid=104
12
Subtracting, we find
3.1136 105
3.1119 105
=1.7000 102
(4)
Comparing (3) and (4), we see that the floating-point version of our answer is
170, to be compared with the exact answer of 167. Thus our floating-point
calculation has incurred a relative error of about 2 102 . But, as noted above,
the value of EPSREL for our 5-significant-digit floating-point scheme is approximately 105 ! Why is the error in our calculation 2000 times larger than machine
precision?
What has happened here is that almost all of our precious digits of precision
are wasted because the numbers we are subtracting are much bigger than their
difference. When we use floating-point registers to store the numbers 311,356
and 311,189, almost all of our precision is used to represent the digits 311,
which are the ones that give zero information for our calculation because they
cancel in the subtraction.
More generally, if we have N digits of precision and the first M digits of
x and y agree, then we can only compute their difference to around N M
digits of precision. We have thrown away M digits of precision! When M is
large (close to N ), we say we have experienced catastrophic loss of numerical
precision. Much of your work in practice as a numerical analyst will be in
developing schemes to avoid catastrophic loss of numerical precision.
In 18.330 we will refer to catastrophic loss of precision as the big floatingpoint kahuna. It is the one potential pitfall of floating-point arithmetic that you
must always have in the back of your mind.
The big floating-point kahuna in finite-difference differentiation

In our unit on finite-difference derivatives we noted that the forward-finitedifference approximation to the first derivative of f (x) at a point x is
f (x + h) f (x)
(5)
h
where h is the stepsize. In exact arithmetic, the smaller we make h the more
closely this quantity approximates the exact derivative. But in your problem
set you found that this is only true down to a certain critical stepsize hcrit ;
taking h smaller than this critical stepsize actually makes things worse, i.e.
0
increases the error between fFD
. Lets now investigate this phenomenon using
floating-point arithmetic. We will differentiate the simplest possible function
imaginable, f (x) = x, at the point x = 1; that is, we will compute the quantity
0
fFD
(h, x) =
f(x+h)-f(x)
h
for various floating-point stepsizes h.
Stepsize h =
13
2
3
First suppose we start with a stepsize of h = 23 . This number is not exactly

representable; in our 5-decimal-digit floating-point scheme, it is rounded to
fl(h) = 0.66667
(6a)
The sequence of floating-point numbers that our computation generates is now

f(x+h) = 1.6667
(6b)
f(x) = 1.0000
f(x+h) - f(x) = 0.6667
and thus
0.66670
f(x+h) - f(x)
=
h
0.66667
(6c)
The numerator and denominator here begin to differ in their 4th digits, so their
ratio deviates from 1 by around 104 . Thus we find

2
0
fFD
(6d)
h = , x = 1 + O(104 )
3
0
= 1,
and thus, since fexact
for h =
2
3
0
the error in fFD
(h, x)
is about 104 .
(6e)
Stepsize h =
1
10
14
2
30
Now lets shrink the stepsize by 10 and try again. Like the old stepsize h = 2/3,
2
the new stepsize h = 30
is not exactly representable. In our 5-decimal-digit
floating-point scheme, it is rounded to
fl(h) = 0.066667
(7a)
Note that our floating-point scheme allows us to specify this h with just as much
precision as we were able to specify the previous value of h [equation (6a)]
namely, 5-digit precision. So we certainly dont suffer any loss of precision at
this step.
The sequence of floating-point numbers that our computation generates is
now
f(x+h) = 1.0667
(7b)
f(x) = 1.0000
f(x+h) - f(x) = 0.0667
and thus
0.066700
f(x+h) - f(x)
=
h
0.066667
(7c)
Now the numerator and denominator begin to disagree in the third decimal
place, so the ratio deviates from 1 by around 103 , i.e. we have

1
0
, x = 1 + O(103 )
(7d)
fFD h =
30
0
and thus, since fexact
= 1,
for h =
2
30
0
the error in fFD
(h, x)
is about 103 .
(7e)
Comparing equation (7e) to equation (6e) we see that shrinking h by a factor

of 10 has increased the error by a factor of ten! What went wrong?
Analysis
The key equations to look at are (6b) and (7b). As we noted above, our floating2
point scheme represents 32 and 30
with the same precision namely, 5 digits.
Although the second number is 10 times smaller, the floating-point uses the
same mantissa for both numbers and just adjusts the exponent appropriately.
The problem arises when we attempt to cram these numbers inside a floatingpoint register that must also store the quantity 1, as in (6b) and (7b). Because
the overall scale of the number is set by the 1, we cant simply adjust the
2
exponent to accommodate all the digits of 30
. Instead, we lose digits off the
15
right end more specifically, we lose one more digit off the right end in (7b) then
we did in (7b). However, when we go to perform the division in (6c) and (7c),
the numerator is the same 5-digit-accurate h value we started with [eqs. (6a)
and (7a)]. This means that each digit we lost by cramming our number together
with 1 now amounts to an extra lost digit of precision in our final answer.
Avoiding the big floating-point kahuna

Once you know that the big floating-point kahuna is lurking out there waiting to
bite, its easier to devise ways to avoid him. To give just one example, suppose
we need to compute values of the quantity
f (x, ) = x + x.
When x,the two terms on the RHS are nearly equal, and subtracting them
gives rise to catastrophic loss of precision. For example, if x = 900, = 4e-3,
the calculation on the RHS becomes
30.00006667 30.00000000
and we waste the first 6 decimal digits of our floating-point precision; in the
5-decimal-digit scheme discussed above, this calculation would yield precisely
zero useful information about the number we are seeking.
However, there is a simple workaround. Consider the identity

x+ x
x + + x = (x + ) x =
which we might rewrite in the form

x+

x =
x++
.
x
The RHS of this equation is a safe way to compute a value for the LHS; for
example, with the numbers considered above, we have
4e-3
6.667e-5.
30.0000667 + 30.0000000
Even if we cant store all the digits of the numbers in the denominator, it doesnt
matter; in this way of doing the calculation those digits arent particularly
relevant anyway.
16
Other Floating-Point Kahunae
Random-walk error accumulation

Consider the following code snippet, which adds a number to itself N times:
function DirectSum(X, N)
Sum=0.0;
for n=1:N
Sum += X;
end
Sum
end
Suppose we divide some number Y into N equal parts and add them all up.
How accurately to we recover the original value of Y ? The following figure plots
the quantity

Y
,N Y |
|DirectSum N
Y
for the case Y = and various values of N . Evidently we incur significant errors
for large N .
Relative error in direct summation

1e-08
-8
1e-09
-9
Relative Error
1e-10
-10
Direct
1e-11
-11
1e-12
-12
1e-13
-13
1e-14
-14
1e-15
-15
1e-16
100
1000
10000
100000
N
1e+06
-16
1e+08
1e+07
Figure 7: Relative error in the quantity DirectSum
Y
N,N
17
The cure for random-walk error accumulation

Unlike many problems in life and mathematics, the problem posed in previous
subsection turns out to have a beautiful and comprehensive solution that, in
practice, utterly eradicates the difficulty. All we have to do is replace DirectSum
with the following function:2
function RecursiveSum(X, N)
if N < BaseCaseThreshold
Sum = DirectSum(X,N)
else
Sum = RecursiveSum(X,N/2) + RecursiveSum(X,N/2);
end
Sum
end
What this function does is the following: If N is less than some threshold value
BaseCaseThreshold (which may be 100 or 1000 or so), we perform the sum
directly. However, for larger values of N we perform the sum recursively: We
evaluate the sum by adding together two return values of RecursiveSum. The
following figure shows that this slight modification completely eliminates the
error incurred in the direct-summation process:
Relative error in direct and recursive summation
1e-08
-8
1e-09
-9
Direct
Relative Error
1e-10
-10
Recursive
1e-11
-11
1e-12
-12
1e-13
-13
1e-14
-14
1e-15
-15
1e-16
100
1000
10000
100000
N
1e+06
1e+07
Figure 8: Relative error in the quantity RecursiveSum
-16
1e+08
Y
N,N
2 Caution: The function RecursiveSum as implemented here actually only works for even
values of N . Can you see why? For the full, correctly-implemented version of the function,
see the code RecursiveSum.jl available from the Lecture Notes section of the website.
18
Analysis
Why does such a simple prescription so thoroughly cure the disease? The basic
intuition is that, in the case of DirectSum with large values of N , by the time
we are on the 10,000th loop iteration we are adding X to a number that is 104
times bigger than X. That means we instantly lose 4 digits of precision of the
right end of X, giving rise to a random rounding error. As we go to higher and
higher loop iterations, we are adding the small number X to larger and larger
numbers, thus losing more and more digits off the right end of our floating-point
register.
In contrast, in the RecursiveSum approach we never add X to any number
that is more than BaseCaseThreshold times greater than X. This limits the
number of digits we can ever lose off the right end of X. Higher-level additions
are computing the sum of numbers that are roughly equal to each other, in
which case the rounding error is on the order of machine precision (i.e. tiny).
For a more rigorous analysis of the error in direct and pairwise summation,
see the Wikipedia page on the topic3 , which was written by MITs own Professor
Steven Johnson.
3 http://en.wikipedia.org/wiki/Pairwise
summation
19
Fixed-Point and Floating-Point Numbers in

Modern Computers
As noted above, modern computers use both fixed-point and floating-point numbers.
Fixed-point numbers: int or integer

Modern computers implement fixed-point numbers in the form of integers, typically denoted int or integer. Integers correspond to the fixed-point diagram of
Figure 1 with zero digits after the decimal place; the quantity EPSABS in equation (1) is 0.5. Rounding is always performed toward zero; for example, 9/2=4,
-9/2=-4. You can get the remainder of an integer division by using the %
symbol to perform modular arithmetic. For example, 19/7 = 2 with remainder
5:
julia> 19%7
5
Floating-point numbers: float or double

The floating-point standard that has been in use since the 1980s is known as
IEEE 754 floating point (where 754 is the number of the technical document
that introduced it). There are two primary sizes of floating-point numbers,
32-bit (known as single precision and denoted float or float32) and 64-bit
(known as double precision and denoted double or float64).
Single-precision floating-point numbers have a mantissa of approximately
7 digits (EPSREL 108 ) while double-precision floating-point numbers have a
mantissa of approximately 15 digits (EPSREL 1016 .)
You will do most of your numerical calculations in double-precision arithmetic, but single precision is still useful for, among other things, storing numbers in data files, since you typically wont need to store all 15 digits of the the
numbers generated by your calculations.
Inf and NaN

The floating-point standard defines special numbers to represent the result of
ill-defined calculations.
If you attempt to divide a non-zero number by zero, the result will be a
special number called Inf. (There is also -Inf.) This special number satisfies
x+Inf=Inf and x*Inf=Inf if x > 0. You will also get Inf in the event of
overflow, i.e. when the result of a floating-point calculation is larger than the
largest representable floating-point number:
20
julia> exp(1000)
Inf
On the other hand, if you attempt to perform an ill-defined calculation like
0.0/0.0 then the result will be a special number called NaN (not a number.)
This special number has the property that all arithmetic operations involving
NaN result in NaN. (For example, 4.0+NaN=NaN, -1000.0*NaN.)
What this means is that, if you are running a big calculation in which any
one piece evaluates to NaN (for example, a single entry in a matrix), that NaN will
propagate all the way through the rest of your calculation and contaminate the
final answer. If your calculation takes hours to complete, you will be an unhappy
camper upon arriving the following morning to check your data and discovering
that a NaN somewhere in the middle of the night has corrupted everything. (I
speak from experience.) Be careful!
NaN also satisfies the curious property that it is not equal to itself:
julia> x=0.0 / 0.0
NaN
julia> y=0.0 / 0.0
NaN
julia> x==y
false
julia>
This fact can actually be used to test whether a given number is NaN.
Distinguishing floating-point integers from integer integers

If, in writing a computer program, you wish to define a integer-valued constant
that you want the computer to store as a floating-point number, write 4.0
instead of 4.
Arbitrary-precision arithmetic
In the examples above we discussed the kinds of errors that can arise when
you do floating-point arithmetic with a finite-length mantissa. Of course it
is possible to chain together multiple floating-point registers to create a longer
mantissa and achieve any desired level of floating-point precision. (For example,
by combining two 64-bit registers we obtain a 128-bit register, of which we might
set aside 104 bits for the mantissa, roughly doubling the number of significant
digits we can store.) Software packages that do this are called arbitrary-precision
arithmetic packages; an example is the gnu mp library4 .
Be forewarned, however, that arbitrary-precision arithmetic packages are not
a panacea for numerical woes. The basic issue is that, whereas single-precision
4 http://gmplib.org
21
and double-precision floating-point arithmetic operations are performed in hardware, arbitrary-precision operations are performed in software, incurring massive
overhead costs that may well run to 100 or greater. So you should think of
arbitrary-precision packages as somewhat extravagant luxuries, to be resorted
to only in rare cases when there is absolutely no other way to do what you need.

Modulation: Wireless Communication and
Lock-in Amplifiers
Homer Reid
April 3, 2014
Contents
1 Overview
2 Analog modulation
2.1 Amplitude modulation (AM) . . . . . . . . . . . . . . . . . . . .
2.2 Phase and frequency modulation (PM and FM) . . . . . . . . . .
3
3
7
3 Digital modulation
3.1 OOK . . . . . . . . . .
3.2 BPSK, QPSK, MPSK
3.3 QAM . . . . . . . . .
3.4 Spectral efficiency . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
10
10
4 Multiplex methods
13
4.1 The cocktail party . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 How CDMA works . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Lock-in amplifiers
15
5.1 How lock-in amplifiers work . . . . . . . . . . . . . . . . . . . . . 15
Overview
Consider a bandlimited baseband signal f BB (t) with bandwidth .1 A good

example to have in mind is music: think of f BB (t) as the time-dependent voltage
V (t) output from your MP3 player to your headphones or speakers. In this case,
f BB (t) is a bandlimited baseband signal with a bandwidth 2 20 kHz.
(The superscript BB stands for baseband.)
For various reasons, it may be desirable to convert the signal f BB (t) into
a new signal f M (t) whose frequency spectrum has the same bandwidth as the
original signal f BB (t), but is centered around a nonzero frequency called the
carrier frequency, carrier . The process of translating frequencies in this way is
called modulation. (The M superscript stands for modulated. In some cases
we will also refer to f M as f transmitted to indicate that it is the signal that is
eventually transmitted over a wired or wireless communication channel.)
Modulation is ubiquitous throughout all fields of science and engineering
and forms the essential cornerstone of modern communications technologies. It
also furnishes an example of a highly practical and relevant real-world problem
which would be essentially impossible to tackle without the ideas and techniques
of Fourier analysis.
The purpose of these short notes is to introduce some of the basic techniques
of modulation and compare their spectral efficiencies. We will focus primarily
on communication technologies, but we will also briefly discuss lock-in detection
as an important application of modulation techniques in experimental science.
1 A bandlimited signal with bandwidth is a function f (t) whose Fourier transform fe()
is zero for frequencies outside an interval of width . A baseband signal is a signal whose
frequency spectrum is centered at = 0.
2
2.1
Analog modulation
Amplitude modulation (AM)
The simplest way to modulate a signal is just to translate the entire frequency
spectrum of f (t) so that it is centered around the carrier frequency. This process
is called amplitude modulation (AM). Historically, AM was the first modulation
scheme used for wireless communications in the early 20th century, and it remains in use to this day in AM radio. It was used in the first widely available
cellular telephone system, the AMPS system, in the 1980s. It was also used
for terrestrial television transmission until 2009. However, in the mid-20th century it was superseded by FM, and in the late 20th century analog modulation
was essentially replaced altogether (for communications applications, anyway)
by digital modulation. On the other hand, amplitude modulation remains in
widespread use for the purposes of lock-in detection, discussed later.
Implementation of AM transmitters
The simplest way to do AM is just to multiply the carrier signal f carrier = cos c t
by the baseband signal f BB (t):
f AM (t) = f BB (t) cos c t
(1)
In other words, the modulated signal is just a sinusoid at the carrier frequency,
but with a time-varying amplitude defined by f BB (t). The baseband signal
modulates the amplitude of the carrier; this is the origin of the name amplitude
modulation.
Spectrum of AM signals
Its easy to determine the frequency spectrum of an AM signal. As a first step,
suppose the baseband signal consists of just a single tone with frequency BB
and amplitude A:
f BB (t) = A cos BB t.
(2)
The modulated signal is
f AM (t) = A cos BB (t) cos c t
To compute the frequency spectrum of this signal, we could now apply Fourier
analysis techniques, but as it turns out we dont need to, because we can just
appeal to the trigonometric identity
cos a cos b =
to write
f AM (t) =
i
1h
cos(a + b) + cos(a b)
2
i
Ah
cos c + BB )t + cos c BB )t .
2
(3)
(4)
This is a frequency spectrum with nonvanishing contributions from just two

frequencies, namely, c BB .
Of course, usually our baseband signal will be more interesting than just the
single tone (2). However, any baseband signal can be decomposed into a sum
of single tones through the magic of Fourier analysis. For the time being, lets
suppose f BB is a periodic baseband signal that is an even function of time; then
its Fourier decomposition looks something like
X
BB
f BB (t) =
fg
(n ) cos n t.
n
Each term in this sum contributes two terms to the frequency spectrum of the
output signal just as in equation (4):
f AM (t) =
X f BB (n ) h
n

i
cos c + n t + cos c n t .
(5)
Equation (5) describes a frequency spectrum consisting of two copies of the

frequency spectrum of f BB (t), with the two copies mirrored about the carrier
frequency. In particular, the bandwidth of the transmit signal is twice the
bndwidth of the baseband signal. Each mirrored copy is called a sideband, and
this type of amplitude modulation is known as double-sideband modulation.
Figures 1 and 2 show the baseband, carrier, and modulated signals in the
time and frequency domains.
1.5
1.5
0.5
0.5
-0.5
0.5
1
t
1.5
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1.5
-1
0.5
1
t
1.5
1.5
0.5
-1.5
1.5
0.5
-0.5
-0.5
-1
-1.5
-0.5
-1
0.5
1
t
1.5
-1.5
Figure 1: Amplitude modulation in the time domain.
1
t
1
t
1
t
Figure 2: Amplitude modulation in the frequency domain. The baseband signal

has some frequency spectrum that is nonzero up to a maximum frequency max .
The carrier signal has a frequency spectrum that is concentrated at a single
point. The modulated signal has a frequency spectrum consisting of two copies
(two sidebands) of the baseband frequency spectrum mirrored about the carrier frequency. The modulated signal has bandwidth 2 max . Single-sideband
modulation would produce a similar signal but with only one of the two sidebands present.
Single-sideband AM
As we noted above, the frequency spectrum of a nave AM signal contains two
redundant copies of the information we are trying to transmit. This means that
the transmit signal actually has twice as much bandwidth as it nominally needs
to have to transmit the requisite information.
It is possible to circumvent this redundancy by use of a technique known as
single-sideband modulation. This is based on the following modified version of
the trig identity (3):
cos a cos b sin a sin b = cos(a + b).
To see how single-sideband modulation works in practice, suppose again that
our baseband signal consist of the single tone
f BB (t) = A cos BB t.
What we do is to form the /2-shifted-version of this signal:
BB
f/2
(t) = A sin BB t.
Then we multiply f BB (t) by the original carrier signal cos c t, and we multiply the /2 shifted baseband signal by the /2 shifted carrier signal, and we
subtract:
BB
f SSAM (t) = f BB (t) cos c t f/2
sin c t
For the case of the single-tone baseband signal, the transmit signal now contains
only the single tone c + BB ; the lower-sideband tone at c BB has been
supressed. More generally, if f BB (t) contains a spectrum of frequencies, the
transmitted signal will contain only one copy of this spectrum, not the two
redundant copies we found above.
However, for baseband signals that are more complicated than a single tone,
forming the /2 shifted version is expensive: we have to Fourier-decompose
the signal into constituent sinusoids and then apply a /2 phase shift to each
sinusoid. In practice this requires fairly sophisticated digital signal processing
techniques, and is not commonly used for wireless AM communications.
2.2
Phase and frequency modulation (PM and FM)
One drawback of amplitude modulation is that all the information is in the

amplitude of the received signal, which makes that signal susceptible to noise
contamination. This will be evident to anyone who has ever experienced annoying hissing and ringing sounds from an AM radio.
An alternative technique is modulate the phase and/or frequency of the
carrier instead of its amplitude. The former option is called phase modulation
(PM), The latter option is called frequency modulation (FM), and collectively
they are sometimes known as angle modulation. In the time domain, the signals
take the form

h
i
f PM = cos c t + f BB (t)
Z t
i
h
f BB (t0 ) dt
f FM = cos c t +
0
where is a parameter known as the modulation index that determines the

fractional extent to which we allow the carrier phase and frequency to be tweaked
by the baseband signal.
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
0
0.5
1
t
1.5
Figure 3: An example of a FM signal in the time domain. Note that the

amplitude is fixed, but the instantaneous frequency varies.
Angle modulation techniques have the advantage that all the information is
contained in the zero crossings of the signal, which make them less sensitive to
noise contamination. However, this advantage comes at a cost: for the same
baseband signal, PM and FM signals occupy significantly more bandwidth than
AM signals. A real-world demonstration of this fact may be found in the spacing
of AM and FM radio stations: AM stations are typically spaced about 10 kHz
apart from one another, while FM stations are typically spaced around 500 kHz
from each other, even though they are nominally transmitting baseband signals
of the same bandwidth (music and talk, which occupies up to around 20 kHz).
Digital modulation
AM and FM are techniques for transmitting analog signals. We may also want
to transmit a digital signal that is, a sequence of 0s and 1s. There are many
ways to do this, of which we will consider just a few.
3.1
OOK
The simplest form of digital modulation is known as on-off keying (OOK).

In this scheme, the carrier is turned on for the duration of each 1 bit in the
bitstream, and turned off for the duration of each 0 bit.
Figure 4: OOK transmit signal.
3.2
BPSK, QPSK, MPSK
The next most complicated thing we could do would be to tweak the phase or
frequency of the carrier during each bit period with the tweak depending on the
binary data to be transmitted during that period.
For example, we might give the carrier a 0-degree phase shift during bit
periods in which the transmit bit is 1, and a -phase shift during bit periods
in which the transmit bit is 0. This is binary phase-shift keying (BPSK). Of
course, a -phase shift to a sinusoid amounts to a sign flip, so BPSK is similar
10
to OOK except that instead of turning the carrier off during 0 bits we flip its
sign.
Figure 5: BPSK transmit signal.
The next most complicated possibility is quadrature phase-shift keying (QPSK).

In this scheme, we look at two bits at a time to determine the phase of the carrier, and apply a phase shift of 0, /2, , or 3/2 accordingly. Continuing in this
vein, we arrive at general MPSK schemes in which we apply one of M possible
phase shifts to the carrier signal depending on log2 M bits from the bitstream.
In addition to PSK schemes, there are also frequency shift keying FSK
schemes, which simply tweak the frequency instead of the phase of the carrier
signal.
3.3
QAM
3.4
Spectral efficiency
An important consideration in identifying a digital modulation scheme is the

spectral efficiency. This is the data bitrate of a signal divided by the bandwidth
occupied by the transmitted signal. More efficient modulation schemes are
able to transmit data at a higher rate while occupying the same portion of the
frequency spectrum.
11
As an example, lets compute the spectral efficiency of QPSK. We will assume

a bitrate of 2 megabit/s and a carrier frequency of = 2100 MHz. Suppose,
for the sake of simplicity, that the data to be transmitted consist of a bitstream
that repeats over and over again the following 8 bits:
...00011011...
In a QPSK scheme with a bitrate of 2megabit /s, we transmit 2 bits in each
1 s interval, so the period of our 8-bit sequence is 4 s. If we imagine the
bitstream to repeat this 8-bit sequence over and over again, then the baseband
signal is periodic with period T = 4 s. Since the carrier frequency is a multiple
2
, the entire transmit signal is periodic with period T = 4 s and we can
of 4s
characterize its frequency spectrum by computing its Fourier series coefficients,
2
which will be defined for frequencies that are integer multiples of 0 = 4s
. The
4
carrier frequency is one such frequency: c = N 0 , where N = 4 10 .
In a QPSK scheme, the above 8-bit pattern would lead to a transmit signal
of the form
cos c t,
0 < t < 1s
sin t,
1 < t < 2s
c
f QPSK (t) =
cos
t,
2 < t < 3s
c
sin c t, 3 < t < 4 s

The Fourier series coefficients are
Z
1 T QPSK
QPSK
f^
=
f
(t)ein0 t dt
n
T 0
Z T /4
1
=
cos(c t)ein0 t dt
T
0
Z T /2
+
sin(c t)ein0 t dt
T /4
3T /2
cos(c t)ein0 t dt
T /2
cos(c t)e
in0 t

dt
3T /2
This spectrum is plotted in Figure 6. If we define the bandwidth of the signal

to be the width of the frequency range within which the Fourier coefficients are
within a factor of 10 of their peak amplitude, then the signal has a bandwidth of
roughly 100 = 2.5 MHz, and the bit rate is 2 megabit/s, so we have a spectral
2
0.8 bit/s/Hz.
efficiency of 2.5
12
10
|f_n|
0.1
0.01
0.001
39800
39850
39900
39950
40000
n
40050
40100
40150
40200
Figure 6: Fourier spectrum of QPSK signal. The x axis labels n, the index of
the frequency n0 ; the carrier frequency is at c = 4 104 0 .
13
Multiplex methods
When multiple people are trying to communicate over the same communications
channel which may be wired (think of an ethernet network consisting of a
single long cable with multiple computers feeding signals in and out) or wireless
(think of electromagnetic waves propagating through the air) we need multiplex
techniques to allow the channel to be shared.
There are three broad categories of multiplex techniques.
Time-division multiplex access (TDMA), in which multiplexing happens
in the time domain: one user uses the entire channel (i.e. all available
frequencies) to transmit his message, then a second user uses the entire
channel to transmit her message, and so on.
Frequency-division multiplex access (FDMA), in which multiplexing happens in the frequency domain: multiple users transmit their messages
simultaneously, but each users transmission is restricted to a finite chunk
of the available frequency spectrum.
Code-division multiplex access (CDMA), in which all users transmit their
messages at the same time using the same frequencies, and yet the receiver
is magically able to disentangle one message from another because the
messages are are coded in an orthogonal way.
To summarize:
TDMA: same frequencies, different times.
FDMA: same time, different frequencies.
CDMA: same time, same frequencies, different codes.
TDMA is used, for example, in ethernet networking. In this protocol, multiple computers are connected to a common wire, and a message sent by one
computer is seen by all computers. Only one computer may be transmitting at
a time.2 TDMA was also used in early cell phone systems. It is very easy to
design TDMA receivers: basically, the receiver just has to turn on during the
appropriate time interval and then turn off during other time
FDMA is the most widely used multiplex method. It is used, for example,
in radio broadcasting (each AM and FM channel broadcasts simultaneously at
a different frequency) and in cell-phone networks (different phones communicate with the base station on different frequencies. FDMA receivers are slightly
2 But how is this synchronization enforced? What happens if two computers try to transmit
messages at the same time? How do computers know its their turn to talk? Answer: they
dont! When a computer has a message to send, it just randomly sends it out and hopes
nobody else was trying to send a message at the same time. If someone else was trying to
send a message at the same time, the two messages collide, neither message is received by
anyone, and the two transmitting computers each wait a randomly chosen amount of time
before attempting to resend. This simpleminded protocol actually yields excellent performance
as long as the total message density (the fraction of all time during which some computer is
trying to send a message) doesnt get too high.
14
trickier to design than TDMA receivers, but still relatively straightforward. Basically, the receiver applies a filter to exclude incoming signals at all frequencies
other than the frequency of interest, then downconverts (demodulates) from the
carrier frequency to baseband.
CDMA is a relatively recent addition to the fold of multiplex techniques. In
CDMA, each message is coded using a certain simple code in a way that allows
it to be distinguished from other simultaeously-received messages. CDMA receivers are much more difficult to design than TDMA or FDMA receivers, and
their implementation involves a lot of interesting mathematics.
4.1
The cocktail party
A good way to understand the various different multiplex techniques is to think

of a cocktail party in which multiple pairs of people are all trying to talk to each
other in the same small crowded space. Consider two pairs of conversationalists:
Akiko is trying to say something to Bob, while Chen is trying to say something
to Dinara. How can Bob receive the message from Akiko without confusing it
with the message from Chen? The relevant implementations of the protocols
discussed above would look something like this:
1. TDMA: Akiko gets to talk to Bob for 1 minute while Chen and Dinara
wait silently. Then Akiko and Bob have to shut up for 1 minute while
Chen and Dinara converse, etc. The message reception protocol is easy:
Bob just knows to listen when its his partners turn to be talking.
2. FDMA: Akiko sings to Bob in a soprano voice, while at the same time Chen
sings to Dinara in a bass voice. Again the message reception protocol is
easy: Bob just tries to tune out the lower-pitched sounds he is hearing
and focuses on the higher-pitched song.
3. CDMA: Akiko talks to Bob in Japanese, while Chen talks to Dinara in
Chinese. Now the message reception protocol is more subtle: Bob is
receiving information at the same time and at the same pitch, so his brain
must piece together only the sounds that make sense in Japanese while
filtering out the sounds that are only meaningful in Chinese.
4.2
How CDMA works
15
Lock-in amplifiers
Most of the preceding discussion pertained to communications technology, which

is the primary application of modulation theory in engineering. The lock-in
amplifier is an application of modulation techniques to an entirely different
field of endeavor: experimental science and measurement.
The basic idea of lock-in amplifiers is this: Suppose we are trying to measure a DC signal. (DC stands for direct current, as opposed to alternating
current (AC), and just means the signal is constant in time.) For example, in
a solid-state physics experiment, we may be trying to measure the resistance of
a piece of material, which is certainly a time-independent quantity, and we may
do this by connecting the material to a fixed time-independent voltage source
(such as an AAA battery) and using a current meter to measure the DC current
that flows through the sample.
The difficulty with this kind of setup is that our measurement apparatus
(the current meter in this case) will typically be contaminated by noise, an unavoidable presence in all real-world equipment despite the best efforts of device
manufacturers to mitigate its impact. This noise spectrum will typically be
peaked at DC [(DC-peaked noise in measurement equipment is often known as
1/f noise (one-over-f noise]), which makes DC about the worst frequency at
which we could possibly try to measure our signal.
But if the signal we are trying to measure really is a DC signal, then were
out of luck, right? We must measure at DC, right? Wrong! We can modulate
our signal at some nonzero frequency, then detect at that frequency. In the
case of the resistance measurement described above, we would simply drive the
sample with an AC voltage at some frequency (typically tens to hundreds of Hz)
instead of a DC signal. Now we have a time-dependent current signal, which
we measure and then filter to extract just the Fourier component we want
namely, the component corresponding to the frequency at which we modulated
the signal, with all other frequencies present in the measured signal understood
to be spurious noise contributions. This technique allows experimentalists to
achieve sensitivity levels far below what would be achievable with the bare noise
floors available on real-world measurement equipment.
5.1
How lock-in amplifiers work

Monte Carlo Integration
Homer Reid
March 20, 2014
Contents
1 Monte-Carlo integration
1.1 Monte-Carlo integration . . . . . . . . . . . . . . . . . . . . . . .
1.2 Comparison to nested quadrature rules . . . . . . . . . . . . . . .
1.3 Applications of Monte-Carlo integration . . . . . . . . . . . . . .
2
2
2
3
2 A computational example
3 How it works: deriving the convergence rate of Monte-Carlo

integration
3.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Mean, variance, standard deviation . . . . . . . . . . . . . . . . .
3.3 Sums and averages of random variables . . . . . . . . . . . . . .
3.4 Functions of random variables . . . . . . . . . . . . . . . . . . . .
3.5 Convergence rate of Monte-Carlo integration . . . . . . . . . . .
3.6 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Generating random numbers according to a specified probability
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
A Volume of the D-dimensional ball
18
8
8
10
11
13
14
15
1
1.1
Monte-Carlo integration
Monte-Carlo integration
Consider a scalar-valued function of an D-dimensional variable f (x), and suppose we want to estimate the integral of f over some subregion R RD . In
Monte-Carlo integration we do this using the following extremely simple rule:
Z
f (x) dx
R
N
V X
f (xn )
N n=1
(1)
where V is the volume of R, and where the xn are a set of N randomly chosen
points distributed uniformly throughout our region R.
It seems too good to be true to think that such an incredibly simple-minded
procedure could possible yield anything resembling decent numerical accuracy.
But it does! If I is the exact value of the integral on the LHS of (1) and IN
is the N -sample Monte-Carlo approximation on the RHS, then we have the
asymptotic convergence rate
1
|I IN |
N
(2)
This result is slightly tricky to prove, so we postpone the proof to Section 3.

The most important thing about equation (1) is that it is independent of
the dimension D. The error in Monte-Carlo integration decays with the square
root of the number of function samples regardless of the dimension. This is the
critical property that makes the method useful; it stands in marked contrast to
case of more pedestrian approaches to multidimensional integration, as we will
now see.
1.2
Comparison to nested quadrature rules
Of course, if you know anything about numerical quadrature, you might be

thinking that equation (2) is an appalling slow convergence rate. Even the
simplest, most brain-dead numerical quadrature
algorithm the rectangular rule
converges like 1/N , much faster than 1/ N , and better quadrature algorithms
converge much more quickly. So why would we ever want to use something that
achieves a lousy convergence rate like (2)?
The answer has to do with a phenomenon sometimes known as the curse
of dimensionality. Consider rectangular-rule quadrature as an example. For a
1D integral over an interval [a, b] subdivided into N subintervals, we have to
evaluate the function Neval = N times and the error decays like E 1/N , as
noted above. Now suppose we have a 2D integral of the form
Z
Z
dx1
dx2 f (x1 , x2 ),
c
Suppose we evaluate the inner (x2 ) integral using an N -point rectangular rule
to obtain a function F (x1 ), then integrate this function over x1 again using
an N -point rectangular rule to compute the full integral. (Such a procedure is
called nested quadrature.) The overall error again decays like E 1/N . But
we have to evaluate the function Neval = N 2 times, so now theconvergence
with respect to the number of function evaluations is only E 1/ Neval , much
slower than the 1D case. More generally, if we evaluate a D-dimensional integral
using nested rectangular-rule quadrature, the error decays like
error in nested D-dimensional rectangular-rule quadrature
1
1/D
Neval
We see that already for D = 2 the simple Monte-Carlo formula (1) achieves
asymptotic convergence equivalent to that of the rectangular rule, while for
D > 2 Monte-Carlo is (asymptotically) better.
Of course, the rectangular rule is only the most nave numerical quadrature
scheme. What if we use something more sophisticated like Simpsons rule?
Well, now the error decreases like E 1/N 4 , where N is the number of function
samples per dimension, but the total number of function samples grows like1
Neval (2N )D , so we have
error in nested D-dimensional Simpsons-rule quadrature
1
4/D
Neval
which is equivalent to Monte-Carlo already for D = 8 and worse for dimensions

D > 8.
The basic point is that repeated nesting of 1D quadrature schemes is a terrible way to evaluate high-dimensional integrals, because the number of function
samples needed to achieve a given tolerance grows exponentially with the dimension (this is the curse of dimensionality.) In some special low-dimensional
cases (such as integration over special low-dimensional regions such as triangles,
spheres, or hypercubes) there are generalized quadrature schemes that do better, but for high-dimensional integrals in general Monte-Carlo integration is the
only available option.
1.3
Applications of Monte-Carlo integration
Computing the volume of complex high-dimensional shapes

What is the volume of intersection of a 12-dimensional sphere with a 5-dimensional
cylinder? What is the electrostatic potential at the origin due to a constant
charge density contained in a solid cubical region? Given two triangles T1 , T2
in R3 , what is the volume V (R) of the 6-dimensional set of points {r1 , r2 } such
that r1 T1 , r2 T2 , and |r1 r2 | = R?
1 Recall that Simpsons rule requires 2 function evaluations per subinterval; this is the origin
of the 2 in this formula.
Questions like this arise in computational geometry and partial differential

equations and may generally be expressed as high-dimensional integrals. In some
cases it may be possible to write out explicit limits of integration delimiting the
region in question, in which case the integrals may be evaluated analytically;
but in general such a calculation may not be possible, and even when possible
it will generally be unwieldy.
On the other hand, it is almost always easy to write a characteristic function
(x) which takes the value 1 for points inside the region in question and 0
otherwise; then the volume of the region is given simply by
Z
V=
(x)dx
R
where R is any simple region (for example, a hypercube) encompassing the

region in question. Integrals of this type are easily evaluated using Monte-Carlo
integration; see below for an example.
Path integrals in quantum mechanics and quantum field theory
A computational example
As an immediate example of Monte-Carlo integration, lets compute the volume

of B D , the D-dimensional unit ball. This is the set of all points in D-dimensional
space that lie within unit distance of the origin:
n
o
B D = x RD : |x| < 1
The characteristic function of B D is
(
1,
(x) =
0,
x BD
otherwise
and, in a high-level language like julia, this function may be implemented in a

single line:
function chiBall(x)
norm(x) < 1 ? 1.0 : 0.0
end
Note that this implementation works for arbitrary dimensions (the dimension
of the x argument is inferred from its length).
Given the characteristic function , the volume of B D may be computed
according to
Z
VD =
(x) dx
R
where R is any region of RD containing the unit ball for example, R could
be all of RD , or could alternatively be the D-dimensional hypercube defined by
{x : 1 xi 1, i = 1, . . . , D.}
Heres a julia program that evaluates the Monte-Carlo integration formula
(1) over a hypercubic region.
#
# MCIntegrate: integrate func over the hypercube with
# bounds { Lower[1..Dim], Upper[1..Dim]} using a total
# of N function samples
#
function MCIntegrate(func, Lower, Upper, N)
Lower=Lower[:];
Upper=Upper[:];
Dim=length(Lower);
Delta = Upper-Lower;
Jacobian = prod(Delta);
Sum=0.0;
for n=1:N
# convert to column vectors
# volume of the hypercube

rv = rand(Dim);
x = Lower + rv.*Delta;
Sum += func(x);
end
Jacobian * Sum / N;
end
6
# random vector w values \in [0:1]
# random point in hypercube
To test this program on a simple example, well compute the volume of the
three-dimensional ball, which is 4
3 = 4.189.
julia> MCIntegrate( chiBall, [-1 -1 -1], [1 1 1], 10000)
4.2352
4.0584
4.1448
Each time we call this routine, we obtain a sample of a random variable whose
mean value is the integral we are trying
to compute and whose standard deviation about that mean decreases like 1/ N . (These concepts are explained more
fully in the following section.) To give you some graphical intuition for how the
process works, the following plot shows the results of 100 calls to MCIntegrate,
as above, for the two values N = 100 and N = 10000. The dashed line is the true
value of the integral. As you can see, in both cases the process is approximating
the true value of the integral, and increasing the number of function samples by
100 reduces the fluctuations (the error in our approximate evaluation of the
integral) by 10.
5.5
5.5
N=100
N=10000
Exact
Value of integral
4.5
4.5
3.5
3.5
3
0
10
20
30
40
50
60
70
Number of MC integration runs
80
90
3
100
Figure 1: Results of 100 calls to MCIntegrate to compute the volume of the

3-dimensional ball using N = 100 and N = 10000 function samples.
How it works: deriving the convergence rate

of Monte-Carlo integration
To understand the convergence rate of Monte-Carlo integration, we first need

to make a brief foray into the field of random variables.
3.1
Random variables
A good way to think about a random variable x is as a black box with a button
on it. Each time we push the button, the black box spits out a number.2
Figure 2: Cartoon depiction of a random variable x as a black box with a button

on it. Each time we hit the button, we get out a sample of x.
If we push the button N times and plot the values of the samples emitted,
we might get something like this:
2 Think of the little machine at the bank or the drivers-license office on which you push
a button and get out a number telling you your position in the line of people waiting to see
a clerk. One distinction is that in that case the numbers that emerge are integers emitted
in ascending order, whereas with a random variable the numbers that emerge are typically
real-valued and (hopefully!) not organized in any particular sequence.
Value of nth sample
2
1.5
1
0.5
0
-0.5
-1
50
100
150
200
Sample index n
250
300
Figure 3: Values of 300 samples of a random variable x, which in this case are
uniformly distributed throughout [0 : 1].
Suppose we segment the real line x = [, ] into buckets of width and

ask, after N presses of the button in Figure 2, how many samples of x fall into
the bucket between 7 and 7 + . If we do this for larger and larger values of
N we will find that the fraction of the total number of samples falling into any
one bucket tends to a constant times the width of the bucket:3
lim
# samples of x falling in the interval[7, 7 + ]

= P (7)
N
More generally, we may ask for the fractional number of samples falling into any
interval [x, x + ], and the answer as N would tend to P (x), where P (x)
is a number that depends on x. P (x) is called the probability density function or
the probability distribution of the random variable. To be a suitable probability
density function, P (x) must satisfy the conditions
Z
P (x) 0 x
and
P (x) dx = 1.
3 Strictly
speaking this equation is only true in the limit , but that would be too
many limits to be considering all at once; for now just think of as a small width.
10
For the case pictured in (3), we have

(
P (x) =
1, x [0, 1]
0, otherwise
which is known as a uniform distribution; we say that that the random variable
x is uniformly distributed in the interval [0 : 1].
System-supplied random-number generators in computers, like the rand
functions in matlab or julia and the drand48 function in the standard c
library, typically produce random numbers uniformly distributed in the interval
[0 : 1]. Later we will discuss how to obtain random numbers distributed with
other densities.
3.2
Mean, variance, standard deviation
The black dashed line in Figure (3) is the average value of all the samples of the
random variable emitted from the black box. This is known as the mean value
of the random variable. For a given probability distribution P (x), the mean
may be computed according to
Z
xP (x) dx
mean = x =

= x
(where the second line defines some useful shorthand for integrating over probability distributions). For the probability distribution in Figure (??), we have
Z
x=
x dx =
0
1
2
in accordance with our intuition.

The quantity that is key for understanding the convergence of Monte-Carlo
integration is the variance x2 , defined as
Z
2
variance = x2 =
x x P (x) dx

= (x x)2
This quantity is measuring how much samples of x deviate from their mean
value. The bigger the value of x2 , the more the random variable is spread out
or fluctuating about its mean.
Note that the specific quantity x2 is actually characterizing something like
the square of the deviations about the mean value. In particular, if the random
variable x has units, like say meters, then x2 has units of meters2 and hence
cannot be used directly to measure the spread of the quantity we are trying to
11
characterize. Instead, the number that you want to have in mind to characterizing the spread of values in a random variable is the square root of the variance,
which is called the standard deviation:
p
standard deviation = x = x2 .
For the uniformly distributed variable of Figure 3, we have
x2 =

x
1
2
2
dx
1
12
p
so the standard deviation is x = 1/12 0.29. You should think of this as the
half-width of the interval around the mean within which most of the fluctuations
of the variable are contained.
=
3.3
Sums and averages of random variables
It is easy to obtain new random variables from old. For example, given a random
variable x distributed according to some probability distribution P (x), we could
define a new random variable y by summing two samples of x:
y = x + x.
As in Figure 2, the random variable y may be thought of as a machine with a
button on it, which we can press however many times we like to generate samples
of y. In this case, we can think of this machine as containing within it two copies
of the machine of Figure 2. Hitting the button on the y machine amounts to
hitting the buttons on the two x machines and summing their results.
Figure 4: A random variable y defined as the sum of two random variables x.

Hitting the button on the y machine is like hitting the button on the x machine
twice (or, equivalently, hitting the buttons on two identical x machines) and
summing the results.
12
The very important fact about random variables defined as sums of random
variables is this:
When we add a random variable to itself N times, its mean value increases
by
a factor of N , but its standard deviation increases by only a factor of N .
Another way to state this is to consider a random variable defined as the

average of N samples of another random variable (this just means we add the
variable to itself N times and divide the result by N :
When we average N samples of a random variable, its mean value

does (3)
not change, but its standard deviation decreases by a factor of N .
This is easy to prove by going through some calculus manipulations similar

to those we did in the previous section, but intuitively all you need to know can
be grasped from the following plot, which is identical to Figure (3) except that
here we are plotting samples of a random variable y10 defined as the average of
10 samples of the random variable x :4
y10

1
x+x+x+x+x+x+x+x+x+x
10
Applying the key result from above, we expect that the mean value and standard
deviation of this variable will be
y10 = x,
4 To
1
y10 = x .
10
understand this variable, think of the cartoon of Figure (4), but with 10 copies of the
x machine instead of just 2, and with a factor of 1/10 multiplying the result on its way out
of the box.
13
Value of nth sample
2
1.5
1
0.5
0
-0.5
-1
50
100
150
200
Sample index n
250
300
P10
1
Figure 5: Values of 300 samples of a random variable y10 = 10
n=1 x defined
by averaging 10 copies of a random variable x, where x is uniformly distributed
in the interval [0, 1] as in Figure 3. Note that y10 has the same mean as the
original x, but
the amplitude of its fluctuations about that mean (its standard
deviation) is 10 3 times smaller than x (compare Figure 3).
By comparing Figures (3) and (5), its easy to see that by averaging 10
samples of x we have obtained a new random variable whose mean is the same
as that of x, but whose fluctuations about that mean are reduced by a factor of
10 3.
3.4
Functions of random variables
Similarly, we could define a new random variable z as the result of operating on

the random variable x with some function f (x)
z = f (x).
This yields a random variable whose cartoon depiction looks something like this:
14
Figure 6: A random variable z defined as the operation of a function f (x) on

a random variable x. Hitting the button on the z machine is like hitting the
button on the x machine and feeding the result into the function f (x).
Its easy to compute the mean and variance of z:

Z
z=
f (x)P (x) dx
Z
2

z2 =
f (x) z P (x) dx
(4a)
(4b)
These are quantities that depend on P (x) and f (x), but not on anything else.
3.5
Convergence rate of Monte-Carlo integration
We can now assemble the insights of the previous sections to understand the
convergence rate of Monte-Carlo integration. Consider a scalar function of D
variables, f (x). We will consider the evaluation of
Z
1
f (x) dx
(5)
I
V R
where R is some subregion of RD and V is the volume of R. I is just computing
the average value of f over the subregion V. (If you want to compute the integral
of f , not its average value, then just multiply I by V.)
Let x be an D-dimensional vector of random variables distributed uniformly
throughout the region R. This means that the probability distribution function
P (x) is constant inside R and zero everywhere else:
(
1
, xR
P (x) = V
0, otherwise.
Given this fact, we can rewrite the integral we are trying to evaluate, equation
(5), in the form
Z
I=
f (x)P (x)dx
(6)
RD
15
where now the integral extends over all of RD .

But now compare equation (6) to equation (4a). We see that the quantity
we are trying to compute is the mean value of a random variable
I f (x),
where x is distributed according to P (x). The mean value of this random
variable, by (4a), is
I = I.
The variance of this random variable is given by (4b):
Z h
i2
f (x) I P (x) dx
I2
RD
Of course, we dont know how to compute I2 , but the the point is that it exists
and is just some number that depends on the function f and the region R (which
is what defines P (x).
Finally, consider defining a new random variable IN by averaging N samples
of I:
N
N
1 X
1 X
IN
I=
f (x)
N n=1
N n=1
Note that this is just the prescription we gave in equation (1) for Monte-Carlo
integration, although we are here interpreting it as the definition of a random
variable.
Invoking the general principle of equation (3), we expect that the mean value
and standard deviation of IN will be
IN = I,
1
IN = I
N
where, again, I is some number that depends on f and R but not on N . The
mean value of IN is the quantity we are trying to compute, and its standard
deviation decreases like the square root of N .
Thus, when we use Monte-Carlo integration with N function samples to
estimate an integral, we are evaluating a single sample of a random variable
whose mean value
is the integral we are trying to compute and whose variance
decreases like 1/ N . This explains Figure 1.
3.6
Importance sampling
In some cases we may be trying to integrate a function g(x) that may be decomposed into a product of factors g(x) = f (x)P (x) where
P (x) satisfies the
R
conditions of a probability density, i.e. P (x) R0 and P (x) dx = 1. In this
case, referring back to equation (4a), we interpret g(x) dx as the mean value of
16
a random variable I = f (x) where x is a random variable distributed according

to a nonuniform probability distribution P (x) :
Z
if g(x) = f (x)P (x)
with
P (x) 0, P (x) dx = 1
Z
1 X
then
g(x) dx
f (x)
(7)
N
RD
where x is a random variable with probability distribution P (x). This technique
is called importance sampling. For functions g(x) that may be decomposed
in this way it is much better to use (7) than the default Monte-Carlo rule
with uniformly distributed evaluation points x, because the importance-sampled
version will more effectively sample the regions of RD that contribute most to
the integral.
Of course, since computer random-number generators typically produce samples of uniformly-distributed random variables, the question arises of how to
generate samples of random variables distributed with non-uniform densities.
We take up this question in the next section.
3.7
17
Generating random numbers according to a specified

probability distribution
Next suppose we want to compute a sequence of random numbers {yn } that are
distributed according to some non-uniform probability distribution P (y). The
general idea will be to compute a sequence of uniformly distributed random
numbers {xn } and then define yn to be f (xn ), where f (x) is some function.
Lets determine the relationship between f (x) and P (y).
Suppose we compute some large number of samples N . The number of x
points falling within an interval [x, x + x ] is approximately N x . All of these
points are mapped by our procedure into the interval [y, y + y ] = [f (x), f (x +
x )]. This latter interval has width y = x |f 0 (x)| (the absolute value arises
because y2 may be less than y1 , but we still want to define the width of the
interval to be a positive number). Thus, if we are trying to define the probability
density P (y) such that the number of sample points falling in an interval [y, y +
y] is N P (y)y , we should say
N P (y)y = N x
or, using y = f (x) and y = |f 0 (x)|x ,
|f 0 (x)| =
1
P (f (x))
This is a differential equation for the function f (x). For example, suppose
we want to generate points y with distribution P (y) = ey . The differential
equation reads
1
|f 0 | = f
e
with solution f (x) = log x. What this means is this: If {xn } is uniformly
distributed in [0, 1] and we define yn = log(xn ), then yn is distributed in
[0, ] with probability density P (y) = exp(y).
18
Volume of the D-dimensional ball
The D-dimensional ball B D is the set of points in RD that lie within unit distance
of the origin. Let V D be the D-dimensional volume5 of B D . From elementary
geometry we know
B1
B2
B3
2
4
(length of line segment [1 : 1])

(area of unit circle, r2 with r = 1)
(volume of unit sphere, 43 r3 with r = 1)
but how do we extend this table to higher D? Earlier in these notes we discussed
how to do this using Monte-Carlo integration. Here well discuss how to do the
calculation analytically.6
One way is to write
Z
VD =
dD x
(8)
|x|<1
and evalute the integral in polar coordinates. To get a sense of how to do this,
let the cartesian components of a point x RD be {x1 , , xD } and recall that
two-dimensional polar coordinates are defined by
x2D
1 = r sin 1
2D
x2 = r cos 1
(9a)
(9b)
while in three dimensions we have7

x3D
1 = r sin 1 sin 2
(10a)
x3D
2 = r sin 1 cos 2
(10b)
x3D
3 = r cos 1
(10c)
Comparing (9) and (10), we see that the transition is effected by introducing
one new angle (2 ) and bifurcating the coordinate x2D
1 into two new coordinates
3D
x3D
and
x
defined
by
1
2
2D
x3D
1 = x1 sin 2 ,
2D
x3D
2 = x1 cos 2 .
This procedure may be repeated inductively: we just keep splitting up whichever

coordinate is all sines into two new coordinates of which one is all sines and
5 In colloquial terms, the D-dimensional volume V D of an D-dimensional set is, for the
cases D=1,2,3, just what we think of as the length, the area, and the volume of a 1-, 2-,
or 3-dimensional shape. The generalization to higher D is conceptually straightforward, if
somewhat difficult to visualize. If we are measuring distances in, say, meters, then V D has
units of (meters)D .
6 We emphasize that this is the rare example of a high-dimensional region whose volume
can be computed analytically; in most cases, something like Monte-Carlo integration is the
only way to proceed.
7 Actually the variable we call x3D here is what we usually call y, while what we call x3D
1
2
is what we usually call x: we have performed this swap to improve the logical presentation of
the formulas.
19
the other is all sines except for one cosine. For example, polar coordinates for
the 7-dimensional sphere are
x7D
1 = r sin 1 sin 2 sin 3 sin 4 sin 5 sin 6
x7D
2 = r sin 1 sin 2 sin 3 sin 4 sin 5 cos 6
x7D
3 = r sin 1 sin 2 sin 3 sin 4 cos 5
x7D
4 = r sin 1 sin 2 sin 3 cos 4
x7D
5 = r sin 1 sin 2 cos 3
x7D
6 = r sin 1 cos 2
x7D
7 = r cos 1
The Jacobian of the transition from Cartesian to polar coordinates is
dx1 dxD = rD1 sinD2 1 sinD3 2 sin D1 dr d1 dD1
and the integral (8) splits up into a product of D factors:
V
Z
=
|0
1
D1
r
{z
1
D!
D2
sin
1 d1
sin D2 dD2
dr
} |0
{z
} |0
{z
} |0
[(D1)/2]
[D/2]
dD1
{z
}
2
Integrals over powers of sin factors may be evaluated using the function.
Working out the general case, we obtain the closed-form analytical expression
VD =
D/2
.
D
2 +1
The function here may be evaluated to yield more explicit formulas which
differ depending on the parity of D:
V 2D =
D
,
D!
V 2D+1 =
2(2)D
(2D + 1)!!

Numerical Differentiation
Homer Reid
February 25, 2014
Suppose we have a black-box function f (x). We can query this function
for its value at any x, and we will get back a number, but we dont have an
analytical formula for f (x). How do we estimate values of f 0 (x)?
Contents
1 Finite-difference approximations of the first derivative
1.1 Forward differencing . . . . . . . . . . . . . . . . . . . . .
1.2 Backward differencing . . . . . . . . . . . . . . . . . . . .
1.3 Centered differencing . . . . . . . . . . . . . . . . . . . . .
1.4 Higher-order finite-difference formulas . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
2
3
3
2 Finite-difference approximations of higher derivatives
3 Finite-differencing of multivariable functions
4 Finite-differencing as matrix-vector multiplication
1
1.1
Finite-difference approximations of the first

derivative
Forward differencing
The definition of the derivative of a function f (x) at a point x is

f (x + h) f (x)
df
f 0 (x)
= lim
.
dx x h0
h
(1)
The simplest approach to numerical differentiation is simply to arrest the limiting process here and evaluate the RHS of (1) at a finite value of h. This defines
what is known as the forward-finite-difference (FFD) (or just forward-difference)
approximation to the derivative:
0
fFFD
(h; x)
f (x + h) f (x)
.
h
(2)
Its easy to assess the error incurred by the forward-difference procedure. Recall
that the Taylor-series expression for the quantity f (x + h) is
f (x + h) = f (x) + hf 0 (x) +
h2 00
f (x) + O(h3 )
2
Inserting this into (2), we find

0
fFFD
(h; x) = f 0 (x) +
h 00
f (x) + O(h2 )
2
(3)
The first term on the RHS here is the quantity we are trying to compute, and
everything else is an error term. Thus we have
0
fFFD
(h; x) f 0 (x) =
h 00
f (x) + O(h2 )
2
(4)
As usual in error analysis, this equation is not useful for giving us an actual
number for the error, because we dont know how to evaluate f 00 (x). The only
important thing is the h dependence: the error is linear in h, i.e. we have a
first-order method. To obtain one more digit of accuracy (i.e. 10 smaller
error) we must use a 10 smaller value of h.
1.2
Backward differencing
It may happen that values of f (x + h) are not available for positive h. This
may happen, for example, if the point x lies at the right endpoint of the interval
over which our function is computable or measurable. (I mean measurable in
the experimental sense, not the sense of Lebesgue integration. Think of f (x) as
a quantity reported by an experimental apparatus on which we cant turn the
dial any further than some xmax .) Of course values of f (x + h) must exist for at
least some nonzero range of positive h, since otherwise the derivative at x would
not be defined, but those values may not be accessible to us for one reason or
another.
In this case, we can use backward differencing:
0
fBD
(h; x)
f (x) f (x h)
.
h
(5)
Its easy to show that backward-differencing, like forward-differencing, is a firstorder method.
1.3
Centered differencing
Consider the Taylor-series expansions of f (x h) and f (x + h):

f (x + h) = f (x) + hf 0 (x) +
h3
h2 00
f (x) + f 000 (x) + O(h4 )
2
6
(6a)
f (x h) = f (x) hf 0 (x) +
h2 00
h3
f (x) f 000 (x) + O(h4 )
2
6
(6b)
Careful scrutiny of these equations reveals that by subtracting them and dividing by 2 we can pick off the second-derivative term (and in fact all even
derivative terms) in (3):
f (x + h) f (x h)
h3
= hf 0 (x) + f 000 (x) + O(h5 )
2
6
Now just divide by h to obtain the centered-difference approximation to the
derivative:
f (x + h) f (x h)
0
fCD
(x)
(7)
2h
The above analysis shows that
0
fCD
(x) f 0 (x) =
h2 000
f (x) + O(h4 )
6
(8)
Thus centered-differencing is a method of order 2.
1.4
Higher-order finite-difference formulas
Formulas like (2), (5), and (7) are known as finite-difference stencils: they
are linear combinations of n function samples that approximate the derivative
with pth-order convergence. The forward-difference, backward-difference, and
centered-difference stencils have (n, p) = (2, 1), (2, 1), (2, 2) respectively.
By increasing the number of function samples n that we are willing to compute, it is easy to construct finite-difference stencils that achieve any desired
convergence order p. All you have to do is write down the Taylor expansions of
the quantities
, f (x 2h), f (x h), f (x), f (x + h), f (x + 2h),
and construct clever weighted combinations of these that pick off successively
higher-order terms in the error estimates of equations (3) and (8).
However, we generally dont carry out finite-differencing beyond the centereddifference case. The reason is that in constructing formulas of this type we are
essentially constructing and differentiating a polynomial interpolant through
data samples at uniformly-spaced intervals. As we have noted many times, this
procedure is badly-behaved due to the Runge phenomenon: the more you try
to bend and squeeze a high-order polynomial to fit through evenly-spaced data
points, the more it will bulge out in between the points. If you need a numerical
differentiation stencil that achieves a rapid convergence rate, a better idea is to
use non-uniformly spaced points to construct and differentiate a polynomial interpolant. We will revisit this topic when we consider Chebyshev interpolation
later in the course.
Finite-difference approximations of higher derivatives
We can play similar games to write down approximate formulas for higher
derivatives. For example, go back to equations (6) and suppose that we add
the two equations together instead of subtracting them:
f (x + h) + f (x h) = 2f (x) + h2 f 00 (x) +
h4 0000
f (x) +
12
Clearly all we have to do is subtract off 2f (x) and divide by h2 to obtain an

approximation to the second derivative:
00
fCD
(h; x)
f (x + h) 2f (x) + f (x h)
h2 0000
00
=
f
(x)
+
f (x) +
h2
12
(9)
We call this the centered-difference approximation to the second derivative;

evidently it achieves second-order convergence.
Finite-differencing of multivariable functions
Next suppose we want to differentiate a function of more than one variable, say
f (x, y).
If we are only interested in partial derivatives with respect to a single variable, we can just apply the formulas for the one-dimensional case with the other
variables held fixed. For example:

f (x + h, y) f (x, y)
f
first-order convergence
x (x,y)
h

f
f (x, y + h) f (x, y h)
second-order convergence
y (x,y)
2h

f (x h, y) 2f (x, y) + f (x + h, y)
2 f
second-order convergence

2
y (x,y)
h2
However, things get a little more interesting when we go to compute mixed
partial derivatives. Consider, for example, the simultaneous double Taylor expansion of f (x, y) :
f (x + x , y + y ) = f (x, y) + x fx (x, y) + y fy (x, y)
+
2y
2x
fxx (x, y) + x y fxy (x, y) +
fyy (x, y) + O(3 )
2
2
By writing out this equation for various possible choices of x and y and
taking linear combinations of the results, it is possible to kill off various terms
on the RHS to obtain stencils for various partial derivatives. You will explore
this possibility in your problem set this week.
Figure 1: A set of N = 5 equally-spaced points in the interior of an interval

[a, b]. The spacing between the points is h = (b a)/(N + 1).
Finite-differencing as matrix-vector multiplication
Consider an interval [xa , xb ] and a function f (x) that vanishes at the endpoints,
i.e. f (xa ) = f (xb ) = 0. Suppose we have samples of f at a set of N evenlyspaced points between a and b. More specifically, break up the interval into
N + 1 segments of width
ba
h=
N +1
and denote the endpoints of these intervals and the values of f at those points
by
xn = xa + nh,
fn = f (xn ),
n = 1, 2, , N.
(For convenience we will also use the notation x0 = a and xN +1 = b.)
Now suppose we try to compute the second derivative of f at the points xn
using the second-order finite-difference stencil (9). We find

i
1h
f
(x
)
2f
(x
)
+
f
(x
)
0
1
2
h2
i
1h
= 2 2f (x1 ) + f (x2 )
h
f 00 (x1 )
(10a)
(where we used the boundary condition f (x0 ) = 0)

f 00 (x2 )
f 00 (x3 )
..
.
f 00 (xN 1 )
f 00 (xN )
i
1h
f (x1 ) 2f (x2 ) + f (x3 )
2
h
i
1h
f (x2 ) 2f (x3 ) + f (x4 )
2
h
..
.
i
1h
f
(x
)
2f
(x
)
+
f
(x
)
N
2
N
1
N
h2
i
1h
f
(x
)
2f
(x
)
N
1
N
h2
(10b)
(10c)
(10d)
(10e)
where in the last line we used the boundary condition f (xN +1 ) = 0.

Its convenient to write equations (10) in the form of a matrix-vector product:
f1
f100
2 1
0 0
0
f200
1 2 1 0
0
f2
f300
1 2 0
0 f3
1 0
(11)
.. 2 ..
..
..
.
.. ..
..
..
.
. h .
.
.
.
.
f 00
0
0
0 2 1 fN 1
N 1
00
fN .
0
0
0 1 2
fN
which we may write using matrix-vector notation in the form
f 00 = Af
(12)
where f 00 and f are the vectors of samples of f and samples of the second derivative of f .
The point of equations (11) and (12) is that the operation that takes f into f 00
may be thought of as matrix multiplication. Among the important consequences
of this observation is that it makes it easy to invert the operation that obtains
f from f 00 :
f = A1 f 00
(13)
The primary use of formulas like (13) is in the application of finite-difference
differentiation to the solution of boundary-value problems and higher-order
PDEs; the technique is known in the PDE world as the finite-difference method.
Extension to nontrivial boundary conditions

In the example above, we used the boundary conditions f (xa ) = f (xb ) = 0.
This simplified equations (10a) and (10e). What if instead we have nontrivial
boundary conditions
f (xa ) = fa ,
f (xb ) = fb
where fa , fb are some given numbers? In this case, equations (10a) and (10e)
are respectively modified to read
i
1h
f
2f
(x
)
+
f
(x
)
a
1
2
h2
i
1h
f 00 (xN ) 2 f (xN 1 ) 2f (xN ) + fb
h
f 00 (x1 )
and equation (11) is modified to look like this:
f100
fa
2 1
0
f200
0
1 2 1
f300 1 0
1 2
1
0
.. 2 .. 2 ..
..
..
. h . h .
.
.
f 00
0
0
0
0
N 1
00
fN
0
0
0
fb
|
| {z } |
{z
}
{z
f 00
..
.
0
0
0
..
.
2
1
(14a)
(14b)
0
0
0
..
.
f1
f2
f3
..
.
1 fN 1
2
fN .
}|
{z
f
(15)
What we have done here is to swing the terms involving fa and fb in (14) over to
the other side of the equation in (15) that is, away from the side containing the
unknowns and onto the side on which the known quantities reside. Note that
the matrix A and the vectors f , f 00 in this equation are unchanged from equation
(11). All that happens is that the RHS of equation (13) is now augmented by
an additional term:
h
1 i
f = A1 f 00 2
(16)
h
where is the sparse vector in (15) containing the boundary values of f .

Numerical Integration, Part 1
Homer Reid
February 11, 2014
Contents
1 Numerical quadrature
2 Newton-Cotes rules
2.1 The rectangular rule . . . . . . .
2.2 The trapezoidal rule . . . . . . .
2.3 Higher-order Newton-Cotes rules
2.4 Heuristic error analysis . . . . . .
2.5 Results . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
6
8
9
10
3 Miscellaneous points about numerical integration

12
3.1 Integration over infinite intervals . . . . . . . . . . . . . . . . . . 12
3.2 Integrable singularities . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Adaptive quadrature . . . . . . . . . . . . . . . . . . . . . . . . . 14
A Nomenclature for Newton-Cotes rules of various orders
16
Numerical quadrature
Numerical integration is the art of approximating definite integrals by finite

sums,
Z b
N
X
f (x) dx
wn f (xn )
(1)
a
n=1
where the sample points xn are some set of N points lying in the interval
[a, b], and wn are an appropriately chosen set of weight coefficients. Numerical
integration is also known as numerical quadrature, and the sets {xn } and {wn }
are known as the quadrature points and the quadrature weights. An algorithm
for choosing {xn } and {wn } is known as a quadrature rule.
The name of the game in numerical quadrature is to obtain accurate estimates of the integral in (1) with the smallest possible number of function
samples. Think of f (x) as an experimentally measured quantity that may take
minutes or hours to evaluate. If your quadrature rule requires you to sample
f (x) at 106 points to get a decent estimate of the integral in (1), your project
will be hopeless. We will find that unsophisticated quadrature rules may indeed
require millions of points to yield decent accuracy, but with a little theoretical
sophistication it is possible to do much better, obtaining 6 or more digits of
accuracy with a few dozens or hundreds of samples.
Our study of numerical quadrature in 18.330 will unfold in multiple installments:
We begin in these notes with the simplest approaches to constructing
quadrature rules. We show that these rules work, but in most cases do
not deliver particularly outstanding performance in the cost-vs-accuracy
department (and, for this reason, are not commonly used in practice). We
offer a simple error analysis to suggest why this might be. (We also note
for future reference a couple of interesting cases in which the simple rules
do yield excellent performance.)
Then, later in the course (well into Unit 2), after we have introduced
some more of the necessary theoretical background, we will discuss more
sophisticated approaches to constructing quadrature rules. These rules
do generally yield good performance and are commonly used in numerical
practice.
Meanwhile, there are several general points to be made about numerical
quadrature that do not depend of the choice of quadrature rule. These
points are addressed in these notes and remain equally valid later, after
we have introduced more sophisticated quadrature schemes.
Thus, the content of Section 2 of these notes should not be taken as a statement of this is how we recommend you do numerical quadrature in practice.
Instead, think of this section as the first step in a journey that will culminate in
a discussion of the right way to do numerical quadrature. On the other hand,
the more general points made in Section 3 of these notes will not be superseded
by subsequent developments.
Newton-Cotes rules
Rb
One way to approximate a f (x) dx is to find a polynomial P (x) that approximates f on the interval [a, b] and integrate that instead, this being easy to do
since we know how to integrate polynomials. This approach leads to the quadrature rules known as Newton-Cotes rules. There is a hierarchy of Newton-Cotes
rules indexed by the degree of the polynomial; the pth-order Newton-Cotes rule
uses a pth degree polynomial.
Of course, if f is not a polynomial itself then it wont be easy to find a single
polynomial that approximates f over the whole interval. Instead, we consider
subdividing the interval into N subintervals; on each subinterval we approximate
f by a different polynomial chosen appropriately to mimic the behavior of f on
that subinterval. The smaller we make the subintervals, the more accurately we
will be able to approximate f by a polynomial, and thus the more accurate will
be our approximation of the integral.1
2.1
The rectangular rule
In elementary calculus we learn to understand integration in terms of Riemann

sums: we approximate the shape under the curve f (x) as a union of rectangles;
the area under the curve is approximately the sum of the areas of the rectangles.
Then we consider a limiting process in which the width of the rectangles goes
to zero, and the number of rectangles becomes infinite.
The simplest approach to numerical quadrature is to arrest this limiting
process at some finite number of rectangles N. The combined area of those N
rectangles then gives us our approximation to the integral. This may be thought
of as the zeroth-degree Newton-Cotes rule, in which we approximate f (x) on
each interval by a zeroth-order polynomial (that is, a constant). The resulting
quadrature rule is known as the rectangular rule, and it pictured graphically
into Figure 1 for the case N = 4.
1 Technically, the quadrature rule obtained by subdividing an interval into subintervals and
applying a p-th order Newton-Cotes rule to each subinterval is known as a composite p-th order
Newton-Cotes rule, as distinct from the basic Newton-Cotes rule which does not subdivide.
The basic pth-order Newton-Cotes rule for an interval uses precisely p + 1 function samples
in that interval, while the composite rule with M subintervals uses approximately M p total
function samples. However, in practice the phrase Newton-Cotes quadrature almost always
refers to the use of composite Newton-Cotes rules, and the adjective composite is generally
omitted.
Figure 1: The area of the figure comprised of the shaded rectangles defines the
Rb
N = 4 rectangular-rule approximation to a f (x) dx.
If we subdivide the interval [a, b] into N equal-length subintervals, each
subinterval has width = ba
N . The width of each rectangle in (1) is .
The height of the leftmost rectangle in (1) is f (a); thus this rectangle has area
A = f (a) . The height of the second-to-leftmost rectangle is f (a + ), so
this rectangle has area A = f (a + ) . Proceeding in this way and summing
the areas of all the rectangles, the N = 4 rectangular-rule approximation to our
integral is
rect
IN
=4 = f (a) + f (a + ) + f (a + 2) + f (a + 3).
Rb
More generally, the N -point rectangular-rule approximation to a f (x) dx is
rect
IN
N
1
X
f (a + n).
(2)
n=0
Note that this is a quadrature rule of the general form (1): The quadrature
weights are wn = (the same weight for all n), and the quadrature points are
xn = a + n for n = 0, 1, , N 1.2 (Note that f (b) is not referenced by this
rule.)
2 Alternatively, we could define the quadrature points to be x = a + (n 1) for n =
n
1, , N . I find this slightly more cumbersome, but it agrees better with the convention of
equation (1).
Figure 2: The area of the figure comprised of the shaded trapezoids defines the
Rb
N = 4 trapezoidal-rule approximation to a f (x) dx.
2.2
The trapezoidal rule
A quick glance at Figure 1 shows that the rectangles are a crude approximation to the area under the curve. We can obtain a better approximation by
considering trapezoids instead of rectangles. A trapezoid is basically just a rectangle, except that its upper edge is slanted into a straight line connecting the
values of f (x) at the left and right endpoints of the interval. This is illustrated
in Figure 2. One way to interpret Figure 2 is to say that in each interval we
are approximating f by a first-order polynomial (a line), so this is a first-order
Newton-Cotes rule. It is known as the trapezoidal rule.
To write down the formula for the trapezoidal rule, note that the areas of
the trapezoids in Figure 2) are
i
1h
f (a) + f (a + )
2
i
1h
area of second-to-leftmost trapezoid = f (a + ) + f (a + 2)
2
area of leftmost trapezoid =
etc. Continuing in this way and summing up the area of all the trapezoids, the
N = 4 trapezoidal-rule approximation to our integral is

trap
IN
=4
=
+
+
+
1h
f (a)
2
1h
f (a + )
2
1h
f (a + 2)
2
1h
f (a + 3)
2
+ f (a + )
f (a + 2)
f (a + 3)
f (a + b)
which we may write in collapsed form as

1
1
trap
IN
=
f
(a)
+
f
(a
+
)
+
f
(a
+
2)
+
f
(a
+
3)
+
f
(b)
.
=4
2
2
More generally, the N -point trapezoidal-rule approximation to
trap
IN
=
N
1
X
f (a + n) +
f (a) +
f (b).
2
2
n=0
Rb
a
f (x) dx is
(3)
Again, this is a quadrature rule of the general form (1). The quadrature points
are xn = a + n for n = 0, , N . (Note that there is one more quadrature
point than in the rectangular rule, namely, the point x = b.) The weights are
(
1
, n = 0, N
wn = 2
,
n = 1, 2, , N 1.
Figure 3: For a third-order Newton-Cotes rule, we write down the unique cubic
polynomial that agrees with f (x) at 4 evenly spaced points throughout one
subinterval (dashed curve). The area under this curve (blue shape) is the thirdorder Newton-Cotes approximation to the integral of f over this subinterval.
2.3
Higher-order Newton-Cotes rules
The rectangular and trapezoidal rules correspond, respectively, to the zerothdegree and first-degree Newton-Cotes rules. It is possible to continue this process. To derive the general p-th order Newton-Cotes rule, we proceed as follows:
1. Consider p+1 evenly spaced points in the subinterval [xn , xn+1 ], including
the endpoints. For example, for p = 2 we consider
xn ,
i
1h
xn + xn+1 ,
2
xn+1
for p = 3 we consider
xn ,
i
1h
xn + xn+1 ,
3
i
2h
xn + xn+1 ,
3
xn+1
etc.
2. Find the unique pth order polynomial P (x) that agrees with f (x) at the
above points. For p = 0 (rectangular rule) this is just the constant P (x) =
f (xn ). For p = 1 (trapezoidal rule) P (x) is the line connecting f (xn ) to

f (xn+1 ). For p = 2, P (x) is the unique parabola running through the
three points

xn + xn+1
xn , f (xn ) ,
xm , f (xm ) ,
xn+1 , f (xn+1 )
where xm
.
2
In every case, P (x) is a polynomial whose coefficients are linear combinations of function samples f (x).
3. Integrate P (x) from xn to xn+1 . This yields an approximate expression
relating the integral of f (x) over the subinterval to a linear combination
of values of f at the sample points.
4. Finally, combine the quadrature rules for all subintervals to obtain a composite quadrature rule for the overall interval.
The second-order Newton-Cotes rule is called Simpsons rule, for reasons
discussed in the Appendix.
Newton-Cotes rules for high orders are not useful because of the Runge phenomenon: when we try to fit a high-order polynomial through evenly-spaced
data points, the polynomial tends to blow up in the regions between the points,
killing the accuracy of our integral estimate. We previewed the Runge phenomenon in our invitation lecture on the first day of class, and we will discuss
it again in our unit on interpolation.
2.4
Heuristic error analysis
A heuristic, hand-wavy error analysis of the Newton-Cotes rules goes something

like the following. This analysis is in fact more of a mnemonic device designed
to remind us of how to arrive at the correct error estimate; it should not be
mistaken for a rigorous demonstration.
Consider a single subinterval of width . Within this interval, we approximate f (x) by a pth-degree polynomial P , where
P (x) = C0 + C1 x + + Cp xp .
There exists3 a point x0 within the interval such that the first p + 1 terms
in the Taylor-series expansion of f about x0 agree with P (x). In other
words, we can write f in the form
f (x) = C0 + C1 x + + Cp xp + Cp+1 xp+1 +
3 For example, if p = 0 (rectangular rule) we just take x to be the left endpoint of the
0
interval. For p = 1, the existence of the point x0 is guaranteed by the mean value theorem:
There is a point between xn and xn+1 at which the derivative of f equals the slope of the
straight line connecting f (xn ) to f (xn+1 ). Similar arguments hold for higher p values.
10
This means that the difference between f (x) and P (x) is a polynomial
that starts at order p + 1:
f (x) P (x) = Cp+1 xp+1 + Cp+2 xp+2 +
and thus the error in our approximate
subinterval is
Z
error in this subinterval =
Z
=
evaluation of the integral over this

f (x) p(x) dx
h
Cp+1 xp+1 +
p+2 + higher order terms

since integrating xp+1 over an interval of width gives us something
proportional to p+2 .
Thus we conclude that the error in each subinterval is proportional to p+2 .
Of course, is related to N (the number of subintervals) according to N1 ,
and thus the result of the above analysis is
error per subinterval
1
.
N p+2
On the other hand, we have a total of N subintervals, and we must add together
all the errors in all the subintervals to get the total error; this gives us an extra
factor of N upstairs, so we find
1
N
N p+2
1
= p+1 .
N
total error
In particular, for the rectangular (p = 0) and trapezoidal (p = 1) rules, we find

the expected convergence rates
1
N
1
error in trapezoidal rule 2 .
N
error in rectangular rule
Our analysis does not furnish the constant of proportionality, but thats OK
because here we are only considered with the dependence on N .
2.5
Results
Figure 4 plots the convergence of the rectangular and trapezoidal rules for various values of N in an approximation of the integral
Z 2
I=
log2 x dx.
1
11
The quantity plotted in the figure is

rel
approx
|IN
I exact |
.
I exact
0.1
0.1
0.01
0.01
0.001
0.001
0.0001
0.0001
1e-05
1e-05
1e-06
1e-06
1e-07
1e-07
1e-08
1e-08
1e-09
1e-09
1e-10
1e-10
1e-11
100
1000
10000
1e-11
100000
Figure 4: Convergence of the trapezoidal

R 2 and rectangular rules vs. the number
of intervals for evaluating the integral 1 log2 x dx.
This figure clearly demonstrates the expected convergence of both methods:

As N increases over three orders of magnitude, the error in the rectangular rule
decreases by three orders of magnitude, while the error in the trapezoidal rule
decreases by six orders of magnitude.
This figure also demonstrates the weaknesses of these approaches: Even
for this extremely smooth, well-behaved function, the trapezoidal rule requires
almost 1000 function samples to achieve 6-digit accuracy, while the rectangular
rule has still not achieved 6-digit accuracy even at N = 106 samples. The more
sophisticated quadrature rules we introduce later will achieve 6-digit accuracy
with many fewer function evaluations.
12
Miscellaneous points about numerical integration
As noted at the beginning of these notes, the Newton-Cotes approach is generally not the best method for deriving quadrature rules, and we will eventually
replace it with better strategies that achieve superior accuracy-vs-cost performance. On the other hand, already at this point there are certain general points
we can make about numerical quadrature that will remain valid even after we
have moved on to more sophisticated quadrature rules.
3.1
Integration over infinite intervals
To handle improper integrals like
f (x) dx
0
we simply make a change of variables that maps the interval [0, ] to a finite
interval. There are many possible choices of map. One popular example is
x=
u
,
1u
dx =
du
(1 u)2
under which the integral x [0, ] is mapped into u [0, 1]. Our improper
integral is then transformed into a proper integral:
Z
Z 1
u
du
f (x) dx =
f
.
1
u
(1
u)2
0
0
Note that the integral on the RHS appears to have a quadratic singularity as
u 1. However, as u 1 the argument
R of f is approaching , and f must
vanish at (otherwise the integral 0 f (x) dx would not be convergent), so
the singularity cancels.
3.2
Integrable singularities
Consider the following integral:

Z
I=
0
exp(x)
dx
x
(4)
Although the integrand blows up at the origin, the integral is perfectly welldefined. We say that we have an integrable singularity at the origin.
Note the important distinction between
integrable and non-integrable singularities. The function f (x) = exp(x)/ x has an integrable singularity at
the origin. In contrast, f (x) = exp(x)/x has a non-integrable singularity at
the
R 1 origin. There is no point in attempting to devise a strategy for estimating
exp(x)dx/x, because the integral does not exist.
0
13
Numerical integration in the presence of integrable singularities is something

of an art form. There are many approaches, some of which are better suited to
particular problems than others; its impossible to give a general prescription.
Instead, well survey some common techniques.
Singularity subtraction
It may be possible to isolate the portion of the integrand that causes the difficulty and integrate it analytically. For example, in (4) we can write
Z
I=
|0
1
dx
x
{z }
I1
Z
+
|0
exp(x) 1
dx
x
{z
}
I2
The first integral may be evaluated analytically:

Z 1
1
dx

= 2 x = 2.
I1 =
x
0
0
The second integral cant be evaluated analytically, but its integrand is now
nonsingular at the origin, as we can see by Taylor-expanding the numerator:
x 12 x2 + 16 x3 +
exp(x) 1
=
x
x
1
1
= x1/2 x3/2 + x5/2 +
2
6
This function goes to zero politely as x 0, so we can evaluate I2 using
standard (nonsingular) numerical quadrature techniques.
Singularity cancellation
Another strategy is to introduce new integration variables in such a way that the
Jacobian of the transformation cancels troublesome factors in the denominator.
For example, for (4) we can put
u=
x,
dx
du =
2 x
whereupon our integral becomes

Z 1 x
Z 1
2
e
dx = 2
eu du.
x
0
0
This integral has no singularities and may again be evaluated via straightforward
numerical quadrature.
14
Epsilon expansion
If neither of the above methods work, you may try to introduce a smoothing
parameter which removes the singularity for finite , then try extrapolating
to the limit 0. For example, in the above case, we might define
Z 1
exp(x)dx
p
I =
0
x 2 + 2
which is nonsingular for finite , but tends to I as 0.
3.3
Adaptive quadrature
In general the functions we will be integrating behave differently in different

portions of their domain of definition. For example, consider the function plotted
in Figure 5.
1.2
0.8
f(x)
0.6
0.4
0.2
-0.2
0
10
Figure 5: An example of a function whose rate of variation is different in different

regions. If we use a trapezoidal rule to integrate this function from 0 to 10, we
will clearly need to use rectangles of narrow widths throughout the region
4 < x < 6, but it would be wasteful to use the same small value for x < 4 or
x > 6.
15
Suppose we try to integrate this function from 0 to 10 using a trapezoidal

rule. Clearly we are going to need a lot of very narrow trapezoids to capture the
behavior of the function in the region 4 < x < 6; we will need to take quite
small in that region to get accurate results. But then it would be wasteful to
use such a small value of in the regions x < 4 or x > 6; there we can obtain
accurate results with much coarser resolution.
This motivates the notion of adaptive quadrature, in which we use quadrature rules of different accuracy for different regions of our integration domain.
The most sophisticated integration codes implement a form of automatic adaptivity: They divide the range of integration into subintervals and estimate their
own error on each subinterval. (To estimate the error in a quadrature scheme,
you compare the difference between a coarse-grained and a fine-grained quadrature schemefor example, trapezoidal rules with N = 100 and N = 200. If
the results arent very different, that means that refining the accuracy of the
coarse-grained rule didnt change things much, and thus that the coarse-grained
rule was already somewhat accurate.) When they deem the error to be too
large, they recompute the integral using a more accurate quadrature rule (for
example, a trapezoidal rule with a smaller value of .)
The subject of adaptivity is particularly important in the context of ODE
integrators, and we will revisit it during our discussion of that subject.
16
Nomenclature for Newton-Cotes rules of various orders

Integration of Ordinary Differential Equations
Homer Reid
February 13, 2014
Contents
1 Overview
1.1 ODEs and ODE Integrators . . . . . . . . . . . . . . . . . . . . .
1.2 Comparison to numerical quadrature . . . . . . . . . . . . . . . .
2
2
4
2 Examples of ODE systems

2.1 Motion of particles in force fields . . . . . . . . . . .
2.2 Molecular dynamics . . . . . . . . . . . . . . . . . .
2.3 Electric circuits . . . . . . . . . . . . . . . . . . . . .
2.4 Chemical reactions . . . . . . . . . . . . . . . . . . .
2.5 Meteorology and chaos . . . . . . . . . . . . . . . . .
2.6 Charge renormalization in quantum electrodynamics
.
.
.
.
.
.
5
5
6
6
6
6
6
3 ODE Integration Algorithms

3.1 Forward (Explicit) Euler . . . . . . . . . . . . . . . . . . . . . . .
3.2 Improved Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . .
7
7
10
13
4 Stability
4.1 Stability of the forward Euler method .
4.2 The backward (implicit) Euler method .
4.3 Stability of the backward Euler method
4.4 Stability in the multidimensional case .
4.5 Stability in the nonlinear case . . . . . .
15
15
17
18
19
20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Pathological cases
22
5.1 Non-uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Blowup in finite time . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Conditions for existence and uniqueness . . . . . . . . . . . . . . 24
1
1.1
Overview
ODEs and ODE Integrators
A first-order ordinary differential equation (ODE) is an equation of the form

du
= f (t, u)
dt
(1)
for some function f (t, u). (We will always call the independent variable t and
the dependent variables u.) More specifically, 1 is an example of an initial
value problem; one also encounters ODEs posed in the form of boundary value
problems, to be discussed below.
Given an ODE like (1) with a reasonably well-behaved RHS function f , and
given a single point (t0 , u0 ), an extensive and well-established theory assures
us that there is a unique solution curve u(t) passing through this point (i.e.
satisfying u(t0 ) = u0 ). However, the theory doesnt tell us how to write down
an analytical expression for this solution curve, and in general no analytical
expression exists. Instead, we must resort to numerical methods to compute
points that lie approximately on the curve u(t).
An ODE integrator is an algorithm that takes an ODE like (1) and a point
(t0 , u0 ) and produces a new point (t1 , u1 ) that (at least approximately) lies on
the unique solution curve passing through (t0 , u0 ). More generally, we will usually want to compute a whole sequence of points (t1 , u1 ), (t2 , u2 ), , (tmax , umax ).
up to some maximum time tmax .
ODE systems
In general we may have two or more ODEs that we need to solve at the same
time. For example, denote the populations of the U.S. and Uruguay respectively
by u1 (t) and u2 (t). Suppose the U.S. and Uruguayan birth rates are respectively
1 and 2 , and suppose that every year a fraction a of the U.S. population
emigrates to Uruguay, while a fraction b of the Uruguayan population emigrates
to the U.S. Then the differential equations that model the population dynamics
of the two countries are
du1
= (1 a )u1 + b u2
dt
du2
= a u1 + (2 b )u2
dt
(2a)
(2b)
These two ODEs are coupled ; each one depends on the other, so we cant solve
separately but must instead solve simultaneously. We generally write systems
like (2) in the form
du
= f (t, u)
(3)
dt
where u is now a time-dependent d-dimensional vector and f is a d-dimensional
vector-valued function. For the case of (2), the dimension d = 2 and the u vector

u1
.
u2
As in the case of the one-dimensional system (1), for reasonably well-behaved
functions f we are guaranteed to have a unique solution curve u(t) passing
through any given point (t0 , u0 ), but again in general we cannot write down
an analytical expression for such a curve; we must use numerical methods to
generate a sequence of points (t0 , u0 ), (t1 , u1 ), (tmax , umax ).
Actually, in the case of (2), the function f (t, u) is linear, f = Au for a
constant matrix A. In this special case, (3) can be solved analytically in terms
of the matrix exponential to yield, for initial conditions u(t = 0) = u0 ,
is u =
u(t) = eAt u0 .
However, this only works for linear ODE systems; for nonlinear systems in
general no analytical solution is possible.
Higher-order ODEs
Equations (1) and (3) are first-order ODEs, i.e. they only involve first derivatives. What if we have a higher-order ODE? Consider, for example, the onedimensional motion of a particle subject to a position-dependent force:
1
d2 u1
= F (u1 ).
dt2
m
(4)
It turns out that we can squeeze high-order ODEs like this into the framework
of the above discussion simply by giving clever names to some of our variables.
More specifically, lets assign the name u2 to the first derivative of u1 in (4).
du1
u2 .
dt
Now we can reinterpret the second-order equation (4) as a first-order equation
for u2 :
1
du2
= F (u1 ).
dt
m
Equations (5) constitute a 2-dimensional system of first-order ODEs:

du
u1
u2
= f (t, u),
u=
f=
u2
F (u1 )/m.
dt
Thus we can use all the same tricks we use to solve first-order ODE systems;
there is no need to develop special methods for higher-order ODEs. This is a
remarkable example of the efficacy of using the right notation.
We can play this trick to convert any system of ODEs, of any degree, to a
one-dimensional system. In general, a d-dimensional system of p-th order ODEs
can always be rewritten as a pdth-dimensional system of first-order ODEs.
Heres another example of this reduction process: The third-order nonlinear

ODE
...
x + A
x (x)
2+x=0
is equivalent, upon defining {u1 , u2 , u3 } {x, x,
x
}, to the following system:
u
u2
d 1
.
u2
u3
=
dt
u3
u1 + u22 Au3
This system has been called the simplest dissipative chaotic flow.1
1.2
Comparison to numerical quadrature
Consider the problem of evaluating the integral

Z
f (t) dt
(6)
which we studied earlier in our unit on numerical quadrature. This problem may
be recast in the language of ODEs as follows: Define u(t) to be the function
Z
u(t)
f (t0 ) dt0 .
Then u(t) satifies the first-order initial-value ODE problem

du
= f (t),
dt
u(a) = 0
(7)
and the integral (6) we want to compute is the value of the solution curve u(t)
at t = b.
Comparing (7) to (1), we notice an important distinction: The RHS function
f (t) in (7) depends only on t, not on u. This means that integrating functions
is easier than integrating ODEs. In particular, suppose we use a numerical
quadrature rule of the form
Z
f (t) dt
wi f (ti )
to estimate our integral. Because the integrand only depends on t, we could

(for example) evaluate all the function samples f (ti ) in parallel, as they are
independent of each other. No such approach is possible for general ODEs like
(1) or (3); we must instead proceed incrementally, computing one point on our
solution curve at a time and then using this point as the springboard to compute
the next point.
1 Reference:
J. C. Sprott, Physics Letters A 228 271 (1997).
2
2.1
Examples of ODE systems

Motion of particles in force fields
Historically, techniques for numerically integrating ODEs were first developed to

study the motion of celestial bodies (planets, comets, etc.) in the solar system.
Consider a planet of mass m at a distance r from the sun, which has mass
M . The planet experiences a gravitational force directed radially inward with
magnitude F = GM m/r2 (where G is Newtons gravitational constant). Then
Newtons second law reads
m
r=
GM m
r
r2
(8)
This is a three-dimensional second-order system of ODEs, which we may express

as a six-dimensional first-order system as follows. First, define the u variables
to be

u1
u4
x
x
u2 = y ,
u5 = y
(9)
u3
z
u6
z
where x, y, z are the components of r. Then equation (8)
following system:

u4
u1
u2
u5
u6
u
d
3
=
2
dt u4 u1 /(u1 + u22 + u23 )3/2
u5
u2 /(u21 + u22 + u23 )3/2
u6
u3 /(u21 + u22 + u23 )3/2
is equivalent to the
(10)
where = GM .2
Of course, in the real solar system we have more than simply one planet, and
planets experience gravitational attractions to each other in addition to their
attraction to the sun. You will work out some implications of this fact in your
problem set.
2 To derive e.g. the 4th component of this equation, we write the x component of equation
(8) as follows:
d
GM
x
=
x = 2
rx
(11)
dt
x + y2 + z2
where rx is the x component of the radially-directed unit vector
r, which we may write in the
form
x
rx = 2
.
(x + y 2 + z 2 )1/2
Plugging this into (11) and renaming the variables according to (9) yields the fourth component of equation (10).
2.2
Molecular dynamics
2.3
Electric circuits
2.4
Chemical reactions
Ozone (O3 ) in the Earths atmosphere disintegrates through a process of reacting

with oxygen monomers and dimers. The primary reactions are
k1
O3 + O2
O + 2O2 ,
k2
3
2O2
O3 + O
Label the concentrations of O, O2 , O3 respectively by u1 (t), u2 (t), u3 (t). Then

the above reaction system implies the following nonlinear ODE system:
u1
k1 u3 u2 k2 u1 u22 k3 u3 u1
u2 = k1 u3 u2 + k2 u1 u22 + k3 u3 u1 .
dt
u3
k1 u3 u2 + k2 u1 u22 k3 u3 u1 .
This system is not even close to being linear.
2.5
Meteorology and chaos
2.6
Charge renormalization in quantum electrodynamics
Figure 1: Eulers method illustrated for the 1D case. We are given an ODE
du/dt = f (t, u) and a single point (tn , un ). The dashed line denotes the unique
solution curve through this point; we know it exists, but we dont have an
analytical expression for it. What we do know is its slope at the given point
[this slope is just s = f (tn , un )], so we move along this line until we have traveled
a horizontal distance t = h on the t axis.
3
3.1
ODE Integration Algorithms

Forward (Explicit) Euler
The simplest possible ODE integration algorithm is Eulers method (sometimes

known as the forward Euler or explicit Euler method to contrast it with an
alternative version we will discuss below). The idea behind this method is
pictured in Figure 3.1. Given a point (tn , un ) on a solution curve, the RHS of
equation (3) tells us the slope of the tangent line to the solution curve at that
point. Eulers method is simply to move along this line until we have traveled
a horizontal distance t h. (h is known as the step size). In equations, the
Euler step transitions from one point to the next according to the rule
(tn , un )
(tn+1 , un+1 )
tn+1 = tn + h,
(12a)
un+1 = un + hf (tn , un ).
(12b)
For the special case of a linear ODE system

u = Au,
equation (12b) takes the form
un+1 = un + hA un+1
or

un+1 = I + hA un
(13)
where I is the n n identity matrix. So each step of the forward Euler method
requires us to do a single matrix-vector multiplication. If A is a sparse matrix,
this can be done in O(n) operations. (We havent discussed sparse matrices or
operation counts yet, so this observation is made for future reference.)
Error analysis
How accurate is Eulers method? Consider the simplest case of a one-dimensional
ODE system (the extension to a general n-dimensional system is immediate).
Given a point (t0 , u0 ), we know there is a unique solution curve u(t) passing
through this point. The Taylor expansion of this function around the point t0
takes the form
1
u(t) = u(t0 ) +(t t0 ) u0 (t0 ) + (t t0 )2 u00 (t0 ) +
| {z }
| {z } 2
u0
(14)
f (t0 ,u0 )
Note that, in this expansion, u(t0 ) is just u0 , and u0 (t0 ) is just f (t0 , u0 ), i.e. the
value of the RHS function in our ODE at the initial point.
If we use (14) to compute the actual value of u at the point t0 + h, we find
1
u(t0 + h) = u0 + hf (t0 , u0 ) + h2 u00 (t0 ) +
2
(15)
On the other hand, the Euler-method approximation to u(t0 + h) is precisely

just the first two terms in this expansion:
uEuler (t0 + h) = u0 + hf (t0 , u0 ).
(16)
Thus the error between the Euler-method approximation and the actual value
is
1
u(t0 + h) uEuler (t0 + h) = h2 u00 (t0 , u0 ) +
2
This result depends on u00 (t0 , u0 ), which we dont know. However, whats important is that it tells us the error is proportional to h2 . If we try again with
one-half the step size h, everything on the RHS stays the same except the factor
h2 , which now decreases by a factor of 4. To summarize,
error in each step of the Euler method h2 .
(17)
On the other hand, in general we will not be taking just a single step of the
Euler method, but will instead want to use it to integrate over some interval
a
steps of width h, then (18)
[ta , tb ]. If we break up this interval into N = tb t
h
tells us that the error in each step is proportional to h2 , but there are N steps,
so the total error is proportional to N h2 h. In other words,
overall error in the Euler method h.
The Euler method is a method of order 1.
(18)
10
Figure 2: The improved Euler method. (a) Starting at a point (tn , un ), we take
the usual forward Euler step by moving a horizontal distance h along a line of
slope s (dashed black line in the figure), where s = f (tn , un ) is the value of
f at the starting point. This takes us to the Euler point, (tn+1 , uEuler
n+1 ). (b)
When we get to the Euler point, we sample the value of f there. Call this value
0
s0 = f (tn+1 , uEuler
n+1 ). s is the slope of a tangent line (solid red line) to the ODE
solution curve through the Euler point (dashed red curve). (c) Now we go back
to the starting point and draw a line of slope 21 (s + s0 ) (solid black line). (The
slope of this line is intermediate between the slope of the dashed black and
dashed green lines in the figure.) Moving a horizontal distance h along this line
takes us to the improved Euler point.
3.2
Improved Euler
Another possibility is the improved Euler method, pictured for the 1D case in
Figure 3.1. Like the Euler method, it computes a successor point to (tn , un )
by moving a horizontal distance h along a straight line. The difference is that,
whereas in the original Euler method this line has slope s, in the improved
Euler method the line has slope 12 (s + s0 ). Here s and s0 are the values of
the ODE function f (t, u) at the starting point (tn , un ) and at the Euler point
(tn+1 , uEuler
n+1 ). (The Euler point is just the point to which the usual Euler
method takes us.)
The idea is that by averaging the slopes of the solution curves at the starting
point and at the Euler point, we get a better approximation to what is happening
in between those points. Thus, if we draw a line whose slope is the average of
the two slopes, we expect that moving along this line should be better than just
moving along the line whose slope is s, as we do in the original Euler method.
11
In equations, the improved Euler method takes the step

(tn , un )
(tn+1 , un+1 )
where
tn+1 = tn + h
(19a)

un+1 = un +

h
f (tn , un ) + f tn+1 , uEuler
n+1
2

(19b)
where uEuler
n+1 = un + hf (tn , un ).
Error analysis
To analyze the error in the improved Euler method, consider again the 1D case:
we are at a point (t0 , u0 ), which we know lies on a solution curve u(t), and we
want to get to the point u(t0 + h). We will compare an exact expansion for
this quantity with the approximate version computed by the improved Euler
method, and this will allow us to estimate the error in the latter.
Exact expansion for u(t0 + h) As above, we can write down an expression
for the exact value of u(t0 + h) by Taylor-expanding u(t) about t0 :
uexact (t0 + h) = u(t0 ) + hu0 (t0 ) +
h2 00
h3
u (t0 ) + u000 (t0 ) +
2
6
(20)
In our error analysis of the Taylor method above, we observed that the first two
terms in this expansion were simply
u(t0 ) = u0
u0 (t0 ) = f (t0 , u0 )
To get at u00 , we now go like this:
d 0
u (t0 )
dt
d
= f (t0 , u0 )
dt
u00 (t0 ) =
We now evaluate this total derivative by making use of the partial derivatives
of f :

f
f
du
=
+
t t0 ,u0
u t0 ,u0 dt
= ft (t0 , u0 ) + fu (t0 , u0 )f (t0 , u0 ).
where we are using the shorthand
f
ft ,
t
f
fu
u
12
Inserting into (20), we have

uexact (t0 +h) = u0 +hf (t0 , u0 )+
i
h2 h
ft (t0 , u0 )+fu (t0 , u0 )f (t0 , u0 ) +O(h3 ) (21)
2
Improved Euler approximation to u(t0 + h) On the other hand, consider

the approximation to u(t0 + h) that we get from the improved Euler method
"
#

h
improved
u
(t0 + h) = u0 +
f t0 , u0 + f t0 + h, u0 + hf (t0 , u0 )
(22)
2
Lets expand the second term in square brackets here3 :

f t0 + h, u0 + hf (t0 , u0 ) = f t0 , u0 + ft t0 , u0 h + fu t0 , u0 hf (t0 , u0 ) +
(23)
Inserting (23) into (22) and collecting terms, we have
"
#

h
improved
u
(t0 + h) = u0 +
2f t0 , u0 ft t0 , u0 h + fu t0 , u0 hf t0 , u0 +
2
h2 h

i
= u0 + hf t0 , u0 +
ft t0 , u0 + fu t0 , u0 f t0 , u0 + O(h3 )
2
Comparison Comparing this against expression (21), we see that the improved Euler method has succeeded in replicating the first three terms in the
Taylor-series expansion of u(t0 + h) (whereas the usual Euler method only replicates the first two terms), so the error decays one order more rapidly than in
the ordinary Euler method:
uexact (t0 + h) uimproved (t0 + h) = O(h3 )
Overall error Of course, as before, this only gives us the error per step, i.e.
error in each step of the improved Euler method
h3
so if we use improved Euler to integrate an ODE from a to b using steps of

width h, then the number of steps grows linearly as we shrink h, so as before
the global error decays one order less rapidly than the local error:
overall error in the improved Euler method
h2 .
The improved Euler method is a method of order 2.

3 All we are doing in this step is expanding f (t, u) in a two-variable Taylor series in t and
u around the points t0 and u0 and keeping only the linear terms in the expansion:
f (t0 + t, u0 + u) = f (t0 , u0 ) + ft (t0 , u0 )t + fu (t0 , u0 )u +
3.3
13
Runge-Kutta Methods
Although the error analysis for improved Euler is a little tricky, the idea of the
method is straightforward: Instead of simply sampling f (t, u) at the left endpoint of the interval we are traversing, we sample it at both the left and right
endpoints and take the average between the two. This gives us a better representation of the behavior of the function over the interval than just sampling at
one endpoint.
Its also pretty clear how we might improve further on the method: Just
sample f (t, u) at even more points, now including points inside the interval, and
do some kind of averaging to get a better sense of the behavior of f throughout
the interval.
This is the motivation for Runge-Kutta methods.4 There are a family of these
methods, indexed by the number of function samples they take on each step
and the order of convergence they achieve. For example, the simplest RungeKutta method is known as the midpoint method and is defined by the following
algorithm: Given an ODE du
dt = f (t, u) and a point (tn , un ), we compute the
successor point as follows:
s1 = f (tn , un )

h
h
s2 = f tn + , un + s1
2
2

(tn+1 , un+1 ) = tn + h, un + hs2
What this algorithm does is the following: It first takes an Euler step with
stepsize h/2 and samples the function f at the resulting point, yielding the
value s2 . This is an estimate of the slope of ODE solution curves near the
midpoint of the interval we are traversing. Then we simply proceed from the
starting point to the successor point by moving a horizontal distance h along a
line of slope s2 . Thus the midpoint method is almost identical to the original
Euler method, in the sense that it travels to the successor point by moving a
distance h along a straight line; the only difference is that we use a more refined
technique to esimate the slope of that straight line.
The most popular Runge-Kutta method is the fourth-order method, known
colloquially as RK4. This method again travels a horizontal distance h along
a straight line, but now the slope of the line is obtained as a weighted average
of four function samples throughout the interval of interest. More specifically,
4 Note:
Runge rhymes with cowabunga.
14
RK4 is the following refinement of the midpoint method:

s1 = f (tn , un )

h
s2 = f tn + , un +
2

h
s3 = f tn + , un +
2
h
s1
2
h
s2
2
s4 = f (tn + h, un + hs3 )

h
s1 + 2s2 + 2s3 + s4
(tn+1 , un+1 ) = tn + h, un +
6
Here s1 is the ODE slope at the left end of the interval, s2 and s3 are samples
of the ODE slope midway through the interval, and s4 are samples of the ODE
slope at the right end of the interval. We compute a weighted average of all
these slopes, savg = (s1 +2s2 +2s3 +s4 )/6, and then we proceed to our successor
point by moving a horizontal distance h along a line of slope savg .
As with all the methods we have discussed, it is easy to generalize RK4 to
ODE systems of arbitrary dimensions. Given an ODE du
dt = f (t, u) and a point
(tn , un ), RK4 computes the successor point as follows:
s1 = f (tn , un )

h
s2 = f tn + , un +
2

h
s3 = f tn + , un +
2
h
s1
2
h
s2
2
s4 = f (tn + h, un + hs3 )

h
(tn+1 , un+1 ) = tn + h, un +
s1 + 2s2 + 2s3 + s4
6
Error analysis in RK4
Although we wont present the full derivation, it is possible to show that RK4
is a fourth-order method : with a stepsize h, the error per step decreases like h5
and the overall error decreases like h4 .
4
4.1
15
Stability
Stability of the forward Euler method
Consider the following initial-value problem:

du
= u,
dt
u(0) = 1
(24)
with solution
u(t) = et .
(25)
Consider applying Eulers method with stepsize h to this problem. The sequence
of points we get is the following:
(t0 , u0 ) = (0, 1)
(t1 , u1 ) = (h, 1 h)

(t2 , u2 ) = 2h, 1 h h(1 h)

= 2h, (1 h)2

(t3 , u3 ) = 3h, (1 h)2 h(1 h)2

= 3h, (1 h)3
and in general

(tN , uN ) = N h, (1 h)N .
In other words, the Euler-method estimate of the value of u(t) after N timesteps
is
uEuler (N h) = (1 h)N
More generally, if we had started with initial condition u(0) = u0 , then after N
timesteps we would have
uEuler (N h) = (1 h)N u0 .
(26)
Notice something troubling here: If h > 2/, then the quantity (1 h) is

negative with magnitude greater than 1, which means that uEuler (N h) grows
in magnitude and flips sign each time we compute a new step. This cannot
come close to capturing the correct behavior of the exact function u(t), which is
always positive and decays monotonically to zero. Figure 4.3 shows the result
of applying Eulers method, with h = 0.42, to the problem (24) with = 5.
16
Forward Euler
Exact solution
1.5
1.5
u(t)
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2
0
0.5
1.5
t
2.5
Figure 3: Instability of the forward Euler method with stepsize h = 0.42 applied
to the ODE du
dt = 5u.
We diagnose this problem by saying that Eulers method applied to (24)

with stepsize h is unstable if h > 2 . More broadly, we say that Eulers method
for this problem is conditionally stable: it is stable for some values of h, and
unstable for others.
17
Figure 4: The implicit Euler method (also known as the backward Euler
method). As in the forward Euler method, we proceed from (tn , un ) to
(tn+1 , un+1 ) by moving on a straight line until we have traveled a horizontal
distance h along the t axis. The difference is that now the slope of the line is
the slope of the ODE solution curve through the new point (tn+1 , un+1 ). Because we dont know this point a priori, we must solve an implicit equation to
find it hence the name of the technique.
4.2
The backward (implicit) Euler method
Instability in ODE integration schemes may be remedied by using backward or

implicit methods, of which the simplest is the implicit version of the usual Euler
method, known as the backward Euler or implicit Euler method and illustrated
in Figure 4.2. As in the case of the forward Euler method, we proceed from the
old point (tn , un ) to the new point (tn+1 , un+1 ) by moving along a straight line
until we have traveled a horizontal distance h along the t axis. The difference
is that now the slope of this line is chosen to be the slope of the ODE solution
curve through the new point (tn+1 , un+1 ). But since we dont know where this
point is a priori, we have to solve implicitly for it. In equations, the implicit
Euler method for proceeding from one point to the next is
(tn , un )
(tn+1 , un+1 )
18
tn+1 = tn + h
(27a)
un+1 = un + hf (tn+1 , un+1 ).
(27b)
For the typical case of a nonlinear function f , solving the implicit equation
(27b) is significantly more costly than simply implementing the explicit equation
(12b).
For the special case of a linear ODE system
u = Au,
equation (27b) takes the form
un+1 = un + hA un+1
which we may solve to obtain
un+1 = I hA
1
un .
(28)
Thus each iteration of the implicit Euler algorithm requires us to invert an nn

matrix (or, essentially equivalently, to solve an n n linear system). This is
much more costly than simply evaluating a matrix-vector product, which is all
that we need for the explicit Euler method [equation (13)].
Error analysis of the implicit Euler method

It is easy to mimic the analysis we performed of the usual (explicit) Euler
method to show that the implicit Euler method is a first-order method, i.e. the
overall error decays like hp with p = 1. This is the same convergence rate as the
explicit Euler method, so all the extra cost of the implicit Euler method doesnt
buy us anything on this front.
4.3
Stability of the backward Euler method
What the implicit method does buy us is unconditional stability. Consider

applying the backward Euler method with stepsize h to the problem (25). At
each timestep, the equation we have to solve, equation (27), reads
un+1 = un hun+1
which we can solve to find
un+1 =
1
un .
(1 + h)
Starting from an initial point (t, u) = (0, u0 ), the value of u after N timesteps
is now
1
ubackward Euler (N h) =
u0 .
(29)
(1 + h)N
19
Comparing this result to (26), we see the advantage of the implicit technique:
assuming > 0, there is no value of h for which (29) grows with N . We say
that the implicit Euler method is unconditionally stable.
Backward Euler
Exact solution
u(t)
0.5
0.5
-0.5
-0.5
-1
-1
0
0.5
1.5
2.5
Figure 5: Stability of the backward Euler method with stepsize h = 0.42 applied
to the ODE du
dt = 5u.
4.4
Stability in the multidimensional case
We carried out the analysis above for a one-dimensional linear ODE, but it is
easy to extend the conclusions to a higher-dimensional linear ODE. Recall from
18.03 that the N -dimensional linear ODE system
du
= A u,
u(0) = u0
dt
(where A is an N N matrix with constant coefficients) has the solution
u(t) = C1 e1 t v1 + C2 e2 t v2 + + CN eN t vN
(30)
where (i , vi ) are the eigenpairs of A, and where the Ci coefficients are determined by expanding the initial-condition vector in the basis of eigenvectors:
u0 = C1 v1 + C2 v2 + + CN vN .
20
If we were to use Eulers method with stepsize h to integrate this ODE, we

would find that the condition for stability would be
h<
2
max
(31)
where max is the eigenvalue lying farthest to the left in the complex plane, i.e.
the eigenvalue with the largest negative real part (corresponding to the most
rapidly decaying term in 30). On the other hand, the timescale over which we
will generally want to investigate the system is determined by the least rapidly
decaying term in 30, i.e.
2
(32)
tmax
min
where min is the eigenvalue with the smallest negative real part.5
Comparing (31) and (32), we see that the number of ODE timesteps we
would need to take to integrate our system stably using Eulers method is
roughly
tMax
max
.
h
min
If the dynamic range spanned by the eigenvalues is large, we will need many
small timesteps to integrate our ODE. Systems with this property are said to
be stiff, and stiff ODEs are a good candidate for investigation using implicit
integration algorithms.
4.5
Stability in the nonlinear case
So far we have discussed stability for linear ODEs. For nonlinear ODEs, we
typically consider the system in the vicinity of a fixed point and investigate
the stability with respect to small perturbations. More specifically, for an autonomous nonlinear ODE system of the form
du
= f (u)
dt
(33)
(where we suppose the RHS function is independent of t thats what autonomous means) we suppose u0 is a fixed pointthat is, a zero of the RHS,
i.e.
f (u0 ) = 0.
(34)
The unique solution of (33) passing through the point t0 , u0 is u(t) u0 ,
constant independent of time; you can check using (34) that this is indeed a
solution of (33). We now consider small perturbations around this solution, i.e.
we put
u(t) = u0 + u(t)
5 For
simplicity in this discussion we are supposing that none of the eigenvalues have positive
real parts. If any eigenvalues have positive real parts, then the underlying ODE itself is
unstable (small perturbations in initial conditions lead to exponentially diverging outcomes).
21
and consider the Taylor-series expansion of the RHS of (33) about the fixed
point:
du
= f (u0 + u)
dt
= f (u0 ) + J u + O(u2 )
where J is the Jacobian of f at u0 . Neglecting terms of quadratic and higher
order in u, this is now a linear system that we can investigate using standard
linear stability analysis techniques.
22
Pathological cases
Above we noted that for an ODE du

dt = f (t, u) and an initial point (t0 , u0 ), we
are guaranteed the existence of a unique solution curve as long as the function
f (t, u) is reasonably well-behaved. In this section we will first look at a couple
of illustrations of what can go wrong if f is badly behaved; having seen what
can go wrong, we will then be in a position to quantify more rigorously the
conditions that must be satisfied to guarantee existence and uniqueness.
5.1
Non-uniqueness
One example of what can go wrong if f is badly behaved is furnished by the

following initial-value problem:
du
= u,
dt
u(0) = 0.
(35)
Solving this equation by separation of variables, we find

du
= dt
u
2 u=t+C
and applying the initial condition yields the solution

u(t) =
t2
.
4
(36)
But now consider using one of the ODE integration schemes discussed above to
integrate this equation. Using Eulers method with stepsize h, for example, we
find
(t0 , u0 ) = (0, 0)
(t1 , u1 ) = (h, 0 + h
0) = (h, 0)
(t2 , u2 ) = (2h, 0 + h 0) = (2h, 0)
and indeed after N timesteps we find

(tN , uN ) = (N h, 0 + h
0) = (N h, 0).
This procedure appears to be tracing out the solution curve

u(t) = 0
(37)
But we already figured out that the solution is given by equation (36)! What
went wrong?!
What went wrong is that the RHS function f (t, u) = u, though seemingly
innocuous enough, is not sufficiently well-behaved to guarantee the uniqueness
of solutions to (35). Indeed, you can readily verify that (37) is a solution to
23
(35) that is every bit as valid as (36). But arent ODE solutions supposed to
be unique?
The behavior of f that disqualifies it in this case is that it is not differentiable
at u = 0. Nonexistence of any derivative of f violates the conditions required
for existence and uniqueness and can give rise to nonunique solutions such as
(36) and (37).
5.2
Blowup in finite time
A different kind of pathological behavior is exhibited by the ODE

du
= u2 ,
dt
u(0) = 1.
(38)
Again using separation of variables to solve, we find

du
= dt
u2
1
=t+C
u
and applying the initial conditions yields

u(t) =
1
.
1t
In this case, uniqueness is not a problem, but existence is problematic at the

point t = 1. The solution blows up in finite time, and doesnt exist for t 1.
The behavior of f (t, u) that causes trouble here is that it grows superlinearly
in u. The function f (t, u) = u grows linearly in u, and f (t, u) = u grows

sublinearly in u (although this function leads to different pathologies, as noted
above), but f (t, u) = u2 grows superlinearly in u. This violates the conditions
required for existence of ODE solution and can give rise to situations in which
a solution exists only for a finite range of the t variable.
In fact, you can easily repeat the above analysis for the more general differential equation
du
= up ,
u(t0 ) = 1
(39)
dt
to find the solution
t
e,
p=1
1
u(t) =
p > 1.
h
i
t 1 p1
p1
For p = 1 we have existence and uniqueness for all time.6 But for any p > 1 the
1
function u(t) blows up (i.e. ceases to exist) at the finite time t = p1
.
6 The function et does grow without bound, and for large values of t it assumes values that
in practice are ridiculously large, but it never becomes infinite for finite t.
5.3
24
Conditions for existence and uniqueness
The above two cases illustrate the two basic ways in which the solution to an
ODE du
dt = f (t, u) can fail to exist or be unique: (a) either f or some derivative
of f can blow up (fail to exist) at some point in our domain of interest, or (b)
f can grow superlinearly in y.
To exclude these pathological cases, mathematicians invented a name for
functions that do not exhibit either of them. Functions f (t, u) that are free of
both pathologies (a) and (b) are said to be Lipschitz, and the basic existence
and uniqueness theorem states that a solution to du
dt = f (t, u) exists and is
unique iff f is Lipschitz.
We have intentionally avoided excessive rigor in this discussion in order to get
the main points across somewhat informally; the topic of Lipschitz functions and
existence and uniqueness of ODEs is discussed in detail in every ODE textbook
and many numerical analysis textbooks.

Orthogonal Polynomials, Gaussian Quadrature,
and Integral Equations
Homer Reid
May 1, 2014
In the previous set of notes we arrived at the definition of Chebyshev polynomials
Tn (x) via the following logic:
Given a function f (x) on the interval [1, 1], define g() = f (cos ).
Being an even 2-periodic
P function, g() has a Fourier cosine series
expansion g() = ea20 + e
a cos(), whereupon our original function

P
f (x) has the expansion f (x) = ea20 + a cos arccos x or
f (x) =
e
a0 X
+
a T (x),
2

T (x) = cos n arccos x .
which defines the function T (x).

In these notes we will investigate the following alternative characterization of
the functions {Tn }:
The Chebyshev polynomials {Tn (x)} are the unique polynomials normalized to Tn (0) = 1 and orthogonal with respect to the inner product
Z
hf, gi =
1
f (x)g(x)dx
.
1 x2
Contents
1 Orthogonal Sets of Polynomials
2 Roots of orthogonal polynomials
3 Gaussian quadrature
4 Integral equations and Nystr

oms method
1
12
Orthogonal Sets of Polynomials
An orthogonal set of polynomials is fully specified by three ingredients:

1. An interval of the real line [a, b] over which we will be integrating.
2. A weight function W (x) defined over [a, b].
3. A normalization convention, which just defines an overall multiplicative
prefactor.
You can think of items 1 and 2 here as together specifying an inner product on
the vector space of real-valued functions on the interval [a, b]:
Z b
f (x)g(x)W (x)dx.
(1)
hf, gi
a
An inner product is just a rule for assigning a real number to any pair of
functions f, g, and different choices of [a, b] and W (x) yield different inner
products. [Note that the inner product is linear, i.e. hf, gi = hf, gi and
hf + g, hi = hf, hi + hg, hi.] For our purposes, the most important fact about the
inner product is that it doesnt vanish when you stick the same function into
both slots, i.e. hf, f i =
6 0.1
Given an inner product and a normalization convention, an orthogonal set of
polynomials is simply a collection of polynomials {Qn (x)} (where n indexes the
degree of the polynomial, i.e. Q0 is a constant, Q1 (x) is a linear function, Q2 (x)
is a second-degree polynomial, etc.) that satisfy the normalization convention
and that are orthogonal with respect to the inner product, i.e.
hQn , Qm i = 0
for n 6= m.
Examples
The following table summarizes the ingredients that define some of the commonlyused sets of orthogonal polynomials.
Name
Symbol
Interval
Weight
function
Normalization
Legendre
Pn (x)
[1, 1]
Pn (1) = 1
Chebyshev
Tn (x)
[1, 1]
Laguerre
Ln (x)
Hermite
Hn (x)
1
1 x2
Tn (0) = 1
[0, ]
ex
Ln (0) = 1
[, ]
ex
hHn , Hn i =
2n n!
1 Note that one convenient way to define a normalization convention [item (3) above] would
be to scale all functions f such that hf, f i = 1, but this is not the convention that is typically
used.
Construction from inner product

Given an inner product and a normalization convention, there is a simple constructive procedure for computing every element in the corresponding family
of orthogonal polynomials. It is the analogue for polynomial vector spaces of
the usual Gram-Schmidt orthogonalization process used to construct orthogonal
bases in geometry, and it goes like this:
1. First, choose Q0 (x) to be the unique degree-zero polynomial (i.e. constant)
that satisfies the normalization convention. (In most cases we simply have
Q0 = 1.)
2. Now construct Q1 as the product of a linear factor times Q0 :
where2
Q1 (x) = A1 (x B1 )Q0
(2)

xQ0 , Q0

B1 =
Q0 , Q0
(3)
and A1 is chosen to ensure that Q1 satisfies the normalization condition.

You can easily verify that Q1 (x) as defined by (2) is orthogonal to Q0 by
construction.
3. Now construct Q2 as the product of a linear factor times Q1 plus a constant
factor times Q0 :
h
i
Q2 (x) = A2 (x B2 )Q1 C2 Q0
(4)
where

xQ1 , Q1
,
B2 =
Q1 , Q1

Q1 , Q1
,
C2 =
A1 Q0 , Q0

You can easily verify that Q2 (x) as defined by (4) is orthogonal to both
Q1 and Q0 .
4. Now construct Q3 as the product of a linear factor times Q2 plus a constant
factor times Q1 :
h
i
Q3 (x) = A3 (x B3 )Q2 C3 Q1
(5)
where

xQ2 , Q2
,
B3 =
Q2 , Q2
C3 =

Q2 , Q2

A2 Q1 , Q1
2 Just to clarify: The numerator of the following equation is the inner product (1) with the
function f (x) taken to be xQ0 (x) and the function g(x) taken to be Q0 (x).

You can easily verify, as before, that Q3 (x) as defined by (5) is orthogonal
to both Q2 and Q1 .
What is surprising is that this Q3 is also orthogonal to Q0 . Indeed, more
generally...
5. ...we construct the general element Qn (x) as the product of a linear factor
times Qn1 (x) plus a constant factor times Qn2 :
h
i
Qn (x) = An (x Bn )Qn1 Cn Qn2
(6)
where

xQn1 , Qn1
,
Bn =
Qn1 , Qn1

Qn1 , Qn1

Cn =
An1 Qn2 , Qn2
(7)
and An is chosen to ensure that Qn satisfies the normalization condition.

Again, what is surprising here is that the polynomial constructed in (6) is
orthogonal not only to Qn1 and Qn2 but indeed to all previous members of the set, Qn3 , Qn4 , , Q1 , Q0 . In constructing (6) it seems like
we are only ensuring orthogonality against Qn2 and Qn1 . But the orthogonality against the previous members of the set turns out to follow
for free, automatically, from the way the previous functions were defined.
This is not obvious.
Recurrence relations
By scutinizing the general case of the inductive procedure discussed above, it
is generally possible to write down recurrence relations that relate the next
element in a set of orthogonal polynomials to previous elements. For example,
the Legendre polynomials satisfy the recurrence

2n + 1
n
xPn (x)
Pn1 (x)
Pn+1 (x) =
n+1
n+1
The Chebyshev polynomials satisfy the recurrence
Tn+1 (x) = 2xTn (x) Tn1 (x).
(8)
The Laguerre polynomials satisfy the recurrence

Ln+1 (x) =
k
(2n + 1) x
Ln (x)
Ln+1 (x).
(n + 1)
k+1
(9)
The Hermite polynomials satisfy the recurrence

Hn+1 (x) = 2xHn (x) 2nHn1 (x).
(10)
Differential equations
Many sets of orthogonal polynomials arise as solutions to differential equations.
For example, the nth Legendre polynomial Pn (x) satisfies
(1 x2 )
dPn
d 2 Pn
2x
+ n(n + 1)Pn (x) = 0
dx2
dx
and the nth Chebyshev polynomial satisfies

(1 x2 )
dTn
d2 Tn
x
+ n2 Tn (x) = 0.
2
dx
dx
Generating functions
It is curious, and in some cases useful, to note that many functions of orthogonal
polynomials have a generating function which encodes the properties of the
entire set of functions and from which individual functions can be recovered by
performing algebraic and derivative manipulations.
For example, for the Legendre polynomials we have
Pn (x) =
1
2n n!
dn 2
(x 1)n .
dxn
For the Chebyshev polynomials, it turns out that Tn (x) arises as precisely the
coefficient of y n in the expansion of the quantity (1 xy)/(1 2xy + y 2 ) in
powers of y :
X
1 xy
=
Tn (x)y n .
1 2xy + y 2
n=0
Differentiating each side of this equation n times and setting y 0 then yields
an expression for Tn (x).
Properties of orthogonal polynomials

There are a few common properties that are common to all sets of orthogonal
polynomials.
1. The first N elements in the set constitute a basis for the vector space
of all polynomials of degree N . What this means is that any arbitrary
degree-N polynomial F (x) may be represented exactly and uniquely as a
linear combination of the Qn functions:
F (x) = c0 Q0 (x) + c1 Q1 (x) + + cN QN (x).
(11a)
2. Only the constant element in the set has nonvanishing integral over the
interval with respect to the weight function, i.e.
Z b
Qn (x)W (x)dx = 0,
n 1.
(11b)
a
This is actually just a consequence of orthogonality: We must have hQn , Q0 i =

0 for n 6= 0, but Q0 is just a constant and may be pulled out of the integral
(1), leaving behind (11b).
Roots of orthogonal polynomials
For many applications, including Gaussian quadrature as discussed in the following section, we need to compute the roots of the N th element in some set of
orthogonal polynomials, i.e. we need the N points xn that satisfy
QN (xn ) = 0,
n = 1, 2, , N.
(12)
It turns out to be easy to compute the numbers xn using numerical eigenvalue

techniques, and indeed numerical eigenvalue techniques are the preferred way
to compute these roots, as other methods tend to be numerically unstable. The
trick is to make use of the recurrence relation (6) to write xQn in terms of other
Q functions:
xQn = n Qn1 + n Qn + n Qn+1
(13)
where the , , coefficients may be written down in closed form and take
different forms for various different sets of orthogonal polynomials; for example,
in the case of Legendre polynomials we have
n =
n
,
2n + 1
n = 0,
n =
n+1
.
2n + 1
If we now write out equation (13) for n = 0, 1, , N 1, we obtain an N N

linear system of equations:
0 0 0
0
0
0
0
Q0 (x)
1 1 1 0
0
0
0
Q1 (x)
0 2 2 2
0
0
0 Q2 (x)
0 3 3
0
0
0
Q3 (x)
..
..
..
..
..
.
.
.
.
0
0
0
0
0
0
0
Q
(x)
N 3
N 3
N 3
0
0
0
0 N 2 N 2 N 1 QN 2 (x)
QN 1 (x)
0
0
0
0
0
N 1 N 1
Q0 (x)
0
Q1 (x)
0
Q2 (x)
Q3 (x)
= x
+
.
..
..
.
.
QN 3 (x)
QN 2 (x)
0
QN 1 (x)
N 1 QN (x)
What this equation says is that x is almost an eigenvalue of the matrix on the
LHS. The only thing that spoils the eigenvalue condition is the extra term in the
last slot of the second vector on the RHS. However, this term vanishes whenever
x is a root of QN ! This means that the roots of QN are precisely the eigenvalues
of the tridiagonal matrix on the RHS.
Heres a little julia code that will compute and return an N -dimensional
vector containing the roots of the N th Legendre polynomial, PN (x):
function LegendreRoots(N)
A=zeros(N,N)
A[1,2] = 1;
for n=1:N-2
A[n+1,n] = n/(2*n+1);
A[n+1,n+2] = (n+1)/(2*n+1);
end
A[N,N-1] = (N-1)/(2*N-1);
(lambda,U)=eig(A);
lambda
end
Gaussian quadrature
In this section we consider the evaluation of integrals of the form

Z
f (x)W (x) dx
(14)
where W (x) is some weight function and f (x) is an arbitrary function whose
integral (times W ) we are trying to compute. We would like to construct an
N -point quadrature rule consisting of N points and weights {{xn }, {wn }) such
that
Z b
N
X
wn f (xn )
f (x)W (x)dx.
(15)
n=1
Note that the sum on the LHS here only involves samples of f , not W ; the
weight function W (x) is baked in to the definition of the quadrature weights
wn .
Let {Qn } be the set of orthogonal polynomials {Qn (x)} defined with respect
to an inner product of the form (1) with interval [a, b] and weight function W (x)
matching those of the integral we are trying to compute in (14). [In the common
case in which W (x) = 1, these will be just the Legendre polynomials {Pn (x)}.]
Its easy to construct an N -point quadrature rule that exactly integrates polynomials up to degree N 1
If you give me any set of N points {xn } distributed throughout the interval [a, b],
I can find a set of N weights {wn } such that the quadrature rule [{xn }, {wn }]
exactly integrates all polynomials of degree N 1 or less. All I have to do
is to require my quadrature rule to be exact for the first N elements in the
orthogonal set {Qn }. Since any polynomial of degree N 1 or lower can be
exactly represented as a linear combination of these elements, its integral will
be computed exactly by our quadrature rule.
The condition that our quadrature rule be exact for the first N polynomials
in the set {Qn } amounts to a set of N simultaneous linear equations on the N
quadrature weights {wn }. Indeed, the requirement that my quadrature rule be
exact when I use it to integrate the function Q0 gives me the condition
Z
w1 Q0 (x1 ) + w2 Q0 (x2 ) + + wN Q0 (xN ) =
Q0 (x)W (x)dx.
a
(16a)
The condition that the rule be exact for Q1 yields
Z
w1 Q1 (x1 ) + w2 Q1 (x2 ) + + wN Q1 (xN ) =
Q1 (x)W (x)dx.
a
(16b)
10
Proceeding similarly, I obtain a total of N equations, culminating in

Z b
w1 QN 1 (x1 ) + w2 QN 1 (x2 ) + + wN QN 1 (xN ) =
QN 1 (x)W (x)dx.
a
(16c)
Equations (16) together constitute an N N linear system for the quadrature
weights wn .
Note also that the RHS of this system is simpler than it looks: as we noted
earlier, all the RHS integrals vanish except for the one involving Q0 , so the RHS
vector of our linear system has only one nonzero entry.
But Gauss discovered a way to construct an N -point quadrature rule
that exactly integrates polynomials up to degree 2N 1
The proceeding development tells me that, given any choice of N points {xn },
I can find a set of N weights that makes the quadrature rule (15) exact for all
polynomials up to degree N 1.
However, among all possible ways to choose the set of quadrature points
{xn }, there is one choice that is distinguished: It is the set of roots of the
polynomial QN (x). It is an astonishing fact that the quadrature rule (15),
computed with the {xn } taken as the roots of QN and the weights computed as
discussed above, is exact for all polynomials up to degree 2N 1. This massively
expands the space of functions over which our quadrature rule is exact; the
technique is known as Gaussian quadrature.
The proof of this statement is amazingly simple. Let f (x) be a polynomial
of degree 2N 1 or less. If we divide3 f (x) by the polynomial QN (x), we obtain
some quotient p(x) and some remainder r(x), and because QN has degree N we
are guaranteed that that p(x) and r(x) both have degree N 1 or less. In other
words, any polynomial f of degree 2N 1 may be written exactly in the form
deg p, r N 1.
f (x) = QN (x)p(x) + r(x),
(17)
But now look at what happens when I apply the quadrature rule (15) to f (x):
Z b
N
X
f (x)W (x)dx
wn f (xn )
a
n=1
N
X
i
h
wn QN (xn ) p(xn ) + r(xn )
| {z }
n=1
=0
The first term vanishes because the quadrature points are roots of QN ! This
leaves behind
=
N
X
n=1
3 The
Z
wn r(xn )
r(x)W (x) dx
(exactly).
(18)
operation at work here is synthetic divisiondo you remember this from high school?
11
In other words, using our quadrature rule to integrate the function (17) is equivalent to integrating just the function r(x). But this function is exactly integrated
by our quadrature rule because it has degree N 1 and our quadrature rule
handles all such functions exactly.
Rb
Meanwhile, we can evaluate the exact integral a f (x)W (x)dx another way,
by expanding the function p(x) in (17) in the set of functions {Qn } [cf. equation
(11b]. Since p has degree N 1, this expansion includes only terms up to
QN 1 :
N
1
X
p(x) =
n Qn (x)
n=0
and hence (17) reads

f (x) = QN (x)
N
1
X
n Qn (x) + r(x).
n=0
Integrating, we find
Z
f (x)W (x)dx =
a
N
1
X
n=0
Z
=
Z
n
|a
Z
QN (x)Qn (x)W (x) dx +
{z
}
r(x)W (x) dx
(19)
=0
r(x)W (x) dx
(20)
because QN is orthogonal to Qn for all n N 1.

Comparing (18) to (20) we see that our quadrature rule is exact for all
functions which can be decomposed in the form (17)that is, for all polynomials
of degree 2N 1. Isnt this beautiful? I love this.
Gauss vs. Clenshaw-Curtis

For an interesting discussion of the relative merits of Gaussian vs. ClenshawCurtis quadrature, see the article Is Gauss Quadrature Better than ClenshawCurtis?, by N. Trefethen, SIAM Review 50 p. 67, available online here: http:
//epubs.siam.org/doi/pdf/10.1137/060659831.
12
Integral equations and Nystr

oms method
Motivation: The 1D Semiconductor

In previous discussions, we considered the computation of a the electrostatic
potential in a one-dimensional crystalline ionic solid. Lets now generalize this
in two ways: we will treat the underlying charge density as continuous rather
than discrete, and we will consider the case of a semiconducting rather than an
ionic material.
In a semiconductor, the local charge density (x) depends strongly on the
local electrostatic potential x. A simple model of this dependence is furnished
by
(21)
(x) = 0 e(x)/VT
where the thermal voltage VT is the temperature divided by the electron charge,
VT = kT
e 0.026 volts at room temperature.
Consider now a 1D semiconductor of length L characterized by a local linecharge charge density4 (x). The electrostatic potential (x) is determined by
according to
Z L/2
(x0 )dx0
(x) =
0
L/2 |x x |
However, from (21) we also have that is determined by according to
(x) = e(x)
1
VT
Combining, we obtain
Z
L/2
(x) = 0
L/2
e(x )
dx0
|x x0 |
(22)
This is an integral equation for the electrostatic potential.

Integral equations are much harder than differential equations, for the following reason: In the case of a differential equation, we can always work locally
to figure out, for example, the next point on a solution curve given just a single
point on that curve. Indeed, this is precisely the M.O. of the ODE solvers that
we discussed in the first unit of our course. In doing this, we know nothing about
the global behavior of the solution curve, know nothing about what the solution
is doing far away from our given point, and nonetheless can infer incremental
knowledge from the local information contained in the differential equation.
On the other hand, in an equation like (23) there is no notion of proceeding
locally: To do anything at all with the RHS of the equation requires global
knowledge of the function .
4 The
line-charge density (x) is defined such that the total charge in the interval [x, x + dx]
is (x)dx.
13
Nystr
oms Method
Nystroms method uses Gaussian quadrature to convert an integral equation
into a linear system of equations. The most general setting is to consider an
integral equation of the form
Z
K(x, x0 )S(x0 )dx0 = F (x)
(23)
where K(x) is a known kernel function, F (x) is an known forcing function, and
S(x) is an unknown source function for which we are trying to solve. Nystroms
method is to use an N -point quadrature rule for the interval [a, b]:
Z
K(x, x0 )S(x0 )dx0
N
X
wn K(x, xn )S(xn )
n=1
We then require that equation (23) be satisfied at each of the N quadrature

points xn . This gives us N equations:
w1 K(x1 , x1 )S(x1 ) + w2 K(x1 , x2 )S(x2 ) + + wN K(x1 , xN )S(xN ) = F (x1 )
w1 K(x2 , x1 )S(x1 ) + w2 K(x2 , x2 )S(x2 ) + + wN K(x2 , xN )S(xN ) = F (x2 )
and so on down to
w1 K(xN , x1 )S(x1 ) + w2 K(xN , x2 )S(x2 ) + + wN K(xN , xN )S(xN ) = F (xN ).
This is an N N linear system
w1 K(x1 , x1 ) w2 K(x1 , x2 )
w1 K(x2 , x1 ) w2 K(x2 , x2 )
..
..
.
.
w1 K(xN , x1 ) w2 K(xN , x2 )
of the form
..
.
wN K(x1 , xN )
wN K(x2 , xN )
..
.
wN K(xN , xN )
S(x1 )
S(x2 )
..
.
S(xN )
F (x1 )
F (x2 )
..
.
F (xN )
which we solve for the values of our unknown source distribution at the quadrature points.
18.330 Introduction to Numerical Analysis

Spring 2015
Problem Set 1
Due: Thursday, 2/19/2015, at the beginning of class
Problem 1. Consider the infinite sum and the N th partial sum
S=
SN =
f (n),
n=1
N
X
f (n)
n=1
for the summand function f (n) =

sum to be EN |SNSS| .
1
n4 .
Define the relative error in the N th partial
(a) Estimate how large we must choose N to ensure that SN agrees with S
to 9-digit precision. (That is, estimate the smallest value of N such that
EN < 109 .)
(b) Write a computer program involving a simple loop to evaluate SN . Plot
EN versus N and assess the accuracy of your prediction from Part (a).
Note: Although not necessary to solve this problem, it is interesting that the
infinite sum here may be evaluated in closed form:
X
1
4
S=
=
.
n4
90
n=1
We will prove this statement later in the semester when we discuss Fourier
analysis.
Problem 2. (This is a simple exercise that foreshadows a concept we will
discuss in detail in a couple of weeks.) Many numerical sums involve summands
of widely-varying magnitudes. However, in some cases we might find ourselves
summing many numbers of roughly equal magnitudes. As a particularly blatant
example, consider the quantity PN defined as the sum of N equal numbers as
follows:
PN
N
X
.
N
n=1
(The fact that the summand is independent of n here is not a typo!) Now
consider the following julia program for computing the quantity PN .
function PN(N)
Summand = pi/N;
Sum=0.0;
for n=1:N
Sum += Summand;
end
Sum
end
(a) Consider the quantity

|PN |
.
State, in words, how you expect the quantity EN to depend on N for values
of N in the range 102 < N < 109 .
EN
(b) Now write a computer program that computes EN for general values of N .
Plot, on a log-log plot, EN versus N for values in the range 102 < N < 109 .
(If you use julia, you may copy-and-paste the above code snippet for the
function PN; if you use another language it will be easy enough to port
this snippet to that language.) How do the results compare with your
expectations as stated in Part (a)?
Problem 3. In this problem you will derive the composite second-order NewtonCotes quadrature rule (Simpsons rule) for integrating over an interval [, ],
subdivided into M subintervals.
(a) As a preliminary warmup, suppose you are given an N -point quadrature
rule {xn , wn } for integrating over the interval [1, 1]. That is, {xn } are
N points lying in the range 1 xn 1, and {wn } are N weights such
that
Z 1
N
X
f (x) dx
wn f (xn ).
(1)
1
n=1
Construct from {xn , wn } a new quadrature rule for integrating over a

general interval [u, v]. That is, given {xn , wn } find {x0n , wn0 } such that
u xn v and
Z v
N
X
f (x) dx
wn0 f (x0n ).
(2)
u
n=1
(b) Next derive the basic (not composite) second-order Newton-Cotes quadrature rule for integrating a function over the interval [1, 1], as follows:
(1) Given a function f (x) defined on this interval, construct the unique
second-degree polynomial P (x) = ax2 +bx+c that agrees with f (x) at the
three points x = 1, 0, 1. [Your answer will involve expressions for a, b, c
2
in terms of f (1), f (0), f (1).] (2) Integrate P (x) over the interval [1, 1]
to obtain an approximation to the integral of f over this interval in terms
of the three samples f (1), f (0), f (1). Express this result in the form (1)
to obtain a quadrature rule {xn , wn } for integrating f over [1, 1].
(c) Combine your answers to parts (a) and (b) to write down the basic (not
composite) Simpsons rule for integrating f over [u, v].
(d) Finally, given an interval [, ], subdivide the interval into M equal-width
subintervals, apply the basic Simpsons rule to integrate f over each subinterval, and sum the results to obtain the composite Simpsons rule for
integrating f over [, ]. How many samples of f does this rule require?
(Be careful not to overcount).
Problem 4. Write a computer program that implements the composite 0th,
1st, and 2nd-order Newton-Cotes quadrature rulesthat is, the composite rectangular, trapezoidal, and Simpsons rulesfor integrating an arbitrary function
over an arbitrary interval, subdivided into M subintervals. Use your program
to approximate the following integrals. In each case, plot the relative error
approx
exact
I
|
versus N for values of N in the range [10, 107 ]. (Here N is
E = |I I exact
the number of function samples required by the quadrature rule, I approx is the
approximation to the integral obtained by numerical quadrature, and I exact is
the exact value of the integral.) How do the results compare with your expectations?
Z

2
(a) Ia
ecos (x+1) +2 sin(4x+1) dx
0
Z
(b) Ib
cos cos2 (x+1)+2 sin(4x+1)
dx
tanh x dx
p
|x |
arctan(x) arctan(x)
dx
x
(c) Ic
0
Z
(d) Id
0
Note: Although not strictly necessary to work this problem, for your error
comparisons you may use the following table of accurate integral values:
Ia = 2.5193079820307612557
Ib = 4.4889560612699568830
Ic = 6.6388149923287733132
Id = 1.7981374998645790990
Extra credit (5%) Unlike the other integrals in Problem 4, it turns out that
the improper integral of Problem 4(d) may be evaluated analytically in closed
3
form. Do so. Hint: Replace the number with a variable u and let F (u) be
the value of the integral in this case. Differentiate F with respect to u and see
what you get.
Extra credit (10%): Mathematics evolving around us in real time.
Just under two years agoon April 17, 2013a little-known mathematician
at the University of New Hampshire submitted to the journal Annals of Mathematics a paper that solved an extremely old outstanding problem in number
theory. To earn some extra credit on your PSet this week you may do a little
research to learn about this interesting mathematics story evolving around us
in real time. One of many outstanding resources you may find useful in tracking
this down is Terence Taos blog: http://terrytao.wordpress.com.
(a) Who is the (formerly) little-known mathematician? (Hes certainly not
little-known anymore.) Did he follow a traditional career path to achieving
success in mathematics? Name at least one non-academic job he held after
receiving his PhD.
(b) What problem did the mathematician solve? State the problem clearly,
and give a brief (one-sentence) summary of the solution. You do not have to
understand how the solution works. (For example, I dont.)
(c) The solution to the problem involves a certain integer-valued parameter
commonly known as H1 , for which it is generally considered desirable to find
the minimal admissible value. What is the significance of H1 ? What value of
H1 was included in the original paper submitted to Annals of Mathematics?
(d) Since the original paper submission in April 2013, the mathematics
community has succeeded in reducing significantly the minimal admissible value
of H1 . What is the best current value of H1 and how recently was it obtained?
Briefly describe the collaborative process by which the improved value of H1 was
attained, and comment on how it differs from how mathematics was done prior
to the 21st century.
(e) Extra extra credit (50,000%). Find an even smaller admissible value
of H1 .
18.330 Introduction to Numerical Analysis

Spring 2014
Problem Set 2
Due: Thursday, 2/19/2015, at the beginning of class
Problem 1: Nested quadrature and hybrid analytical/numerical methods.
This problem illustrates a common theme in numerical analysis: Some problems which cannot be fully solved analytically, but which are too computationally expensive to solve fully numerically, may be attacked by hybrid analytical/numerical methods, in which we solve half of the problem analytically and
handle the rest numerically.
One common implementation of the boundary-element method of computational electromagnetism proceeds by discretizing the surfaces of compact objects
into small triangles and pretending that the surface-charge density is constant
over the area of each triangle. Heres an example of what the surface discretization process might look like:
In performing such computations, we will need to evaluate the electric fields

arising from the surface charge densities on individual panels. For example,
consider the triangle D (for domain) shown in the figure below, which lives
in the xy plane with one vertex at the origin of coordinates.
At a point x, the scalar potential (x) due to a constant unit-strength surface

charge density on this triangle is
Z x0
Z 1
eikr
dx0
dy 0
(x) = (x, y, z) =
(1)
4r
0
0
where
r=
(x x0 )2 + (y y 0 )2 + z 2 .
where k is a parameter related to the angular frequency according to k = /c

(where c is the speed of light). In this problem, you will consider the numerical
task of computing (0, 0, 0), i.e. the potential at the origin of coordinates.
(a) Write a computer program that evaluates integral (1) using nested
Simpsons-rule quadrature. (This program will involve applying the composite Simpsons rule to an outer integrand function which is itself evaluated by
applying the composite Simpsons rule to an inner integrand function. Use the
same number of subintervals, N , for both the outer and inner applications of
Simpsons rule.) Note with caution that the integrand is singular at the point
(x0 , y 0 ) = 0. For k = 0.1 and k = 1.0, run your program for various numbers
of subintervals N and plot the error incurred by the approximate quadrature
versus the time required for the computation.
For reference, you may use the following accurate values of (0, 0, 0):
for k = 0.1: = 0.069985354634047037579 + 0.0039744546692343236264i
for k = 1.0: = 0.055919391431515292736 + 0.035568755003092618039i
(b) Although integral (1) cannot be evaluated in closed form for k 6= 0, it
can be half evaluatedthat is, one of the two integrals may be performed analytically, leaving behind a single integral which must be evaluated analytically.
To effect this simplification, replace y 0 with a new integration variable defined
such that y 0 = x0 , dy 0 = x0 d. Rewrite the integral as a double integral over x0
and . Now evaluate one of the two integrals analytically, reducing the double
integral to a single integral.
(c) Finally, write a computer program that evaluates the single integral
remaining after part (b) using Simpsons rule quadrature. For k = 0.1 and
k = 1.0, run your program for various numbers of subintervals N and plot the
error incurred by the approximate quadrature versus the time required for the
computation.
2. ODE integators and quadrature rules.
(a) Show that the problem of evaluating a definite integral of the form
Z b
I=
f (x)dx
a
may be recast as the problem of integrating the ordinary differential equation

du
= f (t)
dt
2
from t = a to t = b, subject to the initial condition

u(a) = 0.
(b) Consider using (a) the Euler method and (b) the improved Euler method
Rb
to evaluate the integral a f (x) dx, using a stepsize h = ba
N for integer N . In
each case, write out the resulting approximation to the integral in the form of
a quadrature rule, and compare to the Newton-Cotes quadrature rules.
3. Choreographed orbits. In class we discussed the use of numerical ODE
integrators to solve the problem of massive bodies interacting via gravitational
forces. It turns out that the problem of two gravitating bodies may actually
be solved analytically in closed form (a subject discussed in advanced courses
on classical mechanics), but already the next simplest case three massive
interacting bodies is not generally solvable and must be explored by numerical
techniques. Among other things, this means that many features of the threebody problem remain to be discovered, even after several hundred years of
investigation. In this problem, you will study one type of behavior that was
only discovered within the last 15 years: choreographed orbits.1
(a) Consider three planets of equal mass m located at positions r1 , r2 , r3 . Each
planet experiences an attractive gravitational force directed toward the
other two planets. Write down three second-order differential equations
(Newtons laws of motion) governing the time evolution of r1 , r2 , and r3 .
(b) We will assume that the z component of the position and velocity of all
three planets is fixed at 0, so that we need only consider the x and y
components; for example,
r1 = (x1 , y1 , 0),
r 1 = (x 1 , y 1 , 0).
Given this simplification, rewrite the three coupled second-order differential equations of Part (a) in the form of a twelve-dimensional first-order
ODE system. Work in units such that Gm = 1.
(c) Using the improved Euler method, integrate this ODE from t = 0 to t = 20
subject to the following initial conditions:
(x1 , y1 )
(0.7, 0.36),
(x 1 , y 1 )
(0.99, 0.078)
(x2 , y2 )
(1.1, 0.07),
(x 2 , y 2 )
(0.1, 0.47)
(x3 , y3 )
(0.4, 0.3),
(x 3 , y 3 )
(1.1, 0.53)
(2)
Plot the trajectory of each planet. (That is, for each planet, plot a curve
in the (x, y) plane representing the path that planet traverses as it moves
in time. Plot all three curves on the same graph.) Make sure you choose
1 Reference:
http://www.math.utexas.edu/users/jjames/celestHw2Notes.pdf
a step size small enough to ensure that the orbits are converged within
the scale of the plot axes (that is, re-running the calculation at a smaller
stepsize will not noticeably change the plots).
(d) To investigate the fragility of this special type of orbit, tweak one or more of
the 12 numbers in (2) (say, increase or decrease it by 25% or so) and integrate the system again. Plot the resulting orbits for at least two different
tweaks of initial conditions.
(e) Extra credit (10%): Can you find an alternative set of initial conditions
that leads to trajectories qualitatively similar to what you found in Part
(c)? By alternative I mean a set of 12 numbers of which at least 6 differ
by more than 50% from the values given in (2).
Extra credit (10%). Go to the science library and consult the second edition
of the book A Classical Introduction to Modern Number Theory by Ireland &
Rosen. Find the proof of Theorem 20.6.1 on page 359.
(a) Write a brief (around one sentence) description of the logical structure of
this method of proof. You do not need to understand or describe the
content of the conjecture being proven or the hypothesis used in its proof.
(b) Describe the slightly unusual punctuation the authors use to conclude their
description of the proof schema. Have you seen this notation in a mathematics textbook before?
(c) Can you think of any other theorems whose proof proceeds along the same
logical structure as this proof? (I can only think of one, and I stumbled
on it by accident, so no is an acceptable answer to this question.)

Richardson Extrapolation
Homer Reid
Februrary 27, 2014
Suppose we are carrying out some sort of numerical procedure that involves
an adjustable parameter that tunes the accuracy of method at the expense of
computational cost. As we shrink toward 0, the accuracy of our calculation
improves, but the calculation becomes more expensive. (Alternatively, we might
1
characterize computational cost in terms of N
, in which case the accuracy
improves as N .) A good example to have in mind is numerical quadrature
via the trapezoidal rule: here the adjustable parameter is just the width of
the trapezoids, and N = ba
N is the total number of trapezoids we need to use
to integrate over an interval [a, b].
Let F () denote the value returned by our numerical procedure for a given
choice of . Ideally we would like to compute the quantity F ( = 0), but this
is generally impossible as it would require an infinite amount of computation.
Instead, we will have to make do with computing F at finite values of .
We will concern ourself here with the case in which we know a priori how
the accuracy of our numerical procedure depends on . More specifically, we
will assume that we know our method is a p-th order method that is, that the
error incurred by our numerical procedure is given by a polynomial in whose
leading term has degree p, i.e.

F () F (0) = Ap + O p+1 .
(1)
where A is some unknown constant. For example, for the trapezoidal rule we
have p = 2, while for the rectangular rule we have p = 1.
To summarize the situation in symbols, we have
F (0)
| {z }
what we want
F ()
| {z }
what we can compute
Ap
| {z }
dominant error term
O(p+1 )
| {z }
higher-order error terms
(2)
The quantity p determines how hard we have to work to improve the accuracy
of a given estimate of our quantity. To see this, suppose we have computed F ()
for some value of , and suppose we now want to refine this estimate by adding
roughly one digit of precisionthat is to say, we want to decrease the error by
a factor of 10. If p = 1, then to reduce the error by 10 we must decrease by
1
10. For something like rectangular-rule integration, this means we have to do

10 times more work
just to earn that extra digit! In contrast, if p = 2, then we
only need to do 10 3 times more work. Clearly the higher the value of p
the better.
Richardson extrapolation is a technique for increasing the effective value of
p. The idea is to compare two evaluations of F (), at two different values of ,
and use what we know about the dependence of the to eliminate the leadingorder error term. To see how it works, suppose we have evaluated F at and
at /2. Applying equation (1) twice, we express the numbers we have obtained
in the form
F () = F (0) + Ap + O(p+1 )
p

= F (0) + A
+ O(p+1 )
F
2
2
(3)
(4)
Now multiply the second line here by 2p , subtract the first line from it, and do
a little algebra to obtain1

2p F
2 F ()
F (0) =
+ O(p+1 )
(5)
2p 1
The point is that the error term proportional to p in (3) and (4) has cancelled
out of the combination in (5), leaving us with an estimate of our quantity whose
error decays more rapidly with .
The first term on the LHS of (5) defines the Richardson-extrapolated version
of our numerical method at convergence parameter :

2p F
Richardson
2 F ()
F
()
(6a)
2p 1
or, written in terms of the parameter N
F Richardson (N )
1
,
2p F (2N ) F (N )
2p 1
(6b)
If F () converges to the exact answer like p , then F Richardson () converges

to the exact answer like p+1 . (But note that each invocation of F Richardson
requires you to do 3N work, instead of the N work you need to do for F .)
In other words, to summarize the situation in symbols again,

F () 2p F
2
F (0)
=
+
O(p+1 )
(7)
| {z }
| {z }
1 2p
|
{z
}
what we want
dominant error term
what we can compute
1 If you are following along with the algebra at home, you will notice that the O(p+1 )
term in equation (5) is a linear combination of the O(p+1 ) terms in (3) and (4). The point
is that any linear combination of two quantities that are each O(p+1 ) yields a third quantity
that is itself O(p+1 ), no matter what coefficients we choose in the linear combination (as
long as none of them depend on ). This is a feature of the O() notation: it completely
ignores multiplicative coefficients and only keeps track of the leading dependence.
The quantity labeled what we can compute in this equation is the Richardsonextrapolated version of our numerical method at convergence parameter .
Comparing this to equation (2), we see that we have effectively improved the
rate of convergence of our numerical approximation scheme.
Terminology
In some cases, the application of Richardson extrapolation to an existing numerical method is assigned a new name, even though the underlying method
is really the same. For example, the application of Richardson extrapolation
to Newton-Cotes quadrature rules is called Romberg integration. On the other
hand, in the world of ODE integrators the combination of Richardson extrapolation with the midpoint method (which you considered in PSet 3) is known as
the Bulirsch-Stoer algorithm.

Nonlinear Root Finding and a Glimpse at
Optimization
Homer Reid
March 13, 2014
Contents
1 Overview
1.1 Examples of root-finding problems . . . . . . . . . . . . . . . . .
2
2
2 One-dimensional root-finding
2.1 Bisection . . . . . . . . . .
2.2 Secant . . . . . . . . . . . .
2.3 Newton-Raphson . . . . . .
6
6
8
8
techniques
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
3 Newtons method in higher dimensions
11
4 Newtons method is a local method
14
5 Computing roots of polynomials
17
6 A glimpse at numerical optimization

18
6.1 Derivative-free optimization of 1D functions . . . . . . . . . . . . 18
6.2 Roots can be found more accurately than extrema . . . . . . . . 20
Overview
Root-finding problems take the general form

find x such that f (x) = 0
where f (x) will generally be some complicated nonlinear function. (It had better
be nonlinear, since otherwise we hardly need numerical methods to solve.)
The multidimensional case
The root-finding problem has an obvious and immediate generalization to the
higher-dimensional case:
find x such that f (x) = 0
(1)
where x is an N -dimensional vector and f (x) is an M -dimensional vector-valued

function (we do not require M = N ). Equation (1) is unambiguous; it is asking
us to find the origin of the vector space RM , which is a single unique point in
that space.
Root-finding is an iterative procedure
In contrast to many of the algorithms we have seen thus far, the algorithms
we will present for root-finding are iterative: they start with some initial guess
and then repeatedly apply some procedure to improve this guess until it has
converged (i.e. it is good enough.) What this means is that we generally dont
know a priori how much work we will need to do to find our root. That might
make it sound as though root-finding algorithms take a long time to converge.
In fact, in many cases the opposite is true; as we will demonstrate, many of the
root-finding algorithms we present exhibit dramatically faster convergence than
any of the other algorithms we have seen thus far in the course.
1.1
Examples of root-finding problems
Ferromagnets
The mean-field theory of the D-dimensional Ising ferromagnet yield the following equation governing the spontaneous magnetization m:
m = tanh
2Dm
T
(2)
where T is the temperature.1 For a given temperature, we solve (2) numerically

to compute m, which characterizes how strongly magnetized our magnet is.
1 Measured
in units of the nearest-neighbor spin coupling J in the Ising hamiltonian.
Resonance frequencies of structures

A very common application of numerical root-finders is identifying the frequencies at which certain physical structures will resonate. As one example, consider
a one-dimensional model of a optical fiber consisting of a slab of dielectric material with thickness T (we might have something like T = 10 m) and refractive
index n (for example, silicon has n 3.4). Then from Maxwells equations its
easy to derive that the following relation
tanh
2n
nT
=0
c
1 + n2
must hold between T , n, and the angular frequency in order for a resonant
mode to exist. (Here c is the speed of light in vacuum.)
The Riemann function
The greatest unsolved problem in mathematics today is a root-finding problem.
The Riemann (zeta) function is defined by a contour integral as
I
(1 s)
sz1
(s) =
dz
z 1
2i
C e
where C is a certain contour in the complex plane. This function has trivial
roots at negative even integers s = 2, 4, 6, , as well as nontrivial roots
at other values of s. To date many nontrivial roots of the equation (s) = 0
have been identified, but they all have the property that their real part is 21 . The
Riemann hypothesis is the statement that in fact all nontrivial roots of (s) = 0
have Re s = 21 , and if you can prove this statement (or find a counterexample by
producing s such that (s) = 0, Re s 6= 12 ) then the Clay Mathematics Institute
in Harvard Square will give you a million dollars.
Linear eigenvalue problems
Let A be an N N matrix and consider the problem of determining eigenpairs
(x, ), where x is an N -dimensional vector and is a scalar. These are roots of
the equation
Ax x = 0.
(3)
Because both and x are unknown, we should think of (3) as an N + 1dimensional nonlinear root-finding

problem, where the N +1-dimensional vector
x
of unknowns we seek is
, and where the nonlinearity arises because the
x term couples the unknowns to each other.

Although (3) is thus a nonlinear problem if we think of it as an N + 1dimensional problem, it is separately linear in each of and x, and for this
reason we call it the linear eigenvalue problem. The linear eigenvalue problem is not typically solved using the methods discussed in these notes; instead,
it is generally solved using a set of extremely well-developed methods of numerical linear algebra (namely, Householder decomposition and QR factorization),
which are implemented by lapack and available in all numerical software packages including julia and matlab.
Nonlinear eigenvalue problems
On other other hand, it may be the case that the matrix A in (3) depends
on its own eigenvalues and/or eigenvectors. In this case we have a nonlinear
eigenvalue problem and the usual methods of numerical linear algebra do not
apply; in this case we must solve using nonlinear root-finding methods such as
Newtons method.
Nonlinear boundary-value problems
In our unit on boundary-value problems we considered the problem of a particle
motion in a time-dependent force field f (t). We considered an ODE boundaryvalue problem of the form
d2 x
= f (t),
dt2
x(ta ) = x(tb ) = 0
(4)
and we showed that finite-difference techniques allow us to reduce this ODE to

a linear system of equations of the form
Ax = f
(5)
where A is a matrix with constant entries, x is a vector of (unknown) samples

of the particle position x(tn ) at time points tn , and f is a vector of (known)
samples of the forcing function at those time points:
x(t1 )
x1
f (t1 )
..
..
..
x=
f =
. = unknown,
= known.
.
.
x(tN )
xN
f (tN )
Equation (5) may be thought of as a linear root-finding problem, i.e. we seek a

root of the N -dimensional linear equation
Ax f = 0.
(6)
This simple problem has the immediate solution

x = A1 f
(7)
which may be computed easily via standard methods of numerical linear algebra.
But now consider the case of particle motion in a position-dependent force
field f (x). (For example, in a 1D gravitational-motion problem we would have
f (x) = GM
x2 .) The ODE now takes the form
d2 x
= f (x),
dt2
x(ta ) = x(tb ) = 0.
(8)
Again we can use finite-difference techniques to write a system of equations

analogous to (5):
Ax = f
(9)
However, the apparent similarity of (9) to (5) is deceptive, because the RHS
vector in (9) now depends on the unknown vector x! More specifically, in
equation (9) we now have
f (x1 )
x1
..
f =
x = ... = unknown,
= also unknown!.
.
xN
f (xN )
Thus equation (9) defines a nonlinear root-finding problem,

Ax f (x) = 0
(10)
and no immediate solution like (7) is available; instead we must solve iteratively
using nonlinear root-finding techniques.
2
2.1
One-dimensional root-finding techniques

Bisection
The simplest root-finding method is the bisection method, which basically just
performs a simple binary search. We begin by bracketing the root: this means
finding two points x1 and x2 at which f (x) has different signs, so that we
are guaranteed2 to have a root between x1 and x2 . Then we bisect the interval
[x1 , x2 ], computing the midpoint xm = 12 (x1 +x2 ) and evaluating f at this point.
We now ask whether the sign of f (xm ) agrees with that of f (x1 ) or f (x2 ). In
the former case, we have now bracketed the root in the interval [xm , x2 ]; in the
latter case, we have bracketed the root in the interval [x1 , xm ]. In either case,
we have shrunk the width of the interval within which the root may be hiding
by a factor of 2. Now we again bisect this new interval, and so on.
Case Study
As a simple case study, lets investigate the convergence of the bisection method
on the function f (x) = tanh(x 5). The exact root, to 16-digit precision, is
x=5.000000000000000. Suppose we initially bracket the root in the interval
[3.0,5.8] and take the midpoint of the interval to be our guess as to the
starting value; thus, for example, our initial guess is x0 = 4.4. The following
table of numbers illustrates the evolution of the method as it converges to the
exact root.
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Bracket
[3.00000000e+00, 5.80000000e+00]
[4.40000000e+00, 5.80000000e+00]
[4.40000000e+00, 5.10000000e+00]
[4.75000000e+00, 5.10000000e+00]
[4.92500000e+00, 5.10000000e+00]
[4.92500000e+00, 5.01250000e+00]
[4.96875000e+00, 5.01250000e+00]
[4.99062500e+00, 5.01250000e+00]
[4.99062500e+00, 5.00156250e+00]
[4.99609375e+00, 5.00156250e+00]
[4.99882812e+00, 5.00156250e+00]
[4.99882812e+00, 5.00019531e+00]
[4.99951172e+00, 5.00019531e+00]
[4.99985352e+00, 5.00019531e+00]
[4.99985352e+00, 5.00002441e+00]
[4.99993896e+00, 5.00002441e+00]
xn
4.400000000000000e+00
5.100000000000000e+00
4.750000000000000e+00
4.925000000000000e+00
5.012499999999999e+00
4.968750000000000e+00
4.990625000000000e+00
5.001562499999999e+00
4.996093750000000e+00
4.998828124999999e+00
5.000195312499999e+00
4.999511718749999e+00
4.999853515624999e+00
5.000024414062499e+00
4.999938964843748e+00
4.000081689453124e+00
2 Assuming the function is continuous. We will not consider the ill-defined problem of
root-finding for discontinuous functions.
The important thing about this table is that the number of correct (red)
digits grows approximately linearly with n. This is what we call linear convergence.3 Lets now try to understand this phenomenon analytically.
Convergence rate
Suppose the width of the interval within which we initially bracketed the root
was 0 = x2 x1 . Then, after one iteration of the method, the width of the
interval within which the root may be hiding has shrunk to 1 = 12 0 (note
that this is true regardless of which subinterval we chose as our new bracket
they both had the same width). After two iterations, the width of the interval
within which the root may be hiding is 2 = 12 1 = 14 0 , and so on. Thus,
after N iterations, the width of the interval within which the root may be hiding
(which we may alternatively characterize as the absolute error with which we
have pinpointed a root) is
bisection
= 2N 0
N
(11)
In other words, the bisection method converges exponentially rapidly. (More

specifically, the bisection method exhibits linear convergence; the number of
correct digits grows linearly with the number of iterations. If we have 6 good
digits after 10 iterations, then we need to do 10 more iterations to get the next
6 digits, for a total of 12 good digits).
Note that this convergence rate is faster than anything we have seen in the
course thus far: faster than any Newton-Cotes quadrature rule, faster than any
ODE integrator, faster than any finite-difference stencil, all of which exhibit
errors that decay algebraically (as a power law) with N .
The bisection method is extremely robust; if you can succeed in bracketing
the root to begin with, then you are guaranteed to converge to the root. The
robustness stems from the fact that, as long as f is continuous and you can succeed in initially bracketing a root, you are guaranteed to have a root somewhere
in the interval, while the error in your approximation of this root cannot help
but shrink inexorably to zero as you repeatedly halve the width of the bucket
in which it could be hiding.
On the other hand, the bisection method is not the most rapidly-convergent
method. Among other things, the method only uses minimal information about
the values of the function at the interval endpointsnamely, only its sign, and
not its magnitude. This seems somehow wasteful. A method that takes better
advantage of the function information at our disposal is the secant method,
described next.
3 As emphasized in the lecture notes on convergence terminology, linear convergence is not
to be confused with first-order convergence, which is when the error decreases like 1/n, and
hence the number of correct digits grows like log10 (n).
2.2
Secant
The idea of the secant method is to speed the convergence of the bisection
method by using information about the magnitudes of the function values at
the interval endpoints in addition to their signs. More specifically, suppose we
have evaluated f (x) at two points x1 and x2 . We plot the points (x1 , y1 = f (x1 ))
and (x2 , y2 = f (x2 )) on a Cartesian coordinate system and draw a straight line
connecting these two points. Then we take the point x3 at which this line crosses
the x-axis as our updated estimate of the root. In symbols, the rule is
x3 = x2
x2 x1
f (x2 )
f (x2 ) f (x1 )
Then we repeat the process, generating a new point x4 by looking at the points
(x2 , f (x2 )) and (x3 , f (x3 )), and so on. The general rule is
xn+1 = xn
xn xn1
f (xn )
f (xn ) f (xn1 )
(12)
As we might expect, the error in the secant method decays more rapidly than
that in the bisection method; the number ofcorrect digits grows roughly like
the number of iterations to the power p = 1+2 5 1.6.
One drawback of the secant method is that, in contrast to the bisection
method, it does not maintain a bracket of the root. This makes the method less
robust than the bisection method.
2.3
Newton-Raphson
Take another look at equation (12). Suppose that xn1 is close to xn , i.e.
imagine xn1 = xn +h for some small number h. Then the quantity multiplying
f (xn ) in the second term of (12) is something like the inverse of the finitedifference approximation to the derivative of f at xn :
1
xn xn1
0
f (xn ) f (xn1 )
f (xn )
If we assume that this approximation is trying to tell us something, we are led
to consider the following modified version of (12):
xn+1 = xn
f (xn )
f 0 (xn )
(13)
This prescription for obtaining an improved root estimate from a initial root estimate is called Newtons method (also known as the Newton-Raphson method).
Alternative derivation of Newton-Raphson
Another way to understand the Newton-Raphson iteration (13) is to expand the
function f (x) in a Taylor series about the current root estimate xn :
1
f (x) = f (xn ) + (x xn )f 0 (xn ) + (x xn )2 f 00 (xn ) +
2
(14)
If we evaluate (14) at the actual root x0 , then the LHS is zero (because f (x0 ) = 0
since x0 is a root), whereupon we find
0 = f (xn ) + (x0 xn )f 0 (xn ) + O[(x xn )2 ]
If we neglect the quadratic and higher-order terms in this equation, we can solve
immediately for the root x0 :
x0 = xn
f (xn )
f 0 (xn )
(15)
This reproduces equation (13).

To summarize: Newtons method approximates f (x) as a linear function and
jumps directly to the point at which this linear function is zeroed out. From
this, we can expect that the method will work well in the vicinity of a single root
(where the function really is approximately linear) but less well in the vicinity
of a multiple root and perhaps not well at all when we arent in the vicinity of
a root. We will quantify these predictions below.
Convergence of Newton-Raphson
Suppose we have run the Newton-Raphson algorithm for n iterations, so that
our best present estimate of the root is xn . Let x0 be the actual root. As above,
lets express this root using the Taylor-series expansion of f (x) about the point
x = xn :

1
f (x0 ) = 0 = f (xn ) + f 0 (xn )(x0 xn ) + f 00 (xn )(x0 xn )2 + O (x0 xn )3
2
Divide both sides by f 0 (xn ) and rearrange a little:

f (xn )
1 f 00 (xn )
2
3
x0 xn + 0
(x
x
)
+
O
(x
x
)
=
0
n
0
n
f (xn )
2 f 0 (xn )
|
{z
}
x0 xn+1
But now the quantity on the LHS is telling us the distance between the root
and xn+1 , the next iteration of the Newton method. In other words, if we define
the error after n iterations as n = |x0 xn |, then
n+1 = C2n
(where C is some constant). In other words, the error squares on each iteration.
To analyze the implications of this fact for convergence, its easiest to take
logarithms on both sides:
log n+1 2 log n
4 log n1
10
and so on, working backwards until we find

log n+1 2n+1 log 0
where 0 is the error in our initial root estimate. Note that the logarithm of
the error decays exponentially with n, which means that the error itself decays
doubly exponentially with n: we have something like
n eAe
Bn
(16)
for positive constants A and B.

Another way to characterize (16) is to say that the number of correct digits
uncovered by Newtons method grows quadratically with the number of iterations; we say Newtons method exhibits quadratic convergence.
Case study
Lets apply Newtons method to find a root of the function tanh(x 5). The
exact root, to 16-digit precision, is x=5.000000000000000. We will start the
method at an initial guess of x1 = 4.4 and iterate using (13). This produces
the following table of numbers, in which correct digits are printed in red:
n
1
2
3
4
5
xn
4.400000000000000
5.154730677706086
4.997518482593209
5.000000010187351
5.000000000000000
After 3 iterations, I have 4 good digits; after 4 iterations, 8 good digits; after 5
iterations, 16 good digits. This is quadratic convergence.
Double roots
What happens if f (x) has a double root at x = x0 ? A double root means
that both f (x0 ) = 0 and f 0 (x0 ) = 0. Since our error analysis above assumed
f 0 (x0 ) 6= 0, we might expect it to break down if this condition is not satisfied,
and indeed in this case Newtons method exhibits only linear convergence.
11
Newtons method in higher dimensions
One advantage of Newtons method over simple methods like bisection is that it
extends readily to multidimensional root-finding problems. Consider the problem of finding a root x0 of a vector-valued function:
f (x) = 0
(17)
where x is an N -dimensional vector and f is an N -dimensional vector of functions. (Although in the introduction we stated that root-finding problems may
be defined in which the dimensions of f and x are different, Newtons method
only applies to the case in which they are the same.)
The linear case
There is one case of the system (17) that you already know how to solve: the
case in which the system is linear, i.e. f (x) is just matrix multiplication of x by
a matrix with x-independent coefficients:
f (x) = Ax = 0
(18)
In this case, we know there is always the one (trivial) root x = 0, and the condition for the existence of a nontrivial root is the vanishing of the determinant of
A. If det A 6= 0, then there is no point trying to find a nontrivial root, because
none exists. On the other hand, if det A = 0 then A has a zero eigenvalue and
its easy to solve for the corresponding eigenvector, which is a nontrivial root of
(18).
The nonlinear case
The vanishing-of-determinant condition for the existence of a nontrivial root of
(18) is very nice: it tells us exactly when we can expect a nontrivial solution to
exist.
For more general nonlinear systems there is no such nice condition for the
existence of a root4 , and thus it is convenient indeed that Newtons method for
root-finding has an immediate generalization to the multi-dimensional case. All
we have to do is write out the multidimensional generalization of (14) for the
Taylor expansion of a multivariable function around the point x:
f (x + ) = f (x) + J + O(2 )
(19)
4 At least, this is the message they give you in usual numerical analysis classes, but it is not
quite the whole truth. For polynomial systems it turns out there is a beautiful generalization
of the determinant known as the resultant that may be used, like the determinant, to yield a
criterion for the existence of a nontrivial root. I hope we will get to discuss resultants later in
the course, but for now you can read about it in the wonderful books Ideals, Varieties, and
Algorithms and Using Algebraic Geometry, both by Cox, Little, and OShea.
12
where the Jacobian matrix J is the matrix of first partial derivatives of f :

f1
f1
f1
x
x1
x2
N
f2
f2
f2
x
x
x2
1
N
J(x) = .
.
.
.
..
..
..
..
fN
fN
fN
x1
x2
xN
where all partial derivatives are to be evaluated at x.
Now suppose we have an estimate x for the root of nonlinear system f (x).
Lets compute the increment that we need to add to x to jump to the exact
root of the system. Setting (19) equal to zero and ignoring higher-order terms,
we find
0 = f (x + )
f (x) + J
or
= J1 f (x)
In other words, if xn is our best guess as to the location of the root after n
iterations of Newtons method, then our best guess after n + 1 iterations will be
xn+1 = xn J1 f (x)
(20)
This is an immediate generalization of (13); indeed, in the 1D case J reduces

simply to f 0 and we recover (13).
However, computationally, (20) is more expensive than (13): it requires us
to solve a linear system of equations on each iteration.
Example
As a case study in the use of Newtons method in multiple dimensions, consider
the following two-dimensional nonlinear system:
!
x21 cos(x1 x2 )
f (x) =
ex1 x2 + x2
The Jacobian matrix is
J(x) =
2x1 + x2 sin(x1 x2 ) x1 sin(x1 x2 )

x2 ex1 x2
x1 ex1 x2 + 1
This example problem has a solution at

0.926175
x0 =
-0.582852
Heres a julia routine called NewtonSolve that computes a root of this system.
Note that the body of the NewtonStep routine is only three lines long.
function f(x)
x1=x[1];
x2=x[2];
[x1^2 - cos(x1*x2); exp(x1*x2) + x2];
end
function J(x)
x1=x[1];
x2=x[2];
J11=2*x1+x2*sin(x1*x2)
J12=x1*sin(x1*x2)
J21=x2*exp(x1*x2)
J22=x1*exp(x1*x2)+1;
[ J11 J12; J21 J22]
end
function NewtonStep(x)
fVector = f(x)
jMatrix = J(x)
x - jMatrix \ fVector;
end
function NewtonSolve()
x=[1; 1]; # random initial guess
residual=norm(f(x))
while residual > 1.0e-12
x=NewtonStep(x)
residual=norm(f(x))
end
x
end
13
14
Newtons method is a local method
Newtons method exhibits outstanding local convergence, but terrible global

convergence. One way to think of this is to say that Newtons method is more
of a root-polisher than a root-finder : If you are already near a root, you can
use Newtons method to zero in on that root to high precision, but if you arent
near a root and dont know where to start looking then Newtons method may
be useless.
To give just one example, consider the function tanh(x 5) that we considered above. Suppose we didnt know that this function had a root at x = 5, and
suppose we started looking for a root near x = 0. Setting x1 = 0 and executing
one iteration of Newtons method yields
f (x1 )
f 0 (x1 )
tanh(-5)
=0
sech(-5))^2
= 5506.61643
x2 = x1
Newtons method has sent us completely out of the ballpark! What went
wrong??
What went wrong here is that the function tanh(x 5) has very gentle slope
at x = 0 in fact, the function is almost flat there (more specifically, its slope
is sech2 (x 5) 2 104 ) and so, when we approximate the function as a line
with that slope and jump to the point at which that line crosses the x axis, we
wind up something like 5,000 units away. This is what we get for attempting to
use Newtons method with a starting point that is not close to a root.
Newtons method applied to polynomials

We get particularly spectactular examples of the sketchy global convergence
properties of Newtons method when we apply the method to the computation
of roots of polynomials.
One obvious example of what can go wrong is the use of Newtons method
to compute the roots of
P (x) = x2 + 1 = 0.
(21)
The Newton iteration (13) applied to (21) yields the sequence of points
xn+1 = xn
x2n + 1
.
2xn
(22)
If we start with any real-valued initial guess x1 , then the sequence of points
generated by (22) is guaranteed to remain real-valued for all n, and thus we can
never hope to converge to the correct roots i.
15
Newton fractals
We get a graphical depiction of phenomena like this by plotting, in the complex
plane, the set of points {z0 } at which Newtons method, when started at z0 for
a function f (z), converges to a specific root. [More specifically: For each point z
in some region of the complex plane, we run Newtons method on the function f
starting at z. If the function converges to the mth root in N iterations, we plot
a color whose RGB value is determined by the tuple (m, N ).] You can generate
plots like this using the julia function PlotNewtonConvergence, which takes as
its single argument a vector of the polynomial coefficients sorted in decreasing
order. Heres an example for the function f (z) = z 3 1.
julia> PlotNewtonConvergence([1 0 -1])
16
Figure 1: Newton fractal for the function f (x) = z 3 1.
The three roots of f (z) are 1, e2i/3 , e4i/3 . The variously colored regions
in the plot indicate points in the complex plane for which Newtons method
converges to the various roots; for example, red points converge to e2i/3 , and
yellow points converge to e4i/3 . What you see is that for starting points in
the immediate vicinity of each root, convergence to that root is guaranteed, but
elsewhere in the complex plane all bets are off; there are large red and yellow
regions that lie nowhere near the corresponding roots, and the fantastically intricate boundaries of these regions indicate the exquisite sensitivity of Newtons
method to the exact location of the starting point.
This type of plot is known as a Newton fractal, for obvious reasons. Thus
Newtons method applied to the global convergence of polynomial root-finding
yields beautiful pictures, but not a very happy time for actual numerical rootfinders.
17
Computing roots of polynomials
In the previous section we observed that Newtons method exhibits spectacularly sketchy global convergence when we use it to compute roots of polynomials.
So what should you do to compute the roots of a polynomial P (x)? For an arbitrary N th-degree polynomial with real or complex coefficients, the fundamental
theorem of algebra guarantees that N complex roots exist, but on the other
hand Galois theory guarantees for N > 5 that there is no nice formula expressing these roots in terms of the coefficients, so finding them is a task for numerical
analysis. Although specialized techniques for this problem do exist (one such is
the Jenkins-Traub method), a method which works perfectly well in practice
and requires only standard tools is to find a matrix whose characteristic polynomial is P (x) and compute the eigenvalues of this polynomial using standard
methods of numerical linear algebra.
The companion matrix
Such a matrix is called the companion matrix, and for a monic5 polynomial
P (x) of the form
P (x) = xn + Cn1 xn1 + Cn2 xn2 + + C1 x + C0
the companion matrix takes the form.
0 0 0
1 0 0
0 1 0
CP = 0 0 1
.. .. ..
. . .
0
..
.
C0
C1
C2
C3
..
.
Cn1
Given the coefficients of PN , it is a simple task to form the matrix CP and

compute its eigenvalues numerically. You can find an example of this calculation
in the PlotNewtonConvergence.jl code mentioned in the previous section.
5 A monic polynomial is one for which the coefficient of the highest-degree monomial is 1. If
your polynomial is not monic (suppose the coefficient of its highest-order monomial is A 6= 1),
just consider the polynomial obtained by dividing all coefficients by A. This polynomial is
monic and has the same roots as your original polynomial.
18
A glimpse at numerical optimization
A problem which bears a superficial similarity to that of root-finding, but which

in many ways is quite distinct, is the problem of optimization, namely, given
some complicated nonlinear function f (x), we ask to
find x such that f (x) has an extremum at x
where the extremum may be a global or local maximum or minimum. This
problem also has an obvious generalization to scalar-valued functions of vectorvalued variables, i.e.
find x such that f (x) has an extremum at x.
Numerical optimization is a huge field into which we cant delve very deeply in
18.330; what follows is only the most cursory of overviews, although the point at
the end regarding the accuracy of root-finding vs. optimization is an important
one.
6.1
Derivative-free optimization of 1D functions
Golden-Section Search The golden-section search algorithm, perhaps the

simplest derivative-free optimization method for 1D functions, is close in spirit
to the bisection method of root finding. Recall that the bisection method for
finding a root of a function f (x) begins by finding an initial interval [a0 , b0 ]
within which the root is known to lie; the method then proceeds to generate a
sequence of pairs, i.e.
[a0 , b0 ]
[a1 , b1 ]
..
.
[an , bn ]
..
.
with the property that the root is always known to be contained within the
interval in question, i.e. with the property
sign f (an ) 6= sign f (bn )
preserved for all n.
Golden-section search does something similar, but instead of generating a
sequence of pairs [an , bn ] it produces a sequence of triples [an , bn , cn ], i.e.
19
[a0 , b0 , c0 ]
[a1 , b1 , c1 ]
..
.
[an , bn , cn ]
..
.
with the properties that an < bn < cn and each triple be guaranteed to bracket
the minimum, in the sense that f (bn ) is always lower than either of f (an ) or
f (cn ), i.e. the properties
f (an ) > f (bn )
and
f (bn ) < f (cn )
(23)
is preserved for all n.

To start the golden-section search algorithm, we need to identify an initial
triple [a0 , b0 , c0 ] satisfying property (23). Then, we iterate the following algorithm that inputs a bracketing triple [an , bn , cn ] and outputs a new, smaller,
bracketing triple [an+1 , bn+1 , cn+1 ]:
1. Choose6 a new point x that lies a fraction of the way into the larger of
the intervals [an , bn ] and [bn , cn ].
2. Evaluate f at x.
3. If f (x) < f (bn ), then our new bracketing triple is
[an+1 , bn+1 , cn+1 ] = [MIN(bn , x), MAX(bn , x), cn ].
where MIN(u,v) and MAX(u,v) just choose the lesser (greater) of u, v.
4. Otherwise, our new bracketing triple is
[an+1 , bn+1 , cn+1 ] = [an , MIN(bn , x), MAX(bn , x)].
Do you see how this works? The decision-making process in steps (34) guarantees the preservation of property (23), while meanwhile the shrinking of the
intervals in Step 1 guarantees that our bracket converges inexorably to a smaller
and smaller interval within which the minimum could be hiding. The MIN/MAX
business just ensures that we always have an < bn < cn .
How do we choose the optimal shrinking fraction ? One elegant approach
is to choose to ensure that the ratio of the lengths of the two subintervals
6A
more specific description of this step is that we set

(
bn + (cn bn ), if (cn bn ) > (bn an )
x=
.
bn + (an bn ), if (cn bn ) < (bn an ).
20
[an , bn ] and [bn , cn ] remains constant even as the overall width of the bracketing
interval shrinks toward zero. With a little effort you can show that this property
is ensured by taking to be the golden ratio,
3 5
=
= 0.381966011250105
2
and a -fraction of an interval is known as the golden section of that interval,
which explains the name of the algorithm.
6.2
Roots can be found more accurately than extrema
An important distinction between numerical root-finding and derivative-free

numerical optimization is that the former can generally be done much more
accurately. Indeed, if a function f (x) has a root at a point x0 , then in many
cases we will be able to approximate x0 to roughly machine precisionthat
is, to 15-decimal-digit accuracy on a typical modern computer. In contrast, if
f (x) has an extremum at x0 , then in general we will only be able to pin down
the value of x0 to something like the square root of machine precisionthat
is, to just 8-digit accuracy! This is a huge loss of precision compared to the
root-finding case.
To understand the reason for this, suppose f has a minimum at x0 , and let
the value of this minimum be f0 f (x0 ). Then, in the vicinity of x0 , f has a
Taylor-series expansion of the form

1
f (x) = f0 + (x x0 )2 f 00 (x0 ) + O (x x0 )3
(24)
2
where the important point is that the linear term is absent because the derivative
of f vanishes at x0 .
Now suppose we try to evaluate f at floating-point numbers lying very close
to, but not exactly equal to, the nearest floating-point representation of x0 .
(Actually, for the purposes of this discussion, lets assume that x0 is exactly
floating-point representable, and moreover that the magnitudes of x0 , f0 , and
f 00 (x0 ) are all on the order of 1. The discussion could easily be extended to
relax these assumptions at the expense of some cluttering of the ideas.) In
64-bit floating-point arithmetic, where we have approximately 15-decimal-digit
registers, the floating-point numbers that lie closest to x0 without being equal
to x0 are something like7 xnearest x0 1015 . We then find

1
f (xnearest ) = f0 + (x x0 )2 f 00 (x0 ) + O (x x0 )3 .
|{z} 2 | {z }
| {z }
1.0
1.0e-30
1.0e-45
Since xnearest deviates from x0 by something like 1015 , we find that f (xnearest )
deviates from f (x0 ) by something like 1030 , i.e. the digits begin to disagree in
7 This is where the assumption that |x | 1 comes in; the more general statement would be
0
that the nearest floating-point numbers not equal to x0 would be something like x0 1015 |x0 |.
21
the 30th decimal place. But our floating-point registers can only store 15 decimal
digits, so the difference between f (x0 ) and f (xnearest ) is completely lost; the two
function values are utterly indistinguishable to our computer.
Moreover, as we consider points x lying further and further away from x0 ,
we find that f (x) remains floating-point indistinguishable from f (x0 ) over a
wide interval near x0 . Indeed, the condition that f (x) be floating-point distinct
from f (x0 ) requires that (x x0 )2 fit into a floating-point register that is also
storing f0 1. This means that we need8
(x x0 )2 & machine
(25)
or
(x x0 ) &
machine
(26)
This explains why, in general, we can only pin down minima to within the
square root of machine precision, i.e. to roughly 8 decimal digits on a modern
computer.
On other hand, suppose the function g(x) has a root at x0 . In the vicinity
of x0 we have the Taylor expansion
1
g(x) = (x x0 )g 0 (x0 ) + (x x0 )2 g 00 (x0 ) +
2
(27)
which differs from (24) by the presence of a linear term. Now there is generally
no problem distinguishing g(x0 ) from g(xnearest ) or g at other floating-point
numbers lying within a few machine epsilons of x0 , and hence in general we will
be able to pin down the value of x0 to close to machine precision. (Note that
this assumes that g has only a single root at x0 ; if g has a double root there,
i.e. g 0 (x0 ) = 0, then this analysis falls apart. Compare this to the observation
we made earlier that the convergence of Newtons method is worse for double
roots than for single roots.)
Figures 6.2 illustrates these points. The upper panel in this figure plots,
for the function f (x) = f0 + (x x0 )2 [corresponding to equation (24) with
x0 = f0 = 12 f 00 (x0 ) = 1], the deviation of f (x) from its value at f (x0 ) versus the
deviation of x from x0 as computed in standard 64-bit floating-point arithmetic.
Notice that f (x) remains indistinguishable from f (x0 ) until x deviates from x0
by at least 108 ; thus a computer minimization algorithm cannot hope to pin
down the location of x0 to better than this accuracy.
In contrast, the lower panel of Figure 6.2 plots, for the function g(x) =
(x x0 ) [corresponding to equation (27) with x0 = g 0 (x0 ) = 1], the deviation
of g(x) from g(x0 ) versus the deviation of x from x0 , again as computed in
standard 64-bit floating-point arithmetic. In this case our computer is easily
able to distinguish points x that deviate from x0 by as little as 2 1016 . This
8 This is where the assumptions that |f | 1 and |f 00 (x )| 1 come in; the more general
0
0
statement would be that we need (x x0 )2 |f 00 (x0 )| & machine |f0 |.
22
is why numerical root-finding can, in general, be performed with many orders

of magnitude better precision than minimization.
23
1.6e-15
1.4e-15
1.2e-15
1e-15
8e-16
6e-16
4e-16
2e-16
0
-2e-16
4.0e-08
5e-16
4e-16
3e-16
2e-16
1e-16
0
-1e-16
-2e-16
-3e-16
-4e-16
-5e-16
-6e-16
2.0e-08
0.0e+00 -2.0e-08 -4.0e-08
-4e-16 -2e-16
2e-16
4e-16
Figure 2: In standard 64-bit floating-point arithmetic, function extrema can

generally be pinned down only to roughly 8-digit accuracy (upper), while roots
can typically be identified with close to 15-digit accuracy (lower).

Numerical Analysis Lecturer Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numerical Analysis Lecturer Notes

Uploaded by

Copyright:

Available Formats

Walter Gautschi

Preface to the Second Edition

Preface to the First Edition

Preface to the First Edition

machine assignments, where the student is encouraged to implement numerical

Preface to the First Edition

2 Approximation and Interpolation .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

2.1.3 Least Squares Error; Convergence.. . . . . . .. . . . . . . . . . . . . . . . . . .

3 Numerical Differentiation and Integration . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

5 Initial Value Problems for ODEs: One-Step Methods . . . . . . . . . . . . . . . . . .

Error Monitoring and Step Control . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

6 Initial Value Problems for ODEs: Multistep Methods .. . . . . . . . . . . . . . . . .

7 Two-Point Boundary Value Problems for ODEs . . . . .. . . . . . . . . . . . . . . . . . .

P3 Textbooks and Monographs

P2 Numerical Analysis Software

P3 Textbooks and Monographs

P3.1 Selected Textbooks on Numerical Analysis

P3 Textbooks and Monographs

Stoer and Bulirsch [2002] Fairly comprehensive in coverage; written in a style

P3.2 Monographs and Books on Specialized Topics

P3 Textbooks and Monographs

18.330 Lecture Notes:

2 ODE Approach to Boundary-Value Problems: The Shooting

18.330 Lecture Notes

Boundary value problems

In our discussion of ODEs we considered initial value problemsthat is, ODEs

Reconstructing trajectories of particles moving in force

18.330 Lecture Notes

Deflection of a loaded beam

Another classic example of a boundary-value problem is the deflection of a

If we proceed in the usual way to convert equation (10) to a first-order ODE

18.330 Lecture Notes

ODE Approach to Boundary-Value Problems:

(t2 ) is the given boundary-value at time t2 .

18.330 Lecture Notes

Linear-Algebra Approach to Boundary-Value

An alternative approach to boundary-value problems is to convert a differential

then the vectors f and f 00 are related3 by

18.330 Lecture Notes

Example: The beam equation

over an interval [a, b] with boundary conditions

f (x 2h) 4f (x h) + 6f (x) 4f (x + h) + f (x + 2h)

18.330 Lecture Notes

fn(4) = f (4) (a + nh),

and where we have assumed f1 = f0 = fN +1 = fN +2 = 0.

solve the beam equation on the interval [0:10] given a

18.330 Lecture Notes

Figure 1: Solution of beam equation with loading function q(x) = x2 .

18.330 Lecture Notes

function SolveBeamEquation(q, Alpha, N)

18.330 Lecture Notes:

2 The classical answer

3 The modern answer for periodic functions

4 The modern answer for non-periodic functions

6 Chebyshev spectral methods

18.330 Lecture Notes

18.330 Lecture Notes

The classical answer

we have here a set of N + 1 points, not N points as we stated above.

18.330 Lecture Notes

Performance of the classical approach on general functions

Performance of the classical approach on periodic functions

18.330 Lecture Notes

The modern answer for periodic functions