Numerical Analysis
Second Edition
Walter Gautschi
Department of Computer Sciences
Purdue University
250 N. University Street
West Lafayette, IN 479072066
wgautschi@purdue.edu
ISBN 9780817682583
eISBN 9780817682590
DOI 10.1007/9780817682590
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2011941359
Mathematics Subject Classification (2010): 6501, 65D05, 65D07, 65D10, 65D25, 65D30, 65D32,
65H04, 65H05, 65H10, 65L04, 65L05, 65L06, 65L10
c Springer Science+Business Media, LLC 1997, 2012
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer ScienceCBusiness Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acidfree paper
www.birkhauserscience.com
TO
ERIKA
In this second edition, the outline of chapters and sections has been preserved. The
subtitle An Introduction, as suggested by several reviewers, has been deleted. The
content, however, is brought up to date, both in the text and in the notes. Many
passages in the text have been either corrected or improved. Some biographical
notes have been added as well as a few exercises and computer assignments. The
typographical appearance has also been improved by printing vectors and matrices
consistently in boldface types.
With regard to computer language in illustrations and exercises, we now adopt
uniformly Matlab. For readers not familiar with Matlab, there are a number of
introductory texts available, some, like Moler [2004], Otto and Denier [2005],
Stanoyevitch [2005] that combine Matlab with numerical computing, others, like
Knight [2000], Higham and Higham [2005], Hunt, Lipsman and Rosenberg [2006],
and Driscoll [2009], more exclusively focused on Matlab.
The major novelty, however, is a complete set of detailed solutions to all exercises
and machine assignments. The solution manual is available to instructors upon
request at the publishers website http://www.birkhauserscience.com/9780817682583. Selected solutions are also included in the text to give students an idea of
what is expected. The bibliography has been expanded to reflect technical advances
in the field and to include references to new books and expository accounts. As a
result, the text has undergone an expansion in size of about 20%.
West Lafayette, Indiana
November 2011
Walter Gautschi
vii
The book is designed for use in a graduate program in Numerical Analysis that
is structured so as to include a basic introductory course and subsequent more
specialized courses. The latter are envisaged to cover such topics as numerical
linear algebra, the numerical solution of ordinary and partial differential equations,
and perhaps additional topics related to complex analysis, to multidimensional
analysis, in particular optimization, and to functional analysis and related functional
equations. Viewed in this context, the first four chapters of our book could serve as
a text for the basic introductory course, and the remaining three chapters (which
indeed are at a distinctly higher level) could provide a text for an advanced course
on the numerical solution of ordinary differential equations. In a sense, therefore,
the book breaks with tradition in that it does no longer attempt to deal with all
major topics of numerical mathematics. It is felt by the author that some of the
current subdisciplines, particularly those dealing with linear algebra and partial
differential equations, have developed into major fields of study that have attained
a degree of autonomy and identity that justifies their treatment in separate books
and separate courses on the graduate level. The term Numerical Analysis as
used in this book, therefore, is to be taken in the narrow sense of the numerical
analogue of Mathematical Analysis, comprising such topics as machine arithmetic,
the approximation of functions, approximate differentiation and integration, and the
approximate solution of nonlinear equations and of ordinary differential equations.
What is being covered, on the other hand, is done so with a view toward
stressing basic principles and maintaining simplicity and studentfriendliness as far
as possible. In this sense, the book is An Introduction. Topics that, even though
important and of current interest, require a level of technicality that transcends the
bounds of simplicity striven for, are referenced in detailed bibliographic notes at the
end of each chapter. It is hoped, in this way, to place the material treated in proper
context and to help, indeed encourage, the reader to pursue advanced modern topics
in more depth.
A significant feature of the book is the large collection of exercises that
are designed to help the student develop problemsolving skills and to provide
interesting extensions of topics treated in the text. Particular attention is given to
ix
xi
It is a pleasure to thank the publisher for showing interest in this book and
cooperating in producing it. The author is also grateful to Soren Jensen and Manil
Suri, who taught from this text, and to an anonymous reader; they all made many
helpful suggestions on improving the presentation. He is particularly indebted to
Prof. Jensen for substantially helping in preparing the exercises to Chap. 7. The
author further acknowledges assistance from Carl de Boor in preparing the notes to
Chap. 2 and to Werner C. Rheinboldt for helping with the notes to Chap. 4. Last but
not least, he owes a measure of gratitude to Connie Wilson for typing a preliminary
version of the text and to Adam Hammer for assisting the author with the more
intricate aspects of LaTeX.
West Lafayette, Indiana
January 1997
Walter Gautschi
Contents
Prologue .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xix
P1
Overview .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xix
P2
Numerical Analysis Software . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xxi
P3
Textbooks and Monographs .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xxi
P3.1 Selected Textbooks on Numerical Analysis .. . . . . . . . . . . . . . . . xxi
P3.2 Monographs and Books on Specialized Topics . . . . . . . . . . . . . xxiii
P4
Journals.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . xxvi
1 Machine Arithmetic and Related Matters . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.1 Real Numbers, Machine Numbers, and Rounding .. . . . . . . . . . . . . . . . .
1.1.1 Real Numbers.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.1.2 Machine Numbers .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.1.3 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.2 Machine Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.2.1 A Model of Machine Arithmetic . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.2.2 Error Propagation in Arithmetic Operations:
Cancellation Error .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.3 The Condition of a Problem .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.3.1 Condition Numbers . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.3.2 Examples.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.4 The Condition of an Algorithm . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1.5 Computer Solution of a Problem; Overall Error .. . . . . . . . . . . . . . . . . . .
1.6 Notes to Chapter 1 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises and Machine Assignments to Chapter 1 . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Machine Assignments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Selected Solutions to Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Selected Solutions to Machine Assignments. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
1
2
2
3
5
7
7
8
11
13
16
24
27
28
31
31
39
44
48
55
59
59
61
xiii
xiv
Contents
64
67
73
74
77
81
86
91
93
97
100
101
102
104
106
107
110
112
118
118
134
138
150
159
159
159
161
163
165
165
169
175
178
182
187
190
195
200
200
214
219
232
Contents
xv
4 Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1 Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1.1 A Transcendental Equation . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1.2 A TwoPoint Boundary Value Problem . .. . . . . . . . . . . . . . . . . . .
4.1.3 A Nonlinear Integral Equation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.1.4 sOrthogonal Polynomials . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.2 Iteration, Convergence, and Efficiency . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.3 The Methods of Bisection and Sturm Sequences . . . . . . . . . . . . . . . . . . .
4.3.1 Bisection Method .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.3.2 Method of Sturm Sequences . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.4 Method of False Position . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.5 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.6 Newtons Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.7 Fixed Point Iteration .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.8 Algebraic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.8.1 Newtons Method Applied to an Algebraic Equation . . . . . .
4.8.2 An Accelerated Newton Method for Equations
with Real Roots. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.9 Systems of Nonlinear Equations . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.9.1 Contraction Mapping Principle . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
4.9.2 Newtons Method for Systems of Equations .. . . . . . . . . . . . . . .
4.10 Notes to Chapter 4 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises and Machine Assignments to Chapter 4 . . . . . .. . . . . . . . . . . . . . . . . . .
Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Machine Assignments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Selected Solutions to Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
Selected Solutions to Machine Assignments. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
253
254
254
254
256
257
258
261
261
264
266
269
274
278
280
280
325
326
328
331
332
333
335
335
336
337
339
341
343
344
347
348
282
284
284
285
287
292
292
302
306
318
xvi
Contents
5.8
352
352
354
357
360
361
362
367
370
371
378
378
383
387
392
399
399
399
401
405
408
409
412
413
416
416
420
424
426
430
433
433
441
446
450
450
452
453
456
456
459
461
466
Contents
xvii
471
474
474
476
481
482
483
485
490
494
494
500
503
503
506
507
509
512
512
518
521
532
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
543
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .
571
Prologue
P1 Overview
Numerical Analysis is the branch of mathematics that provides tools and methods
for solving mathematical problems in numerical form. The objective is to develop
detailed computational procedures, capable of being implemented on electronic
computers, and to study their performance characteristics. Related fields are Scientific Computation, which explores the application of numerical techniques and
computer architectures to concrete problems arising in the sciences and engineering;
Complexity Theory, which analyzes the number of operations and the amount of
computer memory required to solve a problem; and Parallel Computation, which
is concerned with organizing computational procedures in a manner that allows
running various parts of the procedures simultaneously on different processors.
The problems dealt with in computational mathematics come from virtually
all branches of pure and applied mathematics. There are computational aspects
in number theory, combinatorics, abstract algebra, linear algebra, approximation
theory, geometry, statistics, optimization, complex analysis, nonlinear equations,
differential and other functional equations, and so on. It is clearly impossible
to deal with all these topics in a single text of reasonable size. Indeed, the
tendency today is to develop specialized texts dealing with one or the other
of these topics. In the present text we concentrate on subject matters that are
basic to problems in approximation theory, nonlinear equations, and differential
equations. Accordingly, we have chapters on machine arithmetic, approximation
and interpolation, numerical differentiation and integration, nonlinear equations,
onestep and multistep methods for ordinary differential equations, and boundary
value problems in ordinary differential equations. Important topics not covered
in this text are computational number theory, algebra, and geometry; constructive
methods in optimization and complex analysis; numerical linear algebra; and the
numerical solution of problems involving partial differential equations and integral
equations. Selected texts for these areas are enumerated in Sect. P3.
xix
xx
Prologue
We now describe briefly the topics treated in this text. Chapter 1 deals with
the basic facts of life regarding machine computation. It recognizes that, although
presentday computers are extremely powerful in terms of computational speed,
reliability, and amount of memory available, they are less than ideal unless
supplemented by appropriate software when it comes to the precision available,
and accuracy attainable, in the execution of elementary arithmetic operations. This
raises serious questions as to how arithmetic errors, either present in the input
data of a problem or committed during the execution of a solution algorithm,
affect the accuracy of the desired results. Concepts and tools required to answer
such questions are put forward in this introductory chapter. In Chap. 2, the central
theme is the approximation of functions by simpler functions, typically polynomials
and piecewise polynomial functions. Approximation in the sense of least squares
provides an opportunity to introduce orthogonal polynomials, which are relevant
also in connection with problems of numerical integration treated in Chap. 3. A large
part of the chapter, however, deals with polynomial interpolation and associated
error estimates, which are basic to many numerical procedures for integrating
functions and differential equations. Also discussed briefly is inverse interpolation,
an idea useful in solving equations.
First applications of interpolation theory are given in Chap. 3, where the tasks
presented are the computation of derivatives and definite integrals. Although the
formulae developed for derivatives are subject to the detrimental effects of machine
arithmetic, they are useful, nevertheless, for purposes of discretizing differential
operators. The treatment of numerical integration includes routine procedures, such
as the trapezoidal and Simpsons rules, appropriate for wellbehaved integrands, as
well as the more sophisticated procedures based on Gaussian quadrature to deal
with singularities. It is here where orthogonal polynomials reappear. The method of
undetermined coefficients is another technique for developing integration formulae.
It is applied to approximate general linear functionals, the Peano representation
of linear functionals providing an important tool for estimating the error. The
chapter ends with a discussion of extrapolation techniques; although applicable to
more general problems, they are inserted here since the composite trapezoidal rule
together with the EulerMaclaurin formula provides the bestknown application
Romberg integration.
Chapter 4 deals with iterative methods for solving nonlinear equations and
systems thereof, the pi`ece de resistance being Newtons method. The emphasis here
lies in the study of, and the tools necessary to analyze, convergence. The special
case of algebraic equations is also briefly given attention.
Chapter 5 is the first of three chapters devoted to the numerical solution of
ordinary differential equations. It concerns itself with onestep methods for solving
initial value problems, such as the RungeKutta method, and gives a detailed
analysis of local and global errors. Also included is a brief introduction to stiff
equations and special methods to deal with them. Multistep methods and, in
particular, Dahlquists theory of stability and its applications, is the subject of
Chap. 6. The final chapter (Chap. 7) is devoted to boundary value problems and their
solution by shooting methods, finite difference techniques, and variational methods.
xxi
xxii
Prologue
Atkinson and Han [2009] An advanced text on theoretical (as opposed to computational) aspects of numerical analysis, making extensive use of functional
analysis.
Bruce, Giblin, and Rippon [1990] A collection of interesting mathematical problems, ranging from number theory and computeraided design to differential
equations, that require the use of computers for their solution.
Cheney and Kincaid [1994] Although an undergraduate text, it covers a broad
area, has many examples from science and engineering as well as computer
programs; there are many exercises, including machine assignments.
Conte and de Boor [1980] A widely used text for upperdivision undergraduate
students; written for a broad audience, with algorithmic concerns in the foreground; has Fortran subroutines for many algorithms discussed in the text.
Dahlquist and Bjorck [2003, 2008] The first (2003) text a reprint of the 1974
classic provides a comprehensive introduction to all major fields of numerical
analysis, striking a good balance between theoretical issues and more practical
ones. The second text expands substantially on the more elementary topics
treated in the first and represents the first volume of more to come.
Deuflhard and Hohmann [2003] An introductory text with emphasis on machine
computation and algorithms; includes discussions of threeterm recurrence
relations and stochastic eigenvalue problems (not usually found in textbooks),
but no differential equations.
Froberg [1985] A thorough and exceptionally lucid exposition of all major topics
of numerical analysis exclusive of algorithms and computer programs.
Hammerlin and Hoffmann [1991] Similar to Stoer and Bulirsch [2002] in its
emphasis on mathematical theory; has more on approximation theory and
multivariate interpolation and integration, but nothing on differential equations.
Householder [2006] A reissue of one of the early mathematical texts on the
subject, with coverage limited to systems of linear and nonlinear equations and
topics in approximation.
Isaacson and Keller [1994] One of the older but still eminently readable texts,
stressing the mathematical analysis of numerical methods.
Kincaid and Cheney [1996] Related to Cheney and Kincaid [1994] but more
mathematically oriented and unusually rich in exercises and bibliographic items.
Kress [1998] A rather comprehensive text with a strong functional analysis
component.
Neumaier [2001] A text emphasizing robust computation, including interval
arithmetic.
Rutishauser [1990] An annotated translation from the German of an older text
based on posthumous notes by one of the pioneers of numerical analysis;
although the subject matter reflects the state of the art in the early 1970s, the
treatment is highly original and is supplemented by translators notes to each
chapter pointing to more recent developments.
Schwarz [1989] A mathematically oriented treatment of all major areas of numerical analysis, including ordinary and partial differential equations.
xxiii
xxiv
Prologue
Pomerance [1990] and Gautschi [1994a, Part II]. For algorithms in Combinatorics,
see the books by Nijenhuis and Wilf [1978], Hu and Shing [2002], and Cormen et
al. [2009]. Various aspects of Computer Algebra are treated in the books by Geddes
et al. [1992], Mignotte [1992], Davenport et al. [1993], Mishra [1993], Heck [2003],
and Cox et al. [2007].
Other relatively new disciplines are Computational Geometry and Geometric
Modeling, ComputerAided Design, and Computational Topology, for which relevant texts are, respectively, Preparata and Shamos [1985], Edelsbrunner [1987],
Mantyla [1988], Taylor [1992], McLeod and Baart [1998], Gallier [2000], Cohen et
al. [2001], and Salomon [2006]; Hoschek and Lasser [1993], Farin [1997], [1999],
and Prautsch et al. [2002]; Edelsbrunner [2006], and Edelsbrunner and Harer [2010].
Statistical Computing is covered in general textbooks such as Kennedy and Gentle
[1980], Anscombe [1981], Maindonald [1984], Thisted [1988], Monahan [2001],
Gentle [2009], and Lange [2010]. More specialized texts are Devroye [1986] and
Hormann et al. [2004] on the generation of nonuniform random variables, Spath
[1992] on regression analysis, Heiberger [1989] on the design of experiments,
Stewart [1994] on Markov chains, Xiu [2010] on stochastic computing and uncertainty quantification, and Fang and Wang [1994], Manno [1999], Gentle [2003],
Liu [2008], Shonkwiler and Mendivil [2009], and Lemieux [2009] on Monte Carlo
and numbertheoretic methods. Numerical techniques in Optimization (including
optimal control problems) are discussed in Evtushenko [1985]. An introductory
book on unconstrained optimization is Wolfe [1978]; among more advanced and
broader texts on optimization techniques we mention Gill et al. [1981], Ciarlet
[1989], and Fletcher [2001]. Linear programming is treated in Nazareth [1987] and
Panik [1996], linear and quadratic problems in Sima [1996], and the application of
conjugate direction methods to problems in optimization in Hestenes [1980]. The
most comprehensive text on (numerical and applied) Complex Analysis is the threevolume treatise by Henrici [1988, 1991, 1986]. Numerical methods for conformal
mapping are also treated in Kythe [1998], Schinzinger and Laura [2003], and
Papamichael and Stylianopoulos [2010]. For approximation in the complex domain,
the standard text is Gaier [1987]; Stenger [1993] deals with approximation by
sinc functions, Stenger [2011] providing some 450 Matlab programs. The book by
Iserles and Nrsett [1991] contains interesting discussions on the interface between
complex rational approximation and the stability theory of discretized differential
equations. The impact of highprecision computation on problems and conjectures
involving complex approximation is beautifully illustrated in the set of lectures by
Varga [1990].
For an indepth treatment of many of the preceding topics, also see the fourvolume work of Knuth [1975, 1981, 1973, 20052006].
Perhaps the most significant topic omitted in our book is numerical linear algebra
and its application to solving partial differential equations by finite difference or
finite element methods. Fortunately, there are many treatises available that address
these areas. For Numerical Linear Algebra, we refer to the classic work of Wilkinson
[1988] and the book by Golub and Van Loan [1996]. Links and applications
of matrix computation to orthogonal polynomials and quadrature are the subject
xxv
of Golub and Meurant [2010]. Other general texts are Jennings and McKeown
[1992], Watkins [2002], [2007], Demmel [1997], Trefethen and Bau [1997], Stewart
[1973], [1998], Meurant [1999], White [2007], Allaire and Kaber [2008], and
Datta [2010]; Higham [2002], [2008] has a comprehensive treatment of error and
stability analyses and the first, equally extensive, treatment of the numerics of matrix
functions. Solving linear systems on vector and shared memory parallel computers
and the use of linear algebra packages on highperformance computers are discussed
in Dongarra et al. [1991], [1998]. The solution of sparse linear systems and the
special data structures and pivoting strategies required in direct methods are treated
in sterby and Zlatev [1983], Duff et al. [1989], Zlatev [1991], and Davis [2006],
whereas iterative techniques are discussed in the classic texts by Young [2003]
and Varga [2000], and in Ilin [1992], Hackbusch [1994], Weiss [1996], Fischer
[1996], Brezinski [1997], Greenbaum [1997], Saad [2003], Broyden and Vespucci
[2004], Hageman and Young [2004], Meurant [2006], Chan and Jin [2007], Byrne
[2008], and Woznicki [2009]. The books by Branham [1990] and Bjorck [1996]
are devoted especially to least squares problems. For eigenvalues, see Chatelin
[1983], [1993], and for a good introduction to the numerical analysis of symmetric
eigenvalue problems, see Parlett [1998]. The currently very active investigation of
large sparse symmetric and nonsymmetric eigenvalue problems and their solution
by Lanczostype methods has given rise to many books, for example, Cullum and
Willoughby [1985], [2002], Meyer [1987], Sehmi [1989], and Saad [1992]. For
structured and symplectic eigenvalue problems, see Fassbender [2000] and Kressner
[2005], and for inverse eigenvalue problems, Xu [1998] and Chu and Golub [2005].
For readers wishing to test their algorithms on specific matrices, the collection of
test matrices in Gregory and Karney [1978] and the matrix market on the Web
(http://math.nist.gov./MatrixMarket) are useful sources.
Even more extensive is the textbook literature on the numerical solution of Partial Differential Equations. The field has grown so much that there are currently only
a few books that attempt to cover the subject more or less as a whole. Among these
are Birkhoff and Lynch [1984] (for elliptic problems), Hall and Porsching [1990],
Ames [1992], Celia and Gray [1992], Larsson and Thomee [2003], Quarteroni and
Valli [1994], Morton and Mayers [2005], Sewell [2005], Quarteroni [2009], and
Tveito and Winter [2009]. Variational and finite element methods seem to have
attracted the most attention. An early and still frequently cited reference is the
book by Ciarlet [2002] (a reprint of the 1978 original); among more recent texts
we mention Beltzer [1990] (using symbolic computation), Krz ek and Neittaanmaki
[1990], Brezzi and Fortin [1991], Schwab [1998], Kwon and Bang [2000] (using
Matlab), Zienkiewicz and Taylor [2000], Axelsson and Barker [2001], Babuska
and Strouboulis [2001], Hollig [2003], Monk [2003] (for Maxwells equation),
Ern and Guermonde [2004], Kythe and Wei [2004], Reddy [2004], Chen [2005],
Elman et al. [2005], Thomee [2006] (for parabolic equations), Braess [2007],
Demkowicz [2007], Brenner and Scott [2008], Bochev and Gunzburger [2009],
Efendiev and Hou [2009], and Johnson [2009]. Finite difference methods are treated
in Ashyralyev and Sobolevski [1994], Gustafsson et al. [1995], Thomas [1995],
[1999], Samarskii [2001], Strikwerda [2004], LeVeque [2007], and Gustafsson
xxvi
Prologue
[2008]; the method of lines in Schiesser [1991]; and the more refined techniques
of multigrids and domain decomposition in McCormick [1989], [1992], Bramble
[1993], Shadurov [1995], Smith et al. [1996], Quarteroni and Valli [1999], Briggs
et al. [2000], Toselli and Widlund [2005], and Mathew [2008]. Problems in potential
theory and elasticity are often approached via boundary element methods, for which
representative texts are Brebbia and Dominguez [1992], Chen and Zhou [1992],
Hall [1994], and Steinbach [2008]. A discussion of conservation laws is given in the
classic monograph by Lax [1973] and more recently in LeVeque [1992], Godlewski
and Raviart [1996], Kroner [1997], and LeVeque [2002]. Spectral methods, i.e.,
expansions in (typically) orthogonal polynomials, applied to a variety of problems,
were pioneered in the monograph by Gottlieb and Orszag [1977] and have received
extensive treatments in more recent texts by Canuto et al. [1988], [2006], [2007],
Fornberg [1996], Guo [1998], Trefethen [2000] (in Matlab), Boyd [2001], Peyret
[2002], Hesthaven et al. [2007], and Kopriva [2009].
Early, but still relevant, texts on the numerical solution of Integral Equations are
Atkinson [1976] and Baker [1977]. More recent treatises are Atkinson [1997] and
Kythe and Puri [2002]. Volterra integral equations are dealt with by Brunner and van
der Houwen [1986] and Brunner [2004], whereas singular integral equations are the
subject of Prossdorf and Silbermann [1991].
P4 Journals
Here we list the major journals (in alphabetical order) covering the areas of
numerical analysis and mathematical software.
ACM Transactions on Mathematical Software
Applied Numerical Mathematics
BIT Numerical Mathematics
Calcolo
Chinese Journal of Numerical Mathematics and Applications
Computational Mathematics and Mathematical Physics
Computing
IMA Journal on Numerical Analysis
Journal of Computational and Applied Mathematics
Mathematical Modelling and Numerical Analysis
Mathematics of Computation
Numerical Algorithms
Numerische Mathematik
SIAM Journal on Numerical Analysis
SIAM Journal on Scientific Computing
Contents
1 Boundary value problems
1.1 Reconstructing trajectories of particles moving in force fields . .
1.2 Deflection of a loaded beam . . . . . . . . . . . . . . . . . . . . .
2
2
3
1.1
For example, suppose we are biologists observing under a microscope the motion
of a bioparticle moving in a timedependent force field F(t) = F (t)
x. (For
simplicity, we will consider here the case of 1D motion, although it is easy to
extend the discussion to higher dimensions.) For example, if the bioparticle has
charge q and the xcomponent of the electric field is Ex (t), then the force is
F (t) = qEx (t).
Suppose we observe that the position of the particle at time t1 is x1 , while
at some later time t2 it is at position x2 . (Note that we do not observe the
velocity of the particle.) We would like to reconstruct the trajectory that the
particle followed between t1 and t2 . We then have a boundaryvalue problem of
the form
1
d2 x
x(t1 ) = x1 ,
x(t2 ) = x2 .
(1)
= F (t),
dt2
m
where m is the mass of the bioparticle. To phrase this equation in the language
of firstorder ODE systems, we define u1 = x, u2 = x and obtain the ODE
system
du
d
u1
u2
(2)
=
=
u2
F (t)/m
dt
dt
subject to the boundary conditions
x1
u(t1 ) =
,
?
u(t2 ) =
x2
?
.
(3)
The point is that we dont know the velocity of the particle at either endpoint,
which means we dont have an initialvalue problem. This has at least two
immediate implications:
(a) the nice existence and uniqueness theorems for initialvalue problems go
completely out the window; for a boundaryvalue problem like (2) there
may be no solution, or there may be multiple solutions, and these things
may be true even for perfectly nice f functions.
(b) Even assuming there is a solution curve u(t), we cant use the ODE algorithms we discussed previously to find points on it, because all of those
algorithms required that we start with a known point on the curve. In
this case we dont know all the coordinates of even a single point on the
curve, so none of our ODE integrators can get started.
1.2
h(x1 ) = 0,
h0 (x1 ) = 0,
h(x2 ) = 0,
h0 (x2 ) = 0.
u1
u2
du
d
u3
u2 =
=
(5)
u3
u4
dt
dt
u4
q(u1 )/
subject to the boundary conditions
0
0
u(x1 ) =
,
?
?
0
0
u(x2 ) =
? .
?
(6)
As before, we cant simply use an ODE integrator to solve this equation because we dont have any full point on the solution curve from which to start
integrating.
1 For example, if the beam in question were a bookshelf, and there were heavier books near
the center of the shelf and ligher books near its edges, then the function q(x) would be peaked
near the center of the interval.
We noted above that our standard bag of ODE tricks for integrating initialvalue
problems (such as Eulers method or RK4) cant get started on a boundaryvalue
problem like (2) or (5), because in order to use e.g. Eulers method we need to
know a point on the solution curve. In a problem like (2) we only know half
of a point on a solution curve at t1 we know the u1 coordinate of the point,
but not the u2 coordinate.
There is, however, a way to remedy this difficulty. Starting at t = t1 , we
guess a number for the u2 coordinate. In the case of (2), this corresponds to
guessing an initial velocity for the particle. Denote our guess by uguess
. We now
2
have the coordinates of one full point on a curve at time t1 , and we call this
point uguess :
u1
uguess =
uguess
2
The existence and uniqueness theorems now guarantee that there exists a full
curve uguess (t) satisfying the differential equation and the condition uguess (t0 ) =
uguess
. So we can now use any ODE algorithm we like to integrate our equation
0
to compute more points on this curve. In particular, we can integrate all the
way from t1 to t2 and evaluate the value of uguess (t2 ). If this value equals x2 ,
were done! We have found our desired solution curve. If not, we have to go
back and try a new value for uguess
.
2
This method is known as the shooting method, for obvious reasons: integrating from t1 to t2 with initial position and velocity u1 , uguess
corresponds to
2
shooting the particle from that position with that velocity, and if we guess
the initial velocity just right then the particle will just pass through position u2
at time u2 .
The difficulty is that we now have to solve a rootfinding problem to compute
uguess
. Indeed, for each choice of uguess
at time t1 we can integrate the resulting
2
2
initialvalue problem and compute the value it predicts for the coordinate u1 at
time t2 . Denote this value by uintegrated
(uguess
; t2 ). Choosing the correct value
1
2
guess
of u2
then corresponds to finding a root of the nonlinear equation
uintegrated
(uguess
; t2 ) udesired
(t2 ) = 0
1
2
1
(7)
00
f1
f1
f
f 00
2
2
f =
f 00 =
.. ,
..
.
.
00
fN
fN
where
fn f (a + nh),
fn00 f 00 (a + nh),
n = 1, , N,
h=
ba
N +1
2
1
1
0
A= 2 .
h ..
0
0
(8)
1
2
1
..
.
0
1
2
..
.
..
.
0
0
0
..
.
0
0
0
0
2
1
0
0
0
..
.
1
2
2 Of course this technique is not limited to the second derivative; we could alternatively
write down different matrices that, when applied to f , yield vectors of samples of its first
derivative, its fourth derivative, etc.
3 Equation (8) assumes that f (a) = f (b) = 0. Implementation of nontrivial boundary
conditions is discussed in our lecture notes on numerical differentiation.
(Equation (8) assumes that f satisfies the boundary conditions f (a) = f (b) = 0;
other boundary conditions may be represented by adding suitable terms to the
RHS.)
Of course, as soon as we write down equation (8) we can immediately proceed
to invert that equation to find a relation predicting values of f from the values
of f 00 :
f = A1 f 00 .
(9)
The usefulness of this equation is that, in a boundaryvalue problem, we typically have a relation expressing f 00 in terms of some known function. For
example, in (1), the second derivative of the function we seek is related to the
(known) force field F (x). Then all we have to do is replace f 00 in (9) with the expression for the second derivative given by the differential equation in question,
and we can immediately solve for samples of the function f (x).
3.1
In this section well work through a finitedifference method for solving the
onedimensional beam equation
1
d4 f
= q(x)
dx4
(10)
(11)
d4
dx4
It is easy to verify that a finitedifference stencil with stepsize h for the fourth
derivative of a function f (x) at a point x is
(4)
fFD (h, x) =
(12)
This stencil achieves secondorder convergence, i.e. if f (4) (x) is the exact
fourth derivative of f at x, then we have
(4)
fFD (h, x) f (4) (x) = O(h2 )
Implementation of boundary conditions
When we attempt to apply (12) at points within 1 or 2 sites of the ends of the
interval, we find that we need values for the quantities f1 , f0 , fN +1 , fN +2 .
The values of f0 and fN +1 are fixed by the boundary conditions (12) to be
0. This leaves unspecified the values of f1 and fN +2 , but the condition that
f 0 = 0 at both endpoints winds up being equivalent to the requirement that
f1 = fN +2 = 0. (Less trivial boundary conditions could be handled using the
method described in our lecture notes on numerical differentiation.)
The matrix A
In view of the above considerations, the finitedifference matrix we want is
6 4 1
0
0
0 0
0
4 6 4 1
0
0 0
0
1 4 6 4 1
0
0
0
0
1 4 6 4 1 0
0
1
0
1 4 6 4 0
0
A= 4 0
.
h 0
0
0
1 4 6 0
0
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
0
0
0
0
0
0 6 4
0
0
0
0
0
0 4 6
This matrix operates on a vector of samples of f to yield a vector of samples of
f (4) :
Af = f (4)
(13)
where the nth elements of f and f (4) are respectively
fn = f (a + nh),
h=
ba
N +1
(14)
On the other hand, the differential equation (10) allows us to compute values
of f 00 here in terms of the loading function q(x), i.e. we can put
f (4) =
1
q
where the elements of the vector q are the values of the function q(x) at the
sample point xn . Then equation (15) reads
1
f = A1
q
(15)
We solve this equation numerically using the julia code reproduced below. The
results, for a forcing function q(x) = x2 , are plotted in Figure 1.
#
#
#
#
#
#
beam deection
0
200
400
600
beam deection
beam loading
800
0
10
Contents
1 The question
5 Chebyshev polynomials
10
14
The question
In these notes we will concern ourselves with the following basic question: Given
a function f (x) on an interval x [a, b],
1. How accurately can we characterize f using only samples of its value at
N sample points {xn } in the interval [a, b]?
2. What is the optimal way to to choose the N sample points {xn }?
What does it mean to characterize a function f (x) over an interval [a, b]?
There are at least three possible answers:
Rb
1. We may want to evaluate the integral a f (x) dx. In this case, the problem
of characterizing f from N function samples is the problem of designing
an N point quadrature rule.
2. We may want to evaluate the derivative of f at each of our sample points
using the information contained in the sample values. This is the problem
of constructing a differentiation stencil, and it arises when we try to solve
ODEs or PDEs: in that case we are trying to reconstruct f (x) given knowledge of its derivative, so generally upon constructing the differentiation
stencil we will want to invert it.
3. We may want to construct an interpolant f interp (x) that agrees with f (x)
at the sample points but smoothly interpolates between those points in
a way that mimics the original function f (x) as closely as possible. For
example, f (x) may be the result of an experimental measurement or the
result of a costly numerical calculation, and we might to accelerate calculation of f (x) at arbitrary values of x by precomputing f (xn ) at just the
sample points {xn } and then interpolating to get values at intermediate
points x.
In a sense, the first half of our course was devoted to studying the answer
to this question furnished by classical numerical analysis, while the second half
has been focused on the modern answer. Lets begin by reviewing what the
classical approach had to offer.
Classical numerical analysis answers the question of how to choose the sample
points {xn } in the simplest possible way: We simply take the sample points to
be evenly spaced throughout the interval [a, b]:1
xn = a + n,
n = 0, 1, , N,
ba
.
N
In this case,
The quadrature rules one obtains are the usual NewtonCotes quadrature
rules, which we studied in the first and second weeks of our course. These
work by fitting polynomials through the function samples and then integrating those polynomials to approximate the integral of the the function.
The differentiation stencils one obtains are the usual finitedifference stencils, which we studied in the third and fourth weeks of our course. These
may again be interpreted as a form of polynomial interpolation: we are
essentially constructing and differentiating a lowdegree approximation to
the Taylorseries polynomial
The interpolant one constructs is the unique N th degree polynomial P interp (x)
that agrees with the values of the underlying function f (x) at the N + 1
sample points. Although we didnt get to this in the first unit of our
course, it turns out to be easy to write down a formula for this polynomial
in terms of the sample points {xn } and the values of f at those points,
{fn } {f (xn )}. For example, for the cases N = 1, 2, 3 we have2
P1interp (x) = f1
(x x1 )
(x x2 )
+ f2
(x1 x2 )
(x2 x1 )
P2interp (x) = f1
(x x1 )(x x3 )
(x x1 )(x x2 )
(x x2 )(x x3 )
+ f2
+ f3
(x1 x2 )(x1 x3 )
(x2 x1 )(x2 x3 )
(x3 x1 )(x3 x2 )
P3interp (x) = f1
(x x2 )(x x3 )(x x4 )
(x x1 )(x x3 )(x x4 )
+ f2
(x1 x2 )(x1 x3 )(x1 x4 )
(x2 x1 )(x2 x3 )(x2 x4 )
(x x1 )(x x2 )(x x4 )
(x x1 )(x x2 )(x x3 )
+ f3
+ f4
(x3 x1 )(x3 x2 )(x3 x4 )
(x4 x1 )(x4 x2 )(x4 x3 )
The formula of this type for general N is called the Lagrange interpolation
formula; it constructs an N th degree polynomial passing through N + 1
fixed data points (xn , fn ).
1 Technically
(x)
to , while the centered finitedifference f 0 (x) f (x+)f
has error
2
proportional to 2 .] Thus here again we find convergence algebraic in N ,
not exponential in N .
Interpolation: Polynomial interpolation in evenlyspaced sample points
is a notoriously badlybehaved procedure due to the Runge phenomenon
(we will discuss it briefly in an appendix). The Runge phenomenon is so
severe that, in some cases, the polynomial interpolant through N evenlyspaced function samples points doesnt just converge slowly as N .
It doesnt converge at all!
To summarize the results of the classical approach,
Classical approach: To characterize a function over an interval using N function choose the sample points to be evenlyspaced points and construct polynomial interpolants. The approach in general yields convergence algebraic in N for integration and differentiation, but does not converge for interpolation
of some functions.
fen ein0 t
n=
The
Modern approach, periodic functions: To characterize a
periodic function over an interval using N function samples,
choose the sample points to be evenly spaced throughout the
interval and construct a trigonometric interpolant consisting of
a sum of N sinusoids. The approach in general yields convergence exponential in N for integration, differentiation, and interpolation.
3 Linear
P
combinations of sinusoids like
[an sin n0 t + bn cos n0 t] are sometimes called
trigonometric polynomials since they are in fact polynomials in the variable ei0 t , but I
personally find this terminology a little confusing.
g() =
with coefficients
2
e
a =
e
a0 X
+
e
a cos()
2
=1
Z
(1)
g() cos() d.
(2)
f(x)
1
1
2
2
3
3
4
2
1.5
1
0.5
0
x
0.5
1.5
4
(a)
5
c
4
1
1
2
2
3
3
4
4
(b)
Figure 1: (a) A function f (t) that we want to integrate over the interval [1, 1].
(b) The function g() = f (cos ). Note the following facts: (1) g() is periodic
with period 2. (2) g() is an even function of . (3) Over the interval 0
, g() traces out the behavior of f (t) as t varies from 1 1 [i.e. g()
traces out f (t) backwards.] However, (4) g() knows nothing about what f (t)
does outside the range 1 < t < 1, which can make it a little tricky to compare
the two plots. For example, g() has local minima at = 0, even though f (t)
does not have local minima at t = 1, 1.
The discrete Fourier transform of the set of samples {gn } yields a set of Fourier
coefficients {e
g }:
DFT
{gn } {e
g }
From the {e
g } coefficients we can reconstruct the original {gn } samples through
the magic of the inverse DFT:
IDFT
{e
g }
{gn }
N
X
ge ein .
(4)
=0
N
X
ge ei
(5)
=0
Note that g interp () is (in general) not the same function as the original g();
the difference is that the sum in (6) is truncated at = N , whereas the Fourier
series for the full function g() will in general contain infinitely many terms.
The form of (5) may be simplified by noting that, because g() is an even
function of , its Fourier series includes only cosine terms:
N/2
g interp () =
e
a0 X
+
e
a cos()
2
=1
(6)
where the e
an coefficients are related to the gen coefficients computed by the DFT
according to
e
a0 = 2e
g0 ,
e
a = (e
g + ge ) = 2e
g .
[The last equality here follows from the fact that, for an even function g(), the
Fourier series coefficients for positive and negative are equal, ge = ge .]
The procedure we have outlined above uses general DFT techniques for
computing the numbers a . In this particular case, because g() is an even
function, it is possible to accelerate the calculation by a factor of 4 using the
discrete cosine transform, a specialized version of the discrete Fourier transform.
We wont elaborate on this detail here.
(7)
(8)
interp
e
a0 X
e
a cos (n arccos x)
(x) =
+
2
=1
(9)
Equation (9) would appear at first blush to define a horribly ugly function of
x. It took the twisted5 genius of the Russian mathematician P. L. Chebyshev
to figure out that in fact equation (9) defines a polynomial function of x. To
understand how this could possibly be the case, we must now make a brief foray
in the world of the Chebyshev polynomials.
5 We
10
Chebyshev polynomials
Trigonometric definition
The definition of the Chebyshev polynomials is inspired by the observation, from
highschool trigonometry, that cos(n) is a polynomial in cos for any n. For
example,
cos 2 = 2 cos2 1
cos 3 = 4 cos3 3 cos
cos 4 = 8 cos4 8 cos2 + 1
The polynomials on the RHS of these equations define the Chebyshev polynomials for n = 2, 3, 4. More generally, the nth Chebyshev polynomial Tn (x) is
defined by the equation
cos n = Tn (cos )
and the first few Chebyshev polynomials are
T0 (x) = 1
T1 (x) = x
T2 (x) = 2x2 1
T3 (x) = 4x3 3x
T4 (x) = 8x4 8x2 + 1.
Figure 2 plots the first several Chebyshev polynomials. Notice the following
important fact: For all n and all x [1, 1], we have 1 Tn (x) 1. This
boundedness property of the Chebyshev polynomials turns out to be quite useful
in practice.
On the other hand, the Chebyshev polynomials are not bounded between
1 and 1 for values of x outside the interval [1, 1] (nor, being polynomials,
could they possibly be). Figure 3 shows what happens to T15 (x) as soon as we
get even the slightest little bit outside the range x [1, 1]: the polynomial
takes off to . In almost all situations involving Chebyshev polynomials we
will be interested in their behavior within the interval [1, 1].
11
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
1.5
1
0.5
0
x
0.5
1
1
1.5
1
0.5
T0 (x)
0
x
0.5
T1 (x)
1.5
1.5
0.5
0.5
0
0.5
0.5
0.5
0.5
0.5
0.5
1
1
1.5
1
0.5
0
x
0.5
1
1
1.5
1
0.5
T2 (x)
0
x
0.5
T3 (x)
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
1.5
1
0.5
0
x
T4 (x)
0.5
1
1
1.5
1
0.5
0
x
0.5
T15 (x)
12
15
15
10
10
5
5
10
10
15
15
1
0.5
0.5
13
hf, gi
.
1 x2
1
Orthogonality means that if we insert Tn and Tm in the inner product we
get zero unless n = m:
hTn , Tm i = nm .
(10)
2
Taken together, these two properties furnish a convenient way to represent arbitrary functions as linear combinations of Chebyshev polynomials. The first
property tells us that, given any function f (x), we can write f (x) in the form
f (x) =
Cn Tn (x).
(11)
n=0
hf, Tm i = Cm
2
[where the /2 factor here comes from equation (10)]. In other words, the
Chebyshev expansion coefficients of a general function f (x) are
Cm
2
=
f (x)Tm (x)
dx.
1 x2
(12)
Equations (11) and (12) amount to form what we might refer to as the forward
and inverse discrete Chebyshev transforms of a function f (x).
6 An inner product on a vector space V is just a rule that assigns a real number to any pair
of elements in V . (Mathematicians would say it is a map V V R.) The rule has to be
linear (the inner product of a linear combination is a linear combination of the inner products)
and nondegenerate, meaning no nonzero element has vanishing inner product with itself.
14
Chebyshev spectral methods furnish the second half of the modern solution to
the problem we posed at the beginning of these notes, namely, how best to
characterize a function using samples of its value at N points.
Recall that the first half of the modern solution went like this:
Modern approach, periodic functions: To characterize a
periodic function over an interval using N function samples,
choose the sample points to be evenly spaced throughout the
interval and construct a trigonometric interpolant consisting of
a sum of N sinusoids. The approach in general yields convergence exponential in N for integration, differentiation, and interpolation.
The second half of the modern solution now reads like this:
Modern approach, nonperiodic functions: To characterize a nonperiodic function over an interval using N function
samples, map the interval into [1, 1], choose the sample points
to be Chebyshev points, and construct a polynomial interpolant
consisting of a sum of N Chebyshev polynomials. The approach
in general yields convergence exponential in N for integration,
differentiation, and interpolation.
Lets now investigate how Chebyshev spectral methods work for each of the
various aspects of the characterization problem we considered above.
Chebyshev approximation
As we saw previously, a function f (x) on the interval [1, 1] may be represented
exactly as a linear combination of Chebyshev polynomials:
f (x) =
Cn Tn (x)
(13)
n=0
One way to obtain a formula for the C coefficients in this expansion is to take
the inner product of both sides with Tm (x) and use the orthogonality of the T
functions:
hf, Tm i
hTm , Tm
Z
2 1 f (x)Tm (x)
=
dx.
1
1 x2
Cm =
(14)
However, there are better ways to compute these coefficients, as discussed below.
15
If we restrict the sum in (15) to include only its first N terms, we obtain an
approximate representation of f (x), the N th Chebyshev approximant:
f approx (x) =
N
1
X
Cn Tn (x)
(15)
n=0
Chebyshev interpolation
The coefficients Cn in formula (15) for the Chebyshev approximant may be
computed using the integral formula (13), but there are easier ways to get them.
These are based on the following alternative characterization of (15):
The N th Chebyshev approximant (15) is the unique N th degree polynomial that agrees with f (x) at the N + 1 Chebyshev
points xn = cos n
N , n = 0, 1, , N.
Thus, when we construct (15), we are really constructing an interpolant that
smoothly connects N + 1 samples of f (x) evaluated at the Chebyshev points.
In particular, the values of f at the Chebyshev points are the only data we need
to construct f approx in (15). This is not obvious from expression (14), which
would seem to suggest that we need to know f throughout the interval [1, 1].
How do we use this characterization of (15) to compute the Chebyshev expansion coefficients {Cn } in (15)? There are at least two ways to proceed:
1. We could use the Lagrange interpolation formula to construct the unique
N th degree polynomial running through the
data points {xn , f (xn )} for
the N + 1 Chebyshev points xn = cos n
N , n = 0, 1, , N.
2. We could observe that the Cn coefficients are the coefficients in the Fourier
cosine series of the even 2periodic function
g() = f (cos ). The samples
of g() at evenlyspaced points g n
N are precisely just the samples
of f (x) at the Chebyshev points cos n
N , and the Fourier cosine series
coefficients may be computed by computing the discrete cosine transform
of the set of numbers {fn }:
{fn }
where
DCT
n
fn = f cos
,
N
{Cn }
n = 0, 1, , N.
16
(16)
where
n
fn f cos
N
If we write out equation (16) for all of the Cn coefficients at once, we have an
(N + 1)dimensional linear system relating the sets of numbers {fn } and {Cn }:
12
1
2
1
2
2
N
1
2
.
..
1
2
cos N
cos 2
N
cos 3
N
cos 2
N
cos 4
N
cos 6
N
cos 3
N
..
.
cos 6
N
..
.
cos 9
N
..
.
cos
cos 2
cos 3
..
1
2
1
2 cos
2 cos 2
cos
3
2
..
1
cos
N
f0
f1
f2
=
f3
..
.
fN
C0
C1
C2
C3
..
.
CN
(17)
1
m=0
N ,
nm
nm = N2 cos
, m = 1, , N 1
1
m=N
N cos n,
where the n, m indices run from 0 to N .
Using equation (17) directly is actually not a good way to compute the C
coefficients from the f samples, because the computational cost of the matrixvector multiplication scales like N 2 , whereas FFT techniques (the fast cosine
transform) can perform the same computation with cost scaling like N log N .
However, the existence of the matrix is useful for deriving ClenshawCurtis
quadrature rules and Chebyshev differentiation matrices, as we will now see.
17
Chebyshev integration
The Chebyshev spectral approach to integrating a function f (x) goes like this:
1. Construct the N th Chebyshev approximant f approx (x) to f (x) [equation
(15)].
2. Integrate the approximant and take this as an approximation to the integral.
In symbols, we have
Z
f approx (x) dx
f (x) dx
1
N
X
Cm
Tm (x) dx.
(18)
m=0
But the integrals of the Chebyshev polynomials can be evaluated in closed form,
with the result
(
Z 1
2
m even
2,
(19)
Tm (x) dx = 1m
0,
m odd.
1
Thus equation (18) reads
Z
f (x) dx
1
N
X
m=0
m even
2Cm
.
1 m2
(20)
Does this expression look familiar? It is exactly what we found in our discussion
of ClenshawCurtis quadrature, except there we interpreted the integral (19) in
the equivalent form
Z 1
Z
Tm (x) =
cos(m) sin d.
1
18
cients:
W=
f (x) dx WT C,
2
0
2
122
0
2
142
..
.
2
1N 2
f (x) dx WT f
(21)
= wt f
(22)
Chebyshev differentiation
In the first unit of our course we saw how to use finitedifference techniques to
approximate derivative values from function values. For example, if feven is a
vector of function samples taken at evenlyspaced points in an interval [a, b] i.e.
if
f (a)
f (a + )
feven = f (a + 2)
..
.
f (b)
then the vector of derivative values at the sample points may be represented in
the centeredfinitedifference approximation as a matrixvector product of the
form
0
feven
= DCFD feven
19
where7
DCFD
0
1
0
0
0
0
1
0
1
0
0
1
0
1
0
0
0
0
0
0
0
0
.
..
0 0
0 1
0
0
1
0
0
0
0
0
1
0
N
X
Cm Tm (x)
m=0
Differentiating, we find
0
fapprox
(x) =
N
X
0
Cm Tm
(x).
m=0
If we evaluate
this formula at each of the (N + 1) Chebyshev points xn =
0
cos n
,
n
=
0,
1, , N , we obtain a vector fcheb
whose entries are approximate
N
values of the derivative of f at the Chebyshev points, and which is related to
the vector C of Chebyshev coefficients via a matrixvector product relationship:
0
0
f (x0 )
T0 (x0 ) T10 (x0 ) T20 (x0 ) TN0 (x0 )
C0
f 0 (x1 ) T00 (x1 ) T10 (x1 ) T20 (x1 ) TN0 (x1 ) C1
0
..
..
..
..
..
..
..
.
.
.
.
.
.
.
f 0 (xN )
CN
T00 (xN ) T10 (xN ) T20 (xN ) TN0 (xN )
{z
} 
{z
}  {z }

f 0
cheb
T0
(23)
7 We
are here assuming that f vanishes to the left and right of the endpoints; as we saw
earlier in the course, it is easy to generalize to arbitrary boundary values of f .
8 Technically: faster than any polynomial in N .
20
Second derivatives
What if we need to compute second derivatives? Easy! Just go like this:
00
0
fcheb
= Dcheb fcheb
= Dcheb Dcheb fcheb
2
= Dcheb fcheb .
This equation identifies the (N +1)(N +1) matrix (Dcheb )2 , i.e just the square
of the matrix Dcheb , as the matrix that operates on a vector of f samples at
Chebyshev points to yield a vector of f 00 samples at Chebyshev points.
Contents
1 NewtonCotes Quadrature
3 ClenshawCurtis Quadrature
14
NewtonCotes Quadrature
Name
rectangular rule
Approximation to
N
1
X
Rb
a
f (x) dx
f (a + n)
n=0
N
1
X
trapezoidal rule
n=0
Simpsons rule
N
1
X
n=0
f a + n + f a + (n + 1)
2
1
f a + n + 4f a + (n + ) + f a + (n + 1)
6
2
When we discussed NewtonCotes quadrature previously, we offered the following heuristic convergence analysis: The pth order NC rule models f as a
pth degree polynomial, which means the error in the approximation is a polynomial that starts at degree p + 1. The integral of this error polynomial over
1
an interval of width is proportional to p+2 N p+2
. Hence the error in our
approximate estimate of the integral over each subinterval is
error per subinterval
1
N p+2
1
N p+1
(1)
In other words, our heuristic convergence analysis suggests that the error should
decay algebraically with N , with faster decay for larger values of p. However,
this analysis is clearly oversimplified in particular, equation (1) blindly sums
the errors within each subinterval, without considering the possibility of cancellations among the errors in different subinterval.
Lets consider the integral of a function f (t) over an interval of width T , which
we assume without loss of generality to start at t = 0. Thus we are trying to
compute
Z T
I=
f (t) dt.
0
(2)
This formula is just the second box of the table in the previous section, with a =
T
0, b = T, and = N
. What we would like to understand is the N dependence
of the error
trap
trap
EN
= I IN
.
To do this, recall from our discussion of Fourier analysis that our function
may be represented over the interval [0, T ] in the form
X
2
f (t) =
fem eim0 t
0 =
(3)
T
m=
where the Fourier series coefficients are
Z
1 T
fem =
f (t)eim0 t dt.
T 0
(4)
In particular, the integral we are trying to compute is precisely just T times the
value of the m = 0 Fourier series coefficient:
I = T fe0 .
Of course, when we are doing NewtonCotes quadrature on a function f (t) we
dont know its Fourier series coefficientsif we did, we wouldnt need to be
doing quadrature in the first placebut the point is that even without knowing
the values of the fem we know that the Fouriersynthesized representation (3)
exists, and that is all that we need for this analysis.
We now want to insert the representation (3) into (2). Conveniently, the first
term on the RHS of (2) is precisely what we get by evaluating the Fourier series
(3) at t = 0.1 For the other terms, we simply plug in equation (3) evaluated at
1 This is obviously true when the original function f (t) is periodic with period T , but when
f (0) 6= f (T ) it is a nontrivial and convenient fact that the first term on the RHS of (2)
is precisely what we get by evaluating the Fourier series (3) at t = 0. This, incidentally,
is the reason for starting with a convergence analysis of the trapezoidal rule instead of the
rectangular rule; the latter can be analyzed using Fourierseries techniques as well, but the
analysis is not as nice.
N
1
h
i
X
T
nT
1
trap
IN =
f (0) + f (T ) +
f
N
2
N
{z
} n=1  {z

Pe
fm
P e im0 ( nT
)
N
fm e
(
)
N 1
X
nT
T X
fem eim0 ( N )
=
N n=0 m=
(
)
N 1
X
T X
2imn/N
=
fem e
N n=0 m=
where I used 0 =
2
T .
(5)
)
N
1
X
1
=T
e2imn/N
fem
N n=0
m=

{z
}
(
(6)
KN (m)
In the last line here we defined a function KN (m) which has some interesting
properties:
N 1
1 X 2imn/N
e
N n=0
i
1h
=
1 + + 2 + + N 1
N
KN (m) =
(7)
if m is an integer multiple of N
p6=0
trap
trap
Thus the error EN
= I IN
 is just the sum of the N th, 2N th, etc.
Fourierseries coefficients of our function:
X
trap
e
(8)
fpN .
EN =
p=
p6=0
Of course, again, we dont know the numbers fepN , so we cant compute the
RHS of this formula exactly. However, we can use the smoothnessvs.decay
properties of Fourier analysis to estimate how rapidly it decays with N .
feN + feN = O
2 For
1
N2
a proof, see J. P. Boyd, Chebyshev and Fourier Spectral Methods, Section 2.9.
0.3
0.3
f(x)
0.2
0.2
0.1
0.1
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
1
0.5
0.5
x
1.5
(a)
0.3
0.3
g(x)
0.2
0.2
0.1
0.1
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
1
0.5
0.5
x
1.5
(b)
Figure 1: (a) A nonperiodic function that we might be trying to integrate
over the interval [0, 1]. (b) The actual function whose Fourierseries coefficients
we are computing when we evaluate equation (4). Note that this function is
discontinuous even though the original function was continuous.
Hence the terms proportional to 1/N cancel out of (8), and we have
X
trap
e
e
EN =
fpN + fpN
p=1
X
#
pN 2
1
2.
N
So theres the 1/N 2 convergence of the trapezoidal rule.
ClenshawCurtis Quadrature
The discussion of the previous section explains why the simple trapezoidal rule
converges so rapidly for periodic functions, and why it converges relatively slowly
for nonperiodic functions. Thus, if we are lucky enough to be integrating a
periodic function over a period, all we have to do is apply the usual trapezoidal
rule and we magically get exponential convergence. But what if we have the
bad fortune of needing to integrate a nonperiodic function? Are we stuck with
the slow convergence of the trapezoidal rule?
No! This is actually a general principle of mathematics, and of life more
broadly: You are not helpless. You have options. In particular, in the case
at hand we have the option to convert our nonperiodic function into a periodic function, and the process of availing ourselves of this option is known as
ClenshawCurtis quadrature.
The interval [1, 1] happens to be precisely the range of values covered (though
not in the same order) by cos as ranges from 0 to , so it is convenient to
use the parameterization t = cos and to define a new function
g() f (cos ).
Figure 2 shows some nonperiodic function f (t) together with the function
g() f (cos ). Notice the following points about g() :
(a) It is a periodic function with period T = 2.
(b) It is an even function, i.e. g() = g().
(c) As ranges from 0 , g() traces out the behavior of f (t) as t ranges
backward from 1 1.
(d) g() knows nothing about the behavior of f (t) outside the range 1 t 1.
This can make it a little tricky to compare the two plots. For example,
g() has local minima at = 0, even though f (t) does not have local
minima at t = 1, 1.
3
a function f (t) over some other interval [a, b], just define g(u) =
If you need to integrate
(ba)
f a + 2 (u + 1) and apply ClenshawCurtis quadrature to integrate g(u) from u = 1
to 1. Dont forget the Jacobian factor.
10
f(x)
1
1
2
2
3
3
4
2
1.5
1
0.5
0
x
0.5
1.5
4
(a)
5
1
1
2
2
3
3
4
4
(b)
Figure 2: (a) A function f (t) that we want to integrate over the interval [1, 1].
(b) The function g() = f (cos ). Note the following facts: (1) g() is periodic
with period 2. (2) g() is an even function of . (3) Over the interval 0
, g() reproduces the behavior of f (t). However, (4) g() knows nothing
about what f (t) does outside the range 1 < t < 1, which can make it a little
tricky to compare the two plots. For example, g() has local minima at = 0,
even though f (t) does not have local minima at t = 1, 1.
11
Property (a) here ensures that the function g() has a Fourierseries representation involving sinusoids that are integer multiples of a base period 0 = 2
T =1:
Z
2
X
1
in
g()ein d.
(10)
g() =
gf
,
gf
ne
n =
2
0
n=
Meanwhile, property (b) ensures that this Fourier series contains only cosine
terms, i.e. it is a Fourier cosine series:
g() =
e
a0 X
+
e
an cos n
2
n=1
(11)
where the e
an coefficients are related to the gen coefficients in (10) according to
e
a0 = 2e
g0 ,
e
an = (e
gn + gen ) = 2gn
(where we used the fact that the Fourierseries coefficients of an even realvalued
function satisfy gen = gen ). The e
a coefficients may also be written in the form
Z
1 2
g() cos(n) d.
(12)
e
an =
0
Notice something very important about these integrals: They are integrals of a
periodic function over its period. (Indeed, both g() and cos(n) for integer n
are periodic functions over the interval [0 : 2], so the whole integrand is periodic.) That means the integral (12) can be evaluated using a simple N point
trapezoidal rule with an error that decays exponentially with N .
e
a0 X
+
e
an cos(n) sin d
=
2
0
n=1
12
e
a0
2
Z
0
sin d +
{z }
Z
an
0
n=0
cos(n) sin d
{z
}
1+(1)n
1n2
X
n=1
n even
2e
an
1 n2
X
2e
a2n
.
1 4n2
n=1
(14)
Equation (14) expresses the integral of our function f (t) in terms of the Fouriercosineseries coefficients of g(), defined by equation (12).
Moreover, the sum in (14) is rapidly convergent, because (assuming the
original function f is a smooth function) the function g() is smooth and periodic, so its Fourierseries coefficients e
an decay faster than any polynomial in
n. (Note that this would not be the case if we had simply constructed a bruteforce periodic extension of f (t) by slicing out its behavior between [1 : 1] and
periodically repeating it; in that case the function would have discontinuities
at the endpoints of the interval and its Fourier coefficients would only decay
algebraically with n.)
Hence, in practice, we can truncate the sum in (14) at some finite number of
terms, i.e. we keep terms up to aN for some even integer N , which then defines
the ClenshawCurtis approximation to our integral:4
N/2
CC
I IN e
a0 +
X 2e
a2n
.
1
4n2
n=1
(15)
13
We could approximate the Fouriercoefficient integral, equation (12), using an N point trapezoidal rule.5 Since this is an integral of a periodic
function over its period, the error in this procedure will decrease exponentially6 with N . Moreover, the trapezoidalrule approximation to e
an will
sample g() = f (cos ) at the same N points for all values of n, and (15)
then amounts to a weighted sum over those function samplesthat is, it
amounts to an N point quadrature rule.
Alternatively, we could approximate the e
an coefficients using a fast Fourier
transform and evaluate the sum (15) directly.
Both of these viewpoints are useful in practice. We will consider the first of
these possibilities in the next section, and the second possibility in our lecture
notes on discrete Fourier transforms.
5 Actually we could use any M point trapezoidal rule here with M not necessarily having
any particular relationship to N ; in this case the error in the individual coefficients would
decay like e#M while the error in the sum (15) would decay like e#N , and the overall error
would be determined by the smaller of the two.
6 Technically, the proper statement is that the error will decrease faster than any polynomial
in N , which still leaves open the possibility of convergence like e N , which is faster than any
polynomial but not exponentially fast. We are only guaranteed to get exponential convergence
if the original function f (t) is analytic.
14
g() cos(n)d
(16)
Cm g(m ) cos(nm )
(17)
N
X
m=0
2. Next, we insert (17) into (15) and rearrange the order of the summations:
CC
IN
N/2
" N
#
X
2
Cm g(m ) +
=
Cm g(m ) cos(2nm )
1 4n2 m=0
m=0
n=1

{z
}

{z
}
N
X
e
a0
N
X
e
a2n
Cm 1 +
m=0
N/2
X 2 cos(2nm )
g(m )
2
1
4n
n=1
{z
}
wm
N
X
wm g(m ).
(18)
m=0
(20)
(21)
15
= N2 ,
(1)n
m = 0, 1, , N
m =
Cm
(22a)
m=0
m = 1, 2, , N 1
,
(22b)
m = N.
wm g(m )
(23)
wm f (tm )
(24)
,
m=0
N 1
N/21
X
2
2mn
cos m
wm = 2 1 +
, m = 1, , N 1
cos
+
2
N
1 N2
4n
N
n=1
1 ,
m = N.
N2 1
For odd values of N :
1
Np
and we say we have pth order convergence or a pth order method.
E
EN e2
x 10y
or equivalently as
x ey
or equivalently as
x 2y
the point being that it doesnt matter what base we use for the exponent; all of the above
expressions describe x decaying exponentially with y.
Caveat
You have to be a bit careful with this terminology, because firstorder and
linear are usually synonymous, as are secondorder and quadratic; but
these terms mean very different things when we are talking about convergence
rates.
Indeed, linear convergence (for example) is much faster than firstorder convergence. For a firstorder method, to obtain one additional significant digit)
(i.e. to reduce our error by a factor of 10) we must increase N by a factor of
10 that is, we must do ten times more work. To get two additional digits, we
must do one hundred times more work. Thus the cost of each extra digit grows
cumulatively.
In contrast, for a method that exhibits linear convergence, the error decreases
like 10N for some constant . For example, suppose = 51 . In this case, to
get one extra digit we need only increase N by 5 that is, we must do five more
operations. Not five times as many operations as we have done so far just
five more operations, independent of how many we have done thus far. To get
another digit we only have to do another 5 operations, and so on. Thus the cost
of each extra digit is fixed, no matter how many digits we have obtained so far.
For quadratic convergence, the cost of additional digits actually shrinks as we
proceed.
Overview
In the first half of the course, we considered the computation of the electrostatic
potential due to the 1D ionic solid pictured in Figure 1, which consists of an
infinite chain of ions with alternating charges Q separated by a distance D.
(1)n
p
n=
(x n)2 + y 2
(1)
The series (1) is perfectly well defined1 and convergent and may be used to
compute (x, y) numerically. However, as we saw in the beginning of the course,
the convergence is slow, requiring us to sum upwards of millions of terms to get
6digit accuracy.
We might also consider higherdimensional versions of this problem. For
example, suppose instead of a 1D chain we had a twodimensional lattice of
1 At least as long as the evaluation point is not on an ion site, i.e. (x, y) 6= (n, 0), which we
assume.
ions, with ions at positions (in our units) (nx , ny ) for all integer values of nx , ny .
Now the potential at a point (x, y) takes the form
2D (x, y) =
(1)(nx +ny )
p
.
(x nx )2 + (y ny )2
nx = ny =
(2)
If we needed to sum 106 terms in (1) to get 6digit accuracy, we will need to
sum many more terms of (2) to get similar accuracy. If we need to tabulate the
potential at some large number of points throughout the unit cell 0 < x, y < 1,
the computation will start to get seriously expensive.
Ewald summation is a brilliant trick for speeding the convergence of sums
like (1) and (2). In addition to being an extremely valuable practical tool
in fields like computational electromagnetism and particle simulation, it offers
an excellent example of the power of Fourier analysis and of thinking about
numerical problems in the right domain which, in this case, is the Fourier
domain.
(3)
When we do this, we find that the two terms have the following properties.
local is easily computed by summing just a few terms of the sum (1); we
say local converges rapidly in real space.
On the other hand, although the sum that defines distant is slowly convergent in real space, the function distant (r) is slowly varying in real
space, which means that its Fourier transform decays rapidly in Fourier
space. We will use this fact to rewrite distant (r) as a sum over Fourier
components that converges rapidly in Fourier space.
Whenever you hear the phrase slowly varying in real space there should
be an alarm bell going dingdingding! in your head and a little guy yelling
rapidly decaying in Fourier space! And, indeed, upon Fouriertransforming
e distant (k) which decays rapidly with k and
distant (x) we will find a function
which, by Poisson summation, will yield a series that is rapidly convergent in
Fourier space. This is the basic idea of Ewald summation.
Now, in principle it would seem easy to effect the separation in (3): We could
simply take the local term to consist of the contributions of all ions within (say)
10 sites of our evaluation point, and the distant term to account for all other
ions. However, this turns out to be the wrong approach, basically because it
destroys the very smoothness property of distant that makes it welllocalized
1
.
r
This function of r has two key properties: For small r, it is rapidly varying as
a function of r (indeed, it is singular as r 0). On the other hand, for large r
it is slowly varying as a function of r.
What we would like to do is to break up this potential into two separate
functions, each exhibiting one and only one of these properties. More specifically, we decompose Coulomb into two pieces, one of which is shortranged (it
captures the rapid variation for small x but decays rapidly for large x) and the
other of which is longranged but nonsingular at x = 0:
Coulomb (r) = short (r) + long (r).
To construct short (r), we will multiply Coulomb (r) by some sort of window
function W (r) that is 1 for small r (preserving the smalldistance behavior of
Coulomb ) but falls to 0 rapidly for large r. Given a windowing function W (r),
we define
h
i
short (r) = W (r)Coulomb (r),
long (r) = 1 W (r) Coulomb (r)
and we define the local and distant contributions to the potential, equation (3),
as
X
X
local (r) =
short (r rn ),
distant (r) =
long (r rn ), (4)
n
where the sum over n runs over all ions in the crystal (in the 1D case, n is
just the scalar quantity n, but in 2D we have n = (nx , ny ) and similarly in
3D). Note that, because short (r) decays rapidly with r, local only receives
noticeable contributions from ions in the immediate vicinity of the evaluation
point. On the other hand, because long (r) is small for small r, distant excludes
the contributions of nearby ions and only receives significant contributions from
distant ions. These two properties make local and distant rapidly convergent
in real space and in Fourier space, respectively.
1.5
1.5
0.5
0.5
0.5
0.5
1.5
x
2.5
0.5
(a)
6
5
2
2
1
1
0
0
0
0.5
1.5
x
2.5
(b)
Figure 2: (a): The window function W (r) and its complement 1 W (r). (b):
The bare Coulomb potential coulomb and its decomposition into shortranged
and longranged contributions short and long .
1
2. Evaluation of the local term in real space. Because short (r) decays
rapidly with r, the sum over ions in local converges quickly: we only need to
sum a few terms to get a highly accurate representation of the sum. Thus we
simply evaluate this sum asis.
3. Evaluation of the distant term in Fourier space. On the other hand,
the sum defining distant is slowly convergent in real space. To improve this
situation, we compute its Fourier transform and evaluate the sum using the
Poisson summation formula:
X
distant (r) =
long (r rn )
n
elong ()
1.5
1.5
0.5
0.5
0.5
0.5
1.5
x
2.5
0.5
As discussed in the previous section, the Ewald technique splits the Coulomb
potential into shortranged and longranged components by introducing a window function W (r) which is 1 for small r and falls rapidly to zero for large r. In
principle, there are many different choices of W (r) that could be used; in practice, the particular choice of window function W (r) that people use for Ewald
summation is called the complementary error function. In this section we will
define this function and use it to compute the Fourier transform of long .
2.1
2
et dt =
et dt = 1.
0
If we truncate the upper limit of this integral at some finite value x, we obtain
a number between 0 and 1 known as the error function erf(x):
Z x
2
2
et dt
erf(x) =
0
This function is 0 at x = 0 and rises rapidly to 1 as x . If we instead want
a function that is 1 at x = 0 and falls rapidly to 0 as x , we simply take
1 erf(x); this function is known as the complementary error function erfc(x):
erfc(x) = 1 erf(x)
Z
2
2
=
et dt.
x
(5)
(6)
Another way to write erf and erfc is to change variables in the integral to u = t/x,
which yields
Z
Z
2x 1 x2 u2
2x x2 u2
erf(x) =
e
du,
erfc(x) =
e
du.
(7)
0
1
In Ewald summation we use the complementary error function as our window
function:
W (r) = erfc(r).
2.2
Py (x)
x +y =
e(x +y )u du.
0
Note that we are here thinking of Py as a function of the single variable x, with
the value of y entering as a parameter.
The Fourier transform of Py (x) is
Z
1
e
Py (k) =
eikx Py (x) dx
2
Z Z 1
2 2
2 2
1
ey u ex u ikx du dx
= 3/2
3/2
k
y 2 u2 4u
2
1
Pey (k) =
1
e
u
eu
ik 2
(x+ 2u
2)

Z
k2
4u
2
y u
du.
{z
/u
ik 2
2u2
k2
4u2
dx du
}
(8)
3
P_1(k)
P_10(k)
2.5
2.5
1.5
1.5
0.5
0.5
0
2
1.5
1
0.5
0
x
0.5
1.5
(a)
100000
5
P_1(k)
P_10(k)
1e05
5
1e10
10
1e15
15
1e20
20
1e25
25
4
2
0
k
(b)
Figure 4: The functions Pe1 (k) and Pe10 (k) plotted on (a) linear and (b) logarithmic scales. The important point is that these functions decay extremely
rapidly for large k, which means their Poisson summation is rapidly convergent.
10
Ewald summation in 1D
In this section we flesh out the details of the Ewald summation procedure for a
onedimensional chain of ions. The basic setup was outlined in the first section:
we split the total potential into contributions from local and distant ions.
(x, y) = local (x, y) + distant (x, y)
p
X
(x n)2 + y 2 ,
local (x, y) =
(1)n short
(9)
nZ
distant
(x, y) =
(1)n long
p
(x n)2 + y 2 .
nZ
erfc(r)
erf(r)
,
long (r) =
.
r
r
We now separately consider the evaluation of each of these sums.
short (r) =
3.1
Evaluation of local
Evaluating the first term in (9) is easy. It is done by the PhiShort function in
the julia code included at the end of these notes. For typical values of x, y the
sum converges to 10 decimal places after summing only 6 or 8 summands. So
this term requires no more work.
3.2
Evaluation of distant
To evaluate the second term in (9), we begin by separating the sum into the
contributions of positive and negative ions:4
p
X
distant (x, y) =
(1)n long
(x n)2 + y 2
n
X
nZ
long
p
X
p
(x 2n)2 + y 2
(x (2n + 1))2 + y 2
long
nZ
(10)
Lets think of the two different summands here as two different functions of n:
p
p
f+ (n) long
(x 2n)2 + y 2 ,
f (n) long
(x (2n + 1))2 + y 2
(11)
4 This step is necessary if we want to make use of the Fourier transform of long that
we computed in the previous section. An alternative way to do this calculation would
be directly to compute the Fourier transform of the signalternating function f (n) =
p
(1)n long
(x n)2 + y 2 . You can check that such a calculation reproduces the results
derived in the text.
11
f+ (n)
nZ
f (n)
(12)
nZ
(13)
mZ
where fe+ (), fe () are the Fourier transforms of (11) with respect to the variable
n, which we think of as a continuous variable for these purposes.
Our next task is to compute these Fourier transforms, which involves the
function Pey (k) that we computed in the previous section, together with some
simple manipulations involving the properties of Fourier transforms.
Fourier transforms of f+ (n) and f (n)
In the previous
worked out the Fourier transform of the function
psection we
long
2
2
x +y :
Py (x)
Z
Py (x) = Pey (k)eikx dk
where Py (k) was defined by (8). Using this, we can write
f+ (n) = Py (x 2n)
Z
= Pey (k)eik(x2n) dk
To make this look like the Fourier synthesis of a function of the continuous
variable n, we change variables to = 2k and rewrite it like this:
Z h
1 ix/2 e i in
e d
=
e
Py
2 }
2
{z
fe+ ()
12
(15)
= 2
eimx
mZ
h
i
1
1 eim Pey (m)
2
The factor in curly braces here vanishes for even m and yields 1 for odd .5
Using this fact and the fact that Pey (k) is an even function of k, we obtain the
final form of distant :
distant (x, y) = 4
cos(mx)Pey (m) .
(16)
m=1
m odd
3.3
13
Lets use Ewald summation to evaluate (x, y) for a couple of different evaluation points.
First consider the point (x = 0.25, y = 0.25). For this evaluation point,
the bruteforce summation of equation (1) requires around 800,000 terms to
converge to a relative tolerance of 106 :
Convergence of (x, y) (bruteforce summation) for (x, y) = (0.25, 0.25)
n
1
2
3
4
5
799998
799999
800000
+2.500006250015747e6
2.500003125004028e6
+2.5000000000001218e6
after n terms
0.7790515201261021
1.7864630493747262
1.117534069665704
1.618498199566681
1.2180022694138286
1.3985540298633128
1.3985515298601878
1.3985540298601877
14
= 1.3635461651727092
= 0.0350066146883204
= 1.3985527798610296
= 0.0006953085865214333
= 0.0018971781988659407
= 0.0025924867853873740
In this case the contribution of the distant ions is about 3 the contribution
of the local ions, and clearly both terms are necessary to get even the first
correct digit of the total potential.
15
This code includes routines PhiLocal and PhiDistant for summing the shortand longranged contributions to the potential, as well as PhiBF for bruteforce
evaluation of the original sum (1).
#
# compute PhiLocal to a relative error tolerance of RelTol.
#
function PhiLocal(x,y,RelTol)
y2=y^2
# n=0 term
r=sqrt(x^2 + y2)
Sum=erfc(r)/r;
# the nth loop iteration adds the contributions of
# the positive ions at r=\pm 2n and the negative ions
# at r=\pm (2n1)
ConvergedIters=0;
for n=1:100000
tn=2*n1;
rp=sqrt( (x+tn)^2 + y2)
rm=sqrt( (xtn)^2 + y2)
Delta1 = (erfc(rp)/rp + erfc(rm)/rm)
Sum += Delta1;
println(tn," ",Sum," ",Delta1)
tn=2*n;
rp=sqrt( (x+tn)^2 + y2)
rm=sqrt( (xtn)^2 + y2)
Delta2 = erfc(rp)/rp + erfc(rm)/rm
Sum += Delta2;
println(tn," ",Sum," ",Delta2)
if ( abs(Delta1+Delta2) < RelTol*Sum )
ConvergedIters+=1;
else
ConvergedIters=0;
end
if ConvergedIters==2
break
end
16
end
Sum
end
#
# compute PhiDistant to a relative error tolerance of RelTol
#
function PhiDistant(x,y,RelTol)
ConvergedIters=0;
Sum=0.0;
for m=1:2:10000
Delta=4*pi*cos(pi*m*x) * TildePyk(pi*m, y)
Sum+=Delta
println(m," ",Sum," ",Delta)
if ( m>3 && ( abs(Delta) < RelTol*abs(Sum)) )
ConvergedIters+=1;
else
ConvergedIters=0;
end
if ConvergedIters==3
break;
end
end
Sum
end
#
# the function \tilde P_y(k), computing using numerical integration
# via Simpsons rule
#
function TildePyk(k,y)
SimpRule( u > (u==0.0 ? 0.0 : exp(0.25*k*k/(u*u)  y*y*u*u)/u),
0, 1, 1000
)/pi
end
#
# bruteforce evaluation of full sum up to 2N+1 terms
#
function PhiBF(x,y,N)
y2=y^2
# n=0 term
r=sqrt(x^2 + y2)
17
Contents
1 The Discrete Fourier Transform
14
16
18
19
8 FFT Convolution
20
9 Circulant Matrices
24
In our discussion of Fourier analysis thus far we have assumed that the function
we are Fourieranalyzing, f (t), exists and is computable for arbitrary values of
the real number t. We now take up the question of what happens when we have
only discrete samples of f , fn f (nt) for integer n and some sampling period
t.
If the integer n runs from to (i.e. we have evenlyspaced samples
of f over the entire real line) then the tool we use is the semidiscrete Fourier
transform. In practice, this situation does not arise as often as the case in which
we have a finite number N of samples of f , fn = f (nt) for n = 0, 1, , N 1.
In this case, the tool we want is the discrete Fourier transform (DFT).
The DFT maps the N numbers {fn } into a set of N Fourier coefficients
{fe } :1
DFT
{fn } {fe }
N 1
n
1 X
e
fn e2i N .
f =
N n=0
(1)
Note this connection to the usual Fourier transform: If f (t) were a continuous
function on the interval t [0, N ], then fe would be the th Fourierseries
coefficient of f (t) [that is, the coefficient of the sinusoid ei0 t in the Fourier
synthesis of f (t), where 0 = 2
N ] evaluated using N point rectangularrule
quadrature.
Having exchanged our N realspace coefficients {fn } for the N Fourier coefficients {fe }, it is easy to go the other way and trade the {fe } back in for the
{fn }. This process in the inverse discrete Fourier transform:
{fe }
fn =
IDFT
N
1
X
{fn }
n
fe e2i N .
(2)
n=0
Another way to think about the IDFT is that it is just the Fourier synthesis:
whereas the DFT analyzes the dataset {fn } into constituent sinusoids, the IDFT
reassembles those sinusoids to recover the original data set {fn }.
The IDFT is periodic with period N . Our original data set contained
only N elements, {fn } for n = 0, 1, , N 1. Given such a set, it is not
meaningful to ask for values of fn outside the range 0 n N 1. However,
the RHS of (2) is perfectlywell defined for any n, so we might ask: What do
we get if we evaluate (2) for n outside this range? The answer is that we get
back fn mod N that is, we get whichever element of the original data set is
1 Here
we are thinking of n as a real space variable and using as its Fourier conjugate
variable, but in DFT literature is also common to use symbols like m or even n0 as the
Fourierconjugate of n.
prefactors of 1/ N .
Computer implementations of the FFT, including the ones in julia and
matlab, actually tend to use convention (a). That means that when you use
those systems to compute the DFT of a set of numbers {fn }, the numbers you
get back are actually N times the quantities {fe } defined by my equation (1).
A good way to think of the DFT and the IDFT is that together they constitute
a technique for constructing a smooth interpolating function f (t) guaranteed to
pass through all the data samples fn . Indeed, if in equation (2) we replace the
discrete index n with a continuous variable t, we obtain a function
f (t) =
N
1
X
t
fe e2i N
n=0
N
1
X
n=0
fe ei0 t ,
2
.
0 =
N
(3)
An immediate example
Lets do a quick example. (The julia source for this example is listed below).
Consider the following (randomly generated) set of 9 numbers.
f0 = 0.823648
f3 = 0.177329
f6 = 0.042301
f1 = 0.910357
f4 = 0.278880
f7 = 0.068269
f2 = 0.164566
f5 = 0.203477
f8 = 0.361828
1.2
1.2
Data points
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
1
4
n
The discrete Fourier transform of the set of numbers {fn }, using the normalization convention (1), is the following set of complex numbers {fe }:
You can easily verify at home that summing the numbers fe in this table,
weighted by sinusoids e2in/N (where N = 9 in this case), recovers the numbers
in the previous table, i.e. we have
fn =
8
X
fe ei0 n
0 =
=0
2
.
N
(4)
8
X
=0
fe ei0 t
(5)
This defines a continuous function of t with the property that f (n) = fn , i.e.
f (t) is guaranteed to pass through the points in the table above whenever t
passes through an integer value. This is illustrated in the following figure.2
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
1
2 In this figure and the following two figures, we are plotting just the real part of the
interpolant. The imaginary part also exists and is a wiggly function like the real part; at
integer values of t it agrees with the original data (that is, it vanishes, since the original data
were realvalued) but is nonzero elsewhere. The exception is the minimalvariation interpolant
defined below, which is a purely realvalued function (its imaginary part vanishes for all t).
and thus we can multiply each term in (4) by eipN 0 n with impunity; this simply
corresponds to shifting + pN in the exponent. For example, we could
modify (4) by writing
X
fn =
fe ei(+N )0 n
(6a)
or
fn =
fe ei(+2N )0 n
(6b)
fe ei(47N )0 n
(6c)
or
fn =
and the equations remain valid, i.e. summing up all 9 terms on the LHS will
exactly recover the quantity on the RHS, as you can readily verify at home. More
generally, we can even shift the frequencies of different sinusoids by different
integer multiples of N , i.e. we can write
X
fn =
fe ei(+p N )0 n
(7)
where {p } is any set of N integers. Every sum of the form (8), including all the
examples on the RHS of (6), reproduces the original data {fn } when evaluated
at integer values of n.
But their respective continuations to continuous variables t define very different functions. For example, heres the function defined by continuing (6a)
from n t:
1.5
1
1
0.5
0.5
0.5
0.5
1
1
1.5
1
4
n
PN 1 e i(+N )0 n
.
=0 f e
1.5
1
1
0.5
0.5
0.5
0.5
1
1
1.5
1
4
n
P e i(+2N )0 n
f e
.
Note that, in every case, the continued function f (t) is guaranteed to run
exactly through our prescribed data points at integer values of t; however, the
behavior of the function in between those points is increasingly erratic as we
include higher and higher frequencies.
minimizes the meansquare variation, which is a measure of how much the function wiggles over one full period:
Z
meansquare variation of f (t)
N 1
f 0 (t)2 dt.
10
N = 9 the term e.g. fe7 e7i0 t is replaced by fe7 e2i0 t . (The 2 here comes from
7 9 = 2.) The full minimalvariation interpolant in the example above is
f min var (t) = fe0 +fe1 ei0 t + fe2 e2i0 t + fe3 e3i0 t + fe4 e4i0 t
+fe5 e4i0 t + fe6 e3i0 t + fe7 e2i0 t + fe8 ei0 t .
This is plotted below. Note that, compared to the interpolants we plotted above,
this function wiggles much less between the data points; intuitively it is clearly
the right function if we want a smooth interpolant through our data points.
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
1
4
n
Figure 5: The minimalvariation interpolant for our original data set {fn }.
Heres a modified version of the julia function fC from the above snippet
that computes the minimumvariation interpolant:
function fMinVar(t)
Sum=0.0 + 0.0im;
for nu=0:floor(N/2)
Sum+=tildef[nu+1]*exp(2*pi*im*nu*t/N);
end
for nu=floor(N/2)+1 : N1
Sum+=tildef[nu+1]*exp(2*pi*im*(nuN)*t/N);
end
Sum
end
11
12
Another way to think about what the DFT is doing is that it is computing a
rectangularrule approximation to the integrals that define the Fourierseries coefficients of some function f (t). This interpretation is somewhat complementary
to the previous one: in the previous section we were talking about constructing
a continuous function from a set of discrete samples, and now were talking
about sampling a continuous function to obtain a set of discrete samples.
In this interpretation, we consider a function f (t) defined over an interval
T , with a a Fourierseries representation of the form
Z
X
1 T
f (t)ei0 t dt
f (t) =
fe ei0 t
fe =
T
0
=
where 0 = 2
T . Now approximate the integral here using the simple N point
rectangularrule quadrature scheme, which samples the integrand at points t =
T
n, n = 0, , N 1 with = N
:
N 1
X
f (n)ei0 n
fe
T n=0
N 1
2i
1 X
fn e N n .
N n=0
(9)
13
can write down to smoothly connect the dots in a given data set. The long
story short is ...
14
z1
Z=
z2
..
.
zN
and, as is true for any element of an N dimensional vector space, we may write
a general element Z as a weighted sum of basis vectors:
Z = z1
e1 + z2
e2 + + zN
eN
(10)
e =
0 ,
..
.
0
e =
0 ,
..
.
0
eN
0 .
..
.
1
15
N 1
N
.
.
.
2
3
..
.
1
3
v
=
N
N 1
22
32
..
.
(N 1)2
and, more generally, the nth component of the pth basis vector in this set is
defined to have components vnp = 1N (n1)(p1) = 1N e2i(n1)(p1)/N :
1
p
v
=
N
(p1)
2(p1)
3(p1)
..
.
(N 1)(p1)
(12)
and it turns out that the zen coefficients in this expansion are related to the zn
coefficients in (10) by nothing other than the discrete Fourier transform:
n
o
DFT
{zn }
N ze
(The N factor here just corrects for the slightly different normalization we
used previously: what this equation means is that if you perform the DFT on
the {zn } coefficients in (10), you will
z } coefficients in (12),
get precisely the {e
but divided by an extra factor of N .)
The interpretation of the DFT as a change of basis really hammers home
the losslessness property of Fourier analysis. Changing bases in a vector space
changes the coordinates of points but doesnt change the points; we lose zero
information by switching to a different basis.
16
DFT
{fe },
N 1
n
1 X
fe =
fn e2i N .
N n=0
(13)
The set {fe } contains N numbers, and to calculate each one of them using
(13) we have to do a sum involving N summands. Thus the cost of the whole
operation is going to scale like N 2 :
computational cost of nave N point DFT N 2 .
So, if it takes our computer 1 second to do a 1000point DFT, it will take
In practice this means that we would be limited to running small DFT calculations.
17
F
= [1:10]; # or any other set of 10 numbers
tildeF = fft(F);
Due to the slightly different normalizations, the th entry of the tildeF vector
here will be equal to N fe , where fe is defined by (1).
18
19
20
FFT Convolution
In science and engineering problems we often need to compute discrete convolutions of the form
X
Fn =
Sm Knm
(14)
m
where we have excluded the selfterm m = n from the sum (because we dont
want to count the contribution of an ion to the potential it feels itself). This
is a discrete convolution of the form (14) with source function Sm = Q(m) and
kernel function
0,
n=0
Kn =
.
1
,
n
=
6
0.
nD
Arbitraryprecision arithmetic
As we discussed in our unit on floatingpoint arithmetic, each individual number
stored in a computer has a finite number of digits. What if we need to do
arithmetic on numbers with thousands or millions of digits? In this case we
choose a base b and represent arbitraryprecision numbers in the form
x=
Nx
X
xn b ,
y=
n=0
Ny
X
yn bn
n=0
x3 = 9,
x2 = 4,
x1 = 1,
x0 = 5,
21
y2 = 8,
y1 = 2,
y1 = 6.
(Of course on a typical computer we wouldnt need arbitraryprecision arithmetic to multiply these two numbers, but it illustrates the point).
The product of x and y is
xy = z =
Nz
X
zn bn
n=0
xm ynm .
This is a discrete convolution of precisely the form (14), except that we must
pack the digits xn , yn into data vectors in such a way that xn = yn = 0 for
negative values of n. In practice, this corresponds to using data vectors that
are two times longer than necessary to store the actual digits of x and y, then
zeropadding to ensure the second half of the data vector is all zeros.
N
1
X
Se e2im/N
=0
Kn =
N
1
X
e e2in/N
K
=0
Using this last equation, the quantity that enters the sum (14) may be written
Knm =
N
1
X
=0
e e2i(nm)/N .
K
22
N
1
X
Sm Knm
m=0
N
1
X
(N 1
X
m=0
=0
Se e
2im/N
) (N 1
X
)
e0 e
K
2i(nm) 0 /N
0 =0
N
1 N
1
X
X
e0 e
Se K
2in 0 /N
(N 1
X
=0 0 =0
)
e
2im( 0 )/N
m=0
{z
N , 0
The sum in curly braces here is something we have seen many times: it vanishes
if 6= 0 and evaluates to N if = 0 . (You can remind yourself how this works
0
by thinking of the sum as a geometric series in the variable = e2i( )/N .)
The double frequency sums then collapse to a single sum:
Fn = N
N
1
X
e e2in/N
Se K
=0
DFT
{Se },
{Kn }
DFT
e }
{K
e } to obtain the
2. Multiply (componentwise) the sequences {Se } and {K
discrete Fourier transform of {Fn }:
e ,
Fe = N Se K
= 0, 1, , N 1
IDFT
{Fn }.
23
(The inverse DFT is evaluated in highlevel languages like julia via the function
ifft, which behaves similarly to the fft function.)
Using FFT techniques, steps (1) and (3) here have cost scaling like N log N ,
while step (2) has cost scaling like N , so the overall algorithm has cost scaling
like N log N , as compared to the N 2 scaling of a bruteforce evaluation of the
sum (14).
Periodic source distribution, infiniterange kernel
Finite source distribution
Zero padding
Circulant Matrices
24
Contents
1 Fourier Analysis
12
5 Fourier series
14
21
7 Poisson summation
23
27
28
A Exponential Sums
31
B Gaussian Integrals
33
Fourier Analysis
Recall that the verb analyze means to decompose into constituent pieces.
Fourier analysis is the process of decomposing functions into constituent pieces
which vary at definite rates that is, into sinusoids.
Some functions are easy to Fourieranalyze. For example,
f (t) = 3 cos 2t + 19 sin 4t 0.14 cos 7t
+19 a sinusoid with angular frequency 4
=
Discrete samples
fn f (nt),
nZ
Infinite domain
Fourier transform
Fourier series
< t <
Finite domain
T
T
<t<
2
2
Table 1: The fourfold way: The name of the process used to Fourieranalyze a
function f (t) depends on whether (a) f (t) is defined for all time or only within
a finite window, and (b) whether we have access to f (t) for all real values of t
or just discrete samples at evenly spaced points nt.
1I
borrowed the idea of a fourfold way here from Professor Laurent Demanet.
The first entry in the fourfold way is the Fourier transform. This is what we
use when values of the function f (t) we are analyzing are available for arbitrary
points t on the entire real line. In this case, the Fourier transform of f (t) is
defined to be the function2
Z
1
e
eit f (t) dt.
(1)
f ()
2
f () is a function of frequency that tells us how strongly complex exponential
with frequency is represented in f (t).
Using this formula, we can immediately answer one of the questions we
posed at the outset: How do we decompose a function like E (t) et into
sinusoids? The answer is to plug E into (1):
Z
1
e
E () =
eit et dt
2
Z
Z
i
1 h 0 (i+)t
=
e
dt +
e(i)t dt
2
0
h
i
1
1
1
=
2 i i
.
=
2
( + 2 )
What this means is this: The function et may be reconstructed by summing
sinusoids with all possible frequencies . The amplitude of the frequency
sinusoid in this sum decays, for large , like 1/ 2 . Note that the threshold
defining large is dependent on : The larger the value of (i.e. the more
rapidly decaying the original exponential) the more sinusoids we have to add
with appreciable amplitudes to recover f (t). This is an example of a general
phenomenon that we will discuss in the next section.
is = 2
. is the frequency which with the underlying process repeats itself,
while is the frequency with which the phase angle in the sinusoid accumulates
one radian worth of phase. (The frequency is sometimes called the linear
frequency to distinguish it from the angular frequency.) Linear frequency is
R
theory is nicest if we restrict f (t) to satisfy
f (t)dt < , in which case we
say f is contained in the function space L1 (R). Fourier transforms can be defined for more
general functions, and in some places we will do this without being particularly rigorous about
it, but you should be aware that the proper justification of (1) for nonL1 functions involves
a bit of a foray into real analysis.
2 The
Terminology of domains
In the above discussion, the function we started out with was f (t), and the
Fouriertransformed function was fe(). In other words, the original independent
variable was the time t, and the transformed variable was the frequency . We
think of the function f (t) as existing in the time domain, while fe() exists in
the frequency domain.
But Fourier analysis is also useful in situations where the original independent variable is something other than time. For example, it may be the position
x, in which case the Fouriertransformed variable is the spatial frequency k. (k
in this case is sometimes called the wavenumber.) In this case it would be
a little confusing to say that f (x) exists in the time domain. Instead, various alternative terminologies have arisen for labeling the two spaces in which
functions live before and after Fourier transformation.
Field
Physical
variable
Physical
domain
Fourier
variable
Fourier
domain
Signal
processing
time t
time
domain
frequency
frequency
domain
Solid state
physics
lattice vector L
real space
reciprocal
lattice vector G
reciprocal
space
Optics
position x
real space
wavenumber k
k space
Quantum
mechanics
position x
position
space
momentum p
momentum
space
Dimensional analysis
In physics and engineering problems its important to keep in mind that the
functions f (t) and fe() have different units. Indeed, looking at (1) we see that
3 The term physical variable is something of a misnomer since Fourier variables like
frequency and spatial wavenumber are certainly physical quantities, but whaddya gonna do.
the RHS contains a dt factor that is not present on the LHS, and therefore
units of f
units of fe = units of f time =
.
frequency
Thus, if f (t) has units of volts, then fe() has units of volts seconds or volts
per Hertz.
ik fe(k)eikx dk
The RHS here is the inverse Fourier transform of the Fourierspace funcdf
tion ik fe(k), so we identify this function as the Fourier transform of dx
,
i.e. we conclude that
df
e
= ik fe(k).
(2)
FT f (x) = f (k)
=
FT
dx
This game can be repeated as many times as we like; for example,
2
d f
e
FT f (x) = f (k)
=
FT
= k 2 fe(k).
dx2
(3)
ixfe(x)eikx dxx.
d e
What this tells us is that dk
f is something like the Fourier transform of
the function xf (x):
d
FT f (x) = fe(k)
=
FT xf (x) = i fe(k).
(4)
dk
e
= f (k).
Lorentzians
We already saw our first example of a Fourier transform:
E (x) = ex
.
(2 + k 2 )
e (k) =
E
(5)
2 ln 2
.
(6)
1
FWHM(E ) = 2.
(7)
(8)
Notice that equation (8) is independent of ; the same statement holds for all
functions in the family {E (t)} regardless of the value of .
Gaussians
Let G (x) be a Gaussian of width :
G (x) ex
/ 2
x2
2
+ ikx =
1
2
1 k2 2
=
e 4
2
x+
ik 2
2
2
k2 2
4
2 2
12 x+ ik
2
{z
dx
}
k2 2
= e 4 .
2
e (k) is
Aside from the annoying prefactor4 , the important point here is that G
s again a Gaussian in kspace, but with a width inversely proportional to that
of the original Gaussian:
2
e (k) e ke 2 = Ge (k)
G
where
e .
e ) = 2 ln 2 4 ln 2 = 8(ln 2)2 .
FWHM(G ) FWHM(G
(9)
Again, this is independent of : it holds for the entire family of Gaussian pulses.
~
1034 kg m2 /s
2
e (k) comes out with a nicer prefactor or even a symmetric prefactor (i.e. the
ensure that G
same prefactor as G ), but we wont bother.
10
so, for example, if we have an electron (mass 1030 kg) and we try to resolve
its position to within x 10 nm, then we cant pin down its velocity to any
better accuracy than p 105 m/s! This is a huge uncertainty compared to
the spatial resolution we are trying to hit.
process in which a function gets infinitely wide and infinitely tall sounds like
the kind of procedure that defines a Dirac delta function, and indeed its easy
to show that
e (k) = lim
= (k).
lim E
0 (2 + k 2 )
0
Thus we have the Fouriertransform pair
f (x) 1
fe(k) = (k).
(10)
This actually makes sense, if you think about it: The function f (x) 1 already
is a sinusoid, namely, a sinusoid with zero frequency. To synthesize this function
as a sum of sinusoids, we want to set the coefficients of all sinusoids to zero except
the single sinusoid with frequency k = 0.
Armed with equation (10) and the derivative identity (4), we can now compute the Fourier transform of functions like f (x) = x or f (x) = x2 :
f (x) = x
f (x) = x
=
2
fe(k) = i 0 (k)
fe(k) = 00 (k)
(11)
(12)
where e.g. 0 (k) is the derivative of the Dirac delta function, which is defined
using integration by parts:
Z
Z
f (u) 0 (u)du = f 0 (u)(u)du = f 0 (u).
In other words, the object 0 should be thought of as a gadget similar to , except
that when integrated against a function f it pulls out minus the derivative of f
at the origin, not the value of f like the usual function would do.
11
5 What this means in essence is that objects like (k) and 0 (k) are meaningless in isolation,
and only make sense when they appear paired with a nice function under an integral sign.
12
This makes sense: If f (t) is slow, then it doesnt contain many fast
frequency components (or the ones it does contain have small amplitudes).
This statement can be quantified by characterizing the smoothness of f in
terms of its continuity and that of its derivatives. In particular,
If f (t) and its first p 1 derivatives are continuous, but its pth
derivative is discontinuous with bounded variation, then fe()
decays at least as rapidly as (p+1) for  .
In particular, if f (t) is C (it is continuous and all of its derivatives are continuous everywhere, no discontinuities, anytime, anyplace, ever) then f() decays
for large faster than any polynomial. Functions which decay faster than any
2
polynomial include e , e , e , etc.
Statements like the boxed statements above are generally known as PaleyWiener theorems.
This principle is already illustrated by the particular examples we considered
previously. The function et is continuous, but its first derivative is not (it
has a finite jump at the origin). Thus the statement in the box is satisfied for
p = 1 and we expect the Fourier transform to decay like 2 for large , as
2
2
indeed we found above. On the other hand, the function et / is C , so its
Fourier transform should decay faster than any polynomial in and, indeed,
2 2
the Fourier transform of this function goes like e /4 , which decays for large
faster than any polynomial in .
13
2
has compact support. On the other hand, the Gaussian ex does not have
compact support; for large x it is very small but not exactly zero.
In the same vein as we asked above whether or not we could simultaneously
squeeze f (t) and f() to be narrow pulses, it is interesting to ask if we could
find a function f (t) such that both f and fe have compact support. The answer
is basically no, except for the trivial case f = fe = 0:
14
Fourier series
Next suppose f (t) be a periodic function with period T . This means that
f (t + T ) = f (t) for all t; the function f repeats itself every T seconds.6 Suppose
we try to compute the Fourier transform fe() of this periodic function:
Z
1
e
eit f (t) dt.
(13)
f () =
2
There are two distinct cases we need to analyze:
1. The frequency is an integer multiple of 2
T . In this case, the entire integrand of (13) is periodic with period T . Every time interval of width T
makes an identical contribution to the integral, and there are an infinite
number of such time intervals, so fe() = .
2. The frequency is not an integer multiple of 2
T . In this case, the integrand
of (13) is not periodic. [The f (t) factor is periodic with period T , and the
eit factor is periodic with some period not equal to an integer fraction
of T , so the overall integrand is not periodic.] Now what happens is that
each time interval of width T makes a contribution to the integral that
has essentially the same magnitude, but a random phase factor. These
random phase factors cause all the contributions to the integral to cancel,
and we find fe() = 0.
To summarize, if f (t) is periodic with period T , its Fourier transform fe() is
e
zero except when is an integer multiple of 0 2
T , at which f () is infinite.
e
One way to think of this situation is to represent f () as a train of functions:
fe() =
fen ( n0 )
0 =
2
.
T
Another way to think about this is to say that if f (t) is periodic with period T , its
Fourier decomposition only contains sinusoids with frequencies n = n0 = 2n
T
for n Z. We can write
f (t) =
2int
fn e T
or
f (t) =
fn ein0 t
where
nZ
nZ
0 =
2n
.
T
1
2
f (t)ein0 t dt.
15
A simple example
For example, lets Fourieranalyze the function cos2 3t, which is periodic with
period T = 3 .
1.5
1.5
cos^2(3t)
cos^2(3t)
0.5
0.5
0.5
0.5
2
1.5
1
0.5
0
t
0.5
1.5
2
T
ein0 t f (t) dt
h
i
ein0 t ei0 t + 2 + ei0 t dt
i
1h
n,1 + 2n,0 + n,1
4
16
In other words, the Fourier coefficient fen is only nonzero for n = {1, 0, 1}. The
Fouriersynthesized version of f (t) is
X
f (t) =
fen ein0 t
n
1
1 1
= ei0 t + + ei0 t
4
2 4
1h
= 1 + cos 0 t]
2
1h
= 1 + cos 6t].
2
(14)
fe (t) =
i
1h
f (t) + f (t)
2
fo (t) =
i
1h
f (t) f (t) .
2
17
1.5
1.5
0.5
0.5
0.5
0.5
2
1.5
1
0.5
0
t
0.5
1.5
X
2
in0 t
e
fn e
0 =
f (t) =
T
n=
Z
1 T
fen =
f (t)ein0 t dt
T 0
Z
1 T in0 t
=
te
dt.
T 0
The n = 0 term evaluates to f0 =
T
2
Z T
1
1 in0 t T
1
in0 t
=
e
dt
te
+
T
in0
in0 0
0
1
=
.
in0
18
T
1 X ein0 t
2
i0
n
n6=0
Note that the units are correct: the LHS has units of time, the first term on the
RHS has units of time, and the second term on the RHS has units of (angular
frequency)1 =time.
Note also that fen decays like 1/ for large . This is in accordance with
our discussion of PaleyWiener theorems above, since the function f (t) is discontinuous.
We could also rewrite this series in terms of cosines and sines (and eliminate
0 in favor of T :)
T X1
2nt
T
sin
.
(15)
f (t) =
2
n=1 n
T
Note that this is almost a Fourier sine series only the first (constant) term
doesnt belong. If we consider the modified function g(t) = f (t) T2 , then this
term would go away and the Fourier series for g(t) would be a Fourier sine series
which can only be true if g(t) is an odd function. You should look at the graph
of f (t) and convince yourself that shifting the entire curve downward by T /2
does indeed yield an odd function.
It seems amazing to think that summing up a bunch of sine functions each
one of which is individually a nice smooth function can reproduce the jagged,
discontinuous behavior of the sawtooth function of Figure 5. But it does!
19
1.5
1.5
f(t)
f_2(t)
f_5(t)
f_10(t)
f_20(t)
0.5
0.5
0.5
0.2
0.2
0.4
0.6
0.8
0.5
1.2
Figure 3: The Gibbs phenomenon. When we truncate the Fourier series (15)
at a finite number of terms, we obtain an approximation to the original sawtooth function f (t). Note that, in the regions away from the discontinuity, the
approximation more closely hugs the actual function as N ; however, near
the discontinuity, the peak error between the function and the approximation
does not decrease with increasing N . However, the definition of near the discontinuity does change with N , and for larger N the errors are confined to a
narrower region about the discontinuity.
N
X
n=N
1h
fen ein0 t = f (t ) + f (t+ )
2
where
f (t ) = lim f (t ),
0
20
21
Fourier synthesis
One important consequence of the losslessness of Fourier analysis is that the
inverse process Fourier synthesis exists and may be used to recover the
original function from its Fourieranalyzed version. (Again, this is true no matter which of the fourfoldway entries we are talking about.) For example, the
inverse of equation (1) reads
Z
f (t) =
eit fe() d
(16)
This is exactly what you expect: we recover f (t) by summing a bunch of sinusoids eit , with the weight of the frequency summand given by fe().
Note that equation (16) is exact: there is no loss in going back and forth
between the physical and Fourier domains. If we didnt have the losslessness
property of Fourier analysis, we would have to wonder whether or not the function defined by the RHS of (16) was in some way an inexact representation of
our function.
Parsevals Theorem
Another important consequence of the losslessness of Fourier analysis is that
it allows us to perform certain computations in the Fourier domain with the
confidence that these computations yield the same results as if we had performed
them in the physical domain. If we didnt have the losslessness property of
Fourier analysis, we would have to wonder whether or not we lost something
along the way.
This phenomenon is well illustrated by Parsevals theorem. Suppose we have
two functions f (t) and g(t) and we want to compute their inner product,
Z
hf gi =
f (t)g(t)dt
9 Note that we may not start out with complete information on the function f (t); for
example, we may only have samples of this function at some limited set of time points. In
this case, the process of Fourier analysis (which, for a finite set of function samples, would
be the discrete Fourier transform) obviously does not magically give us any more information
about the original underlying function f (t) than we started out with, but whats important is
that it doesnt lose any information after computing the DFT, we can always compute the
inverse DFT to recover the original function samples we started with.
22
0
i( 0 )t
e
=
f ()
e
ge( )
dt d 0 d

{z
}
2( 0 )
fe ()e
g () d.
= 2
Thus the inner product of the Fourier transforms of f and g is equal to the inner
product of f and g.
Plancherels theorem
If we take the two functions in Parsevals theorem to be the same function,
g = f , we obtain Plancherels theorem:
Z
Z
f (t)2 dt = 2
fe()2 dt.
X
f (t)g(t) dt = T
fen gn
(17)
0
Z
0
f (t)2 dt = T
n=
fen 2
(18)
n=
23
Poisson summation
X
2 X e 2m
f (nt) =
f
.
(19)
t m=
t
n=
Note that the units are correct: fe has units of
units of f
units of fe =
= units of f time
frequency
while t has units of time; thus
units of
fe
= units of f.
t
Its easy to prove equation (19), and well do it below, but first lets investigate
some practical applications.
Jacobi functions
Recall that the Fourier transform of a Gaussian is a Gaussian, and that, more
specifically, the FT of a wide Gaussian is a narrow Gaussian in particular, the
FT of a Gaussian with width in physical space is a Gaussian with width 2
in Fourier space. Thus, if we ever found ourselves wanting to sum the quantity
2
en x over all integer n, and we were finding our sum slow to converge (because,
say, x might be small and the sum thus slowly convergent) we might be tempted
to exploit Poisson summation to evaluate the sum in Fourier space.
To get technical about it, define
2
Tx (n) en
(20)
24
en
n=
Tx (n).
(21)
n=
(x) = 2
Te(2m)
m=
m2
1
e x
=
x m=

{z
}
(22)
(1/x)
But the sum here is nothing but the original function evaluated at the inverse
of its original argument! We have proven the functional equation of the Jacobi
function:
1
1
.
(x) =
x
x
I find this to be a totally wacky formula. (x) looks like a very complicated
function. How could the value of this function at x possibly be related so
simply to its value at 1/x? But it is!
To demonstrate the computational efficacy of (22), write it in the form
X
n=
n2 x
=x
1/2
em
/x
(23)
m=
Suppose we want to compute, to 6digit accuracy, the value of this sum for
x = 0.04. Using the LHS to evaluate the sum, we need to sum 11 terms:
10 Actually the function defined by equation (21) is only one of several related functions
known collectively as Jacobi theta functions.
25
N
X
N 2 x
en
n=N
0
1
2
3
4
5
6
7
8
9
10
11
1.0
0.8819113782981763
0.6049225627642709
0.322718983267049
0.133905721399763
0.04321391826377226
0.010846710538160161
0.002117494770632841
0.00032151151668886733
3.796825289201935e5
3.4873423562089973e6
2.4912565147240595e7
1.000000000000000
2.763822756596353
3.9736678821248947
4.619105848658993
4.886917291458519
4.973345127986064
4.995038549062384
4.99927353860365
4.999916561637028
4.999992498142812
4.9999994728275245
4.999999971078828
On the other hand, if we use RHS of (23), we only have to sum one term to
get 6digit accuracy:
RHS sum in (23) with x = 0.04 :
N
N
X
N 2 /x
en
/x
n=N
0
1
2
1.0
7.773044498987552e35
3.650603079495543e137
1.000000000000000
1.000000000000000
1.000000000000000
(0.04)
=
0.04
0.04
 {z }
 {z }
 {z }
4.999999971078828
5.00000000000000 1.00000000000
Ewald summation
Finally, Poisson summation is the basis of Ewald summation, a wonderful technique for speeding the convergence of realspace sums over particle interactions
that is widely used in computational physics and engineering. We will consider
this topic in detail in a subsequent set of lecture notes.
26
f (nt) =
n=
Z
X
fe()eint d
n=
fe()
)
e
int
n=
{z
2(t2m)
The point of this step is that the sum over n inside the curly brackets yields zero
(all the terms eventually cancel each other) unless t is an integer multiple of
2, in which case that sum is infinite. We summarize this situation by describing
the quantity in the curly brackets as a function which is only nonzero for t
equal to 2m for arbitrary integers m.
Z
fe()(t 2m) d
= 2
2m
d
fe()
t
2 X e 2m
=
f
.
t m=
t
=
2
t
27
Z
i1
Z
fe(1 ) d1
i2 (t )
ge(2 ) d2 eit d dt
Z
i(1 2 )
Z
i(2 )t
{z
2(1 2 )
}
{z
2(2 )
g (2 )d1 d2
dt fe(1 )e
}
Use the first function to evaluate the 1 integral, then use the second function
to evaluate the 2 integral:
= 2 fe()e
g ().
In other words: The frequency Fourier coefficient of the convolution of f and
g is just the product of the frequency Fourier coefficients of f and g.
This fact has important implications for signal processing. In particular, it
means that the operation of convolution is easier to perform in the frequency
domain than the physical domain.
28
eikx f (x) dx
1
x2
+ y2 + z2
11 Since this function lives half in physical space and half in Fourier space we really should
adorn it with a halftilde instead of the full crown, but I dont know how to typeset that in
LATEX.
29
(k) =
(2)3
r
A convenient way of evaluating 3D integrals like this is to use polar coordinates
in a coordinate system in which k points in the z direction. In this coordinate
system we have dx = r2 sin dr d d and k x = kr cos (where k = k is the
magnitude of k) so the integral becomes
Z Z Z 2 ikr cos
e
1
r2 sin d d dr
=
(2)3 0
r
0
0
The integral can be done immediately to yield 2. To do the integral,
change variable to u = cos , du = sin d :
Z
Z 1
1
=
drr
eikru du
(2)2 0
1
Z
1 ikru 1
1
e
=
drr
ikr
(2)2 0
1

{z
}
2 sin(kr)/kr
1
2k 2
sin kr dr
0
1
2 2 k 2
Z
0

12
and thus
is
sin t dt
{z
}
=1
1
.
2 2 k 2
A good way to think of (24) is in terms of the Fouriersynthesis picture:
ecoulomb (k) =
(24)
1
= coulomb (r)
r Z
= ecoulomb (k)eikr dk
Z
dk eikr
=
.
2 2 k2
Thus, we can recover the Coulomb potential by summing plane waves of all
possible wavevectors; the contribution of the plane wave with wavevector k is
weighted in the sum with a factor 1/(2 2 k2 ).
12 Actually the t integral here doesnt quite make sense as we have written it; the proper
justification of the result requires a certain limiting process, which you will work out in your
problem set.
30
f nx x, ny y .
nx ,ny =
In other words, we are sampling f on a grid of points that lie x apart in the x
direction, and y apart in the y direction. All we have to do is apply Poisson
summation recursively, first in the y direction and then in x direction (or vice
versa, it doesnt matter). The result is
X
nx ,ny =
f nx x, ny y =
2
x
2
y
X
mx ,my
2mx 2my
fe
,
.
x
y
31
Exponential Sums
In several places throughout this document, we have invoked certain sum rules
without justification. Well collect these formulas here just to make sure we
have them all in one place and to emphasize that they are all really just slightly
different twists on the same basic principle.
ei(n1 n2 )0 t dt = n1 ,n2 .
(25)
Orthogonality interpretation
A good way to interpret (25) is to say that, for n1 6= n2 , the functions fn1 (t) =
ein1 0 t and Rfn2 (t) = ein2 0 t are orthogonal with respect to the inner product
hf, gi = T1 f gdt. The notion of inner products and orthogonality are
borrowed from geometry, and they mean the same things here: the inner product
is an operation that takes two elements and returns a number, and two elements
are orthogonal if they have zero inner product.
Discrete version
The following result was used in our derivation of Poisson summation above,
and will be considered further when we discuss discrete Fourier transforms.
X
n=
einkx = 2
(xk 2m)
What this says is the following: The sum on the LHS yields zero unless x is an
integer multiple of 2
k . (The sum over m is just allowing for all possible integer
multiples.) If x is an integer multiple of 2
k , then the sum on the LHS is infinite
32
(all the summands are equal to 1), but infinite in such a way that if we multiply
the LHS by some function f (x) and integrate over all x then we get a finite
number which depends on the values of f (x) at the points x = 2m/k.
33
Gaussian Integrals
ex dx =
(27)
x2
e
dx =
.
(28)
To derive this formula, just change variables in the original Gaussian integral
(27).
You can use dimensional analysis to remember the dependence of (28) like
this: The entire LHS of (28) has the same units as x because the dx factor in
the integral is the only dimensionful quantity in that expression. For example,
if x is measured in meters, then the entire LHS of (28) has units of meters.
On the other hand, since x2 is the argument of an exponential, it mustbe
dimensionless, whereupon 1/ must have the same units as x2 , and thus 1/
must have the same units as x. Sincethe RHS must have the same units as x,
the RHS must be proportional to 1/ .
Gaussian integrals with linear and constant terms in the exponent
It may also happen that the exponent contains additional terms of lower order
in x, i.e. we may have
Z
2
I(, , ) =
ex +x+ dx.
ex
+x
dx.
+
.
x2 + x = x
2
4
Inserting back into the above, we have
2
I(, , ) = e
+
4
e(x 2 ) dx
34
2 :
= e+ 4
ey dy

{z
}
r
=
+
e 4 .
Although its not obvious from this derivation, the formula actually continues
to hold for imaginary values of and .13
13 And even some complex values of , though not all for example, it clearly fails for = 0
or = 1, among other values, as the original integral obviously diverges in these cases.
f (n)
(1)
n=1
N
X
f (n)
(2)
n=1
X
f (n).
=
N +1
(3)
0.008
0.006
0.004
0.002
10
11
12
13
1
x2 ,
14
15
RM
The integral N f (x) dx gives the area under the curve f (x) between N and
M . This is the redshaped region in Figure 2 below.
0.01
0.008
0.006
0.004
0.002
10
R 15
10
11
12
13
14
15
PM 1
On the other hand, the sum n=N f (n) gives the area of the purpleshaded
region shown in Figure 3 below.
0.01
0.008
0.006
0.004
0.002
10
11
12
13
14
15
PM 1
Figure 3: The sum n=N f (n) gives the area of the shape consisting of the
blue shaded rectangles. Since f (x) is monotonically decreasing, this area is
guaranteed to be greater than the area of the redshaded area in Figure 2.
The purpleshaded region in Figure 3 is a union of rectangles; the rectangle
between x = n and x = n + 1 has width 1 and height f (n). Since the function
f (x) is decreasing, the area of this rectangle is guaranteed to be greater than
the area under the curve f (x) between n and n + 1, and thus the area of the
entire purpleshaded region in Figure 3) is greater than the redshaded region
f (n) >
f (x) dx
(4)
0.008
0.006
0.004
0.002
10
11
12
13
14
15
PM 1
Figure 4: The sum n=N f (n + 1) gives the area of the shape consisting of the
green shaded rectangles. Since f (x) is monotonically decreasing, this area is
guaranteed to be less than the area of the redshaded area in Figure 2.
x = n + 1 is guaranteed to be less than the area under the curve f (x) between
n and n + 1, and thus the area of the entire greenshaded region in Figure 4 is
less than the redshaded region in Figure 2. In other words, we have
Z M
M
1
X
f (n + 1) <
f (x) dx
(5)
N
(6)
Inequality (6) is the one that will be useful for our purposes. Taking M ,
the sum on the RHS is just the quantity that enters the definition of the error
(3), and hence we find
Z
EN = S SN <
f (x) dx.
(7)
N
(We have dropped the absolute value signs from (3) because f (x) is positive,
which means S SN is always positive.)
X
(1)n
.
n
n=1
2n 1 2n
n=1
=
1
.
2n(2n 1)
n=1
so EN /SN gives us an upper bound on the true relative error. (Moreover, in the
later stages of a calculation the difference between SN and S is small, so it is a
tight upper bound.)
f (n)
f (x) dx
<
(8)
n=N +1
f (n)
f (x) dx =
N
n=N +1
h
i
Cp f (p) (M ) f (p) (N )
(9)
p=0
where f (p) is the pth derivative of f [f (0) (x) is just f (x)] and the Cp coefficients
decay rapidly with p:
C0 =
1
,
2
C1 =
1
,
12
C2 =
1
,
720
C3 =
1
,
30240
C4 =
1
.
1209600
In contrast to equation (8), equation (9) holds for general smooth functions f ,
not just functions that are positive and monotonically decreasing.
The EulerMaclaurin formula is amazing: It says that the difference between
the sum and integral may be expressed entirely in terms of the behavior at the
endpoints. The formula is used extensively by number theorists, who use it to
evaluate sums in terms of integrals (which are generally much easier to compute).
The EulerMaclaurin summation formula is somewhat tedious to derive, and
since we wont really use it much in this class we will skip the derivation. (It
is derived in many older numerical analysis books, including Stoer&Bulirsch.)
However, we want to call your attention to one important property: The error
term on the RHS of (8) depends only on the difference between the behavior
of f at N and the behavior of f at M . This means, in particular, that if f is
periodic over the interval [N, M ] then the entire error term vanishes!
This is our first brush with a general principle of 18.330: amazing magical
things happen when we work with periodic functions. We will encounter this
phenomenon in several places through the remainder of the course.
Contents
1 Binding energy of a onedimensional ionic solid
13
Numerical integration
4 Electric field of a 1D Solid
19
Numerical differentiation
5 Motion of a charged DNA strand near a 1D solid
22
24
Numerical rootfinding
7 Connecting the dots
28
Numerical interpolation
8 A smattering of other problems well discuss in 18.330
32
X
(1)n Q2
E=
nD
n=
n6=0
where we have excluded the selfinteraction of the ion at the origin. Since the
ion at site n makes the same contribution to the sum as its counterpart at
site +n, we can restrict the sum to just the latter contributions and double the
result. (We can also pull the factor Q2 /D out of the sum). So we have
E=
2Q2 X (1)n
D n=1 n
E=
2Q2
S
D
1 You may have seen this formula written to look something more like E = q1 q2 . Think
40 d
of (1) as a version of this formula in which we have absorbed the factor of 40 into our units
of electric charge, so that it does not appear explicitly in our formulas.
where we defined
S
X
(1)n
.
n
n=1
(3)
Our question now becomes: How do we obtain a numerical value for the sum
S?
A bad idea
In mathematics we are often encouraged to adopt divideandconquer approaches
to difficult problems. Heeding this advice, we might try to split the alternating
sum in (3) into two nonalternating sums: the first sum will add up the contributions of just the positive ions [i.e. just the even n terms in (3)], while the
negative sum will handle just the negative ions (the odd n terms). In symbols
this looks like
S = S+ S
(4)
S+ =
X
1
,
2n
n=1
S =
1
2n
1
n=1
(5)
Surely it will be easier to evaluate and combine two nonalternating sums then
a single alternating sum, right? Does this approach save us time? Do we get
out of the lab any faster? No. In fact, each of the two individual sums in (5) is
divergent, so equation (4) winds up looking like
S = .
Try extracting six good digits out of that!
A better idea
A more promising approach is to use a computer to compute the partial sums
SN , where
N
X
(1)n
SN =
.
(6)
n
n=1
In the limit N , SN approaches the number we are trying to compute.
Of course, our computer cant actually sum an infinite number of terms, but
we might evaluate SN for some large N (say N = 106 ), and then again for
some larger N (say N = 107 ), and keep going until the first 6 digits of SN stop
changing (they converge), whereupon we take those digits as our approximate
evaluation of S.
Heres a little computer program that executes this strategy. (In this course
you will find yourself writing many little computer programs like this.) This
program is written in the julia language, but it would look almost the same in
any other programming language.
N=10000
Sum=0.0
for n=1:2:N
Sum += 1/n + 1/(n+1)
end
Sum
Running this program in julia yields the following output:
julia> Sum
0.6930971830599458
What do we do with this number? Is it correct? Do we report it to our
boss and take off for the weekend? Before we know what to make of this
number or any number emitted by a numerical code we had better make
sure we understand how accurate it is. Lets run the above code again for various
different values of the N parameter and see what we get.
N
SN
104
0.693097183059946
0.693142180584944
106
0.693146680560232
0.693147130559867
108
0.693147175473699
10
10
2Q2
0.693147
D
X
(1)n n
=
x .
n
n=1
Inserting x = 1 into both sides of this equation and flipping the signs yields
X
(1)n
log 2 =
.
n
n=1
Now that youve computed the binding energy of the 1D solid, your boss asks
you to consider a slightly different question: Suppose a charged particle (such
as a charged strand of DNA) finds itself at a point r in the vicinity of our
solid. What potential energy does the DNA feel? In other words, what is the
electrostatic potential at a point r near the chain of ions?
X
(1)n Q
p
(r) = (x, y) =
(x nD)2 + y 2 .
n=
For convenience it what follows, lets choose to work with units of charge and
distance such that Q = D = 1. Then our sum reads simply
(x, y) =
(1)n
p
n=
(x n)2 + y 2
(7)
Note that we are now talking about evaluating a function of x and y instead of
just a single number as in the previous section. This means our evaluation of
the sum must be efficient, since we will probably need to evaluate it for many
points (x, y).
10
10
2
2
4
4
6
6
8
8
10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
10
Figure 3: A plot of the electrostatic potential (x, y) near the ionic solid over
the interval 0 x 2 with y fixed at y = 0.1.
need high accuracy. For example, at the point (x, y) = (0.25, 0.25) we must sum
on the order of 106 terms to get 6digit accuracy, as shown below:
Convergence of (x, y) (bruteforce summation) for (x, y) = (0.25, 0.25)
n
after n terms
2.0493756046200877
0.7790515201261021
+1.007411529248624
1.7864630493747262
0.6689289797090223
1.117534069665704
+0.500964129900977
1.618498199566681
0.40049593015285234
1.2180022694138286
799998
+2.500006250015747e6
1.3985540298633128
799999
2.500003125004028e6
1.3985515298601878
800000
+2.5000000000001218e6
1.3985540298601877
X
n<N
(1)n
p
,
(x n)2 + y 2
distant (x, y)
(8)
(1)n
X
p
n>N
(x n)2 + y 2
(9)
where N defines the threshold between nearby and distant ions; ions further
than N sites away from the origin are considered distant. For example, here are
plots of nearby and distant for N = 20:
10
10
2
2
4
4
6
6
8
8
10
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
10
0.04875
0.04875
0.0488
0.0488
0.04885
0.04885
0.0489
0.0489
0.04895
0.04895
0.049
0.049
0.04905
0.04905
0.0491
0.0491
0.04915
0.04915
0.0492
0.0492
0.04925
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
0.04925
Figure 4: Contributions of (a) nearby and (b) distant ions to the potential
plotted in Figure 3. Note the different yaxis scales.
The plot of nearby , which is extremely cheap to calculate (it involves a sum
of just 41 terms), looks to the naked eye indistinguishable from the plot of the
full sum in Figure 3. This appearance is deceptive; as you can see from the
lower plot in Figure 2, the contribution of distant is relevant already to the 2nd
or 3rd digit of the full sum, so for 6digit accuracy we must certainly include
this expensivetocalculate contribution.
However, the lower plot in Figure 2 tells us something else that is interesting:
10
distant does not vary rapidly over the interval in question. Indeed, over this
interval distant is monotonic, and its value changes by less than 1%. (This is in
contrast to the behavior of nearby , which exhibits hairraising curves and deathdefying dips over the interval.) This means that distant will have a compact
Fourier representation that is, only a small number of terms in its Fourier
series will be relevant and by going over to Fourier space we can convert the
expensive realspace sum in (9) into a Fourierspace sum whose evaluation is no
more costly than that of distant .
To summarize,
nearby is a rapidly varying function of x, but one which we can compute
rapidly in real space
remote is costly to evaluate in real space, but is slowly varying, which
means we can compute it rapidly in Fourier space.
The upshot is that the sum (7), for which nave evaluation requires summing
millions of terms to yield 6digit accuracy, is replaced by two sums (9), each of
which requires summing just a few terms to get 6digit accuracy.
Convergence of Ewald Summation
The cursory sketch of Ewald summation that we presented above was slightly
cavalier; in particular, the simple definitions of nearby and distant that we gave
in equation (9) were somewhat oversimplified (whats missing is the presence of
a windowing function instead of a sharp cutoff). We will discuss these details
later.
However, we cant resist giving you a sneak peak at the convergence rate
evinced by actual Ewald summation, to be compared to the slow convergence
visible in Table 2. In actual Ewald summation, the functions defined by equation
(9) are replaced by similar functions well here call local and remote . The
former is defined as a sum of realspace contributions, while the latter is a sum
of Fourierspace contributions. The following tables, to be compared with Table
2, indicate the rates at which these sums converge.
11
0.3893996144303278
1.3559522726396909
+0.007629205898998581
1.3635814785386895
3.5342137585185434e5
1.3635461364011043
+2.8775270993747082e8
1.3635461651763754
3.666105641950547e12
1.3635461651727092
+6.915955958575618e17
1.3635461651727092
1.8760507978155556e22
1.3635461651727092
0.03500661470136366
0.03500661470136366
1.304318427037022e11
0.035006614688320475
3.445626248073877e29
0.035006614688320475
= 1.3635461651727092
= 0.0350066146883204
(0.25, 0.25)
= 1.3985527798610296
Remark
Notice the progression of this computational example: We began with a straightforward approach that, while theoretically sound and capable of yielding correct
answers, was not particularly sophisticated or powerful. Then, we revisited the
problem from a deeper and more insightful perspective and found4 a dazzlingly
efficient solution.
4 Well, so far we have only sketched the solution; youll have to trust us for now that it
actually does work.
12
13
Next your boss announces that, instead of a discrete 1D chain of ions, she needs
to know the potential near a continuous 1D strip characterized by a line charge
density (x). Think of this as a version of the discrete ion chain in which (1)
the ions have all different charges, not just Q; and (2) the ions are all smushed
together (or, equivalently, we zoom out our perspective) so that we dont see
the individual contributions of each ion, but rather just a continuous charge
density. The strip has finite length L.
Figure 5: A charged strip of length L has linear charge density (L). We want
to compute the electrostatic potential at the point r.
L/2
(x, y) =
L/2
(x0 ) dx0
p
(x x0 )2 + y 2
14
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.2
0.4
0.4
0.6
0.6
0.8
10
5
0.8
10
f (x)dx
(11)
To estimate the area of the geometric shape under the curve f (x), we approximate it as a union of N rectangles. Each rectangle has base length = ba
N .
The nth rectangle has height f (a + n) (where x = a + n is the xcoordinate
of its left edge), so it has area f (a + n) . Thus the N point rectangularrule
evaluation of (11) is
N
X
rect
IN
=
f a + n) .
n=1
15
1.4
1.4
ContinuousSolidPhi.dat u 1:2
1.2
1.2
Phi
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
4
3
2
1
0
x
Of course, a plot
number of significant
this, let rect
be the
N
(10). Figure (8) plots
16
0.0001
4
4.5
1e05
5
5.5
1e06
6
6.5
1e07
7
7.5
1e08
8
8.5
1e09
1000
10000
100000
1e+06
1e+07
9
1e+08
17
Sneak preview
To give you just a sneak preview of what to expect when we discuss more sophisticated methods of numerical integration, Figure 9 compares the relative
integration error (same quantity plotted in Figure 8) vs. number of function
samples for the rectangular rule discussed above and for the ClenshawCurtis
rule, a method of numerical integration that we will discuss in the second half
of the course. Already with just N = 150 samples the ClenshawCurtis rule
has converged to an error of 1012 , while (as we saw below) the rectangular
rule needed N = 108 samples just to achieve an error of 109 ! This dramatic
improvement in performance is the analog, for numerical integration, of the
performance improvement achieved by Ewald summation over bruteforce summation; its another demonstration of the power of spectral methods.
An interesting property of the ClenshawCurtis rule (and, indeed, of many
sophisticated numerical integration strategies) is that it samples the function
at unevenly spaced points. The following figure shows the sample points used
by the 26point rectangular and ClenshawCurtis rules to integrate a function
over the interval [10:10]. Note that the ClenshawCurtis points tend to cluster
near the endpoints of the interval, while the rectangularrule points are evenly
spaced throughout the interval.
18
100
0.01
0.0001
1e06
1e08
1e10
1e12
100
1000
10000
Figure 9: Like figure 8, but now comparing the performance of the N point
rectangular rule to that of the N point ClenshawCurtis rule.
Figure 10: The x points at which the function f (x) is sampled by the 26point
rectangular and 26point ClenshawCurtis rules.
19
Now that weve delivered on our boss request for values of the electrostatic
potential (x, y) in the vicinity of our 1D solid, suppose she needs answers to a
slightly different question: What is the electric field in the vicinity of our solid?
Recall from basic electrostatics that the component of the electric field in
the x direction is minus the partial derivative of the potential with respect to
x:
(x, y)
.
(12)
Ex (x, y) =
x
(In this section, as before, we will keep y fixed, so we really only have functions of a single variable x, and partial derivatives are equivalent to total derivatives.)
Of course, taking partial derivatives of functions is usually pretty easy, but
the difficulty in this case is that we dont have a closedform expression for
the function in (12). Instead, what we essentially have is a black box for
computing : We can give it any value of x we like, and it will give us back
a numerical value for , but theres no expression to differentiate, so we cant
write down an expression for the derivative.
The standard way to differentiate a blackbox function f (x) is called finitedifferencing. Recall the definition of the derivative of f at x:
f 0 (x) lim
f (x + ) f (x)
The idea of finitedifferencing is to arrest the limiting process here and evaluate
the ratio on the RHS at some finite value of . We call this quantity the
finitedifference approximation to the derivative at step size :
0
fFD
(; x)
f (x + ) f (x)
.
As we compute this quantity for smaller and smaller values of , the result
should approach the correct value of f 0 .
To test this out, lets look at how it behaves in a simple case: the function
f (x) = x2 . Of course, this is not a black boxwe can differentiate it analytically
to find f 0 (x) = 2xbut lets pretend its a black box and see how closely we
can reproduce the known result by finitedifferencing. The following plot shows
the result of finitedifferencing to estimate the derivative of f (x) = x2 at the
point x = 2.
20
0.1
1
0.01
2
0.001
3
0.0001
4
1e05
5
1e06
6
1e07
7
1e08
1e07
1e06
1e05
0.0001
0.001
0.01
8
0.1
21
0.1
1
0.01
2
0.001
3
0.0001
4
1e05
5
1e06
6
1e07
7
1e08
8
1e09
9
1e10
1e18
1e16
1e14
1e12
1e10
1e08
1e06
0.0001
0.01
10
Figure 12: Same as Figure 11, but now showing a wider range of the x axis.
22
OK, so now youve delivered to your boss an accurate way to compute the
electric field in the vicinity of the 1D solid. What kinds of things might she do
with this information? Heres one possibility. Suppose we place a little strand
of DNAwhich we model as a point particle of charge qat a point near the
solid. The electric field of the solid exerts a force on the particle, which causes
it to accelerate and start moving. What trajectory will it traverse?
To simplify the calculation initially, lets suppose we somehow fix the ycoordinate of the DNA strand at some fixed value of y, say y = y0 (for example,
perhaps the particle is constrained to move along a wire held parallel to the
solid at a distance of y0 = 0.1 length units). The xcoordinate of the particle
will be a function of time, x = x(t), and our goal is to compute this function.
In the previous section we discussed techniques for computing the xcomponent
of the electric field at arbitrary points (x, y); for now lets forget about how we
compute this quantity and just denote it Ex (x, y). Then the force on the DNA
strand is F = qEx (x, y0 ), and this force is related to the acceleration of the
particle by Newtons second law of motion:
m
d2
x(t)
=
qE
x(t),
y
x
0
dt2
(13)
23
fall into the category of basic numerical calculus and can be introduced already
in the first half of our course.
Boundaryvalue problems
The basic algorithm we just discussed for integrating ODEs started with the
position and velocity of the particle at a fixed time. For example, perhaps we
know that at time t0 the DNA particle is at position x0 with velocity v0 , and
we want to predict its future trajectory. This is an initialvalue problem.
Heres a different sort of problem: Suppose we have the positions of the
particle at two timesfor example, we measure experimentally that at time
t0 it is at point x0 , while at time t1 it is at point x1 ; meanwhile, we dont
know the velocities at either point. Can we solve equation (13) to reconstruct
the trajectory followed by the particle in between the two times? This is a
boundaryvalue problem, and methods for solving it take on a rather different
form from methods for solving initialvalue problems. (Indeed, the algorithm
discussed above for IVPs cant even get started for BVPs, because we dont
know the velocity at the starting point.) We will discuss both types of problems
in 18.330.
24
Newtons Method
We will study several methods for solving numerical rootfinding problems. One
method which is particularly simple to describe and which works well in many
cases is Newtons method. To find a root of a function f (x), Newtons method
goes like this:
1. Make an initial guess x1 as to the location of the root.
2. Compute the two numbers f (x1 ) and f 0 (x1 ) (value and derivative of f at
x1 ).
3. Set
x2 = x1
f (x1 )
.
f 0 (x1 )
xn
4.400000000000000
5.154730677706086
4.997518482593209
5.000000010187351
5.000000000000000
After just 4 applications of the method, we have computed our root to 16digit accuracy!
25
x2 = x1
which is over 1,000 times further from the correct root than our starting guess!
Thus Newtons method must be used in practice with considerable care.
26
(a) f (z) = z 2 1
(b) f (z) = z 2 + 1
Figure 13: Convergence of Newtons method for roots of the polynomials f (z) =
z 2 1 (top) and f (z) = z 2 + 1 (bottom). In these plots, each point z in the
complex plane is assigned a color based on the convergence of Newtons method
when started with initial guess z1 = z. In the upper plot, red (yellow) denotes
convergence to +1(1). In the lower plot, red (yellow) denotes convergence to
+i(i).
27
Finally, consider the nonlinear function f (z) = z 3 1, which has the three
complex roots
z = 1,
e2i/3 ,
e4i/3 .
Based on the experience of Figure 6, we might expect the convergence plot for
Newtons method on this function to look like the complex plane divided up
into three sectors. This is not quite what happens:
Figure 14: Convergence of Newtons method for roots of the polynomial f (z) =
z 3 1. Grey, red, and yellow denote convergence to 1, e2i/3 , e4i/3 . The darker
the color, the more rapid the convergence.
We instead get a fractal, illustrating both the promise and peril of nave use
of numerical rootfinding tools.
28
Look back at Figure (3) for the potential of the 1D ionic solid as a function of
the x coordinate. Even with the acceleration techniques discussed in previous
sections, it may be quite timeconsuming to compute at every point for which
we need a value. But a glance at Figure (3) suggests that perhaps we dont
need to calculate at such a dense grid of points instead, perhaps we could
tabulate on a coarse grid of points, and then infer values at intermediate
points by somehow connecting the dots in some reasonable way. This is the
idea of interpolation.
To see how this works, suppose we have used our computational algorithms
to evaluate (x, y = 0.1) at 5 equallyspaced points between x = 0 and x = 2.
Wed like to draw a curve that runs through these points; by forcing the curve
to match exactly at these points, we hope to find that it approximates in
between those points.
An obvious choice for such a curve is a polynomial. Indeed, given any 5
points in the plane (xi , yi ), i = 1..5, there is a unique 4thdegree polynomial
f (x) that runs through all the points. (More generally, given any N points
there is a unique polynomial of degree N 1 running through them.) Leaving
aside for the moment the question of how we find this polynomial, lets look at
how well it mimics the actual function we are trying to replicate.
29
10
20
8
6
15
4
2
10
0
5
2
4
6
5
8
10
0.5
1.5
Figure 15: The green dots are values of the function (x, 0.1) from the previous
section evaluated at 5 evenly spaced points in the interval x = [0, 2]. The red
curve is the unique 4thdegree polynomial running through the green dots. For
reference, the dotted curve shows actual function (x, 0.1) that we are trying
to mimic with the red curve.
Well, the polynomial is not doing a particularly good job of matching the
behavior of the function in between the data points, but perhaps thats to be
expected for such a loworder approximation. Perhaps if we try again with a
higherdegree polynomial well have better luck? Lets try fitting an eighthdegree polynomial through 9 data points.
30
20
2
15
1
10
1
5
5
0
0
5
5
10
0.5
1.5
Figure 16: Like the previous figure, but now showing the unique 8thdegree
polynomial running through 9 evenlyspaced function samples.
Hmmm. In at least some places, the red curve here seems to be doing a
slightly better job of replicating the dashed black curve than we saw previously.
However, there is a troubling spike near the edges of the interval in which the
polynomial deviates significantly from the function were trying to approximate.
Does this mean again that we simply chose too low a degree? Lets try again
with a polynomial of still higher degree.
31
50
50
40
40
30
30
20
20
10
10
0
0
10
10
20
20
0
0.5
1.5
30
Figure 17: Like the previous figure, but now showing the unique 14thdegree
polynomial running through 14 evenlyspaced function samples.
32
X
2
1
=
2
n
6
n=1
...as well as the following, perhaps equally beautiful but much more bizarre,
result:
X
n=
en
1 X n2 /x
=
e
x n=
Bump functions. Heres a challenge question for you: Can you design
a singlevariable function f (x) that simultaneously satisfies the following
two conditions?
1. f (x) must be everwhere continuous and infinitely differentiable.
2. f (x) must be identically zero except on a finite length of the real line.
(For example, f (x) must vanish identically for x outside the interval
[1, 1].
It is not even obvious that such a function can exist, much less how to
construct it, but we will dissect these mysteries in 18.330.
Contents
1 Overview
16
Overview
+
Figure 1: In a 7digit fixedpoint system, each number consists of a string of 7
digits, each of which may run from 0 to 9.
For example, the number 12.34 would be represented in the form
are
S rep =
999.9999
999.9998
999.9997
..
.
000.0001
+000.0000
+000.0001
+000.0002
..
.
+999.9998
+999.9999
Notice something about this list of numbers: They are all separated by the
same absolute distance, in this case 0.0001. Another way to say this is that the
density of the representable set is uniform over the real line (at least between
max
the endpoints, Rmin
= 999.9999): Between any two real numbers r1 and r2
lie the same number of exactly representable fixedpoint numbers. For example, between 1 and 2 there are 104 exactlyrepresentable fixedpoint numbers,
and between 101 and 102 there are also 104 exactlyrepresentable fixedpoint
numbers.
Rounding error
Another way to characterize the uniform density of the set of exactly representable fixedpoint numbers is to ask this question: Given an arbitrary
real number r in the interval [Rmax , Rmin ], how near is the nearest exactlyrepresentable fixedpoint number? If we denote this number by fi(r), then the
statement that holds for fixedpoint arithmetic is:
for all r R, Rmin < r < Rmax , with  EPSABS such that
fi(r) = r + .
(1)
In equation (1), EPSABS is a fundamental quantity associated with a given fixedpoint representation scheme; it is the maximum absolute error incurred in the
approximate fixedpoint representation of real numbers. For the particular fixedpoint scheme depicted in (1), we have EPSABS = 0.00005.
The fact that the absolute rounding error is uniformly bounded is characteristic of fixedpoint representation schemes; in floatingpoint schemes it is the
relative rounding error that is uniformly bounded, as we will see below.
Errorfree calculations
There are many calculations that can be performed in a fixedpoint system with
no error. For example, suppose we want to add the two numbers 12.34 and
742.55. Both of these numbers are exactly representable in our fixedpoint
+
=
Figure 3: Arithmetic operations in which both the inputs and the outputs are
exactly representable incur no error.
Nonerrorfree calculations
On the other hand, heres a calculation that is not errorfree.
/
=
Figure 4: A calculation that is not errorfree. The exact answer here is
24/7=3.42857142857143..., but with finite precision we must round the answer to nearest representable number.
Overflow
The error in (4) is not particularly serious. However, there is one type of calculation that can go seriously wrong in a fixedpoint system. Suppose, in the
calculation of Figure 3, that the first summand were 412.34 instead of 12.34.
The correct sum is
412.24 + 742.55 = 1154.89.
However, in fixedpoint arithmetic, our calculation looks like this:
+
=
Figure 5: Overflow in fixedpoint arithmetic.
The leftmost digit of the result has fallen off the end of our computer! This
is the problem of overflow: the number we are trying to represent does not fit
in our fixedpoint system, and our fixedpoint representation of this number is
not even close to being correct (154.89 instead of (1154.89). If you are lucky,
your computer will detect when overflow occurs and give you some notification,
but in some unhappy situations the (completely, totally wrong) result of this
calculation may propagate all the way through to the end of your calculation,
yielding highly befuddling results.
The problem of overflow is greatly mitigated by the introduction of floatingpoint arithmetic, as we will discuss next.
12.34
754.89
10100 ; in 64bit IEEE doubleprecision binary floatingpoint (the usual floatingpoint scheme you will use in numerical computing) the maximum representable
number is something closer to Rmax 10308 . We are not being particularly precise in pinning down these maximum representable numbers, because in practice
you should never get anywhere near them: if you are doing a calculation in which
numbers on the order of 10300 appear, you are doing something wrong.
10
11
US population (thousands)
311,189
311,356
Table 1: Monthly U.S. population data for February and March 2011.
These data have enough precision to allow us to compute the actual change
in population (in thousands) to threedigit precision:
311,356 311,189 = 167.
(3)
But now suppose we try to do this calculation using the floatingpoint system
discussed in the previous section, in which the mantissa has 5digit precision.
The floating representations of the numbers in Table 1 are
fl(311,356) = 3.1136 105
fl(311,189) = 3.1119 105
1 http://research.stlouisfed.org/fred2/series/POPTHM/downloaddata?cid=104
12
Subtracting, we find
3.1136 105
3.1119 105
=1.7000 102
(4)
Comparing (3) and (4), we see that the floatingpoint version of our answer is
170, to be compared with the exact answer of 167. Thus our floatingpoint
calculation has incurred a relative error of about 2 102 . But, as noted above,
the value of EPSREL for our 5significantdigit floatingpoint scheme is approximately 105 ! Why is the error in our calculation 2000 times larger than machine
precision?
What has happened here is that almost all of our precious digits of precision
are wasted because the numbers we are subtracting are much bigger than their
difference. When we use floatingpoint registers to store the numbers 311,356
and 311,189, almost all of our precision is used to represent the digits 311,
which are the ones that give zero information for our calculation because they
cancel in the subtraction.
More generally, if we have N digits of precision and the first M digits of
x and y agree, then we can only compute their difference to around N M
digits of precision. We have thrown away M digits of precision! When M is
large (close to N ), we say we have experienced catastrophic loss of numerical
precision. Much of your work in practice as a numerical analyst will be in
developing schemes to avoid catastrophic loss of numerical precision.
In 18.330 we will refer to catastrophic loss of precision as the big floatingpoint kahuna. It is the one potential pitfall of floatingpoint arithmetic that you
must always have in the back of your mind.
f(x+h)f(x)
h
for various floatingpoint stepsizes h.
Stepsize h =
13
2
3
(6a)
(6b)
f(x) = 1.0000
f(x+h)  f(x) = 0.6667
and thus
0.66670
f(x+h)  f(x)
=
h
0.66667
(6c)
The numerator and denominator here begin to differ in their 4th digits, so their
ratio deviates from 1 by around 104 . Thus we find
2
0
fFD
(6d)
h = , x = 1 + O(104 )
3
0
= 1,
and thus, since fexact
for h =
2
3
0
the error in fFD
(h, x)
is about 104 .
(6e)
Stepsize h =
1
10
14
2
30
Now lets shrink the stepsize by 10 and try again. Like the old stepsize h = 2/3,
2
the new stepsize h = 30
is not exactly representable. In our 5decimaldigit
floatingpoint scheme, it is rounded to
fl(h) = 0.066667
(7a)
Note that our floatingpoint scheme allows us to specify this h with just as much
precision as we were able to specify the previous value of h [equation (6a)]
namely, 5digit precision. So we certainly dont suffer any loss of precision at
this step.
The sequence of floatingpoint numbers that our computation generates is
now
f(x+h) = 1.0667
(7b)
f(x) = 1.0000
f(x+h)  f(x) = 0.0667
and thus
0.066700
f(x+h)  f(x)
=
h
0.066667
(7c)
Now the numerator and denominator begin to disagree in the third decimal
place, so the ratio deviates from 1 by around 103 , i.e. we have
1
0
, x = 1 + O(103 )
(7d)
fFD h =
30
0
and thus, since fexact
= 1,
for h =
2
30
0
the error in fFD
(h, x)
is about 103 .
(7e)
Analysis
The key equations to look at are (6b) and (7b). As we noted above, our floating2
point scheme represents 32 and 30
with the same precision namely, 5 digits.
Although the second number is 10 times smaller, the floatingpoint uses the
same mantissa for both numbers and just adjusts the exponent appropriately.
The problem arises when we attempt to cram these numbers inside a floatingpoint register that must also store the quantity 1, as in (6b) and (7b). Because
the overall scale of the number is set by the 1, we cant simply adjust the
2
exponent to accommodate all the digits of 30
. Instead, we lose digits off the
15
right end more specifically, we lose one more digit off the right end in (7b) then
we did in (7b). However, when we go to perform the division in (6c) and (7c),
the numerator is the same 5digitaccurate h value we started with [eqs. (6a)
and (7a)]. This means that each digit we lost by cramming our number together
with 1 now amounts to an extra lost digit of precision in our final answer.
f (x, ) = x + x.
When x,the two terms on the RHS are nearly equal, and subtracting them
gives rise to catastrophic loss of precision. For example, if x = 900, = 4e3,
the calculation on the RHS becomes
30.00006667 30.00000000
and we waste the first 6 decimal digits of our floatingpoint precision; in the
5decimaldigit scheme discussed above, this calculation would yield precisely
zero useful information about the number we are seeking.
However, there is a simple workaround. Consider the identity
x+ x
x + + x = (x + ) x =
which we might rewrite in the form
x+
x =
x++
.
x
The RHS of this equation is a safe way to compute a value for the LHS; for
example, with the numbers considered above, we have
4e3
6.667e5.
30.0000667 + 30.0000000
Even if we cant store all the digits of the numbers in the denominator, it doesnt
matter; in this way of doing the calculation those digits arent particularly
relevant anyway.
16
8
1e09
9
Relative Error
1e10
10
Direct
1e11
11
1e12
12
1e13
13
1e14
14
1e15
15
1e16
100
1000
10000
100000
N
1e+06
16
1e+08
1e+07
Y
N,N
17
8
1e09
9
Direct
Relative Error
1e10
10
Recursive
1e11
11
1e12
12
1e13
13
1e14
14
1e15
15
1e16
100
1000
10000
100000
N
1e+06
1e+07
16
1e+08
Y
N,N
2 Caution: The function RecursiveSum as implemented here actually only works for even
values of N . Can you see why? For the full, correctlyimplemented version of the function,
see the code RecursiveSum.jl available from the Lecture Notes section of the website.
18
Analysis
Why does such a simple prescription so thoroughly cure the disease? The basic
intuition is that, in the case of DirectSum with large values of N , by the time
we are on the 10,000th loop iteration we are adding X to a number that is 104
times bigger than X. That means we instantly lose 4 digits of precision of the
right end of X, giving rise to a random rounding error. As we go to higher and
higher loop iterations, we are adding the small number X to larger and larger
numbers, thus losing more and more digits off the right end of our floatingpoint
register.
In contrast, in the RecursiveSum approach we never add X to any number
that is more than BaseCaseThreshold times greater than X. This limits the
number of digits we can ever lose off the right end of X. Higherlevel additions
are computing the sum of numbers that are roughly equal to each other, in
which case the rounding error is on the order of machine precision (i.e. tiny).
For a more rigorous analysis of the error in direct and pairwise summation,
see the Wikipedia page on the topic3 , which was written by MITs own Professor
Steven Johnson.
3 http://en.wikipedia.org/wiki/Pairwise
summation
19
As noted above, modern computers use both fixedpoint and floatingpoint numbers.
20
julia> exp(1000)
Inf
On the other hand, if you attempt to perform an illdefined calculation like
0.0/0.0 then the result will be a special number called NaN (not a number.)
This special number has the property that all arithmetic operations involving
NaN result in NaN. (For example, 4.0+NaN=NaN, 1000.0*NaN.)
What this means is that, if you are running a big calculation in which any
one piece evaluates to NaN (for example, a single entry in a matrix), that NaN will
propagate all the way through the rest of your calculation and contaminate the
final answer. If your calculation takes hours to complete, you will be an unhappy
camper upon arriving the following morning to check your data and discovering
that a NaN somewhere in the middle of the night has corrupted everything. (I
speak from experience.) Be careful!
NaN also satisfies the curious property that it is not equal to itself:
julia> x=0.0 / 0.0
NaN
julia> y=0.0 / 0.0
NaN
julia> x==y
false
julia>
This fact can actually be used to test whether a given number is NaN.
Arbitraryprecision arithmetic
In the examples above we discussed the kinds of errors that can arise when
you do floatingpoint arithmetic with a finitelength mantissa. Of course it
is possible to chain together multiple floatingpoint registers to create a longer
mantissa and achieve any desired level of floatingpoint precision. (For example,
by combining two 64bit registers we obtain a 128bit register, of which we might
set aside 104 bits for the mantissa, roughly doubling the number of significant
digits we can store.) Software packages that do this are called arbitraryprecision
arithmetic packages; an example is the gnu mp library4 .
Be forewarned, however, that arbitraryprecision arithmetic packages are not
a panacea for numerical woes. The basic issue is that, whereas singleprecision
4 http://gmplib.org
21
and doubleprecision floatingpoint arithmetic operations are performed in hardware, arbitraryprecision operations are performed in software, incurring massive
overhead costs that may well run to 100 or greater. So you should think of
arbitraryprecision packages as somewhat extravagant luxuries, to be resorted
to only in rare cases when there is absolutely no other way to do what you need.
Contents
1 Overview
2 Analog modulation
2.1 Amplitude modulation (AM) . . . . . . . . . . . . . . . . . . . .
2.2 Phase and frequency modulation (PM and FM) . . . . . . . . . .
3
3
7
3 Digital modulation
3.1 OOK . . . . . . . . . .
3.2 BPSK, QPSK, MPSK
3.3 QAM . . . . . . . . .
3.4 Spectral efficiency . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
10
10
4 Multiplex methods
13
4.1 The cocktail party . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 How CDMA works . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Lockin amplifiers
15
5.1 How lockin amplifiers work . . . . . . . . . . . . . . . . . . . . . 15
Overview
1 A bandlimited signal with bandwidth is a function f (t) whose Fourier transform fe()
is zero for frequencies outside an interval of width . A baseband signal is a signal whose
frequency spectrum is centered at = 0.
2
2.1
Analog modulation
Amplitude modulation (AM)
The simplest way to modulate a signal is just to translate the entire frequency
spectrum of f (t) so that it is centered around the carrier frequency. This process
is called amplitude modulation (AM). Historically, AM was the first modulation
scheme used for wireless communications in the early 20th century, and it remains in use to this day in AM radio. It was used in the first widely available
cellular telephone system, the AMPS system, in the 1980s. It was also used
for terrestrial television transmission until 2009. However, in the mid20th century it was superseded by FM, and in the late 20th century analog modulation
was essentially replaced altogether (for communications applications, anyway)
by digital modulation. On the other hand, amplitude modulation remains in
widespread use for the purposes of lockin detection, discussed later.
Implementation of AM transmitters
The simplest way to do AM is just to multiply the carrier signal f carrier = cos c t
by the baseband signal f BB (t):
f AM (t) = f BB (t) cos c t
(1)
In other words, the modulated signal is just a sinusoid at the carrier frequency,
but with a timevarying amplitude defined by f BB (t). The baseband signal
modulates the amplitude of the carrier; this is the origin of the name amplitude
modulation.
Spectrum of AM signals
Its easy to determine the frequency spectrum of an AM signal. As a first step,
suppose the baseband signal consists of just a single tone with frequency BB
and amplitude A:
f BB (t) = A cos BB t.
(2)
The modulated signal is
f AM (t) = A cos BB (t) cos c t
To compute the frequency spectrum of this signal, we could now apply Fourier
analysis techniques, but as it turns out we dont need to, because we can just
appeal to the trigonometric identity
cos a cos b =
to write
f AM (t) =
i
1h
cos(a + b) + cos(a b)
2
i
Ah
cos c + BB )t + cos c BB )t .
2
(3)
(4)
Each term in this sum contributes two terms to the frequency spectrum of the
output signal just as in equation (4):
f AM (t) =
X f BB (n ) h
n
i
cos c + n t + cos c n t .
(5)
1.5
1.5
0.5
0.5
0.5
0.5
1
t
1.5
1.5
1.5
0.5
0.5
0.5
0.5
1
1.5
1
0.5
1
t
1.5
1.5
0.5
1.5
1.5
0.5
0.5
0.5
1
1.5
0.5
1
0.5
1
t
1.5
1.5
1
t
1
t
1
t
Singlesideband AM
As we noted above, the frequency spectrum of a nave AM signal contains two
redundant copies of the information we are trying to transmit. This means that
the transmit signal actually has twice as much bandwidth as it nominally needs
to have to transmit the requisite information.
It is possible to circumvent this redundancy by use of a technique known as
singlesideband modulation. This is based on the following modified version of
the trig identity (3):
cos a cos b sin a sin b = cos(a + b).
To see how singlesideband modulation works in practice, suppose again that
our baseband signal consist of the single tone
f BB (t) = A cos BB t.
What we do is to form the /2shiftedversion of this signal:
BB
f/2
(t) = A sin BB t.
Then we multiply f BB (t) by the original carrier signal cos c t, and we multiply the /2 shifted baseband signal by the /2 shifted carrier signal, and we
subtract:
BB
f SSAM (t) = f BB (t) cos c t f/2
sin c t
For the case of the singletone baseband signal, the transmit signal now contains
only the single tone c + BB ; the lowersideband tone at c BB has been
supressed. More generally, if f BB (t) contains a spectrum of frequencies, the
transmitted signal will contain only one copy of this spectrum, not the two
redundant copies we found above.
However, for baseband signals that are more complicated than a single tone,
forming the /2 shifted version is expensive: we have to Fourierdecompose
the signal into constituent sinusoids and then apply a /2 phase shift to each
sinusoid. In practice this requires fairly sophisticated digital signal processing
techniques, and is not commonly used for wireless AM communications.
2.2
1.5
1.5
0.5
0.5
0.5
0.5
1
1
1.5
1.5
0
0.5
1
t
1.5
Angle modulation techniques have the advantage that all the information is
contained in the zero crossings of the signal, which make them less sensitive to
noise contamination. However, this advantage comes at a cost: for the same
baseband signal, PM and FM signals occupy significantly more bandwidth than
AM signals. A realworld demonstration of this fact may be found in the spacing
of AM and FM radio stations: AM stations are typically spaced about 10 kHz
apart from one another, while FM stations are typically spaced around 500 kHz
from each other, even though they are nominally transmitting baseband signals
of the same bandwidth (music and talk, which occupies up to around 20 kHz).
Digital modulation
AM and FM are techniques for transmitting analog signals. We may also want
to transmit a digital signal that is, a sequence of 0s and 1s. There are many
ways to do this, of which we will consider just a few.
3.1
OOK
3.2
The next most complicated thing we could do would be to tweak the phase or
frequency of the carrier during each bit period with the tweak depending on the
binary data to be transmitted during that period.
For example, we might give the carrier a 0degree phase shift during bit
periods in which the transmit bit is 1, and a phase shift during bit periods
in which the transmit bit is 0. This is binary phaseshift keying (BPSK). Of
course, a phase shift to a sinusoid amounts to a sign flip, so BPSK is similar
10
to OOK except that instead of turning the carrier off during 0 bits we flip its
sign.
3.3
QAM
3.4
Spectral efficiency
11
cos c t,
0 < t < 1s
sin t,
1 < t < 2s
c
f QPSK (t) =
cos
t,
2 < t < 3s
c
3T /2
cos(c t)ein0 t dt
T /2
cos(c t)e
in0 t
dt
3T /2
12
10
f_n
0.1
0.01
0.001
39800
39850
39900
39950
40000
n
40050
40100
40150
40200
Figure 6: Fourier spectrum of QPSK signal. The x axis labels n, the index of
the frequency n0 ; the carrier frequency is at c = 4 104 0 .
13
Multiplex methods
When multiple people are trying to communicate over the same communications
channel which may be wired (think of an ethernet network consisting of a
single long cable with multiple computers feeding signals in and out) or wireless
(think of electromagnetic waves propagating through the air) we need multiplex
techniques to allow the channel to be shared.
There are three broad categories of multiplex techniques.
Timedivision multiplex access (TDMA), in which multiplexing happens
in the time domain: one user uses the entire channel (i.e. all available
frequencies) to transmit his message, then a second user uses the entire
channel to transmit her message, and so on.
Frequencydivision multiplex access (FDMA), in which multiplexing happens in the frequency domain: multiple users transmit their messages
simultaneously, but each users transmission is restricted to a finite chunk
of the available frequency spectrum.
Codedivision multiplex access (CDMA), in which all users transmit their
messages at the same time using the same frequencies, and yet the receiver
is magically able to disentangle one message from another because the
messages are are coded in an orthogonal way.
To summarize:
TDMA: same frequencies, different times.
FDMA: same time, different frequencies.
CDMA: same time, same frequencies, different codes.
TDMA is used, for example, in ethernet networking. In this protocol, multiple computers are connected to a common wire, and a message sent by one
computer is seen by all computers. Only one computer may be transmitting at
a time.2 TDMA was also used in early cell phone systems. It is very easy to
design TDMA receivers: basically, the receiver just has to turn on during the
appropriate time interval and then turn off during other time
FDMA is the most widely used multiplex method. It is used, for example,
in radio broadcasting (each AM and FM channel broadcasts simultaneously at
a different frequency) and in cellphone networks (different phones communicate with the base station on different frequencies. FDMA receivers are slightly
2 But how is this synchronization enforced? What happens if two computers try to transmit
messages at the same time? How do computers know its their turn to talk? Answer: they
dont! When a computer has a message to send, it just randomly sends it out and hopes
nobody else was trying to send a message at the same time. If someone else was trying to
send a message at the same time, the two messages collide, neither message is received by
anyone, and the two transmitting computers each wait a randomly chosen amount of time
before attempting to resend. This simpleminded protocol actually yields excellent performance
as long as the total message density (the fraction of all time during which some computer is
trying to send a message) doesnt get too high.
14
trickier to design than TDMA receivers, but still relatively straightforward. Basically, the receiver applies a filter to exclude incoming signals at all frequencies
other than the frequency of interest, then downconverts (demodulates) from the
carrier frequency to baseband.
CDMA is a relatively recent addition to the fold of multiplex techniques. In
CDMA, each message is coded using a certain simple code in a way that allows
it to be distinguished from other simultaeouslyreceived messages. CDMA receivers are much more difficult to design than TDMA or FDMA receivers, and
their implementation involves a lot of interesting mathematics.
4.1
4.2
15
Lockin amplifiers
5.1
Contents
1 MonteCarlo integration
1.1 MonteCarlo integration . . . . . . . . . . . . . . . . . . . . . . .
1.2 Comparison to nested quadrature rules . . . . . . . . . . . . . . .
1.3 Applications of MonteCarlo integration . . . . . . . . . . . . . .
2
2
2
3
2 A computational example
17
18
8
8
10
11
13
14
15
1
1.1
MonteCarlo integration
MonteCarlo integration
Consider a scalarvalued function of an Ddimensional variable f (x), and suppose we want to estimate the integral of f over some subregion R RD . In
MonteCarlo integration we do this using the following extremely simple rule:
Z
f (x) dx
R
N
V X
f (xn )
N n=1
(1)
where V is the volume of R, and where the xn are a set of N randomly chosen
points distributed uniformly throughout our region R.
It seems too good to be true to think that such an incredibly simpleminded
procedure could possible yield anything resembling decent numerical accuracy.
But it does! If I is the exact value of the integral on the LHS of (1) and IN
is the N sample MonteCarlo approximation on the RHS, then we have the
asymptotic convergence rate
1
I IN 
N
(2)
1.2
converges like 1/N , much faster than 1/ N , and better quadrature algorithms
converge much more quickly. So why would we ever want to use something that
achieves a lousy convergence rate like (2)?
The answer has to do with a phenomenon sometimes known as the curse
of dimensionality. Consider rectangularrule quadrature as an example. For a
1D integral over an interval [a, b] subdivided into N subintervals, we have to
evaluate the function Neval = N times and the error decays like E 1/N , as
noted above. Now suppose we have a 2D integral of the form
Z
Z
dx1
dx2 f (x1 , x2 ),
c
Suppose we evaluate the inner (x2 ) integral using an N point rectangular rule
to obtain a function F (x1 ), then integrate this function over x1 again using
an N point rectangular rule to compute the full integral. (Such a procedure is
called nested quadrature.) The overall error again decays like E 1/N . But
we have to evaluate the function Neval = N 2 times, so now theconvergence
with respect to the number of function evaluations is only E 1/ Neval , much
slower than the 1D case. More generally, if we evaluate a Ddimensional integral
using nested rectangularrule quadrature, the error decays like
error in nested Ddimensional rectangularrule quadrature
1
1/D
Neval
We see that already for D = 2 the simple MonteCarlo formula (1) achieves
asymptotic convergence equivalent to that of the rectangular rule, while for
D > 2 MonteCarlo is (asymptotically) better.
Of course, the rectangular rule is only the most nave numerical quadrature
scheme. What if we use something more sophisticated like Simpsons rule?
Well, now the error decreases like E 1/N 4 , where N is the number of function
samples per dimension, but the total number of function samples grows like1
Neval (2N )D , so we have
error in nested Ddimensional Simpsonsrule quadrature
1
4/D
Neval
1.3
A computational example
x BD
otherwise
(x) dx
R
where R is any region of RD containing the unit ball for example, R could
be all of RD , or could alternatively be the Ddimensional hypercube defined by
{x : 1 xi 1, i = 1, . . . , D.}
Heres a julia program that evaluates the MonteCarlo integration formula
(1) over a hypercubic region.
#
# MCIntegrate: integrate func over the hypercube with
# bounds { Lower[1..Dim], Upper[1..Dim]} using a total
# of N function samples
#
function MCIntegrate(func, Lower, Upper, N)
Lower=Lower[:];
Upper=Upper[:];
Dim=length(Lower);
Delta = UpperLower;
Jacobian = prod(Delta);
Sum=0.0;
for n=1:N
6
# random vector w values \in [0:1]
# random point in hypercube
To test this program on a simple example, well compute the volume of the
threedimensional ball, which is 4
3 = 4.189.
julia> MCIntegrate( chiBall, [1 1 1], [1 1 1], 10000)
4.2352
julia> MCIntegrate( chiBall, [1 1 1], [1 1 1], 10000)
4.0584
julia> MCIntegrate( chiBall, [1 1 1], [1 1 1], 10000)
4.1448
Each time we call this routine, we obtain a sample of a random variable whose
mean value is the integral we are trying
to compute and whose standard deviation about that mean decreases like 1/ N . (These concepts are explained more
fully in the following section.) To give you some graphical intuition for how the
process works, the following plot shows the results of 100 calls to MCIntegrate,
as above, for the two values N = 100 and N = 10000. The dashed line is the true
value of the integral. As you can see, in both cases the process is approximating
the true value of the integral, and increasing the number of function samples by
100 reduces the fluctuations (the error in our approximate evaluation of the
integral) by 10.
5.5
5.5
N=100
N=10000
Exact
Value of integral
4.5
4.5
3.5
3.5
3
0
10
20
30
40
50
60
70
Number of MC integration runs
80
90
3
100
3.1
Random variables
A good way to think about a random variable x is as a black box with a button
on it. Each time we push the button, the black box spits out a number.2
If we push the button N times and plot the values of the samples emitted,
we might get something like this:
2 Think of the little machine at the bank or the driverslicense office on which you push
a button and get out a number telling you your position in the line of people waiting to see
a clerk. One distinction is that in that case the numbers that emerge are integers emitted
in ascending order, whereas with a random variable the numbers that emerge are typically
realvalued and (hopefully!) not organized in any particular sequence.
2
1.5
1
0.5
0
0.5
1
50
100
150
200
Sample index n
250
300
Figure 3: Values of 300 samples of a random variable x, which in this case are
uniformly distributed throughout [0 : 1].
More generally, we may ask for the fractional number of samples falling into any
interval [x, x + ], and the answer as N would tend to P (x), where P (x)
is a number that depends on x. P (x) is called the probability density function or
the probability distribution of the random variable. To be a suitable probability
density function, P (x) must satisfy the conditions
Z
P (x) 0 x
and
P (x) dx = 1.
3 Strictly
speaking this equation is only true in the limit , but that would be too
many limits to be considering all at once; for now just think of as a small width.
10
1, x [0, 1]
0, otherwise
which is known as a uniform distribution; we say that that the random variable
x is uniformly distributed in the interval [0 : 1].
Systemsupplied randomnumber generators in computers, like the rand
functions in matlab or julia and the drand48 function in the standard c
library, typically produce random numbers uniformly distributed in the interval
[0 : 1]. Later we will discuss how to obtain random numbers distributed with
other densities.
3.2
The black dashed line in Figure (3) is the average value of all the samples of the
random variable emitted from the black box. This is known as the mean value
of the random variable. For a given probability distribution P (x), the mean
may be computed according to
Z
xP (x) dx
mean = x =
= x
(where the second line defines some useful shorthand for integrating over probability distributions). For the probability distribution in Figure (??), we have
Z
x=
x dx =
0
1
2
= (x x)2
This quantity is measuring how much samples of x deviate from their mean
value. The bigger the value of x2 , the more the random variable is spread out
or fluctuating about its mean.
Note that the specific quantity x2 is actually characterizing something like
the square of the deviations about the mean value. In particular, if the random
variable x has units, like say meters, then x2 has units of meters2 and hence
cannot be used directly to measure the spread of the quantity we are trying to
11
characterize. Instead, the number that you want to have in mind to characterizing the spread of values in a random variable is the square root of the variance,
which is called the standard deviation:
p
standard deviation = x = x2 .
For the uniformly distributed variable of Figure 3, we have
x2 =
x
1
2
2
dx
1
12
p
so the standard deviation is x = 1/12 0.29. You should think of this as the
halfwidth of the interval around the mean within which most of the fluctuations
of the variable are contained.
=
3.3
It is easy to obtain new random variables from old. For example, given a random
variable x distributed according to some probability distribution P (x), we could
define a new random variable y by summing two samples of x:
y = x + x.
As in Figure 2, the random variable y may be thought of as a machine with a
button on it, which we can press however many times we like to generate samples
of y. In this case, we can think of this machine as containing within it two copies
of the machine of Figure 2. Hitting the button on the y machine amounts to
hitting the buttons on the two x machines and summing their results.
12
The very important fact about random variables defined as sums of random
variables is this:
When we add a random variable to itself N times, its mean value increases
by
a factor of N , but its standard deviation increases by only a factor of N .
1
x+x+x+x+x+x+x+x+x+x
10
Applying the key result from above, we expect that the mean value and standard
deviation of this variable will be
y10 = x,
4 To
1
y10 = x .
10
understand this variable, think of the cartoon of Figure (4), but with 10 copies of the
x machine instead of just 2, and with a factor of 1/10 multiplying the result on its way out
of the box.
13
2
1.5
1
0.5
0
0.5
1
50
100
150
200
Sample index n
250
300
P10
1
Figure 5: Values of 300 samples of a random variable y10 = 10
n=1 x defined
by averaging 10 copies of a random variable x, where x is uniformly distributed
in the interval [0, 1] as in Figure 3. Note that y10 has the same mean as the
original x, but
the amplitude of its fluctuations about that mean (its standard
deviation) is 10 3 times smaller than x (compare Figure 3).
By comparing Figures (3) and (5), its easy to see that by averaging 10
samples of x we have obtained a new random variable whose mean is the same
as that of x, but whose fluctuations about that mean are reduced by a factor of
10 3.
3.4
14
Z
2
z2 =
f (x) z P (x) dx
(4a)
(4b)
These are quantities that depend on P (x) and f (x), but not on anything else.
3.5
We can now assemble the insights of the previous sections to understand the
convergence rate of MonteCarlo integration. Consider a scalar function of D
variables, f (x). We will consider the evaluation of
Z
1
f (x) dx
(5)
I
V R
where R is some subregion of RD and V is the volume of R. I is just computing
the average value of f over the subregion V. (If you want to compute the integral
of f , not its average value, then just multiply I by V.)
Let x be an Ddimensional vector of random variables distributed uniformly
throughout the region R. This means that the probability distribution function
P (x) is constant inside R and zero everywhere else:
(
1
, xR
P (x) = V
0, otherwise.
Given this fact, we can rewrite the integral we are trying to evaluate, equation
(5), in the form
Z
I=
f (x)P (x)dx
(6)
RD
15
Of course, we dont know how to compute I2 , but the the point is that it exists
and is just some number that depends on the function f and the region R (which
is what defines P (x).
Finally, consider defining a new random variable IN by averaging N samples
of I:
N
N
1 X
1 X
IN
I=
f (x)
N n=1
N n=1
Note that this is just the prescription we gave in equation (1) for MonteCarlo
integration, although we are here interpreting it as the definition of a random
variable.
Invoking the general principle of equation (3), we expect that the mean value
and standard deviation of IN will be
IN = I,
1
IN = I
N
where, again, I is some number that depends on f and R but not on N . The
mean value of IN is the quantity we are trying to compute, and its standard
deviation decreases like the square root of N .
Thus, when we use MonteCarlo integration with N function samples to
estimate an integral, we are evaluating a single sample of a random variable
whose mean value
is the integral we are trying to compute and whose variance
decreases like 1/ N . This explains Figure 1.
3.6
Importance sampling
In some cases we may be trying to integrate a function g(x) that may be decomposed into a product of factors g(x) = f (x)P (x) where
P (x) satisfies the
R
conditions of a probability density, i.e. P (x) R0 and P (x) dx = 1. In this
case, referring back to equation (4a), we interpret g(x) dx as the mean value of
16
3.7
17
Next suppose we want to compute a sequence of random numbers {yn } that are
distributed according to some nonuniform probability distribution P (y). The
general idea will be to compute a sequence of uniformly distributed random
numbers {xn } and then define yn to be f (xn ), where f (x) is some function.
Lets determine the relationship between f (x) and P (y).
Suppose we compute some large number of samples N . The number of x
points falling within an interval [x, x + x ] is approximately N x . All of these
points are mapped by our procedure into the interval [y, y + y ] = [f (x), f (x +
x )]. This latter interval has width y = x f 0 (x) (the absolute value arises
because y2 may be less than y1 , but we still want to define the width of the
interval to be a positive number). Thus, if we are trying to define the probability
density P (y) such that the number of sample points falling in an interval [y, y +
y] is N P (y)y , we should say
N P (y)y = N x
or, using y = f (x) and y = f 0 (x)x ,
f 0 (x) =
1
P (f (x))
This is a differential equation for the function f (x). For example, suppose
we want to generate points y with distribution P (y) = ey . The differential
equation reads
1
f 0  = f
e
with solution f (x) = log x. What this means is this: If {xn } is uniformly
distributed in [0, 1] and we define yn = log(xn ), then yn is distributed in
[0, ] with probability density P (y) = exp(y).
18
The Ddimensional ball B D is the set of points in RD that lie within unit distance
of the origin. Let V D be the Ddimensional volume5 of B D . From elementary
geometry we know
B1
B2
B3
2
4
but how do we extend this table to higher D? Earlier in these notes we discussed
how to do this using MonteCarlo integration. Here well discuss how to do the
calculation analytically.6
One way is to write
Z
VD =
dD x
(8)
x<1
and evalute the integral in polar coordinates. To get a sense of how to do this,
let the cartesian components of a point x RD be {x1 , , xD } and recall that
twodimensional polar coordinates are defined by
x2D
1 = r sin 1
2D
x2 = r cos 1
(9a)
(9b)
(10a)
x3D
2 = r sin 1 cos 2
(10b)
x3D
3 = r cos 1
(10c)
Comparing (9) and (10), we see that the transition is effected by introducing
one new angle (2 ) and bifurcating the coordinate x2D
1 into two new coordinates
3D
x3D
and
x
defined
by
1
2
2D
x3D
1 = x1 sin 2 ,
2D
x3D
2 = x1 cos 2 .
19
the other is all sines except for one cosine. For example, polar coordinates for
the 7dimensional sphere are
x7D
1 = r sin 1 sin 2 sin 3 sin 4 sin 5 sin 6
x7D
2 = r sin 1 sin 2 sin 3 sin 4 sin 5 cos 6
x7D
3 = r sin 1 sin 2 sin 3 sin 4 cos 5
x7D
4 = r sin 1 sin 2 sin 3 cos 4
x7D
5 = r sin 1 sin 2 cos 3
x7D
6 = r sin 1 cos 2
x7D
7 = r cos 1
The Jacobian of the transition from Cartesian to polar coordinates is
dx1 dxD = rD1 sinD2 1 sinD3 2 sin D1 dr d1 dD1
and the integral (8) splits up into a product of D factors:
V
Z
=
0
1
D1
r
{z
1
D!
D2
sin
1 d1
sin D2 dD2
dr
} 0
{z
} 0
{z
} 0
[(D1)/2]
[D/2]
dD1
{z
}
2
Integrals over powers of sin factors may be evaluated using the function.
Working out the general case, we obtain the closedform analytical expression
VD =
D/2
.
D
2 +1
The function here may be evaluated to yield more explicit formulas which
differ depending on the parity of D:
V 2D =
D
,
D!
V 2D+1 =
2(2)D
(2D + 1)!!
Contents
1 Finitedifference approximations of the first derivative
1.1 Forward differencing . . . . . . . . . . . . . . . . . . . . .
1.2 Backward differencing . . . . . . . . . . . . . . . . . . . .
1.3 Centered differencing . . . . . . . . . . . . . . . . . . . . .
1.4 Higherorder finitedifference formulas . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
2
3
3
1
1.1
(1)
The simplest approach to numerical differentiation is simply to arrest the limiting process here and evaluate the RHS of (1) at a finite value of h. This defines
what is known as the forwardfinitedifference (FFD) (or just forwarddifference)
approximation to the derivative:
0
fFFD
(h; x)
f (x + h) f (x)
.
h
(2)
Its easy to assess the error incurred by the forwarddifference procedure. Recall
that the Taylorseries expression for the quantity f (x + h) is
f (x + h) = f (x) + hf 0 (x) +
h2 00
f (x) + O(h3 )
2
h 00
f (x) + O(h2 )
2
(3)
The first term on the RHS here is the quantity we are trying to compute, and
everything else is an error term. Thus we have
0
fFFD
(h; x) f 0 (x) =
h 00
f (x) + O(h2 )
2
(4)
As usual in error analysis, this equation is not useful for giving us an actual
number for the error, because we dont know how to evaluate f 00 (x). The only
important thing is the h dependence: the error is linear in h, i.e. we have a
firstorder method. To obtain one more digit of accuracy (i.e. 10 smaller
error) we must use a 10 smaller value of h.
1.2
Backward differencing
It may happen that values of f (x + h) are not available for positive h. This
may happen, for example, if the point x lies at the right endpoint of the interval
over which our function is computable or measurable. (I mean measurable in
the experimental sense, not the sense of Lebesgue integration. Think of f (x) as
a quantity reported by an experimental apparatus on which we cant turn the
dial any further than some xmax .) Of course values of f (x + h) must exist for at
least some nonzero range of positive h, since otherwise the derivative at x would
not be defined, but those values may not be accessible to us for one reason or
another.
In this case, we can use backward differencing:
0
fBD
(h; x)
f (x) f (x h)
.
h
(5)
1.3
Centered differencing
h3
h2 00
f (x) + f 000 (x) + O(h4 )
2
6
(6a)
f (x h) = f (x) hf 0 (x) +
h2 00
h3
f (x) f 000 (x) + O(h4 )
2
6
(6b)
Careful scrutiny of these equations reveals that by subtracting them and dividing by 2 we can pick off the secondderivative term (and in fact all even
derivative terms) in (3):
f (x + h) f (x h)
h3
= hf 0 (x) + f 000 (x) + O(h5 )
2
6
Now just divide by h to obtain the centereddifference approximation to the
derivative:
f (x + h) f (x h)
0
fCD
(x)
(7)
2h
The above analysis shows that
0
fCD
(x) f 0 (x) =
h2 000
f (x) + O(h4 )
6
(8)
1.4
Formulas like (2), (5), and (7) are known as finitedifference stencils: they
are linear combinations of n function samples that approximate the derivative
with pthorder convergence. The forwarddifference, backwarddifference, and
centereddifference stencils have (n, p) = (2, 1), (2, 1), (2, 2) respectively.
By increasing the number of function samples n that we are willing to compute, it is easy to construct finitedifference stencils that achieve any desired
convergence order p. All you have to do is write down the Taylor expansions of
the quantities
, f (x 2h), f (x h), f (x), f (x + h), f (x + 2h),
and construct clever weighted combinations of these that pick off successively
higherorder terms in the error estimates of equations (3) and (8).
However, we generally dont carry out finitedifferencing beyond the centereddifference case. The reason is that in constructing formulas of this type we are
essentially constructing and differentiating a polynomial interpolant through
data samples at uniformlyspaced intervals. As we have noted many times, this
procedure is badlybehaved due to the Runge phenomenon: the more you try
to bend and squeeze a highorder polynomial to fit through evenlyspaced data
points, the more it will bulge out in between the points. If you need a numerical
differentiation stencil that achieves a rapid convergence rate, a better idea is to
use nonuniformly spaced points to construct and differentiate a polynomial interpolant. We will revisit this topic when we consider Chebyshev interpolation
later in the course.
We can play similar games to write down approximate formulas for higher
derivatives. For example, go back to equations (6) and suppose that we add
the two equations together instead of subtracting them:
f (x + h) + f (x h) = 2f (x) + h2 f 00 (x) +
h4 0000
f (x) +
12
f (x + h) 2f (x) + f (x h)
h2 0000
00
=
f
(x)
+
f (x) +
h2
12
(9)
Next suppose we want to differentiate a function of more than one variable, say
f (x, y).
If we are only interested in partial derivatives with respect to a single variable, we can just apply the formulas for the onedimensional case with the other
variables held fixed. For example:
f (x + h, y) f (x, y)
f
firstorder convergence
x (x,y)
h
f
f (x, y + h) f (x, y h)
secondorder convergence
y (x,y)
2h
f (x h, y) 2f (x, y) + f (x + h, y)
2 f
secondorder convergence
2
y (x,y)
h2
However, things get a little more interesting when we go to compute mixed
partial derivatives. Consider, for example, the simultaneous double Taylor expansion of f (x, y) :
f (x + x , y + y ) = f (x, y) + x fx (x, y) + y fy (x, y)
+
2y
2x
fxx (x, y) + x y fxy (x, y) +
fyy (x, y) + O(3 )
2
2
By writing out this equation for various possible choices of x and y and
taking linear combinations of the results, it is possible to kill off various terms
on the RHS to obtain stencils for various partial derivatives. You will explore
this possibility in your problem set this week.
Consider an interval [xa , xb ] and a function f (x) that vanishes at the endpoints,
i.e. f (xa ) = f (xb ) = 0. Suppose we have samples of f at a set of N evenlyspaced points between a and b. More specifically, break up the interval into
N + 1 segments of width
ba
h=
N +1
and denote the endpoints of these intervals and the values of f at those points
by
xn = xa + nh,
fn = f (xn ),
n = 1, 2, , N.
(For convenience we will also use the notation x0 = a and xN +1 = b.)
Now suppose we try to compute the second derivative of f at the points xn
2f
(x
)
+
f
(x
)
0
1
2
h2
i
1h
= 2 2f (x1 ) + f (x2 )
h
f 00 (x1 )
(10a)
f 00 (xN 1 )
f 00 (xN )
i
1h
f (x1 ) 2f (x2 ) + f (x3 )
2
h
i
1h
f (x2 ) 2f (x3 ) + f (x4 )
2
h
..
.
i
1h
f
(x
)
2f
(x
)
+
f
(x
)
N
2
N
1
N
h2
i
1h
f
(x
)
2f
(x
)
N
1
N
h2
(10b)
(10c)
(10d)
(10e)
f1
f100
2 1
0 0
0
f200
1 2 1 0
0
f2
f300
1 2 0
0 f3
1 0
(11)
.. 2 ..
..
..
.
.. ..
..
..
.
. h .
.
.
.
.
f 00
0
0
0 2 1 fN 1
N 1
00
fN .
0
0
0 1 2
fN
which we may write using matrixvector notation in the form
f 00 = Af
(12)
where f 00 and f are the vectors of samples of f and samples of the second derivative of f .
The point of equations (11) and (12) is that the operation that takes f into f 00
may be thought of as matrix multiplication. Among the important consequences
of this observation is that it makes it easy to invert the operation that obtains
f from f 00 :
f = A1 f 00
(13)
The primary use of formulas like (13) is in the application of finitedifference
differentiation to the solution of boundaryvalue problems and higherorder
PDEs; the technique is known in the PDE world as the finitedifference method.
2f
(x
)
+
f
(x
)
a
1
2
h2
i
1h
f 00 (xN ) 2 f (xN 1 ) 2f (xN ) + fb
h
f 00 (x1 )
f100
fa
2 1
0
f200
0
1 2 1
f300 1 0
1 2
1
0
.. 2 .. 2 ..
..
..
. h . h .
.
.
f 00
0
0
0
0
N 1
00
fN
0
0
0
fb

 {z } 
{z
}
{z
f 00
..
.
0
0
0
..
.
2
1
(14a)
(14b)
0
0
0
..
.
f1
f2
f3
..
.
1 fN 1
2
fN .
}
{z
f
(15)
What we have done here is to swing the terms involving fa and fb in (14) over to
the other side of the equation in (15) that is, away from the side containing the
unknowns and onto the side on which the known quantities reside. Note that
the matrix A and the vectors f , f 00 in this equation are unchanged from equation
(11). All that happens is that the RHS of equation (13) is now augmented by
an additional term:
h
1 i
f = A1 f 00 2
(16)
h
where is the sparse vector in (15) containing the boundary values of f .
Contents
1 Numerical quadrature
2 NewtonCotes rules
2.1 The rectangular rule . . . . . . .
2.2 The trapezoidal rule . . . . . . .
2.3 Higherorder NewtonCotes rules
2.4 Heuristic error analysis . . . . . .
2.5 Results . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
6
8
9
10
16
Numerical quadrature
n=1
where the sample points xn are some set of N points lying in the interval
[a, b], and wn are an appropriately chosen set of weight coefficients. Numerical
integration is also known as numerical quadrature, and the sets {xn } and {wn }
are known as the quadrature points and the quadrature weights. An algorithm
for choosing {xn } and {wn } is known as a quadrature rule.
The name of the game in numerical quadrature is to obtain accurate estimates of the integral in (1) with the smallest possible number of function
samples. Think of f (x) as an experimentally measured quantity that may take
minutes or hours to evaluate. If your quadrature rule requires you to sample
f (x) at 106 points to get a decent estimate of the integral in (1), your project
will be hopeless. We will find that unsophisticated quadrature rules may indeed
require millions of points to yield decent accuracy, but with a little theoretical
sophistication it is possible to do much better, obtaining 6 or more digits of
accuracy with a few dozens or hundreds of samples.
Our study of numerical quadrature in 18.330 will unfold in multiple installments:
We begin in these notes with the simplest approaches to constructing
quadrature rules. We show that these rules work, but in most cases do
not deliver particularly outstanding performance in the costvsaccuracy
department (and, for this reason, are not commonly used in practice). We
offer a simple error analysis to suggest why this might be. (We also note
for future reference a couple of interesting cases in which the simple rules
do yield excellent performance.)
Then, later in the course (well into Unit 2), after we have introduced
some more of the necessary theoretical background, we will discuss more
sophisticated approaches to constructing quadrature rules. These rules
do generally yield good performance and are commonly used in numerical
practice.
Meanwhile, there are several general points to be made about numerical
quadrature that do not depend of the choice of quadrature rule. These
points are addressed in these notes and remain equally valid later, after
we have introduced more sophisticated quadrature schemes.
Thus, the content of Section 2 of these notes should not be taken as a statement of this is how we recommend you do numerical quadrature in practice.
Instead, think of this section as the first step in a journey that will culminate in
a discussion of the right way to do numerical quadrature. On the other hand,
the more general points made in Section 3 of these notes will not be superseded
by subsequent developments.
NewtonCotes rules
Rb
One way to approximate a f (x) dx is to find a polynomial P (x) that approximates f on the interval [a, b] and integrate that instead, this being easy to do
since we know how to integrate polynomials. This approach leads to the quadrature rules known as NewtonCotes rules. There is a hierarchy of NewtonCotes
rules indexed by the degree of the polynomial; the pthorder NewtonCotes rule
uses a pth degree polynomial.
Of course, if f is not a polynomial itself then it wont be easy to find a single
polynomial that approximates f over the whole interval. Instead, we consider
subdividing the interval into N subintervals; on each subinterval we approximate
f by a different polynomial chosen appropriately to mimic the behavior of f on
that subinterval. The smaller we make the subintervals, the more accurately we
will be able to approximate f by a polynomial, and thus the more accurate will
be our approximation of the integral.1
2.1
Figure 1: The area of the figure comprised of the shaded rectangles defines the
Rb
N = 4 rectangularrule approximation to a f (x) dx.
If we subdivide the interval [a, b] into N equallength subintervals, each
subinterval has width = ba
N . The width of each rectangle in (1) is .
The height of the leftmost rectangle in (1) is f (a); thus this rectangle has area
A = f (a) . The height of the secondtoleftmost rectangle is f (a + ), so
this rectangle has area A = f (a + ) . Proceeding in this way and summing
the areas of all the rectangles, the N = 4 rectangularrule approximation to our
integral is
rect
IN
=4 = f (a) + f (a + ) + f (a + 2) + f (a + 3).
Rb
More generally, the N point rectangularrule approximation to a f (x) dx is
rect
IN
N
1
X
f (a + n).
(2)
n=0
Note that this is a quadrature rule of the general form (1): The quadrature
weights are wn = (the same weight for all n), and the quadrature points are
xn = a + n for n = 0, 1, , N 1.2 (Note that f (b) is not referenced by this
rule.)
2 Alternatively, we could define the quadrature points to be x = a + (n 1) for n =
n
1, , N . I find this slightly more cumbersome, but it agrees better with the convention of
equation (1).
Figure 2: The area of the figure comprised of the shaded trapezoids defines the
Rb
N = 4 trapezoidalrule approximation to a f (x) dx.
2.2
A quick glance at Figure 1 shows that the rectangles are a crude approximation to the area under the curve. We can obtain a better approximation by
considering trapezoids instead of rectangles. A trapezoid is basically just a rectangle, except that its upper edge is slanted into a straight line connecting the
values of f (x) at the left and right endpoints of the interval. This is illustrated
in Figure 2. One way to interpret Figure 2 is to say that in each interval we
are approximating f by a firstorder polynomial (a line), so this is a firstorder
NewtonCotes rule. It is known as the trapezoidal rule.
To write down the formula for the trapezoidal rule, note that the areas of
the trapezoids in Figure 2) are
i
1h
f (a) + f (a + )
2
i
1h
area of secondtoleftmost trapezoid = f (a + ) + f (a + 2)
2
area of leftmost trapezoid =
etc. Continuing in this way and summing up the area of all the trapezoids, the
=
+
+
+
1h
f (a)
2
1h
f (a + )
2
1h
f (a + 2)
2
1h
f (a + 3)
2
+ f (a + )
f (a + 2)
f (a + 3)
f (a + b)
N
1
X
f (a + n) +
f (a) +
f (b).
2
2
n=0
Rb
a
f (x) dx is
(3)
Again, this is a quadrature rule of the general form (1). The quadrature points
are xn = a + n for n = 0, , N . (Note that there is one more quadrature
point than in the rectangular rule, namely, the point x = b.) The weights are
(
1
, n = 0, N
wn = 2
,
n = 1, 2, , N 1.
Figure 3: For a thirdorder NewtonCotes rule, we write down the unique cubic
polynomial that agrees with f (x) at 4 evenly spaced points throughout one
subinterval (dashed curve). The area under this curve (blue shape) is the thirdorder NewtonCotes approximation to the integral of f over this subinterval.
2.3
The rectangular and trapezoidal rules correspond, respectively, to the zerothdegree and firstdegree NewtonCotes rules. It is possible to continue this process. To derive the general pth order NewtonCotes rule, we proceed as follows:
1. Consider p+1 evenly spaced points in the subinterval [xn , xn+1 ], including
the endpoints. For example, for p = 2 we consider
xn ,
i
1h
xn + xn+1 ,
2
xn+1
for p = 3 we consider
xn ,
i
1h
xn + xn+1 ,
3
i
2h
xn + xn+1 ,
3
xn+1
etc.
2. Find the unique pth order polynomial P (x) that agrees with f (x) at the
above points. For p = 0 (rectangular rule) this is just the constant P (x) =
2.4
10
This means that the difference between f (x) and P (x) is a polynomial
that starts at order p + 1:
f (x) P (x) = Cp+1 xp+1 + Cp+2 xp+2 +
and thus the error in our approximate
subinterval is
Z
error in this subinterval =
Z
=
Cp+1 xp+1 +
1
.
N p+2
On the other hand, we have a total of N subintervals, and we must add together
all the errors in all the subintervals to get the total error; this gives us an extra
factor of N upstairs, so we find
1
N
N p+2
1
= p+1 .
N
total error
Our analysis does not furnish the constant of proportionality, but thats OK
because here we are only considered with the dependence on N .
2.5
Results
Figure 4 plots the convergence of the rectangular and trapezoidal rules for various values of N in an approximation of the integral
Z 2
I=
log2 x dx.
1
11
approx
IN
I exact 
.
I exact
0.1
0.1
0.01
0.01
0.001
0.001
0.0001
0.0001
1e05
1e05
1e06
1e06
1e07
1e07
1e08
1e08
1e09
1e09
1e10
1e10
1e11
100
1000
10000
1e11
100000
12
As noted at the beginning of these notes, the NewtonCotes approach is generally not the best method for deriving quadrature rules, and we will eventually
replace it with better strategies that achieve superior accuracyvscost performance. On the other hand, already at this point there are certain general points
we can make about numerical quadrature that will remain valid even after we
have moved on to more sophisticated quadrature rules.
3.1
f (x) dx
0
we simply make a change of variables that maps the interval [0, ] to a finite
interval. There are many possible choices of map. One popular example is
x=
u
,
1u
dx =
du
(1 u)2
under which the integral x [0, ] is mapped into u [0, 1]. Our improper
integral is then transformed into a proper integral:
Z
Z 1
u
du
f (x) dx =
f
.
1
u
(1
u)2
0
0
Note that the integral on the RHS appears to have a quadratic singularity as
u 1. However, as u 1 the argument
R of f is approaching , and f must
vanish at (otherwise the integral 0 f (x) dx would not be convergent), so
the singularity cancels.
3.2
Integrable singularities
exp(x)
dx
x
(4)
Although the integrand blows up at the origin, the integral is perfectly welldefined. We say that we have an integrable singularity at the origin.
Note the important distinction between
integrable and nonintegrable singularities. The function f (x) = exp(x)/ x has an integrable singularity at
the origin. In contrast, f (x) = exp(x)/x has a nonintegrable singularity at
the
R 1 origin. There is no point in attempting to devise a strategy for estimating
exp(x)dx/x, because the integral does not exist.
0
13
I=
0
1
dx
x
{z }
I1
Z
+
0
exp(x) 1
dx
x
{z
}
I2
=
x
x
1
1
= x1/2 x3/2 + x5/2 +
2
6
This function goes to zero politely as x 0, so we can evaluate I2 using
standard (nonsingular) numerical quadrature techniques.
Singularity cancellation
Another strategy is to introduce new integration variables in such a way that the
Jacobian of the transformation cancels troublesome factors in the denominator.
For example, for (4) we can put
u=
x,
dx
du =
2 x
14
Epsilon expansion
If neither of the above methods work, you may try to introduce a smoothing
parameter which removes the singularity for finite , then try extrapolating
to the limit 0. For example, in the above case, we might define
Z 1
exp(x)dx
p
I =
0
x 2 + 2
which is nonsingular for finite , but tends to I as 0.
3.3
Adaptive quadrature
1.2
0.8
f(x)
0.6
0.4
0.2
0.2
0
10
15
16
Contents
1 Overview
1.1 ODEs and ODE Integrators . . . . . . . . . . . . . . . . . . . . .
1.2 Comparison to numerical quadrature . . . . . . . . . . . . . . . .
2
2
4
.
.
.
.
.
.
5
5
6
6
6
6
6
7
7
10
13
4 Stability
4.1 Stability of the forward Euler method .
4.2 The backward (implicit) Euler method .
4.3 Stability of the backward Euler method
4.4 Stability in the multidimensional case .
4.5 Stability in the nonlinear case . . . . . .
15
15
17
18
19
20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Pathological cases
22
5.1 Nonuniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Blowup in finite time . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Conditions for existence and uniqueness . . . . . . . . . . . . . . 24
1
1.1
Overview
ODEs and ODE Integrators
(1)
for some function f (t, u). (We will always call the independent variable t and
the dependent variables u.) More specifically, 1 is an example of an initial
value problem; one also encounters ODEs posed in the form of boundary value
problems, to be discussed below.
Given an ODE like (1) with a reasonably wellbehaved RHS function f , and
given a single point (t0 , u0 ), an extensive and wellestablished theory assures
us that there is a unique solution curve u(t) passing through this point (i.e.
satisfying u(t0 ) = u0 ). However, the theory doesnt tell us how to write down
an analytical expression for this solution curve, and in general no analytical
expression exists. Instead, we must resort to numerical methods to compute
points that lie approximately on the curve u(t).
An ODE integrator is an algorithm that takes an ODE like (1) and a point
(t0 , u0 ) and produces a new point (t1 , u1 ) that (at least approximately) lies on
the unique solution curve passing through (t0 , u0 ). More generally, we will usually want to compute a whole sequence of points (t1 , u1 ), (t2 , u2 ), , (tmax , umax ).
up to some maximum time tmax .
ODE systems
In general we may have two or more ODEs that we need to solve at the same
time. For example, denote the populations of the U.S. and Uruguay respectively
by u1 (t) and u2 (t). Suppose the U.S. and Uruguayan birth rates are respectively
1 and 2 , and suppose that every year a fraction a of the U.S. population
emigrates to Uruguay, while a fraction b of the Uruguayan population emigrates
to the U.S. Then the differential equations that model the population dynamics
of the two countries are
du1
= (1 a )u1 + b u2
dt
du2
= a u1 + (2 b )u2
dt
(2a)
(2b)
These two ODEs are coupled ; each one depends on the other, so we cant solve
separately but must instead solve simultaneously. We generally write systems
like (2) in the form
du
= f (t, u)
(3)
dt
where u is now a timedependent ddimensional vector and f is a ddimensional
vectorvalued function. For the case of (2), the dimension d = 2 and the u vector
u1
.
u2
As in the case of the onedimensional system (1), for reasonably wellbehaved
functions f we are guaranteed to have a unique solution curve u(t) passing
through any given point (t0 , u0 ), but again in general we cannot write down
an analytical expression for such a curve; we must use numerical methods to
generate a sequence of points (t0 , u0 ), (t1 , u1 ), (tmax , umax ).
Actually, in the case of (2), the function f (t, u) is linear, f = Au for a
constant matrix A. In this special case, (3) can be solved analytically in terms
of the matrix exponential to yield, for initial conditions u(t = 0) = u0 ,
is u =
u(t) = eAt u0 .
However, this only works for linear ODE systems; for nonlinear systems in
general no analytical solution is possible.
Higherorder ODEs
Equations (1) and (3) are firstorder ODEs, i.e. they only involve first derivatives. What if we have a higherorder ODE? Consider, for example, the onedimensional motion of a particle subject to a positiondependent force:
1
d2 u1
= F (u1 ).
dt2
m
(4)
It turns out that we can squeeze highorder ODEs like this into the framework
of the above discussion simply by giving clever names to some of our variables.
More specifically, lets assign the name u2 to the first derivative of u1 in (4).
du1
u2 .
dt
Now we can reinterpret the secondorder equation (4) as a firstorder equation
for u2 :
1
du2
= F (u1 ).
dt
m
Equations (5) constitute a 2dimensional system of firstorder ODEs:
du
u1
u2
= f (t, u),
u=
f=
u2
F (u1 )/m.
dt
Thus we can use all the same tricks we use to solve firstorder ODE systems;
there is no need to develop special methods for higherorder ODEs. This is a
remarkable example of the efficacy of using the right notation.
We can play this trick to convert any system of ODEs, of any degree, to a
onedimensional system. In general, a ddimensional system of pth order ODEs
can always be rewritten as a pdthdimensional system of firstorder ODEs.
u
u2
d 1
.
u2
u3
=
dt
u3
u1 + u22 Au3
This system has been called the simplest dissipative chaotic flow.1
1.2
f (t) dt
(6)
which we studied earlier in our unit on numerical quadrature. This problem may
be recast in the language of ODEs as follows: Define u(t) to be the function
Z
u(t)
f (t0 ) dt0 .
u(a) = 0
(7)
and the integral (6) we want to compute is the value of the solution curve u(t)
at t = b.
Comparing (7) to (1), we notice an important distinction: The RHS function
f (t) in (7) depends only on t, not on u. This means that integrating functions
is easier than integrating ODEs. In particular, suppose we use a numerical
quadrature rule of the form
Z
f (t) dt
wi f (ti )
2
2.1
GM m
r
r2
(8)
u1
u4
x
x
u2 = y ,
u5 = y
(9)
u3
z
u6
z
where x, y, z are the components of r. Then equation (8)
following system:
u4
u1
u2
u5
u6
u
d
3
=
2
dt u4 u1 /(u1 + u22 + u23 )3/2
u5
u2 /(u21 + u22 + u23 )3/2
u6
u3 /(u21 + u22 + u23 )3/2
is equivalent to the
(10)
where = GM .2
Of course, in the real solar system we have more than simply one planet, and
planets experience gravitational attractions to each other in addition to their
attraction to the sun. You will work out some implications of this fact in your
problem set.
2 To derive e.g. the 4th component of this equation, we write the x component of equation
(8) as follows:
d
GM
x
=
x = 2
rx
(11)
dt
x + y2 + z2
where rx is the x component of the radiallydirected unit vector
r, which we may write in the
form
x
rx = 2
.
(x + y 2 + z 2 )1/2
Plugging this into (11) and renaming the variables according to (9) yields the fourth component of equation (10).
2.2
Molecular dynamics
2.3
Electric circuits
2.4
Chemical reactions
3
2O2
O3 + O
u1
k1 u3 u2 k2 u1 u22 k3 u3 u1
u2 = k1 u3 u2 + k2 u1 u22 + k3 u3 u1 .
dt
u3
k1 u3 u2 + k2 u1 u22 k3 u3 u1 .
This system is not even close to being linear.
2.5
2.6
Figure 1: Eulers method illustrated for the 1D case. We are given an ODE
du/dt = f (t, u) and a single point (tn , un ). The dashed line denotes the unique
solution curve through this point; we know it exists, but we dont have an
analytical expression for it. What we do know is its slope at the given point
[this slope is just s = f (tn , un )], so we move along this line until we have traveled
a horizontal distance t = h on the t axis.
3
3.1
(tn+1 , un+1 )
tn+1 = tn + h,
(12a)
un+1 = un + hf (tn , un ).
(12b)
(13)
where I is the n n identity matrix. So each step of the forward Euler method
requires us to do a single matrixvector multiplication. If A is a sparse matrix,
this can be done in O(n) operations. (We havent discussed sparse matrices or
operation counts yet, so this observation is made for future reference.)
Error analysis
How accurate is Eulers method? Consider the simplest case of a onedimensional
ODE system (the extension to a general ndimensional system is immediate).
Given a point (t0 , u0 ), we know there is a unique solution curve u(t) passing
through this point. The Taylor expansion of this function around the point t0
takes the form
1
u(t) = u(t0 ) +(t t0 ) u0 (t0 ) + (t t0 )2 u00 (t0 ) +
 {z }
 {z } 2
u0
(14)
f (t0 ,u0 )
Note that, in this expansion, u(t0 ) is just u0 , and u0 (t0 ) is just f (t0 , u0 ), i.e. the
value of the RHS function in our ODE at the initial point.
If we use (14) to compute the actual value of u at the point t0 + h, we find
1
u(t0 + h) = u0 + hf (t0 , u0 ) + h2 u00 (t0 ) +
2
(15)
(16)
Thus the error between the Eulermethod approximation and the actual value
is
1
u(t0 + h) uEuler (t0 + h) = h2 u00 (t0 , u0 ) +
2
This result depends on u00 (t0 , u0 ), which we dont know. However, whats important is that it tells us the error is proportional to h2 . If we try again with
onehalf the step size h, everything on the RHS stays the same except the factor
h2 , which now decreases by a factor of 4. To summarize,
error in each step of the Euler method h2 .
(17)
On the other hand, in general we will not be taking just a single step of the
Euler method, but will instead want to use it to integrate over some interval
a
steps of width h, then (18)
[ta , tb ]. If we break up this interval into N = tb t
h
tells us that the error in each step is proportional to h2 , but there are N steps,
so the total error is proportional to N h2 h. In other words,
overall error in the Euler method h.
The Euler method is a method of order 1.
(18)
10
Figure 2: The improved Euler method. (a) Starting at a point (tn , un ), we take
the usual forward Euler step by moving a horizontal distance h along a line of
slope s (dashed black line in the figure), where s = f (tn , un ) is the value of
f at the starting point. This takes us to the Euler point, (tn+1 , uEuler
n+1 ). (b)
When we get to the Euler point, we sample the value of f there. Call this value
0
s0 = f (tn+1 , uEuler
n+1 ). s is the slope of a tangent line (solid red line) to the ODE
solution curve through the Euler point (dashed red curve). (c) Now we go back
to the starting point and draw a line of slope 21 (s + s0 ) (solid black line). (The
slope of this line is intermediate between the slope of the dashed black and
dashed green lines in the figure.) Moving a horizontal distance h along this line
takes us to the improved Euler point.
3.2
Improved Euler
Another possibility is the improved Euler method, pictured for the 1D case in
Figure 3.1. Like the Euler method, it computes a successor point to (tn , un )
by moving a horizontal distance h along a straight line. The difference is that,
whereas in the original Euler method this line has slope s, in the improved
Euler method the line has slope 12 (s + s0 ). Here s and s0 are the values of
the ODE function f (t, u) at the starting point (tn , un ) and at the Euler point
(tn+1 , uEuler
n+1 ). (The Euler point is just the point to which the usual Euler
method takes us.)
The idea is that by averaging the slopes of the solution curves at the starting
point and at the Euler point, we get a better approximation to what is happening
in between those points. Thus, if we draw a line whose slope is the average of
the two slopes, we expect that moving along this line should be better than just
moving along the line whose slope is s, as we do in the original Euler method.
11
(tn+1 , un+1 )
where
tn+1 = tn + h
(19a)
un+1 = un +
h
f (tn , un ) + f tn+1 , uEuler
n+1
2
(19b)
where uEuler
n+1 = un + hf (tn , un ).
Error analysis
To analyze the error in the improved Euler method, consider again the 1D case:
we are at a point (t0 , u0 ), which we know lies on a solution curve u(t), and we
want to get to the point u(t0 + h). We will compare an exact expansion for
this quantity with the approximate version computed by the improved Euler
method, and this will allow us to estimate the error in the latter.
Exact expansion for u(t0 + h) As above, we can write down an expression
for the exact value of u(t0 + h) by Taylorexpanding u(t) about t0 :
uexact (t0 + h) = u(t0 ) + hu0 (t0 ) +
h2 00
h3
u (t0 ) + u000 (t0 ) +
2
6
(20)
In our error analysis of the Taylor method above, we observed that the first two
terms in this expansion were simply
u(t0 ) = u0
u0 (t0 ) = f (t0 , u0 )
To get at u00 , we now go like this:
d 0
u (t0 )
dt
d
= f (t0 , u0 )
dt
u00 (t0 ) =
We now evaluate this total derivative by making use of the partial derivatives
of f :
f
f
du
=
+
t t0 ,u0
u t0 ,u0 dt
= ft (t0 , u0 ) + fu (t0 , u0 )f (t0 , u0 ).
where we are using the shorthand
f
ft ,
t
f
fu
u
12
i
h2 h
ft (t0 , u0 )+fu (t0 , u0 )f (t0 , u0 ) +O(h3 ) (21)
2
h3
h2 .
3.3
13
RungeKutta Methods
Although the error analysis for improved Euler is a little tricky, the idea of the
method is straightforward: Instead of simply sampling f (t, u) at the left endpoint of the interval we are traversing, we sample it at both the left and right
endpoints and take the average between the two. This gives us a better representation of the behavior of the function over the interval than just sampling at
one endpoint.
Its also pretty clear how we might improve further on the method: Just
sample f (t, u) at even more points, now including points inside the interval, and
do some kind of averaging to get a better sense of the behavior of f throughout
the interval.
This is the motivation for RungeKutta methods.4 There are a family of these
methods, indexed by the number of function samples they take on each step
and the order of convergence they achieve. For example, the simplest RungeKutta method is known as the midpoint method and is defined by the following
algorithm: Given an ODE du
dt = f (t, u) and a point (tn , un ), we compute the
successor point as follows:
s1 = f (tn , un )
h
h
s2 = f tn + , un + s1
2
2
(tn+1 , un+1 ) = tn + h, un + hs2
What this algorithm does is the following: It first takes an Euler step with
stepsize h/2 and samples the function f at the resulting point, yielding the
value s2 . This is an estimate of the slope of ODE solution curves near the
midpoint of the interval we are traversing. Then we simply proceed from the
starting point to the successor point by moving a horizontal distance h along a
line of slope s2 . Thus the midpoint method is almost identical to the original
Euler method, in the sense that it travels to the successor point by moving a
distance h along a straight line; the only difference is that we use a more refined
technique to esimate the slope of that straight line.
The most popular RungeKutta method is the fourthorder method, known
colloquially as RK4. This method again travels a horizontal distance h along
a straight line, but now the slope of the line is obtained as a weighted average
of four function samples throughout the interval of interest. More specifically,
4 Note:
14
h
s1
2
h
s2
2
s4 = f (tn + h, un + hs3 )
h
s1 + 2s2 + 2s3 + s4
(tn+1 , un+1 ) = tn + h, un +
6
Here s1 is the ODE slope at the left end of the interval, s2 and s3 are samples
of the ODE slope midway through the interval, and s4 are samples of the ODE
slope at the right end of the interval. We compute a weighted average of all
these slopes, savg = (s1 +2s2 +2s3 +s4 )/6, and then we proceed to our successor
point by moving a horizontal distance h along a line of slope savg .
As with all the methods we have discussed, it is easy to generalize RK4 to
ODE systems of arbitrary dimensions. Given an ODE du
dt = f (t, u) and a point
(tn , un ), RK4 computes the successor point as follows:
s1 = f (tn , un )
h
s2 = f tn + , un +
2
h
s3 = f tn + , un +
2
h
s1
2
h
s2
2
s4 = f (tn + h, un + hs3 )
h
(tn+1 , un+1 ) = tn + h, un +
s1 + 2s2 + 2s3 + s4
6
Error analysis in RK4
Although we wont present the full derivation, it is possible to show that RK4
is a fourthorder method : with a stepsize h, the error per step decreases like h5
and the overall error decreases like h4 .
4
4.1
15
Stability
Stability of the forward Euler method
u(0) = 1
(24)
with solution
u(t) = et .
(25)
Consider applying Eulers method with stepsize h to this problem. The sequence
of points we get is the following:
(t0 , u0 ) = (0, 1)
(t1 , u1 ) = (h, 1 h)
(t2 , u2 ) = 2h, 1 h h(1 h)
= 2h, (1 h)2
(t3 , u3 ) = 3h, (1 h)2 h(1 h)2
= 3h, (1 h)3
and in general
(tN , uN ) = N h, (1 h)N .
In other words, the Eulermethod estimate of the value of u(t) after N timesteps
is
uEuler (N h) = (1 h)N
More generally, if we had started with initial condition u(0) = u0 , then after N
timesteps we would have
uEuler (N h) = (1 h)N u0 .
(26)
16
Forward Euler
Exact solution
1.5
1.5
u(t)
0.5
0.5
0.5
0.5
1
1
1.5
1.5
2
2
0
0.5
1.5
t
2.5
Figure 3: Instability of the forward Euler method with stepsize h = 0.42 applied
to the ODE du
dt = 5u.
17
Figure 4: The implicit Euler method (also known as the backward Euler
method). As in the forward Euler method, we proceed from (tn , un ) to
(tn+1 , un+1 ) by moving on a straight line until we have traveled a horizontal
distance h along the t axis. The difference is that now the slope of the line is
the slope of the ODE solution curve through the new point (tn+1 , un+1 ). Because we dont know this point a priori, we must solve an implicit equation to
find it hence the name of the technique.
4.2
(tn+1 , un+1 )
18
tn+1 = tn + h
(27a)
(27b)
For the typical case of a nonlinear function f , solving the implicit equation
(27b) is significantly more costly than simply implementing the explicit equation
(12b).
For the special case of a linear ODE system
u = Au,
equation (27b) takes the form
un+1 = un + hA un+1
which we may solve to obtain
un+1 = I hA
1
un .
(28)
4.3
1
un .
(1 + h)
Starting from an initial point (t, u) = (0, u0 ), the value of u after N timesteps
is now
1
ubackward Euler (N h) =
u0 .
(29)
(1 + h)N
19
Comparing this result to (26), we see the advantage of the implicit technique:
assuming > 0, there is no value of h for which (29) grows with N . We say
that the implicit Euler method is unconditionally stable.
Backward Euler
Exact solution
u(t)
0.5
0.5
0.5
0.5
1
1
0
0.5
1.5
2.5
Figure 5: Stability of the backward Euler method with stepsize h = 0.42 applied
to the ODE du
dt = 5u.
4.4
We carried out the analysis above for a onedimensional linear ODE, but it is
easy to extend the conclusions to a higherdimensional linear ODE. Recall from
18.03 that the N dimensional linear ODE system
du
= A u,
u(0) = u0
dt
(where A is an N N matrix with constant coefficients) has the solution
u(t) = C1 e1 t v1 + C2 e2 t v2 + + CN eN t vN
(30)
where (i , vi ) are the eigenpairs of A, and where the Ci coefficients are determined by expanding the initialcondition vector in the basis of eigenvectors:
u0 = C1 v1 + C2 v2 + + CN vN .
20
2
max
(31)
where max is the eigenvalue lying farthest to the left in the complex plane, i.e.
the eigenvalue with the largest negative real part (corresponding to the most
rapidly decaying term in 30). On the other hand, the timescale over which we
will generally want to investigate the system is determined by the least rapidly
decaying term in 30, i.e.
2
(32)
tmax
min
where min is the eigenvalue with the smallest negative real part.5
Comparing (31) and (32), we see that the number of ODE timesteps we
would need to take to integrate our system stably using Eulers method is
roughly
tMax
max
.
h
min
If the dynamic range spanned by the eigenvalues is large, we will need many
small timesteps to integrate our ODE. Systems with this property are said to
be stiff, and stiff ODEs are a good candidate for investigation using implicit
integration algorithms.
4.5
So far we have discussed stability for linear ODEs. For nonlinear ODEs, we
typically consider the system in the vicinity of a fixed point and investigate
the stability with respect to small perturbations. More specifically, for an autonomous nonlinear ODE system of the form
du
= f (u)
dt
(33)
(where we suppose the RHS function is independent of t thats what autonomous means) we suppose u0 is a fixed pointthat is, a zero of the RHS,
i.e.
f (u0 ) = 0.
(34)
The unique solution of (33) passing through the point t0 , u0 is u(t) u0 ,
constant independent of time; you can check using (34) that this is indeed a
solution of (33). We now consider small perturbations around this solution, i.e.
we put
u(t) = u0 + u(t)
5 For
simplicity in this discussion we are supposing that none of the eigenvalues have positive
real parts. If any eigenvalues have positive real parts, then the underlying ODE itself is
unstable (small perturbations in initial conditions lead to exponentially diverging outcomes).
21
and consider the Taylorseries expansion of the RHS of (33) about the fixed
point:
du
= f (u0 + u)
dt
= f (u0 ) + J u + O(u2 )
where J is the Jacobian of f at u0 . Neglecting terms of quadratic and higher
order in u, this is now a linear system that we can investigate using standard
linear stability analysis techniques.
22
Pathological cases
5.1
Nonuniqueness
u(0) = 0.
(35)
2 u=t+C
t2
.
4
(36)
But now consider using one of the ODE integration schemes discussed above to
integrate this equation. Using Eulers method with stepsize h, for example, we
find
(t0 , u0 ) = (0, 0)
(t1 , u1 ) = (h, 0 + h
0) = (h, 0)
0) = (N h, 0).
(37)
But we already figured out that the solution is given by equation (36)! What
went wrong?!
What went wrong is that the RHS function f (t, u) = u, though seemingly
innocuous enough, is not sufficiently wellbehaved to guarantee the uniqueness
of solutions to (35). Indeed, you can readily verify that (37) is a solution to
23
(35) that is every bit as valid as (36). But arent ODE solutions supposed to
be unique?
The behavior of f that disqualifies it in this case is that it is not differentiable
at u = 0. Nonexistence of any derivative of f violates the conditions required
for existence and uniqueness and can give rise to nonunique solutions such as
(36) and (37).
5.2
u(0) = 1.
(38)
1
=t+C
u
1
.
1t
1
u(t) =
p > 1.
h
i
t 1 p1
p1
For p = 1 we have existence and uniqueness for all time.6 But for any p > 1 the
1
function u(t) blows up (i.e. ceases to exist) at the finite time t = p1
.
6 The function et does grow without bound, and for large values of t it assumes values that
in practice are ridiculously large, but it never becomes infinite for finite t.
5.3
24
The above two cases illustrate the two basic ways in which the solution to an
ODE du
dt = f (t, u) can fail to exist or be unique: (a) either f or some derivative
of f can blow up (fail to exist) at some point in our domain of interest, or (b)
f can grow superlinearly in y.
To exclude these pathological cases, mathematicians invented a name for
functions that do not exhibit either of them. Functions f (t, u) that are free of
both pathologies (a) and (b) are said to be Lipschitz, and the basic existence
and uniqueness theorem states that a solution to du
dt = f (t, u) exists and is
unique iff f is Lipschitz.
We have intentionally avoided excessive rigor in this discussion in order to get
the main points across somewhat informally; the topic of Lipschitz functions and
existence and uniqueness of ODEs is discussed in detail in every ODE textbook
and many numerical analysis textbooks.
e
a0 X
+
a T (x),
2
T (x) = cos n arccos x .
hf, gi =
1
f (x)g(x)dx
.
1 x2
Contents
1 Orthogonal Sets of Polynomials
3 Gaussian quadrature
12
An inner product is just a rule for assigning a real number to any pair of
functions f, g, and different choices of [a, b] and W (x) yield different inner
products. [Note that the inner product is linear, i.e. hf, gi = hf, gi and
hf + g, hi = hf, hi + hg, hi.] For our purposes, the most important fact about the
inner product is that it doesnt vanish when you stick the same function into
both slots, i.e. hf, f i =
6 0.1
Given an inner product and a normalization convention, an orthogonal set of
polynomials is simply a collection of polynomials {Qn (x)} (where n indexes the
degree of the polynomial, i.e. Q0 is a constant, Q1 (x) is a linear function, Q2 (x)
is a seconddegree polynomial, etc.) that satisfy the normalization convention
and that are orthogonal with respect to the inner product, i.e.
hQn , Qm i = 0
for n 6= m.
Examples
The following table summarizes the ingredients that define some of the commonlyused sets of orthogonal polynomials.
Name
Symbol
Interval
Weight
function
Normalization
Legendre
Pn (x)
[1, 1]
Pn (1) = 1
Chebyshev
Tn (x)
[1, 1]
Laguerre
Ln (x)
Hermite
Hn (x)
1
1 x2
Tn (0) = 1
[0, ]
ex
Ln (0) = 1
[, ]
ex
hHn , Hn i =
2n n!
1 Note that one convenient way to define a normalization convention [item (3) above] would
be to scale all functions f such that hf, f i = 1, but this is not the convention that is typically
used.
where2
Q1 (x) = A1 (x B1 )Q0
(2)
xQ0 , Q0
B1 =
Q0 , Q0
(3)
xQ1 , Q1
,
B2 =
Q1 , Q1
Q1 , Q1
,
C2 =
A1 Q0 , Q0
xQ2 , Q2
,
B3 =
Q2 , Q2
C3 =
Q2 , Q2
A2 Q1 , Q1
2 Just to clarify: The numerator of the following equation is the inner product (1) with the
function f (x) taken to be xQ0 (x) and the function g(x) taken to be Q0 (x).
xQn1 , Qn1
,
Bn =
Qn1 , Qn1
Qn1 , Qn1
Cn =
An1 Qn2 , Qn2
(7)
Recurrence relations
By scutinizing the general case of the inductive procedure discussed above, it
is generally possible to write down recurrence relations that relate the next
element in a set of orthogonal polynomials to previous elements. For example,
the Legendre polynomials satisfy the recurrence
2n + 1
n
xPn (x)
Pn1 (x)
Pn+1 (x) =
n+1
n+1
The Chebyshev polynomials satisfy the recurrence
Tn+1 (x) = 2xTn (x) Tn1 (x).
(8)
k
(2n + 1) x
Ln (x)
Ln+1 (x).
(n + 1)
k+1
(9)
(10)
Differential equations
Many sets of orthogonal polynomials arise as solutions to differential equations.
For example, the nth Legendre polynomial Pn (x) satisfies
(1 x2 )
dPn
d 2 Pn
2x
+ n(n + 1)Pn (x) = 0
dx2
dx
dTn
d2 Tn
x
+ n2 Tn (x) = 0.
2
dx
dx
Generating functions
It is curious, and in some cases useful, to note that many functions of orthogonal
polynomials have a generating function which encodes the properties of the
entire set of functions and from which individual functions can be recovered by
performing algebraic and derivative manipulations.
For example, for the Legendre polynomials we have
Pn (x) =
1
2n n!
dn 2
(x 1)n .
dxn
For the Chebyshev polynomials, it turns out that Tn (x) arises as precisely the
coefficient of y n in the expansion of the quantity (1 xy)/(1 2xy + y 2 ) in
powers of y :
X
1 xy
=
Tn (x)y n .
1 2xy + y 2
n=0
Differentiating each side of this equation n times and setting y 0 then yields
an expression for Tn (x).
(11a)
2. Only the constant element in the set has nonvanishing integral over the
interval with respect to the weight function, i.e.
Z b
Qn (x)W (x)dx = 0,
n 1.
(11b)
a
For many applications, including Gaussian quadrature as discussed in the following section, we need to compute the roots of the N th element in some set of
orthogonal polynomials, i.e. we need the N points xn that satisfy
QN (xn ) = 0,
n = 1, 2, , N.
(12)
n
,
2n + 1
n = 0,
n =
n+1
.
2n + 1
0 0 0
0
0
0
0
Q0 (x)
1 1 1 0
0
0
0
Q1 (x)
0 2 2 2
0
0
0 Q2 (x)
0 3 3
0
0
0
Q3 (x)
..
..
..
..
..
.
.
.
.
0
0
0
0
0
0
0
Q
(x)
N 3
N 3
N 3
0
0
0
0 N 2 N 2 N 1 QN 2 (x)
QN 1 (x)
0
0
0
0
0
N 1 N 1
Q0 (x)
0
Q1 (x)
0
Q2 (x)
Q3 (x)
= x
+
.
..
..
.
.
QN 3 (x)
QN 2 (x)
0
QN 1 (x)
N 1 QN (x)
What this equation says is that x is almost an eigenvalue of the matrix on the
LHS. The only thing that spoils the eigenvalue condition is the extra term in the
last slot of the second vector on the RHS. However, this term vanishes whenever
x is a root of QN ! This means that the roots of QN are precisely the eigenvalues
of the tridiagonal matrix on the RHS.
Heres a little julia code that will compute and return an N dimensional
vector containing the roots of the N th Legendre polynomial, PN (x):
function LegendreRoots(N)
A=zeros(N,N)
A[1,2] = 1;
for n=1:N2
A[n+1,n] = n/(2*n+1);
A[n+1,n+2] = (n+1)/(2*n+1);
end
A[N,N1] = (N1)/(2*N1);
(lambda,U)=eig(A);
lambda
end
Gaussian quadrature
f (x)W (x) dx
(14)
where W (x) is some weight function and f (x) is an arbitrary function whose
integral (times W ) we are trying to compute. We would like to construct an
N point quadrature rule consisting of N points and weights {{xn }, {wn }) such
that
Z b
N
X
wn f (xn )
f (x)W (x)dx.
(15)
n=1
Note that the sum on the LHS here only involves samples of f , not W ; the
weight function W (x) is baked in to the definition of the quadrature weights
wn .
Let {Qn } be the set of orthogonal polynomials {Qn (x)} defined with respect
to an inner product of the form (1) with interval [a, b] and weight function W (x)
matching those of the integral we are trying to compute in (14). [In the common
case in which W (x) = 1, these will be just the Legendre polynomials {Pn (x)}.]
Its easy to construct an N point quadrature rule that exactly integrates polynomials up to degree N 1
If you give me any set of N points {xn } distributed throughout the interval [a, b],
I can find a set of N weights {wn } such that the quadrature rule [{xn }, {wn }]
exactly integrates all polynomials of degree N 1 or less. All I have to do
is to require my quadrature rule to be exact for the first N elements in the
orthogonal set {Qn }. Since any polynomial of degree N 1 or lower can be
exactly represented as a linear combination of these elements, its integral will
be computed exactly by our quadrature rule.
The condition that our quadrature rule be exact for the first N polynomials
in the set {Qn } amounts to a set of N simultaneous linear equations on the N
quadrature weights {wn }. Indeed, the requirement that my quadrature rule be
exact when I use it to integrate the function Q0 gives me the condition
Z
Q0 (x)W (x)dx.
a
(16a)
The condition that the rule be exact for Q1 yields
Z
w1 Q1 (x1 ) + w2 Q1 (x2 ) + + wN Q1 (xN ) =
Q1 (x)W (x)dx.
a
(16b)
10
(16c)
Equations (16) together constitute an N N linear system for the quadrature
weights wn .
Note also that the RHS of this system is simpler than it looks: as we noted
earlier, all the RHS integrals vanish except for the one involving Q0 , so the RHS
vector of our linear system has only one nonzero entry.
But Gauss discovered a way to construct an N point quadrature rule
that exactly integrates polynomials up to degree 2N 1
The proceeding development tells me that, given any choice of N points {xn },
I can find a set of N weights that makes the quadrature rule (15) exact for all
polynomials up to degree N 1.
However, among all possible ways to choose the set of quadrature points
{xn }, there is one choice that is distinguished: It is the set of roots of the
polynomial QN (x). It is an astonishing fact that the quadrature rule (15),
computed with the {xn } taken as the roots of QN and the weights computed as
discussed above, is exact for all polynomials up to degree 2N 1. This massively
expands the space of functions over which our quadrature rule is exact; the
technique is known as Gaussian quadrature.
The proof of this statement is amazingly simple. Let f (x) be a polynomial
of degree 2N 1 or less. If we divide3 f (x) by the polynomial QN (x), we obtain
some quotient p(x) and some remainder r(x), and because QN has degree N we
are guaranteed that that p(x) and r(x) both have degree N 1 or less. In other
words, any polynomial f of degree 2N 1 may be written exactly in the form
deg p, r N 1.
(17)
But now look at what happens when I apply the quadrature rule (15) to f (x):
Z b
N
X
f (x)W (x)dx
wn f (xn )
a
n=1
N
X
i
h
wn QN (xn ) p(xn ) + r(xn )
 {z }
n=1
=0
The first term vanishes because the quadrature points are roots of QN ! This
leaves behind
=
N
X
n=1
3 The
Z
wn r(xn )
r(x)W (x) dx
(exactly).
(18)
operation at work here is synthetic divisiondo you remember this from high school?
11
In other words, using our quadrature rule to integrate the function (17) is equivalent to integrating just the function r(x). But this function is exactly integrated
by our quadrature rule because it has degree N 1 and our quadrature rule
handles all such functions exactly.
Rb
Meanwhile, we can evaluate the exact integral a f (x)W (x)dx another way,
by expanding the function p(x) in (17) in the set of functions {Qn } [cf. equation
(11b]. Since p has degree N 1, this expansion includes only terms up to
QN 1 :
N
1
X
p(x) =
n Qn (x)
n=0
N
1
X
n Qn (x) + r(x).
n=0
Integrating, we find
Z
f (x)W (x)dx =
a
N
1
X
n=0
Z
=
Z
n
a
Z
QN (x)Qn (x)W (x) dx +
{z
}
r(x)W (x) dx
(19)
=0
r(x)W (x) dx
(20)
12
1
VT
Combining, we obtain
Z
L/2
(x) = 0
L/2
e(x )
dx0
x x0 
(22)
linecharge density (x) is defined such that the total charge in the interval [x, x + dx]
is (x)dx.
13
Nystr
oms Method
Nystroms method uses Gaussian quadrature to convert an integral equation
into a linear system of equations. The most general setting is to consider an
integral equation of the form
Z
(23)
where K(x) is a known kernel function, F (x) is an known forcing function, and
S(x) is an unknown source function for which we are trying to solve. Nystroms
method is to use an N point quadrature rule for the interval [a, b]:
Z
N
X
wn K(x, xn )S(xn )
n=1
w1 K(x1 , x1 ) w2 K(x1 , x2 )
w1 K(x2 , x1 ) w2 K(x2 , x2 )
..
..
.
.
w1 K(xN , x1 ) w2 K(xN , x2 )
of the form
..
.
wN K(x1 , xN )
wN K(x2 , xN )
..
.
wN K(xN , xN )
S(x1 )
S(x2 )
..
.
S(xN )
F (x1 )
F (x2 )
..
.
F (xN )
which we solve for the values of our unknown source distribution at the quadrature points.
SN =
f (n),
n=1
N
X
f (n)
n=1
1
n4 .
(a) Estimate how large we must choose N to ensure that SN agrees with S
to 9digit precision. (That is, estimate the smallest value of N such that
EN < 109 .)
(b) Write a computer program involving a simple loop to evaluate SN . Plot
EN versus N and assess the accuracy of your prediction from Part (a).
Note: Although not necessary to solve this problem, it is interesting that the
infinite sum here may be evaluated in closed form:
X
1
4
S=
=
.
n4
90
n=1
We will prove this statement later in the semester when we discuss Fourier
analysis.
Problem 2. (This is a simple exercise that foreshadows a concept we will
discuss in detail in a couple of weeks.) Many numerical sums involve summands
of widelyvarying magnitudes. However, in some cases we might find ourselves
summing many numbers of roughly equal magnitudes. As a particularly blatant
example, consider the quantity PN defined as the sum of N equal numbers as
follows:
PN
N
X
.
N
n=1
(The fact that the summand is independent of n here is not a typo!) Now
consider the following julia program for computing the quantity PN .
function PN(N)
Summand = pi/N;
Sum=0.0;
for n=1:N
Sum += Summand;
end
Sum
end
State, in words, how you expect the quantity EN to depend on N for values
of N in the range 102 < N < 109 .
EN
(b) Now write a computer program that computes EN for general values of N .
Plot, on a loglog plot, EN versus N for values in the range 102 < N < 109 .
(If you use julia, you may copyandpaste the above code snippet for the
function PN; if you use another language it will be easy enough to port
this snippet to that language.) How do the results compare with your
expectations as stated in Part (a)?
Problem 3. In this problem you will derive the composite secondorder NewtonCotes quadrature rule (Simpsons rule) for integrating over an interval [, ],
subdivided into M subintervals.
(a) As a preliminary warmup, suppose you are given an N point quadrature
rule {xn , wn } for integrating over the interval [1, 1]. That is, {xn } are
N points lying in the range 1 xn 1, and {wn } are N weights such
that
Z 1
N
X
f (x) dx
wn f (xn ).
(1)
1
n=1
n=1
(b) Next derive the basic (not composite) secondorder NewtonCotes quadrature rule for integrating a function over the interval [1, 1], as follows:
(1) Given a function f (x) defined on this interval, construct the unique
seconddegree polynomial P (x) = ax2 +bx+c that agrees with f (x) at the
three points x = 1, 0, 1. [Your answer will involve expressions for a, b, c
2
in terms of f (1), f (0), f (1).] (2) Integrate P (x) over the interval [1, 1]
to obtain an approximation to the integral of f over this interval in terms
of the three samples f (1), f (0), f (1). Express this result in the form (1)
to obtain a quadrature rule {xn , wn } for integrating f over [1, 1].
(c) Combine your answers to parts (a) and (b) to write down the basic (not
composite) Simpsons rule for integrating f over [u, v].
(d) Finally, given an interval [, ], subdivide the interval into M equalwidth
subintervals, apply the basic Simpsons rule to integrate f over each subinterval, and sum the results to obtain the composite Simpsons rule for
integrating f over [, ]. How many samples of f does this rule require?
(Be careful not to overcount).
Problem 4. Write a computer program that implements the composite 0th,
1st, and 2ndorder NewtonCotes quadrature rulesthat is, the composite rectangular, trapezoidal, and Simpsons rulesfor integrating an arbitrary function
over an arbitrary interval, subdivided into M subintervals. Use your program
to approximate the following integrals. In each case, plot the relative error
approx
exact
I

versus N for values of N in the range [10, 107 ]. (Here N is
E = I I exact
the number of function samples required by the quadrature rule, I approx is the
approximation to the integral obtained by numerical quadrature, and I exact is
the exact value of the integral.) How do the results compare with your expectations?
Z
2
(a) Ia
ecos (x+1) +2 sin(4x+1) dx
0
Z
(b) Ib
dx
tanh x dx
p
x 
arctan(x) arctan(x)
dx
x
(c) Ic
0
Z
(d) Id
0
Note: Although not strictly necessary to work this problem, for your error
comparisons you may use the following table of accurate integral values:
Ia = 2.5193079820307612557
Ib = 4.4889560612699568830
Ic = 6.6388149923287733132
Id = 1.7981374998645790990
Extra credit (5%) Unlike the other integrals in Problem 4, it turns out that
the improper integral of Problem 4(d) may be evaluated analytically in closed
3
form. Do so. Hint: Replace the number with a variable u and let F (u) be
the value of the integral in this case. Differentiate F with respect to u and see
what you get.
Extra credit (10%): Mathematics evolving around us in real time.
Just under two years agoon April 17, 2013a littleknown mathematician
at the University of New Hampshire submitted to the journal Annals of Mathematics a paper that solved an extremely old outstanding problem in number
theory. To earn some extra credit on your PSet this week you may do a little
research to learn about this interesting mathematics story evolving around us
in real time. One of many outstanding resources you may find useful in tracking
this down is Terence Taos blog: http://terrytao.wordpress.com.
(a) Who is the (formerly) littleknown mathematician? (Hes certainly not
littleknown anymore.) Did he follow a traditional career path to achieving
success in mathematics? Name at least one nonacademic job he held after
receiving his PhD.
(b) What problem did the mathematician solve? State the problem clearly,
and give a brief (onesentence) summary of the solution. You do not have to
understand how the solution works. (For example, I dont.)
(c) The solution to the problem involves a certain integervalued parameter
commonly known as H1 , for which it is generally considered desirable to find
the minimal admissible value. What is the significance of H1 ? What value of
H1 was included in the original paper submitted to Annals of Mathematics?
(d) Since the original paper submission in April 2013, the mathematics
community has succeeded in reducing significantly the minimal admissible value
of H1 . What is the best current value of H1 and how recently was it obtained?
Briefly describe the collaborative process by which the improved value of H1 was
attained, and comment on how it differs from how mathematics was done prior
to the 21st century.
(e) Extra extra credit (50,000%). Find an even smaller admissible value
of H1 .
(x x0 )2 + (y y 0 )2 + z 2 .
r 1 = (x 1 , y 1 , 0).
Given this simplification, rewrite the three coupled secondorder differential equations of Part (a) in the form of a twelvedimensional firstorder
ODE system. Work in units such that Gm = 1.
(c) Using the improved Euler method, integrate this ODE from t = 0 to t = 20
subject to the following initial conditions:
(x1 , y1 )
(0.7, 0.36),
(x 1 , y 1 )
(0.99, 0.078)
(x2 , y2 )
(1.1, 0.07),
(x 2 , y 2 )
(0.1, 0.47)
(x3 , y3 )
(0.4, 0.3),
(x 3 , y 3 )
(1.1, 0.53)
(2)
Plot the trajectory of each planet. (That is, for each planet, plot a curve
in the (x, y) plane representing the path that planet traverses as it moves
in time. Plot all three curves on the same graph.) Make sure you choose
1 Reference:
http://www.math.utexas.edu/users/jjames/celestHw2Notes.pdf
a step size small enough to ensure that the orbits are converged within
the scale of the plot axes (that is, rerunning the calculation at a smaller
stepsize will not noticeably change the plots).
(d) To investigate the fragility of this special type of orbit, tweak one or more of
the 12 numbers in (2) (say, increase or decrease it by 25% or so) and integrate the system again. Plot the resulting orbits for at least two different
tweaks of initial conditions.
(e) Extra credit (10%): Can you find an alternative set of initial conditions
that leads to trajectories qualitatively similar to what you found in Part
(c)? By alternative I mean a set of 12 numbers of which at least 6 differ
by more than 50% from the values given in (2).
Extra credit (10%). Go to the science library and consult the second edition
of the book A Classical Introduction to Modern Number Theory by Ireland &
Rosen. Find the proof of Theorem 20.6.1 on page 359.
(a) Write a brief (around one sentence) description of the logical structure of
this method of proof. You do not need to understand or describe the
content of the conjecture being proven or the hypothesis used in its proof.
(b) Describe the slightly unusual punctuation the authors use to conclude their
description of the proof schema. Have you seen this notation in a mathematics textbook before?
(c) Can you think of any other theorems whose proof proceeds along the same
logical structure as this proof? (I can only think of one, and I stumbled
on it by accident, so no is an acceptable answer to this question.)
what we want
F ()
 {z }
Ap
 {z }
O(p+1 )
 {z }
(2)
The quantity p determines how hard we have to work to improve the accuracy
of a given estimate of our quantity. To see this, suppose we have computed F ()
for some value of , and suppose we now want to refine this estimate by adding
roughly one digit of precisionthat is to say, we want to decrease the error by
a factor of 10. If p = 1, then to reduce the error by 10 we must decrease by
1
= F (0) + A
+ O(p+1 )
F
2
2
(3)
(4)
Now multiply the second line here by 2p , subtract the first line from it, and do
a little algebra to obtain1
2p F
2 F ()
F (0) =
+ O(p+1 )
(5)
2p 1
The point is that the error term proportional to p in (3) and (4) has cancelled
out of the combination in (5), leaving us with an estimate of our quantity whose
error decays more rapidly with .
The first term on the LHS of (5) defines the Richardsonextrapolated version
of our numerical method at convergence parameter :
2p F
Richardson
2 F ()
F
()
(6a)
2p 1
or, written in terms of the parameter N
F Richardson (N )
1
,
2p F (2N ) F (N )
2p 1
(6b)
1 If you are following along with the algebra at home, you will notice that the O(p+1 )
term in equation (5) is a linear combination of the O(p+1 ) terms in (3) and (4). The point
is that any linear combination of two quantities that are each O(p+1 ) yields a third quantity
that is itself O(p+1 ), no matter what coefficients we choose in the linear combination (as
long as none of them depend on ). This is a feature of the O() notation: it completely
ignores multiplicative coefficients and only keeps track of the leading dependence.
The quantity labeled what we can compute in this equation is the Richardsonextrapolated version of our numerical method at convergence parameter .
Comparing this to equation (2), we see that we have effectively improved the
rate of convergence of our numerical approximation scheme.
Terminology
In some cases, the application of Richardson extrapolation to an existing numerical method is assigned a new name, even though the underlying method
is really the same. For example, the application of Richardson extrapolation
to NewtonCotes quadrature rules is called Romberg integration. On the other
hand, in the world of ODE integrators the combination of Richardson extrapolation with the midpoint method (which you considered in PSet 3) is known as
the BulirschStoer algorithm.
Contents
1 Overview
1.1 Examples of rootfinding problems . . . . . . . . . . . . . . . . .
2
2
2 Onedimensional rootfinding
2.1 Bisection . . . . . . . . . .
2.2 Secant . . . . . . . . . . . .
2.3 NewtonRaphson . . . . . .
6
6
8
8
techniques
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
11
14
17
Overview
(1)
1.1
Ferromagnets
The meanfield theory of the Ddimensional Ising ferromagnet yield the following equation governing the spontaneous magnetization m:
m = tanh
2Dm
T
(2)
2n
nT
=0
c
1 + n2
must hold between T , n, and the angular frequency in order for a resonant
mode to exist. (Here c is the speed of light in vacuum.)
The Riemann function
The greatest unsolved problem in mathematics today is a rootfinding problem.
The Riemann (zeta) function is defined by a contour integral as
I
(1 s)
sz1
(s) =
dz
z 1
2i
C e
where C is a certain contour in the complex plane. This function has trivial
roots at negative even integers s = 2, 4, 6, , as well as nontrivial roots
at other values of s. To date many nontrivial roots of the equation (s) = 0
have been identified, but they all have the property that their real part is 21 . The
Riemann hypothesis is the statement that in fact all nontrivial roots of (s) = 0
have Re s = 21 , and if you can prove this statement (or find a counterexample by
producing s such that (s) = 0, Re s 6= 12 ) then the Clay Mathematics Institute
in Harvard Square will give you a million dollars.
Linear eigenvalue problems
Let A be an N N matrix and consider the problem of determining eigenpairs
(x, ), where x is an N dimensional vector and is a scalar. These are roots of
the equation
Ax x = 0.
(3)
Because both and x are unknown, we should think of (3) as an N + 1dimensional nonlinear rootfinding
problem, where the N +1dimensional vector
x
of unknowns we seek is
, and where the nonlinearity arises because the
it is generally solved using a set of extremely welldeveloped methods of numerical linear algebra (namely, Householder decomposition and QR factorization),
which are implemented by lapack and available in all numerical software packages including julia and matlab.
Nonlinear eigenvalue problems
On other other hand, it may be the case that the matrix A in (3) depends
on its own eigenvalues and/or eigenvectors. In this case we have a nonlinear
eigenvalue problem and the usual methods of numerical linear algebra do not
apply; in this case we must solve using nonlinear rootfinding methods such as
Newtons method.
Nonlinear boundaryvalue problems
In our unit on boundaryvalue problems we considered the problem of a particle
motion in a timedependent force field f (t). We considered an ODE boundaryvalue problem of the form
d2 x
= f (t),
dt2
x(ta ) = x(tb ) = 0
(4)
(5)
x(t1 )
x1
f (t1 )
..
..
..
x=
f =
. = unknown,
= known.
.
.
x(tN )
xN
f (tN )
(6)
(7)
which may be computed easily via standard methods of numerical linear algebra.
But now consider the case of particle motion in a positiondependent force
field f (x). (For example, in a 1D gravitationalmotion problem we would have
f (x) = GM
x2 .) The ODE now takes the form
d2 x
= f (x),
dt2
x(ta ) = x(tb ) = 0.
(8)
f (x1 )
x1
..
f =
x = ... = unknown,
= also unknown!.
.
xN
f (xN )
(10)
and no immediate solution like (7) is available; instead we must solve iteratively
using nonlinear rootfinding techniques.
2
2.1
The simplest rootfinding method is the bisection method, which basically just
performs a simple binary search. We begin by bracketing the root: this means
finding two points x1 and x2 at which f (x) has different signs, so that we
are guaranteed2 to have a root between x1 and x2 . Then we bisect the interval
[x1 , x2 ], computing the midpoint xm = 12 (x1 +x2 ) and evaluating f at this point.
We now ask whether the sign of f (xm ) agrees with that of f (x1 ) or f (x2 ). In
the former case, we have now bracketed the root in the interval [xm , x2 ]; in the
latter case, we have bracketed the root in the interval [x1 , xm ]. In either case,
we have shrunk the width of the interval within which the root may be hiding
by a factor of 2. Now we again bisect this new interval, and so on.
Case Study
As a simple case study, lets investigate the convergence of the bisection method
on the function f (x) = tanh(x 5). The exact root, to 16digit precision, is
x=5.000000000000000. Suppose we initially bracket the root in the interval
[3.0,5.8] and take the midpoint of the interval to be our guess as to the
starting value; thus, for example, our initial guess is x0 = 4.4. The following
table of numbers illustrates the evolution of the method as it converges to the
exact root.
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Bracket
[3.00000000e+00, 5.80000000e+00]
[4.40000000e+00, 5.80000000e+00]
[4.40000000e+00, 5.10000000e+00]
[4.75000000e+00, 5.10000000e+00]
[4.92500000e+00, 5.10000000e+00]
[4.92500000e+00, 5.01250000e+00]
[4.96875000e+00, 5.01250000e+00]
[4.99062500e+00, 5.01250000e+00]
[4.99062500e+00, 5.00156250e+00]
[4.99609375e+00, 5.00156250e+00]
[4.99882812e+00, 5.00156250e+00]
[4.99882812e+00, 5.00019531e+00]
[4.99951172e+00, 5.00019531e+00]
[4.99985352e+00, 5.00019531e+00]
[4.99985352e+00, 5.00002441e+00]
[4.99993896e+00, 5.00002441e+00]
xn
4.400000000000000e+00
5.100000000000000e+00
4.750000000000000e+00
4.925000000000000e+00
5.012499999999999e+00
4.968750000000000e+00
4.990625000000000e+00
5.001562499999999e+00
4.996093750000000e+00
4.998828124999999e+00
5.000195312499999e+00
4.999511718749999e+00
4.999853515624999e+00
5.000024414062499e+00
4.999938964843748e+00
4.000081689453124e+00
2 Assuming the function is continuous. We will not consider the illdefined problem of
rootfinding for discontinuous functions.
The important thing about this table is that the number of correct (red)
digits grows approximately linearly with n. This is what we call linear convergence.3 Lets now try to understand this phenomenon analytically.
Convergence rate
Suppose the width of the interval within which we initially bracketed the root
was 0 = x2 x1 . Then, after one iteration of the method, the width of the
interval within which the root may be hiding has shrunk to 1 = 12 0 (note
that this is true regardless of which subinterval we chose as our new bracket
they both had the same width). After two iterations, the width of the interval
within which the root may be hiding is 2 = 12 1 = 14 0 , and so on. Thus,
after N iterations, the width of the interval within which the root may be hiding
(which we may alternatively characterize as the absolute error with which we
have pinpointed a root) is
bisection
= 2N 0
N
(11)
2.2
Secant
The idea of the secant method is to speed the convergence of the bisection
method by using information about the magnitudes of the function values at
the interval endpoints in addition to their signs. More specifically, suppose we
have evaluated f (x) at two points x1 and x2 . We plot the points (x1 , y1 = f (x1 ))
and (x2 , y2 = f (x2 )) on a Cartesian coordinate system and draw a straight line
connecting these two points. Then we take the point x3 at which this line crosses
the xaxis as our updated estimate of the root. In symbols, the rule is
x3 = x2
x2 x1
f (x2 )
f (x2 ) f (x1 )
Then we repeat the process, generating a new point x4 by looking at the points
(x2 , f (x2 )) and (x3 , f (x3 )), and so on. The general rule is
xn+1 = xn
xn xn1
f (xn )
f (xn ) f (xn1 )
(12)
As we might expect, the error in the secant method decays more rapidly than
that in the bisection method; the number ofcorrect digits grows roughly like
the number of iterations to the power p = 1+2 5 1.6.
One drawback of the secant method is that, in contrast to the bisection
method, it does not maintain a bracket of the root. This makes the method less
robust than the bisection method.
2.3
NewtonRaphson
Take another look at equation (12). Suppose that xn1 is close to xn , i.e.
imagine xn1 = xn +h for some small number h. Then the quantity multiplying
f (xn ) in the second term of (12) is something like the inverse of the finitedifference approximation to the derivative of f at xn :
1
xn xn1
0
f (xn ) f (xn1 )
f (xn )
If we assume that this approximation is trying to tell us something, we are led
to consider the following modified version of (12):
xn+1 = xn
f (xn )
f 0 (xn )
(13)
This prescription for obtaining an improved root estimate from a initial root estimate is called Newtons method (also known as the NewtonRaphson method).
Alternative derivation of NewtonRaphson
Another way to understand the NewtonRaphson iteration (13) is to expand the
function f (x) in a Taylor series about the current root estimate xn :
1
f (x) = f (xn ) + (x xn )f 0 (xn ) + (x xn )2 f 00 (xn ) +
2
(14)
If we evaluate (14) at the actual root x0 , then the LHS is zero (because f (x0 ) = 0
since x0 is a root), whereupon we find
0 = f (xn ) + (x0 xn )f 0 (xn ) + O[(x xn )2 ]
If we neglect the quadratic and higherorder terms in this equation, we can solve
immediately for the root x0 :
x0 = xn
f (xn )
f 0 (xn )
(15)
x
)
+
O
(x
x
)
=
0
n
0
n
f (xn )
2 f 0 (xn )

{z
}
x0 xn+1
But now the quantity on the LHS is telling us the distance between the root
and xn+1 , the next iteration of the Newton method. In other words, if we define
the error after n iterations as n = x0 xn , then
n+1 = C2n
(where C is some constant). In other words, the error squares on each iteration.
To analyze the implications of this fact for convergence, its easiest to take
logarithms on both sides:
log n+1 2 log n
4 log n1
10
Bn
(16)
xn
4.400000000000000
5.154730677706086
4.997518482593209
5.000000010187351
5.000000000000000
After 3 iterations, I have 4 good digits; after 4 iterations, 8 good digits; after 5
iterations, 16 good digits. This is quadratic convergence.
Double roots
What happens if f (x) has a double root at x = x0 ? A double root means
that both f (x0 ) = 0 and f 0 (x0 ) = 0. Since our error analysis above assumed
f 0 (x0 ) 6= 0, we might expect it to break down if this condition is not satisfied,
and indeed in this case Newtons method exhibits only linear convergence.
11
One advantage of Newtons method over simple methods like bisection is that it
extends readily to multidimensional rootfinding problems. Consider the problem of finding a root x0 of a vectorvalued function:
f (x) = 0
(17)
where x is an N dimensional vector and f is an N dimensional vector of functions. (Although in the introduction we stated that rootfinding problems may
be defined in which the dimensions of f and x are different, Newtons method
only applies to the case in which they are the same.)
The linear case
There is one case of the system (17) that you already know how to solve: the
case in which the system is linear, i.e. f (x) is just matrix multiplication of x by
a matrix with xindependent coefficients:
f (x) = Ax = 0
(18)
In this case, we know there is always the one (trivial) root x = 0, and the condition for the existence of a nontrivial root is the vanishing of the determinant of
A. If det A 6= 0, then there is no point trying to find a nontrivial root, because
none exists. On the other hand, if det A = 0 then A has a zero eigenvalue and
its easy to solve for the corresponding eigenvector, which is a nontrivial root of
(18).
The nonlinear case
The vanishingofdeterminant condition for the existence of a nontrivial root of
(18) is very nice: it tells us exactly when we can expect a nontrivial solution to
exist.
For more general nonlinear systems there is no such nice condition for the
existence of a root4 , and thus it is convenient indeed that Newtons method for
rootfinding has an immediate generalization to the multidimensional case. All
we have to do is write out the multidimensional generalization of (14) for the
Taylor expansion of a multivariable function around the point x:
f (x + ) = f (x) + J + O(2 )
(19)
4 At least, this is the message they give you in usual numerical analysis classes, but it is not
quite the whole truth. For polynomial systems it turns out there is a beautiful generalization
of the determinant known as the resultant that may be used, like the determinant, to yield a
criterion for the existence of a nontrivial root. I hope we will get to discuss resultants later in
the course, but for now you can read about it in the wonderful books Ideals, Varieties, and
Algorithms and Using Algebraic Geometry, both by Cox, Little, and OShea.
12
f1
f1
x
x1
x2
N
f2
f2
f2
x
x
x2
1
N
J(x) = .
.
.
.
..
..
..
..
fN
fN
fN
x1
x2
xN
where all partial derivatives are to be evaluated at x.
Now suppose we have an estimate x for the root of nonlinear system f (x).
Lets compute the increment that we need to add to x to jump to the exact
root of the system. Setting (19) equal to zero and ignoring higherorder terms,
we find
0 = f (x + )
f (x) + J
or
= J1 f (x)
In other words, if xn is our best guess as to the location of the root after n
iterations of Newtons method, then our best guess after n + 1 iterations will be
xn+1 = xn J1 f (x)
(20)
x1 ex1 x2 + 1
function f(x)
x1=x[1];
x2=x[2];
[x1^2  cos(x1*x2); exp(x1*x2) + x2];
end
function J(x)
x1=x[1];
x2=x[2];
J11=2*x1+x2*sin(x1*x2)
J12=x1*sin(x1*x2)
J21=x2*exp(x1*x2)
J22=x1*exp(x1*x2)+1;
[ J11 J12; J21 J22]
end
function NewtonStep(x)
fVector = f(x)
jMatrix = J(x)
x  jMatrix \ fVector;
end
function NewtonSolve()
x=[1; 1]; # random initial guess
residual=norm(f(x))
while residual > 1.0e12
x=NewtonStep(x)
residual=norm(f(x))
end
x
end
13
14
x2 = x1
Newtons method has sent us completely out of the ballpark! What went
wrong??
What went wrong here is that the function tanh(x 5) has very gentle slope
at x = 0 in fact, the function is almost flat there (more specifically, its slope
is sech2 (x 5) 2 104 ) and so, when we approximate the function as a line
with that slope and jump to the point at which that line crosses the x axis, we
wind up something like 5,000 units away. This is what we get for attempting to
use Newtons method with a starting point that is not close to a root.
x2n + 1
.
2xn
(22)
If we start with any realvalued initial guess x1 , then the sequence of points
generated by (22) is guaranteed to remain realvalued for all n, and thus we can
never hope to converge to the correct roots i.
15
Newton fractals
We get a graphical depiction of phenomena like this by plotting, in the complex
plane, the set of points {z0 } at which Newtons method, when started at z0 for
a function f (z), converges to a specific root. [More specifically: For each point z
in some region of the complex plane, we run Newtons method on the function f
starting at z. If the function converges to the mth root in N iterations, we plot
a color whose RGB value is determined by the tuple (m, N ).] You can generate
plots like this using the julia function PlotNewtonConvergence, which takes as
its single argument a vector of the polynomial coefficients sorted in decreasing
order. Heres an example for the function f (z) = z 3 1.
julia> PlotNewtonConvergence([1 0 1])
16
The three roots of f (z) are 1, e2i/3 , e4i/3 . The variously colored regions
in the plot indicate points in the complex plane for which Newtons method
converges to the various roots; for example, red points converge to e2i/3 , and
yellow points converge to e4i/3 . What you see is that for starting points in
the immediate vicinity of each root, convergence to that root is guaranteed, but
elsewhere in the complex plane all bets are off; there are large red and yellow
regions that lie nowhere near the corresponding roots, and the fantastically intricate boundaries of these regions indicate the exquisite sensitivity of Newtons
method to the exact location of the starting point.
This type of plot is known as a Newton fractal, for obvious reasons. Thus
Newtons method applied to the global convergence of polynomial rootfinding
yields beautiful pictures, but not a very happy time for actual numerical rootfinders.
17
In the previous section we observed that Newtons method exhibits spectacularly sketchy global convergence when we use it to compute roots of polynomials.
So what should you do to compute the roots of a polynomial P (x)? For an arbitrary N thdegree polynomial with real or complex coefficients, the fundamental
theorem of algebra guarantees that N complex roots exist, but on the other
hand Galois theory guarantees for N > 5 that there is no nice formula expressing these roots in terms of the coefficients, so finding them is a task for numerical
analysis. Although specialized techniques for this problem do exist (one such is
the JenkinsTraub method), a method which works perfectly well in practice
and requires only standard tools is to find a matrix whose characteristic polynomial is P (x) and compute the eigenvalues of this polynomial using standard
methods of numerical linear algebra.
The companion matrix
Such a matrix is called the companion matrix, and for a monic5 polynomial
P (x) of the form
P (x) = xn + Cn1 xn1 + Cn2 xn2 + + C1 x + C0
the companion matrix takes the form.
0 0 0
1 0 0
0 1 0
CP = 0 0 1
.. .. ..
. . .
0
..
.
C0
C1
C2
C3
..
.
Cn1
5 A monic polynomial is one for which the coefficient of the highestdegree monomial is 1. If
your polynomial is not monic (suppose the coefficient of its highestorder monomial is A 6= 1),
just consider the polynomial obtained by dividing all coefficients by A. This polynomial is
monic and has the same roots as your original polynomial.
18
6.1
[a1 , b1 ]
..
.
[an , bn ]
..
.
with the property that the root is always known to be contained within the
interval in question, i.e. with the property
sign f (an ) 6= sign f (bn )
preserved for all n.
Goldensection search does something similar, but instead of generating a
sequence of pairs [an , bn ] it produces a sequence of triples [an , bn , cn ], i.e.
19
[a0 , b0 , c0 ]
[a1 , b1 , c1 ]
..
.
[an , bn , cn ]
..
.
with the properties that an < bn < cn and each triple be guaranteed to bracket
the minimum, in the sense that f (bn ) is always lower than either of f (an ) or
f (cn ), i.e. the properties
f (an ) > f (bn )
and
(23)
20
[an , bn ] and [bn , cn ] remains constant even as the overall width of the bracketing
interval shrinks toward zero. With a little effort you can show that this property
is ensured by taking to be the golden ratio,
3 5
=
= 0.381966011250105
2
and a fraction of an interval is known as the golden section of that interval,
which explains the name of the algorithm.
6.2
1.0e30
1.0e45
Since xnearest deviates from x0 by something like 1015 , we find that f (xnearest )
deviates from f (x0 ) by something like 1030 , i.e. the digits begin to disagree in
7 This is where the assumption that x  1 comes in; the more general statement would be
0
that the nearest floatingpoint numbers not equal to x0 would be something like x0 1015 x0 .
21
the 30th decimal place. But our floatingpoint registers can only store 15 decimal
digits, so the difference between f (x0 ) and f (xnearest ) is completely lost; the two
function values are utterly indistinguishable to our computer.
Moreover, as we consider points x lying further and further away from x0 ,
we find that f (x) remains floatingpoint indistinguishable from f (x0 ) over a
wide interval near x0 . Indeed, the condition that f (x) be floatingpoint distinct
from f (x0 ) requires that (x x0 )2 fit into a floatingpoint register that is also
storing f0 1. This means that we need8
(x x0 )2 & machine
(25)
or
(x x0 ) &
machine
(26)
This explains why, in general, we can only pin down minima to within the
square root of machine precision, i.e. to roughly 8 decimal digits on a modern
computer.
On other hand, suppose the function g(x) has a root at x0 . In the vicinity
of x0 we have the Taylor expansion
1
g(x) = (x x0 )g 0 (x0 ) + (x x0 )2 g 00 (x0 ) +
2
(27)
which differs from (24) by the presence of a linear term. Now there is generally
no problem distinguishing g(x0 ) from g(xnearest ) or g at other floatingpoint
numbers lying within a few machine epsilons of x0 , and hence in general we will
be able to pin down the value of x0 to close to machine precision. (Note that
this assumes that g has only a single root at x0 ; if g has a double root there,
i.e. g 0 (x0 ) = 0, then this analysis falls apart. Compare this to the observation
we made earlier that the convergence of Newtons method is worse for double
roots than for single roots.)
Figures 6.2 illustrates these points. The upper panel in this figure plots,
for the function f (x) = f0 + (x x0 )2 [corresponding to equation (24) with
x0 = f0 = 12 f 00 (x0 ) = 1], the deviation of f (x) from its value at f (x0 ) versus the
deviation of x from x0 as computed in standard 64bit floatingpoint arithmetic.
Notice that f (x) remains indistinguishable from f (x0 ) until x deviates from x0
by at least 108 ; thus a computer minimization algorithm cannot hope to pin
down the location of x0 to better than this accuracy.
In contrast, the lower panel of Figure 6.2 plots, for the function g(x) =
(x x0 ) [corresponding to equation (27) with x0 = g 0 (x0 ) = 1], the deviation
of g(x) from g(x0 ) versus the deviation of x from x0 , again as computed in
standard 64bit floatingpoint arithmetic. In this case our computer is easily
able to distinguish points x that deviate from x0 by as little as 2 1016 . This
8 This is where the assumptions that f  1 and f 00 (x ) 1 come in; the more general
0
0
statement would be that we need (x x0 )2 f 00 (x0 ) & machine f0 .
22
23
1.6e15
1.4e15
1.2e15
1e15
8e16
6e16
4e16
2e16
0
2e16
4.0e08
5e16
4e16
3e16
2e16
1e16
0
1e16
2e16
3e16
4e16
5e16
6e16
2.0e08
4e16 2e16
2e16
4e16