Newton Interpolations

ELEMENTARY
NUMERICAL
METHODS
&
FOURIER
ANALYSIS

F. KOUTN

_______________________________
Zln, CZE 2006

F. KOUTNY: ELEMENTARY NUMERICAL METHODS & FOURIER ANALYSIS

CONTENTS Page

PREFACE

1 INTRODUCTION 1

2 INTERPOLATION 3
1.1 Polynomial Approximation - - - - - - - - - - - - - - - - - - - - 3
1.2 Approximation by Splines - - - - - - - - - - - - - - - - - - - - 9

2 INTEGRATION 17
2.1 Numerical Quadrature- - - - - - - - - - - - - - - - - - - - - - - 19
Simpsons Method - - - - - - - - - - - - - - - - - - - - - - - - - - 21
Gauss Method - - - - - - - - - - - - - - - - - - - - - - - - - - - - 23
Chebyshev Formulas - - - - - - - - - - - - - - - - - - - - - - - - 25
Precision and Richardson Interpolation - - - - - - - - - - - 25
On Multiple Integration - - - - - - - - - - - - - - - - - - - - - - - 27
2.2 Monte Carlo Methods - - - - - - - - - - - - - - - - - - - - - - - 28

3 NONLINEAR EQUATIONS 35
3.1 Solution of the Equation f(x) = 0 - - - - - - - - - - - - - - - 35
Bisection Method - - - - - - - - - - - - - - - - - - - - - - - - - - 36
Multiple Equidistant Partition - - - - - - - - - - - - - - - - - - 38
Regula falsi - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 39
Quadratic Interpolation Method - - - - - - - - - - - - - - - - - 41
Inverse Interpolation Method - - - - - - - - - - - - - - - - - - 43
Newton-Raphson Method - - - - - - - - - - - - - - - - - - - - 44
Simple Iteration Method - - - - - - - - - - - - - - - - - - - - - 45
Localization and Global Determination of Roots - - - - - 47
3.2 Systems of Nonlinear Equations - - - - - - - - - - - - - - - 49
Newton Method - - - - - - - - - - - - - - - - - - - - - - - - - - - 51
Genetic Algorithm - - - - - - - - - - - - - - - - - - - - - - - - - 54

4 ORD. DIFFERENTIAL EQUATIONS 57
4.1 Runge-Kutta Formulas - - - - - - - - - - - - - - - - - - - - - - - 60
4.2 Systems of Ordinary Diff. Equations - - - - - - - - - - - - - - 65
4.3 Adams Methods - - - - - - - - - - - - - - - - - - - - - - - - - - - 68
4.4 Boundary Problems - - - - - - - - - - - - - - - - - - - - - - - - - 71

5 LINEAR SPACES & FOURIER SERIES 75
5.1 Linear Spaces - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 75
5.2 Normed Linear Spaces - - - - - - - - - - - - - - - - - - - - - - - 79
5.3 Scalar Product. Hilbert Space - - - - - - - - - - - - - - - - - - 82
5.4 Orthogonality. General Fourier Series - - - - - - - - - - - - 85
5.5 Trigonometric Fourier Series - - - - - - - - - - - - - - - - - - 93
5.6 Numerical Calculation of Fourier Coefficients - - - - - - - 99
5.7 Fejrs Summation of Fourier Series - - - - - - - - - - - - - - 114
5.8 Fourier Integral - - - - - - - - - - - - - - - - - - - - - - - - - - - - 116

REFERENCES 120
INDEX 121


PREFACE

This text completes the Mathematical Base for Applications in the field of numerical
calculations which is an inevitable part of practical use of mathematical knowledge. The
area of numerical mathematics is very rich and miscellaneous and here only a sketchy
glimpse of its basic fragments can be given. I tried to emphasize the creative and
sportive side of mathematics.
The text as a whole is unbalanced. Some topics are just mentioned by a few words,
others are treated concisely and others are discussed in more details. For example, the
chapter on differential equations offers but a glance at the huge area of various methods.
Simple numerical examples are computed by means of todays standard Microsoft
EXCEL to illustrate the explained method. But in some more complicated methods also
the programming language PASCAL was used.
Methods of Fourier analysis belong to the tools applied very frequently in many
engineering areas. They represent a bridge from functional analysis to various numerical
methods.
I wish the readers would find something useful, interesting and inspiring here.
Please, accept my apology for the heterogeneity of the text and my imperfectness,
lacks and trespasses concerning the language, explanations as well as errors of all kinds.

F. Koutny

INTRODUCTION


1
INTRODUCTION

The aim of the following chapters is to show some of applications of the mathematical
analysis in numerical manner. This in the present time is necessarily connected with use
of a computer as a self-evident computation tool.
There are different number types accordingly the memory units (bytes) needed for
their preserving in computer memory. But in any case the range of memory units is
final. This implies that in computers we can work only with rational numbers and
rational approximations. The presence of rounding errors can cause a big global error or
even break down of computing when a large number of iterations is carried out.
Example. For nN let us define
I
n
=
1
0
t
n
cos t dt.
Due to
n
lim t
n
= 0 and cos t1 the Lebesgue theorem [1,2,8,9] says
n
lim

I
n
=
0
1
n
lim

t
n
cos t dt =
0
1
0 . cos t dt = 0.
Further on, I
0
I
1
I
2
... Obviously, I
0
=
0
1
cos t dt = sin 1. Integration by parts

yields the following recurrence
I
n
=
0
1
t
n
cos t dt = t
n
sin t
0
1
n
0
1
t
n1
sin t dt
= sin 1 n [t
n1
(cos t)
1
0
+ (n1)
0
1
t
n2
cos t dt] = sin 1 n cos 1 n(n1) I
n2
.
Using this relation to compute the sequence {I
n
: n= 2, 4, ...}, e.g. by EXCEL, one
obtains values I
iter
in the second column of the following table:

n I
iter
I
G

0 0.841470984808 0.841470984808
2 0.239133626928 0.239133626928
4 0.133076685140 0.133076685140
6 0.090984265821 0.090984265821
8 0.068770545780 0.068770545782
10 0.055144923310 0.055144923195
12 0.045968778333 0.045968793697
14 0.039385610319 0.039382814230
16 0.033761402064 0.034432462358
18 0.23592345887 0.030579002875
20 7.800340E+01 0.027495989810
22 3.605030E+04 0.024974369569
24 1.989975E+07 0.022874203587
26 1.293484E+10 0.021098356794
28 9.778737E+12 0.019577332382
30 8.507502E+15 0.018260115763

INTRODUCTION


2
It can be seen that starting with the 18
th
member the sequence I
iter
(n) has interrupted
its monotone decrease starts to diverge. Let us find out the reason of this behavior.
The recurrent relation can be written as follows
I
2k
= 2k cos 1 + sin 1
( )!
( )!
2
2 2
k
k
I
2k2
.
The first members of the sequence I
2k
are:
I
0
= sin 1,
I
2
= 2 cos 1 + sin 1
2
0
!
!
I
0
= 2! [cos 1 + (1/2! 1) sin 1],
I
4
= 4 cos 1 + sin 1
4
2
!
!
2! [cos 1 + (1/2! 1) sin 1] =
4! [(1/3!1) cos 1 + (1/4! 1/2! + 1) sin 1],
I
6
= 6 cos 1 + sin 1
6
4
!
!
4! [(1/3!1) cos 1 + (1/4! 1/2! + 1) sin 1] =
6! [(1/5! 1/3! + 1) cos 1 + (1/6! 1/4! + 1/2! 1) sin 1],
I
8
= 8 cos 1 + sin 1
8
6
!
!
6! [(1/5! 1/3! + 1) cos 1 + (1/6! 1/4! + 1/2! 1) sin 1]
= 8! [(1/7! 1/5! + 1/3! 1) cos 1 + (1/8! 1/6! +1/4! 1/2! + 1) sin 1],
...
Coefficients at cos 1 and sin 1 are (for k>2)
(2k)! [(1/(2k1)! (1/(2k3)!+...]
= 2k [1 (2k1)(2k2) + (2k1)(2k2)(2k3)(2k4) ... ],
(2k)! [(1/(2k)! (1/(2k2)!+...] = [1 (2k)(2k1) + (2k)(2k1) (2k2)(2k3) ... ].
In brackets there are divergent alternating series. They cause an early overflowing the
range of the chosen numerical type. Then, of course, the numbers generated in
computers differ from the correct numbers and this is why the computation of I
n
fails.
The iteration results can be compared with results of numerical integration by Gauss
three-node formula I
G
(Chapter 2) in which no problems with numerical stability arise if
n does not exceed reasonable limits.
This warning example shows that numerical computing is also a kind of art.
Sometimes more different methods need to be tested, evaluated, compared and adapted
to find a suitable method for a problem being solved.

1 INTERPOLATION


3

1 INTERPOLATION
A relatively frequent task is to reveal the structure, internal links and relations in a data
set obtained e.g. by measurement. If there some stochastic influences can be expected,
statistical methods are to be used. In the following text, however, mostly deterministic
quantities will be considered, i.e. to any independent quantity x a numerical object O is
assigned uniquely, xO. The simplest and most frequent case is, when both x, O are
numbers, i.e. O is a value of scalar function of a real variable f.

1.1 Polynomial Approximation
For the sake of simplicity a set of pairs {(x
i
, y
i
): i = 1, ..., n} is considered in which i j
implies x
i
x
j
(repeating equal abscissas x
i
is eliminated). Now such an acceptable
function f is to be found that fits them well, thus, makes the magnitude of differences
f(x
i
) y
i
, i = 1, ..., n, small. In R
n
with a metric d this means minimizing d(f, y). This
problem can be viewed from different standpoints.
If the sought function f is supposed to be continuous, y
i
= f(x
i
), the problem can be
reduced to a search for a polynomial P well approximating f [1,2].

Weierstrass theorem.
If f is a continuous function on a closed interval [a, b], fC
0
([a, b]), then to any >0
there is such a polynomial P
that f(x) P
(x)< for any x[a, b].

If
k
is supposed to be 1/k or 1/2
k
, a sequence of polynomials P
k
, P
k
(x)

f(x)<
k
,
x[a, b] exists that converges uniformly to the function f on the interval [a, b].

Remark. Weierstrass theorem can be formulated also for functions of several variables in m-dimensional
intervals and vector functions. The choice of basis enables another generalization. A polynomial is a
linear combination of the elements of the basis B={1=x
0
, x=x
1
, x
2
, ..., x
n
}; trigonometric polynomial is an
element of the space with the basis {1=cos 0, cos x, sin x, , cos nx, sin nx} or {e
ix
: i = 0, 1, , n};
orthogonal polynomials on some intervals generate also linear spaces, etc.

Weierstrass theorem creates an existential background for seeking a polynomial
P
n
(x) = a
n
x
n
+a
n1
x
n1
+ ... + a
1
x +a
0

that approximates a continuous function f in the sense that on the set {x
i
: i=0, ..., m} the
sum of squares of differences
S(a
0
, ..., a
n
) =
i
m
=
0
[P
n
(x
i
) f(x
i
)]
2
.
is minimal. To assure uniqueness it must be n<m. The case n < m1 defines the problem
of linear regression [3]. The interpolation belongs to the case n = m1, S(a
0
, ..., a
n
) = 0.

1 INTERPOLATION


4
The unknown coefficients a
0
, ..., a
n
in this case are given through the following system
of m = n+1 equations
a
n
x
i
n
+a
n1
x
i
n1
+ ... + a
1
x
i
+a
0
= y
i
, i = 0, 1, ..., n.
In the matrix form
Ma =
|
|
|
|
|
\
|
n
n n
n
n
x x
x x
x x
1
... ... ... ...
... 1
... 1
0 1
0 0
|
|
|
|
\
|
n
a
a
a
...
1
0
=
|
|
|
|
\
|
n
y
y
y
...
1
0
= y. (S)
The determinant of the matrix M is the so called Vandermonde determinant
(J. A. Vandermonde (17351796), a French mathematician, one the founders of the
theory of determinants).
Theorem. If x
j
are different, x
j
x
k
, j, k = 0, ..., n, then det M = det
|
|
|
|
|
\
|
n
n n
n
n
x x
x x
x x
1
... ... ... ...
... 1
... 1
0 1
0 0
0.
Proof. det M does not change if to some column of M a multiple of another column of
M is added. Multiplying the first column by x
0
and adding it to the second one gives
D = det M =
n
n n
n
n
x x
x x
x x
... 1
... ... ... ...
... 1
... 1
0 1
0 0
=
n
n n
n
n
x x x
x x x
x x x
... 1
... ... ... ...
... 1
... 1
0
1 0 1
0 0 0
=
n
n n
n
n
x x x
x x x
x
... 1
... ... ... ...
... 1
... 0 1
0
1 0 1
0
.
Multiplying the first column by x
0
2
and subtracting it from the third column annuls the
element in the first row of the third column. This process can be continued till the last
column, from which the x
0
n
-multiple of the first column is subtracted. So one gets
D =
n n
n n
n n
x x x x
x x x x
0 0
0 1 0 1
... 1
... ... ... ...
... 1
0 ... 0 1

=
n n
n n n
n n
n n
x x x x x x
x x x x x x
x x x x x x
0
2
0 0
0 2
2
0 2 0 2
0 1
2
0
2
1 0 1
...
... ... ... ...
...
...

=
) ... )( ( ... ) )( (
... ... ... ...
) ... )( ( ... ) )( (
) ... )( ( ... ) )( (
1
0
2
0 0
2 1
0 0 0 0
1
0
2
0 2 0
2
2
1
2 0 2 0 2 0 2 0 2
1
0
2
0 1 0
2
1
1
1 0 1 0 1 0 1 0 1

+ + + + +
+ + + + +
+ + + + +
n n
n
n
n
n
n n n n n
n n n n
n n n n
x x x x x x x x x x x x x x
.

The factor x
1
x
0
appears in all members of the first row, x
2
x
0
is in all members of the
second row, , x
n
x
0
is in all members of the nth row. Thus,
D = (x
1
x
0
) (x
2
x
0
) ... (x
n
x
0
)
1
0
2
0 0
2 1
0
1
0
2
0 2 0
2
2
1
2 0 2
1
0
2
0 1 0
2
1
1
1 0 1
... ... 1
... ... ... ...
... ... 1
... ... 1

+ + + + +
+ + + + +
+ + + + +
n n
n
n
n
n
n n
n n n n
n n n n
x x x x x x x x
x x x x x x x x
x x x x x x x x
.

1 INTERPOLATION


5
By equivalent operations, i.e. not changing the value of determinant (like successive
subtracting the multiples of the first column by powers of x
0
from the second and next
columns etc.), the original determinant can be rearranged to
D = (x
1
x
0
) (x
2
x
0
) ... (x
n
x
0
)
1
1
2 2
1
1 1
... 1
... ... ... ...
... 1
... 1
n
n n
n
n
x x
x x
x x
.
Repeating the whole procedure with the new determinant (of the nn matrix), we
obtain
D =
) ( ... ) (
) ( ... ) ( ) (
1 1 2
0 0 2 0 1
x x x x
x x x x x x
n
n

2
2
3
3
2
2
2
... 1
... ... ... ...
... 1
... 1
n
n n
n
n
x x
x x
x x
,
etc. Finally we get
D =
) (
....... .......... .......... .......... ..........
) ( ) ( ... ) (
) ( ) ( ... ) ( ) (
1
1 1 1 1 2
0 0 1 0 2 0 1

n n
n n
n n
x x
x x x x x x
x x x x x x x x
.
Because x
j
x
k
for jk, all the (x
j
x
k
) 0. Thus, their product D 0.

Regularity of M (i.e. D0) is the necessary and sufficient condition for the existence
and uniqueness of the interpolation polynomial P(x) of the nth degree whose graph goes
through n+1 nodes (x
i
, y
i
), i = 0, 1, ..., n, i.e. P(x
i
) = y
i
.
In interpolation polynomial, as usual, we are interested rather in its values among the
interpolation nodes than in its coefficients. Then the interpolation polynomial needs to
be written in some shape that contains the values x
i
, y
i
explicitly.
The Lagrange interpolation polynomial for n+1 different interpolation nodes
L
n
(x) =
) )...( )( (
) )...( )( (
0 2 0 1 0
2 1
n
n
x x x x x x
x x x x x x

y
0
+ ... +
) )...( )( (
) )...( )( (
1 1 0
1 1 0

n n n n
n
x x x x x x
x x x x x x
y
n

=
k
n
k
n
k i i
i k
n
k i i
i
y
x x
x x
=
=
=
0
, 0
, 0
) (
) (
.
is one of the possible forms [4-7]. But this one is more convenient for theoretical
considerations than for practical use.
Remark. Nodes are to be chosen with forethought. When x
k
= 2/((4k+1)), k = 0, 1, 2, 3, 4, are chosen for
nodes of the function sin (1/x) on interval [1/30, 21/30] the corresponding Lagrange interpolation
polynomial is constant, L
1
(x) = 1. This follows from the system (S): the coefficient a
0
= D/D = 1 while in
numerator determinants for k = 1, ... , 4 two columns consist of mere 1, thus a
1
= ... = a
4
= 0. Another
negative example can be represented by Lagrange interpolation polynomial L
2
(x) constructed for the same
function on the interval [0.1, 0.6] with nodes (i/10, sin (10/i) ), i = 1, ..., 6.

1 INTERPOLATION


6

-1.5
-1
-0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8
x
sin (1/x )

Fig. 1.1 Function sin (1/x) and its interpolation polynomials L
1
, L
2
for bad choices of nodes x
i
.

In most cases of practical interpolation equidistant nodes x
i+1
x
i
= h, i = 0, 1, ..., n,
are preferable. There are many kinds of interpolation polynomials. For example,
Newtons forwards interpolation polynomial
N
n
(x) = y
0
+
! 1
0
y
q
[1]
+
! 2
0
2
y
q
[2]
+
! 3
0
3
y
q
[3]
+ ... +
!
0
n
y
n
q
[n]
.
It is written here in the shape remembering the Taylor polynomial but the used
symbols need some explanation. denotes the difference operator, which in a repeated
use gives the following relations:
y
0
= y
1
y
0
,
2
y
0
= y
1
y
0
= y
2
2y
1
+ y
0
,
3
y
0
=
2
y
1

2
y
0
= y
3
2y
2
+ y
1
(y
2
2y
1
+ y
0
) = y
3
3y
2
+ 3y
1
y
0
,
n
y
0
=
n1
y
1

n1
y
0
=
n2
y
2
2
n2
y
1

n2
y
0
=
n3
y
3
3
n3
y
2
+ 3
n3
y
1

n3
y
0
=
... =
|
\
|
0
n
y
n

|
\
|
1
n
y
n1
+
|
\
|
2
n
y
n2
... +
|
\
|
1 n
n
(1)
n1
y
1
+
|
\
|
n
n
(1)
n
y
0
.
A practical example makes computing the differences easily understandable. The
following table shows the values of x
i
and y
i
for i=0, ..., 6 as well as differences
k
y
i

obtained by the procedure described above.

i x
i
y
i

y
i

2
y
i

3
y
i

4
y
i

0 3 100 65 38 18 0
1 2 35 27 20 18 0
2 1 8 7 2 18 0
3 0 1 5 16 18
4 1 4 21 34
5 2 25 55
6 3 80

1 INTERPOLATION


7
The fact that the third differences are constant (and difference of the 4
th
and higher
orders are therefore zero) says that the corresponding interpolation polynomial is of the
3
rd
order (cubic polynomial).
The interpolation nodes are assumed x
i
= x
0
+ ih, i = 0, 1, 2, Let be q = (x x
0
)/h
and further on we define
q
[1]
= q, q
[2]
= q(q1), q
[2]
= q (q1) (q2), ..., q
[k]
= q (q1) (qk+1).

Example. Let seven values of the function sine be given, y
i
= sin (i10), i = 0, 1, ..., 6,
and the value of sin (5) is to be determined.
Obviously, x
0
= 0, h = 10, q = (50)/10 = 0.5 and
sin 5 N
6
(q) = y
0
+
! 1
0
y
q
[1]
+
! 2
0
2
y
q
[2]
+ ... +
! 6
0
6
y
q
[6]
.
Defining
0
y
i
= y
i
, q
[0]
= 1 enables to calculate differences
k
y in tabular form:

i x
i
sin x
i

1
y
i

2
y
i

3
y
i

4
y
i

5
y
i

6
y
i

0 0 0.00000000 0.17364818 0.00527621 0.00511590 0.00031576 0.00014585 0.00001403
1 10 0.17364818 0.16837197 0.01039211 0.00480014 0.00046161 0.00013182
2 20 0.34202014 0.15797986 0.01519225 0.00433853 0.00059343
3 30 0.50000000 0.14278761 0.01953078 0.00374510
4 40 0.64278761 0.12325683 0.02327587
5 50 0.76604444 0.09998096
6 60 0.86602540

Computation of sums may be arranged as follows:

k 0 1 2 3 4 5 6

k
y
0

0.00000000 0.17364818 0.00527621 0.00511590 0.00031576 0.00014585 0.00001403
k!
1 1 2 6 24 120 720
q
[k]

1.00000000 0.5 0.25 0.375 0.9375 3.28125 14.765625
N
k

0.00000000 0.08682409 0.08748362 0.08716387 0.0871515 0.0871555 0.08715581

In the last row there are sums N
n
(0.5) =
k
n
=
1
!
0
k
y
k
q
[k]
. The error of approximation is
E = sin 5 N
6
(0.5) = 0.0871 5574 0.0871 5581 = 710
8
.
The calculation can be carried out very easily in EXCEL. E.g. for x = 6, 8, ..., 14
only q is changed in the corresponding formula. Results are shown in the table below:

x 6 8 10 12 14
q 0.6 0.8 1.0 1.2 1.4
N
6
0.1045 2852 0.1391 7313 0.1736 4818 0.2079 1168 0.2419 2188
sin x N
6
5.6410
8
2.5510
8
0 1.5210
8
1.9910
8

1 INTERPOLATION


8
Values of the interpolation polynomial P can also be obtained by repeated (iterated)
linear interpolation which may be summarized up into simple formulas as follow
P(x) =
i i
x x
+1
1
1 1 + +
i i
i i
y x x
y x x
=
i i
x x
+1
1
|
\
|

+ + + 1 1 1
1
1
i i
i i
i
i
y x
y x
y
y
x , i = 0, , n1.

In the above case the following table is obtained.
i x
i
sin x
i
L
i, i+1
L
i, i+1, i+2
L
i, i+1, i+2, i+3
L
i, ... , i+4
L
i, ... , i+5
L
i, ... , i+6

0 0 0.00000000 0.08682409 0.08748362 0.08716387 0.0871515 0.0871555 0.08715581
1 10 0.17364818 0.08946219 0.08556515 0.08706520 0.0871914 0.0871590
2 20 0.34202014 0.10505036 0.07656490 0.08605543 0.0875158
3 30 0.50000000 0.14303098 0.05758383 0.08216103
4 40 0.64278761 0.21138869 0.02809119 sin 5L
6
= 7.0408E08
5 50 0.76604444 0.31613012
6 60 0.86602540

Here the following denotation is used. L
i, i+1
=
i i
x x
+1
1
1 1 + +
i i
i i
y x x
y x x
corresponds
to intervals [x
i
, x
i+1
], i = 0, ..., 61, and x = 5. With those values the second step is
performed by linear interpolation L
i, i+1, i+2
=
i i
x x
+2
1
2 , 1 2
1 ,
+ + +
+
i i i
i i i
L x x
L x x
over intervals
[x
i
, x
i+2
], i = 0, ..., 62. In the same way the third step is carried out over intervals
[x
i
, x
i+3
], i = 0, ..., 63, etc. A graphical scheme of iterated linear interpolation (the
Aitken-Neville algorithm) is shown in Fig. 1.2.

Fig. 1.2 Interpolation as the iteration of linear interpolations (Aitken-Neville algorithm).

Remark. Interpolation is meaningful only when values of the function cannot be obtained easily. Its right
place was in work with laboriously computed tables of functions. Linear interpolation between two
neighboring values was used as usual, quadratic interpolation was used rarely and interpolations of higher
order were rather of an academic value. Today, values of functions given analytically can be computed
very effectively with sufficient precision by computers and meaning of interpolation has been displaced
to the background of many numerical methods. There are many types of interpolation polynomials but we
do not see any reason to mention them here. Preciseness of interpolation can also be limited to noticing

1 INTERPOLATION


9
similarity between Newtons polynomial and Taylors polynomial. But evaluation of derivatives in
unknown functions appears to be problematic.
So far the interpolation polynomial has been defined only by giving its values at n+1
different nodes. Defining the Taylor polynomial of the nth degree by giving the values
of a function and its first n derivatives at a single point appears in some sense opposite
to the above interpolation. Between these two extreme cases a lot of combinations may
exist when at some points the values of a function as well as values of its derivatives are
given. In such cases the so called Hermite interpolation can be defined [6].

Another possibility is represented by piecewise connecting polynomials of the same
degree with or without some further conditions of their interconnections. As a rule,
continuity and smooth links, respectively, are of special importance.
Example. Let a function f be approximated by a continuous and piecewise linear
function S
1
assembled of connecting lines S
1, i
between two neighboring points (x
i
, y
i
),
(x
i+1
, y
i+1
),
S
1, i
(x) = y
i
+ a
i
(x x
i
) , i = 0, 1, ..., n1.
Here n new constants a
i
are introduced but a
i
= (y
i+1
y
i
)/(x
i+1
x
i
). Coefficients a
i
can be
determined from the condition that S
1, i
(x) = y
i
+ a
i
(x x
i
) shall go through the point
(x
i+1
, y
i+1
), i.e. S
1, i
(x
i+1
) = y
i+1
= y
i
+ a
i
(x
i+1
x
i
). This way we obtain the following
system of n equations for n unknowns a
i
:
A a =
|
|
|
|
\
|
1
1 2
0 1
... 0 0
... ... ... ...
0 ... 0
0 ... 0
n n
x x
x x
x x
|
|
|
|
\
|
n
a
a
a
...
1
0
=
|
|
|
|
\
|
1
1 2
0 1
...
n n
y y
y y
y y
= b.
The matrix A is diagonal one, A = diag (x
1
x
0
, ..., x
n
x
n1
), and its inverse is also
diagonal, A
1
= diag (1/(x
1
x
0
), ..., (1/(x
n
x
n1
)). Therefore,
a =
|
|
|
|
\
|
n
a
a
a
...
1
0
= A
1
b =
|
|
|
|
\
|

) /( ) (
...
) /( ) (
) /( ) (
1 1
1 2 1 2
0 1 0 1
n n n n
x x y y
x x y y
x x y y
.

This trivial example of continuous broken line may inspire to using some more
versatile functions instead of linear segments. Using polynomials of the second or third
degree (quadratic or cubic polynomials) offers the simplest possibility.

1.2 Spline Approximation
Websters Encyclopedic Unabridged Dictionary (1996) says: spline = 1. a long, narrow, thin strip of
wood, metal etc., 2. a long flexible strip of wood or the like used in drawing curves.

Let points (x
i
, y
i
), i = 0, 1, ..., n, x
0
<x
1
<...<x
n
be given. Let us seek a function
consisting from n polynomial segments that goes through those points smoothly.

1 INTERPOLATION


10
Quadratic spline.
Quadratic spline is a smooth (differentiable) piecewise quadratic function S
2
. Hence, it
consists of n parabolic arcs S
2, i
each spanned between two neighboring points (x
i
, y
i
),
(x
i+1
, y
i+1
). Those arcs can be written as
S
2, i
(x) = y
i
+ a
i
(x x
i
) + b
i
(x x
i
)
2
, i = 0, 1, ..., n1.
Now the number of unknown coefficients a
i
, b
i
has been doubled to 2n. The requirement
that S
2
shall go through all the points (x
i
, y
i
), i.e.
S
2, i
(x
i+1
) = y
i
+ a
i
(x
i+1
x
i
) + b
i
(x
i+1
x
i
)
2
= y
i+1
, i = 0, 1, ..., n1,
provides only n conditions. Thus, further n conditions are to be added. Smooth joining
individual segments S
2, i
, S
2, i+1
, i.e. S
2, i
(x
i+1
) = S
2, i+1
(x
i+1
) at points x
1
, ..., x
n1
provides
another n1 equations:
a
i
+ 2 b
i
(x
i+1
x
i
) = a
i+1
+ 2 b
i+1
(x
i+1
x
i+1
) = a
i+1
, i = 0, 1, ..., n2.
(As a matter of fact, the segment S
2, i
(x) is used only on interval [x
i
, x
i+1
] and derivatives
from the left or right should be used correctly but for the sake of simplicity the fact is
used that polynomials S
2, i
(x) and their derivatives are defined on the whole R
1
).
One equation is still to be given. Let it be, e.g. S
2
(x
n
) = S
2, n1
(x
n
) = c
n
.
This completes the system of 2n equations for 2n unknowns a
i
, b
i
, A a = c, whose
matrix is tridiagonal and therefore an easily invertible one

A =
|
|
|
|
|
|
|
|
|
|
\
|
) ( 2 1 0 ... 0 0 0 0 0 0
1 0 ... 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
0 0 0 ... ) ( 2 1 0 0 0 0
0 0 0 ... 1 0 0 0 0
0 0 0 ... 0 1 ) ( 2 1 0 0
0 0 0 ... 0 0 1 0 0
0 0 0 ... 0 0 0 1 ) ( 2 1
0 0 0 ... 0 0 0 0 1
1
1
2 3
2 3
1 2
1 2
0 1
0 1
n n
n n
x x
x x
x x
x x
x x
x x
x x
x x
.
The right side vector is
c = (
0 1
0 1
x x
y y
, 0,
1 2
1 2
x x
y y
, 0,
2 3
2 3
x x
y y
, 0, ...,
1
1
n n
n n
x x
y y
, c
n
)
T
.
In order to simplify writing the following denotation is introduced
h
i
= x
i+1
x
i
, D
i
=
i i
i i
x x
y y
+
+
1
1
for i = 0, 1, ..., n1.
In manual computing the system A a = c may be represented by the augmented matrix
(Ac) =
|
|
|
|
|
|
|
\
|
n
n
n
n
c
D
D
D
h
h
h
h
h
h
1
1
0
1
1
1
1
0
0
...
0
0
2 1 0 ... 0 0 0 0 0
1 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ...
0 0 0 ... 0 1 2 1 0
0 0 0 ... 0 0 1 0
0 0 0 ... 0 0 1 2 1
0 0 0 ... 0 0 0 1
.

1 INTERPOLATION


11
This is transformed by equivalent operations so that the matrix A becomes the 2n2n
unit matrix and the column c turns to the vector of coefficients a = (a
0
, b
0
, ..., a
n1
, b
n1
).

Example. Let the points (x
0
, y
0
) = (1, 2), (x
1
, y
1
) = (4, 1), (x
2
, y
2
) = (6, 1) be given. Let us
find a quadratic spline S
2
(x) running through those points while S
2
(x
2
) = c. The
corresponding augmented matrix is transformed by equivalent operations
(Ac) =
|
|
|
\
|
c
0
0
3 / 1
4 1 0 0
2 1 0 0
0 1 6 1
0 0 3 1

|
|
|
\
|
c
0
3 / 1
3 / 1
2 0 0 0
2 1 0 0
0 1 3 0
0 0 3 1

|
|
|
\
|
2 /
3 / 1
3 / 2
1 0 0 0
0 1 0 0
2 0 3 0
0 1 0 1
c
c

|
|
|
\
|
+
2 /
3 / 1
3 / 2
1 0 0 0
0 1 0 0
0 0 3 0
0 0 0 1
c
c
c
c

|
|
|
|
\
|
2 /
9 / ) 3 1 (
3 / ) 2 3 (
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
c
c
c
c
,
i.e.
|
|
|
|
\
|
1
1
0
0
b
a
b
a
=
|
|
|
\
|
2 /
9 / ) 3 1 (
3 / 2
c
c
c
c
.
If, for example, c=1 we obtain
S
2, 0
(x) = 2 + (1/3) (x 1) (2/9) (x 1)
2
,
S
2, 1
(x) = 1 (x 4) + (x 4)
2
/2 .

Fig. 1.3 Quadratic splines S
2
when derivatives at point x
2
= 6 are dS
2
(x
2
)/dx = c = 1, 0.5, 0, 0.5, 1.
-1
0
1
2
3
4
0 1 2 3 4 5 6 7
x
y
c=1
c=0.5
c=0
c=-0.5
c=-1
S
2,0
(x) S
2,1
(x)
-1
0
1
2
3
4
0 1 2 3 4 5 6 7
x
y
c=1
c=0.5
c=0
c=-0.5
c=-1
S
2,0
(x) S
2,1
(x)
-1
0
1
2
3
4
0 1 2 3 4 5 6 7
x
y
c=1
c=0.5
c=0
c=-0.5
c=-1
S
2,0
(x) S
2,1
(x)

1 INTERPOLATION


12
Cubic spline.
Cubic spline is a smooth function S
3
composed of arcs of polynomials of the 3
rd
degree,
i.e. by arcs of cubic parabolas S
3, i
between two neighboring points (x
i
, y
i
), (x
i+1
, y
i+1
).
Those polynomial segments can again be written as
S
3, i
(x) = y
i
+ a
i
(x x
i
) + b
i
(x x
i
)
2
+ c
i
(x x
i
)
3
, i = 0, 1, ..., n1.
Now 3n unknown coefficients a
i
, b
i
, c
i
, i = 0, 1, ..., n1, are to be determined. Thus 3n
conditions must be given.
The fact that S
3
contains all the points (x
i
, y
i
) provides the first n conditions
S
3, i
(x
i+1
) = y
i
+ a
i
(x
i+1
x
i
) + b
i
(x
i+1
x
i
)
2
+ c
i
(x
i+1
x
i
)
3
= y
i+1
,
for i = 0, 1, ..., n1.
Smoothness of joints at points x
1
, ..., x
n1
gives next n1 equations
S
3, i
(x
i+1
) = a
i
+ 2b
i
(x
i+1
x
i
) + 3c
i
(x
i+1
x
i
)
2
= a
i+1
= S
3, i+1
(x
i+1
).
In polynomials of the 3
rd
degree also equality of the second derivatives of
neighboring segments can be required, i.e. further n1 equations are added
S
3, i
(x
i+1
) = 2b
i
+ 6c
i
(x
i+1
x
i
) = 2b
i+1
= S
3, i+1
(x
i+1
).
Now those n + (n1) + (n1) = 3n 2 equations need to be completed with some two
reasonable requirements depending on situation.
For example:
Equality of all derivatives at boundary points for k = 0, 1, 2 can be required, i.e.
S
(k)
3, 0
(x
0
) = S
(k)
3, n1
(x
n
), which gives the periodic spline.
Values of the first derivatives at the first and the last points can be required to be
equal, i.e. S
3, 0
(x
0
) = a
0
, S
3, n1
(x
n
) = a
n1
+ 2b
n1
(x
n
x
n1
) + 3c
n1
(x
n
x
n1
)
2
= d
n
.
But most preferably,
S
3, 0
(x
0
) = S
3, n1
(x
n
) = 0 is required, which defines the so called natural spline
(mechanically realized by a thin elastic strip with fixed endpoints and going through
all the points (x
0
, y
0
), ..., (x
n
, y
n
) ).
The natural spline minimizes the functional

n
x
x
f
0
2
) ( dx on the set of all twice
differentiable functions on [x
0
, x
n
] that fulfill the conditions f(x
i
) = y
i
(Holladay). From
the differential geometry [9] we know that the curvature of the graph of f is
= f (1+f
2
)
3/2
and if f
2
<<1, then f .
It could be said that complexity of computing the coefficients of a spline increases
very steeply with its degree. This was seen in transition from the linear to the quadratic
spline. Now the simplifying symbols introduced in quadratic spline, i.e.
h
i
= x
i+1
x
i
, D
i
= (y
i+1
y
i
)/h
i
, i = 0, ..., n1
can prove their usefulness.
The conditions for the second derivatives
2b
i
+ 6c
i
(x
i+1
x
i
) = 2b
i
+ 6c
i
h
i
= 2b
i+1

imply

1 INTERPOLATION


13
c
i
= (b
i+1
b
i
)/(3h
i
). (c)
for i = 0, ..., n2. The condition of continuity of spline yields
y
i
+ a
i
h
i
+ b
i
h
i
2
+ c
i
h
i
3
= y
i+1
,
thus, a
i
= [(y
i+1
y
i
) b
i
h
i
2
c
i
h
i
3
] / h
i
. Using (c) and simple rearrangements give
a
i
= (y
i+1
y
i
)/h
i
b
i
h
i
(b
i+1
b
i
)/(3h
i
) h
i
2
] = (y
i+1
y
i
)/ h
i
(b
i+1
+2b
i
) h
i
/3
= D
i
(b
i+1
+2b
i
) h
i
/3 . (a)
So both a
i
and c
i
are expressed as functions of b
i
for i = 0, ..., n2.
The identity of first derivatives at internal nodes implies
a
i
+ 2b
i
h
i
+ 3c
i
h
i
2
= a
i+1
, i.e.
2b
i
h
i
+ 3c
i
h
i
2
= a
i+1
a
i
, i.e.
2b
i
+ 3c
i
h
i

= (a
i+1
a
i
)/h
i
.
Substitutions of a
i
by (a) and c
i
by (c) yield
2b
i
+ 3(b
i+1
b
i
)/(3h
i
) h
i

=
((y
i+2
y
i+1
)/ h
i+1
(b
i+2
+2b
i+1
) h
i+1
/3 (y
i+1
y
i
)/h
i
+ (b
i+1
+2b
i
) h
i
/3)/h
i
,
i.e.
b
i
+ b
i+1
= (D
i+1
D
i
(b
i+2
+2b
i+1
) h
i+1
/3 (b
i+1
+2b
i
) /3)/h
i
.
Multiplication of the last equation by 3h
i
gives
h
i
b
i
+ 2(h
i
+ h
i+1
) b
i+1
+ h
i+1
b
i+2
= 3 (D
i+1
D
i
) .
Thus, the matrix of the system for coefficients b
i
of the natural spline is
B =
|
|
|
|
|
|
\
|
+
+
+
+
+

) ( 2 0 ... 0 0 0 0
) ( 2 ... 0 0 0 0
... ... ... ... ... ... ... ...
0 0 0 ... ) ( 2 0
0 0 0 ... 0 ) ( 2
0 0 0 ... 0 0 ) ( 2
1 2 2
2 2 3 3
3 3 2 2
2 2 1 1
1 1 0
n n n
n n n n
h h h
h h h h
h h h h
h h h h
h h h

and the right hand side vector
p =
|
|
|
|
\
|

) ( 3
...
) ( 3
) ( 3
2 1
1 2
0 1
n n
D D
D D
D D
.
The tridiagonal matrix B is regular. The coefficients b
0
, b
1
, ..., b
n1
are components of a
vector b = B
1
p. Now it is sufficient to define b
n
= 0. Then coefficients a
n1
and c
n1
are
defined by equations (a) and (c).

Example. Let us construct the natural spline going through the nodes of the foregoing
example, i.e. (x
0
, y
0
) = (1, 2), (x
1
, y
1
) = (4, 1), (x
2
, y
2
) = (6, 1). In this case coefficients of
the two cubic parts
S
3,0
(x) = y
0
+ a
0
(xx
0
) + b
0
(xx
0
)
2
+ c
0
(xx
0
)
3
,
S
3,1
(x) = y
1
+ a
1
(xx
1
) + b
1
(xx
1
)
2
+ c
1
(xx
1
)
3

1 INTERPOLATION


14
can be calculated directly from the definition conditions:
S
3,0
(x
1
) = S
3,1
(x
1
), i.e. a
0
+ b
0
(x
1
x
0
) + c
0
(x
1
x
0
)
2
= (y
1
y
0
)/(x
1
x
0
),
S
3,0
(x
1
) = S
3,1
(x
1
), i.e. a
0
+ 2b
0
(x
1
x
0
) + 3c
0
(x
1
x
0
)
2
a
1
= 0,
S
3,0
(x
1
) = S
3,1
(x
1
), i.e. b
0
+ 3c
0
(x
1
x
0
) b
1
= 0,
S
3,1
(x
2
) = y
2
, i.e. a
1
+ b
1
(x
2
x
1
) + c
1
(x
2
x
1
)
2
= (y
2
y
1
)/(x
2
x
1
),
S
3,0
(x
0
) = b
0
= 0,
S
3,1
(x
2
) = 0, i.e. b
1
+ 3c
1
(x
2
x
1
) = 0
Substituting x
1
x
0
= 3, (y
1
y
0
)/(x
1
x
0
) = 1/3, x
2
x
1
= 2, (y
2
y
1
)/(x
2
x
1
) = 0
yields the following augmented matrix for the coefficients of spline which in the next
steps is transformed onto 55 unit matrix and vector of coefficients (a
0
, c
0
, a
1
, b
1
, c
1
)
T

|
|
|
|
\
|
0
0
0
0
3 / 1
6 1 0 0 0
0 1 0 9 0
0 0 1 27 1
4 2 1 0 0
0 0 0 9 1

|
|
|
|
\
|
0
0
0
3 / 1
3 / 1
6 1 0 0 0
4 2 1 0 0
0 1 0 9 0
0 0 1 18 0
0 0 0 9 1

|
|
|
|
\
|
0
12 / 1
3 / 1
54 / 1
3 / 1
6 1 0 0 0
1 1 0 0 0
0 2 1 0 0
0 0 18 / 1 1 0
0 0 0 9 1

|
|
|
|
\
|
60 / 1
12 / 1
6 / 1
54 / 1
3 / 1
1 0 0 0 0
1 1 0 0 0
2 0 1 0 0
0 0 18 / 1 1 0
0 0 0 9 1

|
|
|
|
\
|
60 / 1
10 / 1
15 / 2
90 / 1
30 / 13
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
.
Both the parts of the sought spline are then
S
3, 0
(x) = 2 (13/30) (x 1) + (1/90) (x 1)
3
,
S
3, 1
(x) = 1 (2/15) (x 4) + (1/10) (x 4)
2
(1/60) (x 4)
3
.

Remark. For the sake of simplifying calculations and use of general relations let us assemble the table:

i x
i
y
i
h
i
D
i

0 1 2 3 1/3
1 4 1 2 0
2 6 1

Because in this case n = 2 the system of equations for coefficients b
i
is reduced to one equation (the last
one in the system B b = p for a general n)
h
0
b
0
+ 2 (h
0
+ h
1
) b
1
= 3(D
1
D
0
).
The equation b
0
= 0 yields b
1
= (3/2) (D
1
D
0
)/(h
0
+ h
1
) = 1/10.
When we put b
2
= 0, the relations mentioned above enable to calculate
a
0
= D
0
(b
1
+ 2b
0
) h
0
/3 = 13/30, a
1
= D
1
(b
2
+ 2b
1
) h
1
/3 = 2/15,
c
0
= (b
1
b
0
)/(3h
0
) = 1/90, c
1
= (b
2
b
1
)/(3h
1
) = 1/60.

The resulting natural cubic spline is shown in Fig. 1.4. Its derivative at the endpoint x=6 is S
3, 1
(6) =
15
1
.
This value and preserved internal nodes determine quadratic spline composed of two parabolic segments
S
2, 0
(x) = 2 + (x1)[ 3/5 + 4/45 (x1)], S
2, 1
(x) = 1 + (x4)[ 1/15 + 1/30 (x4)].

1 INTERPOLATION


15

0
1
2
3
0 1 2 3 4 5 6 7 8
x
y
(xi, yi)
Quadr. Spline
Cubic Spline

Fig. 1.4 Cubic versus quadratic spline for the same slope S
3
(6) = S
2
(6) = 1/15.

Example. Let us compare the Newton polynomial with the natural spline in the function
f(x) = x
2
sin (x
2
) on the interval [0, 4] for nodes given by abscissas x = 0, 1, 2, 3, 4.

Fig. 1.5 Approximation of f(x) = 2x/ sin (x/2) on [0, 4] by cubic spline S
3
and Newton polynomial N.
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2 2.5 3 3.5 4
x
y

f(x)
S
3
( x )
N ( x )
error

1 INTERPOLATION


16
Results are shown in Fig. 1.5. Because f is an even function and f (0) = 2 while
S
3
(0) = 0, the spline S
3
cannot be expected to provide a good approximation near x = 0.
This is shown also by the error f(x) S
3
(x). Nevertheless, the average error (in [0, 4])
is less by a third in the spline than that in the Newton polynomial and the total error
max {f(x) S
3
(x): x[0, 4]}
2
1
max {f(x) N(x): x[0, 4]}.
The approximation can be enhanced if the derivative at both endpoints is respected.
To do so two interpolation nodes are added from their close neighborhoods, x = 0.01
and x = 4.01. Then x
0
= 0.01, x
1
= 0, x
2
= 1, ..., x
5
= 4, x
6
= 4.01. The reduction of error
to about one third of the original error is shown in Fig. 1.6. Further improvement would
be obtained by adding more nodes, of course.

Fig. 1.6 Approximation of the function f(x) = 2x/ sin (x/2) by natural spline for added nodes at
x = 0.01, 4.01.

Final remarks. Splines are very useful tool especially in two and three dimensional spaces due to their
ability to provide elegant smooth solutions of many problems solved by FEM. For the classical
computation tools (pencil and paper), however, their computing is too complex. They represent a typical
area for computers due a huge number of elementary operations but are treacherous for computing
abilities of human individuals.

-2
-1.5
-1
-0.5
0
0..5
1
0 0.5 1 1.5 2 2.5 3 3.5 4
x
y

f(x)
error
S
3
( x )

2 INTEGRATION


17
2 INTEGRATION
It happens very often that for a given function no simple analytically expressible
antiderivative can be found. This is the case with sin x
2
, exp (x
2
), sin x/x etc. Also
sometimes not the function f itself but its integral over some area is physically
meaningful but values of f can be obtained only in some difficult way (measurement).
As a rule, in numerical integration functions integrable without any doubts over
considered interval are dealt with and such functions are approximated locally by some
easily integrable functions. Polynomials are the first choice at hand and their
successfulness in continuous functions is granted by the Weierstrass theorem.
The definition of Riemann integral [9] is based on substitution of the given function f
by a step function. Let
D = {x
0
, x
1
, ..., x
n
: a = x
0
<x
1
< ... < x
n1
< x
n
= b}
be a partition of an interval [a, b] and t
k
I
k
= [x
k1
, x
k
] for any k = 1, ..., n and interval.
Riemann sum
R
n
=
k
n
=
1
f(t
k
) (x
k
x
k1
)
represents the area generated by composition of rectangle areas.
Let us define M
k
= sup {f(x): x I
k
}, m
k
= inf {f(x): xI
k
}. Obviously,
m
k
(x
k
x
k1
)

k
k
x
x
1
f M
k
(x
k
x
k1
)
and

k
k
x
x
f
1

2
k k
m M +
(x
k
x
k1
). (i)
is an improved estimate of the integral over the partial interval [x
k1
, x
k
]. The right hand
side can be interpreted geometrically as the area of the trapezoid of the height
h
k
= x
k
x
k1
and with parallel sides of the lengths M
k
, m
k
.
If f is monotone on [x
k
, x
k1
], M
k
+ m
k
= f(x
k1
) + f(x
k
). Then the right side of the
relation (i) may be rewritten as follows
( f(x
k1
) + f(x
k
)) h
k
/2 .
Additivity of integral gives immediately the so called trapezoid method for general
(non-equidistant) partition of the interval [a, b]
b
a
f =
k
n
=
1

k
k
x
x
f
1

k
n
=
1
( f(x
k1
) + f(x
k
)) h
k
/2 = L
n
.
In equidistant partition, x
k
x
k1
= h = (ba)/n , k = 1, ..., n, the sum L
n
can be written as
L
n
= h
k
n
=
1
( f(x
k1
) + f(x
k
))/2 ,
i.e. as the sum of moving averages of values of the function f multiplied by the step
length.
This can be written shortly in form of the scalar product
L
n
= [(ba)/(2n)] (A.F), (T)

2 INTEGRATION


18
where A = (1, 2, 2, ..., 2, 1)
T
and
F = (f(a+0.(ba)/n), f(a+1.(ba)/n), f(a+2.(ba)/n), ..., f(a+(n1).(ba)/n), f(b))
T
.
When integrating functions that do not change (oscillate) fast the trapezoid formula (T)
yields quite acceptable results also for constant steps. Otherwise the step length x
k
x
k1

needs to be adapted with respect to change rates of the integrand.
Example 1. Let us calculate I =
1
0
x . The antiderivative for f(x) = x is (2/3) x
3/2
,
so I = 2/3 = 0.6666666... The integrand f is increasing but its derivative from the right
at 0 is infinite. For various nN and constant step h = 1/n the corresponding L
n
can be
easily computed in EXCEL.

n L
n

1 0.5
2 0.603 553
4 0.643 283
8 0.658 130
16 0.663581
32 0.665 559
64 0.666 271
128 0.666 526
256 0.666 617

The convergence L
n
2/3 is slow because secants do not provide proper
approximation of x near the origin. To improve the results it would be convenient to
shorten the length of partition when approaching zero. It can be done in the simplest
way by uniform partition on ordinate axis choosing y
k
= k/n, k = 0, 1, , n. Then
x
k
= k
2
/n
2
and
L
n
=
1
2

=
(
(
\
|
|
\
|
|
\
|
+
n
k
n
k
n
k
n
k
n
k
1
2 2
1 1
=
1
2
3
n
( )[ ]
=

n
k
k k
1
1 2 1 2 =
1
2
3
n
( )
n
k
k
1
2
1 2
.
Due to
( ) 2 1
2
1
k
k
n
=
k
k
n
2
1
2
=
=
n
k
k
1
2
4
=
2 2 1 4 1
6
4
1 2 1
6
n n n n n n ( )( ) ( )( ) + +
+ +
=
n
n 4 1
3
2

one gets
L
n
=
3
2
1
n
3
1 4
2
n
n =
2
6
1
3
2
n
= I
2
6
1
n
.
At the same time this formula also gives the total error in this function and method of
numerical integration (approximation of x by linear spline with n+1 nodes shown
above)
L
n
I
2
6
1
n
.

2 INTEGRATION


19
It also says that L
n
< I for any nN (Fig. 2.1). If the required precision is = 10
2m
,
then n > 10m/6. E.g. = 10
4
needs n > 100/6 = 40.

0
0.2
0.4
0.6
0.8
1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
x
y

Fig. 2.1 Approximation of x by linear spline with 6 nodes equidistant on the axis y.

This transparent example shows that the precision of numerical computing does not
depend only on the number of interpolation nodes but also on their distribution. A
proper choice of interpolation nodes needs to respect and utilize the knowledge of the
integrand.

2.1 Numerical Quadrature
The principle of numerical quadrature consists in approximating the given function f by
some simple function p on the integration interval [a, b] and putting
b
a
f
b
a
p .

2 INTEGRATION


20
Most often, f is given in some way and p is an interpolation function (polynomial or
spline). The symbol indicates that f may be given with some uncertainty, e.g. like a
data set of a measurement. The approximation function p is then constructed to fit the
data in some sense. So far only step functions (with 1 interpolation node in each partial
interval) and piecewise linear functions (with 2 interpolation nodes in each partial
interval) have been used in computing integrals. It can be expected, of course, that
formulas based on more interpolation nodes will be more precious.
To simplify considerations let a general interval [a, b] be reduced onto the interval
[1, 1] by the following homothetic transform (Fig. 2.2)
x(t) =
2
a b
t +
2
a b +
.

Fig. 2.2 Homothetic transform of interval [a, b] onto interval [1, 1].

Then, obviously,
b
a
p(x) dx =
1
1
p(x(t)) dx(t) =
2
a b
1
1
P(t) dt ,
where P(t) = p(x(t)).
The trapezoid formula can now be written as
2
a b
(1 (1))
2
) 1 ( ) 1 ( P P +
=
2
a b
( ) ) 1 ( ) 1 ( P P + = (b a)
2
) 1 ( ) 1 ( P P +
,
i.e. the length of the integration interval multiplied by the mean of function values at the
endpoints (nodes) in the interval [1, 1].
If the number of interpolation nodes is increased to n>2, the general integration
formula can be sought in form
b
a
p(x) dx =
2
a b
1
1
P(t) dt =
2
a b
( ) ) ( ... ) (
1 1 n n
x p A x p A + + , (G)
where coefficients A
i
are called also weights and t
i
are interpolation nodes (i = 1, , n).
In linear function P we have n=2 and A
1
= A
2
= 1, t
1
= t
2
= 1. If n>2 and A
k
, t
k
are
chosen from a richer set, one can expect an increase of the precision of the integration
formula.

2 INTEGRATION


21
Example 2. Let us choose, e.g., t
1
= 7/8, t
0
= 0, t
1
= 7/8. The three corresponding
coefficients A
i
can be found by the requirement that the formula is precise for
polynomials till the second degree, i.e.
1
1
t
0
dt = A
1
+ A
2
+ A
3
= 2,
1
1
t
1
dt = 7/8 A
1
+ 0 A
2
+ 7/8 A
3
= 0,
1
1
t
2
dt = (7/8)
2
A
1
+ 0 A
2
+ (7/8)
2
A
3
= 2/3.
The second equation yields A
3
= A
1
. This after being put in the third equation gives
A
1
= 64/147. The first equation then implies A
2
= 2 (1 A
1
) = 166/147. Thus, the sought
formula is

b
a
p(x) dx =
147 2
a b
[64 P(7/8) + 166 P(0) + 64 P(7/8)] =

147
a b
(

+
+
+
)
8
3 11
( 32 )
2
( 83 )
8
3 11
( 32
a b
p
b a
p
b a
p .
But this does not seem very convenient for practical use because it does not fit the
principal requirement of mathematical esthetics maximal simplicity. In spite of this,
however, it shows the advantage of the symmetry of the nodes t
k
with respect to the
center of the integration interval.
We have said already that a way how to enhance the precision of integration
formulas consists in increasing the number n of interpolation nodes. Equidistant nodes
and interpolation polynomial lead to Newton-Cotes integration formulas [4-8].
However, a more practical way towards higher preciseness consists in repeated
partitioning the integration area while applying a fixed, simple integration formula (with
a small number of nodes and interpolation polynomial of low order). This was
illustrated by Example 1 with trapezoid formula. Therefore, further text will show
several formulas limited by the number n=2, 3.

Simpsons Formula
Let us put t
1
= 1, t
0
= 0, t
1
= 1 and seek the corresponding integration formula
1
1
P(t) dt =
( )
A P t A P t A P t

+ +
1 1 0 0 1 1
( ) ( ) ( ) .
Like in Example 2 we obtain the following equations for coefficients A:
1
1
t
0
dt = A
1
+ A
0
+ A
1
= 2,
1
1
t
1
dt = 1 A
1
+ 0 A
0
+ 1 A
1
= 0,

2 INTEGRATION


22
1
1
t
2
dt = (1)
2
A
1
+ 0 A
0
+ (1)
2
A
1
= 2/3.
This system expressed in the matrix form is
|
|
\
|
1 0 1
1 0 1
1 1 1
|
|
|
\
|

1
0
1
A
A
A
=
|
|
\
|
3 / 2
0
2
.
Vector A can be found by equivalent operations in the augmented matrix

|
|
\
|
3 / 2
0
2
1 0 1
1 0 1
1 1 1
|
|
\
|
3 / 2
2
2
2 0 0
2 1 0
1 1 1
|
|
\
|
3 / 1
3 / 4
2
1 0 0
0 1 0
1 1 1
|
|
\
|
3 / 1
3 / 4
3 / 1
1 0 0
0 1 0
0 0 1
.
Hence,
1
1
P(t) dt = [P(1) + 4 P(0) + P(1)]/3
or
b
a
p(x) dx =
6
a b
(
+
+
+ ) ( )
2
( 4 ) ( b p
b a
p a p .
Because
1
1
t
2k+1
dt = 0 for k = 0, 1, 2, ..., the formula above is correct also for cubic
polynomials (this is valid also in formula from the Example 2, of course).
Example 3. Let us calculate I =
1
0
x for the uniform partition of [0, 1] on the
y-axis, y
k
= k/5, k=1, ..., 5 (Fig. 2.3).

The calculation performed in EXCEL may be summarized up in the following table:

SIMPSON TRAPEZOID
Step x
i
p(x
i
) A
i
I A
i
I
0.00 0 1 1
0.02 0.141421 4 2
1 0.04 0.2 1 0.005105 1 0.004828
0.04 0.2 1 1
0.10 0.316228 4 2
2 0.16 0.4 1 0.037298 1 0.036974
0.16 0.4 1 1
0.26 0.509902 4 2
3 0.36 0.6 1 0.101320 1 0.100990
0.36 0.6 1 1
0.50 0.707107 4 2
4 0.64 0.8 1 0.197327 1 0.196995
0.64 0.8 1 1
0.82 0.905539 4 2
5 1.00 1 1 0.325329 1 0.324997

I = 0.666379 I = 0.664784

2 INTEGRATION


23
As expected the Simpsons formula gives a more precise result than the trapezoid
formula while the number of nodes remains the same.

Gauss Quadrature
The nodes t
k
and coefficients A
k
can be required to give the quadrature formula
maximum preciseness, i.e. integrate polynomials up to degree as high as possible. In
two nodes t
1
, t
1
and two coefficients A
1
, A
1
one gets the formula
1
1
P(t) dt = A
1
P(t
1
) + A
1
P(t
1
).
There are 4 unknowns on the right side, thus 4 equations are needed, i.e.
1
1
t
0
dt = A
1
+ A
1
= 2,
1
1
t
1
dt = A
1
t
1
+ A
1
t
1
= 0,
1
1
t
2
dt = A
1
t
1
2
+ A
1
t
1
2
= 2/3,
1
1
t
3
dt = A
1
t
1
3
+ A
1
t
1
3
= 0.
This system is nonlinear, which is a complication but the symmetry with respect to 0
simplifies the situation. The equality A
1
= A
1
and the first equation immediately imply
A
1
= A
1
= 1, the second and fourth equations are satisfied automatically, and the third
one gives t
1
2
= 1/3 due to t
1
= t
1
. Then, t
1
= t
1
= 3/3 = 0.577 350 269...
The Gaussian formula with 2 nodes

1
1
P(t) dt = P(3/3) + P(3/3)
or, with the original variable x

b
a
p(x) dx =
2
a b
(
+ +
+
+ +
)
6
) 3 3 ( ) 3 3 (
( )
6
) 3 3 ( ) 3 3 (
(
a b
p
b a
p
,
is as precise as Simpsons formula with 3 nodes.
Let now be n=3 and let us look for the maximally precise formula
1
1
P(t) dt = A P t A P t A P t

+ +
1 1 0 0 1 1
( ) ( ) ( ) .
The symmetry with respect to 0 gives t
0
= 0, t
1
= t
1
, A
1
= A
1
, which reduces the
number of unknowns to three: t
1
, A
0
, A
1
. These can be determined from equations for
even powers of t (odd powers result in zeros and are useless)

1
1
t
0
dt = A
0
+ 2A
1
= 2,

2 INTEGRATION


24

1
1
t
2
dt = 2A
1
t
1
2
= 2/3,

1
1
t
4
dt = 2A
1
t
1
4
= 2/5.
Dividing the last equation by the second one yields t
1
2
= 3/5, thus t
1
=
5
3
= 6 . 0 .
Then the second equation yields A
1
=
2
1
3
1
t
=
9
5
. Substituting it in the first equation gives
A
0
= 2 10/9 = 8/9. Thus, the Gaussian 3-node formula is

1
1
P(t) dt =
9
1
[5P((3/5) ) + 8P(0) + 5P((3/5) )]
or

b
a
p(x) dx =
18
a b
(

+
+
+

+ )
2
) 6 . 0 1 )( (
( 5 )
2
( 8 )
2
) 6 . 0 1 )( (
( 5
a b
b p
b a
p
a b
a p .

Example 4. We stay at the integral I =
1
1
x and the same partition of the integration
area as in Simpsons rule in Example 3. Results obtained in EXCEL with the Gaussian
3-point formula are shown in the table below:

Step Interval Nodes x
i
p(x
i
) A
i
I
0.00 0.004508 0.067142 5
0.02 0.141421 8
1 0.04 0.035492 0.188393 5 0.005353
0.04 0.053524 0.231353 5
0.1 0.316228 8
2 0.16 0.146476 0.382722 5 0.037335
0.16 0.18254 0.427247 5
0.26 0.509902 8
3 0.36 0.33746 0.580913 5 0.101334
0.36 0.391556 0.625745 5
0.5 0.707107 8
4 0.64 0.608444 0.780028 5 0.197333
0.64 0.680573 0.824968 5
0.82 0.905539 8
5 1.00 0.959427 0.979504 5 0.325333

I = 0.666688

Remark. Deriving Gauss formulas shown here is very simple. But in formulas with more nodes we
would have to use the traditional way through orthogonal polynomials. This interpretation can be found in
many books on numerical mathematics, e.g. [4-8]. Legendre polynomials create orthogonal basis in the
set of all polynomials on the interval [1, 1]. If their roots are taken as interpolation nodes t
k
, the above

2 INTEGRATION


25
nonlinear system of equations is, due to orthogonality, reduced to linear system for A
k
. This is the
principal idea of Gauss quadrature.
Remark. Because Gaussian formulas do not contain endpoints of integration intervals they can be used
also in estimating convergent improper integrals of functions whose absolute value tends to infinity. The
Laguerre and Hermitte orthogonal polynomial allows to extend the Gaussian type quadrature also to
infinite intervals [4-8].

Chebyshev Formulas
Generally, these formulas are of the type
1
1
P(t) dt = A
( )
P t P t P t
n
( ) ( ) ... ( )
1 2
+ + +
with only one coefficient A (all the A
k
in the general formula (G) are the same). Thus,
e.g. for three nodes
1
1
P(t) dt = A
( )
P t P t P t ( ) ( ) ( )
+ +
1 0 1
.
The equation
1
1
t
0
dt = 3A = 2
implies A = 2/3. If nodes are symmetric with respect to t
0
= 0 then t
1
= t
1
is obtained
from
1
1
t
2
dt = 2A t
1
2
= 4/3 t
1
2
= 2/3.
Hence, t
1
= 1/2 = 0.707 106 781 ...
and
1
1
P(t) dt = 2/3 [P(1/2) + P(0) + P(1/2)] or
b
a
p(x) dx =
b a
3
p
a b b a
p
a b
p
a b b a
(
( )
) ( ) (
( )
)
+

+
+
+
+
+

(
2
2
2 2 2
2
2
.
This formula is of the same preciseness order as the Simpsons one.

Precision and Richardsons Extrapolation
In every integration formula examples of its failure can be constructed. This is a
consequence of the fact that all the information concerning the integrand is limited to a
small number of interpolation nodes while the successfulness of the integration formula
generally depends on the global quality of approximation. In smooth functions
with small oscillations the error is substantially given by the remainder of the Taylor
polynomial round the middle of the integration interval, i.e. the degree of the lowest
power of the integration variable when the integration formula ceases to be precise. An
overview is shown in the following table.

2 INTEGRATION


26
Rule Trapezoid Simpson, Gauss 2, Chebyshev 3 Gauss 3
Degree of error p 2 4 6

As said above numerical computing integrals is carried out preferably by a simple
quadrature formula (the Simpsons or Gauss 3-node formula) using a convenient
(non-uniform) partition of the area of integration. Refining the partition is usually
performed by bisection of the foregoing partial subintervals. Suppose a partition D
m

gives m partial intervals. A quadrature rule used on them yields the sum S
m
of partial
results as a numerical approximation of the integral I,
I = S
m
+ R
p
h
p
.
On the right side p > 1 is the degree of preciseness of the quadrature formula, R
p
is the
corresponding remainder and h = max {x
i
x
i1
: i = 1, , m} is the norm of partition.
By bisection of m partial intervals new 2m intervals are obtained. Summing up the
results of the same quadrature formula on those new intervals gives a more precise
estimate S
2m

I = S
2m
+ R
p
(h/2)
p
.
Multiplying this equation by 2
p
yields
2
p
I = 2
p
S
2m
+ R
p
h
p
.
Subtracting I = S
m
+ R
p
h
p
from this equation gives (2
p
1) I = 2
p
S
2m
S
m
and the
following enhanced estimate of the integral I
I
1 2
2
2
p
m m
p
S S
= S
2m
+
1 2
2
p
m m
S S
.
This is the principle of Richardsons extrapolation.
Obviously, the higher precision p of the quadrature formula (degree of error) the
smaller the correction (S
2m
S
m
)/(2
p
1). The following table illustrates it well by
bisection of intervals from tables in Examples 3 and 4 and corresponding quadrature
rules.

Rule m=5 2m=10 S
2m
S
m
R. extrapolation
Trapezoid 0.664784 0.666131 0.001347 0.666580
Simpson 0.666379 0.666580 0.000201 0.666594
Gauss 3 0.666688 0.666674 0.000014 0.666674

The next table shows the effect of Richardsons extrapolation in the trapezoid rule
with constant step h = 1/n in computation of
1
0
x :

n L
n
R.. Extrapolation
1 0.5
2 0.603 553 0.638 071
4 0.643 283 0.656 526

2 INTEGRATION


27
8 0.658 130 0.663 079
16 0.663581 0.665 398
32 0.665 559 0.666 218
64 0.666 271 0.666 508
128 0.666 526 0.666 611
256 0.666 617 0.666 647

On Multiple Integration
Numerical calculation of integrals is limited neither to one-dimensional intervals nor to
final integration areas. The Fubini theorem [9] is very important in multidimensional
areas because it may transform a multiple integration into several successive
integrations in one dimension.

Example. Let us estimate the volume of the ball B = {(x, y, z) : x
2
+ y
2
+ z
2
< 1} by
the 3-node Gauss rule. In spherical coordinates B = {(r, , ) : 0<r<1, 0<<, 0<<2).
The volume is the Lebesgue measure of the set B ([9])
V = (B) = 2
) 2 , 0 ( ) 2 / , 0 ( ) 1 , 0 (
r
2
sin dr d d.
The Fubini theorem gives
V == 4
) 2 / , 0 ( ) 1 , 0 (
r
2
sin dr d.
The Gaussian rule for a two-dimensional interval XY sounds
Y X
f(x, y) dx dy (X) (Y)
=
1
1 ,k j
A
j
A
k
f(x
j
, y
k
) ,
where A
1
= A
1
= 5/18, A
0
= 8/18, x
0
is the midpoint of X, x
1
= x
0
+ (1 6 . 0 ) (X)/2,
etc. (Fig. 2.3).

Fig. 2.3 Nodes for Gaussian cubature of the function f(x) g(y) in an interval XY.

Interpolation nodes for integral
) 2 / , 0 ( ) 1 , 0 (
are elements of the Cartesian product
{r
1
, r
0
, r
1
}{
1
,
0
,
1
}

2 INTEGRATION


28
= {0.5(1 6 . 0 ), 0.5, 0.5(1+ 6 . 0 )}{/4.(1 6 . 0 ), /4, /4.(1+ 6 . 0 )}.
Then
V 4
18 18
2 / 1

(25(0.5(1 6 . 0 ))
2
sin (/4.(1 6 . 0 )) + 40(0.5)
2
sin (/4) + ) =
= 4
18 18
2 / 1

68.755 494 = 40.333 336 4/3 .
But Fubini theorem gives also
V = 4
1
0

2 /
0
sin d r
2
dr = 4

2 /
0
sin d
1
0
r
2
dr .
The first integral is estimated by 3-node Gauss formula as follows:

2 /
0
sin d
36
(5 sin ((1 6 . 0 )/4) + 8 sin (/4) + 5 sin ((1+ 6 . 0 )/4)

=
36
(50.176 108 + 8 2 /2 + 50.984 371) = 1.000 008 1.

The Gauss formula gives for second integral the estimate
1
0
r
2
dr
4 18
1
(5(1 6 . 0 )
2
+ 81
2
+ 5(1+ 6 . 0 )
2
) =
4 18
8 2 . 3 5
+
=
3
1
.
Thus,
V = 4 1
3
1
.
Remark. The numerical integration was used here only in two-dimensional interval. The successfulness of
the method was conditioned by its integrand as product of functions dependent just on one variable each.
In n dimensions the same Gauss formula would need 3
n
interpolation nodes and 3
n
coefficients, which
would make it very cumbersome. Moreover, complicated cases must be solved individually [6].

2.2 Monte Carlo Methods
The Monte Carlo methods are based on the law of large numbers of the probability
theory that says approximately this: as the number of experiments increases the relative
frequency of an event approaches unlimitedly to its theoretical probability [3].
Let f: DR
1
be a function, u = sup{f(x): xD} and l = inf{f(x) : xD}. Of course,
f = l + (f l), where g = f l is a non-negative function.
When estimating an integral two procedures shown in Fig. 2.4 may be applied.
1. Estimate by global probability. The set G =
U
{(0, g (x)) : xD} (i.e. the area
limited from the upper side by the graph of g and by 0 from below) is merged into a
simple set M, whose measure (M) =
M
1 is known (can be determined simply)

2 INTEGRATION


29
and which can be uniformly covered by randomly generated points x. Now a proper
criterion is needed to identify the inclusion xG and determine the ratio (G)/(M).
Suppose n random points are uniformly distributed over the set M and m of them
fall into G. Then for n the probability of m/n(G)/(M) will tend to 1
following the law of large numbers. Thus, if n is large, one can put
(G)
n
m
(M) and
D
f (
n
m
+
l u
l
) (M) .
2. Estimate by the integration area average. In the integration area D random
points x
i
, i = 1, ..., n are generated and the sum S
n
=
=
n
i
i
x g
1
) ( is calculated.
Obviously,
D
g
) ( l u n
S
n
(u l) = S
n
/n and
D
f S
n
/n +
D
l .

Fig. 2.4 Integration of a real function by Monte Carlo Method.

2 INTEGRATION


30
The following examples should elucidate the explanation.
Example 1. Let us compute I =
0
sin x dx = cos x

0
= 2 by Monte Carlo method.
Obviously, l = 0, u = 1 and the considered arc of sinusoid defines the set
G = {(x, y): x[0, ] y[0, sin x]}. This is inlaid into the rectangle M = [0, ][0, 1],
(M) = . Now both the possibilities above can be used.
1. n random points (x, y) are generated, where x is a random number from the interval
[0, ) and y is a random number from [0, 1). The points (x, y) belong to the uniform
distribution on the set M = [0, )[0, 1). If y sin x, the point (x, y) belongs to the
measured set G = {(x, y): 0x<, 0ysin x} and the counter m is increased by 1,
otherwise m remains unchanged. The table below shows the slow convergence
m/n2 in the horizontal direction and the randomness of estimates (m/n) in the
vertical direction.
These results can be processed statistically. In all cases the averages of 5
calculations for given numbers n are included in the 95percent confidence intervals
(a s t
0.05
(4)/5, a + s t
0.05
(4)/5), where a is the average, s is the standard
deviation and t the critical value of the t-distribution .

n
calculation 100 1000 10 000 100 000 1 000 000
1 1.91637 2.01690 1.99365 1.99950 2.00109
2 1.69646 2.01690 2.00811 1.99501 1.99974
3 1.91637 1.85668 1.98266 1.99786 2.00013
4 2.01062 2.01062 2.00999 2.00396 2.00282
5 1.82212 1.98235 1.99491 1.99702 1.99682
Average p 1.87239 1.97669 1.99786 1.99867 2.00012
Standard deviation s 0.10626 0.06135 0.01010 0.00302 0.00200
Confidence limit s t
0.05
(4)/5 0.132192 0.07616 0.01254 0.00375 0.00248

2. The Fubini theorem yields
(G) =
G
dx dy =
] , 0 [
(
] sin , 0 [ x
dy) dx =
] , 0 [
sin x dx.
This means that the average of sin x over [0, ] shall be calculated. This can be done
by n times repeated generating the point x as a random number from the interval
[0, ) and setting up the sums S
G
=
x
sin x and S
M
=
x
= n. The ratio S
G
/ S
M

converges to the ratio of measures (G)/(M). Therefore,
(G) (S
G
/S
M
) (M) = (S
G
/n) = S
G
/n.
The procedure (ii) is illustrated in the following table.

2 INTEGRATION


31

n
Calculation 100 1000 10 000 100 000 1 000 000
1 1.84798 1.95824 1.99301 1.99863 1.99815
2 2.26357 2.01396 1.98859 2.00301 2.00010
3 1.94581 1.99292 1.98872 2.00591 2.00068
4 1.95159 1.93058 1.99574 2.00086 2.00095
5 2.00085 1.95867 1.99054 1.99968 1.99962
Average p 2.001960 1.970874 1.991320 2.001618 1.999900
Standard deviation s 0.156398 0.032690 0.003050 0.002899 0.001106
Confidence limit s t
0.05
(4)/5 0.194162 0.040584 0.003787 0.003599 0.001373

Remark. Also the interval [0, 1] on the y-axis can be taken for the integration area, evidently:
(A) = (
arcsin
arcsin
y
y

0
1
dx) dy =
0
1
( 2 arcsin y) dy = 2
0
1
arcsin y dy.
Then random numbers y[0, 1) would be generated and the values x can be obtained by numerical
solution of the equation f(x) = sin x y = 0 in the interval 0x</2. Now the corresponding sums
S
G
=
y
( 2arcsin y ) and S
M
=
M
= n or S
G
=
y
arcsin y and S
M
=
M
/2 = n/2 are to be
calculated.

Example 2. Let us estimate the area of an m-foil (Fig. 2.5, a generalization for
quatrefoil) which is given as follows:
x(t) = acos (mt/2) cos t , y(t) = acos (mt/2) sin t, t[0, 2).
Let r(t) be the distance of a point (x(t), y(t)) of the curve from the origin (0, 0).
Obviously
r
2
(t) = x
2
(t) + y
2
(t) = a
2
cos
2
(mt/2).
The area of the m-foil can be computed easily in polar coordinates (t, r), r 0. The
corresponding Jacobian is
J =
) , (
) , (
t r
y x
= r.
The substitution theorem gives the area of the m-foil
I =
[ ] { }
U ) 2 , 0 [ : ) ( , 0 t t r
r dr dt .
The integration area is bounded. Hence, the integral exists and the Fubini theorem
yields
I =

)) ( , 0 [ ) 2 , 0 [
(
t r
r dr ) dt =
) 2 , 0 [
r
2
(t)/2 dt =
) 2 , 0 [
a
2
/2 cos
2
(mt/2) dt.
Further on,
I = a
2
/2
) 2 , 0 [
cos
2
(mt/2) dt = a
2
/2
) 2 , 0 [
2
cos 1 mt +
dt = a
2
/2,
i.e. the area is a half of the whole circle for any number m of foils.

2 INTEGRATION


32
The calculation of I was transferred to integration over a one-dimensional interval.
In the integration by Monte Carlo method we can proceed in the same way as in
Example 1.
But the Monte Carlo method can also be used in two dimensions, when calculating
[ ] { }

U
) 2 , 0 [ : ) ( , 0 t t r
r dr dt.
The surface element in the Cartesian coordinates x, y is simply dxdy. However, in polar
coordinates dr d must be multiplied by the length r, (and generally by the absolute
value of Jacobian), to obtain the corresponding surface element. In this way a weight
is assigned to the random points in the rectangle of polar coordinates that arranges
uniform covering of the circle x
2
+ y
2
a
2
in Cartesian coordinates.

Fig. 2.5 Seven-foil.

Both the Monte Carlo calculations of the integral I are shown in the following table.

I =
) 2 , 0 [
r
2
(t)/2 dt I =
[ ] { }
U ) 2 , 0 [ : ) ( , 0 t t r
r dr dt
n = 100 n = 10 000 n = 1 000 000 n = 100 n = 10 000 n = 1 000 000
Calculation 1 1.45509 1.59752 1.57007 1.34212 1.52297 1.56992
2 1.76443 1.59352 1.57015 1.77171 1.58822 1.56974
3 1.58099 1.58101 1.57024 1.62027 1.59032 1.57273
4 2.05876 1.55375 1.56987 1.69898 1.55693 1.56971
5 1.35604 1.58872 1.57214 1.54655 1.57229 1.57151
Average 1.643062 1.582904 1.570494 1.595926 1.566146 1.570722

2 INTEGRATION


33
It is advantageous in computing integrals by Monte Carlo method that greater
dimension is manifested only by more complicated conditions when the frequency sums
are created. This is shown by the following example.
Example 3. Let us estimate the volume V of the Viviani window [5,6,9], i.e. the set of
points (x, y, z) that fulfill the following two conditions:
x
2
+ y
2
+ z
2
a
2
, (xa/2)
2
+ y
2
(a/2)
2
.
The volume of the cube circumscribing the ball x
2
+ y
2
+ z
2
a
2
is (2a)
3
= 8a
3
. From
[9], Chapter 3, we know V = ) (
3
4
3
2
a
3
. Hence, Q = V/(8a
3
) =
12

1
9
0.15069.
Now the ratio Q will be determined by random generating the points (x, y, z) in the
cube [0, 1] [0, 1] [0, 1] representing the part of the cube in the octant x, y, z > 0. First
the counter m (incidence number) is annulled. Then a random point (x, y, z) is generated
n times whose coordinates x, y, z are random numbers from [0, 1). If x
2
+ y
2
+ z
2
1
and x
2
+ (y 1/2)
2
1/4, m is increased by 1. The Viviani window lies in 4 octants (of
the half-space x 0) that take the quarter of its volume each. Therefore, for great n we
obviously get Q m/(2n).
Results of calculating Q with n = 1 000 000 are shown in the table below:

Calculation 1 2 3 4 5
Q 0.15081 0.15062 0.15115 0.15043 0.15073

The average Q = 0.15074 differs from the exact value by 510
5
.

Example 4. Let us compute the following non-elementary integral
I =
] 1 , 0 [ ] 1 , 0 [
cos (x
2
y) dx dy.
1. Using the Gauss 3-node quadrature in 2 dimensions leads to the following table for
values of cos (x
2
y) at the nodes (x
j
, y
k
){x
1
, x
0
, x
1
}{y
1
, y
0
, y
1
}:

y
1
y
0
y
1

0.112702 0.500000 0.887298
x
1

0.112702 0.999999 0.999980 0.999936
x
0

0.500000 0.999603 0.992198 0.975498
x
1

0.887298 0.996066 0.923516 0.765764

The corresponding coefficients A
jk
= A
j
A
k
can be arranged in the matrix
A =
2
18
1 1
|
|
\
|
25 40 25
40 64 40
25 40 25
.
The Gauss quadrature yields
G =
2
18
1 1

=
1
1 j
=
1
1 k
A
jk
f(x
j
, y
k
) = 0.967 557

2 INTEGRATION


34
2. The Fubini theorem gives
I =
] 1 , 0 [ ] 1 , 0 [
cos (x
2
y) dx dy =

|
|
\
|
1
0
1
0
2
) ( cos y y x d dx =
1
0
2
2
1
0
sin
x
y x
dx =
1
0
2
2
sin
x
x
dx.
The integrand may be expanded into the following uniformly convergent series
2
2
sin
x
x
=
2
5 2 3 2
2
...
! 5
) (
! 3
) (
x
x x
x +
= 1
! 3
4
x
+
! 5
8
x

(x = 1 gives its convergent majorizing series 1 +
! 3
1
+
! 5
1
+ = sinh 1 = 1.1752
in the integration area) that can be integrated term by term
2
2
1
0
sin
x
x
dx = x
! 3 5
5
x
+
! 5 9
9
x

1
0
= 1
! 3 5
1
+
! 5 9
1
=0 k
)! 1 2 ( ) 1 4 (
) 1 (
+ +
k k
k

The last sum converges very fast and may be easily obtained in EXCEL

k
)! 1 2 ( ) 1 4 (
) 1 (
+ +
k k
k

0
1
1
0.033333333
2
0.000925926
3
1.52625E05
4
1.62102E07

The second column of the table above gives the sum I = 0.967 577.

3. The two Monte-Carlo methods mentioned above give results presented in the
following table:

Computation MC1 MC2
1 0.967437 0.967512
2 0.967869 0.967614
3 0.967465 0.967488
4 0.967463 0.967695
5 0.967539 0.967473
Average 0.967555 0.967556
St. Deviation 0.000161 0.000085

3 NONLINEAR EQUATIONS


35

Solution of the equation P(x) = 0, where P is a polynomial with coefficients from a
number field is one of the basic topics in algebra.
Every linear binomial az + b, a, b C
1
, a0, has always the root z = b/a (C
1
is the
set of complex numbers).
Every quadratic trinomial az
2
+ bz + c ( a, b, c C
1
, a0), has always two roots
z
1,2
=
a
ac b b
2
4
2

(if b
2
4ac = 0, then z
1
= z
2
= b/(2a) is its double root).
In polynomials of the 3
rd
and 4
th
degree it is still possible to express the roots
algebraically (in cubic equations by Cardans formulas whose practical usage is
limited). It is well known (and proved by E. Galois (1811-1832) by means of the theory
of groups founded by him) that the equations of the 5
th
and higher degrees are generally
unsolvable in algebraic form. The fundamental theorem of algebra says that every
polynomial over the field of complex numbers has at least one complex root z
1
. Division
of the original polynomial by (zz
1
) yields a polynomial of the n1 degree, which again
has at least a root z
2
, etc. A simple consequence of it is that the polynomial of the nth
degree has generally n complex roots. C. F. Gauss (1777-1855) proved the fundamental
theorem of algebra (incompletely in todays point of view) in 1799 and later he gave 3
other proofs. A very short and elegant proof can be obtained by the Liouville theorem
known from the theory of functions of a complex variable [9].
There are several special types of ever solvable algebraic equations, e.g. reciprocal
ones. Also very important is the fact that a real polynomial of an odd degree has at least
one real root. (This follows from the continuity of the polynomial, the dominant role of
the highest power of the variable x and from x
2k+1
as x). There are many
theorems concerning localization or estimating roots of polynomials. But here we
follow only general aims connected with real roots.
Solution of systems of linear algebraic equations and methods of linear algebra
represent another large area of numerical methods. But today, for example, algebraic
operations with vectors and matrices, matrix inversion, calculation of determinants etc.
belong to the standard wide spread software (e.g. in EXCEL), therefore these topics can
be omitted here.

3.1. Solution of the Equation f(x) = 0
R
1
is considered automatically a metric space with Euclidean metric d(x, y) = x y.
A continuous function maps a connected compact [a, b] on a connected compact [c, d].



36
If f(a).f(b)<0 (e.g. f(a) < 0 < f(b) ), there exists such an r[a, b] that f(r) = 0
(B. Bolzano).
The problem is now how that r can be found constructively. This may be done in
many ways.

Bisection Method
Let f be continuous on [a, b], f(a) f(b)<0. A zero r of the function f is to be found with
an error less than a small >0. A computation algorithm may be as follows:
(0) Put s
a
= sign f(a), x
a
= a, x
b
= b. Introduce a cycle counter i and set i = 0.
(i) Increase i by 1. Put r = (x
a
+ x
b
)/2 and calculate y
r
= f(r). If y
r
= 0, the zero is found
and the calculation is finished. If not, put s
r
= sign y
r
.
If s
r
= s
a
then put x
a
= r and t
i
= 1 else put x
b
= r and t
i
= 0.
(e) If x
b
x
a
< , the calculation is finished. Otherwise repeat the step (i).

The additional sequence {t
i
} may be taken as a binary expansion of a real number t
when r is expressed as a convex combination of the endpoints a, b:
r = tb + (1t) a = a + (b a) t.
Example. Let us search for a root of the function f(x) = 1 (4/x) sin (20/x) in interval
[1, 4]. It can be seen in Fig. 4.1 that f has five zeros in [1, 4].

-4
-3
-2
-1
0
1
2
3
4
5
1 2 3 4
x
y

Fig. 4.1 Function f(x) = 1 (4/x) sin (20/x) on interval [1, 4].
1
2
3
4 5



37

i x
a
x
b
t
0 1.000000000 4.000000000 0
1 2.500000000 4.000000000 1
2 2.500000000 3.250000000 0
3 2.500000000 2.875000000 0
4 2.687500000 2.875000000 1
5 2.781250000 2.875000000 1
6 2.828125000 2.875000000 1
7 2.828125000 2.851562500 0
8 2.828125000 2.839843750 0
9 2.828125000 2.833984375 0
10 2.828125000 2.831054688 0
11 2.828125000 2.829589844 0
12 2.828857422 2.829589844 1
13 2.829223633 2.829589844 1
14 2.829223633 2.829406738 0
15 2.829223633 2.829315186 0
16 2.829269409 2.829315186 1
17 2.829292297 2.829315186 1
18 2.829292297 2.829303741 0
19 2.829292297 2.829298019 0
20 2.829295158 2.829298019 1
21 2.829296589 2.829298019 1
22 2.829297304 2.829298019 1
23 2.829297662 2.829298019 1
24 2.829297841 2.829298019 1
25 2.829297930 2.829298019 1
26 2.829297930 2.829297975 0
27 2.829297952 2.829297975 1
28 2.829297952 2.829297964 0
29 2.829297952 2.829297958 0
30 2.829297955 2.829297958 1
31 2.829297955 2.829297957 0
32 2.829297956 2.829297957 1
33 2.829297956 2.829297956 0

The bisection method can be
well traced in the neighboring
table that shows the successive
reducing the length of intervals
at whose endpoints the function
f has opposite signs.
The first iterations (bold in
the table) are shown on the axis
0x in Fig. 4.1. The table shows
also the connection between
sign changes and additional
values of t
i
.
The binary number
t =
i=
1
33
t
i
2
i
= 0.609 765 985 ...
corresponds to
r = tb + (1t) a = 4t + 1 t =
1+3t = 2.829 297 95...
Of course, the values x
a
and x
b

converge to the same number
which can be seen from the last
rows of the table.

Remark. In bisection method a family of centered closed intervals {I
i
: i=0, 1, ...} is constructed whose
lengths decrease as quickly as 2
i
. A well known theorem on compact sets says that
I
{I
i
: i=0, 1, ...}
[9], which proves the existence of a zero of f. If in the kth step at any of endpoints of I
k
the function f
becomes zero the search is finished and for i = k+1, k+2, ... it can be put I
i
= {r}.

Remark. The limit point of the bisection algorithm depends on the choice of interval [a, b], f(a) f(b)<0.
If, e.g., a =1 and b = 2, 3, 5 one would obtain another root r = 1.542 927 532...

It is evident that under these conditions the algorithm can be generalized to iterated
partition of the interval [x
a
, x
b
] in a ratio q:(1q), 0<q<1. The next table presents results
of such algorithms in calculating zeros of the above function f for different q in the same
interval [1, 4] with the tolerance = 10
10
. The sequence {t
i
} would need a more
complicated way of generating. Also a question on fastest convergence could be
analyzed in some simple situations.



38
q Number of iterations Zero r
0.1 98 2.829 297 956 1...
0.2 47 2.829 297 956 1...
0.3 37 2.829 297 956 1...
0.4 37 2.829 297 956 1...
0.5 35 1.542 927 532 7...
0.6 35 1.046 341 398 4...
0.7 40 1.046 341 398 4...
0.8 35 1.046 341 398 4...
0.9 63 1.046 341 398 4...

Multiple Uniform Partition
Let f be a function continuous on an interval [a, b], f(a)f(b)<0 and h = (ba)/m, m>2.
Points ih for i = 0, 1, , m create the uniform partition of [a, b] into m parts [x
i1
, x
i
] of
the same length.
A further algorithm for calculating a zero r with a tolerance >0 could be like this:
(0) Set s
a
= sign f(a), x
a
= a, x
b
= b. Define counter i and set i = 0.
(i) Increase i by 1. Put h = (x
b
x
a
)/m. For j = 0, 1, , m put x
j
= a + jh and calculate
y
j
= f(x
j
). If y
j
= 0, then r = x
j
and the calculation is finished. Otherwise compute
s
a
sign y
j
. If s
a
sign y
j
> 0 the integer j is increased by 1 until s
a
sign y
j
= 1.
Then x
a
= x
j1
, x
b
= x
j
and t
i
= j1.
(e) If x
b
x
a
< , finish the computation otherwise go to (i).

The following table shows two examples.
m = 3 m = 10
i x
a
x
b
t
i
x
a
x
b
t
i

0 1.0000000000 2.0000000000 0 1.0000000000 2.0000000000 0
1 1.3333333333 1.6666666667 1 1.0000000000 1.1000000000 0
2 1.4444444444 1.5555555556 1 1.0400000000 1.0500000000 4
3 1.5185185185 1.5555555556 2 1.0460000000 1.0470000000 6
4 1.5308641975 1.5432098765 1 1.0463000000 1.0464000000 3
5 1.5390946502 1.5432098765 2 1.0463400000 1.0463500000 4
6 1.5418381344 1.5432098765 2 1.0463410000 1.0463420000 1
7 1.5427526292 1.5432098765 2 1.0463413000 1.0463414000 3
8 1.5429050450 1.5430574607 1 1.0463413900 1.0463414000 9
9 1.5429050450 1.5429558502 0 1.0463413980 1.0463413990 8
10 1.5429219800 1.5429389151 1 1.0463413983 1.0463413984 3
11 1.5429219800 1.5429276251 0 1.0463413984 1.0463413984 7
12 1.5429257434 1.5429276251 2
13 1.5429269979 1.5429276251 2
14 1.5429274160 1.5429276251 2
15 1.5429274857 1.5429275554 1
16 1.5429275322 1.5429275554 2
17 1.5429275322 1.5429275399 0
18 1.5429275322 1.5429275347 0
19 1.5429275322 1.5429275330 0
20 1.5429275327 1.5429275330 2
21 1.5429275327 1.5429275328 0



39
Like in the bisection method the number t = 0.t
1
t
2
t
3
... = t
1
m
1
+ t
2
m
2
+ t
3
m
3
+
(m-adic expansion of t) enables to write r as a convex combination of the points a, b:
r = tb + (1 t) a = a + (b a) t.
This can be shown by the function from Example 1. Let [1, 2] be the initial interval
of iteration. Fig. 4.1 shows 3 zeros of f in this interval. The described algorithm finds
one of them. The above table presents iterations for m=3 and m=10.
For m = 3 the triadic expansion of the root
r
3
= 1 + (0.11212221010222120002)
3
, which is r
3
= 1.5429275327 decimally.
In m = 10 the decimal number 0.t
1
t
2
= 0.04634139837 is obtained directly and no
explanation comment is needed (the numbers in the last rows are rounded).
It is evident that the information about a function with an a priori unknown behavior
is increasing together with increasing m. EXCEL with m=10 may be a very efficient
tool for a quick determination of a zero without programming. If the length of the initial
interval is chosen as an entire power of 10, i.e. [k.10
l
, (k+1) 10
l
], kZ, then every
iteration step means the reduction of an foregoing interval [x
a
, x
b
] to one tenth of its
length, thus increasing the precision of the foregoing root estimate by a decimal order.
This can be done by successive copying and easy arrangement.
The following table shows the first five steps of this process.

k = 1 k = 2 k = 3 k = 4 k = 5
j x
j
y
j
x
j
y
j
x
j
y
j
x
j
y
j
x
j
y
j

0 1 2.65178 1 2.65178 1.040 0.43095 1.0460 0.02333 1.04630 0.00283
1 1.1 3.25168 1.01 2.22702 1.041 0.36345 1.0461 0.01650 1.04631 0.00215
2 1.2 1.02 1.69678 1.042 0.29574 1.0462 0.00967 1.04632 0.00146
3 1.3 1.03 1.08885 1.043 0.22785 1.0463 0.00283 1.04633 0.00078
4 1.4 1.04 0.43095 1.044 0.15980 1.0464 0.00401 1.04634 9.6E05
5 1.5 1.05 0.25040 1.045 0.09162 1.0465 1.04635 0.00059
6 1.6 1.06 1.046 0.02333 1.0466 1.04636
7 1.7 1.07 1.047 0.04504 1.0467 1.04637
8 1.8 1.08 1.048 1.0468 1.04638
9 1.9 1.09 1.049 1.0469 1.04639

Regula falsi (false rule)
The principle of regula falsi consists in substituting the zero of a considered function
that changes its sign at the endpoints of an interval by the zero of the linear binomial
through the same endpoints (Fig. 4.2).
Suppose the function f is continuous in interval [a, b], f(a) f(b)<0. The points
(a, f(a)), (b, f(b)) determine a linear function L(t), 0t1, that can be written as convex
combination of those points,
t
|
\
|
) (a f
a
+ (1 t)
|
\
|
) (b f
b
=
|
\
|
+
+
) ( ) ( ) (
) (
b f t a f t
b t a t
1
1
.



40

Fig. 4. 2 Principle of regula falsi.

The intersection point of L(t) and the x-axis is defined by
tf(a) + (1 t) f(b) = t [f(a) f(b)] + f(b) = 0 ,
which gives t = f(b) / [f(b) f(a)]. Thus, the false zero r of the function f is
r = t a + (1 t) b = b + t (a b) = b f(b)
) ( ) ( b f a f
b a
= b f(b)
) ( ) ( a f b f
a b
.
This can be rearranged to
r =
) ( ) (
) ( ) (
a f b f
a bf b af
= det
|
\
|
) (
) (
b f b
a f a
/ det
|
\
|
) ( 1
) ( 1
b f
a f
.
Now f(r) can be computed. If f(r) = 0, the root of f is found. In the opposite case that
of the endpoints a, b, at which the function f has the same sign as f(r), is replaced by r.
This way the original interval length is reduced and the procedure can be repeated.
Setting r
1
= a, r
0
= b allows to define a sequence {r
n
} converging to the zero of f as
follows
r
n+1
=
) ( ) (
) ( ) (
1
1 1
n n
n n n n
r f r f
r f r r f r
, n = 0, 1, 2, ..., r
1
= a, r
0
= b.
If f is concave or convex in [a, b], the sequence r
n
converges to zero of the function
f from the left or right side while the second endpoint remains fixed. This so called
primitive form of regula falsi, can, e.g. for fixed a, be written as
r
n+1
=
) ( ) (
) ( ) (
a f r f
a f r r af
n
n n
, n = 0, 1, 2, ..., r
0
= b.
In order to accelerate the convergence the regula falsi is combined with the bisection
method. Results from EXCEL are shown in the following table.



41
Regula falsi Regula falsi + Bisection method
x
i
f(x
i
) r
i
f(r
i
) x
i
f(x
i
) r
i
f(r
i
)
1 2.65178 2.725407 0.27685 1 2.65178 2.725407 0.27685
4 1.958924 4 1.958924
2.725407 0.27685 2.883236 0.156584 2.725407 0.27685 2.831146 0.005272
4 1.958924 3.362704 1.39174
2.725407 0.27685 2.826218 0.00877 2.725407 0.27685 2.82917 0.00036
2.883236 0.156585 2.831146 0.005272
2.826218 0.00877 2.829241 0.00016 2.82917 0.00036 2.829298 1.4E07
2.883236 0.156585 2.830158 0.002452
2.829241 0.00016 2.829297 2.9E06
2.883236 0.156585
2.829297 2.7E06 2.829298 4.9E08
2.883236 0.156585

If the requirement of opposite signs of the continuous function f at the endpoints of
an interval is omitted and the warranty of existence of the root within the interior of
considered interval is lost, the corresponding iteration
r
n+1
=
) ( ) (
) ( ) (
1
1 1
n n
n n n n
r f r f
r f r r f r
= r
n
f(r
n
)
) ( ) (
1
1
n n
n n
r f r f
r r

is called secant method. However, if f >0 or f <0 and f(a) f(b)<0, the sequence {r
n
}
converges to the only root of f in (a, b).

Quadratic Interpolation Method
Let fC
0
([a, b]), f(a) f(b)<0. Let us put x
0
= a, x
1
= b, x
1/2
= (x
0
+ x
1
)/2. The nodes
(x
0
, f(x
0
)), (x
1/2
, f(x
1/2
)), (x
1
, f(x
1
)) define the interpolation parabola
P(x) = A(xx
0
)
2
+ B(xx
0
) + y
0
,
whose coefficients
A = 2[ f(x
0
) 2 f(x
1/2
) + f(x
1
) ]/(x
1
x
0
)
2
, B = [3f(x
0
) + 4 f(x
1/2
) f(x
1
) ]/(x
1
x
0
)
can be easily found from conditions P(x
1/2
) = f(x
1/2
), P(x
1
) = f(x
1
).
The roots of P are
r
1,2
=

B B A f x
A
2
0
4
2
( )
.
After substitution and simple arrangements we get
r
1,2
=
x x
1 0
2
B B f x A
A

2
0
4
2
( )
,
where A = f(x
0
) 2 f(x
1/2
) + f(x
1
), B = 3f(x
0
) 4 f(x
1/2
) + f(x
1
). The trinomial P has two
roots but only one is needed, just that one lying in the interval [x
0
, x
1
]. Fortunately, it
can be determined quite simply:
r =
x x
1 0
2
A
A x f B x f B
2
) ( 4 ) ) ( (
0
2
1
+sign
.



42
The interpolation parabola can be expressed by coefficients A, B
P(x) = 2A
x x
x x
|
\
|
0
1 0
2
B
x x
x x
|
\
|
0
1 0
+ y
0
.

0

Fig. 4. 3 Solution of the equation f(x) = 1 (4/x) sin (20/x) = 0 by quadratic interpolation.

Step x
i
f(x
i
) A, B r f(r)
1 2.65178 0.47309 2.869955 0.117602
1 2.5 0.58297 3.66453
4 1.95892
2.5 0.58297 0.273015 2.833481 0.011944
2 2.685 0.36913 0.15468
2.87 0.11773
2.833481 0.01194 0.026531 2.829382 0.00024
3 2.759241 0.19186 0.434133
2.685 0.36913
2.829382 0.00024 0.004385 2.829298 1.110
6

4 2.794312 0.09800 0.200864
2.759241 0.19186

In Fig. 4.3 two interpolation parabolas in the function f(x) = 1 (4/x) sin (20/x) are
shown, P
1
on interval [1, 4] and P
2
on [2.5, 2.87]. It can be seen that the graph of the
polynomial P
2
is optically almost indistinguishable from the graph of the function f.
-3
-2
-1
0
1
2
3
4
5
1 2 3 4
x
y
f (x)

P
1
( x )
P
2
( x )



43
This method contains automatically also the bisection method and this warrants the
convergence in the worst case [10]. Numerical problems could arise when A0. But
those can be avoided by using linear interpolation
r = x
0

f x f x
x x
( ) ( )
1 0
1 0
f(x
0
)
when A becomes small, say, A <10
8
.

Inverse Interpolation Method
Let a continuous function f : [a, b] R
1
fulfill the condition f(a) f(b)<0. If = f(a),
= f(b), then either < 0 < or < 0 < . Let us e.g. choose < 0 < .
The linear interpolation polynomial on the interval [, ], P
1
() = a +

(a), is at
the point 0 equal to
r
1
= P
1
(0) = a

(b a).
Let r
1
be taken as an initial estimate of the root. When
1
= f(r
1
) is calculated the three
nodes (, a), (
1
, r
1
), (, b) allow to construct a quadratic interpolation polynomial
P
2
() and put r
2
= P
2
(0). Calculating
2
= f(r
2
) and adding the point (
2
, r
2
) to the
former set of interpolation nodes enables to construct a cubic interpolation polynomial
P
3
() and put r
3
= P
3
(0). Then again
3
= f(r
3
) can be calculated and the point (
3
, r
3
)
added to the set of interpolation nodes, etc. This process continues until r
n
r
n1
<
(or f(r
n
) < ) where is the given tolerance.
Aitkens algorithm enables to do this by repeated linear interpolation and calculation
of function values at each new point r
n
, which can be done in EXCEL. In each step the
columns are lengthened by one and the other operations are just copied. The following
table shows the computation of the root of the function f(x) = 1 (4/x) sin (20/x) for the
tolerance =510
7
in the third step already (r = 2.829 298).

Step x
i
f(x
i
) r
1,i
f(r
1,i
) r
2,i
f(r
2,i
) r
3,i
f(r
3,i
)
1 1 2.65178 2.725407 0.27685
4 1.958924
2 1 2.65178 2.92654 0.28463 2.83028 0.0028
2.725407 0.27685 2.883236 0.156584
4 1.958924
3 1 2.65178 2.92654 0.28463 2.829296 5.1E06 2.829298 1.25E07
2.725407 0.27685 2.82923 0.00019 2.829298 5.83E08
2.83028 0.0028 2.828605 0.00197
4 1.958924
4 1 2.65178 2.92654 0.28463 2.829298 2.3E10 2.829298 3.33E16
2.725407 0.27685 2.829298 8.7E09 2.829298 3.33E16 2.829298 3.33E16
2.829298 1.25E07 2.829298 5.28E11 2.828605 0.00197
2.83028 0.00280 2.828605
4 1.958924



44
Remark. Sharp monotony of the function f C
0
([a, b]) is a sufficient condition for the convergence of
the sequence {r
n
} to the root r. But the efficiency may seduce to risking. With an isolated problem of
finding a root the initial interval can easily be changed and the calculation repeated. A worse situation
arises, if finding the root is but a small part of a larger algorithm whose results are used as inputs of
linked up complex computations. Then the initial localization should assure uniqueness of the root. Also
recurrent interpolation of a small degree with fixed number of interpolation nodes is preferable.
Inverse interpolation can be done with any suitable function, e.g. spline,
trigonometric polynomial etc.

Newton-Raphson Method
If the function is supposed to be not only continuous but also differentiable (smooth),
the limit
x
n+1
= x
n
f(x
n
)
n n
x x
1
lim
) ( ) (
1
1
n n
n n
x f x f
x x

can be considered in the secant method. On the right hand side there is the reciprocal
value of the derivative f (x
n
). If f (x
n
) 0, the Newtons iteration is obtained
x
n+1
= x
n

) (
) (
n
n
x f
x f
, n = 0, 1, ...
It can be proved that if f exists on [a, b], f and f do not change their signs (i.e. f is
convex or concave) and f(a) f(b)<0, this iteration converges to an x(a, b).

Example. Calculation of a positive root of a positive number, i.e. finding a positive root
of the function f(x) = x
b
a, a, b, x > 0. Then f (x) = bx
b1
>0 and f(x) = b(b1)x
b2
.
If b = 1, we have immediately x = a, the function f is convex for b>1 and concave for
b<1. The corresponding Newtons iteration is then
x
n+1
= x
n

1
b
n
b
n
bx
a x
= |
\
|

b
1
1 x
n
+
1 b
n
x b
a
.
For example, for b = 5 and b = e
1

x
n+1
= 0.8 x
n
+ 0.2
4
n
x
a
and x
n+1
= (1e) x
n
+ e a
-1
e 1
n
x ,
and with a = 20, x
0
= 2 we get

x
1
x
2
x
3
x
4
x
5
x
6
x
7

b = 5 1.85 1.821 1.82056 1.820564
b = e
1
80.4 734.3 2261.2 3287.1 3437.87 3440.061 3440.06141

Remark. In practice, however, functions allowing such a simple verification of convergence assumptions
are rather exceptional. Another problem arises when we are not able to do more than laboriously compute
values of the considered function (e.g. by numerical integration or solving a complicated equation). Then
the derivative must be approximated by divided difference, i.e. Newtons iteration is transformed to the
modified secant method. On the other hand, if the root is localized well, the convergence of Newtons



45
iteration is very fast. The following table shows this in determining roots of f(x) = 1 (4/x) sin (20/x) for
x
0
= 1, 2, 3.

n xn f(xn) f(xn) xn f(xn) f(xn) xn f(xn) f(xn)
0 1 36.29835 2.65178 2 8.93474 2.088042 3 2.914046 0.501132
1 1.073055 62.58095 1.78139 2.233699 6.03273 0.187363 2.828029 2.847219 0.003620
2 1.044590 68.19115 0.11960 2.264757 5.27164 0.011776 2.829299 2.850369 2.01E06
3 1.046344 68.35873 0.00016 2.266991 5.2164 6.17E05 2.829298 2.850368 6.09E13
4 1.046341 68.35856 1.98E10 2.267003 5.2161 1.73E09 2.829298 2.850368 3.33E16
5 1.046341 68.35856 4.66E15 2.267003 5.2161 4.44E16

These calculations as well as starting the iteration from another initial point can be
done comfortably in EXCEL.
Wrong estimate of the starting point or
careless using can cause that the Newton
method fails. The table on the left side shows
iterations of the function
f(x) = 1 (4/x) sin (20/x)
(Example 1) for the starting point x
0
= 4.
It can be seen that the sequence of
iterations {x
i
: i = 0, 1, } diverges.

Simple Iteration Method
The equation f(x) = 0 can be reshaped to x = f(x) + x = g(x). The equality x = g(x) means
that x is a fixed point of the function g. In accordance with Banach fixed point theorem
in spaces R
k
provided with the Euclidean metric it is sufficient to verify that
x x
x g x g

) ( ) (
q < 1 for x, x from the considered subset [1,2,9]. This condition is
fulfilled, if g is differentiable and g(x) q < 1.
The fixed point of the function g can be interpreted as the intersection point of the
straight line y = x with the curve y = g(x). It can be found by iteration
x
n+1
= g(x
n
), n = 0, 1, ...

Example. Let us solve the equation
sin (2x) = cos x
in the segment (0, /2) . This can be rewritten as x =
2
1
arsin (cos x)
and the corresponding iteration is
x
n+1
=
2
1
arsin (cos x
n
) = g(x
n
)
n x
n
f(x
n
) f(x
n
)
0 4 0.114847 1.958924
1 13.0569 0.02485 0.693881
2 14.86861 0.023081 0.737799
3 17.0976 0.01885 0.784612
4 24.53529 0.008551 0.881341
5 78.538 0.00032 0.987170
6 2976.076 6.07E09 0.999991
7 1.6E+08 3.6E23 1.000000
8 2.79E+22 7.33E66 1.000000
9 1.4E+65 6E194 1.000000



46
Obviously, g(x) =
x
x
2
cos 1
sin
2
1
=
2
1
<1, which assures the convergence of the
sequence {x
n
} for every initial point x
0
[0, /2].
The equation
sin (2x) = cos x
can also be solved by adding x on its both sides and obtaining the iteration
x
n+1
= x
n
+ cos x
n
sin (2x
n
) = g(x
n
).

n x
n

0 0

1 0.785398163

2 0.392699082

3 0.589048623

4 0.490873852

5 0.539961237

6 0.515417545

7 0.527689391

8 0.521553468

9 0.524621429

10 0.523087449

11 0.523854439

12 0.523470944

13 0.523662691

14 0.523566818

15 0.523614755

16 0.523590786

17 0.523602770

18 0.523596778

19 0.523599774

20 0.523598276

21 0.523599025

22 0.523598651

23 0.523598838

24 0.523598744

25 0.523598791

26 0.523598768

27 0.523598779

28 0.523598774

29 0.523598777

30 0.523598775

31 0.523598776

Then g(x) = 1 sin x 2 cos 2x. But in this case the
convergence would be secured only on a part of R
1
.

Fig. 4.4 Convergence of the sequence {x: n = 0, 1, } given by the iteration
x
n+1
=
2
1
arsin (cos x
n
).

The results can be checked up very simply. It is
sin (2x) = 2 sin x cos x = cos x and
cos x (2 sin x 1) = 0, i.e. arcsin
2
1
= 0.5235987756
(x =
2
is eliminated).



47
Localization and Simultaneous Determination of Roots
A very simple estimate of the roof of a function f continuous in interval [a, b] consists
in defining (n+1) equidistant points,
x
0
=a, x
1
= a + (ba)/n, x
2
= a + 2(ba)/n, ..., x
n
= b ,
computing values f(x
i
) and determining m = min {f(x
i
): i=0, ..., n}. If m is very small
the corresponding x
m
can be an estimate of the root.
This method cannot fail but a too high price is paid due to work consumed in
computation of the other n1 values f(x
i
). The work consumed on calculation of one
value of the function is sometimes called a Horner. The search for a zero of a function f
with the tolerance (ba)10
6
would need 10
6
computations of values of f, i.e. one
million Horners. Thus, this method is inconvenient and the search for a root of f in
interval [a, b] has to be rationalized. This, in the end, is the leading idea of the
preceding methods working with efficiency of a few Horners.
Computing value at newly added points supplies further information concerning the
considered function f, especially about the existence of its roots. Then, in the original
interval [a, b] some intervals [x
i1
, x
i
] may be taken that satisfy the condition
f(x
i1
) f(x
i
)<0 while their length x
i
x
i1
is small in comparison to ba. In those partial
intervals local methods of computing the zeros of f can be used.
The added points can be taken in a deterministic way, e.g. by partition of the original
interval by equidistant points. But the partition points may be chosen randomly too.
Then, the root localization can look like this:
A small number n (e.g. n=10) of random numbers x are repeatedly generated
in interval [a, b], values f(x) are calculated, their minimum m is determined and the
corresponding coordinate x
m
is recorded. The values x
m
are cumulating round zeros of
the function f.
However, a more clever procedure consists in recording only those values x at which
the value f(x) drops below the level of some prescribed tolerance >0. Generating
random numbers x uniformly distributed in interval [a, b] and calculation of f(x) is
stopped when, say, 100 values x satisfying f(x) < are found. Ordering them
produces clusters round roots of f.

Example. Results obtained with the function f(x) = 1 (4/x) sin (20/x) on interval
[1, 4] with = 0.05 are shown in Fig. 4.5. If the process is repeated the frequencies may
be different because the concentration of points around a root x
k
is random. The
tolerance = 0.05 produces some graphical offsets in the two upper rows of points in
Fig. 4.5. This may be removed by choosing smaller.
So for = 0.000 005 and eliminating x
k
= x
k1
for k>1, we got at once the following
ordered set of roots of f :



48
{1.046 34, 1.300 66, 1.542 93, 2.267 00, 2.829 30} .
But the cost was too high: N 45.810
6
Horners. In this example using N = (ba)/
equidistant points with the total work amount of 0.6 10
6
Horners would seem much
more reasonable.
The random choice of points from [a, b] can, however, be modified in various ways.
A certain and sometimes even substantial reduction of computing work N can be
attained by introducing an indicator that reveals if the random point x differs from the
roots that had already been found.

0
0.5
1
1.5
2
2.5
3
3.5
4
0 20 40 60 80 100
k
x
k

Fig. 4. 5 Ordered sequence {x
k
: f(x
k
)<0.05, k = 1, ..., 100} in f(x) = 1 (4/x) sin (20/x).

A more detailed imagination concerning the simultaneous calculation of all roots can
be obtained for different tolerances . But must be also taken as a compromise
between the preciseness and the amount of work exerted with very small .



49
3.2 Systems of Nonlinear Equations
With the experience gained in solution of one equation f(x) = 0 we can start considering
systems of n equations of n unknowns, nN, n>1. This may be written in the vector
form
f(x) = 0
where x, 0R
n
, n > 1, and components of f, f
i
(x), are continuous in a simple, connected
area GR
n
, like an interval, ball etc.
Opposite signs of components f
i
(x)C
0
(G) in some points, parts of the border G or
subareas of G, in contrast to n=1, do not warrant the existence of a root xG,
unfortunately. So we are in a quite different and much more complicated situation. The
success in finding the root x is assured only in very special cases. E.g. algebraic
polynomial of the nth degree in one complex variable can be taken as a function of two
variables, the real and imaginary part of the complex number z = x + iy:
P(x, y) = R(x, y) + i I(x, y).
Obviously, P(x, y) = 0 if and only if
R(x, y) = 0,
I(x, y) = 0.

There are several well-known ways how to solve systems of nonlinear equations. An
important role belongs to the dimension n: systems of two equations are surely more
transparent than those of 10 equations.
From a general standpoint several typical methods can be classified.
Elimination method is the most natural and the logically simplest method. One
equation, let it be the first one,
f
1
(x
1
, ..., x
n1
, x
n
) = 0
is viewed as an implicit prescription for some variable, say x
n
. If the conditions of the
corresponding theorem [9] are fulfilled, x
n
= g
n
(x
1
, ..., x
n1
). Thus, the number of
unknowns is reduced by 1. The second equation of the system is then transformed to
f
2
(x
1
, ..., x
n1
, x
n
) = f
2
(x
1
, ..., x
n1
, g
n
(x
1
, ..., x
n1
)) = F
2
(x
1
, ..., x
n1
) = 0.
Again, if the conditions of the corresponding theorem are fulfilled, this equation
gives x
n1
= g
n1
(x
1
, ..., x
n2
). The third equation is analogously transformed to
f
3
(x
1
, ... , x
n2
, g
n1
(x
1
, ..., x
n2
), g
n
(x
1
, ... , x
n2
, g
n1
(x
1
, ..., x
n2
))) = F
3
(x
1
, ..., x
n2
) = 0.
This procedure can continue and after n1 steps we obtain an equation of an
unknown
F
1
(x
1
) = 0,
whose solution should already be found quite easily.
Now the backward process can start. The root x
1
is put into x
2
= g
2
(x
1
), the couple
x
1
, x
2
is put into x
3
= g
3
(x
1
, x
2
), etc.



50
The sketched process can be realized in some cases when, for example, the number n
is small, e.g. n = 2 or, may be, n = 3. So it is not a mere academic possibility.
Simple iteration method. A convenient domain G is needed in which the system
f(x) = 0 can be transformed it into an equivalent form x = g(x) enabling the iteration
x
k+1
= g(x
k
).
Assuring the convergence of this iteration requires some special conditions, e.g.
those needed for the Banach fixed point theorem [9]:
g x g x ( ) ( )
k k +

1
q x x
k k +

1
, 0 q <1.
Newtons method. If the domain G is large and functions f
i
(x) satisfy special
conditions of existence of partial derivatives
j
i
x
f
and Jacobian det

|
|
\
|
j
i
x
f
0 in G,
then the first terms of Taylors expansion allow to define Newtons iteration (see
below).
Transform to a minimization problem. Functions f
i
(x) are required to be continuous
in an area GR
n
in which an interval Q can be taken. The problem f(x) = 0, xQ is
solved by minimizing a convenient norm of f(x) or finding out points at which the
norm of f(x) drops below a prescribed . Thus, individual points xQ are at hand
and a convenient minimizing procedure must be defined. If a choice of Q does not
give a solution, then simply another choice of Q is tried.

The two first methods do not need further explanation and will be illustrated by
examples.

Example of elimination. Find intersection points of the three surfaces: (1) the sphere
x
2
+ y
2
+ z
2
= a
2
with the center x=0 and radius a>0, (2) the cone x
2
y
2
+ z
2
= 0
and (3) the plane x + y +z = 0. Totally, one gets the system:
x
2
+ y
2
+ z
2
= a
2
,
x
2
y
2
+ z
2
= 0,
x + y +z = 0.
The last equation gives z(x, y) = (x + y). This being put in the second equation gives
x
2
y
2
+ (x + y)
2
= 2 x
2
+ 2 x y = 2 x (x + y) = 0; thus either x = 0 or y = x.
If x = 0, then z(0, y) = (0 + y) = y, which after being put in the first equation
yields 0
2
+ y
2
+ (y)
2
= a
2
, i.e. y = a/ 2 . The solution in this case represents the
point P
1
= (0, a/ 2 , a/ 2 ) and the centrally symmetric point P
2
=
(0, a/ 2 , a/ 2 ). If y = x, it is z(x, y) = (x x) = 0, which after being put in the
first equation yields x
2
+ (x)
2
+ 0
2
= a
2
, i.e. x = a/ 2 ; thus another two
intersection points are obtained P
3
= (a/ 2 , a/ 2 , 0), P
4
= ( a/ 2 , a/ 2 , 0).



51
If the plane is not going through the sphere center x = 0, e.g. x + y +z = 1, the
corresponding system of equations would not be solved so easily.

Example of the simple iteration method.
Let the following system of
three nonlinear equations for
three unknowns be given
(arranged for the simple
iteration method)
) (
. sin
2
1
sin
4
1
, cos
4
1
2
sin
, cos
2
1
2
sin
2
2
2
I
y x z
z
x
y
z z
y
x
+ =
+ =
+ + =
The table on the left presents
the iterations
|
|
|
\
|
+
+
+
1
1
1
k
k
k
z
y
x
= x
k+1
= f(x
k
) =

|
|
|
|
|
|
\
|
+
+
+ +
k k
k
k
k k
k
y x
z
x
z z
y
sin sin
4
1
cos
4
1
2
sin
cos
2
1
2
sin
2
2
2

starting from the initial point
x
0
= y
0
= z
0
= 1.

The simple iteration methods are used carefully with the consciousness of their slow
convergence and possible failure. The Newtons method, conversely, is tried very often
and hazardously - mainly due to its fast convergence.

Newtons Method
As it was mentioned above, the functions f
i
(x) are supposed to have all partial
derivatives
j
i
x
f
in G, i, j = 1, ..., n and their Jacobian det

|
|
\
|
j
i
x
f
0. Let x
k
be an initial
estimate of the root x, f(x) = 0. The first terms in the Taylor expansion of f at the point
x
k
can be written as follows
f(x
k
+ h) = f(x
k
) +
k
j
i
x
f
x
|
|
\
|
h + Reminder.
k x
k
y
k
z
k

0 1 1 1
1 1.749576692 0.614501115 0.243717138
2 0.847060990 1.016966929 0.046181321
3 0.988452769 0.660980993 0.284881690
4 0.885512052 0.723527884 0.132566017
5 0.867111294 0.678392760 0.181154853
6 0.857364591 0.669965514 0.168433763
7 0.850047087 0.665572093 0.167549928
8 0.847748622 0.662243725 0.167637375
9 0.846197309 0.661196264 0.166896669
10 0.845515851 0.660490860 0.166868370
11 0.845175134 0.660180218 0.166759041
12 0.845000803 0.660025106 0.166720939
13 0.844917888 0.659945695 0.166702947
14 0.844875819 0.659907924 0.166692162
15 0.844855252 0.659888764 0.166687684
16 0.844845067 0.659879395 0.166685221
17 0.844840018 0.659874756 0.166684048
18 0.844837530 0.659872456 0.166683469
19 0.844836297 0.659871323 0.166683178
20 0.844835688 0.659870761 0.166683036
21 0.844835387 0.659870484 0.166682966
22 0.844835238 0.659870346 0.166682931
23 0.844835165 0.659870279 0.166682913
24 0.844835128 0.659870245 0.166682905
25 0.844835110 0.659870228 0.166682901
26 0.844835102 0.659870220 0.166682899
27 0.844835097 0.659870216 0.166682898
28 0.844835095 0.659870214 0.166682897
29 0.844835094 0.659870213 0.166682897
30 0.844835093 0.659870213 0.166682897
31 0.844835093 0.659870212 0.166682897
32 0.844835093 0.659870212 0.166682897



52
Our goal is to annul f by a convenient choice of h. Assumptions f(x
k
+ h) = 0 and
Reminder = 0 give the following system of linear equations
k
j
i
x
f
x
|
|
\
|
h = f(x
k
)
with the regular matrix
k
j
i
x
f
x
|
|
\
|
( i. e. det
k
j
i
x
f
x
|
|
\
|
0 ) and solution
h =
1
|
|
\
|
k
j
i
x
f
x
f(x
k
).
An improved estimate x
k+1
can be defined by x
k+1
= x
k
+ h. This defines the Newton (or
Gauss-Newton) iteration
x
k+1
= x
k

1
|
|
\
|
k
j
i
x
f
x
f(x
k
).
In one-dimensional case, n=1, the Jacobi matrix is reduced to one element f , its
inverse to 1/f and the foregoing recurrence turns into x
k+1
= x
k
f(x
k
)/ f (x
k
).
In complicated functions partial derivatives cannot be given analytically and are
substituted by divided symmetric differences
j
n j
x
x x x f
) , ... , , ... , (
1

h
x h x x f x h x x f
n j i n j i
2
) , ... , , ... , ( ) , ... , , ... , (
1 1
+
,
where h is small, e.g. h = 10
6
.

Example. Let us try to find a solution of the above system from the elimination example
x
2
+ y
2
+ z
2
a
2
= 0,
x
2
y
2
+ z
2
= 0,
x + y + z = 0.
The corresponding Jacobi matrix is J =
|
|
\
|
1 1 1
2 2 2
2 2 2
z y x
z y x
. Its inverse J
1
can be found
when transforming (JI
3
) to (I
3
J
1
) by equivalent operations as follows
|
|
\
|
1 0 0
0 1 0
0 0 1
1 1 1
2 2 2
2 2 2
z y x
z y x
|
|
\
|
1 0 0
0 1 1
0 0 1
1 1 1
0 4 0
2 2 2
y
z y x
|
|
|
|
\
|
1 0 0
0
0 0
1 1 1
0 1 0
1
4
1
4
1
2
1
y y
x
x
z
x
y

|
|
|
|
|
\
|
+
x z
x
x z y
x y
x z y
x y
y y
x z
z
x z y
y z
x z y
y z
) ( 4 ) ( 4
4
1
4
1
) ( 4 ) ( 4
0
1 0 0
0 1 0
1 0 1
.



53
The Newtonian iteration is then
|
|
|
\
|
+
+
+
1
1
1
k
k
k
z
y
x
=
|
|
|
\
|
k
k
k
z
y
x

|
|
|
|
|
\
|
+
k k
k
k k k
k k
k k k
k k
k k
k k
k
k k k
k k
k k k
k k
x z
x
x z y
x y
x z y
x y
y y
x z
z
x z y
y z
x z y
y z
) ( 4 ) ( 4
4
1
4
1
) ( 4 ) ( 4
0

|
|
|
|
\
|
+ +
+
+ +
k k k
k k k
k k k
z y x
z y x
a z y x
2 2 2
2 2 2 2
.

All elements of the inverse J
1
must be well defined, i.e. all denominators must be
0. This is a trouble-making condition because generally the set of iteration points is not
known a priori. Several trials can be made in a hope that some guess will be good.
Thus, we choose the value of radius a and the starting point, e.g. a = 1 and x
0
= 2,
y
0
= 2, z
0
= 1. Results obtained in several first steps and carried out in EXCEL are
shown in the next table.

k x
k
J
1
(x
k
) f(x
k
) J
1
(x
k
) f(x
k
) x
k+1
= x
k
J
1
(x
k
) f(x
k
)
2 0.125 0.375 1 8 0.375 1.625
0 2 0.125 0.125 0 1 0.875 1.125
1 0 0.5 2 1 1.5 0.5

1.625 0.16993464 0.06535948 0.23529412 3.15625 0.642565359 0.982434641
1 1.125 0.2222222 0.22222222 0 1.625 0.340277778 0.784722222
0.5 0.05228758 0.2875817 0.76470588 0 0.302287582 0.197712418

0.98 0.26618861 0.1575402 0.16949153 0.6088 0.223811386 0.756188614
2 0.78 0.3205128 0.32051282 0 0.392 0.069487179 0.710512821
0.2 0.05432421 0.478053 0.83050847 5.5511E17 0.154324207 0.045675793

0.75 0.33450704 0.29049296 0.0625 0.0691 0.040180458 0.709819542
3 0.71 0.3521127 0.35211268 0 0.0609 0.002887324 0.707112676
0.05 0.01760563 0.6426056 0.9375 0.01 0.047293134 0.002706866

0.7098 0.35221699 0.3495374 0.00378947 0.00381374 0.002683013 0.707116987
4 0.7071 0.3535568 0.35355678 0 0.00383292 6.78122E06 0.707106781
0.0027 0.00133979 0.7030942 0.99621053 3.55618E17 0.002689794 1.02059E05

0.70712 0.35354678 0.35353678 1.4142E05 2.32466E05 1.32186E05 0.707106781
5 0.70711 0.3535518 0.35355178 0 1.41424E05 3.21881E06 0.707106781
0.00001 4.9998E06 0.7070886 0.99998586 4.5511E17 9.99981E06 1.86936E10

0.70710678 0.35355339 0.35355339 2.6446E10 3.3561E09 1.18655E09 0.707106781
6 0.70710678 0.3535534 0.35355339 0 3.4969E20 1.18655E09 0.707106781
1.87E10 0 0.7071068 1 1.87E10 1.87E10 2.47268E20

When starting at points (5, 5, 2) or (0, 1, 1) or (20, 100, 50) the results of
Newtons iteration are x = (2/2, 2/2, 0). The start at the points (1, 2, 2) or (2, 1, 1)
would lead to x = (0, 2/2, 2/2). So, in this case the Newtons iteration appears to be
relatively stable. But usually the Newtons iteration (also called Gauss-Newton



54
iteration) must be modified to improve its convergence behavior (e.g. Marquardt's
modification in nonlinear regression mentioned in [3]).

Genetic Algorithm
Random sifting of a chosen interval was used for simultaneous finding all roots of f. In
the systems of equations it can be used as well. For example, we know that all roots of
the following system
x
2
+ y
2
+ z
2
1 = 0,
x
2
y
2
+ z
2
= 0,
x + y +z = 0
must lie on the surface of the unit ball (the 1
st
equation) and consequently also in the
unit cube [1, 1] [1, 1] [1, 1]. The system can be solved simultaneously by
choosing a small number >0 and taking any point (x, y, z)
T
as its solution if
F(x, y, z) = x
2
+ y
2
+ z
2
1 + x
2
y
2
+ z
2
+ x + y +z < .
But the requirement for preciseness must be held in reasonable limits. E.g. for =0.02
the amount of 2 522 980 Horners provided following estimates of root vectors
r
1
=
|
|
|
\
|
1
1
1
z
y
x
=
|
|
\
|
704 . 0
713 . 0
012 . 0
, r
2
=
|
|
\
|
014 . 0
704 . 0
711 . 0
, r
3
=
|
|
\
|
704 . 0
707 . 0
007 . 0
, r
4
=
|
|
\
|
003 . 0
702 . 0
704 . 0
.
These relatively close estimates could further be improved by other methods. E.g. the
Newtons method yields the following iterations for the first root vector r
1
:
x
1
0
=
|
|
\
|
704 . 0
713 . 0
012 . 0
, x
1
1
=
|
|
\
|
9 21 707 . 0
131 707 . 0
086 000 . 0
, x
1
2
=
|
|
\
|
79 106 707 . 0
78 106 707 . 0
01 000 000 . 0
.

Let us generally seek the minimum of a function f(x) in a given interval
Q = [a
1
, b
1
] ... [a
n
, b
n
].
The following one-to-one mapping
x
i
= a
i
+ (b
i
a
i
) t
i
, i = 1, ..., n,
transforms the parallelepiped Q onto the cube I = [0, 1] ... [0, 1] R
n
, x: IQ.
Each t
i
can be with a given tolerance represented in the binary system as a sequence of 0
and 1 of the length m and the whole nm matrix corresponds to the vector t uniquely.
The search for the x minimizing f in Q is transformed to changes of t that can be
controlled by primitive mechanisms well-known from the nature.

Principle of genetic algorithm.
First a population P is generated with not a very big number p of random points
(individuals) t
k
, P = {t
k
: k = 1, ..., p}, e.g. p < 200. Then development of the population
P is set off by two mechanisms taken over from genetics (see [15] or [3]).



55
Cross-over. Two elements (parents) t
u
, t
v
P are chosen randomly and a set of random
numbers {r
i
: 1 r
i
m, i = 1, ..., n} is generated. Both the matrices of the elements t
u
,
t
v
are split in the same way corresponding to {r
i
}. The segments are composed by cross
over, i.e. the first segment of t
u
is connected with the second segment of t
v
and the
second segment of t
u
is connected with the first one of t
v
. This way two new elements
(children) are created: t
uv
= t
u,1
t
v,2
, t
vu
= t
v,1
t
u,2
. This process is shown in the
following schema for the case of n = 3, m = 12 and sequence {r
i
} = {8, 2, 5}.

Parent t
u
Parent t
v

0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1
1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0
1 1 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1

Child t
uv
Child t
vu

0 1 0 1 0 0 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0
1 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0
1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 1

Mutation. A point (individual)l t
k
P, 1kp, is chosen randomly and every randomly
chosen element in each row turns to its complement. For demonstration the first child
from the cross-over example and the same sequence {r
i
} are taken. The corresponding
mutation looks like this:

Individual t
k

Mutant t
km

0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1
1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0
1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1

This process may be modified also. The number of mutations is chosen substantially
less than the number of cross-overs (e.g. 10 or 20 less). Mutations increase the
variability of population P and prevent its premature localization (homogenization).

Evolution of population. Each new point (individual) is evaluated by a properly chosen
objective function. If its value at the new point is better than that of the worst individual
in the population P, the worst individual is replaced by the new one. In the search for
minimum of a function f(x) it self can simply be taken as objective function. Then in the
population P the candidate t
w
for elimination is defined as follows
f(x(t
w
)) = max { f(x(t
k
)) : k = 1, ..., p}.
Thus, the new element t
e
generated by cross-over or mutation is compared to t
w
.
If
f(x(t
e
)) < f(x(t
w
)),



56
then t
e
substitutes t
w
in P and t
w
is excluded.
This way the quality of population cannot decrease.
It is obvious that x(t
e
) can never get out of the interval Q. This is a worthy property
that provides a considerable portion of robustness (stability) for the genetic algorithm.
The described evolution process is infinite and must be stopped when, e.g. a given
number of steps is exceeded or some proper condition is satisfied or simply by users
intervention.

Example. Let us return to finding the solution of the system
x
2
+ y
2
+ z
2
1 = x
2
y
2
+ z
2
= x + y + z = 0
in the cube [1, 1] [1, 1] [1, 1]. The problem can be solved by minimizing the
function
F(x, y, z) = x
2
+ y
2
+ z
2
1 + x
2
y
2
+ z
2
+ x + y +z.
Results of five runs of a program for the search of the root vector by genetic algorithm
are shown in the following table.

Run 1 2 3 4 5
x 0.70691 0.00049 0.00018 0.00006 0.00049
y 0.70721 0.70703 0.70697 0.70715 0.70703
z 0.00055 0.70715 0.70734 0.70691 0.70740
F(x, y, z) 0.00080 0.00082 0.00083 0.00066 0.00094
Work, Horner 2 834 947 1 526 1 567 1 454

The computing process converged tree times to the root x
3
(runs 2, 4, 5), ones to x
2

(run 1) and ones to x
1
(run 3).
Advantages of genetic algorithm are: (i) robustness, (ii) relative universality,
(iii) ability to find solutions or at least their very good estimates in very complicated
problems, in which other methods fail.
For example, it enables easy establishing parameters of regression functions (no
matter if parameters are involved in linear or non-linear way), estimates of parameters
of distribution functions [3]. Repeating the calculations enables also estimating
confidence intervals for the sought parameters (such an opportunity occurred in x
3
in the
foregoing example due to repetition). Thus, given problem needs to be transferred to an
equivalent minimizing problem.

4 ORDINARY DIFFERENTIAL EQUATIONS


57
The initial value problem for the ordinary differential equation of the first order solved
with respect to the derivative
x
y
d
d
= f(x, y(x)), y(x
0
) = y
0
,
can be principally transformed in the integral equation
y(x) = y
0
+
x
x
0
f(t, y(t)) dt
and solved by successive approximations. However, the numerical effectiveness of such
an approach is generally poor. Also power series and other general means could be used
but, on the other hand, there have been evolved many purpose-made efficient methods.
The simplest approach to numerical solution (simple Euler method) was shown in
the end of Section 5.1 in [9], where the initial value problem
x
y
d
d
= f(x, y(x)), y(x
0
) = y
0
,
in interval [x
0
, x
e
] was solved approximately by substituting the smooth line (x, y(x)) by
the broken line connecting the points (x
i
, y
i
),
x
i
= x
0
+ (x
e
x
0
) i/n and y
i+1
= y
i
+ (x
i+1
x
i
) f(x
i
, y
i
), i = 0, 1, , n1.
It is clear that the derivative f(x, y) generally changes on [x
i
, x
i+1
] which must cause a
local error
e(x
i+1
) =

+1 i
i
x
x
f(t, y(t)) dt (x
i+1
x
i
) f(x
i
, y
i
)
and a global error
E(x
i+1
) = y
0
+

+1
0
i
x
x
f(t, y(t)) dt y
i+1
= y(x
i+1
) y
i+1

at a point x = x
i+1
.
To reduce those errors the function f(x, y(x)) must be approximated on [x
i
, x
i+1
] by a
more convenient function than by mere constant f(x
i
, y
i
).

Taylor expansion
In the sufficiently smooth function y(x) the first idea leads to Taylor expansion,
y(x+h) = y(x) + y(x) h + y(x) h
2
/2 + y(x) h
3
/3! +
Considering y(x) = f(x, y(x)) and h
i
= x
i+1
x
i
we obtain
y
i+1
= y
i
+ f(x
i
, y
i
) h
i
+
x d
d
f(x, y(x))
i
x x=

! 2
2
i
h
+
2
x d
d
2
f(x, y(x))
i
x x=

! 3
3
i
h
+
+
1
1
n
n
x d
d
f(x, y(x))
i
x x=

! n
h
n
i
+ R
n+1
, (Tn)



58
where R
n+1
is the remainder. For the sake of simplicity the number n is small as a rule.
If, for example, n = 3 we get
y
i+1
= y
i
+

f(x
i
, y
i
) h
i
+ [
x
f(x, y(x)) +
y
f(x, y(x)) y(x)]

i
x x=

! 2
2
i
h

+ [
2
2
x
f(x, y(x)) +
y
f(x, y(x)) y(x)]

i
x x=

! 3
3
i
h

= y
i
+ f(x
i
, y
i
) h
i
+ [
x
f(x, y(x)) +
y
f(x, y(x)) f(x, y(x))]

i
x x=

2
2
i
h

+ [
2
2
x
f(x, y(x)) + 2
y x
2
f(x, y(x)) f(x,y(x)) +
2
2
y
f(x, y(x)) f
2
(x, y(x))
+
x
f(x, y(x))
y
f(x, y(x)) + (
y
f(x, y(x)))
2
f(x, y(x))]
i
x x=

6
3
i
h
. (T3)
Example 1. Let us try to solve the initial value problem
y = sin (x
2
+ y
2
), y(0) = 0
on interval [0, 5].
Denoting h
i
= x
i+1
x
i
= 0.2 = 5/25, s
i
= sin (x
i
2
+ y
i
2
), c
i
= cos (x
i
2
+ y
i
2
) we obtain
for n =1: y
i+1
= y
i
+ sin (x
i
2
+ y
i
2
) h
i
= y
i
+ s
i
h
i
,
n = 2: y
i+1
= y
i
+ sin (x
i
2
+ y
i
2
) + cos (x
i
2
+ y
i
2
) (x
i
+ y
i
sin (x
i
2
+ y
i
2
)) h
i
2

= y
i
+ s
i
h
i
+ c
i
(x
i
+ y
i
s
i
) h
i
2
,
n = 3: y
i+1
= y
i
+ s
i
h
i
+ c
i
(x
i
+ y
i
s
i
) h
i
2
+ [c
i
(1+ s
i
2
) + 2 (x
i
+ y
i
s
i
)(y
i
c
i
2
s
i
] h
i
3
/3.
The following table shows the values y
i
obtained by Taylor approximation in EXCEL.

x
i
=ih n=1 (Euler) T
2
(n=2) T
3
(n=3)
0 0.00000 0.00000 0.00000
0.2 0.00000 0.00000 0.00267
0.4 0.00800 0.01599 0.02129
0.6 0.03987 0.06380 0.07158
0.8 0.11063 0.15829 0.16819
1 0.23202 0.30997 0.32091
1.2 0.40588 0.51118 0.51911
1.4 0.60577 0.70059 0.69802
1.6 0.75126 0.77109 0.76410
1.8 0.75470 0.70490 0.70359
2 0.63082 0.54623 0.55002
2.2 0.44063 0.33899 0.34880
2.4 0.25089 0.16281 0.18447
2.6 0.16206 0.14918 0.18262
2.8 0.25849 0.33873 0.37203
3 0.45821 0.52509 0.53742
3.2 0.50084 0.43304 0.44648
3.4 0.32578 0.20344 0.22928
3.6 0.16909 0.11267 0.15838
3.8 0.25104 0.32563 0.37108
4 0.43781 0.44396 0.46092
4.2 0.34479 0.21611 0.24595
4.4 0.16741 0.09603 0.15079
4.6 0.26997 0.34977 0.40008
4.8 0.40750 0.33336 0.35992
5 0.22004 0.07821 0.12817



59
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5
x
y
T1
T2
T3

Fig. 4.1 Taylor approximations T
k
(k=1, 2, 3) of solution of the equation
y = sin (x
2
+y
2
), y(0)=0 on interval [0, 5]. Step h = 0.2.

Estimating errors
Choosing a smaller step usually brings a improvement of the approximate solution.
An example is shown in Fig. 4.2.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5
x
y
T1 = Euler
T2
T3

Fig. 4.2 Taylor approximations T
k
(k=1, 2, 3) of solution of the equation
y = sin (x
2
+y
2
), y(0)=0 on interval [0, 5]. Step h = 0.1.



60
Due to insufficient information concerning the function y and its derivatives a
practical estimate of global error is based on repeated calculation due bisectioning the
step and the Richardson extrapolation mentioned in numerical quadrature,
e(y(x, h/2)) y(x) y(x, h/2)
1 2
) ; ( ) 2 / ; (
n
h x y h x y
.
Here n is the order of the method, i.e. in our case n = 1, 2, 3. For example, at x = 5

h = 0.2 h = 0.1 E
T1 0.22004 0.16795 0.05209
T2 0.07821 0.13884 0.02021
T3 0.12817 0.15429 0.00373

This principle can be used generally also in other methods and if the order of
method is unknown, the worst estimate should be used, i.e.
e(y(x, h/2)) ) ; ( ) 2 / ; ( h x y h x y .

4.1 Runge-Kutta Formulas
As shown above the computation for n = 3 is already quite complicated and relatively
much of preparation work is needed to get the second derivative
2
x d
d
2
f(x, y(x))
i
x x=
.
Therefore other methods were looked for to attain effectiveness with relatively
simple formulas. Runge-Kutta formulas belong to them.
The solution of the problem
x
y
d
d
= f(x, y(x)), y(x
0
) = y
0

is sought as a sequence {y
i
} given by the following general recurrence
y
i+1
= y
i
+ h
i
(
1
k
1
+ +
r
k
r
), i = 0, 1, , n, 1< rN (RK)
where
k
1
= f(x
i
, y
i
),
k
j
= f(x
i
+
j
h
i
, y
i
+
j
h
i
k
j1
) , j = 2, , r .
The coefficients
1
, ,
r
,
2
, ,
r
,
2
, ,
r
are chosen so that y
i+1
obtained by
(RK) is equal to y
i+1
obtained by Taylor approximation. This generally leads to a
complicated system of algebraic equations whose number is less than the number of
unknowns and, therefore, there remains a freedom in choosing a proper solution.
To illustrate this let us consider the simplest case r = 2, i.e.
k
1
= f(x
i
, y
i
), k
2
= f(x
i
+ h
i
, y
i
+ h
i
k
1
) ,
where the indices in , are omitted. We have



61
k
2
= f(x
i
+ h
i
, y
i
+ h
i
k
1
) f(x
i
, y
i
) + f/x (x
i
, y
i
) h
i
+ f/y (x
i
, y
i
) h
i
k
1
),
thus,
y
i+1
= y
i
+ h
i
(
1
k
1
+
2
k
2
)
= y
i
+ h
i
[
1
f(x
i
, y
i
) +
2
(f(x
i
, y
i
) + f/x (x
i
, y
i
) h
i
+ f/y (x
i
, y
i
) h
i
k
1
)]
= y
i
+ [(
1
+
2
) f(x
i
, y
i
)] h
i
+ [
2
f/x (x
i
, y
i
) + f/y (x
i
, y
i
) f(x
i
, y
i
))}] h
i
2
.
The corresponding part of Taylor expansion is
y
i+1
= y
i
+ h
i
[f(x
i
, y
i
) h
i
+ [
x
f(x, y(x)) +
y
f(x, y(x)) y(x)]

i
x x=

! 2
2
i
h
].
Hence,

1
+
2
= 1,

2
= 1/2,
= 1/2 ,
which represents 3 equations for 4 unknowns
1
,
2
, , .
When is left free, the two first equations yield

2
=1/(2),
1
= 1 1/(2)
and
y
i+1
= y
i
+ h
i
[(1
2
1
) k
1
+
2
1

k
2
]
= y
i
+ h
i
[(1
2
1
) f(x
i
, y
i
) +
2
1

f(x
i
+ h
i
, y
i
+ h
i
f(x
i
, y
i
))]
= y
i
+ h
i
[(1
2
1
) f(x
i
, y
i
) +
2
1

f(x
i
+ h
i
, y
i
+ h
i
f(x
i
, y
i
)/2)] .
So, for example, we get for
=1 : y
i+1
= y
i
+ [f(x
i
, y
i
) +

f(x
i
+ h
i
, y
i
+ h
i
f(x
i
, y
i
)/2)] h
i
/2, (Heune)
=
2
1
: y
i+1
= y
i
+ f(x
i
+ h
i
/2, y
i
+

f(x
i
, y
i
) h
i
/2) h
i
. (Euler, modified)

The most favored method of this clas is the Runge-Kutta formula for r = 4:
k
1
= f(x
i
, y
i
),
k
2
= f(x
i
+ h
i
/2, y
i
+

(

h
i
/2) k
1
) ,
k
3
= f(x
i
+ h
i
/2, y
i
+

(

h
i
/2) k
2
) , (RK4)
k
4
= f(x
i+1
, y
i
+

h
i
k
3
) ,
y
i+1
= y
i
+ h
i
( k
1
+ 2k
2
+ 2 k
3
+ k
4
)/6 .
This formula is of the 4
th
order (like the Simpson formula in numerical integration to
which it is reduced if f does not depend on y, i.e. f(x, y) = f(x) ).

Example 2. Let us go back to the initial value problem
y = sin (x
2
+ y
2
), y(0) = 0
on interval [0, 5].



62
With the constant step h = 0.2 one obtains (EXCEL)

i xi RK4 k1 k2 k3 k4
0 0 0
1 0.2 0.002666 0 0.01 0.010001 0.03999333
2 0.4 0.021320 0.039996 0.089923 0.090014 0.15973995

The 0
th
row represents the initial condition. In the 1
st
row (i =1): x
1
= x
0
+h,
k
1
= sin (x0x0 + y0y0), k
2
= sin((x0+h/2)(x0+h/2) + (y0+h/2k1)(y0+h/2k1)),
etc. Now the first row just needs to be copied into the following rows, i = 2, 3,
In the next table there are also values of y
i
obtained in separate tables for the half
step h = 0.1 and the step h = 0.01 as well as differences
(h
1
, h
2
) = y(RK4, x
i
, h=h
1
) y(RK4, x
i
, h=h
2
) .

x
i
RK4(h=0.2) RK4(h=0.1) RK4(h=0.01) (0.1, 0.2) (0.01, 0.1)
0 0.000000 0.000000 0.000000 0.000000 0.000000
0.2 0.002666 0.002667 0.002667 0.000000 0.000000
0.4 0.021320 0.021320 0.021320 0.000000 0.000000
0.6 0.071761 0.071761 0.071760 0.000001 0.000000
0.8 0.168590 0.168589 0.168589 0.000001 0.000000
1 0.320612 0.320619 0.320619 0.000007 0.000000
1.2 0.514339 0.514368 0.514369 0.000029 0.000002
1.4 0.686654 0.686703 0.686707 0.000050 0.000003
1.6 0.754313 0.754425 0.754433 0.000112 0.000008
1.8 0.697428 0.697529 0.697536 0.000101 0.000007
2 0.546857 0.546930 0.546935 0.000072 0.000005
2.2 0.351340 0.351427 0.351433 0.000087 0.000007
2.4 0.197073 0.197221 0.197231 0.000148 0.000010
2.6 0.197647 0.197767 0.197775 0.000119 0.000008
2.8 0.364620 0.364769 0.364777 0.000149 0.000008
3 0.499880 0.499947 0.499954 0.000067 0.000008
3.2 0.423020 0.423179 0.423192 0.000159 0.000013
3.4 0.234504 0.234828 0.234854 0.000325 0.000026
3.6 0.183847 0.184239 0.184269 0.000392 0.000030
3.8 0.351454 0.351785 0.351804 0.000331 0.000020
4 0.409604 0.409835 0.409855 0.000231 0.000020
4.2 0.239132 0.239711 0.239757 0.000579 0.000046
4.4 0.182983 0.183768 0.183825 0.000785 0.000057
4.6 0.353942 0.354359 0.354386 0.000417 0.000027
4.8 0.313511 0.314253 0.314300 0.000742 0.000047
5 0.162218 0.163826 0.163926 0.001608 0.000099

The greatest error arises at the end of the considered interval, i.e. at x = 5. Taking
y(RK4, 5, h=0.01) for precise solution, we indeed can see that
E(y(x, 0.1)) y(x) y(x, 0.1) 0.000099
1 2
) 2 . 0 ; ( ) 1 . 0 ; (
4
x y x y

15
0016 . 0
0.000107.



63
Fig. 4.3 shows that both the approximations y(RK4, x
i
, h=h
1
) and y(RK4, x
i
, h=h
2
)
are practically indistinguishable within the chosen scale. The differences between the
numerical solutions are visualized in Fig. 4.4.
Obviously, as x increases the argument of the derivative, x
2
+ y
2
, increases also. But
the sine function holds the derivative in the interval [1, 1]. Therefore, the amplitude of
oscillations of y must drop to zero if x tends to . This behavior is indicated in Fig. 4.5.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5
x
y
RK4(h=0.2)
RK4(h=0.1)

Fig. 4.3 Runge-Kutta approximations RK4 for the initial problem y = sin (x
2
+y
2
), y(0)=0
on interval [0, 5]. Steps h = 0.2 and h = 0.1.

0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
0.0014
0.0016
0.0018
0 1 2 3 4 5
x

y(0.1)-y(0.2)
y(0.01)-y(0.1)

Fig. 4.4 Differences in Runge-Kutta approximations RK4 for the initial problem
y = sin (x
2
+y
2
), y(0) = 0
on interval [0, 5].



64
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5 10 15 20 25
x
y

Fig. 4.5 The approximation RK4 for the initial problem y = sin (x
2
+y
2
), y(0)=0
on interval [0, 25]. Step h = 0.01.

A more detailed information on other methods, errors, step choice and further topics
connected with numerical solution of ordinary differential equations can be found
elsewhere, e.g. in [5-7].

Fehlbergs methods
belong to formulas of the Runge-Kutta type as well. One of them is just quoted here
from [6]. The recurrence of the 5
th
order for the step h is given as follows
y
i+1
= y
i
+ h

(
135
16
k
1
+
825 12
656 6
k
3
+
430 56
561 28
k
4

50
9
k
5
+
55
2
k
6
) ,
where
k
1
= f(x
i
, y
i
),
k
2
= f(x
i
+
4
1
h, y
i
+
4
1
hk
1
),
k
3
= f(x
i
+
8
3
h, y
i
+
32
1
h(3k
1
+ 9k
2
)),
k
4
= f(x
i
+
13
12
h, y
i
+
197 2
1
h(1 932k
1
7 200k
2
+ 7 296 k
3
)),
k
5
= f(x
i
+ h, y
i
+ h(
216
439
k
1
8k
2
+
513
680 3
k
3

104 4
845
k
4
)),
k
6
= f(x
i
+
2
1
h, y
i
+ h(
27
8
k
1
+ 2k
2
565 2
544 3
k
3
+
104 4
859 1
k
4

40
11
k
5
)).
The six values of k
i
enable to create another combination
y*
i+1
= y*
i
+ h

(
216
25
k
1
+
565 2
408 1
k
3
+
104 4
197 2
k
4

5
1
k
5
) ,



65
which is of the 4
th
order. The value y
i+1
y*
i+1
presents the local error estimate and
can be used for immediate adapting the step h in accordance to the required tolerance.

4.2 Systems of Ordinary Differential Equations
As we know, ordinary differential equation of the order s>1
y
(s)
(x) = f(x, y(x), , y
(s1)
(x))
with initial conditions y(x
0
)=y
0
, y(x
0
) = y
0
, , y
(s1)
(x
0
) = y
0
(s1)

can be transferred to the system of s differential equations of the first order [9]. In
vector form
x d
d
y(x) = f(x, y
T
(x)), y(x
0
) = y
0
,
where
y(x) =
|
|
|
\
|
) (
) (
) (
2
1
x y
x y
x y
s
M
=
|
|
|
|
\
|
) (
) (
) (
) 1 (
x y
x y
x y
s
M
, f(x, y
T
(x)) =
|
|
|
|
\
|
)) ( ..., ), ( ), ( , (
)) ( ..., ), ( ), ( , (
)) ( ..., ), ( ), ( , (
2 1
2 1 2
2 1 1
x y x y x y x f
x y x y x y x f
x y x y x y x f
s s
s
s
M
.
Now the above formulas need to be applied in the vector form too. Putting y
k,i
= y
k
(x
i
),
then, for example, RK4 is then generalized as follows:
k
1
= f(x
i
, y
i
T
) =
|
|
|
\
|
) ..., , , , (
: : : : :
) ..., , , , (
, 2 , 1 ,
, 2 , 1 , 1
s i i i i s
s i i i i
y y y x f
y y y x f
=
|
|
|
\
|
s
k
k
, 1
1 , 1
: ,
k
2
= f(x
i
+ h
i
/2, (y
i
+

(

h
i
/2) k
1
)
T
)
=
|
|
|
|
|
\
|
+ + + +
+ + + +
)
2
..., ,
2
,
2
,
2
(
: : : : : : : : : : : : : :
)
2
..., ,
2
,
2
,
2
(
, 1 , 2 , 1 2 , 1 , 1 1 ,
, 1 , 2 , 1 2 , 1 , 1 1 , 1
s
i
s i
i
i
i
i
i
i s
s
i
s i
i
i
i
i
i
i
k
h
y k
h
y k
h
y
h
x f
k
h
y k
h
y k
h
y
h
x f
=
|
|
|
\
|
s
k
k
, 2
1 , 2
: ,
k
3
= f(x
i
+ h
i
/2, (y
i
+

(

h
i
/2) k
2
)
T
)
=
|
|
|
|
|
\
|
+ + + +
+ + + +
)
2
..., ,
2
,
2
,
2
(
: : : : : : : : : : : : : :
)
2
..., ,
2
,
2
,
2
(
, 2 , 2 , 2 2 , 1 , 2 1 ,
, 2 , 2 , 2 2 , 1 , 2 1 , 1
s
i
s i
i
i
i
i
i
i s
s
i
s i
i
i
i
i
i
i
k
h
y k
h
y k
h
y
h
x f
k
h
y k
h
y k
h
y
h
x f
=
|
|
|
\
|
s
k
k
, 3
1 , 3
: ,
k
4
= f(x
i+1
, (y
i
+

h
i
k
3
)
T
)
=
|
|
|
\
|
+ + +
+ + +
+
+
) ..., , , , (
: : : : : : : : :
) ..., , , , (
, 3 , 2 , 3 2 , 1 , 3 1 , 1
, 3 , 2 , 3 2 , 1 , 3 1 , 1 1
s i s i i i i i i s
s i s i i i i i i
k h y k h y k h y x f
k h y k h y k h y x f
=
|
|
|
\
|
s
k
k
, 4
1 , 4
: ,

y
i+1
= y
i
+
6
i
h
( k
1
+ 2 k
2
+ 2 k
3
+ k
4
) .



66
Example 3. Let us solve the equation
y 2( yy + x) cos (2(x
2
+y
2
)) = 0, y(0) = y(0) =0
using the RK4 formula with the constant step h = 0.01 on interval [0, 2].
The corresponding system of the firstorder equations is
|
\
|
2
1
y
y
x d
d
=
|
|
\
|
+ + ) cos( ) ( 2
2
1
2
2 1
2
y x y y x
y
,
|
\
|
) 0 (
) 0 (
2
1
y
y
=
|
\
|
0
0

where y
1
= y, y
2
= y.
The first two steps are shown in the following table (EXCEL).

xi y1 = RK4 k1 k2 k3 k4
0.00 0
0
0.01 0.000000 0.000 0.000 0.000 0.000
0.000100 0.000 0.010 0.010 0.020
0.02 0.000001 0.000100 0.000100 0.000101 0.000101
0.000400 0.020000 0.030000 0.030000 0.040000

The solved equation has arisen by differentiating the equation y = sin (x
2
+ y
2
)
with respect to x as can be easily verified. Thus, the solution y
1
(x) can be compared with
the previous solution y(x); this is done in Fig. 4.6. Fig. 4.7 shows the difference between
solutions of equations y = sin (x
2
+ y
2
) and y 2( yy + x) cos (2(x
2
+y
2
)) = 0.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1 1.5 2
x
y
y1
y

Fig. 4.6 Comparison of solutions of the initial value problems y = sin (x
2
+y
2
), y(0)=0 and
y
1
= 2( y
1
y
1
+ x) cos (2(x
2
+y
1
2
)), y
1
(0)= y
1
(0)=0 on interval [0, 2]. RK4, step h = 0.01.



67
-0.01
0
0.01
0 0.5 1 1.5 2
x
y - y
1

Fig. 4.7 The difference between solutions of the initial value problems y= sin (x
2
+y
2
), y(0)=0 and
y
1
= 2( y
1
y
1
+ x) cos (2(x
2
+y
1
2
)), y
1
(0)= y
1
(0)=0 on interval [0, 20]. RK4, step h = 0.01.

A good imagination concerning the preciseness of approximate solutions can be
obtained from Fig. 4.8 in simple linear equation y+ 4y = 0, y(0)=0, y(0)=2. Even
such a simple case illustrates the warning that a small step itself may give no warranty
of sufficient quality of numerical results.

-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x
y
h=0.02
h=0.01
h=0.005
sin(2x)

Fig. 4.8 Comparison of approximate solutions of the initial problem y+ 4y = 0, y(0)=0, y(0)=2
obtained by RK4 for variable steps h with the exact solution.



68
4.3 Adams Methods
The methods mentioned so far are called one-step methods. They namely enable the
transition from the point (x
i
, y
i
) to the point (x
i+1
, y
i+1
) while all the information
concerning the foregoing steps with indices i1, i2, is neglected.
A very natural idea is to approximate the right side of the differential equation
y(x) = f(x, y(x)) by the interpolation polynomial p
i,k
(x) based on k values f
j
= f (x
j
, y
j
),
j = ik, , i at equidistant points x
i
(k1) h, , x
i
h, x
i
, where kN. The function
f(x, y(x)) is then substituted by that interpolation polynomial of the degree k1 and the
value y
i+1
at the point x
i+1
is obtained by integration
y
i+1
= y
i
+
+1 i
i
x
x
p
i,k
(x) dx .
For k = 1 the p
i,k
(x) turns to the constant f (x
i
, y
i
) and the Euler method is obtained.
For k = 2 we have linear interpolation polynomial p
i,2
(x) = f
i
+
h
f f
i i 1
(x x
i
) and
y
i+1
= y
i
+
+1 i
i
x
x
p
i,2
(x) dx = y
i
+ f
i
h +
h
f f
i i 1
+h x
x
i
i
(x x
i
) dx = y
i
+
2
3
1
i i
f f
h .
This process can continue for k = 3, 4, The corresponding formulas represent k-step
Adams-Bashford methods. As an example we show the derivation of the very frequently
used 4-step Adams-Bashford method
y
i+1
= y
i
+
24
h
(55 f
i
59 f
i1
+ 37 f
i2
9 f
i3
) . (A-B)
by the Lagrange interpolation polynomial for f(x, y),
p
i,4
(x) =
) )( )( (
) )( )( (
3 2 1
3 2 1

i i i i i i
i i i
x x x x x x
x x x x x x
f
i
+
) )( )( (
) )( )( (
3 1 2 1 1
3 2

i i i i i i
i i i
x x x x x x
x x x x x x
f
i1

+
) )( )( (
) )( )( (
3 2 1 2 2
3 1

i i i i i i
i i i
x x x x x x
x x x x x x
f
i2
+
) )( )( (
) )( )( (
2 3 1 3 3
2 1

i i i i i i
i i i
x x x x x x
x x x x x x
f
i3
.
Because x
i2
= x
i3
+ h, x
i1
= x
i3
+ 2h, x
i
= x
i3
+ 3h , x
i+1
= x
i3
+ 4h we obtain
y
i+1
= y
i
+
+1 i
i
x
x
p
i,4
(x) dx = y
i
+
+
+
h x
h x
i
i
4
3
3
3
p
i,4
(x) dx .
The substitution t =
h
x x
i 3
( x = x
i3
+ th) leads to
y
i+1
= y
i
+ h

4
3
p
i,4
(x(t)) dt = y
i
+ h

4
3
(
+

...
3 2
) )( )( (
3 2 1
i
i i i
f
h h h
x x x x x x
dt
= y
i
+ h

4
3
(

3 2 1
! 3
) 1 )( 2 )( 3 (
! 2
) 2 )( 3 (
! 2
) 1 )( 3 (
! 3
) 1 )( 2 (
i i i i
f
t t t
f
t t t
f
t t t
f
t t t

= y
i
+ h (I
0
+ I
1
+ I
2
+ I
3
).
The four summands in parentheses are as follows:



69
I
0
=
+
4
3
2 3
) 2 3 (
6
t t t
f
i
=
4
3
2 3
4
)
4
(
6
t t
t f
i
+ =
6
i
f
4
55
,
I
1
=
4
3
2 3 1
) 3 4 (
2
t t t
f
i
=
2
1 i
f
12
18 16 3
4
3
2 3 4
t t t +
=
2
1 i
f
4
59
,
I
2
=
4
3
2 3 2
) 6 5 (
2
t t t
f
i
=
2
2 i
f
12
18 16 3
4
3
2 3 4
t t t +
=
2
2 i
f
12
37
,
I
3
=
+

4
3
2 3 3
) 6 11 6 (
6
t t t
f
i
=
6
3 i
f
4
24 22 8
4
3
2 3 4
t t t t +
=
6
3 i
f
4
9
.

The formula (A-B) is predictive one: the value y
i+1
at x
i+1
is determined by means of
values at the four foregoing abscissas x
i
, x
i1
, x
i2
, x
i3
. Consequently, it still represents a
kind of extrapolation. But, generally, extrapolation is less precise than interpolation.

Including y
i+1
in the interpolation formula defines y
i+1
implicitly. The
corresponding formulas (obtained usually by backward Newton interpolation
polynomial) are called Adams-Moulton formulas. The Lagrange interpolation
polynomial corresponding to four nodes (x
i+1
, f
i+1
), (x
i
, f
i
), (x
i1
, f
i1
), (x
i2
, f
i2
) is
c
i,4
(x) =
) )( )( (
) )( )( (
2 1 1 1 1
2 1
+ + +

i i i i i i
i i i
x x x x x x
x x x x x x
f
i+1
+
) )( )( (
) )( )( (
2 1 1 1
2 1 1
+ +
+

i i i i i i
i i i
x x x x x x
x x x x x x
f
i

+
) )( )( (
) )( )( (
2 1 1 1 1
2 1
+
+

i i i i i i
i i i
x x x x x x
x x x x x x
f
i1
+
) )( )( (
) )( )( (
1 2 2 1 2
1 1
+
+

i i i i i i
i i i
x x x x x x
x x x x x x
f
i2

Its integration over the interval [x
i
, x
i+1
] yields the following 4
th
order formula
y
i+1
= y
i
+
24
h
(9 f
i+1
+ 19 f
i
5 f
i1
+ f
i2
) . (A-M)
For example, the second summand in c
i,4
(x) ) gives (substitution t =
h
x x
i 2
, x = x
i2
+ th)
+1 i
i
x
x
) )( )( (
) )( )( (
2 1 1
2 1 1
+
+

i i i i i i
i i i
x x x x x x
x x x x x x
f
i
dx
= f
i

+
+
h x
h x
i
i
3
2
2
2
) 2 ( ) (
) )( )( 3 (
2 2 2
h h h
x x h x x h x x
i i i

dx = f
i

2
h
3
2
(t3)(t1)t dt = f
i
h
24
19
.

Both (A-B) and (A-M) formulas can be used autonomously. But more usually they
are combined in the predictor-corrector methods whose algorithms consist of the
following steps [5-7]:
1. P (prediction): y
P
i+1
= y
i
+
24
h
(55 f
i
59 f
i1
+ 37 f
i2
9 f
i3
) (A-B)
2. E (evaluation of f
i+1
): Computation of f
P
i+1
= f(x
i+1
, y
P
i+1
)



70
3. C (correction): y
C
i+1
= y
i
+
24
h
(9 f
P
i+1
+ 19 f
i
5 f
i1
+ f
i2
) (A-M)
4. E (evaluation of f
i+1
): Computation of f
i+1
= f(x
i+1
, y
C
i+1
)
Some of the steps may be repeated several times. This way algorithms denoted by
various combinations of letters P, E, C can be generated. The first step P, however,
assumes the knowledge of f
0
, f
1
, f
2
, f
3
, i.e. also of the first values y
1
, y
2
, y
3
. Those must
be computed by another method (Runge-Kutta method RK4 as a rule).
The expression y
C
i+1
y
P
i+1
is an upper estimate of the local error of corrector.
Closer estimates can be found e.g. in [6-7].

Example 4. The algorithm PECE (i.e. consisting of steps P, E, C, E) will be
demonstrated by solving the initial value problem
y = sin (x
2
+ y
2
), y(0) = 0.
Let us choose h = 0.01. The first y
i
were obtained by RK4 method. All the computation
can easily be performed in EXCEL.

x
i
f(x
i
,y
i
) y
i
(predict.) y
i
(correct.)
0.00 0.0000000000 0.0000000000
0.01 0.0001000000 0.0000003333
0.02 0.0004000000 0.0000026667
0.03 0.0009000000 0.0000090000
0.04 0.0015999998 0.0000213333 0.0000213333
0.05 0.0024999991 0.0000416667 0.0000416667
0.06 0.0035999974 0.0000720000 0.0000720000

5.01 0.0059511841 0.1633709537 0.1633707731
5.02 0.0943528736 0.1638130325 0.1638128862

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5 10 15 20 25 30 35 40 45 50
x
y

Fig. 4.8 The solution of the initial problems y= sin (x
2
+y
2
), y(0)=0 by PECE method.



71
4.4 Boundary Problems
A solution of a differential equation of the nth degree or system of n differential
equations of the first degree can be determined by n initial conditions at one point or by
some n conditions at more points.
For example, the function y(t) = sin t obviously satisfies the equation y + y = 0
(n=2) whose general solution is y(t) = c
1
sin t + c
2
cos t. The constants c
1
= 1, c
2
= 0
can be given by initial conditions y(0) = 0, y(0) = 1 or y() = 0, y() = 1 etc. but
also by y(0) = 0, y() = 0 or y(0) = 1, y() = 1 etc.
In the latter case the values of y or its derivatives are given at boundary points of the
interval [0, 1]. In such cases, when some values of y, y, are given in two points, we
speak about boundary problems.
In the case of boundary problem
y + y = 0, y(0) = 0, y() = 0
its solution is found very easily due to the evident correspondence between the feasible
boundary values y(0), y() and the function y.
As we know the general solution of a linear equation of the nth order
y
(n)
+ a
n1
y
(n1)
+ + a
1
y + a
0
y = b
can be written as linear combination
y(x) = c
1
e
1
(x) + + c
n
e
n
(x),
where c
1
, , c
n
are constants and e
k
(x), k = 1, , n, are linearly independent solutions
of the equation (see e.g. [9]). Hence, in linear differential equations the boundary
problems as well as initial problems lead to systems of linear algebraic equations.
The right arena of numerical methods, however, is in the field of nonlinear analysis.
Explained methods of getting approximate solutions of initial value problems in
differential equations and solutions of nonlinear equations (Chapter 3) enable to solve
boundary problems by means of initial value problems. The boundary problem of the
2
nd
order
y = f(x, y, y) , y(x
0
) = y
0
, y(x
1
) = y
1

can be solved by introducing y(0) = as a new unknown and solving the corresponding
initial problem. This way one gets a new function
() = y(x
1
, ) y
1
,
whose root is to be found.

Example 5. Let us solve the boundary problem
y 2( yy + x) cos (2(x
2
+y
2
)) = 0, y(0) = 0, y(3) =1
using the RK4 formula with the constant step h = 0.02 on interval [0, 3]. The initial
problem for y(0) = y(0) = 0 was solved in Example 3. Choosing initial condition
y(0)= gives values of the function () = y(3, ) 1. Its root may be found, for



72
example, by the regula falsi method. This procedure can be carried out in EXCEL
easily. Results are arranged in the following table and shown in Fig. 4.9

0 0.5 0.24 0.214 0.2143
()
0.46648 0.516663 0.05127 0.0006 2.16E06

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 0.5 1 1.5 2 2.5 3
x
y
y(0)=0
y(0)=0.5
y(0)=0.24
y(0)=0.2143
Boundary val.

Fig. 4.10 Solution of the boundary problem y 2( yy + x) cos (2(x
2
+y
2
)) = 0, y(0) = 0, y(3) =1
using solutions of initial problems (R-K method) and regula falsi.

Preciseness of solution found this way depends on preciseness of used methods, of
course, first of all on tolerance limits in solution of the initial problems.
In differential equations of the 2
nd
order this method recalls shooting (giving both
positions of gun and target and the barrel elevation) and, therefore, it is also called
shooting method [6,7]. In equations of higher orders or systems of equations the
transition to initial problems remains preserved but leads now to systems of nonlinear
equations (Chapter 3). Though the shooting method cannot be taken as the most
economic it is relatively transparent. This is advantageous in any uncertain terrain. The
linearity or nonlinearity of the differential equation play a second-rate role.
There are many other methods traditionally explained in courses of numerical
mathematics [5-7]. The simplest idea is to substitute derivatives by their finite
difference approximations and this way to transform the boundary problem to a system
of equations of type f(p) = o dealt in Chapter 3 with. But this can be made only with
relatively rough differences to attain solvability of the system f(p) = o. Therefore, such
a method offers rather first contacts and reconnaissance of newly arisen situations.
A simple method in linear problems is offered also by Taylor expansion, whose
coefficients may be determined by additive conditions, e.g. boundary ones.



73
Example 6. The solution of the problem
y + 4y = x, y(0) = 0, y(1) = 1
is sought in the form of polynomial
p(x) = a
1
x + a
2
x
2
+ a
3
x
3
+ a
4
x
4
+ a
5
x
5
.
The first condition, y(0) = 0, is fulfilled automatically and the second one gives one
equation for five coefficients, i.e. four other equations are needed. These can be
obtained by choosing four inside points of the open interval (0, 1), at which the equation
y + 4y = x is satisfied, e.g. x
k
= k/5, k = 1, , 4. The fifth equation is the second
boundary condition. Thus,
Aa =
|
|
|
|
\
|
1 1 1 1 1
55072 . 11 3184 . 9 848 . 6 56 . 4 2 . 3
63104 . 4 8384 . 4 464 . 4 44 . 3 4 . 2
32096 . 1 0224 . 2 656 . 2 64 . 2 6 . 1
16128 . 0 4864 . 0 232 . 1 16 . 2 8 . 0
.
|
|
|
|
|
\
|
5
4
3
2
1
a
a
a
a
a
=
|
|
|
|
\
|
1
8 . 0
6 . 0
4 . 0
2 . 0
= b
In EXCEL we easily find the inverse A
1
and a = A
1
b =
|
|
|
|
|
\
|
11648857 . 0
15605911 . 0
19743135 . 1
02987798 . 0
89500568 . 1
. A comparison
with the exact solution y(x) =
4
x
+
4
3
2 sin
) 2 ( sin x
is shown in Fig. 4.11. It may be verified
that the maximal error does not exceed 0.0004, maxy(x) p(x)< 0.0004.

0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
x
y
p(x)
y(x)

Fig. 4.11 Approximate and exact solutions of the boundary problem y + 4y = x, y(0) = 0, y(1) = 1.



74
Of course, the polynomial choice is not the best one. In 1934 Kantorovich suggested
the so called collocation method. The solution of the boundary problem is sought as a
linear combination p(t) of linearly independent functions that fulfill the boundary
conditions. The function p is to satisfy the differential equation at a chosen set of points.

Example 7. The solution of the problem
y + 4y = x, y(0) = 0, y(1) = 1
is sought in the form of trigonometric polynomial
p(x) = x + a
1
sin(x) + a
2
sin(2x) + a
3
sin(3x).
The linear combination c(x) = a
1
sin(x) + a
2
sin(2x) + a
3
sin(3x) is annulated at
x = 0, 1. Substituting y by p one gets
(
2
4) sin(x) a
1
+ 4(
2
1) sin(2x) a
2
+ (9
2
4) sin(3x) a
3
= 3x
The collocation set, e.g. {
1
/
4
,
1
/
2
,
3
/
4
}, gives system of equations for unknowns a
1
, a
2
, a
3

|
|
\
|
98135 . 59 47842 . 35 150437 . 4

82644 . 84 0 869604 . 5
98135 . 59 47842 . 35 150437 . 4
.
|
|
|
\
|
3
2
1
a
a
a
=
|
|
\
|
25 . 2
5 . 1
75 . 0
,
whose solution is a = (0.308480785, 0.021139612, 0.003662304)
T
. The following
table offers a comparison of p with the exact solution y.

x p y yp
0.1 0.1858631 0.1888650 0.0030019
0.2 0.3646985 0.3711972 0.0064986
0.3 0.5305929 0.5407242 0.0101313
0.4 0.6788045 0.6916844 0.0128799
0.5 0.8048185 0.8190559 0.0142374
0.6 0.9036556 0.9187576 0.0151020
0.7 0.9708029 0.9878114 0.0170085
0.8 1.0049085 1.0244609 0.0195525
0.9 1.0107142 1.0282418 0.0175276

Boundary problems are closely connected to calculus of variations [9] and this link
resulted in many new methods. Their principle consists in constructing a functional
whose minimizing leads to the given equation and, at the same time, it enables to
approximate the solution as a finite linear combination of functions from a convenient
functional space
y
n
(x) = c
1
1
(x) + + c
n
n
(x).
(Ritz method, Galerkin method). A systematic construction of a basis of functions with
small supporters is performed in finite element methods [6] (B-splines). Those areas,
however, need new notions, definitions and a more profound study far beyond the scope
of this elementary frame.

5 LINEAR SPACES AND FOURIER SERIES


75

5 LINEAR SPACES AND
FOURIER SERIES

In pure mathematics as well as in numerical applications some common features appear
that can be accumulated into more general objects. A very significant role belongs to the
notion of linear space created by the transfer of the structure of common vector space
(known e.g. from linear algebra) into an abstract set.

5.1 Linear Space
A set X furnished with two binary operations (i) addition + and (ii) multiplication by
the scalar from a convenient number field T (R
1
or C
1
as usual), i.e. +: XX X and
.: TX X, is called linear space, if
I. (X, +) is commutative (Abelian) additive group, i.e. for x, y, z X
there exists a zero (neutral) element o that x+o = x for every xX,
(x + y) + z = x + (y + z) (associativity),
x + y = y + x (commutative law),
the equation x + y = o has a unique solution y = x
1
= x (the inverse element
x
1
= x),
II. (X, .) is characterized by following properties
.(.x) = (.).x for , T, xX,
.(x + y) = .x + .y (1
st
distributive law),
( + ).x = .x + .x (2
nd
distributive law),
1.x = x.
The dot for denoting multiplication by scalar is omitted as a rule.

The sum {
i
x
i
: i = 1, ..., n}, where x
1
, ..., x
n
X,
1
, ...,
n
T, is called linear
combination of elements x
1
, ..., x
n
X. The set
env {x
1
, ..., x
n
} = {
i
x
i
:
1
, ...,
n
T & i = 1, ..., n}
consisting of all linear combinations of x
1
, ..., x
n
X is the linear envelope of x
1
, ..., x
n
.
Elements x
1
, ..., x
n
are linearly dependent, if there is at least one non-zero ntuple of
scalars
1
, ...,
n
,
i i
n
=
1
>0, that {
i
x
i
: i = 1, ..., n} = o. Elements x
1
, ..., x
n
are
linearly independent, if {
i
x
i
: i = 1, ..., n} = o
i = 1, ..., n
(
i
= 0). Any maximal
linearly independent set of elements of X is called basis of X. The space X is linear
envelope or hull of its basis. The cardinal number (number of elements) of a basis of the
space X is called dimension of X.
Example 1. The sets of all planar or spatial vectors V
2
, V
3
are equivalent to the
elements (points) of Euclidean spaces R
2
or R
3
, respectively. Those objects are then



76
freely exchangeable and vectors will be expressed in matrix forms. Then, e.g., the linear
combination of planar vectors x, y may be written as
z = ax + by = a
|
\
|
2
1
x
x
+ b
|
\
|
2
1
y
y
=
|
\
|
+
+
2 2
1 1
by ax
by ax
, a, b R
1
.
Any vector x from R
2
can be written as
x = x
1 |
\
|
0
1
+ x
2

|
\
|
1
0
.
Similarly y = y
1 |
\
|
0
1
+ y
2

|
\
|
1
0
,
then
z = ax
1 |
\
|
0
1
+ ax
2

|
\
|
1
0
+ by
1 |
\
|
0
1
+ by
2

|
\
|
1
0
= (ax
1
+ by
1
) |
\
|
0
1
+ (ax
2
+ by
2
)
|
\
|
1
0
.
Therefore, a fundamental role of the vectors e
1
= |
\
|
0
1
, e
2
=
|
\
|
1
0
is evident, they
represent a basis of R
2
. Their linear independency can be verified very easily. Let us
assume conversely there are a, bR
1
that a+b>0 and ae
1
+ be
2
= a |
\
|
0
1
+ b
|
\
|
1
0
=
|
\
|
b
a
= o =
|
\
|
0
0
. Then a+b = 0, which contradicts to a+b>0.
Let us note the matrix with columns e
1
, e
2
is unity matrix, I
2
= (e
1
, e
2
), det I
2
= 1 0.
Generally, annulling a linear combination is equivalent to homogeneous system of
linear equations
ax + by = o (x, y)
|
\
|
b
a
=
|
\
|
2 2
1 1
y x
y x
|
\
|
b
a
=
|
\
|
0
0
.
This system has the trivial solution, a = b = 0, if and only if its determinant
2 2
1 1
y x
y x
0.
Three and more vectors in the plane are linearly dependent. Thus, any maximal
linearly independent set of elements in R
2
has two elements that can be taken as a basis.
Due to their simplicity unit vectors e
1
, e
2
are preferred as a rule. Dimension of R
2
, equal
to the cardinal number of the basis, is then 2, of course.
These considerations can be simply transferred onto R
n
.

Example 2. The space of real nn matrices. Obviously,
|
|
|
|
\
|
nn n n
n
n
x x x
x x x
x x x
...
... ... ... ...
...
...
2 1
2 22 21
1 12 11
= x
11
|
|
|
\
|
0 ... 0 0
... ... ... ...
0 ... 0 0
0 ... 0 1
+ x
12
|
|
|
\
|
0 ... 0 0
... ... ... ...
0 ... 0 0
0 ... 1 0
+ ... + x
1n
|
|
|
\
|
0 ... 0 0
... ... ... ...
0 ... 0 0
1 ... 0 0

+ x
21
|
|
|
\
|
0 ... 0 0
... ... ... ...
0 ... 0 1
0 ... 0 0
+ x
22
|
|
|
\
|
0 ... 0 0
... ... ... ...
0 ... 1 0
0 ... 0 0
+ ... + x
2n
|
|
|
\
|
0 ... 0 0
... ... ... ...
1 ... 0 0
0 ... 0 0

+ ...
+ x
n1
|
|
|
\
|
0 ... 0 1
... ... ... ...
0 ... 0 0
0 ... 0 0
+ x
n2
|
|
|
\
|
0 ... 1 0
... ... ... ...
0 ... 0 0
0 ... 0 0
+ ... + x
nn
|
|
|
\
|
1 ... 0 0
... ... ... ...
0 ... 0 0
0 ... 0 0
.



77
Let I
jk
be the matrix, whose elements are 0 except the one on the jth row and in kth
column that is 1. It can be shown that the set I = {I
jk
: j, k = 1, ..., n} is a basis in the
space of all nn matrices, card I = n
2
.
Example 3. Set l of all sequences x = {x
i
: i = 1, 2, ...} of that kind that
=1 i
i
x is
convergent is a linear space. If a, b R
1
and x, y l we have namely
ax by
i i
i
+
=
1

ax
i
i=
1
+
by
i
i=
1
= a
x
i
i=
1
+ b
y
i
i=
1
< + .
Thus, ax + by l. If e
k
is a unit vector, whose components are e
kj
= 1 signkj,
jN, the countable set { e
k
: k=1, 2, } is a basis of l.

Example 4. Set B of powers of a variable z, B = {z
i
: i I, z C
1
}. If the set of indices
is I = {0, 1, 2, ..., n} and T = C
1
, then env B is the linear space of complex polynomials
of the nth degree. If I = {0} N, T = R
1
, zR
1
, the space of power series of a real
variable is obtained.

Sets C
0
([a, b]), C
n
([a, b]), L([a, b]) of functions on [a, b] continuous, n-times
continuously differentiable, integrable in the Lebesgue sense are obviously linear spaces
too. Those spaces, however, are too rich (their cardinals are uncountable) to take in
them a basis by simple means. But elements of the mentioned sets can be sorted into
classes whose elements are equivalent in some sense. In the sets C
n
([a, b]), C
0
([a, b])
various metrics can be introduced and all functions belonging to a small neighborhood
of some chosen function declared equivalent. We know that e.g. in the set L([a, b]) the
Dirichlet function (equal 0 at irrational x and equal 1 at rational x, [9]) is equivalent to
the function equal 0 identically and to any function different from zero only on a set of
zero measure, e.g. except arbitrary final or countable set. Nevertheless, the collection of
equivalence classes still remains uncountable. The notion of basis can be reformulated
as follows.
Basis of a linear space X is a collection B = {x
a
: aA, x
a
X}, if
1. elements of B are linearly independent (i.e. b x
a a a A
= o
(aA)
(b
a
= 0) ),
2. X env B (= linear envelope of B), i.e. every xX can be written as x = b x
a a a A
).
The proof of existence of uncountable bases B is founded on the axiom of choice
and belongs to set-theoretical comprising. Real numerical calculations are confined
to the set of rational numbers Q and its final Cartesian products, bases are reduced to
final bases and corresponding approximations.

If x, y X the set { x + (1) y : R
1
} is called the straight line through points
x, y. The set {x + (1)y : [0, 1]} is the segment between x and y (connecting line)



78
or also the convex combination of elements x and y. The term of the convex
combination can easily be extended onto more elements by induction:
If x
1
, ..., x
n
X and
1
, ...,
n
0 and
1
+ ... +
n
= 1, then
1
x
1
+ ... +
n
x
n
is
called convex combination of elements x
1
, ..., x
n
.
A set AX is convex, if
x, y A
([0, 1])
( x + (1) y A)
(i.e. with any two its points x, y A also their whole connecting line belongs to A).
The set of all convex combinations of elements x
1
, ..., x
n
X is called their convex
hull (envelope). Several simple examples are illustrated in Fig. 5.1.

a) b) c)
Fig. 5.1 Convex envelops a) of points (1, 0), (0, 1), b) of set {[0, 1]{0}}, {{0}[0, 1]},
c) of the vertices of regular 17-angle.

Open ellipsoids with the center a R
n
, especially equi-axial ellipsoids or balls with a
radius r>0, i.e. (x
1
a
1
)
2
+ ... + (x
n
a
n
)
2
< r
2
are another example of bounded convex sets
in R
n
furnished with the Euclidean metric [1,2,9]. Linear subspaces or cones with the
apex a R
n
are examples of unbounded convex sets in R
n

If there is such a topology in X that to any two points x, y X there exist open
disjoint neighborhoods U
x
, U
y
(xU
x
, yU
y
, U
x
U
y
=) and both the linear operations
(summation and multiplication by scalar) are continuous in this topology, we say X is a
topological linear space. If every neighborhood of any point xX contains a convex
open set, then X is a locally convex topological linear space. We say a system S
x
of
neighborhoods of xX is complete, if in any neighborhood U(x) there is S(x)S
x
, i.e.
xS(x)U(x). Therefore, in topological linear space the topology is given by one
complete system of neighborhoods of one of its points, e.g. of o. This system can be
extended onto the entire space. In metric linear space (X, d) the system S
o
can be
defined as the collection {U(o,
n
1
) : nN}, where U(o,
n
1
) = {x: xX, d(x, o)<
n
1
} is a
convex neighborhood of o. Thus, a topology of a very simple structure can be
introduced in (X, d).



79

5.2 Normed Linear Spaces
In some linear spaces a real functional called norm can be defined that is analogous to
the length of a vector in R
n
or to the absolute value of a number.
A norm on a linear space X is a mapping v: XR
1
with the following properties:
(1) v(x) = 0 x = o,
(2) v(x+y) v(x) + v(y) (triangle inequality),
(3) v(ax) = av(x) for any aT, xX.

Examples.
1. In R
n
norms can be defined as follows

e
(x) =
2 2
1
...
n
x x + (Euclidean norm),

m
(x) = max {
1
x , ...,
n
x } (maximum norm),

s
(x) =
1
x + ... +
n
x (summation norm),
(two last names we have chosen arbitrarily, they are not common).

2. In the linear space of nm matrices of complex numbers the norm can be defined as
the following functional

m
(A) = max {
=
m
j
j
a
1
1
, ...,
=
m
j
nj
a
1
}
(maximum of sums of absolute values of elements in a row).
If Z is an nm matrix, one gets
m
(Z) = max{
m
(z
k
): k = 1, , m }, where z
k
is the
kth column of the matrix Z. The norm
s
(Z) could be defined similarly as maximal
value of
s
(z
k
). The Euclidean norm is defined as follows

e
(A) =

= =
n
i
m
j
ij
a
1 1
2
=

= =
n
i
m
j
ij
a
1 1
2

(in real matrices the absolute value in summands can be omitted).
The purpose of introducing the matrix norm can be illustrated on infinite sum of
powers of an nn matrix A (A
0
is the nn unit matrix I
n
)
S = A
0
+ A + A
2
+ A
3
+ ...
The properties of norm imply

m
(S)
m
(A
0
) +
m
(A) +
m
(A
2
) +
m
(A
3
) + ...
Obviously,
m
(A
0
) =
m
(I
n
) = 1. Let
m
(A) = q. The norm of the product of two nn
matrices A, B can be evaluated as follows

m
(A.B) = max { a b
k kj k
n
j
n
1 1 1 = =

, ..., a b
nk kj k
n
j
n
= =

1 1
}
max { a b
k kj k
n
j
n
1 1 1 = =

, ..., a b
nk kj k
n
j
n
= =

1 1
}
max { a
ik k
n
=
1
.max { b
kj k
n
=
1
: j = 1,..., n} : i = 1, ..., n}



80
= max { a
ik k
n
=
1
.
m
(B) : i = 1, ..., n} =
m
(A)
m
(B).
Then

m
(A
2
) (
m
(A))
2
= q
2
,
m
(A
3
) (
m
(A
2
))
m
(A) q
2
q = q
3
, etc.
and
m
(S) 1 +
m
(A) + (
m
(A))
2
+ (
m
(A))
3
+ ... = 1 + q + q
2
+ q
3
+ ...
If
m
(A) = q < 1, the geometric series 1 + q + q
2
+ q
3
+ ... converges to
1
1 q
and the
geometric series of powers of A defines the matrix S = A
0
+ A + A
2
+ A
3
+ ... This leads
to defining power series and corresponding operator of an arbitrary square matrix X
f (X) = a
0
I
n
+ a
1
X + a
2
X
2
+ ...
For example sin X = I
n

X
3
3!
+
X
5
5!
...
A normed space is a metric space because its norm defines automatically the metric
d(x, y) = x y .
This may be immediately used to transfer the Banach fixed point theorem [9] to the
problem of solving the system of linear equations by simple iteration method:
A small Theorem. Let Ax = b be system of linear algebraic equations, where x, bR
n

and A is a regular matrix, I = I
n
=A
0
. If
m
(IA) q < 1, the iteration process
x
i+1
= (I A) x
i
+ b (i)
converges to the solution of Ax = b.
Proof. Adding x to both sides of the equation Ax = b yields Ax + x = b + x, thus
x = b + x Ax = b + (IA) x. This gives the iteration (i). The right side b + (IA) x
represents a mapping f: R
n
R
n
. Let us choose x, x. Obviously,
m
(f (x) f ( x)) =
m
((IA)(x x)) q
m
(x x) <
m
(x x),
i.e. f is a contraction. The Banach theorem assures the existence of a unique fixed
point x of f.
Let us follow another way to complete the situation. Let x
0
be an initial
approximation of solution of Ax = b. The relation (i) gives the following sequence
x
1
= (I A) x
0
+ b,
x
2
= (I A) x
1
+ b = (IA) [(I A) x
0
+ b] + b = (I A)
2
x
0
+ [(I A) + I] b ,
x
3
= (I A) x
2
+ b = (I A)
3
x
0
+ [(I A)
2
+ (I A) + I] b,
...
x
n
= (I A) x
n1
+ b = (I A)
n
x
0
+ [(I A)
n1
+ ... + (I A) + I] b.
Let B
n
=[( I A)
n1
+ ... + (I A) + I]. If
m
(I A) q < 1 and n, the sequence
{B
n
} converges to some matrix B and (IA)
n
x
0
o, so it holds x
= lim x
n
= B b.
Because x
is the solution of the original equation A x = b, it is B = A

1
.
The table on the following page shows two simple examples of both convergent and
divergent iterations (i) calculated by EXCEL. In the first case v
m
(I A) < 1 and the
iteration converges to the solution of the system Ax = b. For the sake of completeness



81
also elements of matrices B
n
are shown. They illustrate the convergence of {B
n
} to the
inverse matrix A
1
very clearly.

A =
05 01
02 04
. .
. .
|
\
|
, b =
2
2 6 .
|
\
|
(v
m
(I A) = 0.8)
A =
|
\
|
4 . 0 2 . 0
1 . 0 5 . 0
, b =
|
\
|
6 . 2
0 . 1

Iteration B
n
= I + (IA) + ... + (IA)
n
Iteration
n x
n,1
x
n,2
b
n,11
b
n,12
b
n,21
b
n,22
x
i,1
x
i,2

0 1 1 1 0 0 1 1 1
1 2.4 3 1.5 0.1 0.2 1.6 2.6 3
2 2..9 3.92 1.77 0.21 0.42 1.98 2.6 4.92
3 3.058 4.372 1.927 0.303 0.606 2.23 5.392 5.032
4 3.0918 4.6116 2.0241 0.3745
0.749
2.3986 6.5848 6.6976
5 3.08474 4.7486 2.08695 0.42711 0.85422 2.51406 11.547 5.3016
6 3.06751 4.832212 2.128897 0.46496 0.92992 2.593858 15.790 8.090352
7 3.050534 4.885825 2.157441 0.49187 0.98373 2.649307 25.4945 4.296155
8 3.036684 4.921388 2.177094 0.51086 1.02173 2.687957 36.81207 10.27658
9 3.026203 4.945496 2.190720 0.52423 1.04846 2.714947 57.2458 1.403537
10 3.018552 4.962057 2.200205 0.53361 1.06722 2.733814 84.72829 14.89127
11 3.013070 4.973524 2.206824 0.54019 1.08037 2.747010 129.582 5.41089
12 3.009183 4.981500 2.211449 0.54479 1.08959 2.756243 193.9134 25.26978
13 3.006441 4.987064 2.214683 0.54802 1.09604 2.762705 294.397 21.0208
14 3.004514 4.990950 2.216946 0.55028 1.10056 2.767227 442.6977 48.86693
15 3.003162 4.993667 2.218529 0.55186 1.10373 2.770392 669.933 56.6194
16 3.002214 4.995568 2.219637 0.55297 1.10594 2.772608 1009.562 102.615
17 3.001550 4.996898 2.220413 0.55375 1.10749 2.774159 1525.6 137.743
18 3.001085 4.997829 2.220956 0.55429 1.10858 2.775245 2301.181 225.0749
19 3.000760 4.998480 2.221336 0.55467 1.10934 2.776005 3475.28 322.591
20 3.000532 4.998936 2.221602 0.55493 1.10987 2.776537 5244.177 504.101
21 3.000372 4.999255 2.221788 0.55512 1.11024 2.776909 7917.68 743.775
22 3.000261 4.999479 2.221918 0.55525 1.11050 2.777170 11949.89 1139.87
23 3.000182 4.999635 2.222009 0.55534 1.11069 2.777352 18039.8 1703.46
24 3.000128 4.999745 2.222073 0.55541 1.11081 2.777480 27229.08 2588.491
25 3.000089 4.999821 2.222118 0.55545 1.11090 2.777569 41103.5 3890.12
26 3.000063 4.999875 2.222149 0.55548 1.11097 2.777632 62043.22 5889.222
27 3.000044 4.999912 2.222171 0.55550 1.11101 2.777676 93654.8 8872.51
28 3.000031 4.999939 2.222186 0.55552 1.11104 2.777706 141368.4 13410.04
29 3.000021 4.999957 2.222197 0.55553 1.11106 2.777728 213395 20225.1
30 3.000015 4.999970 2.222205 0.55554 1.11108 2.777743 322113.4 30546.49
31 3.000011 4.999979 2.222210 0.55554 1.11109 2.777753 486226 46092.2
32 3.000007 4.999985 2.222214 0.55555 1.11109 2.777761 733946.8 69592.43
33 3.000005 4.999990 2.222216 0.55555 1.11110 2.777766 1107880 105031
34 3.000004 4.999993 2.222218 0.55555 1.11110 2.777769 1672323 158559.9
35 3.000003 4.999995 2.222219 0.55555 1.11111 2.777772 2524341 239326
36 3.000002 4.999996 2.222220 0.55555 1.11111 2.777774 3810443 361275.2
37 3.000001 4.999998 2.222221 0.55555 1.11111 2.777775 5751793 545321
38 3.000001 4.999998 2.222221 0.55555 1.11111 2.777776 8682221 823168.7
39 3.000001 4.999999 2.222222 0.55555 1.11111 2.777776 1.3E+07 1242540
40 3 4.999999 2.222222 0.55556 1.11111 2.777777 19782728 1875608
41 3 4.999999 2.222222 0.55556 1.11111 2.777777 3E+07 2831178
42 3 5 2.222222 0.55556 1.11111 2.777777 45075597 4273626
43 3 5 20/9 5/9 10/9 25/9 6.8E+07 6450941



82
In the second case the corresponding system Ax = b has the same solution as in the
first case but the iterations (i) diverge due to v
m
(IA) > 1. The convergence of iteration
(i) can be attained if Ax = b is multiplied by transposed matrix A
T
, i.e. transferred to
(A
T
A) x = A
T
b,
where
A
T
A =
029 003
0 03 017
. .
. .
|
\
|
, A
T
b =
102
0 94
.
.
|
\
|
.
The fact that any normed space is the metric space enables to implement various
notions from the theory of metric spaces. Especially those connected to convergence,
limit points, completeness (i.e. closure with regard to limits) etc. For example: any
normed space is locally convex because any set open with respect to norm contains a
ball with a proper radius and the ball is convex.
Complete normed linear spaces are shortly called Banach spaces [1,2].

5.3 Scalar Product. Hilbert Space
The scalar (inner) product in the Euclidean space R
n
is defined by the equality (x, y) =
x.y =
=
n
i
i i
y x
1
. The validity of commutative law, (x, y) =(y, x) is obvious as well as the
inequality (x, x) 0. Let S
n
=
=
n
i
i i
y x
1
. If {S
n
} is convergent,
n
lim S
n
= S R
1
, the
scalar product can be naturally generalized to infinite-dimensional vectors from R
by
infinite series
=1 i
i i
y x .
To avoid convergence problems it is advantageous to assume, for example, that
coordinates of points x, y generate absolutely convergent series or that the series
=1
2
i
i
x ,
=1
2
i
i
y assigned to scalar products (x, x), (y, y) are convergent.

For n N it holds the following Cauchy-Schwarz inequality
(x, y)
2
(x, x) (y, y)
or, in components,
(
=
n
i
i i
y x
1
)
2
(
=
n
i
i
x
1
2
) (
=
n
i
i
y
1
2
).
Proof. Be z = (y, y) x (x, y) y. Evidently,
(z, z) = ((y, y) x (x, y) y, (y, y) x (x, y) y) = (y, y)
2
(x, x) (y, y) (x, y)
2

= (y, y) [(y, y) (x, x) (x, y)
2
] 0,
That means: for (y, y) > 0 the expression in brackets is 0.

Scalar product is more frequently defined for complex vectors [1-2, 11-14], i.e.
elements from C
n
. For x, yC
n
one puts (x, y) = x y
i i i
n
=
1
, where bar denotes the



83
conjugated complex number. Then (y, x) = y x
i i i
n
=
1
=
=
n
i
i i
y x
1
= ) , ( y x . Now it is
simple to introduce the scalar product also in C
.
By means of the scalar product it is possible to define the Euclidean norm for x from
R
n
or C
n
: v
e
(x) = ) , ( x x =
2
1
=
n
i
i
x . Consequently, spaces R
n
or C
n
provided with
scalar product (y, x) = y x
i i i
n
=
1
are normed automatically.

Inspired by the real or complex spaces of final or countably infinite dimension we
can consider the abstract space X.

Definition. Suppose X is an abstract linear space over the field T and for every x, y X
a real or complex function (x, y) is defined with the following properties
1. if c
1
, c
2
T, x
1
, x
2
, y X then (c
1
x
1
+ c
2
x
2
, y) = c
1
(x
1
, y) + c
2
(x
2
, y) (linearity),
2. (x, y) = ) , ( x y for arbitrary x, y X, where the bar denotes the conjugated complex
number; in real vector space X (x, y) = ) , ( x y = (y, x) (commutative law)
3. (x, x) > 0 for every x X , x o.
The function (x, y) is called scalar product and X is called unitary space.

In the unitary space X
the norm x = ) , ( x x is defined, thus x y + x + y ,
the Cauchy-Schwarz inequality (x, y) x y holds.

Proof. The equality (x, x) = 0 and the propriety 3) of the scalar product imply x = o. For
aT, xX it holds ax =
( , ) ax ax
= a
( , ) x x
= a x . Further on,
x y +
2
= (x + y, x + y) = (x, x) + (x, y) + (y, x) + (y, y)
x
2
+ 2 x y + y
2
= ( x + y )
2
,
which implies
x y + x + y (triangle inequality).
The validity of the Cauchy-Schwarz inequality in X could be proved in the same way as
in R
n
.

The unitary space can be normed and this also allows to introduce the metric
d(x, y) = x y = ( , ) x y x y .

The complete unitary space is called Hilbert space.



84
Theorem. Scalar product is a function continuous with respect to norm.
Proof. Let x
n
x, y
n
y, (i.e. x x
n
, y y
n
0) and x
n
, y
n
m < +.
Then
( , ) ( , ) x y x y
n n
= ( , ) ( , ) ( , ) ( , ) x y x y x y x y
n n n n
+
( , ) ( , ) x y x y
n n n
+ ( , ) ( , ) x y x y
n
( , ) x y y
n n
+ ( , ) x x y
n

m y y
n
+ x x
n
y .
It follows from x x
n
0, y y
n
0 that the right hand side of the last
inequality also converges to 0, thus,
n
lim (x
n
, y
n
) = (x, y).

The following rough schematic of the successive specialization can be set up as a
summary:

X = abstract set (space) X, T, +, . linear space
X, topology T (X, T ) + topology T topological linear space

(X, T ), neighborhood
convergence, completeness

(X, metric d) metric space
(X, norm ) normed linear space
Completeness with respect to norm
Banach space

(X, scalar product ( , )) unitary space
Completeness Hilbert space

Example 1. Let l
2
be the set of such infinite sequences a = {a
i
: i = 1, ...} that
=1 i i i
a a converges. For a, b l
2
let us define the scalar product (a, b) =
=1 i
i i
b a .
From (a, b)
2
(a, a) (b, b) =
=1 i i i
a a
=1 i i i
b b < + it follows that (a, b) < +.
The completeness of l
2
, i.e. a
(n)
a a l
2
, can be proved as follows.
Let {a
(n)
: n = 1, 2, } be the Cauchy sequence, i.e. to every >0 there is an n
0
N that
for any n, m > n
0
it holds (a
(n)
, a
(m)
) =
2
1
) ( ) (
=

k
m
k
n
k
a a < . Then, obviously,



85
2
1
) ( ) (
=

K
k
m
k
n
k
a a < for KN. For n we get
) (n
k
a a
k
and
2
1
) (
=

K
k
m
k k
a a .
Because K is an arbitrary positive integer, it holds also
2
1
) (
=

k
m
k k
a a . The
difference between a = {a
k
} and {
) (m
k
a } l
2
can be made arbitrarily small, so a itself
belongs to l
2
.

Example 2. L
2
([a, b]) denotes the set of real functions whose square is integrable,
fL
2
([a, b]) if and only if
b
a
f
2
< +. Let (f, g) =
b
a
fg for f, g L
2
([a, b]). The
unitary space L
2
([a, b]), (.,.)) is complete, thus, L
2
([a, b]) is a Hilbert space (a
consequence of the Lebesgue theorem [1,2,9]).
In L
2
the Cauchy-Schwarz inequality implies:
(f, g)
2
= (
b
a
fg )
2
(
b
a
fg )
2

b
a
f
2

b
a
g
2
< +.
Remark. Let L
p
for p1 be the set of such functions f that f
p
is integrable (in the Lebesgue sense). In
functional analysis the completeness of the space L
p
for p1 is proved [1,2] (the idea of the proof is
similar to that in l
2
as shown in example 1). This implies the completeness of L
2
(p=2). If {f
n
} is a
sequence in L
2
, f
n
f, then fL
2
.

5.4 Orthogonality. General Fourier Series
Definition. X be a unitary space with a scalar product (., .). If f, gX and (f, g) = 0,
we say f, g are orthogonal with respect to the scalar product (., .).

Example 1. Let
I
= (
ij
), where i, j = 1, 2, ... and
ij
=
=
j i
j i
for
for
0
1
(Kroneckers delta). Then
the columns
e
1
=
1
0
0
M
|
\
|
|
|
|
, e
2
=
0
1
0
M
|
\
|
|
|
|
, e
3
=
0
0
1
M
|
\
|
|
|
|
, ...
are orthogonal. Obviously, (e
i
, e
j
) =
ij
, if i, j = 1, 2, ....

Example 2. Functions sin nx, cos nx are orthogonal on any interval of the length 2.
Namely

+

a
a
sin nx cos nx dx =
+

a
a
nx
n
2
sin
2
1
=
n
a n a n
2
) ( sin ) ( sin
2 2
+



86
=
n
a n a n )) ( 2 cos 1 ( ) ( 2 cos 1 +
=
n
na na 2 cos 2 cos
= 0.

Example 3. Let us consider the sequence {f
k
(x) = e
ikx
: k Z} on interval [, ] and the
scalar product (f
k
, f
l
) =

l k
f f
2
1
=
2
1

e
ikx
e
ilx
dx =
2
1

e
i(kl)x
dx
(i = 1 ). Thus, for any k, l Z it is
(f
k
, f
l
) =
2
1

e
i(kl)x
dx =
= =

. for 1
2
) (
, for 0
) ( 2i
e e
) ( i ) ( i
l k
l k
l k
l k l k

Example 4. Let us find polynomials orthogonal to q(x) = x
2
1 on [1, 1]. First,
calculate
J
n
=
1
1
x
n
(x
2
1) =
1
1
x
n+2
x
n
=
3
1
+ n
(1 + (1)
n+2
)
1
1
+ n
(1 + (1)
n
)
=
+ +
n
n
n n
odd for
even for
0
) 3 )( 1 (
4
.
Thus, all odd powers and their linear combinations, i.e. all polynomials
p
k
(x) = x [a
0
+ a
1
x
2
+ ... + a
k
x
2k
], kN,
are orthogonal to the function q(x) on [1, 1]. However, also convergent series of odd
powers of x are orthogonal to q. This e.g. holds for the expansions of sin x or sinh x.
Namely, (x
2
1) sin x and (x
2
1) sinh x are odd functions and its integral over an
interval symmetric with respect to 0 is equal to zero.

In example 1 the system of unit vectors was shown like an orthogonal basis. Now a
question can be asked whether it is possible to construct an orthogonal system for an
arbitrary system of linearly independent elements. The answer is positive and given in
the following.

Orthogonalization Theorem. Let a
1
, a
2
, ... be a sequence of linearly independent
elements from a Hilbert space X. Then there is an orthogonal system b
1
, b
2
, ... in X that
env {a
1
, a
2
, ...} = env {b
1
, b
2
, ...}.
Proof. Let us choose b
1
= a
1
and put b
2
= a
2
b
1
. The coefficient be taken so that b
2

is orthogonal to b
1
, i.e. (b
2
, b
1
) = (a
2
, b
1
) (b
1
, b
1
) = 0. The last equation gives
= (a
2
, b
1
)/(b
1
, b
1
), i.e. b
2
= a
2
b
1
(a
2
, b
1
)/(b
1
, b
1
). We can go on and choose
b
3
= a
3
b
1
b
2
. The equations (b
3
, b
1
) = (b
3
, b
2
) = 0 yield
= (a
3
, b
1
)/(b
1
, b
1
), = (a
3
, b
2
)/(b
2
, b
2
), i.e.
b
3
= a
3
b
1
(a
3
, b
1
)/(b
1
, b
1
) b
2
(a
3
, b
2
)/(b
2
, b
2
) .



87
Thus, the following recurrence can be anticipated
b
n
= a
n

) , (
) , (
1 1
1
b b
b a
n
b
1

) , (
) , (
2 2
2
b b
b a
n
b
2
...
) , (
) , (
1 1
1

n n
n n
b b
b a
b
n1

and proved by mathematical induction. Let vectors b
1
, b
2
, ..., b
m
, m>3 be orthogonal.
Then
b
m+1
= a
m+1
b
1
(a
m+1
, b
1
)/(b
1
, b
1
) b
2
(a
m+1
, b
2
)/(b
2
, b
2
) ... b
m
(a
m+1
, b
m
)/(b
m
, b
m
)
is orthogonal to b
1
, b
2
, ..., b
m
(it is (b
m+1
, b
k
) = (a
m+1
, b
k
) (b
k
, b
k
) (a
m+1
, b
k
)/(b
k
, b
k
) = 0
for k = 1, ..., m). Because b
1
, b
2
, ..., b
m
are linear combinations of a
1
, a
2
, ..., a
m
for every
m, it is also
env {a
1
, a
2
, ..., a
m
} = env {b
1
, b
2
, ..., b
m
}.

The here exhibited construction of orthogonal basis is called Gram-Schmidt
orthogonalization.

Example 1. Let us find vectors b
1
, b
2
, b
3
orthogonal to the following vectors from R
3

a
1
=
|
|
\
|
0
1
1
, a
2
=
|
|
\
|
1
0
2
, a
3
=
|
|
\
|
1
2
1
.
Linear independence of vectors a
1
, a
2
, a
3
follows from the fact that the rank of matrix
(a
1
, a
2
, a
3
) is 3,
det
1 2 1
1 0 2
0 1 1
|
\
|
|
|
= 3 0.
The Gram-Schmidt algorithm gives
b
1
= a
1
=
|
|
\
|
0
1
1
,
b
2
= a
2
(a
2
, b
1
)/(b
1
, b
1
) b
1
=
|
|
\
|
1
0
2
(21+01+10)/(11+11+00)
|
|
\
|
0
1
1
=
|
|
\
|
1
1
1
,
b
3
= a
3
(a
3
, b
1
)/(b
1
, b
1
) b
1
(a
3
, b
2
)/(b
2
, b
2
) b
2

=
|
|
\
|
1
2
1
3/2
|
|
\
|
0
1
1
0/3
|
|
\
|
1
1
1
=
|
|
\
|
1
2 / 1
2 / 1

|
|
\
|
2
1
1
.
Backwards, the orthogonality of b
1
, b
2
, b
3
can be verified easily by direct computing
(b
1
, b
2
), (b
1
, b
3
), (b
2
, b
3
). For example, (b
2
, b
3
) = 1.( 1) + (1).1 + 1.2 = 0.

Example 2. The sequence {x
0
, x, x
2
, ...} is a linearly independent system in any interval
I R
1
of positive length (a polynomial or power series is identically zero if and only if
all its coefficients are 0; roots of a polynomial are isolated points). In the set L
2
([1, 1])
of square-integrable functions on interval [1,1] let us define the scalar product



88
(f, g) =
1
1
fg
and search for the corresponding orthogonal polynomials P
0
(x), P
1
(x), ....
The P
0
(x) can be taken as the first member of the sequence {1, x, x
2
, ...}, i.e.
P
0
(x) = 1.
The next polynomial is
P
1
(x) = x P
0
(x, P
0
)/(P
0
, P
0
) = x 1.(x, 1)/(1, 1) = x
1
1
x.1 /
1
1
1.1 = x.
It is (P
1
, P
1
) =
1
1
x
2
= 2/3.
And further on,
P
2
(x) = x
2
P
0
(x
2
, P
0
)/(P
0
, P
0
) P
1
(x
2
, P
1
)/(P
1
, P
1
) = x
2

1
1
x
2
/2 3/2
1
1
x
3
= x
2
1/3 = (3x
2
1)/3,
(P
2
, P
2
) = (1/9)
1
1
(3x
2
1)
2
= 8/45,
P
3
(x) = x
3
P
0
(x
3
, P
0
)/(P
0
, P
0
) P
1
(x
3
, P
1
)/(P
1
, P
1
) P
2
(x
3
, P
2
)/(P
2
, P
2
)
= x
3
x (3/2) (2/5) = x (x
2
3/5)
etc.

An orthogonal system {b
1
, b
2
, ...} in which (b
n
, b
n
) = 1 for every nN is called
orthonormal (i.e. consisting of orthogonal unit vectors).
Obviously, a unit vector corresponding to an arbitrary b 0 is e
b
= b/ b , where
b = ) , ( b b . Thus, if b
1
, b
2
, ..., b
m
is an orthogonal system, then
e
1
= b
1
/
1
b , e
2
= b
2
/
2
b , ..., e
m
= b
m
/
m
b
is the corresponding orthonormal system.

Definition. Suppose X is a Hilbert space, {e
1
, e
2
, ...} an orthonormal system of its
elements and xX. The infinite series
k=
1
(x, e
k
) e
k

is called Fourier series of x with regard to the system {e
1
, e
2
, ...} and the numbers
(x, e
k
) are Fourier coefficients of x in that system.

This definition is purely formal. However, the following theorem says more.

Theorem. If {e
1
, e
2
, ...} is an orthonormal system in a Hilbert space X, then for any xX
the series
k=
1
(x, e
k
)
2
converges and
k=
1
(x, e
k
)
2
x
2
= (x, x). (Bessels inequality)



89
Proof. It holds
0
2
1
) , (
k
k
k
e e x x

=
=(x
k=
1
(x, e
k
) e
k
, x
k=
1
(x, e
k
) e
k
)
= (x, x) (
=1 k
(x, e
k
) e
k
, x) (x,
=1 k
(x, e
k
) e
k
) + (
=1 k
(x, e
k
) e
k
,
=1 k
(x, e
k
) e
k
).
In real elements due to orthogonality of e
k
and commutativity of the scalar product it is
0 (x, x)
k=
1
((x, e
k
) e
k
, x)
k=
1
((x, e
k
) e
k
, x) +
k=
1
(x, e
k
)
2

= (x, x)
k=
1
(x, e
k
)
2

k=
1
(x, e
k
)
2
+
k=
1
(x, e
k
)
2

= x
2

k=
1
(x, e
k
)
2
.
In complex elements
(( , ) , ) x e e x
k k k=
1
= ( , )( , ) x e e x
k k k=
1
= ( , ) ( , ) x e x e
k k k =
1
= ( , ) x e
k k=
1
2

and similarly
(x,
k=
1
(x, e
k
) e
k
) = (
k=
1
(x, e
k
) e
k
,
k=
1
(x, e
k
) e
k
) = ( , ) x e
k k=
1
2
.

Consequence. It follows from the convergence of
2
1
) , (
= k
k
e x that
k
lim (x, e
k
) = 0.
Definition.
Let X be a Hilbert space and {e
1
, e
2
, ...} an orthonormal system in X.
{e
1
, e
2
, ...} is called basis of X, if for every xX
x =
k=
1
(x, e
k
) e
k

{e
1
, e
2
, ...} is called complete, if
(kN)
( (e
k
, x) = 0 x = o ),
i.e. the zero element o is the only element annulling all scalar products (e
k
, x).
{e
1
, e
2
, ...} is called closed, if the Bessels inequality turns to equality,
i.e. xX implies
k=
1
(x, e
k
)
2
= x
2

(Parseval equality).

Approximation theorem. Let mN and x an arbitrary element from the real Hilbert
space X, {e
1
, ..., e
m
} be an orthonormal system in X. A linear combination of
=
m
k
k k
b
1
e differs from x minimally,
x e
=

m
k
k k
b
1
min,
if and only if b
k
= (x, e
k
) (i.e. b
k
are Fourier coefficients).



90
Proof. Obviously, instead of minimizing the norm we can seek for the minimum of its
square
S(b
1
, ..., b
m
) = b
k k k
m
e x
=
1
2
= ( b
k k k
m
e x
=
1
, b
k k k
m
e x
=
1
)
= b
k k
m 2
1 =
( b
k k k
m
e
=
1
, x) (x, b
k k k
m
e
=
1
) (x, x)
= b
k k
m 2
1 =
2 b
k k
m
(x
=
1
, e
k
) (x, x)
= [ ( , )] b
k k
m
k =

1
2
x e (x
k
m
=
1
, e
k
)
2
(x, x).
This function of b
1
, ..., b
m
attains its minimum if
2
1
)] , ( [
k
m
k k
b e x
=
= 0, i.e. b
k
= (x, e
k
),
k = 1, ..., m. In other words, S is minimal when b
k
are Fourier coefficients of x in the
orthonormal system {e
1
, ..., e
m
}.

Example. Let us find the Fourier expansion of the polynomial
P(x) = 5 x
2
(x0.8)
2
(x2)
2
+ 0.5
in the orthogonal system of the Legendre polynomials
P
n
(x) =
! 2
1
n
n

n
n
x d
d
(x
2
1)
n

on interval [0, 2].
Fourier coefficients of P with respect to the system {P
0
, P
1
, ...} on a general interval
[a, b], b>a, are
A
k
=
) , (
) , (
k k
k
P P
P P
,
where (P, P
k
) =
b
a
P(x) P
k
(
2 / ) (
2 / ) (
b a
b a x
+
+
) dx.
In our case a = 0, b = 2, thus, (a+b)/2 = 1 and
(P, P
k
) =
0
2
P(x) P
k
(x1) dx,
(P
j
, P
k
) =
0
2
P
j
(x1) P
k
(x1) dx,
Because P is of the 6
th
degree let us confine to first Legendre polynomials
P
0
(x) = 1,
P
1
(x) = x,
P
2
(x) = (3x
2
1)/2,
P
3
(x) = x (5 x
2
3)/2,
P
4
(x) = (35 x
4
30x
2
+ 3)/8,
P
5
(x) = x (63 x
4
70x
2
+15)/8,
P
6
(x) = (231 x
6
315 x
4
+ 105 x
2
5)/16,



91
P
7
(x) = x (429 x
6
693 x
4
+ 315 x
2
35)/16.

The coefficients A
k
can be computed by the integration of corresponding polynomials
but it would be a dully work. So we simply use the 3-node Gauss quadrature.
Orthogonality of polynomials P
k
can be verified by the numerical calculation of matrix
elements a
jk
= (P
j
, P
k
). For given tolerance = 10
8
and n = 20 partial intervals
{[(i 1)/10, i/10]: i = 1, ..., 20}
we obtained ( (P
j
, P
k
) = 0 for j k, (P
j
, P
j
) = 2/(2j + 1), j = 0, 1, )

(P
j
, P
k
) =
|
|
|
|
|
|
|
|
|
\
|
13 / 2 0 0 0 0 0 0
0 11 / 2 0 0 0 0 0
0 0 9 / 2 0 0 0 0
0 0 0 7 / 2 0 0 0
0 0 0 0 5 / 2 0 0
0 0 0 0 0 3 / 2 0
0 0 0 0 0 0 2
.

The Fourier coefficients A
k
= (P, P
k
)/(P
k
, P
k
) are as follows

A
0
= 0.9876190,
A
1
= 0.4571429,
A
2
= 0.1523810,
A
3
= 0.7111111,
A
4
= 0.6815584,
A
5
= 0.2539683,
A
6
= 0.3463203.

Consequently, the partial sum of the Fourier series (n 6) is then
S
n
(x) =
k
n
=
0
A
k
P
k
(x1).

Fig. 5.2 provides a visual evaluation of the approximation of the polynomial P by the
sums S
n
for n = 2, ..., 6 (S
6
(x) = P(x)).
Some time later the same polynomial P will be used to show computation of partial
sums of the corresponding trigonometric Fourier series.

Hence, Fourier series may be expected to converge in elements of the Hilbert space
X. But we do not know if it converges to the original element. This is true when the
considered orthogonal system is a basis of X. But various kinds of convergence in



92
function spaces make the problem of Fourier series convergence even more
complicated.

0
0.5
1
1.5
2
0 0.5 1 1.5 2
x
S
n
P = S6
n=2
n=3
n=4
n=5

Fig. 5.2 Polynomial P(x) = 5[x(x0.8)(x2)]
2
+0.5 and its Legendrean approximations S
n
(x), n = 2, , 6,
S
6
(x) = P(x).

Convergence of Fourier series in special orthonormal systems belongs to
permanently inspiring problems of mathematical analysis. For example, H. Lebesgue in
his Leons sur les sries trigonomtriques (1906), displayed Fourier series as a fertile
area for his new concept of integration. Many monographs have been written on Fourier
series. Even the calculation of Fourier coefficients is not trivial because it depends on
how the scalar product is defined and what kind of the set-theoretical approach has been
chosen.



93
5.5 Trigonometric Fourier Series
The class of functions piecewise continuous on bounded intervals is rich enough in most
practical cases. Those functions as well as their products are integrable. Thus, standard
scalar product (x, y) =
I
y x is defined and sequences
{1, cos x, sin x, cos 2x, sin 2x, cos 3x, sin 3x, ...},
{..., e
3ix
, e
2ix
, e
ix
, 0, e
ix
, e
2ix
, e
3ix
, ...}
are orthogonal systems on any interval I of the length 2.
In even functions those systems are reduced to
{1, cos x, cos 2x, cos 3x,...},
in odd functions to
{sin x, sin 2x, sin 3x,...}.
Let x
0
R
1
and f: [x
0
, +) R
1
be an l-periodic function, l > 0, i.e. for x[x
0
, +)
f(x + l) = f(x).
Trigonometric orthogonal system on an interval [a, a+l) [x
0
, +) can be obtained very
simply from {..., e
3ix
, e
2ix
, e
ix
, 0, e
ix
, e
2ix
, e
3ix
, ...} by taking e
2imx/l
instead of e
imx
for
mZ. Then for ax
0
the scalar product

(e
2imx/l
, e
2inx/l
) =
+l a
a
e
2imx/l
e
i 2 nx l /
dx =
+l a
a
e
2imx/l
e
2inx/l
dx
=
+l a
a
e
2i(mn)x/l
dx =

+

+
l a
a
l x n m i
l a
a
n m
l
x
/ ) ( 2
) ( 2
e
i

=

l x n m
e
n m
l
l
/ ) ( i 2
) ( i 2
=
0
l
for
n m
n m
=
.

Thus, the system

..., , , , , , , , ...
. / . / . / . / . / . /
e
l
e
l
e
l l
e
l
e
l
e
l
ix l ix l ix l ix l ix l ix l

`
)
3 2 2 2 1 2 1 2 2 2 3 2
1

is orthonormal on the interval [a, a+l).
Because
+l a
a
1 dx = l,
+l a
a
cos
2

l
kx 2
dx =
2
1
a
a l +
(1 + cos
l
kx 4
) dx =
2
l
,
+l a
a
sin
2

l
kx 2
dx =
2
1
a
a l +
(1 cos
l
kx 4
) dx =
2
l
,



94
the orthogonal system
1
2 2 3 3
, cos , sin , cos , sin , ...
x
l
x
l
x
l
x
l

`
)

can be transformed by multiplying 2 / l or 1/ l , respectively, to the orthonormal
one
)
`

... ,
3
sin
2
,
3
cos
2
,
2
sin
2
,
2
cos
2
,
1
l
x
l l
x
l l
x
l l
x
l l
.
Specially the period l = 2 gives
)
`

... ,
2 sin
,
2 cos
,
sin
,
cos
,
2
1 x x x x
.

The square roots are rather annoying, therefore (trigonometric) Fourier series for a
function f integrable on (a, a+l) is presented in the form
a
0
2
+
k=
1
(a
k
cos
l
kx 2
+ b
k
sin
l
kx 2
)
with coefficients
a
k
=
2
l

+l a
a
f(x) cos
l
kx 2
dx, b
k
=
2
l
a
a l +
f(x) sin
l
kx 2
dx.

Fourier series is often written in the phasor form,
r
0
+
k=
1
r
k
cos (
l
kx 2

k
).
Obviously, r
0
=
2
0
a
and
r
k
=
2 2
k k
b a + .
for k = 1, 2, ...
Determining the phase
k
(, ] on computer may be a bit complicated when
arctan is the only inverse trigonometric function in software available. Then

k
=
< +
>
0 ) ( sign
2
0 for arctan ) ( sign
0 arctan
k k
k
k
k
k
k
k
k
a b
a
a
b
b
a
a
b
.



95
Later on it will be illustrated by several examples that, generally, the Fourier series
convergence is not very fast. A. N. Kolmogorov (in 1922) constructed a function
integrable in Lebesgue sense whose Fourier series diverges everywhere. Let L
2
(I)
denote the set of functions f: IR
1
whose square is integrable in Lebesgue sense. The
measure of the set of points at which the Fourier series of a function fL
2
(I) diverges is
zero (Carleson, 1966). Specially, the convergence of the Fourier series of a continuous
function is granted only almost everywhere and the validity of equality lim S
n
(f, x) = f(x)
must be verified using a convenient sufficient condition of convergence. There are
several tests (criterions) of that kind [8,11].
For the sake of simplicity let l = 2, a = ,
a
k
=
+

f(t) cos kt dt, b
k
=
+

f(t) sin kt dt .
Let us consider the partial sum
S
n
(f, x) =
2
0
a
+
=
n
k 1
(a
k
cos
l
kx 2
+ b
k
sin
l
kx 2
).
Substitution for a
k
, b
k
yields
S
n
(f, x) =
+

f(t) [
2
1
+
k
n
=
1
(cos kx cos kt + sin kx sin kt)] dt
=
+

f(t) [
2
1
+
k
n
=
1
cos k(xt)] dt.
The sum in brackets in the last integrand is equal to
2 / ) ( sin 2
] 2 / ) )( 1 2 [( sin
x t
x t n
+
.

This can be verified simply by the Euler formula cos u = (e
iu
+ e
iu
)/2. Then
1
2
+ (
1
2
+ cos u + cos 2u + ... + cos nu )
=
1 1
2
+
+
e e
i -i u u
+
2
+
e e
i2 -i2 u u
+
2
+ ... +
e e
i -i nu nu
+
2

=
1
2
[(1 + e
iu
+ ... + e
inu
) + (1 + e
iu
+ ... + e
inu
)] =
1
2
[
e
e
i(
i
n u
u
+
1
1
1
)
+
e
e
-i(
-i
n u
u
+
1
1
1
)
]
=
1
2 ) 1 e )( 1 (e
) 1 e )( 1 (e ) 1 e )( 1 (e
i - i
i ) 1 -i( -i ) 1 i(

+
+ +
u u
u u n u u n

=
1
2 )
1 1
) 1 ) 1
u u
u n u nu u n u nu
i - i
-i( i -i i( -i i
e (e 2
e e e e e e
+
+ + +
+ +

=
1
2 ) cos 1
cos 2 2 )
u
u u n nu
+
( 2
1) + (( 2cos ) ( 2cos
=
) cos 1
)
u
u n nu
( 2
1) + (( cos ) ( cos
+
2
1
.
Further, using the following trigonometric relations



96
cos cos = 2 sin
+
2
sin

2
and sin
2
2
=
1
2
cos

we obtain (
1
2
+ cos u + cos 2u + ... + cos nu ) =
2
4
2
) ( 2sin
u
u
u n
2
sin
sin
2
1
+
.
Substitution and simple arrangement yield the following Dirichlet integral
S
n
(f, x) =
2
1
+

f(t)
2 / ) ( sin
] 2 / ) )( 1 2 [( sin
x t
x t n
+
dt
=
2
1
+

x
x
f(x+z)
) 2 / ( sin
] ) 2 / 1 [( sin
z
z n +
dz.
The function D
n
(z) =
] 2 / [ sin 2
] ) 2 / 1 [( sin
z
z n
+
(or equivalent one) is called Dirichlet kernel.
Courses of two Dirichlet kernels are shown in Fig. 5.3.

-40
0
40
80
120
160
-0.1 -0.05 0 0.05 0.1
x
D
n
n=128
n=512

Fig. 5.3 Two Dirichlet kernels D
n
near the zero.

Due to (LHospital rule),
0
lim
z

] 2 / [ sin
] ) 2 / 1 [( sin
z
z n +
=
0
lim
z 2 / 1 . ] 2 / [ cos
2 / ) 1 2 ].( 2 / ) 1 2 [( cos
z
n z n + +
= 2n+1,



97
it obviously holds
n
lim
0
lim
z
D
n
(z) = +.
On the other hand it is clear that f(x) = 1 and k = 1, 2, ... give a
k
= b
k
= 0 and a
0
= 2.
Therefore,
1 = S
n
(1, x) =
+

D
n
(z) dz =
+

D
n
(tx) dt .
for every nN . If n a new object, called -functional (or -distribution or Diracs
-function), arises. For a continuous function
+

(tx) f(t) dt = f(x).
This means that the Fourier series of a bounded differentiable function converges to
the corresponding value of that function. This simple criterion is very useful in many
practical applications.
The last equality says that the pointwise convergence of the Fourier series at a point x
depends only on the local behavior of the considered function. This assertion is also
called theorem on localization as a rule.

In some classes of more general functions than those continuously differentiable
functions there are special sufficient conditions for the convergence of Fourier series to
the original function value. Here merely one such case will be mentioned.
Generally, the difference of the nth partial Fourier sum S
n
(f, x) and the value f(x) of
the function can be written as
S
n
(f, x) f(x) =
[f(x+z) f(x)] D
n
(z) dz.

Dini condition. If f is a Lebesgue integrable function and the integral

t
x f t x f ) ( ) ( +
dt,
exists for any fixed x and arbitrary > 0, then the sequence of the partial Fourier sums
{S
n
(x)} of the function f converges to the value f(x).

Proof is based on the following
Riemann-Lebesgue lemma:
For every function g integrable on [a, b]
c
lim
b
a
g(x) sin cx dx = 0 .
Proof. If g is a continuously differentiable function in [a, b], then for c>0
b
a
g(x) sin cx dx = g(x)
cos cx
c
+
b
a
g(x)
cos cx
c
dx



98
and the right side converges to 0 for c +. To any integrable function g and an
arbitrarily small >0 there exists such a continuously differentiable function g
that
b
a
g(x) g
(x) dx < /2.

Further,
b
a
g(x) sin cx dx =
b
a
[g(x) g
(x) + g
(x)] sin cx dx

b
a
[g(x) g
(x) ] sin cx dx +
b
a
g
(x) sin cx dx

b
a
[g(x) g
(x) ] sin cx dx +
b
a
g
(x) sin cx dx

b
a
g(x) g
(x) dx +
b
a
g
(x) sin cx dx

2
b
a
g
(x) sin cx dx.

The second summand converges to 0 for c +, as it was shown above.

Now
S
n
(f, x) f(x) =
+

[f(x+z) f(x)] D
n
(z) dz
=
+

[f(x+z) f(x)]
] 2 / [ sin 2
] ) 2 / 1 [( sin
z
z n
+
dz .
The integrability of
z
x f z x f ) ( ) ( +
implies the integrability of
z
x f z x f ) ( ) ( +

2
sin 2
z
z

and Riemann-Lebesgue lemma can be used in the last integral. Hence, this integral
converges to 0 for n.

The Dini condition implies the following important
Theorem. Let f be a bounded periodic function whose discontinuity points are of the
first kind maximally (i.e. both limits
+ 0
lim
h
f(x+h) = f(x+0),
+ 0
lim
h
f(xh) = f(x0) exist
and f(x0) f(x+0) ) and that both derivatives from the left and from the right exist
f(x0) =
+ 0
lim
h
h
x f h x f ) ( ) (
, f(x+0) =
+ 0
lim
h
h
x f h x f ) ( ) ( +
.
Then
+ n
lim S
n
(f, x) =
2
) 0 ( ) 0 ( + + x f x f
,



99
i.e. the Fourier series of the function f converges to the average of the limits of f from
the left and from the right.

5.6 Numerical Calculation of Fourier Coefficients
Our considerations will further be confined to bounded functions piecewise continuous
in intervals of the length l>0. If such a function, f: [x
0
, x
0
+l] R
1
, is given analytically,
the corresponding Fourier coefficients a
k
, b
k
may be calculated by different numerical
methods. However, if values of f are collected by measurements the most frequent case
results in a data sequence corresponding to (h = l/n)
x
i
= x
0
+ ih, i = 0, 1, ..., n.
Then Fourier coefficients can be calculated by formulas of the Newton-Cotes type. But
then the total number of points n+1 must be an entire multiple of the degree m of the
corresponding interpolation polynomial, n/mN. For example, in Simpsons integration
there is m = 2 and n would have to be even. This restriction vanishes in m =1, of course,
thus, in the trapezoid rule as the simplest Newton-Cotes formula. Then
a
k
=
l
2
+l a
a
f(x) cos
2kx
l
dx
0
2
x x
n

i
n
=
1

+
+
ih x
h i x
0
0
) 1 (
f(x) cos
l
kx 2
dx
nh
2
h
x
l
k
x f x
l
k
x f
n
i
i i i i
1
1 1
2
)
2
cos( ) ( )
2
cos( ) (
.
If l-periodicity is assumed, it is x
0
= x
n
and
a
k

n
2
n
i
i i
x
l
k x f
1
)
2
cos( ) ( .
Similarly,
b
k

n
2
n
i
i i
x
l
k x f
1
)
2
sin( ) ( .
It is important to take into account that all trigonometric functions for k>1 obtain
only the values calculated already for k = 1. Thus, their calculating for k = 2, 3, is
just repeated work and is therefore useless. It is said that this circumstance was noticed
and exploited already by Gauss (17771855). This is also the basic idea of the so called
Fast Fourier Transform, FFT, whose algorithm was published in 1965 by Cooley a
Tukey [6,7]. Algorithm FFT is usually quoted for n = 2
s
as a standard part of software
on common noise and vibration analyzers. It is comprised in EXCEL as well.
A very simple algorithm for computing Fourier coefficients could be as follows. Let
a set of points {(x
i
, y
i
): i = 0, 1, ..., n} be given, where
x
i
= x
0
+ ih, y
i
= f(x
i
).



100

For i = 0, 1, ..., n the following values are calculated
c
i
= cos
i
x
l
2
= cos 2
n
i
, s
i
= sin 2
n
i

and stored as arrays (sequences) C = {c
1
, ..., c
n
}, S = {s
1
, ..., s
n
}. As we have said above
no further values of sine and cosine are needed.
Obviously,
r
0
=
n
1
=
n
i
i
y
1
, a
1
=
n
2
=
n
i
i i
c y
1
, b
1
=
n
2
=
n
i
i i
s y
1
.
When calculating a
k
, b
k
for k>1 sequences C
k
= {C, ..., C}, S
k
= {S, ..., S} are
generated by simple repeating of C and S k-times. Now members of C
k
and S
k

corresponding to indices k, 2k, ..., nk are taken, multiplied with y
1
, y
2
, ..., y
n
and
summed up. In the end the resulting sums are multiplied by
n
2
. So instead of calculating
values of sine and cosine one manipulates merely with indices.
This procedure can be written (in programming languages) as follows
a
k
=
n
2
y c
i ik n
i
n
mod
=
1
, b
k
=
n
2
y s
i ik n
i
n
mod
=
1
,
where ik mod n is the residual after ik is divided by n (i.e. ik (ik mod n) is a multiple
of n).

The numerical calculations and demonstrating some properties of Fourier series will
be illustrated by several examples.
We start with a differentiable periodic function whose Fourier sums converge
relatively fast and the original function can be well approximated by Fourier sums with
a small index (Example 1). In a continuous function with a break some problems
concerning accuracy appear (Example 2). These problems become even worse with a
discontinuous function (step function in Example 3). Local character of behavior
becomes especially evident when the length of the upper level of the step function is
shortened (Gibbs phenomenon, Example 4).

Eample 1. Let us turn back to the polynomial of the 6
th
degree,
f(x) = x
2
(x
4
5
)
2
(x2)
2
,
which was used to show its expansion in the orthogonal basis of Legendre polynomials.
In this case the function f is differentiable everywhere in R
1
and its trigonometric
Fourier series converges to its functional values,
n
lim S
n
(f, x) = f(x).



101
The relatively fast decrease of the absolute values of its Fourier coefficients is
illustrated in Fig. 5.4.

0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
k
r
k

Fig. 5.4 Fast convergence of Fourier coefficients to 0 in the differentiable function
f(x) = x
2
(x 4/5)
2
(x2)
2
.

The next figure shows the convergence of the partial Fourier sums to values of
f(x) = x
2
(x
4
5
)
2
(x2)
2
.

-0.5
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5
x
S
n
n=2
n=4
n=512

Fig. 5.5 Relatively fast convergence of Fourier sums S
n
(f, x) to values of the differentiable function
f(x) = x
2
(x 4/5)
2
(x2)
2
.



102
Comparison of Fig. 5.5 to Fig. 5.2 shows that the approximation quality of partial
Fourier trigonometric sums S
n
(f, x) of the polynomial P seems to be better than that of
those obtained by the Legendre polynomials for 2 n 5.

Example 2. Let us deal with the behavior of the partial sums S
n
(f, x) corresponding to
the continuous periodic function with discontinuous first derivative
f(x) = 1 cos (x)
as sketched here:

0
1 2 3
0
1
x
f

The corresponding partial sums S
n
(f, x) are presented in Fig. 5.6. It can be seen that in
this case the Fourier series also converges relatively fast. The greatest error arises at the
cusp point (x = 0.5) but already the sum S
32
provides a very good approximation which,
except near vicinity of x = 0.5, is optically almost indistinguishable from f(x) and S
512
.

-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
x
S
n
n=2
n=8
n=32
n=512

Fig. 5.6 Convergence of Fourier partial sums of the continuous function f(x) = 1 cos (x).

Consider now the function



103
f
p
(x) = 1
p
x cos , p = 2, 4, 8, ...
on interval [0, 1]. Its graphs for different p are shown in Fig. 5.7 and amplitudes of its
harmonics in dependence on p are in Fig. 5.8.

0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
x
f
p
p=2
p=8
p=32
p=128

Fig. 5.7 Functions f
p
(x) = 1
p
x cos
.

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 50 100 150 200 250 300 350 400
k
2
p
r
k
p=2
p=8
p=32
p=128
p=512

Fig. 5.8 Flattening the amplitude specters for sharpening the cusp of the function f
p
. (Fig. 5.7).



104
Example 3. Consider the following piecewise continuous function (sketch below)
f(x) =
3 / 1
3 / 2
for
... ) 6 , 5 [ ) 4 , 3 [ ) 2 , 1 [
... ) 5 , 4 [ ) 3 , 2 [ ) 1 , 0 [

x
x

0 1 2 3 4
5 6
0
1
x
f

Let us deal just with its first period, i.e. the interval [0, 2] (Fig. 5.9)
f(x) =
3 / 1
3 / 2
for
2 1
1 0
<
<
x
x
.

0.3
0.4
0.5
0.6
0.7
0 0.5 1 1.5 2
x
f

Fig. 5.9 The first period of the chosen step function f.

The phasor form of the Fourier series of f has the following properties:
r
0
= a
0
/2 = 0.5 (arithmetic average),
amplitudes with even indices are r
2
= r
4
= ... = 0,
amplitudes with odd indices are as follows:



105
r
1
, r
3
, r
5
,
0.21221 0.07074 0.04244 0.03032 0.02358 0.01930 0.01633 0.01415 0.01249 0.01118
0.01011 0.00923 0.00850 0.00787 0.00733 0.00686 0.00644 0.00608 0.00575 0.00545
0.00519 0.00495 0.00473 0.00453 0.00435 0.00418 0.00402 0.00388 0.00374 0.00362
0.00350 0.00339 0.00329 0.00319 0.00310 0.00301 0.00293 0.00286 0.00278 0.00271
0.00265 0.00259 0.00253 0.00247 0.00242 0.00236 0.00231 0.00227 0.00222 0.00218
0.00214 0.00210 0.00206 0.00202 0.00199 0.00195 0.00192 0.00189 0.00186 0.00183
0.00180 0.00177 0.00174 0.00172 0.00169 0.00167 0.00164 0.00162 0.00160 0.00158
0.00156 0.00154 0.00152 0.00150 0.00148 0.00146 0.00144 0.00142 0.00141 0.00139
0.00138 0.00136 0.00135 0.00133 0.00132 0.00130 0.00129 0.00128 0.00126 0.00125
0.00124 0.00123 0.00121 0.00120 0.00119 0.00118 0.00117 0.00116 0.00115 0.00114
0.00113 0.00112 0.00111 0.00110 0.00109 0.00108 0.00107 0.00107 0.00106 0.00105
0.00104 0.00103 0.00103 0.00102 0.00101 0.00100 0.00100 0.00099 0.00098 0.00098
0.00097 0.00096 0.00096 0.00095 0.00095 0.00094 0.00093 0.00093 0.00092

Their values belonging to indices from 1 to 257 confirm that in some cases the
convergence of Fourier series can be slow indeed.
Several partial Fourier sums S
m
(x) are shown in Fig. 5.9. It can be seen how the sums
S
m
(x) as continuous functions are smoothing discontinuities of the given step function at
the points x = 0, 1, 2, ... The sums S
m
keep to preserve oscillations in the left and right
neighborhoods of the discontinuity points (Fig. 5.10). This phenomenon is denoted as
Gibbs effect. The sums S
m
in both directions from the discontinuity points remember
records of damped oscillations (Fig. 5.11).

0.3
0.4
0.5
0.6
0.7
0 0.5 1 1.5 2
x
S
n
n=5
n=9
n=17
n=33
n=65
n=129
n=257

Fig. 5.10 Several partial sums S
n
(x) in the step function f in interval [0, 2).



106
0.60
0.65
0.70
0.7 0.8 0.9 1
x
S
n
n=33
n=65
n=129
n=257
n=513

Fig. 5.11 Gibbs phenomenon in sums S
n
(x) in the left neighborhood of x = 1.

Example 4. Let the length of the upper constant level in the step function from the
Example 3 be decreasing by increasing p, i.e.
f(x) =
3 / 1
3 / 2
for
2 / 2
/ 2 0
<
<
x p
p x

Oscillations of partial sums shown in Fig. 5.12 demonstrate the localization theorem,
Fig. 5.13 shows the corresponding amplitudes.

0.3
0.4
0.5
0.6
0.7
0 0.5 1 1.5 2
x
S
n
p=2
p=8
p=32

Fig. 5.12 Oscillations of sums S
n
(x) for increasing asymmetry 1/p : (11/p) of [0, 2) of levels of f.



107

0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0 50 100 150 200
k
r
k
p=2
p=8
p=32
p=64
p=128

Fig. 5.13 Amplitude spectra for shortening the upper level 2/p of step function.

A more detailed explanation of the Gibbs phenomenon can be found e.g. in [2]. Here
only an illustration by means of numerical calculations in the step function will be
given. Fig. 5.14 shows that the maximum overshoot near the discontinuity point
changes only very little with the length n of the Fourier sum S
n
(x) and is slowly
increasing with the decreasing length 2/p of the upper level (Fig. 5.15).

2
0
.
4
9
9
5
4
0
.
7
4
9
5
8
0
.
8
7
4
5
1
6
0
.
9
3
7
3
2
0
.
9
6
2
5
6
4
0
.
9
8
3
8
8
1
2
8
0
.
9
9
1
6
9
8
2
5
6
0
.
9
9
5
6
1
5
5
1
2
0
.
9
9
7
5
8
9
0
0.01
0.02
0.03
0.04
0.05
n, x
max
G
n

Fig. 5.14 The maximum overshoot G
n
in the left neighborhood of the discontinuity point.



108

2
0.4995
4
0.2495
8
0.1245
16
0.062
32
0.031
64
0.0155
128
0.0075
256
0.0035
512
0.001496
0
0.02
0.04
0.06
0.08
p = n, x
max
G
p,n

Fig. 5.15 The maximum overshoot in S
n
if the upper level length 2/n drops.

Changes of the overshoot G at the discontinuity point are more important and
interesting. Their proportionality to the step magnitude was obtained numerically in the
considered step function (Fig. 5.16)
f(x) =
3 /
3 / 2
s
s
for
1
1
>
<
x
x
.

0
0.04
0.08
0.12
0.16
0.2
0 0.2 0.4 0.6 0.8 1
step s
G
32

Fig. 5.16 Dependence of Gibbs overshoot G on the step magnitude.



109
Example 5. A very important application of Fourier analysis consists in frequency
identification in periodical processes or vibrations of various kinds. Fig. 5.17 shows a
time record of the displacement x(t) in free oscillations of some simple dynamical
system with a small damping.

0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25
t , s
x

Fig. 5.17 Free oscillations of a simple dynamical system.

The oscillations are non-periodic due to damping. Nevertheless, the amplitude
spectrum of several oscillations can unveil valuable information. Let five oscillations
from Fig. 5.17 be considered and the number of summands in partial Fourier sum be
limited by, e.g., m = 19. The corresponding amplitude spectrum is shown in Fig. 5.18.

0
0.1
0.2
0.3
0.4
0.5
0.6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
k
r
k

Fig. 5.18 Amplitude spectrum of oscillations in Fig. 5.17.



110
Also the following fact should be noticed: the sum S
19
(t) is periodic while the
original function x(t) is not. Hence, the biggest differences can be expected near the
ends of the time interval [0, 4.45]. This is shown in Fig. 5.19.

0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30
t , s
x, S
n
x(t)
S19(t)
S1(t)

Fig. 5.19 Approximation of the non-periodical function x(t) by the sum S
19
(t) with period about 4.913s.

It may be useful to say that the considered function is obtained by numerical solution
of the following non-homogeneous nonlinear differential equation
x + 0.05 x + 2 sign x x
0.6
= 1.3
with initial conditions
x(0) = 1, x(0) = 0.
Thus, nonlinear oscillations with variable lengths of periods (returns into the
equilibrium position from the same side) are substituted by linear oscillations.
The presence of all frequencies in the amplitude spectrum is typical for nonlinear
vibrations. Though the frequency 1/5Hz is dominant frequencies different from its
entire multiples (belonging to harmonic components) are present in the spectrum
(subharmonic and superharmonic components).
The fifth (maximal) component represents harmonic oscillator with the period equal
to one fifth of the total time interval, 24.6/5 4.913s and frequency 0.2 Hz. This
harmonic component
H
1
(t) = r
0
+ r
5
cos (10t/T +
1
) = 0.51839+0.35633 cos (0.2035*2t + 0.01348)
would be useful in case of linearization. It is shown as the dotted line in Fig. 5.19.
Decrease of amplitudes of the linearized oscillator is given by the factor e
t
, where
= ln x(T)/T = ln 0.754/24.6 = 0.05747.



111
The trajectory of the nonlinear oscillator in the phase plane 0xx is shown in
Fig. 5.20. Nonlinearity causes an asymmetry and flattening in the left part near the
vertical axis (x = 0). Also convergence of the trajectory to the center of attraction
(equilibrium position)
t
lim x(t) = x
=
6 . 0
1
2
3 . 1
|
\
|
= 0.48774,
t
lim x(t) = 0
is evident.

-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0 0.2 0.4 0.6 0.8 1
x
x

Fig. 5.20 Trajectory of the considered nonlinear oscillator in the phase plane.



112
Example 6. Damped nonlinear oscillator
x + 0.1 x + 5 arctan (0.6 x) = 1.5,
is excited from the initial rest state, x(0) = 0, x(0) = 0, by random step displacements
x(t
i
) = X, where X belongs to normal distribution N(0, 0.01). The time instants t
i
= n
i
h
are given by the integration step h = /180 and sum of random numbers
n
i
= r
j j
i
=
1
, r
j
{0, 1, ..., 9}.
An example of the phase trajectory (x(t), x(t)) is shown in Fig. 5.21 and the function
x(t) in Fig. 5.22.

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
x
x

Fig. 5.21 Phase trajectory of the randomly excited nonlinear oscillator.



113

-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60 70 80 90 100
t , s
x

Fig. 5.22 The time course of displacement x(t) of the randomly excited nonlinear oscillator.

Fourier analysis of the function x(t) was carried out in interval [0, 95] for the index
range of k = 0, 1, ..., 200 . The corresponding amplitude spectrum shown in Fig. 5.23
has a distinct maximum at k = 22. Dominating harmonic oscillations have period T
22
=
95/22 4.32s and frequency 22/95 0.23 Hz.

0
0.05
0.1
0.15
0.2
0 50 100 150
k
Amplitude r
k

Fig. 5.23 Amplitude spectrum of the function x(t) from Fig. 5.22.



114

5.7 Fejrs Summation of Fourier Series
Many methods were invented to accelerate convergence of infinite series or to assign a
generalized sum to some divergent series. The method of arithmetical averages is one of
the simplest summation methods
Let {a
1
, a
2
, ...} be a sequence of real numbers and
s
1
= a
1
, s
2
= a
1
+ a
2
, ..., s
k
= a
1
+ ... + a
k
, ...
the sequence of its corresponding partial sums. If the following limit
k
lim
k
s s
k
+ + ...
1

exists, we call it the generalized sum of the series
=1 j
j
a with respect to averages.

The average
n
=
k
s s
k
+ + ...
1
can be rearranged into the shape
n
=
k
s s
k
+ + ...
1
=
k
a a a a a a
k
) ... ( ... ) (
2 1 2 1 1
+ + + + + + +
=
k
a a k ka
k
+ + + ... ) 1 (
2 1

= a
1
+ (1
k
1
) a
2
+ ... +
k
k k ) 1 (
a
k
.
This is the nth partial sum of the series
=1 j
c
j
a
j
with c
j
= 1
j
j 1
=
j
1
. Evidently,
0< c
j
<1. If
=1 j
a
j
is convergent, then also
=1 j
c
j
a
j
, i.e. the sequence of averages
n
, is
convergent (Abels theorem [9]). The same conclusion follows from Dirichlet theorem.
Hence, no news is obtained in convergent series. This can be illustrated by
examples. The following table presents several partial sums s
n
and their averages
n
of
the series
=1
2
1
k
k
=
6
2
1.644 934 067,
1
2
1
) 1 (
k
k
k
=
12
2
0.822 467 033.

n 10
1
10
2
10
3
10
4
10
5
10
6

s
n
1.549 1.635 1.6439 1.6448 1.64492 1.64493
6
1
2
1
2
= k k

n
1.412 1.599 1.6381 1.6440
s
n
0.818 0.822 0.8225 0.82247 0.822467 0.822467
12
) 1 (
2
1
2
1
k
k
k

n
0.835 0.824 0.8226 0.82248 0.822468 0.822467



115
The idea of summation with respect to averages was developed and further
generalized by E. CesBro in particular. That is why such methods are also called
CesBros methods and the possibility shown above is shortly denoted as (C, 1)-method.
Such generalized sums can be assigned also to some divergent series. For example, if
x<1 then
1
1+ x
= 1 x + x
2
... + (1)
n
x
n
+ ...
Obviously,
0 1
lim
x
1
1+ x
=
1
2
, but the right side, 1 1 + 1 1 + ..., does not fulfill even
the necessary condition of convergence, lim a
n
= 0 for n. The corresponding partial
sums are
s
2k
= 0, s
2k+1
= 1 for k = 0, 1, ...
and the averaged sums
1
= 1,
2
=
1
2
,
3
=
2
3
,
4
=
1
2
, ...,
2k+1
=
k
k
+
+
1
2 1
,
2k
=
k
k 2
=
1
2
, ...
converge to if k . This is in full accord with the result of the power series.

Remark. It should be said that the summation based on power series (Poisson-Abel summation) is more
general than the average method (CesBro).

Fejr utilized CesBro method in summation of trigonometric Fourier series (1904).
One of its results is the following
Fejrs theorem. For fC
0
([0, 2]) let S
n
(f, x) be the partial Fourier trigonometric sum
of f and
n
(f, x) =
1
) , ( ... ) , (
0
+
+ +
n
x f S x f S
n
.
Then the sequence of continuous functions {
n
(f, x) : n = 0, 1, ...} converges uniformly
to the function f.

In other words: the partial Fejr sums give arbitrarily precise approximation of any
continuous function on [0, 2].
Substitution and simple arrangements yield
n
(f, x) = r
0
+
=
n
k 1
(1
) 1 + n
k
) r
k
cos (
l
kx 2
+
k
) .
It can be seen from here that the computation of
n
(f, x) becomes more complicated just
due to factors (1
) 1 + n
k
) in comparison to calculating S
n
(f, x).



116
Example 1. Fejr sums
n
(f, x) of the continuous periodical function
f(x) = 1 cos (x/2)
2

in interval [0, 2] are shown in Fig. 5.24.
0.0
0.2
0.4
0.6
0.8
1.0
0 0.5 1 1.5 2
x

n
n=8
n=64
n=512

Fig. 5.24 Fejr sums
n
of the continuous function f(x) = 1 cos (x/2)
2
with the cusp at x=2.

Example 2. The damping effect of Fejr summation can be shown on the step function
f(x) =
10 / 1
10 / 9
for
2 1
1 0
<
<
x
x
.
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2
x
f ,
n
f
n=8
n=64
n=512

Fig. 5.25 Fejr sums
n
of the step function.
Fejr sums remove the Gibbs phenomenon (suppress oscillations) in neighborhoods
of discontinuity points. Fig. 5.25 shows at the same time that the continuous sum
512
(f, x) approximates the step function very faithfully.



117

5.8 Fourier Integral
We will sketch just in main features what happens when the period of some considered
function is tending to infinity. Let f be Lebesgue integrable (i.e. absolutely) in a finite
interval [0, l], l>0, fL([0, l]). Coefficients of the corresponding Fourier series
S
(f, x) =
2
0
a
+
k=
1
(a
k
cos
l
kx 2
+ b
k
sin
l
kx 2
)
are (k = 0, 1, ...)
a
k
=
l
2
l
0
f(t) cos
l
kt 2
dt, b
k
=
l
2
l
0
f(t) sin
l
kt 2
dt.
After putting them into S
(f, x) and simple rearranging we obtain

S
(f, x) =
l
1
l
0
f(t) dt +
k=
1

1
l
2
l
0
f(t) cos
l
k 2
(t x) dt .
l
0
f(t) dt is a final number due to fL([0, l]) and
l
lim
1
l

l
0
f(t) dt = 0. Let us
introduce a new variable u(k) =
2k
l
= u k, where u =
2
l
is a step of u(k).
For large l
S
(f, x)
=1 k
l
2
l
0
f(t) cos u(t x) dt =
=1 k
(
l
0
f(t) cos u(t x) dt ) u.
The sum on the right hand side can be viewed as an integral sum that converges to
0
(
0
f(t) cos u(t x) dt ) du
=
0
(
0
f(t) cos ut cos ux dt
0
sin ut sin ux dt ) du ,
when l (i.e. u0). If
a(u) =
0
f(t) cos ut dt ,
b(u) =
0
f(t) sin ut dt,
one can write (Fubini theorem)
0
(
0
f(t) cos u(t x) dt ) du =
0
[a(u) cos ux + b(u) sin ux] du .
Formal substitution of symbols {
0
, u, du } {
=0 k
, k, k =1} transfers the
right hand side integral to the common Fourier series of a 2periodical function.



118
The assumption of the Lebesgue integrability over the infinite interval [0, ) enables
a substantial simplifying the existential considerations. After all, it implies the existence
of coefficients a(u), b(u) due to
a(u) =
0
f(t) cos ut dt
0
f(t) cos utdt
0
f(t)dt < +
and analogically
b(u)
0
f(t) sin ut
0
f(t)dt < + .
To avoid separate considerations of the cosine (even) part and sine (odd) part of
trigonometric Fourier expansion, we use the Euler formula. The complex Fourier
coefficient (or so called (direct) Fourier transform of the function f) is then
g(u) =
0
f(t) e
iut
dt .
Now the following problem is at hand: what is the relation between the complex
function g(u) and the original real function f(x)? The answer is given by the inverse
Fourier transform
f(x) = S
(f, x) =
0
g(u) e
iut
du .
And like in final Fourier sums S
n
(f, x) we have
Theorem. If
d
d
t
x f t x f ) ( ) ( +
dt exists for every d > 0 (Dini condition), then
S
n
(f, x) f(x) = S
(f, x) =
0
g(u) e
iut
du .
Thus, the Dini condition warrants the convergence of Fourier integrals.

The transition to the complex area enables symmetric manipulations on the real axis.
The direct and the inverse Fourier transforms are then defined as follows
g(u) =
2
1

f(t) e
iut
dt ,
f(x) =

g(u) e
iut
du
or
g(u) =

f(t) e
iut
dt ,
f(x) =
2
1

g(u) e
iut
du

or, symmetrically,



119
g(u) =
2
1

f(t) e
iut
dt ,
f(x) =
2
1

g(u) e
iut
du .
Theory of complex functions, especially the residue theorem, are used to efficient
computing the corresponding integrals.
Example. For a>0 the Fourier transform of the function
f(t) = 5 e
at

(obviously, fL([0, ))
g(u) =
0
f(t) e
iut
dt =
0
e
atiut
dt =
) (
5
iu a +
e
(a+iu)t
0
=
) (
5
iu a +

=
5
2 2
u a
iu a
+
0
f(t) (cos ut i sin ut) dt.
If f is an even function (Fig. 5.26), we get
a(u) = Re g(u) =
0
f(t) cos ut dt =
5
2 2
u a
a
+
.
If f is an odd function (Fig. 5.26), we get
b(u) = Im g(u) =
0
f(t) sin ut dt = =
5
2 2
u a
u
+
.
-2
-1
0
1
2
-2 -1 0 1 2
t
f
Even
Odd

Fig. 5.26 Even and odd extension of the exponential function on the negative half-axis.

Fourier transform also provides computation of some types of integrals though in the
foregoing case the integration by parts can be used with the same results.

Solutions of linear differential equations are regularly presented as linear
combinations of exponentials. Integral transforms or operator calculus are effective
tools for finding solutions of complicated dynamical systems, transfer functions etc. If
the real part of the complex variable p in Laplace transform can be reduced to 0,



120
p = i (e.g. if fL((, )) ), the Laplace transform turns to Fourier transform and the
Laplace transmission functions become frequency characteristics.


121
REFERENCES

1. KOLMOGOROV, A. N. FOMIN, S. V.: Foundations of Function Theory and Functional Analysis
(Czech translation). SNTL, Prague 1975.
2. VEC, M. ALT, T. NEUBRUNN, T.: Mathematical Analysis of Functions of Real Variable (in
Slovak). ALFA, Bratislava 1987.
3. www.koutny-math.com (Prelude to Probability and Statistics)
4. DEMIDOVICH, B. P. MARON, I. A.: Fundaments of Numerical Mathematics (in Russian).
NAUKA, Moscow 1966.
5. RALSTON, A.: A First Course in Numerical Analysis. McGraw-Hill , New York etc. 1965.
6. VITSEK, E.: Numerical Methods (in Czech). SNTL, Prague 1987.
7. CONTE, S. D. De BOOR, C.: Elementary Numerical Analysis. An Algorithmic Approach.
Mc Graw Hill, New York 1980.
8. JARNK, V.: Integral Calculus II (in Czech). NSAV Prague 1955.
9. www.koutny-math.com (Mathematical Base for Applications)
10. KOUTNY, F., Numerische Lsung der Gl. f(t)=0 , Aplikace matematiky, Vol. 19, 1974, p. 290.
11. HARDY, G. H. ROGOSINSKI, W. W.: Fourier Series. University Press Cambridge 1962.
12. K3MG+='?9[O '. ;.: 7JDF *4LL,D,>P4":\>@(@ 4 4>H,(D":\>@(@ 4FR4F:,>4 !!!,
K3";#G'3" ;@F$%" 1963.
13. ACHIESER, N. I. GLASMANN, I. M.: Theorie der linearen Operatoren im Hilbert-Raum.
Akademie-Verlag Berlin 1958.
14. REKTORYS, K. et al.: Survey of Applied Mathematics (in Czech, 5th edition). SNTL, Prague 1988.
15. GOLDBERG, D. E.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison Wesley Longman Inc., USA and Canada, 1998


122

INDEX

Adams methods 68 Initial problem 57
Adams-Bashford methods 68 Interpolation 3
Adams-Moulton methods 69 Interpolation polynomial, Hermite 9
Aitken-Neville algorithm 8 Interpolation polynomial, Lagrange 5
Amplitude spectrum 109 Interpolation polynomial, Newton 6
Approximation, polynomial 3 Inverse interpolation method 43

Basis of linear space 75 L ([a, b]) 77
Bessels inequality 89 Legendre polynomial 90
Bisection method 36 Linear dependence 75
Boundary problems 71 Linear envelope 75
Linear space 75
C
0
([a, b]) 3 Local error 60
Calculation of Fourier coefficients 99 Localization of roots 47
Chebyshev formulas 25 Localization theorem 97
C
n
([a, b]) 77
Collocation method 74 Monte Carlo methods 28
Complete system 89 Multiple integration 27
Convergence of Fourier series 92
Newtons method 44, 51 Convex envelope, hull 78

Nonlinear equations 35
Difference operator 6 Normed linear space 79
Dini condition 97 Numerical quadrature 19
Dirichlet integral 96
Dirichlet kernel 96 Orthogonality 85
Orthonormality 88
Elimination method 49
Parseval equality 89
Fehlberg 64 Phase form of Fourier series 94
Fejr theorem 115 Predictor-corrector methods 69
Fejrs summation 114
Fourier coefficients 88 Quadratic interpolation method 41
Fourier integral 116
Fourier series 88 Regula falsi 39
Fourier transform, direct 118 Richardsons extrapolation 25
Fourier transform, inverse 118 Riemann-Lebesgue lemma 97
RK4 61
Gauss quadrature 23 Runge-Kutta formulas 60
Genetic algorithm 54
Gibbs phenomenon 105 Scalar product 82
Gram-Schmidt orthogonalization 87 Shooting method 72
Simple iteration 50
Hilbert space 83 Simple iteration method 45
Horner 47 Simpsons formula 21
Hull, linear 75 Space L of Lebesgue integrable functions 77


123

Space l
2
84
Taylor expansion 57
Space L
2
of square integrable functions 85 Trapezoid formula 20
Spline 9 Trapezoid method 17
Spline, cubic 12
Spline, natural 12 Vandermonde determinant 4
Spline, periodic 12
Spline, quadratic 10 Weierstrass theorem 3
Systems of Diff. Equations 65
Systems of nonlinear equations 49

Newton Interpolations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Newton Interpolations

Uploaded by

Copyright:

Available Formats

ELEMENTARY

cos t dt = sin 1. Integration by parts

(x)< for any x[a, b].

) on the interval [0, 4] for nodes given by abscissas x = 0, 1, 2, 3, 4.

(5 sin ((1 6 . 0 )/4) + 8 sin (/4) + 5 sin ((1+ 6 . 0 )/4)

(50.176 108 + 8 2 /2 + 50.984 371) = 1.000 008 1.

and Jacobian det

in G, i, j = 1, ..., n and their Jacobian det

f(x, y(x)) y(x)]

f(x, y(x)) y(x)]

f(x, y(x)) f(x, y(x))]

f(x, y(x)) y(x)]

98135 . 59 47842 . 35 150437 . 4

is the solution of the original equation A x = b, it is B = A

(x) dx < /2.

(x) sin cx dx.

1.644 934 067,

0.822 467 033.

(f, x) and simple rearranging we obtain

You might also like