ELEC3030 Notes v1

page 1 E763 (part 2) Numerical Methods
A-PDF Page Cut DEMO: Purchase from www.A-PDF.com to remove the wa
Numerical Methods
1. Introduction
Most mathematical problems in engineering and physics as well as in many other

disciplines cannot be solved exactly (analytically) because of their complexity. They have to be
tackled either by using analytical approximations, that is by choosing a simplified (and then only
approximately correct) analytical formulation that can then be solved exactly, or by using
numerical methods. In both cases, approximations are involved but this need not represent a
disadvantage since in most practical cases, the parameters used in the mathematical model (the
basic equations) will only be known approximately, by measurements or otherwise.
Furthermore, if the accuracy is high there is no practical advantage of having an exact solution,
when only a few decimal figures will have practical significance.
An ideal property sought in the design of numerical methods is that they should be “exact-
in-the-limit”, that is, if we refine the discretization or if we take more terms in a series, or if we
perform a larger number of iterations, etc (depending of the nature of the method), the result
should only improve, approximating asymptotically the exact solution. Of course not all
methods will have this property.
Since approximations will be present, an important aspect of using numerical methods is to
be able to determine the magnitude of the error involved. Naturally, we cannot know the error.
This will imply to know the exact solution! But we can determine error bounds, or limits, so we
can know that the error cannot exceed a certain measure.
There are several sources for these errors. One of them could be an inaccurate
mathematical model; that is, the mathematical description of the problem does not represent
correctly the physical situation. Possible reasons for this are:
Incomplete or incorrect theory

Idealizations/Approximations/Uncertainty
The second one could be due for example to idealization of the geometry of the problem or the
properties of materials involved and the uncertainty of the value of material parameters, etc.
This type of error concerns the mathematical model and the only possibility to reduce it is
to improve the mathematical description of the physical problem. It does not concern the
numerical calculation. It is though, an important part of the more general problem of “numerical
or computer modelling”.
Errors concerning the numerical calculations can be of three kinds:

i) Blunder: Mistakes, computer or programming bugs. We cannot really dismiss this
error. In practice, the possibility of its occurrence must always be considered, in particular when
the result is not known approximately in advance and then, there are no means to evaluate the
‘reasonableness’ of the obtained result.
ii) Truncation error: This is an approximation error, where for example the value of a
function is calculated with a finite number of terms in an infinite series, or an integral is
estimated by the sum of a finite number of trapezoidal areas over (non-infinitesimal) segments.
This is also called discretization error in some cases. It is important to have an estimate or a
bound (limit) for this type of error.
iii) Round off error: Arithmetic calculations can almost never be carried out with
E763 (part 2) Numerical Methods page 2
complete accuracy. Most numbers have infinite decimal representation, which must be rounded.
But even if the data in a problem can be initially expressed exactly by finite decimal
representation, divisions may introduce numbers that must be rounded; multiplication will also
introduce more digits. This type of error has a random character that makes it difficult to deal
with.
So far the error we have been discussing is the absolute error or the error defined by:
error = true value – approximation. A problem with this definition is that it doesn’t take into
account the magnitude of the value being measured, for example and absolute error of 1 cm has a
very different significance in the length of a 100 m bridge or of a 10 cm bolt. Another
definition, that reflects this significance is the relative error defined as: relative error = absolute
error/true value. For example, in the previous case, the length of the bridge has a relative error
of 10−4 and 0.1 respectively; or in percent, 0.01% and 10%.
2. Machine representation of numbers

Not only numbers have to be truncated to be represented (for manual calculation or in a
machine) but also all the intermediate results are subjected to the same restriction (over the
complete calculation process). Of the continuum of real numbers, only a discrete set can be
represented. Additionally, only a limited range of numbers can be represented in a given
machine. Also, even if two numbers can be represented, it is possible that the product of an
arithmetic operation between these numbers is not representable. Additionally, there will be
occasions where results fall outside the range of representable numbers (underflow or overflow
errors).
Example: Some computers accept real numbers in a floating-point format, with say, two digits
for the exponent; that is, in the form:
mantissa exponent
± 0.xxxxxx E ±xx
which can only represent numbers in the range: 10–100 ≤ y < 1099.
This is a normalized form, where the mantissa is defined such that:
0.1 ≤ mantissa < 1
We will see this in more detail next.
Floating-Point Arithmetic and Roundoff Error

Any number can be represented in a normalised format in the form: x = ± a×10b in a
decimal representation. We can even use any other number as the exponent base and we have a
more general form: x = ± a×βb, common examples are binary, octal and hexadecimal
representations. In all of these, a is called the mantissa, a number between 0.1 and 1, which can
be infinite in length (for example for π or for 1/3 or 1/9), β is the base (10 for decimal
representation) and b is the exponent (positive or negative), which can also be infinite.
If we want to represent these numbers in a computer or use them in calculations, we need
to truncate both the mantissa (to a certain number of places), and the exponent, which limits the
range (or size) of numbers that can be considered.
Now, if A is the set of numbers exactly representable in a given machine, the question
arises of how to represent a number x not belonging to A ( x ∉ A ).
This is encountered not only when reading data into a computer, but also when
representing intermediate results in the computer during a calculation. Results of the elementary
arithmetic operations between two numbers need not belong to A. Let’s see first how a number
is represented (truncated) in a machine.
A machine representation can in most cases be obtained by rounding:
x ––––––––––––– fl(x)
Here and from now on, fl(x) will represent the truncated form of x (this is, with a
truncated mantissa and limited exponent), and not just its representation in normalized floating-
point format.
For example, in a computer with t = 4 digit representation for the mantissa:
fl(π) = 0.3142E 1
fl(0.142853) = 0.1429E 0
fl(14.28437) = 0.1428E 2
In general, x is first represented in a normalized form (floating-point format): x = a×10b

(if we consider a decimal system), where a is the mantissa and b the exponent; 1 >  a ≥ 10–1
(so that the first digit of a after the decimal point is not 0).
Then suppose that the decimal representation of  a is given by:
 a = 0.α1α2α3α4 ... αi ... 0 ≤ αi ≤ 9, α1 ≠ 0 (2.1)

(where a can have infinite terms!)
Then, we form:
0.α 1α 2 ⋅ ⋅⋅ α t if 0 ≤ α t +1 ≤ 4
a' =  −t (2.2)
0.α 1α 2 ⋅ ⋅⋅ α t + 10 if α t+1 ≥ 5
That is, only t digits are kept in the mantissa and the last one is rounded up: αt is
incremented by 1 if the next digit αt+1 ≥ 5 and all digits after αt are deleted.
The machine representation (truncated form of x) will be:
fl(x) = sign(x) a’ 10b (2.3)
The exponent b must also be limited. So a floating-point system will be characterized by

the parameters: t, length of the mantissa; β, the base (10 in the decimal case as above) and L
and U, the limits for the exponent b: L ≤ b ≤ U.
To clarify some of the features of the floating-point representation, let’s examine the
system (β, t, L, U) = (10, 2, –1, 2).
The mantissa can have only two digits: Then, there are only 90 possible forms for the
mantissa (excluding 0 and negative numbers):
0.10, 0.11, 0.12, . . . , . . , 0.98, 0.99
The exponent can vary from –1 to 2, so it can only take 4 possible values: –1, 0, 1 and 2.
Then, including now zero and negative numbers, this system can represent exactly only
2×90×4 + 1 = 721 numbers: The set of floating-point numbers is finite.
The smallest positive number in the system is 0.10×10–1 = 0.01
and the largest is: 0.99×102 = 99.

We can also see that the spacing between numbers in the set varies; the numbers are not equally
spaced: At the small end (near 0): 0, 0.01, 0.02, etc the spacing is 0.01 while at the other
extreme: 97, 98, 99 the spacing is 1 (100 time bigger in absolute terms).
In a system like this, we are interested in the relative error produced when any number is
truncated in order to be represented in the system. (We are here excluding the underflow or
overflow errors produced when we try to represent a number which is outside the range defined
by L and U, for example 1000 or even 100 in the example above.)
It can be shown that the relative error of fl(x) is bounded by:
fl(x) − x
≤ 5 × 10− t = eps (2.4)
x
This limit eps is defined here as the machine precision.
Demonstration:
From (2.1) and (2.2), the normalized decimal representation of x and its truncated floating-
point form, we have that the maximum possible difference between the two forms is 5 at the
decimal position t+1, that is:
fl(x) − x ≤ 5 × 10− (t+1) × 10b
b −b
also, since x ≥ 0.1× 10 or 1 x ≤ 10 × 10 , we obtain the condition (2.4).
From equation (4) we can write:
fl(x) = x(1+ε) (2.5)
where ε ≤ eps, for all numbers x. The quantity (1+ε) in (2.5) cannot be distinguished from
1 in this machine representation, and the maximum value of ε is eps. So, we can also define
the machine precision eps as the smallest positive machine number g for which 1 + g >
1.
Definition: machine precision eps = min{g g + 1 > 1,g > 0} (2.6)
Error in basic operations

The result of arithmetic operations between machine numbers will also have to be
represented as machine numbers, then, for each arithmetic operation and assuming that x and y
are already machine numbers, we will have:
x + y –––we get –––fl(x + y) which is equal to (x + y)(1 + ε1); also: (2.7)

x − y ––––––––––––fl(x − y) = (x − y)(1 + ε2) (2.8)
x*y –––––––––––––fl(x*y) = (x*y)(1 + ε3) (2.9)
x/y –––––––––––––fl(x/y) = (x/y)(1 + ε4) with all  εi  ≤ eps (2.10)
If x and y are not floating-point numbers (machine numbers) they will have to be converted
first giving:
x + y –––––––––––– fl(x + y) = fl(fl(x) + fl(y))
and similarly for the rest.
Let’s examine for example the subtraction of two such numbers: z = x – y , ignoring higher
order error terms:
fl(z) = fl(fl(x) – fl(y))

= (x(1 + ε1) – y(1 + ε2))(1 + ε3)
= ((x – y) + x ε1 – yε2)(1 + ε3)
= (x – y) + x ε1 – yε2 + (x – y)ε3
Then:
fl(z ) − z xε1 − yε 2  x + y
= ε3 + ≤ eps  1+ 
z x−y  x−y
We can see that if x approaches y the relative error can blow up, especially for large values of x
and y. The maximal error bounds are pessimistic and in practical calculations, errors might tend
to cancel. For example, in adding 20000 numbers rounded to, say, 4 decimal places the
maximum error will be 0.5×10–4×20000 = 1 (imagining the maximum absolute truncation of
0.00005 in every case) while it is extremely improbable that this case occurs. From a statistical
point of view, one can expect that in about 90% of the cases, the error will not exceed 0.005.
Example
Let’s compute the difference between a = 1200 and b = 1194 using a floating-point
system with a 3-digit mantissa:
fl(a – b) = fl(fl(1200) – fl(1194)) = fl(0.120×104 – 0.119×104)

= 0.001×104 = 10
where the correct value is 6, giving a relative error of: 0.667 (or 66.7%).
The machine precision for this system is eps = 5×10–t and the error bound above gives a
 a+b
limit for the relative error of eps 1 +  = 2.0
 a−b 
Error propagation in calculations

One of the important tasks in computer modelling is to find algorithms where the error
propagation remains bounded.
In this context, an algorithm is defined as a finite sequence of elementary operations that
prescribe how to calculate the solution to a problem from given input data (as in a sequence of
computer instructions). Problems can arise when one is not careful, as is shown in the following
example:
Example
Assume that we want to calculate the sum of three floating-point numbers: a, b, c.
This has to be done in sequence, that is, using any of the next two algorithms:
i) (a + b) + c or
ii) a + (b + c)
If the numbers are in floating-point format with t = 8 decimals and their values are for example:
a = 0.23371258E-4
b = 0.33678429E 2
c = -0.33677811E 2
The two algorithms will give the results: i) 0.64100000E-3

ii) 0.64137126E-3
The exact result (which needs 14 decimal points to calculate) is

0.641311258E-3
Exercise 2.1
Show, using an error analysis, why the case ii) gives a more accurate result for the numbers of
the example above. Neglect higher order error terms; that is, products of the form: ε1ε2.
Example
Determination the error propagation in the calculation of y = (x − a)2 using floating-point
arithmetic, by two different algorithms, when x and a are already floating-point numbers.
a) direct calculation: y = (x − a)2

fl( y) = [(x − a )(1 + ε1 )] (1 + ε 2 )
2
[
fl( y) = (x − a ) (1 + ε1 ) (1 + ε 2 )
2 2
]
And only preserving first order error terms:
2
[ 2
]
fl( y) = (x − a) (1 + ε1 ) (1 + ε 2 ) = (x − a ) [(1 + 2ε1 )(1 + ε 2 )]
2
fl( y) = (x − a) (1 + 2ε1 + ε 2 )
2
fl( y)- y = (x − a) (2ε1 + ε 2 )

2
then: or
fl( y)- y
∆y = = 2ε1 + ε 2
y
We can see that the relative error in the calculation of y using this algorithm is given by
2ε1 + ε 2 , so it is less than 3 eps.
b) Using the expanded form: y = x 2 − 2ax + a2
[( ) ]
fl( y) = x (1 + ε1 ) − 2ax (1 + ε 2 ) (1 + ε 3 ) + a (1 + ε 4 ) (1 + ε 5 )
2 2
This is, taking the square of x first (with its error) and subtracting the product 2ax (with its error)
and the error in the subtraction, and then, adding the last term with its error and the
corresponding error due to that addition. Solving this, keeping only first order error terms we
get:
[( ) ( )
fl( y) = x − 2ax + x ε 1 − 2ax ε 2 + x − 2ax ε 3 + a (1 + ε 4 ) (1 + ε 5 )
2 2 2 2
]
( )
fl( y ) = x 2 − 2ax + a 2 + x 2 (ε1 + ε 3 ) − 2ax(ε 2 + ε 3 ) + a 2ε 4 + x 2 − 2ax + a 2 ε 5
from where we can write:
fl( y)- y x2 2ax a2

∆y = = ε5 + 2 (ε 1 + ε 3 ) − 2 (ε 2 + ε 3 ) + ε4
y (x − a ) (x − a ) (x − a)2
and we can see that there will be problems with this calculation if (x – a)2 is too small compared
with either x2 or a2. The first term above is bounded by eps while the second will be eps
multiplied by the amplification factor x 2 / ( x − a )2 . For example in if x = 15 and a = 14, the
three amplification factors will be respectively: 225, 420 and 196, which gives a total error
bound of (1 + 450 + 840 + 196)eps = 1487eps compared with 3 eps from algorithm a).
Exercise 2.2
For y = a – b compare the error bounds when a and b are and are not already defined as
floating-point numbers.
Exercise 2.3
Determine the error propagation characteristics of two algorithms to calculate
a) y = x 2 − a2 and b) y = (x − 1)3 . Assume in both cases that x and a are floating point
numbers.
3. Root Finding: Solution of nonlinear equations

This is a problem of very common occurrence in engineering and physics. We’ll study in
this section several numerical methods to find roots of equations. The methods can be basically
classified as “bracketing methods” and “open methods”. In the first case, the solution lies inside
an interval limited by a lower and an upper bound. The methods are always convergent as the
iterations progress. In contrast, the open methods are based on procedures that require one
starting point or two that not necessarily enclose the root. Because of this, sometimes they
diverge. However when they do converge, they usually do that faster than the bracketing
methods.
Bracketing Methods
The Bisection Method

If a function is continuous in the interval [a, b] and f(a) and f(b) are of different sign
(f(a)f(b)<0), then at least one root of f(x) will lie in that interval.
The bisection method is an iterative procedure that starts with two points where f(x) has
different sign, that is, that “bracket” a root of the function.
We now define the point
a+b
c= (3.1)
2
as the midpoint of the interval [a, b], and examine the sign of f(c). There are now three
possibilities: f(c)=0, in which case the solution is c, and f(c) is positive or f(c) is negative. The
next interval where to search for the root will be either [a, c] or [c, b] if a and c or c and b are of
different sign, respectively. (Equivalently, we can search for the case f(a)f(c)<0 or f(c)f(b)<0).
The process then continues, each time halving the size of the search interval and “bracketing”
the solution as shown in Figs. 3.1 and 3.2.
an cn bn
a c b
an+1 cn+1 bn+1
Fig. 3.1 Fig. 3.2
CONVERGENCE AND ERROR

Since we know that the solution lies in the interval [a, b] the absolute error for c is
bounded by (b−a)/2, then, we can see that after n iterations, halving the interval in each iteration,
the search interval would have reduced in length to (b – a)/2n, so the maximum error after n
iterations is:
b−a
α − cn ≤ n (3.2)
2
where α is the exact position of the root and cn is the nth approximation found by this method.
Furthermore, if we want to find the solution with a tolerance ε (that is, α − cn ≤ ε ), we can
calculate the maximum number of iterations required from the expression above. Naturally, if at
one stage the solution lies at the middle of the current interval the search finishes early.
An approximate relative error (or percent error) at iteration n+1 can be defined as:
c n+1 − c n
ε=
c n+1
b n+1 − a n+1 b n+1 + a n+1

but from the figure above we can see that c n+1 − c n = and since c n+1 = ,
2 2
the relative error at iteration n+1 can be written as:
b n+1 − a n+1
ε= (3.3)
b n+1 + a n+1
This expression can also be used to stop iterations.
Exercise 3.1
Demonstrate that the number of iterations required to achieve a tolerance ε is the integer that
log(b − a ) − log ε
satisfies: n≥ (3.4)
log 2
Example
The function: f ( x) = cos(3x) , has one root in the interval [0, 1]. The following simple Matlab
program implements the bisection method to find this root.
a=0; b=1; %limits of the search interval [a,b]

eps=1e-6; %sets tolerance to 1e-6
fa=ff(a);
fb=ff(b);
if(fa*fb>0)
disp('root not in the interval selected')
else
n=ceil((log(b-a)-log(eps))/log(2)); %rounded up to closest integer
disp('Iteration number a b c')
for i=1:n
c=a+0.5*(b-a); %c is set as the midpoint between a and b
disp([sprintf('%8d',i),' ',sprintf('%15.8f',a,b,c)])
fc=ff(c);
if(fa*fc)<0 %the root is between a and c
b=c;
fb=fc;
elseif(fa*fc)>0 %the root is between b and c
a=c;
fa=fc;
else return
end
end
end
together with the function definition:

function y=ff(x)
%****************************
y=cos(3*x);
%****************************
And the corresponding results are:

Iteration number a b c
1 0.00000000 1.00000000 0.50000000
2 0.50000000 1.00000000 0.75000000
3 0.50000000 0.75000000 0.62500000
4 0.50000000 0.62500000 0.56250000
5 0.50000000 0.56250000 0.53125000
6 0.50000000 0.53125000 0.51562500
7 0.51562500 0.53125000 0.52343750
8 0.52343750 0.53125000 0.52734375
9 0.52343750 0.52734375 0.52539063
10 0.52343750 0.52539063 0.52441406
11 0.52343750 0.52441406 0.52392578
12 0.52343750 0.52392578 0.52368164
13 0.52343750 0.52368164 0.52355957
14 0.52355957 0.52368164 0.52362061
15 0.52355957 0.52362061 0.52359009
16 0.52359009 0.52362061 0.52360535
17 0.52359009 0.52360535 0.52359772
18 0.52359772 0.52360535 0.52360153
19 0.52359772 0.52360153 0.52359962
20 0.52359772 0.52359962 0.52359867
Provided that the solution lies in the initial interval, and since the search interval is
continually divided by two, we can see that this method will always converge to the solution and
will find it within a required precision in a finite number of iterations.
However, due to the rather blind choice of solution (it is always chosen as the middle of
the interval), the error doesn’t vary monotonically. For the previous example
f ( x) = cos(3x) = 0 :
Iteration
number c f(c) error %
1 0.50000000 0.07073720 4.50703414
2 0.75000000 -0.62817362 -43.23944878
3 0.62500000 -0.29953351 -19.36620732
4 0.56250000 -0.11643894 -7.42958659
5 0.53125000 -0.02295166 -1.46127622
6 0.51562500 0.02391905 1.52287896
7 0.52343750 0.00048383 0.03080137
8 0.52734375 -0.01123469 -0.71523743
9 0.52539063 -0.00537552 -0.34221803
10 0.52441406 -0.00244586 -0.15570833

11 0.52392578 -0.00098102 -0.06245348
12 0.52368164 -0.00024860 -0.01582605
13 0.52355957 0.00011762 0.00748766
14 0.52362061 -0.00006549 -0.00416920
15 0.52359009 0.00002606 0.00165923
16 0.52360535 -0.00001971 -0.00125498
17 0.52359772 0.00000317 0.00020212
18 0.52360153 -0.00000827 -0.00052643
19 0.52359962 -0.00000255 -0.00016215
20 0.52359867 0.00000031 0.00001998
We can see that the error is not continually decreasing although in the end it has to be small.
This is due to the rather “brute force” nature of the algorithm. The approximation to the solution
is chosen blindly as the midpoint of the interval without and attempt at guessing its position
inside the interval. For example, if at some iteration n, the magnitudes of f (a n ) and f (b n ) are
very different, say, f (a n ) >> f (b n ) , it is likely that the solution is closer to b than to a, if the
function is smooth.
A possible way to improve it is to select the point c by interpolating the values at a and b.
This is called the “regula falsi” method or method of false position.
Regula Falsi Method

In this case, the next point is obtained by linear interpolation between the values at a and b.
From the figure (left), we can see that:
f (b) c − b
= (3.5)
f (a ) c − a
from where
af (b) − bf (a)
c=
c b f (b) − f (a)
or alternatively:
a f (a)(a − b)
c =a+ (3.6)
f (b) − f (a)
Fig. 3.3
The algorithm is the same as the bisection method except for the calculation of the point c.
In this case, for the same function, ( f ( x) = cos(3x) = 0 ) the solution, within the same absolute
tolerance (10−6) is found in only 4 iterations:
Iteration number a b c
0 0.00000000 1.00000000 0.00000000
1 0.50251446 1.00000000 0.50251446
2 0.50251446 0.53237237 0.53237237
3 0.52359536 0.53237237 0.52359536
4 0.52359536 0.52359878 0.52359878
and for the error:

Iter.
number c f(c) error %
1 0.50251446 0.06321078 4.02680812
2 0.53237237 -0.02631774 -1.67563307
3 0.52359536 0.00001025 0.00065258
4 0.52359878 -0.00000000 -0.00000008
We can see that the error decreases much more rapidly than in the bisection method. The size of
the interval also decreases more rapidly. In this case the successive values are:
Iter.
number b-a b-a in bisection method
1 0.49748554 0.5
2 0.02985791 0.25
3 0.00877701 0.125
4 0.00000342 0.0625
However, this is not necessarily always the case and some occasions, the search interval can
remain large. In particular, one of limits can remain stuck while the other converges to the
solution. In that case the length of the interval tends to a finite value instead of converging to
zero.
In the following example for the function f ( x) = x10 − 1 the solution requires 70 iterations to
reach a tolerance of 10−6 with the regula falsi method while only 24 are needed with the bisection
method. We can also see that the right side of the interval remains stuck at 1.3 and the size of
the interval will tend to 0.3 in the limit instead of converging to zero. The figure shows the
interpolating lines at each iteration. The corresponding approximations are the points where
these lines cross the x-axis.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
-1 -1
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Fig. 3.4 Standard regula falsi Fig. 3.5 Modified regula falsi
Modified regula falsi method

A modified version of the regula falsi method that improves this situation consists of an
algorithm that detects when one of the limits gets stuck and then reduces the value at that limit
by 2, changing the slope of the approximating function. For the same example, this modified
version reaches the solution in 13 iterations. The sequence of approximations and interpolating
lines is shown in the figure above (right).
Open Methods
These methods start with only one point or two but not necessarily bracketing the root. One of
the simplest is the fixed point iteration.
Fixed point iteration

This method is also called one-point iteration or successive substitution. For a function of the
form f ( x) = 0 , the method simply consists of re-arranging it into the form: x = g ( x)
Example
For the function 0.5 x 2 − 1.1x + 0.505 = 0 the algorithm can be set as: x = 0.5 x 2 − 0.1x + 0.505 . or
( x − 0.1) 2 + 1
x= With an initial guess introduced in the right hand side, a new value of x is
2
obtained and the iteration can continue.
Starting from the value: x0 = 0.5, the successive values are:
Iteration number x error %

1 0.58000000 13.79310345
2 0.61520000 5.72171651
3 0.63271552 2.76830889
4 0.64189291 1.42973889
5 0.64682396 0.76234834
6 0.64950822 0.41327570
7 0.65097964 0.22603166
8 0.65178928 0.12421806
9 0.65223571 0.06844503
10 0.65248214 0.03776824
11 0.65261826 0.02085727
12 0.65269347 0.01152336
1
0.7
0.9
0.68
0.8
0.66
0.7
0.64
0.6 y=g(x) 0.62

↓
0.5
0.6
0.4
0.58
0.3 0.56
0.2 ← y=x 0.54
0.1 0.52
0 0.5
0 0.2 0.4 0.6 0.8 1 0.4 0.45 0.5 0.55 0.6 0.65
Fig.. 3.6 Plot of the x and g(x) Fig. 3.7 Close-up showing the successive
approximations
Convergence
This is a very simple method but solutions are not guaranteed. The following figures show
situations when the method converges and when it diverges.
In cases (a) and (b) the method converges, while in cases (c) and (d) it diverges.
y=x →
y=g(x)
y=g(x ) →
← y=x
(a) convergence (b) convergence
y=g(x) → y=x →
← y=x
← y=g(x)
(c) divergence (d) divergence

Fig. 3.8 Four different cases
From Fig. 3.8 a-d we can see that it is relatively easy to determine when the method will
converge, so the best way of ensuring success is to plot the functions y = g ( x) and y = x . In a
more rigorous way, we can also see that for convergence to occur the slope of g(x) should be less
that that of x in the region of search. That is, g ' ( x) < 1 .
If divergence is predicted, a different form of re-writing the problem f ( x) = 0 in the form
y = g ( x) needs to be found that satisfies the condition above.
For example, for the function f ( x ) = 3 x 2 + 3 x − 1 = 0 , with a solution at x0 = 0.2637626 ,
we can separate it in the following two forms:
− 3x 2 + 1
(a) x = g ( x) = and (b) x = g ( x) = 3 x 2 + 4 x − 1
3
In the first case, g ' ( x) = −2 x then g ' ( x) x =x = −0.5275252 while for the second case,
0
g ' ( x) = 6 x + 4 and then, g ' ( x) x =x = 5.5825757 .

0
0.5 0.5 y=g(x) →
← y=x
0.4 0.4
← y=x
0.3 0.3
0.2 y=g(x) 0.2

↓
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6
2
− 3x + 1 Fig. 3.9(b) x = g ( x) = 3 x 2 + 4 x − 1
Fig. 3.9(a) x = g ( x) =
3
Fig. 3.9 illustrates the main deficiency of this method. Convergence often depends on how the
problem is formulated. Additionally, divergence can also occur if the initial guess is not
sufficiently close to the solution.
Newton-Raphson Method
This is one of the most used methods for root finding. It also needs only one point to start
the iterations, but as a difference to the fixed point iteration it will converge to the solution
provided the function is monotonically varying in the region of interest.
Starting from a point x0, a tangent to the function f(x) (a line with the slope of the
derivative of f) is extrapolated to find the point where it crosses the x-axis, providing a new
approximation. The same procedure is repeated until the error tolerance is achieved. The
method needs repeated evaluation of the function and its derivative and an appropriate stopping
criterion is the value of the function at the successive approximations: f(xn).
slope = f '(x0)
x1 x0
Fig. 3.10
f ( xn )
Form Fig. 3.10 we can see that at stage n: f ' ( xn ) = , then the next
xn − xn+1
f ( xn )
approximation is found as: xn +1 = xn − . (3.7)
f ' ( xn )
Example
For the same function as in the previous examples: f ( x ) = cos(3 x) = 0 , and for the same
tolerance of 10−6, the solution is found in 3 iterations starting from x0 = 0.3 (After 3 iterations
the accuracy is better than 10−8). Starting from 0.5, only 2 iterations are sufficient.
Iteration number x f(x)

0 0.30000000 0.62160997
1 0.56451705 -0.12244676
2 0.52339200 0.00062033
3 0.52359878 -0.00000000
The method can also be derived from the Taylor series expansion. This also provides a useful
insight on the rate of convergence of the method.
Considering the Taylor expansion truncated to the first order (see Appendix):
f ( 2) (ξ )
f ( xi +1 ) = f ( xi ) + f ' ( xi )( xi +1 − xi ) + ( xi +1 − xi ) 2 (3.8)
2!
Considering now the exact solution xr and the Taylor expansion, evaluated at this point:
f ( 2) (ξ )
f ( xr ) = 0 = f ( xi ) + f ' ( xi )( xr − xi ) + ( xr − xi ) 2
2!
and reordering (assuming a single root – first derivative ≠ 0):
f ( xi ) f ( 2) (ξ )
xr = xi − − ( xr − xi ) 2 (3.9)
f ' ( xi ) 2! f ' ( xi )
f ( xi )
Using now (3.7) for xi +1 : xi +1 = xi − and substituting in (3.9) gives:
f ' ( xi )
f ( 2) (ξ )
xr = xi +1 − ( xr − xi ) 2
2! f ' ( xi )
which can be reordered as:
f ( 2) (ξ )
( xr − xi +1 ) = − ( xr − xi ) 2 (3.10)
2! f ' ( xi )
The error at stage i can be written as the difference between xr and xi: Er = ( xr − xi ) then, from
(3.10) we can write:
− f ( 2) (ξ ) 2
Er +1 = Er
2 f ' ( xi )
Assuming convergence, both ξ and xi should eventually become xr, so the previous equation can
be re-arranged on the form:
− f ( 2 ) ( xr ) 2
Er +1 = Er (3.11)
2 f ' ( xr )
We can see that the relation between errors of successive order is quadratic. That means that on
each Newton-Raphson iteration, the number of correct decimal points should double. This is
what is called quadratic convergence.
Although the convergence rate is generally quite good, there are cases that show poor or
no convergence. An example is when there is an inflexion point near the root and in that case,
the iteration values will start to progressively diverge from the solution. Another case is when
the root is a multiple root, that is, when the first derivative is also zero.
Stopping the iterations

It can be demonstrated that the absolute error (difference between the current approximation and
the correct value) can be written as a multiple of the difference between the last two consecutive
approximations:
En = ( xα − xn ) = Cn ( xn − xn−1 )
and the constant Cn tends to 1 as the method converges. Then, a convenient criterion for
stopping the iterations is when:
Cn ( xn − xn−1 ) < ε , (3.12)
choosing a small value for Cn, usually 1.
The Secant Method

One of the difficulties of the Newton-Raphson method is the need to evaluate the derivative.
This can be very inconvenient of difficult in some cases. However, the derivative can be
approximated using a finite difference expression, for example, the backward difference:
f ( xi ) − f ( xi −1 )
f ' ( xi ) ≅ (3.13)
xi − xi −1
If we now substitute this in the expression for the Newton-Raphson itertions, the following
equation is obtained:
f ( xi )( xi − xi −1 )
xi +1 = xi − (3.14)
f ( xi ) − f ( xi −1 )
which is the formula for the secant method.
Example
Use the secant method to find the root of f ( x ) = e − x − x . Start with the estimates at x−1 = 0 and
x0 = 1. The exact result is: 0.56714329…
First iteration:
x−1 = 0 f ( x−1 ) = 1.0 − 0.63212(1 − 0)
then x1 = 1 − = 0.61270 ε = 8%
x0 = 1 f ( x0 ) = −0.63212 − 0.63212 − 1
Second iteration:
x0 = 1 f ( x0 ) = −0.63212 − 0.07081(0.61270 − 1)
then x2 = 0.61270 − = 0.563838325
x1 = 0.61270 f ( x1 ) = −0.07081 − 0.07081 − (−0.63212)
Note that in this case the 2 points are at the same side of the root (not bracketing it).
Using Excel, a simple calculation can be made giving:
i xi f(xi) error %
-1 0 1
0 1 -0.63212
1 0.612700047 -0.070814271 8.032671349
2 0.563838325 0.005182455 -0.582738902
3 0.567170359 -4.24203E-05 0.004772880
4 0.567143307 -2.53813E-08 2.92795E-06
5 0.567143290 1.24234E-13 7.22401E-08
The secant method doesn’t require the evaluation of the derivative of the function as the
Newton-Raphson method does but still, it suffers from the same problems. The convergence of
the method is similar to that of Newton and similarly, it has severe problems if the derivative is
zero or near zero in the region of interest.
Multiple Roots
We have seen that some of the methods have poor convergence if the derivative is very small or
zero. For higher order zeros (multiple roots), the function is zero and also the first n−1
derivatives (n is the order of the root). In this case the Newton-Raphson method (and the secant
method will converge poorly).
We can notice however, that if the function f(x) has a root multiple root at x = α, the
function:
f ( x)
g ( x) = (3.15)
f ' ( x)
has a simple root at x = α (If the root of f is of order n, the root of the derivative is of order n−1).
We can then use the standard Newton-Raphson method for the function f(x).
Exercise 3.2
Use the Newton-Raphson method to find a root of the function f ( x ) = 1 − xe1− x . Start the
iterations with x0 = 0 .
Extrapolation and Acceleration

Observing how the iterations proceed in the methods we have just studied, one can expect some
advantages if an extrapolated value is calculated from the iterations in order to improve the next
guess and accelerate the convergence.
One of the best methods for extrapolation and acceleration is the Aitken’s method: Considering
the sequence of values xn, the extrapolated sequence can be constructed by:
x n−1 − x n
xn = xn + α ( xn − xn−1 ) where α = (3.16)
x n − 2 x n−1 + x n−2
We can use this expression embedded in the fixed point iteration for example, in the form:
Starting from a value for x0: the first two iterates are found using the standard method:
x1=g(x0);
x2=g(x1);
Now, we can use the Aitken’s extrapolation in a repeated form:
alpha=(x2-x1)/(x2-2*x1+x0)
xbar=x2+alpha*(x2-x1)
and now we can refresh the initial guess:
x0=xbar
and re-start the iterations.
Similarly for the Newton-Raphson method where the evaluation of x1, and x2 are replaced by
the corresponding forms for the N-R method and the function and its derivative needs to be
calculated at each stage:
f0=f(x0) % calculation of the function
df0=df(x0) % calculation of the derivative
x1=x0-f0/df0
x2=x1-f1/df1
alpha=(x2-x1)/(x2-2*x1+x0)
xbar=x2+alpha*(x2-x1)
x0=xbar
Example
For the function f ( x ) = cos(3 x) , using the Fixed-Point method with and without acceleration
gives the results that follow. The iterations are started with x0 = 0.5 and the alternative form:
2 x + cos(3x)
g ( x) = is used.
2
Iter. Fixed Point method Accelerated FP method

Number x error % x error %
1 0.5353686 4.50703414 0.52359385 -1.12230396
2 0.51771753 -2.24787104 0.52359878 -0.00023538
3 0.52653894 1.12323492 0.52359878 0
4 0.52212875 -0.56153005
5 0.52433378 0.28075410
6 0.52323127 -0.14037569
7 0.52378253 0.07018767
8 0.5235069 -0.03509381
9 0.52364471 0.01754690
10 0.52357581 -0.00877345
11 0.52361026 0.00438673
12 0.52359303 -0.00219336
13 0.52360165 0.00109668
14 0.52359734 -0.00054834
15 0.52359949 0.00027417
16 0.52359842 -0.00013709
17 0.52359896 0.00006854
18 0.52359869 -0.00003427
19 0.52359882 0.00001714
20 0.52359875 -0.00000857
Higher Order Methods

All the methods we have seen consist of approximating the function using a straight line and
looking for the intersection of that line with the axis. An obvious extension of this idea is to use
a higher order approximation, for example, fitting a second order curve (a parabola) to the
function and looking for the zero of this instead. A method that implements this is the Muller
method.
Muller’s method
Using three points, we can find the equation of a parabola that fits the function and then, find the
zeros of the parabola.
For example, for the function: y = x 6 − 2 (used 10 x2

to find the value of 6 2 ), the function and the
approximating parabola are shown in the
figure:
5
In this case, the 3 points are chosen as
x1 = 0.5 , x2 = 1.5 and x3 = 1.0
The parabola passing through the points: x4

x3
( x1 , y1 ) , ( x2 , y2 ) and ( x3 , y3 ) can be written ↓
0
as. x1
y = y3 + c2 ( x − x3 ) + d1 ( x − x3 )( x − x2 ) where
y −y y − y2 c −c
c1 = 2 1 , c2 = 3 and d1 = 2 1
x2 − x1 x3 − x2 x3 − x1
-5
0.5 1 1.5
Fig. 3.11
2 y3
Solving for the zero closest to x3 gives: x4 = x3 − (3.17)
s + sign( s) s 2 − 4 y3d1
where s = c2 + d1 ( x3 − x2 ) .
The results of the iterations are:
Iteration number x error

1 1.07797900 0.07797900
2 1.37190697 0.29392797
3 1.37170268 -0.00020429
4 1.37148553 -0.00021716
5 1.17938786 -0.19209767
6 1.14443703 -0.03495083
7 1.12268990 -0.02174712
8 1.12242784 -0.00026206
9 1.12246205 0.00003420
10 1.12246205 0.00000000
The Muller method requires three points to start the iterations but it doesn’t require evaluation of
derivatives as the Newton-Raphson. It can also be used to find complex roots.
4. Interpolation and Approximation

There are many occasions when there is a need to convert discrete data, from measurements or
calculations into a smooth function. This could be necessary for example, to obtain estimates of
the values between data points. It could be also necessary if one wants to represent a function by
a simpler one that approximates its behaviour. Two approaches to this task can be used.
Interpolation consists of fitting a smooth curve exactly to the data points and creating the curve
that spans these points. In approximation, the curve doesn’t necessarily fit exactly the data
points but it approximates the overall behaviour of the data, following an established criterion.
Lagrange Interpolation
The basic interpolation problem can be formulated as:
Given a set of nodes, {xi, i =0,…,n} and corresponding data values {yi, i =0,…,n}, find the
polynomial p(x) of degree less or equal to n such that p(xi) = yi.
n x− xj
Consider the family of functions: L(i n ) ( x) = ∏ , i = 0, 1, n (4.18)
j = 0, j ≠ i xi − x j
We can see that they are polynomials of order n and have the property (interpolatory condition):
1 i = j
L(i n ) ( x j ) = δ i , j =  (4.19)
0 i ≠ j
Then, if we define the polynomial by:
n
p n ( x) = ∑ y k L(kn) ( x) (4.20)
k =0
n
then: p n ( xi ) = ∑ y k L(kn) ( xi ) = yi (4.21)
k =0
The uniqueness of this interpolation polynomial can also be demonstrated (that is, that there is
only one polynomial of order n or less that satisfy this condition.
Lagrange Polynomials
In more detail, from the general definition (4.18) the equation for the first order polynomial
(straight line) passing through two points ( x1 , y1 ) and ( x2 , y2 ) is:
x − x2 x − x1
p1 ( x) = L1(1) y1 + L(21) y2 = y1 + y2 (4.22)
x1 − x2 x2 − x1
The second order polynomial (parabola) passing through three points is:
p2 ( x) = L1( 2) y1 + L(22) y2 + L(32) y3

( x − x2 )( x − x3 ) ( x − x1 )( x − x3 ) ( x − x1 )( x − x2 )
p2 ( x ) = y1 + y2 + y3 (4.23)
( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 )
In general, we can see that the interpolation polynomials have the form given in (4.20) for any
order.
Each of the Lagrange interpolation functions L(n )
k associated to each of the nodes xk (given in
general by (4.18)) is:
( x − x1 )( x − x2 )L( x − xk −1 )( x − xk +1 )L( x − xn ) N ( x)
L(kn ) ( x) = = (4.24)
( xk − x1 )( xk − x2 )L( xk − xk −1 )( xk − xk +1 )L( xk − xn ) D
The denominator has the same form as the denominator and D = N(xk).
Example
Find the interpolating polynomial that passes through the three points: ( x1 , y1 ) = (−2,4) ,
( x2 , y2 ) = (0,2) and ( x3 , y3 ) = (2,8) .
Substituting in (4.20), or more specifically, (4.23):
( x − 0)( x − 2) ( x + 2)( x − 2) ( x + 2)( x − 0)
p 2 ( x) = 4+ 2+ 8
(−2 − 0)(−2 − 2) (0 + 2)(0 − 2) (2 + 2)(2 − 0)
x2 − 2x x2 − 4 x 2 + 2x
p 2 ( x) = 4+ 2+ 8 = L1( 2) y1 + L(22) y 2 + L(32) y3
8 −4 8
p2 ( x ) = x 2 + x + 2
8
(2)
L2 (x) Fig. 4.1 shows the complete interpolating
1
6 p2(x)
polynomial p2 ( x) and the three Lagrange
interpolation polynomials L(k2) ( x) k = 1, 2,
4 3 corresponding to each of the nodal points.
p2(x) p2(x) p2(x)
p2(x)
Notice that the function corresponding to
0.5
p2(x) one node has a value 1 at that node and 0 at
2
2
the other two.
(2)
L3 (x)
0
(2) 0
L1 (x)
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Fig. 4.1
Exercise 4.1
Find the 5th order Lagrange interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and
yd = {2, 1, 3.4, 3.8, 5.8, 4.8}.
Exercise 4.2
n
Show that an arbitrary polynomial of order n can be represented exactly by p( x) = ∑ p( xi ) Li ( x)
i =0
using an arbitrary set of (distinct) data points xi.
Newton Interpolation
It can be easily demonstrated that the polynomial interpolating a set of points is unique (Exercise
4.2), and the Lagrange method allows us to find it. The Newton interpolation method gives
eventually the same result but it can be more convenient in some cases. In particular, it is
simpler to extend the interpolation adding extra points, which in the Lagrange method would
need a total re-calculation of the interpolation functions.
The general form of a polynomial used to interpolate n + 1 data points is:
f ( x) = b0 + b1 x + b2 x 2 + b3 x 3 + L + bn x n (4.25)
Newton’s method, and Lagrange’s, give us a procedure to find the coefficients bi.
From Fig. 4.2, we can write:
y − y1 y2 − y1
=
y2 x − x1 x2 − x1
which can be re-arranged as:
y
y2 − y1
y = y1 + ( x − x1 ) (4.26)
x2 − x1
y1 This is the Newton’s expression for the first

order interpolation polynomial (or linear
interpolation).
x1 x x2
Fig. 4.2
That is, the Newton form of the equation of a straight line that passes through 2 points (x1,
y1) and (x2, y2) is
a 0 = y1
 y − y1
p( x) = a0 + a1 ( x − x1 ) ; where:  (4.27)
a = 2
 1 x2 − x1
Similarly, the general expression for a second order polynomial passing through the 3 data
points: (x1, y1), (x2, y2) and (x3, y3) can be written as:
p 2 ( x) = b0 + b1 x + b2 x 2
or, re-arranging: p 2 ( x) = a0 + a1 ( x − x1 ) + a2 ( x − x1 )( x − x2 ) (4.28)
Substituting the values for the 3 points, we get, after some re-arrangement:
a0 = y1
 y 2 − y1
a1 =
 x2 − x1
 (4.29)
 y3 − y 2 y 2 − y1
−
 x3 − x2 x2 − x1
a 2 = x3 − x1

The individual terms in the above expression are usually called “divided differences” and
denominated by the symbol D. That is,
y −y Dyi +1 − Dyi D 2 yi +1 − D 2 yi
Dyi = i +1 i , D 2 y i = , D 3 yi = , etc
xi +1 − xi xi + 2 − xi x i + 3 − xi
In this form, (4.29) takes the form: a0 = y1 , a1 = Dy1 and a2 = D 2 y1 .

The general form of Newton interpolation polynomials is then an extension of (4.27) and (4.28):
pn ( x ) = a0 + a1 ( x − x1 ) + a2 ( x − x1 )( x − x2 ) + L (4.30)
n n
or, pn ( x) = a0 + ∑ aiWi ( x) with Wi ( x) = ∏ ( x − xi ) (4.31))
i =1 i =1
with the coefficients:
a0 = y1 , ai = D i y1
Example
We can consider the previous example of finding the interpolating polynomial that passes
through the three points: ( x1 , y1 ) = (−2,4) , ( x2 , y2 ) = (0,2) and ( x3 , y3 ) = (2,8) .
In this case it is usual and convenient to arrange the calculations in a table with the following
quantities in each column:
xi yi Dyi D 2 yi
−2 4
y2 − y1 2−4
= = −1
x2 − x1 0 − (−2)
Dy2 − Dy1 3 − (−1)
0 2 = =1
x3 − x1 2 − (−2)
y3 − y 2 8 − 2
= =3
x3 − x2 2 − 0
2 8
Then, the coefficients are: a0 = 4, a1 = −1 and a2 = 1 and the polynomial is:
p2 ( x) = 4 − ( x + 2) + ( x + 2)( x − 0) = x 2 + x + 2
Note that it is the same polynomial found using Lagrange interpolation.
One important property of the Newton’s construction of the interpolating polynomial is what
makes it easy to extend including more points. if additional points are included, the new higher
order polynomial can be easily constructed from the previous one:
pn +1 ( x) = pn ( x ) + an+1Wn+1 ( x) with Wn+1 ( x) = Wn ( x)( x − xn+1 ) and a n+1 = D n+1 y1 (4.32)
In this way, it has many similarities with the Taylor expansion, where additional terms increase
the order of the polynomial. These similarities allow a treatment of the error in the same way as
it is done with Taylor expansions.
Exercise 4.3
Find the 5th order interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd = {2, 1,
3.4, 3.8, 5.8, 4.8}, this time using Newton interpolation.
Some Practical Notes:

In some of the examples seen here we have used equally spaced data. This is by no means
necessary and the data can have any distribution. However, if the data is equally spaced, simpler
expressions for the divided differences can be derived (not done here).
Both in the Lagrange and the Newton methods, the interpolating function is obtained in a
form that is different from the standard polynomial expression (4.25). The coefficients of this
expression can be obtained from the Lagrange or Newton forms by setting a system of equations
for the known values at the data point. However, the resultant system is notoriously ill-
conditioned (see chapter on Matrix Computations) and the results can be highly inaccurate.
____________________________
One of the problems with interpolation of data point is that this technique is very sensitive
to noisy data. A very small change in the values of the data can lead to a drastic change in the
interpolating function. This is illustrated in the following example:
Fig. 4.3 shows the interpolating polynomial (in blue) for the data:
xd = {0, 1, 2, 3, 4, 5}
yd = {2, 1, 3, 4.8, 5.8, 4.8}
Now, if we add two more points with a slight amount of noise: xd’ = {2.2, 2.7} and yd’ = {3.5,
4.25} (shown with the filled black markers), the new interpolation polynomial (red line) shows a
dramatic difference to the first one.
0
0 1 2 3 4 5
Fig. 4.3
Hermite Interpolation
The problem arises because the extra points force a higher degree polynomial and this can have a
higher oscillatory behaviour. Another approach that avoids this problem is to use data for the
derivative of the function too. If we also ask for the derivative values to be matched at the
nodes, the oscillations will be prevented. This is done with the “Hermite Interpolation”. The
development is rather similar to that of the Newton’s method but more complicated due to the
involvement of the derivative values. It can also be constructed easily with the help of a table
(as in Newton’s) and divided differences. We will not cover here the details of the derivation
but simply, the procedure to find it.
The table is similar to that for the Newton interpolation but we enter the data points twice (see
below) and the derivatives values are placed between repeated data points, in alternate rows as
the first divided differences. The initial set-up for 2 points is marked with red circles.
i xi yi Dyi D 2 yi D 3 yi
1 x1 y1
y1 '
Dy1 − y1 '
1 x1 y1 A=
x2 − x1
y2 − y1 B−A
Dy1 = C=
x2 − x1 x2 − x1
y2 '− Dy1
2 x2 y2 B=
x2 − x1
y2 '
2 x2 y2
The corresponding interpolating polynomial is:
H 2 ( x) = y1 + y1 ' ( x − x1 ) + A( x − x1 ) 2 + C ( x − x1 ) 2 ( x − x2 ) (4.33)
The coefficients of the successive terms are marked in the table with blue squares.
Fig. 4.4 shows the function y ( x) = sin(πx) 1.5

(+) compared with the interpolated curves cubic splines
using Hermite, Newton and cubic splines 1 Newton

Hermite
method (see next section).
The data points are marked by circles.
0.5
Newton interpolation results in a polynomial
of order 4 while Hermite gives a polynomial
of order 7. 0
(If 7 points are used for the Newton iteration
the results are quite similar.) -0.5
-1
0 0.5 1 1.5
Fig. 4.4
Another approach to avoid the oscillations present when using high order polynomials is to
use lower order polynomials to interpolate subsets of the data and assemble the overall
approximating function piecewise. This is what is called “spline interpolation”.
Spline Interpolation
Any piecewise interpolation of data by low order functions is called spline interpolation and the
simplest and widely used is the piecewise linear interpolation, or simply, joining consecutive
data points by straight lines.
First Order Splines

If we have a set of data points: (xi, yi), i = 1, , n; the first order splines (straight lines) can
be defined as:
f ( x) = y1 + m1 ( x − x1 ); x1 ≤ x ≤ x2 (4.34)
f ( x) = y2 + m2 ( x − x2 ); x2 ≤ x ≤ x3 (4.35)
…
f ( x) = yi + mi ( x − xi ); xi ≤ x ≤ xi +1 (4.36)
yi +1 − yi
with the slopes given by: mi =
xi +1 − xi
Quadratic and Cubic Splines

First order splines are straight-forward. To fit a line in each interval between data points, we
need only 2 pieces of data: the data values at the 2 points. If we now want to fit a higher order
function, for example a second order polynomial, a parabola, we need to determine 3
coefficients, and in general, for an m-order spline we need m+1 equations.
If we want a smooth function, over the complete domain, we should impose continuity of
the representation in contiguous intervals and as many derivatives as possible to be continuous
too at the adjoining points.
For n+1 points, xi, i = 0,..,n there are n intervals and consequently, n functions to
determine. If we use quadratic splines (second order) their equations will be of the form:
f i ( x) = ai x 2 + bi x + c and we need 3 equations per interval to find the 3 coefficients, that is, a
total of 3n equations.
We can establish the following conditions to be satisfied:
1) The values of the functions corresponding to adjacent intervals must be equal at the
common interior nodes (no discontinuities at the nodes).
This can be represented by:
f ( xi ) = ai x 2 + bi x + c = ai +1 x 2 + bi +1 x + ci +1 (4.37)
x = xi
for i = 1,…,n, that is, at the node xi, the boundary between interval i and i+1, the functions
defined in each of these intervals must coincide. This gives us a total of 2n−2 equations
(there are 2 equations in the above line and n−1 interior points).
2) The first and last functions must pass through the first and last points (end points). This
gives us another 2 equations
3) The first derivative at the interior nodes must be continuous.
This can be written as: f i ' ( xi ) = f i +1 ' ( xi ) for i = 2,…,n. Then:
2ai xi2 + bi = 2ai +1xi2 + bi +1 (4.38)

and we have another n−1 equations.
All these give us a total of 3n−1 equations when we need 3n. An additional condition must then
be established. Usually this can be chosen at the end points, for example, stating that a1 = 0
(This corresponds to asking for the second derivative to be zero at the first point and results in
the two first points joined by a straight line).
For example, for a set of 4 data points, we can establish the necessary equations as listed above,
giving a total of 8 equations for 8 unknowns (having fixed a1 = 0 already). This can be solved
by the matrix techniques that we will study later.
Quadratic splines have some shortcomings that are not present in cubic splines, so these
are preferred. However, their calculation is even more cumbersome than that of quadratic
splines. In this case, the function and the first and second derivatives are continuous at the
nodes.
Because of their popularity, cubic splines are commonly found in computer libraries and for
example, Matlab has a standard function that calculates them.
To illustrate the advantages of this technique,

Fig. 4.5 shows the cubic interpolation splines 6
(green line) for the same function described

5
in the last example, and compared with the
highly oscillatory result for a single
4
polynomial interpolation for the 8 data points
(that gives a polynomial of order 7.
3
Using the built-in Matlab function, the code 2

for plotting this graph is simply:
1
0
0 1 2 3 4 5
xd=[0,1,2,2.2,2.7,3,4,5]'; Fig. 4.5
yd=[2,1,3,3.5,4.25,4.8,5.8,4.8]';
x=0:0.05:5;
y=spline(xd,yd,x);
plot(x,y,'g','Linewidth',2.5)
plot(xd,yd,'ok','MarkerSize',10,'MarkerFaceColor','w','LineWidth',2)
Note that the drawing of the 7-order interpolating polynomial (red line in Fig. 4.5) is not
included in this piece of code and that the last line is simply to draw the markers.
Exercise 4.4
2
Using Matlab plot the function f ( x) = 0.1xe1.2 sin x in the interval [0, 4] and construct a cubic
spline interpolation using the values of the function at the points xd = 0: 0.5: 4 (in Matlab
notation). Use Excel to construct a table of divided differences for this function at the points xd
and find the coefficients of the Newton interpolation polynomial. Use Matlab to plot the
corresponding polynomial in the same figure as the splines and compare the results
If the main objective is to create a smooth function to represent the data, it is some times
preferable to choose a function that doesn’t necessarily pass exactly through the data but
approximates its overall behaviour. This is what is called “approximation”. The problems here
are how to choose the approximating function and what is considered the best choice.
Approximation
There are many different ways to approximate a function and you have seen some of them
in detail already. Taylor expansions, least squares curve fitting and Fourier series are examples
of this.
Methods like the “least squares” look for a single function or polynomial to approximate
the desired function. Another approach, of which the Fourier series is an example, consists of
using a family of simple functions to build an expansion that approximates the given function.
The problem then is to find the appropriate set of coefficients for that expansion. Taylor series
are somehow related, the main difference is that while the other methods based on expansions
attempt to find an overall approximation, Taylor series are meant to approximate the function at
one particular point and its close vicinity.
Least Squares Curve Fitting

One of the most common problems of approximation is fitting a curve to experimental data. In
this case, the usual objective is to find the curve that approximates the data minimizing some
measure of the error. In the method of least squares the measure chosen is the sum of the
squares of the differences between the data and the approximating curve,
If the data is given by: (xi, yi), i = 1, , n. We can define the error of the approximation
by:
n
E = ∑ ( yi − f ( xi ) )2 (4.39)
i =1
The commonest choices for the approximating functions are polynomials, straight lines,
parabolas, etc. Then, the minimization of the error will give the necessary relations to determine
the coefficients.
Approximation by a Straight Line

The equation of a straight line is: f ( x) = a + bx so the error to minimise is:
n
E (a, b) = ∑ ( yi − (a + bxi ) )2 (4.40)
i =1
The error is a function of the parameters a and b that define the straight line. Then, the
minimization of the error can be achieved by making the derivatives of E with respect to a and b
equal to zero. These conditions give:
n n n
∂E
= −2∑ ( yi − (a + bxi ) ) = 0 ⇒ ∑ yi − na − b∑ xi = 0 (4.41)
∂a i =1 i =1 i =1
n n n n
∂E
= −2∑ ( yi − (a + bxi ) )xi = 0 ⇒ ∑ xi yi − a∑ xi − b∑ xi2 = 0 (4.42)
∂b i =1 i =1 i =1 i =1
which can be simplified to:
n n
na + b∑ xi = ∑ yi (4.43)
i =1 i =1
n n n
and a ∑ xi + b∑ xi2 = ∑ xi yi (4.44)
i =1 i =1 i =1
Solving the system for a and b gives:
n n n n n n n
∑ xi2 ∑ yi − ∑ xi ∑ xi yi n∑ xi yi − ∑ xi ∑ yi
i =1 i =1 i =1 i =1 i =1 i =1 i =1
a= 2
and b = 2
(4.45)
n  n  n  n 
n∑ xi2 −  ∑ xi  n∑ xi2 −  ∑ xi 
i =1  i =1  i =1  i =1 
Example
Fitting a straight line to the data given by:
xd = {0, 0.2, 0.8, 1, 1.2, 1.9, 2, 2.1, 2.95, 3} and
yd = {0.01, 0.22, 0.76, 1.03, 1.18, 1.94, 2.01, 2.08, 2.9, 2.95}
10 10 10 10
Then, ∑ xdi = 15.15 ; ∑ ydi = 15.08 , ∑ xdi2 = 32.8425 and ∑ xdi ydi = 32.577
i =1 i =1 i =1 i =1
and the parameters of the straight line are:

a = 0.01742473648290 and 3
b= 0.98387806172746
2.5
Fig. 4.6 shows the approximating straight

2
line (red) together with the curve for
Least Squares Line Fit
Newton interpolation (black) and for cubic 1.5
splines (blue).
The problems occurring with the use of 1
higher order polynomial interpolation are Cubic Splines
0.5
also evident. Newton Interpolation
0 0.5 1 1.5 2 2.5 3
Fig. 4.6
Example
The data (xd, yd) shown in Fig. 4.7 appears to behave in an exponential manner. Then, defining
the variable zd i = log( yd i ) , zd should vary linearly with xd.
We can then fit a straight line to the pair of 25
variables (xd, zd). If the fitting function is the
function z(x), the function 20
y = ez
15
is a least squares fit to the original data.

10
0
0 0.5 1 1.5 2 2.5 3
Fig. 4.7
Second Order Approximation

If a second order curve is used for the approximation: f ( x) = a + bx + cx 2 , there are 3 parameters
to find and the error is given by:
( )
n 2
E (a, b, c) = ∑ yi − (a + bx + cx 2 ) (4.45)
i =1
Making the derivatives of E with respect to a, b and c equal to zero will give the necessary 3
equations for the coefficients of the parabola. The expressions are similar to those found for the
straight line fit although more complicated.
Matlab has standard functions to perform least squares approximations with polynomials of any
order. If the data is given by (xd, yd), and m is the desired order of the polynomial to fit, the
function:
coeff = polyfit(xd, yd, m)
returns the coefficients of the polynomial.

If now x is the array of values where the approximation is wanted,
y = polyval(coeff, x)
returns the values of the polynomial fit at the point x.
Exercise 4.5
Find the coefficients of a second order polynomial that approximates the function s( x) = e x .in
the interval [−1, 1] in the least squares sense, using the points xd = [−1, 0, 1]. Plot the function
and the approximating polynomial together with the Taylor approximation (Taylor series
truncated to second order) for comparison.
Approximation using Continuous Functions

The same idea of “least squares”, that is, to try to minimize an error expression in the least
squares sense, can be used to the approximation of continuous functions. Imagine that instead of
discrete data sets we have a rather complicated analytical expression representing the function of
interest. This case is rather different to all we have seen previously because now there is a
certainty of the value of the desired variable at every point, but it might be desirable to have a
simpler expression representing this behaviour, at least in a certain domain, for further analysis
purposes. For example is the function is s(x) and we want to approximate it using a second order
polynomial, we can formulate the problem as minimizing the square of the difference over the
domain of interest, that is:
E = ∫ (ax 2 + bx + c − s ( x)) 2 dx (4.46)
Ω
Again, asking for the derivatives of E with respect to a, b and c to be zero, will give us the
necessary equations to find these coefficients. There is one problem though, the resultant matrix
problem, particularly for large systems (high order polynomials) is normally very ill-
conditioned, so some special precautions need to be taken.
Approximation using Orthogonal Functions

We can extend the same ideas of least squares to the approximation of a given function by an
expansion in terms of a set of “basis functions”. You are already familiar with this idea from the
use of Fourier expansions to approximate functions, but the concept is not restricted to use of
sines and cosines as basis functions.
First of all, we called a base a family of functions that satisfy a set of properties. Similarly with
what you know of vectors, for example the set of unit vectors x̂ , ŷ and ẑ constitute a base in
the 3D space because any vector in that space can be expressed as a combination of these three.
Naturally, we don’t actually need these vectors to be perpendicular to each other but that helps
(as we’ll see later). However, the base must be complete, that is no vector in 3D escapes this
representation. For example if we only consider the vectors x̂ and ŷ , no combination of them
can possibly have a component along the ẑ axis and the resultant vectors are restricted to the xy-
plane. In the same sense, we want a set of basis functions to be complete (as sines and cosines
are) in order that any function can have its representation by an expansion. The difference now
is that unlike the 3D space, we need an infinite set (that is the dimension of the space of
functions).
So, if we select a complete set of basis functions: φk (x) , we can represent our function by:
n
~
f ( x ) ≈ f n ( x ) = ∑ ck φ k ( x ) (4.47)
k =1
an expansion truncated to n terms.
In this context, the error of this approximation is the function difference between the exact and
~
the approximate functions: rn ( x) = f ( x) − f n ( x) and we can use the least squares ideas again,
seeking to minimise the norm of this error. That is: the quantity, (error residual):
2 ~
Rn = rn ( x) ≡ ∫ ( f ( x) − f n ( x)) 2 dx (4.48)
Ω
with the subscript n because the error residual above is also a function of truncation level.
r2 r r N
This concept of norm is analogous to that of the norm or modulus of a vector: v = v ⋅ v = ∑ vi2 .
i =1
In order to extend it to functions we need to introduce the inner product of functions, (analogous
to the dot product of vectors):
If we have two functions f and g defined over the same domain Ω, their inner product is
the quantity:
f , g = ∫ f ( x) g ( x)dx (4.49)
Ω
The inner product satisfies the following properties:

1. f , f > 0 for all nonzero functions
2. f , g = g , f for all functions f and g.
3. f , αg1 + βg 2 = α f , g1 + β f , g 2 for all functions f and g and scalars α and β.
Note: The above definition of inner product is sometimes extended using a weighting function
w(x) in the form: f , g = ∫ w( x) f ( x) g ( x)dx and provided that w(x) is non-negative it satisfies
Ω
all the required properties.
2
In a similar form as with the dot product of vectors, f ( x) = f ( x), f ( x)
Using the inner product definition, the error expression (4.48) can be written as:
~ 2 ~ 2 ~ ~ 2
Rn = ( f ( x) − f n ( x)) = ∫ ( f ( x) − f n ( x)) 2 dx = f ( x) − 2 f ( x), f n ( x) + f n ( x) (4.50)
Ω
n
~ ~
and if we write f n ( x) as f n ( x) = ∑ ckφk ( x) as above,
k =1
we get:
n n n
Rn = f ( x) − 2∑ ck f ( x), φk ( x) + ∑∑ ck c j φk ( x), φ j ( x)
2
(4.51)
k =1 j =1 k =1
We can see that the error residual Rn, is a function of the coefficients ck of the expansion and
then, to find those values that minimize this error, we can make the derivatives of Rn with respect
to ck equal to zero for all k. That is:
∂Rn
= 0 for k = 1, …, n. (4.52)
∂ck
The first term is independent of ck, so it will not contribute and the other two will yield the
general equation:
n
∑ c j φk ( x),φ j ( x) = f ( x), φk ( x) for k = 1, …, n. (4.53)
j =1
Writing this down in detail gives:
(k = 1) φ1 , φ1 c1 + φ1 , φ 2 c 2 + L + φ1 , φ n c n = f , φ1
(k = 2) φ2 , φ1 c1 + φ2 , φ2 c2 + L + φ2 , φn cn = f , φ2
L
(k = n) φn , φ1 c1 + φn , φ2 c2 + L + φn , φn cn = f , φn
which can be written as a matrix problem of the form: Φc = s, where the matrix Φ contains all
the inner products (in all combinations), the vector c is the list of coefficients and s is the list of
values in the right hand side (which are all known: we can calculate them all).
We can find the coefficients solving the system of equations but we can see that this will
be a much easier task if all crossed products of basic functions yielded zero; that is, if
φk ( x), φ j ( x) = 0 for all j ≠ k (4.54)
This is what is called orthogonality and is a very useful property of the functions in a base.
(Similar with what happens with a base of perpendicular vectors: all dot products between
different members of the base are zero).

Then if the basis functions φk (x ) are orthogonal: the solution of the system above is
straight-forward:
f ( x), φk ( x)
ck = (4.55)
φk ( x), φk ( x)
Remark:
You can surely recognise here the properties of the sine and cosine functions involved in Fourier
expansions: They form a complete set, they are orthogonal: φk ( x), φ j ( x) = 0 where φk is
either sin kπx or cos kπx . and the coefficients of the expansions are given by the same
expression (4.55) above.
Fourier expansions are convenient but sinusoidal functions are not the only or simpler
possibility. There are many other sets of orthogonal functions that can form a base, in particular,
we are interested in polynomials because of the convenience of calculations.
Families of Orthogonal Polynomials

There are many different families of polynomials that can be used in this manner. Some are
more useful than others in particular problems. They are often originated as solution to some
differential equations.
1. Legendre Polynomials
They are orthogonal polynomials in the interval [−1, 1], with weighting function 1. That is,
1
Pi ( x), Pj ( x) = ∫ Pi ( x) Pj ( x)dx = 0 if i ≠ j. They are usually normalised so that Pn(1) = 1 and
−1
1
2
their norm in this case is: Pn ( x), Pn ( x ) = ∫ Pn ( x ) Pn ( x )dx = (4.56)
−1
2n + 1
The first few are: 1
P0 ( x) = 1
P1 ( x) = x
P2 ( x) = (3 x 2 − 1) 2
P3 ( x) = (5 x 3 − 3 x) 2
P4 ( x) = (35 x 4 − 30 x 2 + 3) 8 , 0
1
P5 = (15 x − 70 x 3 + 63 x 5 )
8
1
P6 = (−5 + 105 x 2 − 315 x 4 + 231x 6 )
16
1
P7 = (−35 x + 315 x 3 − 693 x 5 + 429 x 7 ) -1
-1 0 1
16
etc. Fig. 4.8 Legendre polynomials
(−1) n ∂ n
In general they can be defined by the expression: Pn ( x) = (1 − x 2 ) n (4.57)
2 n n! ∂x n
They also satisfy the recurrence relation: (n + 1) Pn+1 ( x) = (2n + 1) xPn ( x) − nPn−1 ( x) (4.58)
2. Chebyshev Polynomials
The general, compact definition of these polynomials is: Tn ( x) = cos(n cos −1 ( x)) (4.59)
and they satisfy the following orthogonality condition:
1 0 if i ≠ j
Ti ( x)T j ( x) 
Ti ( x ), T j ( x) = ∫ 1− x2
dx = π 2 if i = j ≠ 0 (4.60)
−1 π if i = j = 0

That is, they are orthogonal in the interval [−1, 1] with the weighting function: w( x) = 1 1 − x 2
They are characterised by having all their oscillations of the same amplitude and in the interval
[−1, 1] and also all their zeros occur in the same interval.
The first few Chebyshev polynomials are:
T0 ( x) = 1 1
T1 ( x) = x
T2 ( x) = 2 x 2 − 1
T3 ( x) = 4 x 3 − 3 x
0
T4 ( x ) = 8 x 4 − 8 x 2 + 1
T5 ( x) = 16 x 5 − 20 x 3 + 5 x
T6 ( x) = 32 x 6 − 48 x 4 + 18 x 2 − 1
T7 ( x ) = 64 x 7 − 112 x 5 + 56 x 3 − 7 x -1
… etc. -1 0 1
Fig. 4.9 Chebyshev polynomials
They also can be constructed from the recurrence relation:

Tn+1 ( x ) = 2 xTn ( x) − Tn −1 ( x) for n ≥ 1. (4.61)
These are not the only possibilities. Other families of polynomials commonly used are the
Hermite polynomials, which are orthogonal over the complete real axis with weighting function
exp(− x 2 ) , and Laguerre polynomials, orthogonal in [0, ∞) with weighting function e − x .
Example
1
Approximate the function f ( x ) = with a = 4 in the interval [−1, 1] in the least squares
1+ a2x2
sense using Legendre polynomials up to order 2.
2 1
~ 1
The approximation is: f ( x) = ∑ ck Pk ( x ) and the coefficients are: ck = 2 ∫ f ( x) Pk ( x)dx
k =1 Pk ( x) −1
22
with Pk ( x) = .
2k + 1
Form the expression above we can see that the calculation of the coefficients will involve
integrals:
1
xm
Im = ∫ 1 + a 2 x 2 dx
−1
which satisfy the recurrence relation:
2 1 2
Im = − I m−2 with I 0 = tan −1 a
(m − 1)a 2 a 2 a
We can also see that due to symmetry Im = 0 for m odd. (Integral of an odd function over the
interval [−1, 1]). Also, because of this, only even numbered coefficients are necessary. (Odd
coefficients of the expansion are zero).
The coefficients are then:
1 1
1 1 1 1
2 −∫1
c0 = f ( x)dx = ∫ dx = I 0
2 −11 + a 2 x 2 2
1 1
5 5 3x 2 − 1 5
c2 = ∫ P2 ( x) f ( x)dx = ∫ dx = (3I 2 − I 0 )
2 −1 4 −11 + a 2 x 2 4
1
9 9
P4 ( x) f ( x)dx = (35 I 4 − 30 I 2 + 3I 0 )
2 −∫1
c4 =
16
1
13 13
P6 ( x) f ( x)dx = (231I 6 − 315 I 4 + 105 I 2 − 5 I 0 )
2 −∫1
c6 =
32
1
17 17
(6435 I8 − 12012 I 6 + 6930 I 4 − 1260 I 2 + 35I 0 )
2 −∫1
c8 = P8 ( x ) f ( x )dx =
256
1
The results are shown in Fig. 4.10,
where the red line corresponds to the
approximating curve. The green line at the
bottom shows the absolute value of the
difference between the two curves. 0.5
The total error of the approximation
(integral of this difference, divided by the
integral of the original curve) is of 10.5%.
-1 0 1
Fig. 4.10
Exercise 4.6
Use cosine functions ( cos(nπ x) ) instead of Legendre polynomials (in the form of a Fourier
series) and compare the results.
Remarks:
We can observe in the figure of the example above that the error oscillates through the
domain. This is typical of least squares approximation, where the overall error is minimised.
In this context, Chebyshev polynomials are the best possible choice. The error obtained
with its approximation is the least possible with any other polynomial up to the same degree.
Furthermore, these polynomials have other useful properties. We have seen before interpolation
of data by higher order polynomials, using equally spaced data points and the problems that this
cause were quite clear. However, one then can think if there is a different distribution of points
that helps in minimising the error. The answer is yes, the optimum arrangement is to locate the
data points on the zeros of the Chebyshev polynomial of the order necessary to give the required
number of data points. These occur for:
 2k − 1 
xk = cos π  for k = 1, …, n.
 2n 
Approximation to a Point
The approximation methods we have studied so far attempt to reach a “gobal” approximation to
a function; that is, to minimise the error over a complete domain. This might be desirable in
many cases but there could be others where it is more important to achieve high approximation
at one particular point of the domain and in its close vicinity. One such method is the Taylor
approximation, where the function is approximated by a Taylor polynomial.
Taylor Polynomial Approximation

Among polynomial approximation methods, the Taylor polynomial gives the maximum possible
“order of contact” between the function and the polynomial; that is, the n-order Taylor
polynomial agrees with function and with its n derivatives at the point of contact.
The Taylor polynomial for approximating a function f(x) at a point a is given by:
f ' ' (a) f ( n) (a )

p ( x) = f (a ) + f ' (a )( x − a ) + ( x − a) 2 + L + ( x − a) n (4.62)
2! n!
f ( n+1) (ξ )
and the error incurred is given by: ( x − a ) n+1 where ξ is a point between a and x.
(n + 1)!
(See Appendix).
Fig. 4.11 shows the function (blue line):

1
y ( x) = sin π x + cos 2π x
with the Taylor approximation (red line)

using a Taylor polynomial of order 9 (Taylor 0
series truncated at order 9). For comparison,

the Newton approximation using 9 equally
spaced points (indicated with * and a black -1
line), giving a single polynomial of order 9.
While the Taylor approximation is very good
in the vicinity of zero (9 derivatives are also -2
matched there) it deteriorates badly away
-1 0 1
from zero.
Fig. 4.11
If a function varies rapidly or has a pole, polynomial approximations will not be able to achieve
high degree of accuracy. In that case an approximation using rational functions will be better.
The polynomial in the denominator will provide the facilities for rapid variations and poles.
Padé Approximation
Padé Approximants are rational functions, or ratio of polynomials, that fits the value of a
function and a number of its derivatives at one point. They usually provide an approximation
that is better than that of the Taylor polynomials, in particular in the case of functions containing
poles.
A Padé approximation to a function f(x) that can be represented by a Taylor series in [a, b]
(or Padé approximant) is a ratio between two polynomials Pm(x) and Qn(x) of orders m and n
respectively:
P ( x) am x m + L + a2 x 2 + a1x + a0
f ( x ) ≈ Rnm ( x ) = m = (4.63)
Qn ( x ) bn x n + L + b2 x 2 + b1 x + b0
For simplicity, we’ll consider only approximations to a function at x = 0. For other values, a
simple transformation of variables can be used.
If the Taylor approximation at x = 0 (Maclaurin series) to f(x) is:
k
t ( x) = ∑ ci x i = ck x k + L + c2 x 2 + c1x + c0 with k = m + n (4.64)
i =0
Pm ( x )
we can write: t ( x) ≈ Rnm ( x) = or t ( x)Qn ( x) = Pm ( x) . (4.65)
Qn ( x)
Considering now this equation in its expanded form:
(ck x k + L + c2 x 2 + c1x + c0 )(bn x n + L + b2 x 2 + b1x + b0 ) = am x m + L + a2 x 2 + a1 x + a0 (4.66)
we can establish a system of equations to find the coefficients of P and Q. First of all, we can
force this condition to apply at x = 0 (exact matching of the function at x = 0). This will give:
t (0)Qn (0) = Pm (0) or:
c0b0 = a0 (4.67)
but, since the ratio R doesn’t change if we multiply the numerator and denominator by any
number, we can choose the value of b0 = 1 and this gives us the value of a0 .
Taking now the first derivative of (4.66) will give:
(kck x k −1 + L + 2c2 x 2 + c1 )(bn x n + L + b1x + b0 ) +

(c k x + L + c1 x + c0 )(nbn x n−1 + L + 2b2 x + b1 ) = ma m x m−1 + L + 2a 2 x + a1
k
(4.68)
and again, forcing this equation to be satisfied at x = 0 (exact matching of the first derivative),
gives:
c1b0 + c0b1 = a1 (4.69)
which gives an equation relating coefficients a1 and b1 : a1 − c0 b1 = c1b0 (4.70)

To continue with the process of taking derivatives of (4.66) we can establish first some general
formulae. If we call g(x) the product of denominators in the left hand side of (4.66), we can
apply repeatedly the product rule to form the derivatives giving:
g ' ( x ) = t ( x )q ' ( x) + t ' ( x) q ( x )

g ' ' ( x) = t ( x)q ' ' ( x ) + 2t ' ( x)q ' ( x) + t ' ' ( x)q ( x )
L (4.71)
i
i!
(i )
g ( x) = ∑ j!(i − j )!
t ( j ) ( x )Q (i − j ) ( x )
j =0
and since we are interested on the values of the derivatives at x = 0, we have:
i
i!
g ( i ) ( 0) = ∑ j!(i − j )! t ( j ) (0)Q (i− j ) (0) (4.72)
j =0
The first derivative of a polynomial, say, Qn (x) is nbn x n−1 + L + 2b2 x + b1 , so the second
derivative will be: (n − 1)nbn x n−2 + L + 2 ⋅ 3b3 x + 2b2 and so on. Then these derivatives
evaluated at x = 0 are successively: b1, 2b2 , (2 ⋅ 3)b3 , (2 ⋅ 3 ⋅ 4)b4 , L , , j!b j
i i
i!
Then, we can write (4.72) as: g (i ) (0) = ∑ j!(i − j )! j!c j (i − j )!bi− j = i! ∑ c j bi− j and equating
j =0 j =0
this to the ith derivative of Pm (x) evaluated at x = 0, that is, i!ai gives:
i
ai = ∑ c j bi− j , but since b0 = 1 we can finally write:
j =0
i −1
ai − ∑ c j bi − j = ci for i = 1, …, k, (k = m + n). (4.73)
j =0
where we take the coefficients ai = 0 for i > m and bi = 0 for i > n .
Example
1
Consider the function f ( x) = . This function has a pole at x = 1 and polynomial
1− x
approximations will not perform well.
1 ⋅ 3 ⋅ 5 L (2 j − 1)
The Taylor coefficients of this function are given by: c j =
2 j j!
1 3 5 35
So the first five are: c0 = 1 , c1 = , c2 = , c3 = , c4 = which gives a polynomial of
2 8 16 128
order 4.
From the equations above we can calculate the Padé coefficients. We choose: m = n =2, so
k = m + n = 4.
Taking b0 = 1 , c0b0 = a0 gives a0 = 1
The other equations (from asking the derivatives to fit) are:

i −1
ai − ∑ c j bi − j = ci for i = 1, …, k, (k = m + n)
j =0
and in detail:
a1 − c0 b1 = c1
a2 − c0 b2 − c1b1 = c2
a3 − c0 b3 − c1b2 − c 2 b1 = c3
a4 − c0 b4 − c1b3 − c 2 b2 − c3b1 = c4
but for m = n =2, we have a3 = a4 = b3 = b4 = 0 and the system can be written as:
a1 − c0b1 = c1
a2 − c0b2 − c1b1 = c2
0 0 0 − c0b2 − c2b1 = c3
0 0 0 − c2b2 − c3b1 = c4
and we can see that the second set of equations can be solved for the coefficients bi. Re-writing
this system:
c2b1 + c0b2 = −c3
c3b1 + c2b2 = − c4
In general, when n = m = k 2 as in this case the matrix that define the system will be of the
form:
 r0 r1 r2 L rn−1 
 r r r L rn−2 
 −1 0 1
 r−2 r−1 r0 L rn−3 
 
 L L L L L
r−n+1 r−n + 2 r−n+3 L r0 
This is a special kind of matrix called the Toeplitz matrix that has the same element along each
diagonal so it is defined by a total of 2n+1 numbers. There are methods for solving systems with
this matrix and in particular,
Solving the system will give:
a = [1, − 3 4, 1 16] and b = [1, − 5 4, 5 16]

giving:
a0 + a1x + a2 x 2 1 − 0.75 x + 0.0625 x 2
R22 ( x) = 2
=
b0 + b1 x + b2 x 1 − 1.25 x + 0.3125 x 2
Fig. 4.12 shows the function 8
1
f ( x) = (blue line) and the Padé 6
1− x
approximant (red line)
4
1 − 0.75 x + 0.0625 x 2
R22 ( x) =
1 − 1.25 x + 0.3125 x 2 2
th
The Taylor polynomial up to 4 order:
P( x) = c0 + c1x + c2 x 2 + c3 x 3 + c4 x 4 0
with a green line. 0 1

Fig. 4.12
It is clear that the Padé approximant gives a better fit, particularly closer to the pole.
Increasing the order, the Padé approximation 8

gets closer to the singularity.
Fig. 4.13 shows the approximations for
m = n = 1, 2, 3 and 4. 6
The poles closer to 1 of this functions are:
R11 ( x) : 1.333 4
R22 ( x) : 1.106
R33 ( x) : 1.052 2
R44 ( x) : 1.031
0
0 0.5 1
Fig. 4.13
We can see that even R11 ( x ) , a ratio of linear functions of x, gives a better approximation than
the 4th order Taylor polynomial.
Exercise 4.7
Using the Taylor (McLaurin) expansion of f ( x) = cos x , truncated to order 4:
4
x2 x4
t ( x) = ∑ ck x k = 1 − +
2 24
k =0
find the Padé approximant:
P2 ( x) a2 x 2 + a1 x + a0
R22 ( x) = =
Q2 ( x) b2 x 2 + b1x + b0
5. MATRIX COMPUTATIONS
We have seen earlier that a number of issues arise when we consider errors in the
calculations dealing with machine numbers. When matrices are involved, the problems of
accuracy of representation, error propagation and sensitivity of the solutions to small variations
in the data are much more important.
Before discussing any methods of solving matrix equations, we consider first the rather
fundamental matrix property of ‘condition number’.
‘Condition’ of a Matrix
We have seen that multiplying or dividing two floating-point numbers gives an error of the
order of the ‘last preserved bit’. If, say, two numbers are held to 8 decimal digits, the resulting
product (or quotient) will effectively have its least significant bit ‘truncated’ and therefore have
a relative uncertainty of ± 10–8.
By contrast, with matrices and vectors, multiplying (that is, evaluating y = Ax) or
‘dividing’ (that is, solving Ax = y for x) can lose in some cases ALL significant figures!
Before examining this problem we have to define matrix and vector norms.
VECTOR AND MATRIX NORMS

To introduce the idea of ‘length’ into vectors and matrices, we have to consider norms:
VECTOR NORMS
If xT = (x1, x2,...., xn) is a real or complex vector, a general norm is denoted by x N
and is
defined by:
1N
 n N

x N
= ∑ xi  (5.1)
 i =1 
So the usual “Euclidian norm”, or “length” is x 2 or simply x :
2 2
x 2
= x1 + ... + x n (5.2)
Other norms are used, e.g. x 1 and x ∞ , the latter corresponding to the greatest in
magnitude |xi| (Show this as an exercise).
MATRIX NORM
If A is an n-by-n real or complex matrix, we denote its norm defined by:
 Ax N 
A N = max   for any choice of the vector x (5.3)
x≠0  x
 N  
According to our choice of N, in defining the vector norms by (5.1), we have

corresponding Ax 1 , Ax 2 , Ax ∞ , the Euclidian N = 2 being the commonest. Note that
(ignoring the question “how do we find its value”), for given A and N, A has some specific
numerical value ≥ 0.
‘Condition’ of a Linear System Ax = y

This is in the context of Hadamard’s general concept of a ‘well-posed-problem’, which is
roughly one where the result is not too sensitive to small changes in the problem specification.
Definition: The problem of finding x, satisfying Ax = y, is well posed or well conditioned if:
(i) a unique x satisfies Ax = y , and

(ii) small changes in either A or y result in small changes in x.
For a quantitative measure of “how well conditioned” a problem is, we need to estimate
the amount of variation in x when y varies and/or the variation in x when A changes slightly or
the corresponding changes in y when either (or both) x and A vary.
Suppose A is fixed, but y changes slightly to y + δy, with the associated x changing to
x + δx. We have:
Ax = y, (5.4)
and so: A (x + δx) = y + δy
Subtracting gives:
A δx = δy or δx = A–1 δy (5.5)
From our definition (5.3), we must have for any A and z:
Az
≤ A and so Az ≤ A ⋅ z (5.6)
z
Taking the norm of both sides of (15) and using inequality (5.6) gives:
δ x = A −1δ y ≤ A −1 ⋅ δ y (5.7)
Taking the norm of (5.4) and using inequality (5.6) gives:
y = Ax ≤ A ⋅ x (5.8)
Finally multiplying corresponding sides of (5.7) and (5.8) and dividing by x y gives our
fundamental result:
δx δy
≤ A A −1 (5.9)
x y
For any square matrix A we introduce its condition number and define:
−1
cond(A) = A A (5.10)
We note that a ‘good’ condition number is small, near to 1.

Relevance of the Condition Number

The quantity δ y y can be interpreted as a measure of relative uncertainty in the vector
y. Similarly, δ x x is the associated relative uncertainty in the vector x.
From equations (5.9) and (5.10), cond(A) gives an upper bound (worst case) factor of
degradation of precision between y and x = A–1y . Note that if we re-write the equations from
(5.4) for A–1 instead of A (and reversing x and y), equation (5.9) will be the same but with x and
y reversed and (5.10) would remain unchanged. These two equations therefore give the
important result that:
If A denotes the precise transformation y = Ax and δx, δy are related small changes in x
and y, the ratio:
δx x
must lie between 1/ cond(A) and cond(A).
δy y
Numerical Example
Here is an example using integers for total precision. Suppose:
100 99
A= 
 99 98
We then have:
 1000 1000
Ax = y: A = 
− 1000 1000
Shifting x slightly gives:
 1001 1199
A = 
− 999 1197 
Alternatively, shifting y slightly:
 803 1001
A = 
− 801  999 
So a small change in y can cause a big change in x or vice versa. We have this clear moral,
concerning any matrix multiplication or (effectively) inversion:
For given A, either multiplying (Ax) or ‘dividing’ (A–1y) can be catastrophic, the degree
of catastrophe depending on cond(A) and on the ‘direction’ of change in x or y.
In the above example, cond(A) is about 4000.
Matrix Computations
We now consider methods for solving matrix equations. The most common problems
encountered are of the form:
Ax=y (5.11)
or A x = 0, requiring det(A) = 0 (5.12)
(we will consider this a special case of (5.11)) or

A x = k2 B x (5.13)
where A (and B) are known n×n matrices and x and k2 are unknown.
Usually B (and sometimes A) is positive definite (meaning that xTBx > 0 for all x
A is sometimes complex (and also B), but numerically the difference is straightforward, and so
we will consider only real matrices.
Types of Matrices A and B (Sparsity Patterns)

We can classify the problems (or the matrices) according to their sparsity (that is, the
proportion of zero entries). This will have a strong relevance on the choice of solution method.
In this form, the main categories are: dense, where most of the matrix elements are nonzero and
sparse, where a large proportion of them are zero. Among the sparse, we can distinguish several
types:
Banded matrices: all nonzero elements are grouped in a band around the main diagonal
(fixed and variable band).
Arbitrarily sparse
Finally, sparse matrix with any pattern, but where the elements are not stored; that is.
elements are ‘generated’ or calculated each time they are needed in the solving algorithm.
* * * * 0 0
* * * * * 0

* * * * * *
 
* * * * * *
0 * * * * *
 
0 0 * * * *
Zeros and non-zeros in band matrix of semi-bandwidth 4
We can also distinguish between different solution methods, basically:
DIRECT where the solution emerges in a finite number of calculations (if we temporarily
ignore round-off error due to finite word-length).
INDIRECT or iterative, where a step-by-step procedure converges towards the correct solution.
Indirect methods can be specially suited to sparse matrices (especially when the order is large) as
they can often be implemented without the need to store the entire matrix A (or intermediate
forms of matrices) in high speed memory.
All the common direct routines are available in software libraries and in books and journals,
most commonly in Fortran or Fortran90/95, but also some in C.
Direct Methods of Solving Ax = y
– Gauss elimination or LU decomposition

The classic solution method of (5.11) is by the Gauss method. For example, given the system:
 1 4 7   x1  1
2 5 8  x  = 1
  2    (5.14)
 3 6 11  x3  1
we subtract 2 times the first row from the second row, and then we subtract 3 times the first row
from the third row, to give:
1 4 7   x1   1
0 − 3 − 6  x  =  − 1
  2    (5.15)
0 − 6 − 10  x3  − 2
and then subtracting 2 times the second row from the third row gives:
1 4 7   x1   1
0 − 3 − 6  x  = − 1
  2    (5.16)
0 0 2  x3   0
The steps from (5.14) to (5.16) are termed ‘triangulation’ or ‘forward-elimination’. The
triangular form of the left-hand matrix of (5.16) is crucial; it allows the next steps.
The third row immediately gives:
x3 = 0 (5.17a)
and substitution into row 2 gives:
(–3) x2 + 0 = –1 and so: x2 =1/3 (5.17b)

and then into row 1 gives:
x1 + 4 (1/3) + 0 = 1 and so: x1 = –1/3 (5.17c)
The steps through (5.17) are termed ‘back-substitution’. We now ignore a complication of
‘pivoting’, a technique sometimes required to improve numerical stability and which consists of
changing the order of rows and columns.
Important points about this algorithm:
1. When performed on a dense matrix, (or on a sparse matrix but not taking advantage of
the zeros), computing time is proportional to n3 (n: order of the matrix). This means that
doubling the order of the matrix will increase computation time by up to 8 times!
2. The determinant comes immediately as the product of the diagonal elements of (5.16).
3. Algorithms that take advantage of the special ‘band’ and ‘variable-band’ are very
straightforward, by changing the limits of the loops when performing row or column operations,
and some ‘book-keeping’. For example, in a matrix of ‘semi-bandwidth’ 4, the first column has
non-zero elements only in the first 4 rows as in the figure. Then only those 4 numbers need
storing, and only the 3 elements below the diagonal need ‘eliminating’ in the first column.
4. Oddly, it turns out that, in our context, one should NEVER find the inverse matrix A in
order to solve Ax = y for x. Even if it needs doing for a number of different right-hand-side
vectors, y, it is better to ‘keep a record’ of the triangular form of (5.16) and back-substitute as
necessary.
5. Other methods very similar to Gauss are due to Crout and Choleski. The latter is (only)
for use with symmetric matrices. Its advantage is that time AND storage are half that of the
orthodox Gauss.
6. There are variations in the implementation of the basic method developed to take
advantage of the special type of sparse matrices encountered in some cases, for example when
solving problems using finite differences or finite elements. One of these is the frontal method;
here, elimination takes place in carefully controlled manner, with intermediate results being kept
in backing-store. Another variation consists of only storing the ‘nonzero’ elements in the matrix,
at the expense of a great deal of ‘book-keeping’ and reordering (renumbering) rows and columns
through the process, in the search for the best compromise between numerical stability and fill-
in.
Iterative Methods of Solving Ax = y

We will outline 3 methods: (1) Gradient methods and its most popular version, the conjugate
gradient method (2a) the Jacobi (or simultaneous displacement) and (2b) the closely related
Gauss-Seidel (or successive displacement) algorithm.
1.– Gradient Methods

Although known and used for decades, the method has come, in the 1980’s, to be adopted
as one of the most popular iterative algorithms for solving linear systems of equations: A x = y.
The fundamental idea of this approach is to introduce an error residual A x – y for some trial
vector x and proceed to minimize the residual with respect to the components of x..
The equation to be solved for x:
Ax=y (5.18)
can be recast as: finding x to minimize the ‘error-residual’, a column vector r defined as a
function of x by:
r=y–Ax (5.19)
The general idea of this kind of methods is to search for the solution (minimum of the error
residual) in a multidimensional space (of the components of vector x), starting from a point x0
and choosing a direction to move. The optimum distance to move along that direction can then
be calculated.
In the steepest descent method, the simplest form of this method, these directions are
chosen as those of the gradient of the error residual at each iteration point. Because of this, they
will be mutually orthogonal and then there will be no more than n different directions. In 2D
(see figure below) this means that every time we’ll have to make a change of direction at right
angle to the previous, but this will not always allow us to reach the minimum or at least not in an
efficient way.
The norm r of this residual vector is an obvious choice for the quantity to minimise, and
2
r , the square of the norm of the residual, which is not negative and is only zero when the
error is zero and there are no square roots to calculate. However, using (5.19), gives (if A is
symmetric: AT=A):
2
r = (y – Ax)T(y – Ax) = xTAAx – 2xTAy + yTy (5.20)
which is rather awkward to compute, because of the product AA. Another possible choice of
error functional (measure of the error), valid for the case where the matrix A is positive definite
and which is also minimised for the correct solution is the functional: h2 = rTA–1r (instead of
rTr as in (5.20). This gives a simpler form:
h 2 = rT A −1r = (y − Ax)T A −1 (y − Ax) = y T Ay − 2xT y + xT Ax (5.21)
or, because the first term in (5.21) is independent of the variables and will play no part in the
minimisation we can drop it and we finally have
:
h 2 = xT Ax − 2xT y (5.22)
The method proceeds by evaluating the error functional at a point x0, choosing a direction,
in this case, the direction of the gradient of the error functional, and finding the minimum value
of the functional along that line. That is, if p gives the direction, the line is defined by all the
points (x + α p), where α is a scalar parameter. The next step is to find the value of α that
minimizes the error. This gives the next point x1. (Since in this case, α is the only variable, it is
simple to calculate the gradient of the error functional as a function of α and finding the
corresponding minimum).
x
0
v
x
1
u
Fig. 5.1
Several variations appear at this point. It would seem obvious that the best direction to
choose is that of the gradient (its negative or downwards) and that is the choice taken in the
conventional “steepest-gradient” or “steepest descent” method. In this case the consecutive
directions are all perpendicular to each other as illustrated in the figure above. However, as
mentioned earlier, this conduces to poor convergence due to the discrete nature of the steps.
The more efficient and popular “Conjugate Gradient Method” looks for a direction which
is ‘A–orthogonal’ (or conjugate) to the previous one instead (pTAq = 0 instead of pTq = 0).
Exercise 5.1:
Show that the value of α that minimizes the error functional in the line (x + αs) in the two cases
mentioned above (using the squared norm of the residual as error functional or the functional
h2), for a symmetric matrix A, is given respectively by:
p T A(y − Ax ) p T (y − Ax )
α= 2 and α=
Ap p T Ap
A useful feature of this method (as can be observed from the expressions above) is that
reference to the matrix A is only via simple matrix products; for given values of the matrices A,
y, xi and pi, we need only form A times a vector (Ax or Api) and AT times a vector. These can
be formed from a given sparse A without unnecessary multiplication (of non-zeros) or storage.
An even better advantage of the conjugate gradient method is that it is guaranteed to converge in
at most n steps, where n is the order of the matrix.
The Conjugate Gradient Algorithm

The basic steps of the algorithm are as follow:
1) Choose a starting point x0.
2) Choose direction to move. In the first step we choose that of the gradient of h2: and this
coincides with the direction of the residual r:
∇h 2 = ∇(xT Ax 0 − 2xT0 y ) = 2( Ax 0 − y ) = −2r0
so we choose: p 0 = r0 .
p T (y − Ax )
3) Calculate the distance to move – parameter α: α = using p0 and x0.
p T As
4) Calculate the new point: x k = x k −1 + α p k −1
5) Calculate the new residual: rk = y − A(x k −1 + α p k −1 ) = y − Ax k −1 + α p k −1 = rk −1 + α p k −1
6) Calculate the next direction. Here is where the methods differ. For the conjugate gradient
the new vector p is not along the gradient of r but instead:
rkT rk
p k = rk + β p k −1 where β =
rkT−1rk −1
More robust versions of the algorithm use a preliminary ‘pre-conditioning’ of the matrix
A, to alleviate the problem that the condition number of (5.20) is the square of the condition
number of A, – a serious problem if A is not ‘safely’ positive-definite. This leads to the popular
‘PCCG’ or Pre-Conditioned-Conjugate-Gradient algorithm, as a complete package and to many
more variations in implementation that can be found as commercial packages.
2.– Jacobi and Gauss-Seidel Methods

Two algorithms that are simple to implement are the closely related Jacobi (simultaneous
displacement) and Gauss-Seidel (successive displacement or ‘relaxation’).
a) Jacobi Method or Simultaneous Displacement

Suppose the set of equations for solution are:
a11x1 + a12x2 + a13x3 = y1

a21x1 + a22x2 + a23x3 = y2 (5.23)
a31x1 + a32x2 + a33x3 = y3
This can be re-organised to:
x1 = ( y1 – a12x2 – a13x3)/a11 (5.24a)

x2 = ( y2 – a23x3 – a21x1)/a22 (5.24b)

x3 = ( y3 – a31x1 – a32x2)/a33 (5.24c)
Suppose we had the vector x(0) = [x1, x2, x3](0), and substituted it into the right-hand side of
(5.24), to yield on the left-hand side the new vector x(1) = [x1, x2, x3](1). Successive substitutions
will give the sequence of vectors:
x(0), x(1), x(2), x(3), . . . .
Because (5.24) is merely a rearrangement of the equations for solution, the ‘correct’
solution substituted into (5.24) must be self-consistent, i.e. yield itself! The sequence will:
either converge to the correct solution
or diverge.
b) Gauss-Seidel Method or Successive Displacement

Note that when eq. (5.24a) is applied, a ‘new’ value of x1 would be available, which could
be used instead of the ‘previous’ x1 value when solving (5.24b). And similarly for x1 and x2
when applying (5.24c). This is the Gauss-Seidel or successive displacement iterative scheme,
illustrated here with an example, showing that the computer program (in FORTRAN/90) is
barely more complicated than writing down the equations.
! Example of successive displacement

x1 = 0.
x2 = 0.
x3 = 0. ! Equations being solved are:
do i=1,10
x1 = (4. + x2 - x3)/4. ! 4*x1 - x2 + x3 = 4
x2 = (9. - 2.*x3 - x1)/6. ! x1 +6*x2 + 2*x3 = 9
x3 = (2. + x1 + 2.*x2)/5. ! -x1 -2*x2 + 5*x3 = 2
print x1,x2,x3
enddo
stop
end
1 1.333333 1.133334
1.05 0.9472222 0.9888889
0.9895833 1.00544 1.000093
1.001337 0.9997463 1.000166
0.9998951 0.9999621 0.9999639
0.9999996 1.000012 1.000005
1.000002 0.9999981 0.9999996
0.9999996 1 1
1 1 1
1 1 1
Whether the algorithm converges or not depends on the matrix A and (surprisingly) not on
the right-hand-side vector y of (5.23). Convergence does not even depend on the ‘starting value’
of the vector, which only affects the necessary number of iterations.
We will skip over any formal proof of convergence. But to give the sharp criteria for
convergence,− first one splits A as:
A=L+D+U
where L, D and U are the lower, diagonal and upper triangular parts of A. Then the schemes
converge if-and-only-if all the eigenvalues of the matrix:
D–1(U + L) {for simultaneous displacement}

or
(D + L)–1U {for successive displacement}
lie within the unit circle.

For applications, it is simpler to use some sufficient (but not necessary!) conditions, when
possible, such as:
(a) If A is symmetric and positive-definite, then successive displacement converges.

(b) If, in addition, ai,j < 0 for all i ≠ j, then simultaneous displacement also converges.
Condition (a) is a commonly satisfied condition. Usually, the successive method

converges in about half the computer time of the simultaneous, but, strictly there are matrices
where one method converges and the other does not, and vice-versa.
Matrix Eigenvalue Problems

The second class of matrix problem that occurs frequently in the numerical solution of
differential equations as well as in many other areas, is the matrix eigenvalue problem. In many
occasions, the matrices will be large and very sparse; in others, they will be dense and normally
the solution methods will have to take these characteristics into account in order to achieve a
solution in an efficient way. There is a number of different methods to solve matrix eigenvalue
problems involving dense or sparse matrices, each of them with different characteristics and
better adapted to different types of matrices and requirements.
Choice of method
The choice of method will depend on the characteristics of the problem (type of the
matrices) and on the solution requirements. For example, most methods suitable for dense
matrices calculate all eigenvectors and eigenvalues of the system. However, in many problems
arising from the numerical solution of PDEs one is only interested in one or just a few
eigenvalues and/or eigenvectors. Also, in many cases the matrices will be large and sparse.
In what follows we will concentrate in methods which are suitable for sparse matrices (of
course the same methods can be applied to dense matrices).
The problem to solve can have two different forms:
Ax = λx (standard eigenvalue problem) (5.25)
Ax = λ Bx (generalized eigenvalue problem) (5.26)

Sometimes the generalized eigenvalue problem can be converted into the form (33) simply
by premultiplying by the inverse of B:
B −1Ax = λ x
however, if A and B are symmetric, the new matrix in the left hand side ( B −1A ) will have lost
this property. Instead, it is preferable to decompose the matrix B (factorise) in the form
B = LLT (Choleski factorisation - possible if B is positive definite). Substituting in (5.26) and
premultiplying by L−1 will give:
L−1Ax = λLT x ( )
and since: LT
−1 T
L = I , the identity matrix,
( )T
L−1 A L−1 LT x = λ LT x ; putting: LT x = y and ( )T ~
L−1 A L−1 = A gives:
~
Ay = λy
~
The matrix A is symmetric if A and B are symmetric and the eigenvalues are not
modified. The eigenvectors x can be obtained from y simply by back-substitution. However, if
~
the matrices A and B are sparse, this method is not convenient because A will generally be
dense. In the case of sparse matrices, it is more convenient to solve the generalized problem
directly.
Solution Methods
We can again classify solution methods as:
1.- Direct: (or transformation methods)
i) Jacobi rotations Converts the matrix of (5.25) to diagonal form

ii) QR (or QZ for complex) Converts the matrices to triangular form
iii) Conversion to Tri-diagonal Normally to be followed by either i) or ii)
All these methods give all eigenvalues. We will not examine any of these in detail.
2.- Vector Iteration Methods:

These are better suited for sparse matrices and also for the case when only a few eigenvalues and
eigenvectors are needed.
The Power Method: or Direct iteration

For the standard problem (33) the algorithm consists of the repeated multiplication of a
starting vector by the matrix A, this can be seen to produce a reinforcement of the component of
the trial vector along the direction of the eigenvector of largest absolute value, causing the trial
vector to converge gradually to that eigenvector. The algorithm can be described schematically
by:
Choose starting vector
Premultiply by A
Normalize
Check convergence
Not OK
OK
STOP
The normalization step is necessary because otherwise the iteration vector can grow
indefinitely in length over the iterations.
How does this algorithm work?
If φi, i = 1, .. N are the eigenvectors of A, we can write any vector of N components as a
superposition of them (they constitute a base in the space of N dimensions). In particular, for the
starting vector:
N
x 0 = ∑ α i φi (5.27)
i =1
When we multiply by A we get: ~

x1 = A x 0 or:
N N N
x1 = A ∑ α i φi = ∑ α i Aφi = ∑ α i λi φi
~ (5.28)
i =1 i =1 i =1
~
N λi
If λ1 is the eigenvalue of largest absolute value, we can also write this as: x1 = λ1 ∑ α i φi
i =1 λ1
~
x
This is then normalized by: x1 = ~1
x1
Then, after n iterations of multiplications by A and normalization we will get:
n
N  λi 
x n = C ∑ α i   φi (5.29)
i =1  λ1 
where C is a normalization constant.

From this expression (5.29) we can see that since λ1 ≥ λi , for all i, the coefficient of all
φi for i ≠ 1, will tend to zero as n increases and then the vector xn will gradually converge to φ1.
This method (the power method) finds the dominant eigenvector, that is the eigenvector
that corresponds to the largest eigenvalue (in absolute value). To find the eigenvalue we can see
that pre-multiplying (5.25) by the transpose of φi will give:
φiT Aφi = λi φiT φi T T

where φi Aφi and φi φi are scalars, so we can write:
φiT Aφi
λi = (5.30)
φiT φi
This is known as the Rayleigh quotient and can be used to obtain the eigenvalue from the
eigenvector. This expression has interesting properties; if we only know an estimate of the
eigenvector, (5.30) will give an estimate of the eigenvalue with a higher order of accuracy than
that of the eigenvector itself.
Exercise 5.2
Write a short program to calculate the dominant eigenvector and the corresponding
eigenvalue of a matrix like that in (7.14) but of order 7, using the power method. Terminate
iterations when the relative difference between two successive estimates of the eigenvalue are
within a tolerance of 10–6.
Exercise 5.3 (advanced)

Using the same reasoning used to explain the power method for the standard eigenvalue problem
5.27)-( 5.29), develop the corresponding algorithm for the generalized eigenvalue problem
Ax =λBx.
The power method in the form presented above can only be used to find the dominant
eigenvector. However, modified forms of the basic idea can be developed to find any
eigenvector of the matrix. One of these modifications is the inverse iteration.
Inverse Iteration
1
For the system Ax = λx we can write: x = A −1x from where we can see that the
λ
eigenvalues of A–1 are 1/λ; the reciprocals of the eigenvalues of A and the eigenvectors are the
same. In this form, if we are interested in finding the eigenvector of A corresponding to the
smallest eigenvalue in absolute value (closest to zero), we can notice that for that eigenvalue λ,
its reciprocal is the largest, and so it can be found using the power method on the matrix A–1.
The procedure then is as follows:
1. Choose a starting vector x0.

2. x1 = A −1x 0 or instead, solve: A~
Find ~ x1 = x 0 (avoiding the explicit calculation of A–1)
3. Normalize: x = C x ~ (C is a normalization factor to have: x = 1 )
1 1 1
4. Re-start
Exercise 5.4
Write a program and calculate by inverse iteration the smallest eigenvalue of the
tridiagonal matrix A of order 7 where the elements are –4 in the main diagonal and 1 in the
subdiagonals. You can use the algorithms given in section 1 of the Appendix for the solution of
the linear system of equations.
An important extension of the inverse iteration method allows finding any eigenvalue of
the spectrum (spectrum of a matrix: set of eigenvalues) not just that of smallest eigenvalue in
absolute value.
Shifted Inverse Iteration

Suppose that the system Ax = λx has the set of eigenvalues { λi}. If we construct the
~ ~
matrix A as: A = A − σ I where I is the identity matrix and σ is a real number, we have:
~
Ax = Ax − σ x = λ x − σ x = (λ − σ )x (5.31)
~
And we can see that the matrix A has the same eigenvectors as A and its eigenvalues are
{λ−σ}, that is, the same eigenvalues as A but shifted by σ. Then, if we apply the inverse
~
iteration method to the matrix A , the procedure will yield the eigenvalue (λi−σ) closest to zero;
that is, we can find the eigenvalue λi of A closest to the real number σ.
Rayleigh Iteration
Another extension of this method is known as Rayleigh iteration. In this case, the
Rayleigh quotient is used to calculate an estimate of the eigenvalue at each iteration and the shift
is updated using this value.
Since the convergence rate of the power method depends on the relative value of the
coefficients of each eigenvector in the expansion of the trial vector during the iterations (as in
(5.27)) and these are affected by the ratio between the eigenvalues λi and λ1, the convergence
will be fastest when this ratio is largest as we can see from (5.29). The same reasoning applied
to the shifted inverse iteration method leads to the conclusion that the convergence rate will be
fastest when the shift is chosen as close as possible to the target eigenvalue. In this form, the
Rayleigh iteration has faster convergence than the ordinary shifted inverse iteration.
Exercise 5.5
Write a program using shifted inverse iteration to find the eigenvalue of the matrix A of
the previous exercise which lies closest to 3.5.
Then, write a modified version of this program using Rayleigh quotient update of the shift
in every iteration (Rayleigh iteration). Compare the convergence of both procedures (by the
number of iterations needed to achieve the same tolerance for the relative difference between
successive estimates of the eigenvalue). Use a tolerance of 10–6 in both programs.
6. Numerical Differentiation and Integration

Numerical Differentiation
For a function of one variable f(x), the derivative at a point x = a is defined as:
f ( a + h) − f (a )
f ' (a) = lim (6.1)
h→0 h
This suggests that choosing a small value for h the derivative can be reasonably
approximated by the forward difference:
f ( a + h) − f ( a )
f ' (a) ≈ (6.2)
h
An equally valid approximation can be the backward difference:
f ( a ) − f ( a − h)
f ' (a) ≈ (6.3)
h
Graphically, we can see the meaning of each

of these expressions in the Fig. 6.1.
The derivative at xc is the slope of the tangent Forward Difference
to the curve at the point C. The slope of the yb B
chords between the points A and C, and B and Central Difference
C are the values of the backward and forward
difference approximations to the derivative, Backward Difference
respectively.
We can see that a better approximation to
the derivative is obtained by the slope of the yc C
chord between points A and B, labelled
“central difference” in Fig. 6.1. A
ya
xa xc xb
Fig. 6.1
We can understand this better analysing the error in each approximation by the use of
Taylor expansions.
Considering the expansions for the points a+h and a−h:
f ( 2) (a) 2 f (3) (a) 3

f ( a + h) = f ( a ) + f ' ( a ) h + h + h + L (6.4)
2! 3!
f ( 2) (a) 2 f (3) (a) 3

f ( a − h) = f ( a ) − f ' ( a ) h + h − h + L (6.5)
2! 3!
( 4)
f (ξ ) 4
where in both cases the reminder (error) can be represented by a term of the form: h
4!
(see Appendix), so we could write instead
f ( 2) (a) 2 f (3) (a ) 3
f ( a + h) = f ( a ) + f ' ( a ) h +
h + h + O(h 4 )
2! 3!
where the symbol O(h4) means: “a term of the order of h4 ”
Truncating (6.4) to first order we can then see that f ( a + h) = f ( a ) + f ' ( a ) h + O (h 2 ) , so

for the forward difference formula we have:
f ( a + h) − f ( a )
f ' (a) = + O ( h) (6.6)
h
and we can see that the error of this approximation is of the order of h. A similar result is
obtained for the backward difference.
We can also see that subtracting (6.4) and (6.5) and discarding terms of order of h3 and
higher we can obtain a better approximation:
f ( a + h) − f ( a − h) = 2 f ' ( a ) h + O ( h 3 )
From where we obtain the central difference formula:
f ( a + h) − f ( a − h)
f ' (a) = + O (h 2 ) (6.7)
2h
which has an error of the order of h2.

Expressions (6.2) and (6.3) are “two-point” formulae for the first derivative. Many more
different expressions with different degrees of accuracy can be constructed using a similar
procedure and using more points. For example, the Taylor expansions for the function at a
number of points can be used to eliminate all the terms except the desired derivative.
Example:
Considering the 3 points x0, x1 = x0 + h and x2 = x0 + 2h and taking the Taylor expansions at x1
and x2 we can construct a three-point forward difference formula:
f ( 2) ( x0 ) 2
f ( x1 ) = f ( x0 ) + f ' ( x0 )h + h + O (h 3 ) (6.8)
2!
f ( 2) ( x0 ) 2
f ( x2 ) = f ( x0 ) + f ' ( x0 )2h + 4h + O ( h 3 ) (6.9)
2!
Multiplying (6.8) by 4 and subtracting to eliminate the second derivative term:
4 f ( x1 ) − f ( x2 ) = 3 f ( x0 ) + 2 f ' ( x0 )h + O(h3 )
from where we can extract for the first derivative:
− 3 f ( x0 ) + 4 f ( x1 ) − f ( x2 )
f ' ( x0 ) = + O (h 2 ) (6.10)
2h
Exercise 6.1
Considering the 3 points x0, x1 = x0 − h and x2 = x0 − 2h and the Taylor expansions at x1 and x2
find a three-point backward difference formula. What is the order of the error?
Exercise 6.2
Using the Taylor expansions for f(a+h) and f(a–h) show that a suitable formula for the
second derivative is:
f (a − h) − 2 f ( a ) + f (a + h)
f ( 2) ( a ) ≈ (6.11)
h2
Show also that the error is O(h2).
Exercise 6.3
Use the Taylor expansions for f(a+h), f(a–h), f(a+2h) and f(a–2h) to show that the
following are formulae for f ’(a) and f(2)(a), and that both have an error of the order of h4:
f ( a − 2h) − 8 f ( a − h) + 8 f (a + h) − f (a + 2h)
f ' (a) ≈
12h
− f (a − 2h) + 16 f (a − h) − 30 f (a ) + 16 f (a + h) − f (a + 2h)
f ( 2) ( a ) ≈
12h 2
Expressions for the derivatives can also be found using other methods. For example, if the
function is interpolated with a polynomial using, say, n points, the derivative (first, second, etc)
can be estimated by calculating the derivative of the interpolating polynomial at the point of
interest.
Example:
Considering the 3 points x1, x2 and x3 with x1 < x2 < x3 (this time, not necessarily equally spaced)
and respective function values y1, y2 and y3, we can use the Lagrange interpolation polynomial to
approximate y(x):
f ( x) ≈ L( x) = L1 ( x) y1 + L2 ( x) y2 + L3 ( x) y3 (6.12)
where:
( x − x2 )( x − x3 ) ( x − x1 )( x − x3 ) ( x − x1 )( x − x2 )
L1 ( x) = , L2 ( x) = and L3 ( x ) = (6.13)
( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 )
so the first derivative can be approximated by:
f ' ( x) ≈ L' ( x ) = L1 ' ( x ) y1 + L2 ' ( x) y2 + L3 ' ( x) y3

which is:
2 x − x2 − x3 2 x − x1 − x3 2 x − x1 − x2
f ' ( x) ≈ y1 + y2 + y3 .
( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 )
This general expression will give the value of the derivative at any of the points x1, x2 or x3.
For example, at x1:
2 x1 − x2 − x3 x1 − x3 x1 − x2
f ' ( x1 ) ≈ y1 + y2 + y3 (6.14)
( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 )
Exercise 6.4
Show that if the points are equally spaced by the distance h in (6.12) and the expression is
evaluated at x1, x2 and x3, the expression reduces respectively to the 3-point forward difference
formula (6.10), the central difference and the 3-point backward difference formulae.
Central Difference Expressions for different derivative

The following expressions are central difference approximations for the derivatives:
f ( xi +1 ) − f ( xi −1 )
i) f ' ( xi ) =
2h
f ( xi +1 ) − 2 f ( xi ) + f ( xi −1 )
ii) f ' ' ( xi ) =
h2
f ( xi + 2 ) − 2 f ( xi +1 ) + 2 f ( xi −1 ) − f ( xi −2 )
iii) f ' ' ' ( xi ) =
2h 3
f ( xi + 2 ) − 4 f ( xi +1 ) + 6 f ( xi ) − 4 f ( xi −1 ) + f ( xi −2 )
iv) f ( 4 ) ( xi ) =
h4
Naturally, many more expressions can be developed using more points and/or different methods.
Exercise 6.5
Derive expressions iii) and iv) above. What is the order of the error for each of the 4 expressions
above?
Partial Derivatives
∂f ( x, y )
For a function of two variables: f(x,y), the partial derivative: is defined as:
∂x
∂f ( x, y ) f ( x + h, y ) − f ( x , y )
= lim , which clearly, is a function of y.
∂x h →0 h
Again, we can approximate this expression by a difference assuming that h is sufficiently small.
∂f ( x, y )
Then, for example a central difference expression for is:
∂x
∂f ( x, y ) f ( x + h, y ) − f ( x − h, y )
= (6.15)
∂x 2h
Similarly,
∂f ( x, y ) f ( x, y + h) − f ( x, y − h)
=
∂y 2h
In this form, the gradient of f is given by:
∂f ( x, y ) ∂f ( x, y ) 1
∇f ( x, y ) = xˆ + yˆ ≈ (( f ( x + h, y ) − f ( x − h, y )) xˆ + ( f ( x, y + h) − f ( x, y − h)) yˆ )
∂x ∂y 2h
Exercise 6.6
Using central difference formulae for the second derivatives derive an expression for the
Laplacian ( ∇ 2 f ) of a scalar function f.
Numerical Integration
In general, numerical integration methods approximate the definite integral of a function f by a
weighted sum of function values at several points in the interval of integration. In general these
methods are called “quadrature” and there are different methods to choose the points and the
weights.
Trapezoid Rule
The simplest method to approximate the integral of a function is the trapezoid rule. In this case,
the interval of integration is divided into a number of subintervals on which the function is
simply approximated by a straight line as shown in the figure below.
The integral (area under the curve) is
then approximated by the sum of the areas of
the trapezoids based on each subinterval.
The area of the trapezoid with base in the
interval [xi, xi+1] is:
( f i + f i +1 )
( xi +1 − xi )
2
and the total area is then the sum of all the

terms of this form. If we denote by
hi = ( xi +1 − xi ) x1 ... x(i-1) x(i) x(i+1)

Fig. 6.2
the width of the subinterval [xi, xi+1], the integral can be approximated by:
xn
1 n−1
∫ f ( x)dx ≈ 2 ∑ hi ( fi + fi +1) (6.16)
x1 i =1
If all the subintervals are of the same width (the points are equally spaced), (6.14) reduces to:
xn
 f1 + f n n−1 
∫ f ( x)dx ≈h + ∑ f i  (6.17)
x1  2 i =2 
Exercise 6.7
Using the trapezoid rule:
(a) Calculate the integral of exp(−x2) between 0 and 2.
(b) Calculate the integral of 1/x between 1 and 2
In both cases use 10 and 100 subintervals (11 and 101 points respectively).
It can be shown that the error incurred with the application of the trapezoid rule in one interval is
given by the term:
(b − a )3 ( 2)
E=− f (ξ ) (6.18)
12
where ξ is a point inside the interval and the error is defined as the difference between the exact
integral (I) and the area of the trapezoid (A): E = I − A.
If this is applied to the Trapezoid rule using a number of subintervals, the error term changes to:
(b − a )h 2 ( 2)
E=− f (ξ h ) (6.19)
12
where now ξ h is a point in the complete interval [a, b] and depends on h. Considering the Error
as the sum of the individual errors in each subinterval, we can write this term in the form:
h 3 n ( 2) h2 n
E=− ∑
12 i =1
f (ξi ) = − ∑ hf ( 2) (ξi )
12 i =1
(6.20)
where ξi are points in each subintervals. Expression (6.20), in the limit when n → ∞ and
h → 0 corresponds to the integral of f(2) over the interval [a, b]. Then, we can write (6.20) as:
h2
E≈− ( f ' (b) − f ' (a)) (6.21)
12
Two options are open now, we can use this term to estimate the error incurred or equivalently, to
determine the number of equally spaced points required for a determined precision or, include
this term in the calculation to form a corrected form of the trapezoid rule:
b
 f1 + f n n−1  h 2
∫ f ( x)dx ≈h + ∑ f i  − ( f ' (b) − f ' (a)) (6.22)
a  2 i = 2  12
Simpson’s Rule
In the case of the trapezoid rule, the function is approximated by a straight line and this can be
done repeatedly by subdividing the interval. A higher degree of accuracy using the same
number of subintervals can be obtained with a better approximation than the straight line. For
example, choosing a quadratic approximation could give a better result. This is the Simpson’s
rule.
Consider the function f(x) and the interval [a, b]. Defining the points x0, x1 and x2, as:
a+b b−a
x0 = a , x1 = , x2 = b and defining h =
2 2
Using Lagrange interpolation to generate a second order polynomial approximation to f(x) gives
as in (6.12):
f ( x) ≈ f ( x0 ) L0 ( x) + f ( x1 ) L1 ( x) + f ( x2 ) L2 ( x)
where:
( x − x1 )( x − x2 ) 1
L0 ( x) = = ( x − x1 )( x − x2 )
( x0 − x1 )( x0 − x2 ) 2h 2
( x − x0 )( x − x2 ) 1
L1 ( x) = = − 2 ( x − x0 )( x − x2 )
( x1 − x0 )( x1 − x2 ) h
( x − x0 )( x − x1 ) 1
and L2 ( x) = = ( x − x0 )( x − x1 )
( x2 − x0 )( x2 − x1 ) 2h 2
Then,
b a + 2h
∫ f ( x )dx ≈ ∫ ( f ( x0 ) L0 ( x) + f ( x1) L1( x) + f ( x2 ) L2 ( x))dx (6.23)

a a
b a + 2h a+2h a+2h
or ∫ f ( x)dx ≈ f ( x0 ) ∫ L0 ( x)dx + f ( x1 ) ∫ L1( x)dx + f ( x2 ) ∫ L2 ( x)dx
a a a a
The individual integrals are:

a+2h a+2h
1
∫ L0 ( x)dx = 2h 2 ∫ ( x − x1)( x − x2 )dx
a a
and with the change of variable: x → t = x − x1
a+2h h h
1 1  t 3 t2  h
∫ L0 ( x)dx = ∫ t (t − h)dt = 2 3
−h  =
a 2h 2 −h 2h  2  3
−h
a+2h a+2h
4h h
Similarly, ∫ L1( x)dx = 3
and ∫ L2 ( x)dx = 3
a a
Then, substituting in (6.23):
b
h
∫ f ( x)dx ≈ 3 ( f (a) + 4 f (a + h) + f (a + 2h)) (6.24)
a
Example
1.25
Use of the Simpson’s rule to calculate the integral: ∫ (sin πx + 0.5)dx
0.25
We have: x0 = a = 0.25, x2 = x0 + 2h = b = 1.25 ,
then x1 = x0 + h = 0.75 and h = 0.5
1
The figure shows the function (in blue) and the
2nd order Lagrange interpolation used to
calculate the integral with the Simpson’s rule.
The exact value of this integral is:
1.25 x2
0
∫ (sin πx + 0.5)dx = 0.950158158 x0 x1
0.25
Applying the Simpson’s rule to this
function gives: 0 0.5 1 1.5
Fig. 6.3
1.25
h
∫ (sin πx + 0.5)dx ≈ 3 ( f (a) + 4 f (a + h) + f (a + 2h)) = 0.97140452
0.25
As with the trapezoid rule, higher accuracy can be obtained subdividing the interval of
integration and adding the result of the integrals over each subintervals.
Exercise 6.8
Write down the expression for the composite Simpson’s rule using n subintervals and use it to
1.25
calculate the integral ∫ (sin πx + 0.5)dx of the example above using 10 subintervals.
0.25
The Midpoint Rule

Another simple method to approximate a definite integral is the midpoint rule where the function
is simply approximated be a constant value over the interval; the value of the function at the
midpoint. In such a way, the integral of f(x) in the interval [a, b] is approximated by the area of
the rectangle of base (b − a) and height f(c), where c = (a + b)/2.
Consider the integral:
b
I = ∫ f ( x)dx (6.25)
a
and the Taylor expansion for the function f, truncated to first order and centred in the midpoint
of the interval [a, b]:
p1 ( x) = f (c) + ( x − c) f ' (c) + O(h 2 ) (6.26)
where c = (a + b) 2 .
1
The error term in (6.26) is actually: R1 ( x) = ( x − c) 2 f ( 2) (ξ ) so the error in (6.25) when
2
b
p1 ( x) is used instead of f(x) is given by: E = ∫ R1 ( x)dx and applying the Integral Mean Value
a
Theorem to this (see Appendix), we have:
( )
b
1 ( 2) 1 b 1
E= f (ξ ) ∫ ( x − c) 2 dx = f ( 2) (ξ )( x − c) a = f ( 2) (ξ ) (b − c)3 − (a − c)3
2 a
6 6
(b − a )3
but since c = a + h and c = b − h, where h = (b − a)/2, we have: (b − c)3 − (a − c)3 =
4
and the error is:
1
E=(b − a )3 f ( 2) (ξ ) (6.27)
24
which is half of the estimate for the single interval trapezoid rule.
We can write then:

b
1
∫ f ( x)dx = (b − a) f (c) + 24 (b − a)
3
f ( 2) (ξ ) (6.28)
a
for some ξ in the interval [a, b].
Similarly to the case of the trapezoid rule and the Simpson’s rule, the midpoint rule can be
used in a composite manner, after subdividing the interval of integration in a number N of
subintervals to achieve higher precision. In that case, the expression for the integral becomes:
b N
h 2 (b − a ) ( 2) b−a
∫ f ( x)dx = h∑ f (ci ) + 24
f (η ) , h=
N
(6.29)
a i =1
where ci are the midpoints of each of the n subintervals and η is a point between a and b.
Exercise 6.9
1.25
Use the Midpoint Rule to calculate the integral: ∫ (sin πx + 0.5)dx using 2 and 10 subintervals.
0.25
compare the result with that of the Simpson’s rule in the example above.
Gaussian Quadrature
In the trapezoid, Simpson’s and midpoint rule, the definite integral of a function f(x) is
approximated by the exact integral of a polynomial that approximates the function. In all these
cases, the evaluation points are chosen arbitrarily, often equally spaced. However, it is rather
clear that the precision attained is dependent of the position of these points, giving then another
route to optimisation. Then the weighting coefficients are determined by the choice of method.
Considering again the general approach at the approximation of the definite integral, the problem
can be written in the form:
n 1
Gn ( f ) = ∑ win f ( xin ) ≈ ∫ f ( x)dx (6.30)
i =1 −1
(The interval of integration is here chosen as [ −1,1], but obviously, any other interval can be
mapped into this by a change of variable.)
The objective now is to find for a given n, the best choice of evaluation points xin (called here
“Gauss points”) and weights win to get maximum precision in the approximation. Compared to
the criterion for the trapezoid and Simpson’s rules, this is equivalent to try to find for a fixed n
the best choice for xin and win the approximation (6.30) is exact for a polynomial of degree N,
with N (>n) as large as possible. (That is we go beyond the degree of approximation of the
previous rules.)
This is equivalent to say:
1 n
∫x
k
dx = ∑ win ( xin ) k for k = 0, 1, 2, …, N (6.31)
−1 i =1
with N as large as possible (note the equal sign now). We can simplify the problem to these
functions now because any polynomial is just a superposition of terms of the form x k , so if the
integral is exact for each of them, it will be for any polynomial containing those terms.
Expression (6.31) is a system of equations that the Gauss points and weights (unknown) need to
satisfy. The problem is then, to find the xin and win (n of each). This is a nonlinear problem
that cannot be solved directly. We also have to determine the value of N. It can be shown that
this number is N = 2n − 1. This is also rather natural since (6.31) consists of N+1 equations and
we need to find 2n parameters.
Finding the weights

For a set of weights and Gauss points, let’s consider the Lagrange interpolation polynomials of
order n, associated to each of the Gauss points:
n
x − xkn
Lnj ( x) = ∏ x nj − xkn
(6.32)
k =1,k ≠ j
then, since the expression (6.31) should be exact for any polynomial up to order N = 2n − 1, and
Lnj ( x) is of order n, we have:
1 n
∫ L j ( x)dx = ∑ wi L j ( xi )
n n n n
(6.33)
−1 i =1
but since the Lnj ( x) are interpolation polynomials, Lnj ( xin ) = δ i j (that is, they are 1 if i=j and 0
otherwise), all the terms in the sum in (6.33) are zero except for i=j and we can have:
∫ Li ( x)dx = wi
n n
(6.34)
−1
With this, we have the weights for a given set of Gauss points. We have to find now the best
choice for these.
If P(x) is an arbitrary polynomial of degree ≤ 2n − 1, we can write: P(x) = Pn(x) Q(x) + R(x); (Q
and R are respectively the quotient polynomial and reminder polynomial of the division of P by
Pn). Pn(x) is of order n and then, Q and R are of degree n − 1 or less. Then we have:
1 1 1
∫ P( x)dx = ∫ Pn ( x)Q( x)dx + ∫ R( x)dx (6.35)

−1 −1 −1
If we now define the polynomial Pn(x) by its roots and choose these as the Gauss points:
n
Pn ( x) = ∏ ( x − xin ) , (6.36)
i =1
then, the integral of the product Pn(x) Q(x), which is a polynomial of degree ≤ 2n − 1, must be
given exactly by the quadrature expression:
1 n
∫ Pn ( x)Q( x)dx = ∑ wi Pn ( xi )Q( xi )
n n n
(6.37)
−1 i =1
but since the Gaussian points are the roots of Pn(x), (6.37) must be zero; that is:
∫ Pn ( x)Q( x)dx = 0 (6.38)

−1
for all polynomials Q(x) of degree n − 1 or less. This implies that Pn(x) must be a member of a
family of orthogonal polynomials1.. In particular, Legendre polynomials are a good choice

because they are orthogonal in the interval [-1, 1] with a weighting function w(x) = 1.
Going back now to the integral of the arbitrary polynomial P(x) of degree ≤ 2n − 1, and equation
(6.35), we have that if we choose Pn(x) as in (6.36), (6 38) is satisfied and then, (6.35) reduces
to:
1 1
∫ P( x)dx = ∫ R( x)dx (6.39)

−1 −1
but since R(x) is of degree n − 1 or less, the interpolation using Lagrange polynomials for the n
points will give the exact representation of R (see Exercise 4.2). That is:
n
R ( x) = ∑ R ( xin ) Li ( x) exactly. (6.40)
i =1
Then,
1 1 n n 1
∫ R( x)dx = ∫ ∑ R( xi ) Li ( x)dx = ∑ R( xi ) ∫ Li ( x)dx

n n
(6.41)
−1 −1 i =1 i =1 −1
but we have seen before (6.34) the integral of the Lagrange interpolation polynomial for point i
is the value of win , so:
1 n
∫ R( x)dx = ∑ wi R( xi )
n n
(6.40)
−1 i =1
Now, since P(x) = Pn(x) Q(x) + R(x) and Pn ( xin ) = 0 (see 6.36), P ( xin ) = R ( xin ) and from (6.39):
1 n
∫ P( x)dx = ∑ wi P( xi )
n n
(6.41)
−1 i =1
which tells us that the integral of the arbitrary polynomial P(x) of order 2n − 1can be calculated
exactly using the sets of Gauss points xin - zeros of the nth order Legendre polynomial and the
weights win determined by(6.34) – the integral over the interval [−1, 1] of the Lagrange
polynomial of order n corresponding to the Gauss point xin .
Back to the start then, we have now a set of n Gauss points and weights that yield the exact
evaluation of the integral of a polynomial up to degree 2n − 1. These should also give the best
result to the integral of an arbitrary function f:
1
Remember that for orthogonal polynomials in [−1, 1], all their roots lie in [−1, 1], and satisfy:
1 1
∫ pi ( x) p j ( x)dx = δ i ∫ pi ( x)q( x)dx = 0 for any polynomial q of degree ≤ i −1.

j
. Additionally,
−1 −1
1 n
∫ f ( x)dx ≈ ∑ wi
n
f ( xin ) (6.42)
−1 i =1
Gauss nodes and weights for different orders are given in the following table.
Gaussian Quadrature: Nodes and Weights
n Nodes xin Weghts win

1 0.0 2.0
2 ± 3 3 = ±0.577350269189 1.0
3 0 8 9 = 0.888888888889
± 15 5 = ±0.774596669241 5 9 = 0.555555555556
4 ± 525 − 70 30 35 =±0.339981043585 (18 + 30 ) 36 = 0.652145154863

± 525 + 70 30 35 =±0.861136311594 (18 − 30 ) 36 = 0.347854845137
5 0 128 225 = 0.568888888889
± 5 − 2 10 7 3 = ±0.538469310106 (322 + 13 70 ) 900 =0.478628670499
± 5 + 2 10 7 3 = ±0.906179845939 (322 − 13 70 ) 900 =0.236926885056
6 ±0.238619186 0.4679139
±0.661209386 0.3607616
±0.932469514 0.1713245
Example
1
−x
For the integral: ∫e sin 5 x dx the results of the calculation using Gauss quadrature are listed in
−1
the table:
n Integral Error %
2 −0.307533965529
3 0.634074857001
4 0.172538331616 28.71
5 0.247736352452 −2.35
6 0.241785750244 0.10
The error is also calculated compared with the exact value: 0.24203832101745.
The results with few Gauss points are not very good because the function varies strongly in
the interval of integration. However, for n = 6, the error is very small.
Exercise 6.10
Compare the results of the example above with those of the Simpson’s and Trapezoid Rules for
n subintervals.
Change of Variable
The above procedure was developed for definite integrals over the interval [−1, 1]. Integrals
over other intervals can be calculated after a change of variables. For example if the integral to
calculate is:
b
∫ f (t )dt
a
To change from the interval [a, b] to [−1, 1] the following change of variable can be made:
t − ( a + b) 2 b−a a+b b−a

x← or t → x+ so dt = dx
(b − a) 2 2 2 2
Then, the integral can be approximated by:
b
b−a n n b−a n a+b
∫ f (t )dt ≈ ∑ wi f  2 xi + 2 
2 i =1
(6.43)
a
Exercise 6.11
1.25
Use Gaussian quadrature to calculate: ∫ (sin πx + 0.5)dx
0.25
7. Solution of Partial Differential Equations

The Finite Difference Method: Solution of Differential Equations by
Finite Differences
We will study here a direct method to solve differential equations which is based on the use of
finite differences. This consists of the direct substitution of derivatives by finite differences and
thus the problem reduces to an algebraic form.
One Dimensional Problems:

Let’s consider a problem given by the equation:
Lf = g (7.1)
where L is a linear differential operator and g(x) is a known function. The problem is to find the
function f(x) satisfying equation (7.1) over a given region (interval) [a, b] and subjected to some
boundary conditions at a and b.
The basic idea is to substitute the derivatives by appropriate difference formulae like those
seen in section 6. This will convert the problem into a system of algebraic equations.
In order to apply systematically the difference approximations we proceed as follow:
–– First, we divide the interval [a, b] into N equal subintervals of length h : h = (b–a)/N,
defining the points xi : xi = a + ih
–– Next, we approximate all derivatives in the operator L by appropriate difference formulae
(h must be sufficiently small – N large – to do this accurately).
–– Finally, we formulate the corresponding difference equation at each point xi. This will
generate a linear system of N–1 equations on the N–1 unknown values of fi = f(xi).
Example:
If we take Lf = g to be a general second order differential equation:
cf ' ' (x) + df' (x) + ef (x) = g(x) (7.2)
with constant coefficients c, d and e. x ∈[a,b] and boundary conditions:
f(a) at a and f(b) at b. (7.3)
Using (6.11) and (6.7) at point i:
fi −1 − 2 fi + fi+1 fi+1 − fi −1
fi ' ' ≈ and fi ' ≈ (7.4)
h2 2h
results for the equation at point i:
fi−1 − 2 fi + fi +1 f −f
c + d i +1 i−1 + e fi = gi (7.5)
h2 2h
or:
 d 
( 
) d 
 c − h fi−1 + −2c + eh2 fi +  c + h fi +1 = gi h2
 2   2 
(7.6)
for all i except i = 1 and N–1, where for i = 1: fi-1 = f(a) and
for i = N–1: fi+1 = f(b)
This can be written as a matrix problem of the form: A f = g, where f = { fi} g = {h2gi}
are vectors of order N–1. The matrix A has only 3 elements per row:


d 
( )  d 
ai ,i−1 =  c − h , ai,i = −2c + eh 2 , ai,i +1 =  c + h
2   2 
Exercise 7.1
Formulate the algorithm to solve a general second order differential equation over the
interval [a, b], with Dirichlet boundary conditions at a and b (known values of f at a and b).
Write a short computer program to implement it and use it to solve:
f’’ + 2f’ + 5f = e–xsin(x) in [0,5]; f(0) = f(5) = 0
Two Dimensional Problems

We consider now the problem Lf = g in 2 dimensions, where f is a function of 2 variables,
say: x and y. In this case, we need to approximate derivatives in x and y in L. To do this, we
superimpose a regular grid over the domain of the problem and (as for the 1–D case) we only
consider the values of f and g at the nodes of the grid.
Example:
Referring to the figure below, the problem consists of finding the potential distribution
between the inner and outer conducting surfaces of square cross-section when a fixed voltage is
applied between them. The equation describing this problem is the Laplace equation:
2
∇ φ = 0 with the boundary conditions φ = 0 in the outer conductor and φ = 1 in the inner
conductor.
2 ∂ 2φ ∂ 2 φ
To approximate ∇φ= + (7.7)
∂ x2 ∂ y2
we can choose for convenience the same spacing h for x and y. Ignoring the symmetry of the
problem, the whole cross-section is discretized (as in Fig. 8.1). With this choice, there are 56
free nodes (for which the potential is not known).
1 2 3 4 5 6 7 8 9
φ=0
10 11 12 18
19 20 22
23 26
φ=1
30
34
38
47
48 56
Fig. 8.1 Finite difference mesh for solution of the electrostatic field in square coaxial line.
On this mesh, we only consider the unknown internal nodal N
values and only those nodes are numbered. An internal point of
the grid, xi, labelled by O in the figure, is surrounded by other O
W xi E
four, which for convenience are labelled by N, S, E and W. For
this point we can approximate the derivatives in (7.7) by: S
∂ 2φ 1 ∂ 2φ 1
≈ (φ − 2φO + φ E ) and ≈ (φ − 2φO + φS ) (7.8)
∂ x2 h 2 W ∂ y2 h 2 N
Then the equation ∇ 2φ = 0 becomes simply:
∇ 2φ ≈ (φ N + φ S + φE + φ W − 4φO ) h2 = 0 or
φ N + φ S + φ E + φW − 4φO = 0 (7.9)
Formulating this equation for each point in the grid and using the boundary conditions
where appropriate, we end up with a system of N equations, one for each of the N free points of
the grid. Applying equation (7.9) to point 1 of the mesh gives:
0 + φ10 + φ 2 + 0 − 4φ1 = 0 (7.10)
and to point 2: 0 + φ11 + φ3 + φ1 − 4φ2 = 0 (7.11)
A typical interior point such as 11 gives:
φ2 + φ20 + φ12 + φ10 − 4φ11 = 0 (7.12)

and one near the inner conductor, 13 will give:
φ4 + 1+ φ14 + φ12 − 4φ13 = 0 (7.13)
In this way, we can assemble all 56 equations from the 56 mesh points of the figure, in
terms of the 56 unknowns. The resulting 56 equations can be expressed as:
Ax=y (7.14a)
−4 1 L 1 K   φ1   0 
 1 −4 1 K 1 φ   0 
  2   
 1 −4 K 1   φ3   0 
 ⋅  ⋅   ⋅ 
    
 ⋅  ⋅   ⋅ 
 ⋅  φ  =  −1
   13   
or   ⋅   ⋅  (7.14b)
  ⋅   ⋅ 
    
 K −4 1 φ 55   0 
 1 −4  φ56   0 

The unknown vector x of (7.14a) is simply (φ1, φ2, . . . , φ56)T.

The right-hand-side vector y of eqs. (7.14) consists of zeros except for the –1’s coming
from equations like (7.13) corresponding to points next to the inner conductor with potential 1;
namely, from points 12 to 17, 20, 21,..., 41 to 45.
The 56×56 matrix A has mostly zero elements, except for ‘–4’ on the diagonal, and either
two, three or four ‘+1’ somewhere else on each matrix row, the number and distribution
depending on the geometric node location.
One has to be careful not to confuse the row-and-column numbers of the 2-D array A with
the ‘x and y coordinates’ of the physical problem. Each number 1 to 56 of the mesh points in the
figure above corresponds precisely to the row number of matrix A and the row number of
column vector x.
Equation (7.14) is standard, as given in (5.11), and could be solved with a standard library
package, with a Gauss elimination solver routine, or better, a sparse or a band-matrix version of
Gauss elimination. Better still would be the Gauss-Seidel or successive displacement. Section 2
in the Appendix shows details of such a solution.
Enforcing Boundary Conditions
In the description so far we have seen the implementation of the Dirichlet boundary
condition; that is a condition where the values of the desired function are known at the edges of
the region of interest (ends of the interval in 1-D or boundary of a 2-D or 3-D region). This has
been implemented in a straightforward way in equations (7.10), (7.11) and (7.13).
Frequently, other types of boundary conditions appear, for example, when the derivatives
of the function are known (instead of the values themselves) at the boundary - this is the
Neumann condition. In occasions, a mixed condition will apply; something like:
∂f
f (r) + = K , where the second term refers to the normal derivative (or derivative along the
∂n
normal direction to the boundary. This type of condition appears in some radiation problems
(Sommerfeld radiation condition).
We will see next some forms of dealing with these boundary conditions in the context of
finite difference calculations.
For example in the case of the ‘square coaxial problem’ studied earlier, we can see that the
solution will have symmetry properties which makes unnecessary the calculation of the potential
over the complete cross-section. In fact only one eighth of the cross-section is actually needed.
The new region of interest can be one eighth φ=0
of the complete cross-section: the shaded region
or one quarter of the cross-section, to avoid ∂φ
oblique lines. In any case, the new boundary =0
∂n
conditions needed on the dashed lines that limit
the new region of interest are of the Neumann φ=1
∂φ
type: =0 since the lines of constant
∂n
potential will be perpendicular to those edges. (n
represents the normal to the boundary).
Fig. 8.2
We will need a different strategy to deal with these conditions.
For this condition it is more convenient to define the mesh in a different manner. If we
place the boundary at half the node spacing from the start of the mesh as in the figure below, we
can implement the normal derivative condition in a simple form:
(Note that the node numbers used in the next figure do not correspond to the mesh numbering
defined earlier for the whole cross-section).
φ=0 The boundary condition that applies at point b (not

∂φ
actually part of the mesh) is: = 0 We can
∂n
a b 1 2 approximate this by using a central difference
∂φ with the point 1 and the auxiliary point a outside
=0 ∂φ φ −φ
∂n 10 the mesh. Then: = 0 = a 1 and then,
∂n b h
φa = φ1 .
Fig. 8.3
So the corresponding equation for point 1 is: φ N + φ S + φ E + φW − 4φ 0 = 0

Substituting values we have:
0 + φ10 + φ1 + φ 2 − 4φ1 = 0
or φ2 + φ10 − 3φ1 = 0
A Dirichlet condition can also be implemented in a similar form, if necessary. For
example if part of a straight boundary is subjected to a Neumann condition and the rest to a
Dirichlet condition (as for example in the case of a conductor that does not cover the whole
boundary), a Dirichlet condition will have to be implemented in a mesh separated half of the
spacing from the edge.
Exercise 7.2:
Consider the Fig. 8.4 and the points 5 ∂φ
and 15 with the point a on the boundary (not =0 φ=V
in the mesh). Use Taylor expansions at points ∂n a
5 and 15 at distances h/2 and 3h/2 from a, to

1 2 3 4 5 6
show that a condition that can be applied
when the function φ has a fixed value V at the
11 12 13 14 15
boundary and we also want the normal
derivative of φ to be zero there is:
9φ5 − φ15 = 8V.
Fig. 8.4
Using the results of Exercise 7.2, the difference equation corresponding to the
discretization of the Laplace equation at point 5 where the boundary condition is φ = V will be:
φ N + φ 4 − 4φ 5 + φ 6 + φ15 = 0 but φ N = φ 5 and also φ15 = 9φ 5 − 8V so finally:
φ4 + 6φ 5 + φ6 = 8V
Exercise 7.3
Using Taylor expansions, derive an equation to implement the Neumann condition using
five points along a line normal to the edge and at distances 0, h, 2h, 3h, and 4h.
Exercise 7.4 (Advanced)

Using two points like 11 and 12 in the figure of Exercise 7.2, find the equation
∂φ
corresponding to the implementation of a radiation condition of the form: φ + p = K , where
∂n
p is a constant with the same dimensions as h.
Repeat using 3 points along the same line and use the extra degree of freedom to increase
the degree of approximation (eliminating second and third derivatives).
Example Heat conduction (diffusion) in a uniform rod.
A uniform rod of length L is initially at temperature 0. It is then connected to a heat source

at temperature 1 at one end while the other is attached to a sink at temperature 0. Find the
temperature distribution along the rod, as time varies. We need the time-domain solution
(transient). In the steady state we would expect a linear distribution of temperature between both
ends.
The equation describing the temperature variation in space and time, assuming a heat
conductivity of 1 is the normalized diffusion equation:
∂ 2u ∂u
= (7.15)
∂ x 2 ∂t
where u(x,t) is the temperature. The boundary and initial conditions are:
u(0,t) = 0
 for all t u(x,0) = 0 for x < L
u(L,t) = 1
We can discretize the solution space (x.t) with a regular grid with spacing ∆x and ∆t: The
solution will be sought at positions: xi, i = 1, ...., M–1 (leaving out of the calculation the points
at the ends of the rod), so ∆ x = L/M) and at times: tn, n = 0, 1, 2,... .
Using this discretization, the boundary and initial condition become:
u0 = 0
n
 for all n u0i = 0 for i = 0, ...., M–1 (7.16)
unM = 1
We now discretize equation (7.15), converting the derivatives into differences:
For the time derivative at time n and position i, we can use the central difference formula:
∂ 2u un − 2uni + ui+1
n
= i −1 (7.17)
∂ x 2 i,n ∆x 2
In the right hand side of (7.15), we need to approximate the time derivative at the same point
(i,n). We could use the forward difference:
∂u uin+1 − uin
≈ (7.18)
∂t ∆t
and in this case we get:
uni−1 − 2uin + uin+1 uni +1 − uin
= (7.19)
∆x 2 ∆t
as the difference equation.
n +1 ∆t n  ∆t  n ∆t n
Rearranging: ui = ui−1 + 1 − 2 2  ui + 2 ui +1 (7.20)
∆x 2  ∆x  ∆x
This gives us a form of calculating the temperature u at point

i and time n+1 as a function of u at points i, i–1 and i+1, at n+1
time n. We have then a time-stepping algorithm.
n
i–1 i i+1
Fig. 8.5
Equation (7.20) is valid for n = 0,...., N and i = 1, ...., M–1.
n n
Special cases: for i = 1: ui−1 = 0, and for i = M–1: ui+1 = 1 for all n (at all times).
 ∆t  ∆t
If we call: b =  1− 2 2  and c = 2 , we can rewrite (7.20) as:
 ∆x  ∆x
u1n +1 = bu1n + cun2

n +1 n n n
u2 = cu1 + bu2 + cu3
(7.21)
L
n +1 n n
uM −1 = L cuM −2 + buM−1 + c
which can be written in matrix form as: un +1 = Aun + v, where the matrix A and the vector v
are:
b c ⋅ ⋅   0
c b c ⋅ ⋅   0
   
A= c b c ⋅ ⋅  and v =  ⋅ (7.22)
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅  0
   
 ⋅ c b c 
n n
u is the vector containing all values ui . It is known at n = 0, so the matrix equation (7.21) can
be solved for all successive time steps.
Problems with this formulation:

– It is unstable unless ∆ t ≤ ∆ x2/2 (c ≤ 0.5 or b ≥ 0) This is due to the rather poor
approximation to the time derivative provided by the forward difference (7.18). A better scheme
can be constructed using the central difference in the RHS of (7.15) instead of the forward
difference.
Using the central difference for the time derivative:
the Crank-Nicolson method

∂u u n+1 − uni −1
A central difference approximation to the RHS of (7.15) would be: = i
∂t i,n 2∆t
This would involve values of u at three time steps: n–1, n, and n+1, which would be
undesirable. A solution to this is to shrink the gap between the two values, having instead of the
n +1 n −1 n +1 n
difference between ui and ui , the difference between ui and ui . However, if we
consider this a ‘central’ difference, the derivative must be evaluated at the central point which
corresponds to the value n+1/2:
∂u u n+1 − uin
= i (7.23)
∂t i,n+1/ 2 ∆t
We then need to evaluate the left hand side of (7.15) at the same time, but this is not on the grid!
We will have:
∂ 2u un+1/ 2 − 2uin+1/ 2 + ui+1
n+1/ 2
2 = i −1 2 (7.24)
∂ x i,n+1/ 2 ∆x
Since the values of u are restricted to positions in the grid, we have to approximate the values in
the RHS of (7.24). Since those values are at the centre of the intervals, we can approximate
them by the average between the neighbouring grid points:
n +1/ 2 uin + uni +1

ui ≈ and similarly for i–1, and i+1. (7.25)
2
We can now substitute (7.25) into (7.24) and evaluate the equation (7.15) at time n+1/2. After
re-arranging this gives:
n +1 n+1 n+1 n n n
ui−1 − 2dui + ui+1 = −ui−1 + 2eui − ui+1 (7.26)
 ∆x2   ∆x 2 
where d = 1 +  and e =  1− 
 ∆t   ∆t 
This form of treating the derivative (evaluating the equation between grid points in order to
have a central difference approximation for the first order derivative is called the Crank-
Nicolson method and has several advantages over the previous formulation (using the forward
difference).
Fig. 8.6 shows the scheme of calculation: the points or

values involved in each calculation. The dark spot n+1
represents the position where the equation is evaluated. This n+1/2
method provides a second order approximation for both n
i–1 i i+1
derivatives and also, very important, it is unconditionally
stable. Fig. 8.6 Crank-Nicolson scheme
We can now write (7.26) for each value of i as in (7.21), considering the special cases at the ends
of the rod and at t = 0, and write the corresponding matrix form. In this case we will get:
n +1 n
Au = Bu − 2v (7.27)
where:
 −2 d 1 ⋅   2e −1 ⋅   0
 1 −2d 1 ⋅   −1 2e −1 ⋅   ⋅
A= , B=   and v =  
 ⋅ ⋅ ⋅ ⋅  ⋅ ⋅ ⋅ ⋅   ⋅
 ⋅ ⋅ 1 −2d  ⋅ ⋅ −1 2e   1
   
Example
Consider now a parabolic equation in 2 space dimensions and time, like the Schroedinger
equation or the diffusion equation in 2D:
∂ 2u ∂ 2 u ∂u
+ =a (7.28)
∂ x2 ∂ y2 ∂t
For example, this could represent the temperature distribution over a 2-dimensional plate. Let’s
consider a square plate of length 1 in x and y and the following boundary conditions:
a) u(0,y,t) = 1 That is, the sides x = 0 and y = 0 are kept at

b) u(1,y,t) = 0 temperature 1 at all times while the sides x = 1
c) u(x,0,t) = 1 and y = 1 are kept at u = 0.
d) u(x,1,t) = 0 The whole plate, except the sides x = 0 and y = 0
e) u(x,y,0) = 0 for x > 0 and y > 0 are at temperature 0 at t = 0.
The following diagram shows the discretization for the x coordinate:
x:
0 1 ... i–1 i i+1 ... R R+1
There are R+2 points with R unknown values (i = 1, ..., R). The two extreme points: i = 0 and i
= R+1 correspond to x = 0 and x = 1, respectively.
The discretization of the y coordinate can be made in similar form. For convenience, we
can also use R+2 points, where only R are unknown: (j = 1, ..., R).
Discretization of time is done by considering t = n∆ t, where ∆ t is the time step.
Approximation of the time derivative (first order):

As in the previous example, in order to use the central difference and only two time levels,
we use the Crank-Nicolson method. That is, equation (7.28) will be evaluated half way between
time grid points.
This will give:
∂u uni, j+1 − ui,n j
= (7.29)
∂t i, j,n+1/ 2 ∆t
We need to approximate the LHS at the same time using the average of the values at n and n+1:
∇ 2 u(n +1 / 2) = 1
2 (∇2u (n) + ∇2u( n+1) )
Applying this and (7.29) to (7.28) we get:
∇ u
2 (n +1)
+∇ u
2 ( n)
=
∆t
(
2a (n+1)
u −u
(n)
)
or re-writing:
2 (n +1) 2 a ( n +1)  2 (n) 2a ( n) 
∇ u − u =  −∇ u + u  (7.30)
∆t  ∆t 
where u is still a continuous function of position.
Approximation of the space derivative (second order):

Using the central difference for the second order derivatives and using ∆x = ∆y = h, we
get:
1
∇ u = 2 (uN + uS + uW + u E − 4uO )
2
h
and using this in (7.30):
 2a 
uin−+11, j + uin++11, j + uin, +j −11 + uin, +j +11 −  4 + uin, +j 1
 ∆t 
  2a  
= − uin−1, j + uin+1, j + uin, j −1 + uin, j +1 −  4 − uin, j  (7.31)
  ∆t  
Defining now node numbering over the grid and a vector u(n) containing all the unknown values
(for all i and j), (7.31) can be written as a matrix equation of the form:
Au(n+1) = Bu(n) (7.32)

Exercise 7.5
A length L of transmission line is terminated at both ends with a short circuit. A unit
impulse voltage is applied in the middle of the line at time t = 0.
The voltage φ(x, t) along the line satisfies the following differential equation:
∂ 2φ 2
2∂ φ
2 +p =0 with the boundary and initial conditions:
∂x ∂t 2
φ (0, t ) = φ ( L, t ) = 0 , φ ( x, t ) = 0 for t < 0 and φ ( L 2,0) = 1
a) By discretizing both the coordinate x and time t as xi = (i – 1)∆x , i = 1, 2, .., R+1 and
tm = m∆t , m = 0, 1, 2, ..., use finite differences to formulate the numerical solution of the
equation above. Show that the problem reduces to a matrix problem of the form:
Φ m +1 = AΦ m + BΦ m −1
where A and B are matrices and Φm is a vector containing the voltages at each of the
discretization points xi at time tm.
b) Choosing R = 7, find the matrices A and B giving the values of their elements, taking
special care at the edges of the grid. Show the discretized equation corresponding to points at
the edge of the grid (for i = 1 or R+1) and consider the boundary condition φ(0) = φ(L) = 0.
∂φ
c) How would the matrices A and B change if the boundary condition was changed to =0
∂x
at x = 0, L (corresponding in this case to an open circuit). Show the discretized equation
corresponding to one of the edge points and propose a way to transform it so it contains only
values corresponding to points in the defined grid.
8. Solution of Boundary Value Problems

A physical problem is often described by a differential equation of the form:
r r
Lu ( x ) = s ( x ) on a region Ω. (8.1)
r r
where L is a linear differential operator, u( x ) is a function to be found, s( x ) is a known
r
function and x is the position vector of any point in Ω (coordinates).
We also need to impose boundary conditions on the values of u and/or its derivatives on
Γ, the boundary of Ω, in order to have a unique solution to the problem.
The boundary conditions can be written in general as:
r r
Bu ( x ) = t ( x ) on Γ. (8.2)
r
with B, a linear differential operator and t ( x ) a known function. So we will have for example:
r r
B=1: u( x ) = t ( x ) : known values of u on Γ (Dirichlet condition)
r
∂ ∂u( x ) r
B= : = t( x ) : fixed values of the normal derivative (Neumann condition)
∂n ∂n
∂ ∂u r
B= +k : + ku = t( x ) : mixed condition (Radiation condition)
∂n ∂n
A problem described in this form is known as a boundary value problem (differential equation
+ boundary conditions).
Strong and Weak Solutions to Boundary Value Problems

Strong Solution:
This is the direct approach to the solution of the problem, as specified above in (8.1)-(8.2);
for example, the finite difference solution method studied earlier is of this type.
Weak Solution:
This is an indirect approach. Instead of trying to solve the problem directly, we can re-
formulate it as the search for a function that satisfies some conditions also satisfied by the
solution to the original problem (8.1)-(8.2). With a proper definition of these conditions the
search will lead to the correct and unique solution.
Before expanding on the details of this search, we need to introduce the idea of inner
product between two functions. This will allow us to quantify (put a number to) how close a
function is to another or how big or small is an error function.
The commonest definition of inner product between two functions f and g is:
f , g = ∫ fg dΩ for real functions f and g or

(8.3)
f , g = ∫ f g * dΩ for complex functions f and g
In general, the inner product between two functions will be a real number obtained by
global operations between the functions over the domain and satisfying some defining properties
(for real functions):
i) f , g = g, f
ii) αf + β g, h = α f , h + β g, h α and β scalars (8.4)
iii) f , f = 0 if and only if f = 0
From this definition we can deduce that:

If for a function r, we have that r, h = 0 for any choice of h, then r = 0.
We will try to use this property for testing an error residual r.
We are now in position to formulate the weak solution of (8.1)-(8.2) as:

r r r
Given the function s( x ) and an arbitrary function h( x ) in Ω; find the function u( x ) that
satisfies:
Lu, h = s, h or Lu − s, h = 0 (8.5)
for any choice of the function h.
Approximate Solutions to the Weak Formulation

The Rayleigh-Ritz Method
r
In this case, instead of trying to find the exact solution u( x ) that satisfies (8.5), we only
look for an approximation. We can choose a set of basis functions and assume that the wanted
function, or rather, a suitable approximation to it, can be represented as an expansion in terms of
the basis functions (as for example with Fourier expansions using sines and cosines).
Now, once we have chosen the basis functions, the unknown is no longer the function u,
but its expansion coefficients; that is: numbers! so the problem (8.5) is converted from “finding a
function u among all possible functions defined on Ω” into the search for a set of numbers. We
now need some method to find these numbers.
We will see two methods to solve (84) using the Rayleigh-Ritz procedure above:
Weighted Residuals
Variational Method
Weighted Residuals
For this approach the problem (8.5) is rewritten in the form: r = Lu – s = 0 (8.6)
where the function r is the residual or error or the difference
between the LHS and the RHS of (7.19) when we try any function in place of u. This residual
will only be zero when the function we use is the correct solution to the problem.
We can now look at the solution to (7.24) as an optimization (or minimization) problem;
that is: “Find the function u such that r = 0” or in approximate way:
“Find the function u such that r is as small as possible” and here we need to be able to
measure how ‘small’ is the error (or residual).
Using these ideas on the weak formulation, this is now transformed into:
r r r
Given the function s( x ) and an arbitrary function h( x ) in Ω; find the function u( x ) that
satisfies:
r, h = 0 (where r = Lu – s) (8.7)
for any choice of the function h.
Applying the Rayleigh-Ritz procedure, we can put:

N
r r
u( x ) = ∑ d j bj ( x ) (8.8)
j =1
where dj, j = 1, ... , N are scalar coefficients and

r
b j ( x ), j = 1, ... , N are a set of linearly independent basis functions (normally
called trial functions – chosen)
We now need to introduce the function r h. For this, we can also use an expansion, in
general using another set of functions wi ( x ) for example:
N
r r
h( x ) = ∑ ci wi ( x ) (8.9)
i =1
where ci, i = 1, ... , N are scalar coefficients and

r
wi ( x ), i = 1, ... , N a set of linearly independent expansion functions for h
(normally called weighting functions – also chosen)
Using now (8.9) into (8.7) we get:
N N
r r r r
r, h = r( x ), ∑ ci wi ( x ) = ∑ ci r( x ),wi ( x ) = 0 (8.10)
i=1 i=1
for any choice of the coefficients ci.

Since (8.10) has to be satisfied for any choice of the coefficients ci (equivalent to say ‘any
choice of h), we can deduce that:
r,wi = 0 for all i, i = 1, ... , N (8.11)
which is simpler than (8.10). (It comes from the choice: ci = 1 and all others = 0, in sequence).
So, we can conclude that we don’t need to test the residual against any (and all) possible
function h; we only need to test it against the chosen weighting functions wi, i = 1, ... , N.
Now, we can expand the function u in terms of the trial (basis) functions as in (8.8) and use
this expansion in r: r = Lu – s. With this, (8.11) becomes:
N
r r r
r , wi = Lu − s, wi = L ∑ d j b j ( x ) − s ( x ), wi ( x ) = 0 for all i
j =1
or
N
r , wi = ∑ d j Lb j , wi − s, wi = 0 for all i (8.12)
j =1
Note that in this expression the only unknowns are the coefficients dj.
N
We can rewrite this expression (8.12) as: ∑ aij d j = si for all i: i = 1, N
j =1
which can be put in matrix notation as:
Ad=s (8.13)
where A = { aij} , d = { dj } and s = { si} with aij = Lb j , wi and si = s, wi
Since the trial functions bj are known, the matrix elements aij are all known and the problem has
been reduced to solve (8.13), finding the coefficients dj of the expansion of the function u.
Different choices of the trial functions and the weighting functions define the different
variations by which this method is known:
i) wi = bi –––––––> method of Galerkin

r r r
ii) wi ( x ) = δ ( x − x i ) –––––––> Point matching method This is equivalent to asking
for the residual to be zero at a fixed number of points on the domain.
Variational Method
The central idea of this method is to find a functional* (or variational expression)
associated to the boundary value problem and for which the solution of the BV problem leads to
a stationary value of the functional. Combining this idea with the Rayleigh-Ritz procedure we
can develop a systematic solution method.
Before introducing numerical techniques to solve variational formulations, let’s examine

an example to illustrate the nature of the variational approach:
* A functional is simply a function of a function over a complete domain which gives as a result
a number. For example: J(φ) = ∫ φ dΩ . This expression gives a numerical value for each
2
Ω
r
function φ we use in J(φ). Note that J is not a function of x .
Example:
The problem is to find the path of light travel between two points. We can start using a
variational approach and for this we need a variational expression related to this problem. In
particular, for this case we can use Fermat’s Principle that says that ‘light travels through the
quickest path’ (Note that this is not necessarily the shortest). We can formulate this statement in
the form:
P2
ds
time = min ∫ (8.14)
v(s)
P1
where v(s) is the velocity point by point along the path. We are not really interested in the actual
time taken to go from P1 to P2, but on the conditions that this minimum imposes on the path
s(x,y) itself. That is, we want to find the actual path between these points that minimize this
time. We can write for the velocity: v = c/n, where n(x,y) is the refractive index.
Let’s consider first a uniform medium, that is, one with n uniform (constant). In that case,
(8.14) becomes:
P2
n n
time = min ∫ ds = min( path length) (8.15)
c c
P1
The integral in this case reduces to the actual length of the path, so the above statement asks for
the path of ‘minimum length’ or the shortest path between P1 to P2. Obviously, the solution in
this case is the straight line joining the points. However, you can see that (8.14) can be applied
to an inhomogeneous medium, and the minimization process should lead to the actual trajectory.
Extending the example a little more, let’s consider an interface between two media with
different refractive indices. Without losing generality, we can consider a situation like that of
the figure.
From above, we know that in each media the path will be a straight line, but what are the
coordinates of the point on the y–axis (interface) where both straight lines meet?
Applying (8.14) gives:

y
 P0 P2 
1
P2 time = min n1 ∫ ds + n2 ∫ ds (8.16)
c  
 P1 P0 
P0
P1
Fig. 8.7
Both integrals correspond to the respective lengths of the two branches of the total path, but we
don’t know the coordinate y0 of the point P0. We can re-write (8.16) in the form:
1
[2 2 2
time = min n1 x1 + (y1 − y0 ) + n2 x 2 + (y2 − y0 )
c
2
] (8.17)
where the only variable (unknown) is y0. To find it we need to find the minimum, and for this
we do:
 
d −(y1 − y0 ) −(y2 − y0 )
(time) = 0 = n1 + n2  (8.18)
dy 0  x1 + (y1 − y0 )2
2
x22 + (y2 − y0 )2 
Now, from the figure we can observe that the right hand side can be written in terms of the
angles of incidence and refraction as:
n1 sin α 1 − n2 sin α 2 = 0 (8.19)
as the condition the point P0 must satisfy, and we know this is right because (8.19) is the familiar
Snell’s law.
So, the general idea of this variational approach is to formulate the problem in a form that we
look for a stationary condition (maximum, minimum, inflexion point) on some parameter which
depends on the desired solution. (Like in the above problem, we looked for the time, which
depends on the actual path travelled).
An important property of a variational approach is that precisely because the solution
function produces a stationary value of the functional, this is rather insensitive to small
perturbations of the solution (approximations). This is a very desirable property, particularly for
the application of numerical methods where all solutions are only approximate. To illustrate this
property, let’s analyse another example:
Example
Consider the problem of finding the natural resonant frequencies of a vibrating string (a
chord in a guitar). For this problem, an appropriate variational expression is (do not worry about
where this comes from, but there is a proof in the Appendix):
b 2
 dy 
∫  dx  dx
2
k = s.v. a b (8.20)
∫y
2
dx
a
The above expression corresponds to the k–number or resonant frequencies of a string vibrating
freely and attached at the ends at a and b.
For simplicity, let’s change the limits to –a and a. The first mode of oscillation has the
form:
πx dy Aπ πx
y = Acos then =− sin
2a dx 2a 2a
Using this exact solution in (99): we get for the first mode (prove this):
2 π2 π 1.571
k = then k= ≈
4a 2 2a a
Now, to show how a ‘bad’ approximation of the function y can still give a rather acceptable
value for k, let’s try a simple triangular shape (instead of the correct sinusoidal shape of the
vibrating string):
  x
 A1 + a  x < 0
y=  
 x 
 A1 −  x > 0
  a  -a a
Fig. 8.8
dy  A/ a x <0
then = and using these values in (99), gives (prove this):
dx − A / a x >0
2 3 1.732
k = then k≈
a2 a
which is not too bad considering how coarse is the approximation to y(x). If instead of the
triangular shape we try a second order (parabolic) shape:
(
y = a 1 − (x a)
2
) then
dy
dx
A
= −2 2 x
a
Substituting these values in (99) gives now:
2 2.5 1.581
k = then k=
a2 a
which is a rather good approximation.
Now, how can we use this method systematically? As said before, the use of the Rayleigh-
Ritz procedure permits to construct a systematic numerical method. In summary, we can specify
the necessary steps as:
BV problem: L u = s ––––––––> Find variational expression J(u)
N
Use Rayleigh-Ritz: u = ∑ d j bj Insert in J(u)
j =1
Find stationary value of J(u) = J({ dj}), that is:

find the coefficients dj
N
Reconstruct u = ∑ d j bj
j=1
u is solution of BV problem <–––––––– then
We will skip here the problem of actually finding the corresponding variational expression
to a boundary value problem, simply saying that there are systematic methods to find them. We
will be concerned here on how to solve a problem, once we already have a variational
formulation.
Example
For the problem of the square coaxial (or square capacitor) seen earlier, the BV problem is
defined by the Laplace equation: ∇ 2φ = 0 with some boundary conditions (L = ∇ 2 , u = φ,
s = 0).
An appropriate functional (variational expression) for this case is:
J (φ ) = ∫ (∇φ ) dΩ (given)
2
(8.21)
Ω
2
  N 
Using Rayleigh-Ritz: J (φ ) = ∫  ∇ ∑ d j b j (x, y)  dΩ
Ω   j =1 
2
N 
= ∫  ∑ d j ∇b j (x, y) dΩ
Ω  j=1 
N N
J (φ ) = ∫ ∑ ∑ did j∇bi (x, y) ⋅ ∇b j (x, y) dΩ (8.22)
Ω i =1 j =1
This can be written as the matrix equation: J(d) = dTAd,
where d = { dj} and aij = ∫ ∇bi ⋅ ∇b j dΩ

Ω
∂J
Now, find stationary value: =0 for all i: i = 1, ... , N
∂d i
N
∂J
so, applying it to (8.22): = ∑ aij d j = 0 , for all i: i = 1, ... , N
∂di j=1
And the problem reduces to the matrix equation: Ad=0 (8.23)
Solving the system of equations (8.23) we obtain the coefficients dj and the unknown
function can be obtained as:
N
u(x,y) = ∑ d j b j (x, y)
j=1
We can see that both methods, the weighted residuals and the variational method transform
the BV problem into an algebraic, matrix problem.
One of the first steps in the implementation of either method is the choice of appropriate
expansion functions to use in the Rayleigh-Ritz procedure: basis or trial functions and weighting
functions.
The finite element method provides a simple form to construct these functions and
implementing these methods.
9. FINITE ELEMENTS
As the name suggests, the finite element method is based on the division of the domain of
interest Ω into ‘elements’, or small pieces Ei that cover Ω completely but without intersections:
They constitute a tessellation (tiling) of Ω:
U Ei = Ω ; Ei I E j = ∅
i
Over the subdivided domain we apply the methods discussed earlier (either weighted
residuals or variational). The basis functions are defined locally in each element and because
each element is small, these functions can be very simple and still constitute overall a good
approximation of the desired function. In this form inside each element Ee, the wanted function
r r
u( x ) in (8.1) is represented by a local approximation u˜e ( x ) valid only in the element number e:
r
Ee. The complete function u( x ) over the whole domain Ω is then simply approximated by the
r r
addition of all the local pieces: u( x ) ≈ ∑ u˜e ( x )
e
An important characteristic of this method is that it is ‘exact-in-the-limit’, that is, the
degree of approximation can only improve when the number of elements increase, the solution
gradually and monotonically converging to the exact value. In this form, a solution can always
be obtained to any degree of approximation, provided the availability of computer resources.
One dimensional problems

Let’s consider first a one dimensional case with Ω as the interval [a, b]. We first divide Ω
into N subintervals, not necessarily equal in size!, defined by N+1 nodes xi, as shown in the
figure. The wanted function u(x) is then locally approximated in each subinterval by a simple
function, for example, a straight line: a function of the type: ax + b, defined differently in each
subinterval. The total approximation u˜(x) is then the superposition of all these locally defined
functions, a piecewise linear approximation to u(x). The amount of error in this approximation
will depend on the size of each element (and consequently on the total number of elements) and
more importantly, on their size in relation to the local variation of the desired function.
u˜e (x) ui+1

u˜(x) u(x)
u(x)
ui ui+1 Ni+1 (x)

ui Ni (x)
1
Ni (x) Ni+1 (x)

a x2 x3 xi b
xi xi+1
Fig. 9.1 Piecewise linear approximation Fig. 9.2 Shape functions
The function u˜(x) , approximation to u(x), is the addition of the locally defined functions
u˜e (x) which are only nonzero in the subinterval e. Now, these local functions u˜e (x) can be
defined as the superposition of interpolation functions Ni(x) and Ni+1(x) as shown in the figure
above (right). From the figure we can see that the function u˜e (x), local approximation to u(x) in
element e, can be written as:
u˜e (x) = ui Ni (x) + ui+1 Ni+1 (x) (9.1)
Now, if we consider the neighbouring subintervals and the associated interpolation

functions as in the figure below, we can extend the definition of these functions to form the
triangular shapes below:
u˜(x)
Ni (x)
Ni+1 (x)
xi-1 xi xi+1 xi+2

Fig. 9.3 Shape functions and approximation over two adjacent intervals
With this definition, we can write for the function u˜(x) in the complete domain Ω:
Ne Np
u(x) ≈ u˜ (x) = ∑ u˜ e (x) = ∑ ui Ni (x) (9.2)
e=1 i =1
(Np is the number of nodes, Ne is the number of elements). The functions Ni(x) are defined as
the (double-sided) interpolation functions at node i, i = 1, ... , Np so Ni(x) = 1 at node i and 0 at
all other nodes.
This form of expanding u(x) has a very useful property:

— The trial functions Ni(x), known in the FE method as shape functions, are
interpolation functions, and as a consequence, the coefficients of the expansion: ui in
(104), are the nodal values, that is, the values of the wanted function at the nodal
points. Solving the resultant matrix equation will give these values directly.
Exercise 9.1:
Examine this figure and that of the previous page and show that indeed (9.1) is valid in the
element (xi, xi+1) and so (9.2) is valid over the full domain.
Example
d2 y
Consider the boundary value problem: + k2 y = 0 for x ∈ [a, b] with y(a) = y(b) = 0
dx 2
This corresponds for example, to the equation describing the envelope of the transverse
displacement of a vibrating string attached at both ends. A suitable variational expression
(corresponding to resonant frequencies) is the following, as seen in (8.20) and in the Appendix:
b 2
 dy 
∫  dx  dx
2 a
J =k = b (given) (9.3)
∫y
2
dx
a
N
Using now the expansion: y = ∑ y j N j (x) (9.4)
j =1
where N is the number of nodes, yj are the nodal values (unknown) and Nj(x) are the shape
functions.
2 2 Q
From (9.3) we have: k = k (y j ) = where:
R
2 2
b 
d  N
b N
dN j  b N
dN j N
dNk
Q = ∫   ∑ y j N j (x)  dx = ∫  ∑ y j  dx = ∫ ∑ y j ∑ yk dx (9.5)
a
 dx  j=1   a j=1 dx  a j =1
dx k =1 dx
and
2
b N  b N N
R = ∫  ∑ y j N j (x )  dx = ∫ ∑ y j N j ( x )∑ yk Nk ( x ) dx (9.6)
a j =1  a j=1 k =1
dk 2
To find the stationary value, we do: =0 for each yi, i = 1, ... , N (9.7)
dyi
2 Q dk 2 Q' R − QR ' dk 2
But since k = , then = so = 0 ⇒ Q ' R = QR ' and
R dyi R2 dyi
Q 2 dQ dR
Q' = R'= k R' so finally: = k2 for all yi, i = 1, ... , N (9.8)
R dyi dyi
We now have to evaluate these derivatives:
b N
dQ
b
dN 
N
dN  dN j  dNi
= ∫ i  ∑ yk k  dx + ∫  ∑ y j  dx
dyi dx  k =1 dx  dx  dx
a a  j=1
or
dN  dN j   b dN dN j 
b N N
dQ
= 2∫ i  ∑ y j  dx = 2∑ y j  ∫ i
 dx 
dyi dx  j =1 dx  j =1  a dx dx 
a
which can be written as:
N b
dQ dN dN j
= 2∑ aij y j where aij = ∫ i dx (9.9)
dyi j =1 a
dx dx
For the second term:
dR
b  N  b N 
= ∫ Ni (x) ∑ y j N j (x) dx + ∫  ∑ yk Nk (x)  Ni (x) dx
dyi
a  j =1  a  k =1 
or
N b N b
dR
= 2∑ y j ∫ Ni ( x ) N j ( x ) dx = 2∑ bij y j with bij = ∫ Ni ( x ) N j ( x ) dx (9.10)
dyi j =1 a j =1 a
Replacing (9.9) and (9.10) in (9.8), we can write the matrix equation:
A y = k2 B y (9.11)
where A = { aij} , B = { bij} and y = { yj} is the vector of nodal values.
Equation (9.11) is a matrix eigenvalue problem. The solution will give the eigenvalue k2
and the corresponding solution vector y (list of nodal values of the function y(x)).
The matrix elements:
b b
dNi dN j
aij = ∫ dx and bij = ∫ Ni N j dx (9.12)
a
dx dx a
The shape functions Ni and Nj are only nonzero in the vicinity of nodes i and j respectively (see
figure below). So, aij = bij = 0 if Ni and Nj do not overlap; that is, if j ≠ i–1, i, i+1. Then,
the matrices A and B are tridiagonal: They will have not more than 3 elements per row.
Ni Nj
i-1 i i+1 ........ j-1 j j+1
Exercise 9.2:
Define generically the triangular functions Ni, integrate (9.12) and calculate the value of the
matrix elements aij and bij.
Two Dimensional Finite Elements
In this case, a two dimensional region Ω is subdivided into smaller pieces or ‘elements’
and the subdivision satisfies the same properties as in the one-dimensional case; that is, the
elements cover completely the region of interest and there is no intersections (no overlapping).
The most common form of subdividing a 2D region is by using triangles. Quadrilateral elements
are also used and they have some useful properties but by far the most common, versatile and
easier to use are triangles with straight sides. There are well-developed methods to produce
appropriate meshing (or subdivision) of a 2D region into triangles and they have maximum
flexibility to accommodate to intricate shapes of the region of interest Ω.
The process of calculation follows the same route as in the 1D case. Now, a function of
two variables u(x,y) is approximated by shape functions Nj(x,y) defined as interpolation
functions over one element (in this case a triangle). This approximation is given by:
N
u( x , y ) ≈ ∑ u j N j ( x, y) (9.13)
j =1
where N is the number of nodes in the mesh and the coefficients uj are the nodal values
(unknown). Nj(x,y) are the shape functions defined for every node of the mesh.
Fig. 9.4
The figure shows a rectangular region in the xy–plane subdivided into triangles, with the
corresponding piecewise planar approximation to a function u(x,y) plotted along the vertical
axis. Note that the approximation is composed by flat ‘tiles’ that fit exactly along the edges so
the approximation is continuous over the entire region but its derivatives are not. The
approximation shown in the figure shows uses first order functions, that is, pieces of planes (flat
tiles). Other types are also possible but require defining more nodes in each triangle. (While a
plane is totally defined by 3 points - e.g. the nodal values, a second order surface for example,
will need 6 points - for example the values at the 3 vertices and at 3 midside points).
For a first order approximation, the function u(x,y) is approximated in each triangle by a
function of the form (first order in x and y):
 p
u˜(x, y) = p + qx + ry = (1 x y) q  (9.14)
 
 r
where p, q and r are constants with different values in each triangle. Similarly to the one
dimensional case, this function can be written in terms of shape functions (interpolation
polynomials):
u(x,y) ≈ u˜ (x,y) = u1 N1(x, y) + u2 N2 (x, y) + u3 N3 (x, y) (9.15)
for a triangle with nodes numbered 1, 2 and 3, with coordinates (x1 , y1), (x2 , y2) and (x3 , y3).
The shape functions Ni are such that Ni = 1 at node i and 0 at all the others:
It can be shown that the function Ni satisfying this property is:
1
Ni (x, y) = (ai + bi x + ciy ) (9.16)
2A
where A is the area of the triangle and
a1 = x2y3 – x3y2 b1 = y2 – y3 c1 = x3 – x2
a2 = x3y1 – x1y3 b2 = y3 – y1 c2 = x1 – x3
a3 = x1y2 – x2y1 b3 = y1 – y2 c3 = x2 – x1
See demonstration in the Appendix.
The function N1 defined in (9.16) and shown below corresponds to the shape function
(interpolation function) for the node numbered 1 in the triangle shown. Now, the same node can
be a vertex of other neighbouring triangles in which we will also define a corresponding shape
function for node 1 (with expressions like (9.16) but with different values of the constants a, b
and c – different orientation of the plane), building up the complete shape function for node 1:
N1. This is shown in the next figure for a node number i, which belongs to five triangles.
u2 u2
u1 N1 + u2 N2 + u3 N3
u2 N2
u1 N1 2 2
u1 u1
1 1 u3
3 3
Fig. 9.5 Fig. 9.6
Fig. 9.7
Joining all the facets of Ni, for each of the triangles that contain node i, we can refer to this
function as Ni(x,y) and then, considering all the nodes of the mesh, the function u can be written
as:
N
u(x,y) = ∑ u j N j (x, y) for j = 1, …, N (9.17)
j =1
We can now use this expansion for substitution in a variational expression or in the
weighted residuals expression, to obtain the corresponding matrix problem for the expansion
coefficients. An important advantage of this method is that these coefficients are precisely the
nodal values of the wanted function u, so the result is obtained immediately when solving the
matrix problem. An additional advantage is the sparsity of the resultant matrices.
MATRIX SPARSITY
Considering the global shape function for node i, shown in the figure above, we can see
that it is zero in all other nodes of the mesh. If we now consider the global shape function
corresponding to another node, say, j, we can see that products of the form Ni Nj or of
derivatives of these functions, which will appear in the definition of the matrix elements (as seen
in the 1–D case), will almost always be zero, except when the nodes i and j are either the same
node or immediate neighbours so there is an overlap (see figure above). This implies that the
corresponding matrices will be very sparse, which is very convenient in terms of computer
requirements.
Example:
1 2 3 4
1 3 5
2 4 6
6 7
5 8
If we consider a simple mesh:
7 9 11
8 10 12
9 10 11 12
The matrix sparsity pattern results:
1 2 3 4 5 6 7 8 9 10 11 12
3 Contribution from triangle 2

with nodes 2, 5 and 6.
4
6 Contribution from node 11

in triangles 10, 11 and 12
7
10
11
12
Example:
For the problem of finding the potential distribution in the square coaxial, or equivalently, the
temperature distribution (in steady state) between the two square section surfaces, an appropriate
variational expression is:
J = ∫ (∇φ ) dΩ
2
(9.18)
Ω
and substituting (9.17):
2 2
 N   N 
J = ∫  ∇ ∑ φ j N j  dx dy = ∫  ∑ φ j ∇N j  dx dy
Ω  j =1  Ω  j =1 
which can be re-written as:
N N
J = ∫ i∑
=1
φ i ∇Ni ⋅ ∑φ j ∇N j
j =1
dx dy
Ω
or
N N
J = ∑ ∑φ iφ j ∫ ∇Ni ⋅∇N j
T
dx dy = Φ AΦ (9.19)
i =1 j =1 Ω
where A = { aij} , aij = ∫ ∇Ni ⋅∇N j dx dy and Φ = { φj}

Ω
Note that again, the coefficients aij can be calculated and the only unknowns are the nodal values
φj.
∂J
We now have to find the stationary value; that is: put: =0 for i = 1, … , N
∂φi
N
∂J
= 0 = 2∑ aij φ j for i = 1, … , N then: AΦ = 0 (9.20)
∂φi i =1
Then, the resultant equation is (9.20): AΦ = 0. We need to evaluate first the elements of
the matrix A. For this, we can consider the integral over the complete domain Ω as the sum of
the integrals over each element Ω k, k = 1, … , Ne (Ne elements in the mesh).
Ne
aij = ∫ ∇Ni ⋅∇ N j dx dy = ∑ ∫ ∇Ni ⋅∇N j dx dy (9.21)
Ω k =1 Ω k
Before calculating these values, let’s consider the matrix sparsity; that is, let’s see which
elements of A are actually nonzero. As discussed earlier (page 45), aij will only be nonzero if
the nodes i and j are both in the same triangle. In this way, the sum in (9.21) will only extend to
at most two triangles for each combination of i and j.
Inside one triangle, the shape function Ni(x,y), defined for the node i is:
1 ∂ Ni ∂N 1
Ni ( x, y) = ( a + bi x + ci y ); Then: ∇Ni = xˆ + i ˆy = (b xˆ + ci yˆ )
2A i ∂x ∂y 2A i
And then:
Ne Ne
1 1
aij = ∑ 4 A 2 ( bi bj + ci c j ) ∫ dx dy = ∑
4 Ak
(bi b j + ci c j ) (9.2)
k =1 k Ωk k =1
where the sum will only have a few terms (for those values of k corresponding to the triangles
containing nodes i and j. The values of Ak, the area of the triangle and bi, bj, ci and cj will be
different for each triangle concerned.
In particular, considering for example the element a4 7 corresponding to the mesh in the
previous figure, the sum extends over the triangles containing nodes 4 and 7; that is triangles
number 5 and 6:
a4 7 =
1
4A5
(
(5) ( 5) ( 5) ( 5)
b4 b7 + c4 c7 +
1
4A6
)
( 6) ( 6)
(
( 6) ( 6)
b4 b7 + c4 c7 )
In this case, the integral in (9.22) reduces simply to the area of the triangle and the
calculations are easy. However, in other cases the integration can be complicated because they
have to be done over the triangle, which is at any position and orientation in the x–y plane.
For example, solving a variational expression like:
J = k2 =
∫ (∇φ ) 2 dΩ (corresponding to (9.3) for 1D problems) (9.23)
∫ φ 2 dΩ
This results in an eigenvalue problem AΦ = k2BΦ where the matrix B has the elements:
bij = ∫ Ni N j dΩ (9.24)
Ω
The elements of A are the same as in (9.24).
Exercise 9.3:
Apply the Rayleigh-Ritz procedure to expression (9.23) and show that indeed the matrix
elements aij and bij have the values given in (9.21) and (9.24).
Ne
For the case of integrals like those in (126): bij = ∫ Ni N j dΩ = ∑ ∫ Ni N j dΩ
Ω k =1 Ω k
The sum is over all triangles containing nodes i and j. For example for the element 5,6 in the
previous mesh, these will be triangles 2 and 7 only.
If we take triangle 7, its contribution to this element is the term:
1
b56 =
4 A7 ∫ ( a5 + b5x + c5y)( a6 + b6 x + c6 y) dx dy
Ω7
or
1
∫ [a5a6 + ( a5b6 + a6b5) x + ( a5c6 + a6c5) y + b5b6x
2
b56 =
4 A7
Ω7
2
+(b5c6 + b6 c5 )xy + c5 c6 y ] dx dy
Integrals like these need to be calculated for every pair of nodes in every triangle of the mesh.
These calculations can be cumbersome is attempted directly. However, it is much simpler to use
a transformation of coordinates, into a local system. This has the advantage that the integrals
can be calculated for just one model triangle and then the result converted back to the x and y
coordinates. The most common system of local coordinates used for this purpose is the triangle
area coordinates.
Triangle Area Coordinates
For a triangle as in the figure, any point inside can 1 (1,0,0)

be specified by coordinates x and y or for example,
ξ1 =1
by the coordinates ξ1 and ξ2.
These coordinates are defined with value 1 at
one node and zero at the opposite side, varying P
ξ1
linearly between these limits. For the sake of
symmetry, we can define 3 coordinates: (ξ1,ξ2,ξ3) 2 ξ2
when obviously, only two are independent. (0,1,0)
A formal definition of these coordinates can 3 (0,0,1)
be made in terms of the area of the triangles shown ξ2 =1 ξ1 =0
in the figure.
ξ2 =0
Fig. 9.8
The advantage of using this local coordinate system is that the actual shape of the triangle
is not important. Any point inside is defined by its proportional distance to the sides,
irrespective of triangle shape. In this way, calculations can be made in model triangle using this
system and then, mapped back to the global coordinates.
If A1, A2 and A3 are the areas of the triangles

1
formed by P and two of the vertices of the
triangle, the area coordinates are defined as the
ratio:
A3 A2
A line of constant ξ1
ξi = i where A is the area of the triangle.
A A1
2
Note that the triangle marked with the
dotted line has the same area as A1 (same base,
3
same height), so the line of constant ξ1 is the one
Fig. 9.9
marked in the figure.
Since A1 + A2 + A3 = A we have that ξ1 + ξ2 + ξ3 = 1 , which shows their linear

dependence.
The area of each of these triangles, for example A1, can be calculated using (see
Appendix):
1 x y
1  
A1 = det 1 x2 y2 
2  
1 x3 y3 
where x and y are the coordinates of point P.

Evaluating this determinant gives: A1 = 12 [(x 2 y3 − x3 y2 ) + (y2 − y3 )x + (x3 − x2 )y]
and using the definitions in the Appendix:
A1 = 12 (a1 + b1 x + c1 y )
A1 1
Then, ξ1 = = (a1 + b1x + c1y ) or ξ1 = N1( x , y) (9.25)
A 2A
We have then, that these coordinates vary in the same form as the shape functions, which
is quite convenient for calculations.
Expression (9.25) also gives us the required relationship between the local (ξ1,ξ2,ξ3) and
global coordinates (x,y); that is the expression we need to convert coordinates (x,y into ξ1,ξ2,ξ).
We can also find the inverse relation, that is (x,y) in terms of (ξ1,ξ2,ξ3). This is all we
need to convert from one system of coordinates to the other.
For the inverse relationship, we have from the Appendix:

−1
 1 x1 y1 
 
(N1 N2 N3 ) = (ξ1 ξ2 ξ3 ) = (1 x y) 1 x2 y2 
 
 1 x3 y3 
from where:
1 x1 y1 
 
(1 x y) = (ξ1 ξ2 ξ3 )1 x2 y2 
 
1 x3 y3 
and expanding:
1 = ξ1 + ξ2 + ξ3
x = x1ξ1 + x2 ξ2 + x3ξ3 (9.26)
y = y1ξ1 + y2 ξ2 + y3ξ3
Equations (9.25) and (9.26) will allow us to change from one system to the other.
Finally, the evaluation of integrals can be made now in terms of the local coordinates in
the usual way:
∫ f (x, y) dxdy = ∫ f (ξ1,ξ2 ,ξ3) J dξ1dξ2
Ωk Ωk
where J is the Jacobian of the transformation. In this case this is simply 1/2A, so the
expression we need to use to transform integrals is:
∫ f (x, y) dxdy = 2A ∫ f (ξ1, ξ2, ξ3) dξ1dξ2 (9.27)

Ωk Ωk
Example:
The integral (9.24) of the previous exercise, is difficult to calculate in terms of x and y for
an arbitrary triangle, and needs to be calculated separately for each pair of nodes in each triangle
of the mesh. Transforming to (local) area coordinates this is much simpler:
∫ NiN j dxdy = 2 A ∫ ξiξ j dξ1dξ2 (9.28)

Ωk Ωk
To determine the limits of integration, we can see that 1

the total area of the triangle can be covered by ribbons ξ1 =1
like that shown in the figure, with a width dξ1 and
length ξ2. Now, we can note that at the left end of this
ξ2 =1-ξ1
ribbon, ξ3 = 0, so ξ2 = 1–ξ1. Moving the ribbon from
the side 2–3 of the triangle to the vertex 1 (ξ1 changing
from 0 to 1) will cover the complete triangle, so the 2
limits of integration must be: ξ2 =1
3
ξ2: 0 ––> 1 – ξ1 and ξ1 =0
ξ1: 0 ––> 1 ξ2 =0
Fig. 9.10
So that the integral of (9.28) results:
1 1−ξ1
∫ NiN j dxdy = 2 A∫ ∫ ξiξ j dξ2dξ1 (9.29)

Ωk 0 0
Note that the above conclusion about integration limits is valid for any integral over the
triangle, not only the one used above.
We can now calculate these integrals. Taking first the case where i ≠ j:
a) choosing i = 1 and j = 2 (This is an arbitrary choice – you can check that any other
choice, e.g. 1,3 will give the same result).
1 1−ξ1 1
1 2 1 A
Iij = 2A ∫ ξ1 ∫ ξ2 dξ2dξ1 = A ∫ ξ1(1− ξ1)
2
dξ 1 = A  − +  =
 2 3 4  12
0 0 0
b) For i = j and choosing i = 1:
1 1−ξ1 1
 1 1 A
Iii = 2 A∫ ξ12 ∫ dξ2dξ1 = 2A ∫ ξ12 (1 − ξ1)dξ1 = 2A  −  =
 3 4 6
0 0 0
Once calculated in this form, the result can be used for any triangle irrespective of the
shape and position. We can see that for this integral, only the area A will change when applied
to different triangles.
Higher Order Shape Functions
In most cases involving second order differential equations, the resultant weighted
residuals expression or the corresponding variational expression can be written without
involving second order derivatives. In these cases, first order shape functions will be fine.
However, if this is not possible, other type of elements/shape functions must be used. (Note that
second order derivatives of a first order function will be zero everywhere.) We can also choose
to use different shape functions even if we could use first order polynomials, for example, to get
higher accuracy with less elements.
Second Order Shape Functions:

To define a first order polynomial in x and y (planar shape functions) we needed 3 degrees
of freedom (the nodal values of u will define completely the approximation. If we use second
order shape functions, we need to fit this type of surface over each triangle. To do this uniquely,
we need six degrees of freedom: either the 3 nodal values and the derivatives of u at those points
or the values at six points on the triangle. (This is analogous to trying to fit a second order curve
to an interval – while a straight line is completely defined by 2 points, a second order curve will
need 3). Choosing the easier option, we form triangles with 6 points: the 3 vertices and 3
midside points:
1 (1 0 0) The figure shows a triangle with 6 points identified by

their triangle coordinates.
We need to specify now the shape functions. If

(12 12 0) 4 6 (12 0 12) we take first the function N1(ξ1,ξ2,ξ3), corresponding to
node 1, we know that it should be 1 there and zero at
every other node. That is, it must be 0 at ξ1 = 0 (nodes
2, 3 and 5) and also at ξ1 = 1/2 (nodes 4 and 6) Then,
2 3 we can simply write:
5
(0 1 0) (0 0 1)
(0 1 1
2 2 )
Fig. 9.11
N1 = ξ1(2ξ1 − 1) (9.30)
In the same form we can write the corresponding shape functions for the other vertices
(nodes 2 and 3). For the mid-side nodes, for example node 4, we have that N4 should be zero at
all nodes except 4 where its value is 1. We can see that all other nodes are either on the side 2–3
(where ξ1 = 0) or on the side 3–1 (where ξ2 = 0). So the function N4 should be:
N4 = 4ξ1ξ2 (9.31)
The following figure shows the two types of second order shape functions, one for vertices
and one for mid-side points.
N3 N4
3 3 5
5 2
2 6
6 4
4
1
1
Fig. 9.12 Shape function N3(x). Fig. 9.13 Shape function N4(x).
Exercise 9.4:
For the mesh of second order triangles of the figure, find the corresponding sparsity
pattern.
2 8 7
4 6 9 11
1 5 3 12 10
Fig. 9.14
Exercise 9.5:
Use a similar reasoning as that used to define the second order shape functions (9.30)-
(9.31), to find the third order shape functions in terms of triangle area coordinates.
Note that in this case there are 3 different types of functions.
3 (0 0 1)
(13 0 23 ) (0 13 23 )
8 7
(23 0 13 ) (13 13 13 ) (0 23 13 )
9 6
10
(1 0 0) 4 5 (0 1 0)
1 2
( 2 1
3 3 0) (13 23 0)
Fig. 9.15
page A1 E763 (part 2) Numerical Methods Appendix
APPENDIX
1. Taylor theorem
x
For a continuous function we have ∫ f ' (t )dt = f ( x) − f (a) , then, we can write:
a
x
f ( x) = f (a) + ∫ f ' (t )dt or f ( x) = f (a ) + R0 ( x) (A1.1)
a
x
where the reminder is R0 ( x) = ∫ f ' (t )dt .
a
u = f ' (t ) du = f ( 2) (t )dt
We can now integrate R0 by parts using: giving:
dv = dt v = −( x − t )
x x
x
R0 ( x) = ∫ f ' (t )dt = − ( x − t ) f ' (t ) a + ∫ ( x − t ) f ( 2) (t )dt which gives after solving and
a a
substituting in (A1.1):
x
f ( x) = f (a ) + ( x − a ) f ' (a ) + ∫ ( x − t ) f ( 2) (t )dt or
a
f ( x ) = f (a ) + f ' (a )( x − a ) + R1 ( x ) (A1.2)
( 2) (3)
u= f du = f (t )dt
(t )
We can also integrate R1 by parts using this time: ( x − t ) 2 which gives:
dv = ( x − t )dt v = −
2
x x x
f ( 2) (t )
R1 ( x) = ∫ ( x − t ) f ( 2) (t )dt = − ( x − t ) 2 + ∫ ( x − t ) 2 f (3) (t )dt and again, after
a
2 a
a
substituting in (A1.2) gives:
x
f ( 2) ( a )
f ( x ) = f (a ) + f ' (a )( x − a ) + ( x − a ) 2 + ∫ ( x − t ) 2 f (3) (t )dt
2 a
Proceeding in this way we get the expansion:
f ( 2) ( a ) f (3) (a) f ( n) (a )
f ( x) = f (a) + f ' (a )( x − a) + ( x − a) 2 + ( x − a )3 + L + ( x − a) n + Rn (A1.3)
2 3! n!
x
( x − t ) n ( n+1)
where the reminder can be written as: Rn ( x) = ∫ f (t )dt (A1.4)
a
n!
To find a more useful for for the reminder we need to invoke some general mathematical
theorems:
E763 (part 2) Numerical Methods Appendix page A2
First Theorem for the Mean Value of Integrals

If the function g (t ) is continuous and integrable in the interval [a, x], then there exists a point ξ
x
between a and x such that: ∫ g (t )dt = g (ξ )( x − a) .
a
This says simply that the integral can be represented by an average value of the function g (ξ )
times the length of the interval. Because this average value must be between the minimum and
the maximum values and the function is continuous, there will be a point ξ for which the
function has this value.
And in a more complex form:
Second Theorem for the Mean Value of Integrals

Now if the functions g and h are continuous and integrable in the interval and h does not change
sign in the interval, there exists a point ξ between a and x such that:
x x
∫ g (t )h(t )dt = g (ξ )∫ h(t )dt

a a
If we now use this second theorem on expression (A1.4), with: g (t ) = f ( n+1) (t ) and
x
( x − t )n ( x − t)n
h (t ) = , we get: Rn ( x) = f ( n+1) (ξ ) ∫ dt which can be integrated, giving:
n! a
n!
f ( n+1) (ξ )
Rn ( x) = ( x − t ) n+1 (A1.5)
(n + 1)!
for the reminder of the Taylor expansion, where ξ is appoint between a and x.
2. Implementation of Gaussian Elimination

Referring to the steps (24) - (26) to eliminate the non-zero entries in the lower triangle of
the matrix, we can see that (in step (25)) to eliminate the unknown x1 from equation i (i = 2 – N);
– that is, to eliminate the matrix elements ai1, we need to subtract from eqn. i equation 1 times
ai1/a11. With this, all the elements of column 1 except a11 will be zero. After that, to remove x2
from equations 3 to N (or elements ai2 , i = 3 – N as in (26)), we need to scale the new equation 2
by the factor ai2/a22 and subtract it from row i, i = 3 – N. The pattern now repeats in the same
form, so the steps for triangularizing matrix A (keeping just the upper triangle) can be
implemented by the following code (in Matlab):
for k=1:n-1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
for i=k+1:n
A(i,k+1:n)=A(i,k+1:n)-v(i)*A(k,k+1:n);
end
end
In fact, this can be simplified eliminating the second loop, by noting that all operations on
rows can be performed simultaneously. The simpler version of the code is then:
for k=1:n-1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-v(k+1:n)*A(k,k+1:n);
end
U=triu(A); % This function simply puts zeros in the lower
triangle
The factorization is completed by calculating the lower triangular matrix L. The complete
procedure can be implemented as follow:
function [L,U] = GE(A)

%
% Computes LU factorization of matrix A
% input: matrix A
% output: matrices L and U
%
[n,n]=size(A);
for k=1:n-1
A(k+1:n,k)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-A(k+1:n,k)*A(k,k+1:n);
end
L=eye(n,n) + tril(A,-1);
U=triu(A);
This function can be used to find the LU factors of a matrix A using dense storage. The
function eye(n,n) returns the identity matrix of order n and tril(A,-1) gives a lower
triangular matrix with the elements of A in the lower triangle, excluding the diagonal, and
setting all others to zero (lij = aij if j ≤ i–1, and 0 otherwise).

The solution of a linear system of equations Ax = b is completed (after the LU
factorization of the matrix A is performed) with a forward substitution using matrix L followed
by backward substitution using the matrix U. Forward substitution consists of finding the first
unknown x1 directly as b1/l11 , substituting this value in the second equation, find x2 and so on.
This can be implemented by:
function x = LTriSol(L,b)
%
% Solves the triangular system Lx = b by forward substitution
%
n=length(b);
x=zeros(n,1); % a vector of zeros to start
for j=1:n-1
x(j)=b(j)/L(j,j);
b(j+1:n)=b(j+1:n)-x(j)*L(j+1:n,j);
end
x(n)=b(n)/L(n,n);
Backward substitution can be implemented in a similar form, this time the unknowns are
found from the end upwards:
function x = UTriSol(L,b)
%
% Solves the triangular system Ux = b by backward substitution
%
n=length(b);
x=zeros(n,1);
for j=n:-1:2 % from n to 2 one by one
x(j)=b(j)/U(j,j);
b(1:j-1)=b(1:j-1)-x(j)*U(1:j-1,j);
end
x(1)=b(1)/U(1,1);
With these functions the solution of the system of equations Ax = b can be performed in
three steps by the code:
[L,U] = GE(A);
y = LTriSol(L,b);
x = UTriSol(U,y);
Exercise
Use the functions GE, LTriSol and UTriSol to solve the system of equations
generated by the finite difference modelling of the square coaxial structure, given by equation
(7.14). You will have to complete first the matrix A given only schematically in (7.14), after
applying the boundary conditions. Note that because of the geometry of the structure, not all
rows will have the same pattern.
To input the matrix it might be useful to start with the matlab command:
A = triu(tril(ones(n,n),1),-1)-5*eye(n,n)
This will generate a tridiagonal matrix of order n with –4 in the main diagonal and 1 in the two
subdiagonals. After that you will have to adjust any differences between this matrix and A.
Compare the results with those obtained by Gauss-Seidel.
3. Solution of Finite Difference Equations by Gauss-Seidel

Equation (7.14) is the matrix equation derived by applying the finite differences method to
the Laplace equation. Solving this system of equation by the method of Gauss-Seidel, as shown
in (5.23) and (5.24), consists of taking the original simultaneous equations, putting all diagonal
matrix terms on the left-hand-side as in (5.24) and effectively putting them in a DO loop (after
initializing the ‘starting vector’).
Equation (7.9) is the typical equation, and putting the diagonal term on the left-hand-side
gives:
φO = 14 (φ N + φ S + φ E + φW )
We could write 56 lines of code (one per equation) or even simpler, use subscripted
variables inside a loop. In this case the elements of A are all either 0, +1 or -4, and are easily
“generated during the algorithm”, rather that actually stored in an array. This simplifies the
computer program, and instead of A, the only array needed holds the current value of vector
elements:
xT = (φ1, φ 2 , ... , φ56 ).
The program can be simplified further by keeping the vector of 56 unknowns x in a 2-D
array z(11,11) to be identified spatially with the 2-D Cartesian coordinates of the physical
problem (see figure). For example z(3,2) stores the value of φ10. There is obviously scope to
improve efficiency since with this arrangement we store values corresponding to nodes with
known, fixed voltage values, including all those nodes inside the inner conductor. None of the
actually need to be stored, but doing so makes the program simpler. In this case, the program (in
old Fortran 77) can be as simple as:
c Gauss-Seidel to solve Laplace between square inner & outer

c conductors. Note that this applies to any potential problem,
c including potential (voltage) distribution or steady state
c current flow or heat flow.
c
dimension z(11,11)
data z/121*0./
do i=4,8
do j=4,8
z(i,j)=1.
enddo
enddo
c
do n=1,30
do i=2,10
do j=2,10
if(z(i,j).lt.1.) z(i,j)=.25*(z(i–1,j)+z(i+1,j)+
$ z(i,j–1)+z(i,j+1))
enddo
enddo
c in each iteration print values in the 3rd row
write(*,6) n,(z(3,j),j=1,11)
enddo
c
write(*,7) z
6 format(1x,i2,11f7.4)
7 format(‘FINAL RESULTS=’,//(/1x,11f7.4))
stop
end
In the program, first z(i,j) is initialized with zeros, then the values corresponding to the
inner conductor are set to one (1 V). After this, the iterations start (to a maximum of 30) and the
Gauss-Seidel equations are solved.
In order to check the convergence, the values of potentials in one intermediate row (the
third) are printed after every iteration. We can see that after 19 iterations there are no more
changes (within 4 decimals). Naturally, a more efficient monitoring of convergence can be
implemented whereby the changes are monitored, either in a point by point basis, or as the norm
of the difference, and the iterations are stopped when this value is within a prefixed precision.
The results are:
1 0.0000 0.0000 0.0000 0.2500 0.3125 0.3281 0.3320 0.3330 0.0833 0.0208 0.0000
2 0.0000 0.0000 0.1250 0.3750 0.4492 0.4717 0.4785 0.4181 0.1895 0.0699 0.0000
3 0.0000 0.0469 0.2109 0.4473 0.5225 0.5472 0.5399 0.4737 0.2579 0.1098 0.0000
4 0.0000 0.0908 0.2690 0.4922 0.5653 0.5865 0.5742 0.5082 0.3016 0.1369 0.0000
5 0.0000 0.1229 0.3076 0.5205 0.5902 0.6084 0.5944 0.5299 0.3293 0.1542 0.0000
6 0.0000 0.1446 0.3326 0.5381 0.6047 0.6211 0.6067 0.5435 0.3465 0.1650 0.0000
7 0.0000 0.1586 0.3484 0.5488 0.6133 0.6288 0.6143 0.5519 0.3572 0.1716 0.0000
8 0.0000 0.1675 0.3582 0.5553 0.6184 0.6334 0.6190 0.5572 0.3637 0.1756 0.0000
9 0.0000 0.1729 0.3641 0.5592 0.6215 0.6362 0.6218 0.5604 0.3676 0.1780 0.0000
10 0.0000 0.1763 0.3678 0.5616 0.6234 0.6379 0.6236 0.5624 0.3700 0.1794 0.0000
11 0.0000 0.1783 0.3700 0.5631 0.6245 0.6390 0.6247 0.5636 0.3713 0.1802 0.0000
12 0.0000 0.1795 0.3713 0.5639 0.6253 0.6396 0.6254 0.5643 0.3722 0.1807 0.0000
13 0.0000 0.1802 0.3721 0.5645 0.6257 0.6400 0.6258 0.5647 0.3727 0.1810 0.0000
14 0.0000 0.1807 0.3726 0.5648 0.6259 0.6403 0.6260 0.5649 0.3729 0.1812 0.0000
15 0.0000 0.1810 0.3729 0.5650 0.6261 0.6404 0.6261 0.5651 0.3731 0.1813 0.0000
16 0.0000 0.1811 0.3731 0.5651 0.6262 0.6405 0.6262 0.5652 0.3732 0.1813 0.0000
17 0.0000 0.1812 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1813 0.0000
18 0.0000 0.1813 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1814 0.0000
19 0.0000 0.1813 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
20 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
21 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
22 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
23 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
24 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
25 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
26 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
27 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
28 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
29 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
30 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
FINAL RESULTS=
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.3099 0.6406 1.0000 1.0000 1.0000 1.0000 1.0000 0.6406 0.3099 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
4. A variational formulation for the wave equation

Equation (8.20) in the example on the use of the variational method shows a variational
formulation for the one-dimensional wave equation. In the following we will prove that indeed
both problems are equivalent.
Equation (8.20) stated:
b 2
 dy 
∫  dx  dx
k 2 = s.v. a b (A4.1)
∫y
2
dx
a
Let’s examine a variation of k2 produced by a small perturbation δy on the solution y:

b 2
 d(y + δy) 
∫  dx 
 dx
2 2 2 a
k (y + δy) = k + δk = b
∫ (y + δy)
2
dx
a
And re-writing:
b b 2
(k 2 + δk 2 )∫ (y + δy)2 dx = ∫  d(ydx+ δy)  dx (A4.2)
a a
Expanding and neglecting higher order variations:

b
dy dδy 
b 2
(k 2 + δk 2 )∫ (y 2 + 2yδy) dx = ∫   dy 

dx 
+2
dx dx 
 dx
a a
or
b b b b 2 b
 dy  dy dδy
∫ y dx + 2k ∫ yδy dx + δk ∫ y dx = ∫  dx  dx + 2 ∫ dx dx dx
2 2 2 2 2
k (A4.3)
a a a a a
and now using (A4.1):

b b b
dy dδy
∫ yδy dx + δk ∫ y dx = 2∫
2 2 2
2k dx (A4.4)
a a a
dx dx
Now, since we want k2 to be stationary about the solution function y, we make δk2 = 0, and we
examine what conditions this imposes on the function y:
b b
dy dδy
∫ y δy dx = ∫
2
k dx (A4.5)
a a
dx dx
Integrating the RHS by parts:
b b b
dy d2 y
∫ y δy dx = dx δy a − ∫ δy dx 2
2
k dx
a a
Or re-arranging:
b b
 d2 y 2  dy
∫ δy dx 2 + k y dx = δy
 dx a
(A4.6)
a
Since δy is arbitrary, (A4.6) can only be valid if both sides are zero. That means that y should
satisfy the differential equation:
d2 y 2
+k y= 0 (A4.7)
dx 2
and any of the boundary conditions:
dy
= 0 at a and b or δy = 0 at a and b (fixed values of y at the ends). (A4.8)
dx
Summarizing, we can see that imposing the condition of stationarity of (A4.1) with respect
to small variations of the function y leads to y satisfying the differential equation (A4.7), which
is the wave equation, and any of the boundary conditions (A4.8); that is, either fixed values of y
at the ends (Dirichlet B.C.), or zero normal derivative (Neumann B.C.).
5. Area of a Triangle
For a triangle with nodes 1, 2 and 3 with coordinates (x1, y1), (x2, y2) and (x3, y3):
y3 3
B
A
y2 1
C
y1 2
x1 x3 x2
The area of the triangle is: A = Area of rectangle – area(A) – area(B) – area(C)
Area of rectangle = (x2 − x1 )(y3 − y2 ) = (x2 y3 + x1 y2 − x1y3 − x2 y2 )

Area (A) = 12 (x2 − x3 )(y3 − y2 ) = 12 (x2 y3 + x 3 y2 − x2 y2 − x3 y3 )
Area (B) = 12 (x3 − x1 )(y3 − y1 ) = 12 (x3 y3 − x3 y1 − x1y3 + x1 y1 )
Area (C) = 12 (x2 − x1 )(y1 − y2 ) = 12 (x2 y1 − x 2 y2 − x1 y1 + x1 y2 )
Then, the area of the triangle is:
A= 1
2 [(x 2y3 − x3y2 ) + (x 3y1 − x1y3) + (x1y2 − x2 y1)] (A5.1)
1 x1 y1
1
which can be written as: A= det 1 x 2 y2 (A5.2)
2
1 x3 y3
6. Shape Functions (Interpolation Functions)

The function u is approximated in each triangle by a first order function (a plane). This
will be given by an expression of the form: p + qx + ry which can also be written as a vector
product (equation 9.14).
 p
 
u˜(x, y) = p + qx + ry = (1 x y) q (A6.1)
 
 r 
Evaluating this expression at each node of a triangle (with nodes numbered 1, 2 and 3):
u1 = p + qx1 + ry1   u1  1 x1 y1   p 
    
u2 = p + qx2 + ry2  ⇒ u2  = 1 x 2 y2   q  (A6.2)
 
u3 = p + qx 3 + ry3  u3  1 x 3 y3   r 

And from here we can calculate the value of the constants p, q and r in terms of the nodal values
and the coordinates of the nodes:
−1
 p  1 x1 y1   u1 
     
q = 1 x2 y2  u2  (A6.3)
  
 r  1 x 3 y3   u3 
Replacing (A6. 3) in (A6.1):

−1
1 x1 y1   u1 
   
u˜(x, y) = (1 x y) 1 x 2 y2  u2  (A6.4)
1 x 3 y3   u3 
and comparing with (9.15), written as a vector product:
 u1 
 
u(x,y) ≈ u˜ (x,y) = u1 N1(x, y) + u2 N2 (x, y) + u3 N3 (x, y) = (N1 N2 N3 )  u2  (A6.5)
 u3 
we have finally:
−1
1 x1 y1 
 
(N1 N2 N3 ) = (1 x y) 1 x 2 y2  (A6.6)
1 x3 y3 
Solving the right hand side (inverting the matrix and multiplying), gives the expression for
each shape function (9.16):
1
Ni (x, y) = (ai + bi x + ciy ) (A6.7)
2A
where A is the area of the triangle and

a1 = x2y3 – x3y2 b 1 = y2 – y3 c 1 = x3 – x 2
a2 = x3y1 – x1y3 b 2 = y3 – y1 c 2 = x1 – x 3 (A6.8)
a3 = x1y2 – x2y1 b 3 = y1 – y2 c 3 = x2 – x 1
Note that from (A5.1), the area of the triangle can be written as:
A= 1
2 (a1 + a2 + a3 ) (A6.9)

ELEC3030 Notes v1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ELEC3030 Notes v1

Uploaded by

Copyright:

Available Formats

page 1 E763 (part 2) Numerical Methods

A-PDF Page Cut DEMO: Purchase from www.A-PDF.com to remove the wa

Most mathematical problems in engineering and physics as well as in many other

Incomplete or incorrect theory

Errors concerning the numerical calculations can be of three kinds:

2. Machine representation of numbers

0.1 ≤ mantissa < 1

We will see this in more detail next.

Floating-Point Arithmetic and Roundoff Error

In general, x is first represented in a normalized form (floating-point format): x = a×10b

 a = 0.α1α2α3α4 ... αi ... 0 ≤ αi ≤ 9, α1 ≠ 0 (2.1)

The machine representation (truncated form of x) will be:

fl(x) = sign(x) a’ 10b (2.3)

The exponent b must also be limited. So a floating-point system will be characterized by

0.10, 0.11, 0.12, . . . , . . , 0.98, 0.99

and the largest is: 0.99×102 = 99.

It can be shown that the relative error of fl(x) is bounded by:

This limit eps is defined here as the machine precision.

From equation (4) we can write:

fl(x) = x(1+ε) (2.5)

Definition: machine precision eps = min{g g + 1 > 1,g > 0} (2.6)

Error in basic operations

x + y –––we get –––fl(x + y) which is equal to (x + y)(1 + ε1); also: (2.7)

fl(z) = fl(fl(x) – fl(y))

fl(a – b) = fl(fl(1200) – fl(1194)) = fl(0.120×104 – 0.119×104)

Error propagation in calculations

The two algorithms will give the results: i) 0.64100000E-3

The exact result (which needs 14 decimal points to calculate) is

a) direct calculation: y = (x − a)2

fl( y)- y = (x − a) (2ε1 + ε 2 )

from where we can write:

fl( y)- y x2 2ax a2

3. Root Finding: Solution of nonlinear equations

The Bisection Method

Fig. 3.1 Fig. 3.2

CONVERGENCE AND ERROR

b n+1 − a n+1 b n+1 + a n+1

This expression can also be used to stop iterations.

a=0; b=1; %limits of the search interval [a,b]

together with the function definition:

And the corresponding results are:

10 0.52441406 -0.00244586 -0.15570833

Regula Falsi Method

From the figure (left), we can see that:

4 0.52359536 0.52359878 0.52359878

and for the error:

Modified regula falsi method

Fixed point iteration

Iteration number x error %

0.6 y=g(x) 0.62

0.2 ← y=x 0.54

(a) convergence (b) convergence

(c) divergence (d) divergence

g ' ( x) = 6 x + 4 and then, g ' ( x) x =x = 5.5825757 .

0.5 0.5 y=g(x) →

0.2 y=g(x) 0.2

Iteration number x f(x)

Stopping the iterations

The Secant Method

Extrapolation and Acceleration

Iter. Fixed Point method Accelerated FP method

Higher Order Methods

For example, for the function: y = x 6 − 2 (used 10 x2