Professional Documents
Culture Documents
Numerical Methods
1. Introduction
The second one could be due for example to idealization of the geometry of the problem or the
properties of materials involved and the uncertainty of the value of material parameters, etc.
This type of error concerns the mathematical model and the only possibility to reduce it is
to improve the mathematical description of the physical problem. It does not concern the
numerical calculation. It is though, an important part of the more general problem of “numerical
or computer modelling”.
complete accuracy. Most numbers have infinite decimal representation, which must be rounded.
But even if the data in a problem can be initially expressed exactly by finite decimal
representation, divisions may introduce numbers that must be rounded; multiplication will also
introduce more digits. This type of error has a random character that makes it difficult to deal
with.
So far the error we have been discussing is the absolute error or the error defined by:
error = true value – approximation. A problem with this definition is that it doesn’t take into
account the magnitude of the value being measured, for example and absolute error of 1 cm has a
very different significance in the length of a 100 m bridge or of a 10 cm bolt. Another
definition, that reflects this significance is the relative error defined as: relative error = absolute
error/true value. For example, in the previous case, the length of the bridge has a relative error
of 10−4 and 0.1 respectively; or in percent, 0.01% and 10%.
which can only represent numbers in the range: 10–100 ≤ y < 1099.
This is a normalized form, where the mantissa is defined such that:
Now, if A is the set of numbers exactly representable in a given machine, the question
arises of how to represent a number x not belonging to A ( x ∉ A ).
This is encountered not only when reading data into a computer, but also when
page 3 E763 (part 2) Numerical Methods
representing intermediate results in the computer during a calculation. Results of the elementary
arithmetic operations between two numbers need not belong to A. Let’s see first how a number
is represented (truncated) in a machine.
A machine representation can in most cases be obtained by rounding:
x ––––––––––––– fl(x)
Here and from now on, fl(x) will represent the truncated form of x (this is, with a
truncated mantissa and limited exponent), and not just its representation in normalized floating-
point format.
For example, in a computer with t = 4 digit representation for the mantissa:
fl(π) = 0.3142E 1
fl(0.142853) = 0.1429E 0
fl(14.28437) = 0.1428E 2
Then, we form:
0.α 1α 2 ⋅ ⋅⋅ α t if 0 ≤ α t +1 ≤ 4
a' = −t (2.2)
0.α 1α 2 ⋅ ⋅⋅ α t + 10 if α t+1 ≥ 5
That is, only t digits are kept in the mantissa and the last one is rounded up: αt is
incremented by 1 if the next digit αt+1 ≥ 5 and all digits after αt are deleted.
The exponent can vary from –1 to 2, so it can only take 4 possible values: –1, 0, 1 and 2.
Then, including now zero and negative numbers, this system can represent exactly only
2×90×4 + 1 = 721 numbers: The set of floating-point numbers is finite.
The smallest positive number in the system is 0.10×10–1 = 0.01
E763 (part 2) Numerical Methods page 4
fl(x) − x
≤ 5 × 10− t = eps (2.4)
x
Demonstration:
From (2.1) and (2.2), the normalized decimal representation of x and its truncated floating-
point form, we have that the maximum possible difference between the two forms is 5 at the
decimal position t+1, that is:
fl(x) − x ≤ 5 × 10− (t+1) × 10b
b −b
also, since x ≥ 0.1× 10 or 1 x ≤ 10 × 10 , we obtain the condition (2.4).
where ε ≤ eps, for all numbers x. The quantity (1+ε) in (2.5) cannot be distinguished from
1 in this machine representation, and the maximum value of ε is eps. So, we can also define
the machine precision eps as the smallest positive machine number g for which 1 + g >
1.
If x and y are not floating-point numbers (machine numbers) they will have to be converted
first giving:
x + y –––––––––––– fl(x + y) = fl(fl(x) + fl(y))
and similarly for the rest.
Let’s examine for example the subtraction of two such numbers: z = x – y , ignoring higher
order error terms:
We can see that if x approaches y the relative error can blow up, especially for large values of x
and y. The maximal error bounds are pessimistic and in practical calculations, errors might tend
to cancel. For example, in adding 20000 numbers rounded to, say, 4 decimal places the
maximum error will be 0.5×10–4×20000 = 1 (imagining the maximum absolute truncation of
0.00005 in every case) while it is extremely improbable that this case occurs. From a statistical
point of view, one can expect that in about 90% of the cases, the error will not exceed 0.005.
Example
Let’s compute the difference between a = 1200 and b = 1194 using a floating-point
system with a 3-digit mantissa:
where the correct value is 6, giving a relative error of: 0.667 (or 66.7%).
The machine precision for this system is eps = 5×10–t and the error bound above gives a
a+b
limit for the relative error of eps 1 + = 2.0
a−b
Example
Assume that we want to calculate the sum of three floating-point numbers: a, b, c.
This has to be done in sequence, that is, using any of the next two algorithms:
i) (a + b) + c or
ii) a + (b + c)
If the numbers are in floating-point format with t = 8 decimals and their values are for example:
a = 0.23371258E-4
b = 0.33678429E 2
c = -0.33677811E 2
Exercise 2.1
Show, using an error analysis, why the case ii) gives a more accurate result for the numbers of
the example above. Neglect higher order error terms; that is, products of the form: ε1ε2.
Example
Determination the error propagation in the calculation of y = (x − a)2 using floating-point
arithmetic, by two different algorithms, when x and a are already floating-point numbers.
[
fl( y) = (x − a ) (1 + ε1 ) (1 + ε 2 )
2 2
]
And only preserving first order error terms:
2
[ 2
]
fl( y) = (x − a) (1 + ε1 ) (1 + ε 2 ) = (x − a ) [(1 + 2ε1 )(1 + ε 2 )]
2
fl( y) = (x − a) (1 + 2ε1 + ε 2 )
2
fl( y)- y
∆y = = 2ε1 + ε 2
y
We can see that the relative error in the calculation of y using this algorithm is given by
2ε1 + ε 2 , so it is less than 3 eps.
b) Using the expanded form: y = x 2 − 2ax + a2
[( ) ]
fl( y) = x (1 + ε1 ) − 2ax (1 + ε 2 ) (1 + ε 3 ) + a (1 + ε 4 ) (1 + ε 5 )
2 2
page 7 E763 (part 2) Numerical Methods
This is, taking the square of x first (with its error) and subtracting the product 2ax (with its error)
and the error in the subtraction, and then, adding the last term with its error and the
corresponding error due to that addition. Solving this, keeping only first order error terms we
get:
[( ) ( )
fl( y) = x − 2ax + x ε 1 − 2ax ε 2 + x − 2ax ε 3 + a (1 + ε 4 ) (1 + ε 5 )
2 2 2 2
]
( )
fl( y ) = x 2 − 2ax + a 2 + x 2 (ε1 + ε 3 ) − 2ax(ε 2 + ε 3 ) + a 2ε 4 + x 2 − 2ax + a 2 ε 5
and we can see that there will be problems with this calculation if (x – a)2 is too small compared
with either x2 or a2. The first term above is bounded by eps while the second will be eps
multiplied by the amplification factor x 2 / ( x − a )2 . For example in if x = 15 and a = 14, the
three amplification factors will be respectively: 225, 420 and 196, which gives a total error
bound of (1 + 450 + 840 + 196)eps = 1487eps compared with 3 eps from algorithm a).
Exercise 2.2
For y = a – b compare the error bounds when a and b are and are not already defined as
floating-point numbers.
Exercise 2.3
Determine the error propagation characteristics of two algorithms to calculate
a) y = x 2 − a2 and b) y = (x − 1)3 . Assume in both cases that x and a are floating point
numbers.
E763 (part 2) Numerical Methods page 8
Bracketing Methods
an cn bn
a c b
an+1 cn+1 bn+1
where α is the exact position of the root and cn is the nth approximation found by this method.
Furthermore, if we want to find the solution with a tolerance ε (that is, α − cn ≤ ε ), we can
calculate the maximum number of iterations required from the expression above. Naturally, if at
one stage the solution lies at the middle of the current interval the search finishes early.
An approximate relative error (or percent error) at iteration n+1 can be defined as:
c n+1 − c n
ε=
c n+1
b n+1 − a n+1
ε= (3.3)
b n+1 + a n+1
Exercise 3.1
Demonstrate that the number of iterations required to achieve a tolerance ε is the integer that
log(b − a ) − log ε
satisfies: n≥ (3.4)
log 2
Example
The function: f ( x) = cos(3x) , has one root in the interval [0, 1]. The following simple Matlab
program implements the bisection method to find this root.
else return
end
end
end
Provided that the solution lies in the initial interval, and since the search interval is
continually divided by two, we can see that this method will always converge to the solution and
will find it within a required precision in a finite number of iterations.
However, due to the rather blind choice of solution (it is always chosen as the middle of
the interval), the error doesn’t vary monotonically. For the previous example
f ( x) = cos(3x) = 0 :
Iteration
number c f(c) error %
1 0.50000000 0.07073720 4.50703414
2 0.75000000 -0.62817362 -43.23944878
3 0.62500000 -0.29953351 -19.36620732
4 0.56250000 -0.11643894 -7.42958659
5 0.53125000 -0.02295166 -1.46127622
6 0.51562500 0.02391905 1.52287896
7 0.52343750 0.00048383 0.03080137
8 0.52734375 -0.01123469 -0.71523743
9 0.52539063 -0.00537552 -0.34221803
page 11 E763 (part 2) Numerical Methods
We can see that the error is not continually decreasing although in the end it has to be small.
This is due to the rather “brute force” nature of the algorithm. The approximation to the solution
is chosen blindly as the midpoint of the interval without and attempt at guessing its position
inside the interval. For example, if at some iteration n, the magnitudes of f (a n ) and f (b n ) are
very different, say, f (a n ) >> f (b n ) , it is likely that the solution is closer to b than to a, if the
function is smooth.
A possible way to improve it is to select the point c by interpolating the values at a and b.
This is called the “regula falsi” method or method of false position.
f (b) c − b
= (3.5)
f (a ) c − a
from where
af (b) − bf (a)
c=
c b f (b) − f (a)
or alternatively:
a f (a)(a − b)
c =a+ (3.6)
f (b) − f (a)
Fig. 3.3
The algorithm is the same as the bisection method except for the calculation of the point c.
In this case, for the same function, ( f ( x) = cos(3x) = 0 ) the solution, within the same absolute
tolerance (10−6) is found in only 4 iterations:
Iteration number a b c
0 0.00000000 1.00000000 0.00000000
1 0.50251446 1.00000000 0.50251446
2 0.50251446 0.53237237 0.53237237
3 0.52359536 0.53237237 0.52359536
E763 (part 2) Numerical Methods page 12
We can see that the error decreases much more rapidly than in the bisection method. The size of
the interval also decreases more rapidly. In this case the successive values are:
Iter.
number b-a b-a in bisection method
1 0.49748554 0.5
2 0.02985791 0.25
3 0.00877701 0.125
4 0.00000342 0.0625
However, this is not necessarily always the case and some occasions, the search interval can
remain large. In particular, one of limits can remain stuck while the other converges to the
solution. In that case the length of the interval tends to a finite value instead of converging to
zero.
In the following example for the function f ( x) = x10 − 1 the solution requires 70 iterations to
reach a tolerance of 10−6 with the regula falsi method while only 24 are needed with the bisection
method. We can also see that the right side of the interval remains stuck at 1.3 and the size of
the interval will tend to 0.3 in the limit instead of converging to zero. The figure shows the
interpolating lines at each iteration. The corresponding approximations are the points where
these lines cross the x-axis.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
-1 -1
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Fig. 3.4 Standard regula falsi Fig. 3.5 Modified regula falsi
Open Methods
These methods start with only one point or two but not necessarily bracketing the root. One of
the simplest is the fixed point iteration.
Example
For the function 0.5 x 2 − 1.1x + 0.505 = 0 the algorithm can be set as: x = 0.5 x 2 − 0.1x + 0.505 . or
( x − 0.1) 2 + 1
x= With an initial guess introduced in the right hand side, a new value of x is
2
obtained and the iteration can continue.
Starting from the value: x0 = 0.5, the successive values are:
1
0.7
0.9
0.68
0.8
0.66
0.7
0.64
0.4
0.58
0.3 0.56
0.1 0.52
0 0.5
0 0.2 0.4 0.6 0.8 1 0.4 0.45 0.5 0.55 0.6 0.65
Fig.. 3.6 Plot of the x and g(x) Fig. 3.7 Close-up showing the successive
approximations
E763 (part 2) Numerical Methods page 14
Convergence
This is a very simple method but solutions are not guaranteed. The following figures show
situations when the method converges and when it diverges.
In cases (a) and (b) the method converges, while in cases (c) and (d) it diverges.
y=x →
y=g(x)
y=g(x ) →
← y=x
y=g(x) → y=x →
← y=x
← y=g(x)
From Fig. 3.8 a-d we can see that it is relatively easy to determine when the method will
converge, so the best way of ensuring success is to plot the functions y = g ( x) and y = x . In a
more rigorous way, we can also see that for convergence to occur the slope of g(x) should be less
that that of x in the region of search. That is, g ' ( x) < 1 .
If divergence is predicted, a different form of re-writing the problem f ( x) = 0 in the form
y = g ( x) needs to be found that satisfies the condition above.
For example, for the function f ( x ) = 3 x 2 + 3 x − 1 = 0 , with a solution at x0 = 0.2637626 ,
we can separate it in the following two forms:
− 3x 2 + 1
(a) x = g ( x) = and (b) x = g ( x) = 3 x 2 + 4 x − 1
3
In the first case, g ' ( x) = −2 x then g ' ( x) x =x = −0.5275252 while for the second case,
0
← y=x
0.4 0.4
← y=x
0.3 0.3
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6
2
− 3x + 1 Fig. 3.9(b) x = g ( x) = 3 x 2 + 4 x − 1
Fig. 3.9(a) x = g ( x) =
3
Fig. 3.9 illustrates the main deficiency of this method. Convergence often depends on how the
problem is formulated. Additionally, divergence can also occur if the initial guess is not
sufficiently close to the solution.
Newton-Raphson Method
This is one of the most used methods for root finding. It also needs only one point to start
the iterations, but as a difference to the fixed point iteration it will converge to the solution
provided the function is monotonically varying in the region of interest.
Starting from a point x0, a tangent to the function f(x) (a line with the slope of the
derivative of f) is extrapolated to find the point where it crosses the x-axis, providing a new
approximation. The same procedure is repeated until the error tolerance is achieved. The
method needs repeated evaluation of the function and its derivative and an appropriate stopping
criterion is the value of the function at the successive approximations: f(xn).
slope = f '(x0)
x1 x0
Fig. 3.10
f ( xn )
Form Fig. 3.10 we can see that at stage n: f ' ( xn ) = , then the next
xn − xn+1
E763 (part 2) Numerical Methods page 16
f ( xn )
approximation is found as: xn +1 = xn − . (3.7)
f ' ( xn )
Example
For the same function as in the previous examples: f ( x ) = cos(3 x) = 0 , and for the same
tolerance of 10−6, the solution is found in 3 iterations starting from x0 = 0.3 (After 3 iterations
the accuracy is better than 10−8). Starting from 0.5, only 2 iterations are sufficient.
The method can also be derived from the Taylor series expansion. This also provides a useful
insight on the rate of convergence of the method.
Considering the Taylor expansion truncated to the first order (see Appendix):
f ( 2) (ξ )
f ( xi +1 ) = f ( xi ) + f ' ( xi )( xi +1 − xi ) + ( xi +1 − xi ) 2 (3.8)
2!
Considering now the exact solution xr and the Taylor expansion, evaluated at this point:
f ( 2) (ξ )
f ( xr ) = 0 = f ( xi ) + f ' ( xi )( xr − xi ) + ( xr − xi ) 2
2!
and reordering (assuming a single root – first derivative ≠ 0):
f ( xi ) f ( 2) (ξ )
xr = xi − − ( xr − xi ) 2 (3.9)
f ' ( xi ) 2! f ' ( xi )
f ( xi )
Using now (3.7) for xi +1 : xi +1 = xi − and substituting in (3.9) gives:
f ' ( xi )
f ( 2) (ξ )
xr = xi +1 − ( xr − xi ) 2
2! f ' ( xi )
which can be reordered as:
f ( 2) (ξ )
( xr − xi +1 ) = − ( xr − xi ) 2 (3.10)
2! f ' ( xi )
The error at stage i can be written as the difference between xr and xi: Er = ( xr − xi ) then, from
(3.10) we can write:
− f ( 2) (ξ ) 2
Er +1 = Er
2 f ' ( xi )
Assuming convergence, both ξ and xi should eventually become xr, so the previous equation can
be re-arranged on the form:
page 17 E763 (part 2) Numerical Methods
− f ( 2 ) ( xr ) 2
Er +1 = Er (3.11)
2 f ' ( xr )
We can see that the relation between errors of successive order is quadratic. That means that on
each Newton-Raphson iteration, the number of correct decimal points should double. This is
what is called quadratic convergence.
Although the convergence rate is generally quite good, there are cases that show poor or
no convergence. An example is when there is an inflexion point near the root and in that case,
the iteration values will start to progressively diverge from the solution. Another case is when
the root is a multiple root, that is, when the first derivative is also zero.
Example
Use the secant method to find the root of f ( x ) = e − x − x . Start with the estimates at x−1 = 0 and
x0 = 1. The exact result is: 0.56714329…
First iteration:
x−1 = 0 f ( x−1 ) = 1.0 − 0.63212(1 − 0)
then x1 = 1 − = 0.61270 ε = 8%
x0 = 1 f ( x0 ) = −0.63212 − 0.63212 − 1
Second iteration:
x0 = 1 f ( x0 ) = −0.63212 − 0.07081(0.61270 − 1)
then x2 = 0.61270 − = 0.563838325
x1 = 0.61270 f ( x1 ) = −0.07081 − 0.07081 − (−0.63212)
Note that in this case the 2 points are at the same side of the root (not bracketing it).
Using Excel, a simple calculation can be made giving:
E763 (part 2) Numerical Methods page 18
i xi f(xi) error %
-1 0 1
0 1 -0.63212
1 0.612700047 -0.070814271 8.032671349
2 0.563838325 0.005182455 -0.582738902
3 0.567170359 -4.24203E-05 0.004772880
4 0.567143307 -2.53813E-08 2.92795E-06
5 0.567143290 1.24234E-13 7.22401E-08
The secant method doesn’t require the evaluation of the derivative of the function as the
Newton-Raphson method does but still, it suffers from the same problems. The convergence of
the method is similar to that of Newton and similarly, it has severe problems if the derivative is
zero or near zero in the region of interest.
Multiple Roots
We have seen that some of the methods have poor convergence if the derivative is very small or
zero. For higher order zeros (multiple roots), the function is zero and also the first n−1
derivatives (n is the order of the root). In this case the Newton-Raphson method (and the secant
method will converge poorly).
We can notice however, that if the function f(x) has a root multiple root at x = α, the
function:
f ( x)
g ( x) = (3.15)
f ' ( x)
has a simple root at x = α (If the root of f is of order n, the root of the derivative is of order n−1).
We can then use the standard Newton-Raphson method for the function f(x).
Exercise 3.2
Use the Newton-Raphson method to find a root of the function f ( x ) = 1 − xe1− x . Start the
iterations with x0 = 0 .
x n−1 − x n
xn = xn + α ( xn − xn−1 ) where α = (3.16)
x n − 2 x n−1 + x n−2
We can use this expression embedded in the fixed point iteration for example, in the form:
Starting from a value for x0: the first two iterates are found using the standard method:
x1=g(x0);
x2=g(x1);
Now, we can use the Aitken’s extrapolation in a repeated form:
alpha=(x2-x1)/(x2-2*x1+x0)
page 19 E763 (part 2) Numerical Methods
xbar=x2+alpha*(x2-x1)
and now we can refresh the initial guess:
x0=xbar
and re-start the iterations.
Similarly for the Newton-Raphson method where the evaluation of x1, and x2 are replaced by
the corresponding forms for the N-R method and the function and its derivative needs to be
calculated at each stage:
f0=f(x0) % calculation of the function
df0=df(x0) % calculation of the derivative
x1=x0-f0/df0
x2=x1-f1/df1
alpha=(x2-x1)/(x2-2*x1+x0)
xbar=x2+alpha*(x2-x1)
x0=xbar
Example
For the function f ( x ) = cos(3 x) , using the Fixed-Point method with and without acceleration
gives the results that follow. The iterations are started with x0 = 0.5 and the alternative form:
2 x + cos(3x)
g ( x) = is used.
2
looking for the intersection of that line with the axis. An obvious extension of this idea is to use
a higher order approximation, for example, fitting a second order curve (a parabola) to the
function and looking for the zero of this instead. A method that implements this is the Muller
method.
Muller’s method
Using three points, we can find the equation of a parabola that fits the function and then, find the
zeros of the parabola.
2 y3
Solving for the zero closest to x3 gives: x4 = x3 − (3.17)
s + sign( s) s 2 − 4 y3d1
where s = c2 + d1 ( x3 − x2 ) .
The results of the iterations are:
The Muller method requires three points to start the iterations but it doesn’t require evaluation of
derivatives as the Newton-Raphson. It can also be used to find complex roots.
page 21 E763 (part 2) Numerical Methods
Lagrange Interpolation
The basic interpolation problem can be formulated as:
Given a set of nodes, {xi, i =0,…,n} and corresponding data values {yi, i =0,…,n}, find the
polynomial p(x) of degree less or equal to n such that p(xi) = yi.
n x− xj
Consider the family of functions: L(i n ) ( x) = ∏ , i = 0, 1, n (4.18)
j = 0, j ≠ i xi − x j
We can see that they are polynomials of order n and have the property (interpolatory condition):
1 i = j
L(i n ) ( x j ) = δ i , j = (4.19)
0 i ≠ j
Then, if we define the polynomial by:
n
p n ( x) = ∑ y k L(kn) ( x) (4.20)
k =0
n
then: p n ( xi ) = ∑ y k L(kn) ( xi ) = yi (4.21)
k =0
The uniqueness of this interpolation polynomial can also be demonstrated (that is, that there is
only one polynomial of order n or less that satisfy this condition.
Lagrange Polynomials
In more detail, from the general definition (4.18) the equation for the first order polynomial
(straight line) passing through two points ( x1 , y1 ) and ( x2 , y2 ) is:
x − x2 x − x1
p1 ( x) = L1(1) y1 + L(21) y2 = y1 + y2 (4.22)
x1 − x2 x2 − x1
The second order polynomial (parabola) passing through three points is:
In general, we can see that the interpolation polynomials have the form given in (4.20) for any
order.
Each of the Lagrange interpolation functions L(n )
k associated to each of the nodes xk (given in
general by (4.18)) is:
E763 (part 2) Numerical Methods page 22
( x − x1 )( x − x2 )L( x − xk −1 )( x − xk +1 )L( x − xn ) N ( x)
L(kn ) ( x) = = (4.24)
( xk − x1 )( xk − x2 )L( xk − xk −1 )( xk − xk +1 )L( xk − xn ) D
The denominator has the same form as the denominator and D = N(xk).
Example
Find the interpolating polynomial that passes through the three points: ( x1 , y1 ) = (−2,4) ,
( x2 , y2 ) = (0,2) and ( x3 , y3 ) = (2,8) .
Substituting in (4.20), or more specifically, (4.23):
( x − 0)( x − 2) ( x + 2)( x − 2) ( x + 2)( x − 0)
p 2 ( x) = 4+ 2+ 8
(−2 − 0)(−2 − 2) (0 + 2)(0 − 2) (2 + 2)(2 − 0)
x2 − 2x x2 − 4 x 2 + 2x
p 2 ( x) = 4+ 2+ 8 = L1( 2) y1 + L(22) y 2 + L(32) y3
8 −4 8
p2 ( x ) = x 2 + x + 2
8
(2)
L2 (x) Fig. 4.1 shows the complete interpolating
1
6 p2(x)
polynomial p2 ( x) and the three Lagrange
interpolation polynomials L(k2) ( x) k = 1, 2,
4 3 corresponding to each of the nodal points.
p2(x) p2(x) p2(x)
p2(x)
Notice that the function corresponding to
0.5
p2(x) one node has a value 1 at that node and 0 at
2
2
the other two.
(2)
L3 (x)
0
(2) 0
L1 (x)
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Fig. 4.1
Exercise 4.1
Find the 5th order Lagrange interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and
yd = {2, 1, 3.4, 3.8, 5.8, 4.8}.
Exercise 4.2
n
Show that an arbitrary polynomial of order n can be represented exactly by p( x) = ∑ p( xi ) Li ( x)
i =0
using an arbitrary set of (distinct) data points xi.
Newton Interpolation
It can be easily demonstrated that the polynomial interpolating a set of points is unique (Exercise
4.2), and the Lagrange method allows us to find it. The Newton interpolation method gives
eventually the same result but it can be more convenient in some cases. In particular, it is
simpler to extend the interpolation adding extra points, which in the Lagrange method would
need a total re-calculation of the interpolation functions.
page 23 E763 (part 2) Numerical Methods
f ( x) = b0 + b1 x + b2 x 2 + b3 x 3 + L + bn x n (4.25)
Newton’s method, and Lagrange’s, give us a procedure to find the coefficients bi.
y − y1 y2 − y1
=
y2 x − x1 x2 − x1
which can be re-arranged as:
y
y2 − y1
y = y1 + ( x − x1 ) (4.26)
x2 − x1
That is, the Newton form of the equation of a straight line that passes through 2 points (x1,
y1) and (x2, y2) is
a 0 = y1
y − y1
p( x) = a0 + a1 ( x − x1 ) ; where: (4.27)
a = 2
1 x2 − x1
Similarly, the general expression for a second order polynomial passing through the 3 data
points: (x1, y1), (x2, y2) and (x3, y3) can be written as:
p 2 ( x) = b0 + b1 x + b2 x 2
Substituting the values for the 3 points, we get, after some re-arrangement:
a0 = y1
y 2 − y1
a1 =
x2 − x1
(4.29)
y3 − y 2 y 2 − y1
−
x3 − x2 x2 − x1
a 2 = x3 − x1
The individual terms in the above expression are usually called “divided differences” and
denominated by the symbol D. That is,
y −y Dyi +1 − Dyi D 2 yi +1 − D 2 yi
Dyi = i +1 i , D 2 y i = , D 3 yi = , etc
xi +1 − xi xi + 2 − xi x i + 3 − xi
The general form of Newton interpolation polynomials is then an extension of (4.27) and (4.28):
pn ( x ) = a0 + a1 ( x − x1 ) + a2 ( x − x1 )( x − x2 ) + L (4.30)
n n
or, pn ( x) = a0 + ∑ aiWi ( x) with Wi ( x) = ∏ ( x − xi ) (4.31))
i =1 i =1
with the coefficients:
a0 = y1 , ai = D i y1
Example
We can consider the previous example of finding the interpolating polynomial that passes
through the three points: ( x1 , y1 ) = (−2,4) , ( x2 , y2 ) = (0,2) and ( x3 , y3 ) = (2,8) .
In this case it is usual and convenient to arrange the calculations in a table with the following
quantities in each column:
xi yi Dyi D 2 yi
−2 4
y2 − y1 2−4
= = −1
x2 − x1 0 − (−2)
Dy2 − Dy1 3 − (−1)
0 2 = =1
x3 − x1 2 − (−2)
y3 − y 2 8 − 2
= =3
x3 − x2 2 − 0
2 8
p2 ( x) = 4 − ( x + 2) + ( x + 2)( x − 0) = x 2 + x + 2
Note that it is the same polynomial found using Lagrange interpolation.
One important property of the Newton’s construction of the interpolating polynomial is what
makes it easy to extend including more points. if additional points are included, the new higher
order polynomial can be easily constructed from the previous one:
In this way, it has many similarities with the Taylor expansion, where additional terms increase
the order of the polynomial. These similarities allow a treatment of the error in the same way as
it is done with Taylor expansions.
Exercise 4.3
Find the 5th order interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd = {2, 1,
page 25 E763 (part 2) Numerical Methods
One of the problems with interpolation of data point is that this technique is very sensitive
to noisy data. A very small change in the values of the data can lead to a drastic change in the
interpolating function. This is illustrated in the following example:
Fig. 4.3 shows the interpolating polynomial (in blue) for the data:
xd = {0, 1, 2, 3, 4, 5}
yd = {2, 1, 3, 4.8, 5.8, 4.8}
Now, if we add two more points with a slight amount of noise: xd’ = {2.2, 2.7} and yd’ = {3.5,
4.25} (shown with the filled black markers), the new interpolation polynomial (red line) shows a
dramatic difference to the first one.
0
0 1 2 3 4 5
Fig. 4.3
Hermite Interpolation
The problem arises because the extra points force a higher degree polynomial and this can have a
higher oscillatory behaviour. Another approach that avoids this problem is to use data for the
derivative of the function too. If we also ask for the derivative values to be matched at the
nodes, the oscillations will be prevented. This is done with the “Hermite Interpolation”. The
development is rather similar to that of the Newton’s method but more complicated due to the
involvement of the derivative values. It can also be constructed easily with the help of a table
(as in Newton’s) and divided differences. We will not cover here the details of the derivation
but simply, the procedure to find it.
The table is similar to that for the Newton interpolation but we enter the data points twice (see
E763 (part 2) Numerical Methods page 26
below) and the derivatives values are placed between repeated data points, in alternate rows as
the first divided differences. The initial set-up for 2 points is marked with red circles.
i xi yi Dyi D 2 yi D 3 yi
1 x1 y1
y1 '
Dy1 − y1 '
1 x1 y1 A=
x2 − x1
y2 − y1 B−A
Dy1 = C=
x2 − x1 x2 − x1
y2 '− Dy1
2 x2 y2 B=
x2 − x1
y2 '
2 x2 y2
H 2 ( x) = y1 + y1 ' ( x − x1 ) + A( x − x1 ) 2 + C ( x − x1 ) 2 ( x − x2 ) (4.33)
The coefficients of the successive terms are marked in the table with blue squares.
-1
0 0.5 1 1.5
Fig. 4.4
Another approach to avoid the oscillations present when using high order polynomials is to
use lower order polynomials to interpolate subsets of the data and assemble the overall
approximating function piecewise. This is what is called “spline interpolation”.
Spline Interpolation
Any piecewise interpolation of data by low order functions is called spline interpolation and the
simplest and widely used is the piecewise linear interpolation, or simply, joining consecutive
data points by straight lines.
page 27 E763 (part 2) Numerical Methods
For example, for a set of 4 data points, we can establish the necessary equations as listed above,
giving a total of 8 equations for 8 unknowns (having fixed a1 = 0 already). This can be solved
by the matrix techniques that we will study later.
Quadratic splines have some shortcomings that are not present in cubic splines, so these
E763 (part 2) Numerical Methods page 28
are preferred. However, their calculation is even more cumbersome than that of quadratic
splines. In this case, the function and the first and second derivatives are continuous at the
nodes.
Because of their popularity, cubic splines are commonly found in computer libraries and for
example, Matlab has a standard function that calculates them.
0
0 1 2 3 4 5
xd=[0,1,2,2.2,2.7,3,4,5]'; Fig. 4.5
yd=[2,1,3,3.5,4.25,4.8,5.8,4.8]';
x=0:0.05:5;
y=spline(xd,yd,x);
plot(x,y,'g','Linewidth',2.5)
plot(xd,yd,'ok','MarkerSize',10,'MarkerFaceColor','w','LineWidth',2)
Note that the drawing of the 7-order interpolating polynomial (red line in Fig. 4.5) is not
included in this piece of code and that the last line is simply to draw the markers.
Exercise 4.4
2
Using Matlab plot the function f ( x) = 0.1xe1.2 sin x in the interval [0, 4] and construct a cubic
spline interpolation using the values of the function at the points xd = 0: 0.5: 4 (in Matlab
notation). Use Excel to construct a table of divided differences for this function at the points xd
and find the coefficients of the Newton interpolation polynomial. Use Matlab to plot the
corresponding polynomial in the same figure as the splines and compare the results
If the main objective is to create a smooth function to represent the data, it is some times
preferable to choose a function that doesn’t necessarily pass exactly through the data but
approximates its overall behaviour. This is what is called “approximation”. The problems here
are how to choose the approximating function and what is considered the best choice.
Approximation
There are many different ways to approximate a function and you have seen some of them
in detail already. Taylor expansions, least squares curve fitting and Fourier series are examples
of this.
Methods like the “least squares” look for a single function or polynomial to approximate
the desired function. Another approach, of which the Fourier series is an example, consists of
page 29 E763 (part 2) Numerical Methods
using a family of simple functions to build an expansion that approximates the given function.
The problem then is to find the appropriate set of coefficients for that expansion. Taylor series
are somehow related, the main difference is that while the other methods based on expansions
attempt to find an overall approximation, Taylor series are meant to approximate the function at
one particular point and its close vicinity.
n
E (a, b) = ∑ ( yi − (a + bxi ) )2 (4.40)
i =1
The error is a function of the parameters a and b that define the straight line. Then, the
minimization of the error can be achieved by making the derivatives of E with respect to a and b
equal to zero. These conditions give:
n n n
∂E
= −2∑ ( yi − (a + bxi ) ) = 0 ⇒ ∑ yi − na − b∑ xi = 0 (4.41)
∂a i =1 i =1 i =1
n n n n
∂E
= −2∑ ( yi − (a + bxi ) )xi = 0 ⇒ ∑ xi yi − a∑ xi − b∑ xi2 = 0 (4.42)
∂b i =1 i =1 i =1 i =1
which can be simplified to:
n n
na + b∑ xi = ∑ yi (4.43)
i =1 i =1
n n n
and a ∑ xi + b∑ xi2 = ∑ xi yi (4.44)
i =1 i =1 i =1
Solving the system for a and b gives:
E763 (part 2) Numerical Methods page 30
n n n n n n n
∑ xi2 ∑ yi − ∑ xi ∑ xi yi n∑ xi yi − ∑ xi ∑ yi
i =1 i =1 i =1 i =1 i =1 i =1 i =1
a= 2
and b = 2
(4.45)
n n n n
n∑ xi2 − ∑ xi n∑ xi2 − ∑ xi
i =1 i =1 i =1 i =1
Example
Fitting a straight line to the data given by:
xd = {0, 0.2, 0.8, 1, 1.2, 1.9, 2, 2.1, 2.95, 3} and
yd = {0.01, 0.22, 0.76, 1.03, 1.18, 1.94, 2.01, 2.08, 2.9, 2.95}
10 10 10 10
Then, ∑ xdi = 15.15 ; ∑ ydi = 15.08 , ∑ xdi2 = 32.8425 and ∑ xdi ydi = 32.577
i =1 i =1 i =1 i =1
Fig. 4.6
Example
The data (xd, yd) shown in Fig. 4.7 appears to behave in an exponential manner. Then, defining
the variable zd i = log( yd i ) , zd should vary linearly with xd.
We can then fit a straight line to the pair of 25
variables (xd, zd). If the fitting function is the
function z(x), the function 20
y = ez
15
0
0 0.5 1 1.5 2 2.5 3
Fig. 4.7
page 31 E763 (part 2) Numerical Methods
( )
n 2
E (a, b, c) = ∑ yi − (a + bx + cx 2 ) (4.45)
i =1
Making the derivatives of E with respect to a, b and c equal to zero will give the necessary 3
equations for the coefficients of the parabola. The expressions are similar to those found for the
straight line fit although more complicated.
Matlab has standard functions to perform least squares approximations with polynomials of any
order. If the data is given by (xd, yd), and m is the desired order of the polynomial to fit, the
function:
y = polyval(coeff, x)
Exercise 4.5
Find the coefficients of a second order polynomial that approximates the function s( x) = e x .in
the interval [−1, 1] in the least squares sense, using the points xd = [−1, 0, 1]. Plot the function
and the approximating polynomial together with the Taylor approximation (Taylor series
truncated to second order) for comparison.
n
~
f ( x ) ≈ f n ( x ) = ∑ ck φ k ( x ) (4.47)
k =1
an expansion truncated to n terms.
In this context, the error of this approximation is the function difference between the exact and
~
the approximate functions: rn ( x) = f ( x) − f n ( x) and we can use the least squares ideas again,
seeking to minimise the norm of this error. That is: the quantity, (error residual):
2 ~
Rn = rn ( x) ≡ ∫ ( f ( x) − f n ( x)) 2 dx (4.48)
Ω
with the subscript n because the error residual above is also a function of truncation level.
r2 r r N
This concept of norm is analogous to that of the norm or modulus of a vector: v = v ⋅ v = ∑ vi2 .
i =1
In order to extend it to functions we need to introduce the inner product of functions, (analogous
to the dot product of vectors):
If we have two functions f and g defined over the same domain Ω, their inner product is
the quantity:
f , g = ∫ f ( x) g ( x)dx (4.49)
Ω
Note: The above definition of inner product is sometimes extended using a weighting function
page 33 E763 (part 2) Numerical Methods
w(x) in the form: f , g = ∫ w( x) f ( x) g ( x)dx and provided that w(x) is non-negative it satisfies
Ω
all the required properties.
2
In a similar form as with the dot product of vectors, f ( x) = f ( x), f ( x)
Using the inner product definition, the error expression (4.48) can be written as:
~ 2 ~ 2 ~ ~ 2
Rn = ( f ( x) − f n ( x)) = ∫ ( f ( x) − f n ( x)) 2 dx = f ( x) − 2 f ( x), f n ( x) + f n ( x) (4.50)
Ω
n
~ ~
and if we write f n ( x) as f n ( x) = ∑ ckφk ( x) as above,
k =1
we get:
n n n
Rn = f ( x) − 2∑ ck f ( x), φk ( x) + ∑∑ ck c j φk ( x), φ j ( x)
2
(4.51)
k =1 j =1 k =1
We can see that the error residual Rn, is a function of the coefficients ck of the expansion and
then, to find those values that minimize this error, we can make the derivatives of Rn with respect
to ck equal to zero for all k. That is:
∂Rn
= 0 for k = 1, …, n. (4.52)
∂ck
The first term is independent of ck, so it will not contribute and the other two will yield the
general equation:
n
∑ c j φk ( x),φ j ( x) = f ( x), φk ( x) for k = 1, …, n. (4.53)
j =1
(k = 1) φ1 , φ1 c1 + φ1 , φ 2 c 2 + L + φ1 , φ n c n = f , φ1
(k = 2) φ2 , φ1 c1 + φ2 , φ2 c2 + L + φ2 , φn cn = f , φ2
L
(k = n) φn , φ1 c1 + φn , φ2 c2 + L + φn , φn cn = f , φn
which can be written as a matrix problem of the form: Φc = s, where the matrix Φ contains all
the inner products (in all combinations), the vector c is the list of coefficients and s is the list of
values in the right hand side (which are all known: we can calculate them all).
We can find the coefficients solving the system of equations but we can see that this will
be a much easier task if all crossed products of basic functions yielded zero; that is, if
This is what is called orthogonality and is a very useful property of the functions in a base.
(Similar with what happens with a base of perpendicular vectors: all dot products between
E763 (part 2) Numerical Methods page 34
1. Legendre Polynomials
They are orthogonal polynomials in the interval [−1, 1], with weighting function 1. That is,
1
Pi ( x), Pj ( x) = ∫ Pi ( x) Pj ( x)dx = 0 if i ≠ j. They are usually normalised so that Pn(1) = 1 and
−1
1
2
their norm in this case is: Pn ( x), Pn ( x ) = ∫ Pn ( x ) Pn ( x )dx = (4.56)
−1
2n + 1
The first few are: 1
P0 ( x) = 1
P1 ( x) = x
P2 ( x) = (3 x 2 − 1) 2
P3 ( x) = (5 x 3 − 3 x) 2
P4 ( x) = (35 x 4 − 30 x 2 + 3) 8 , 0
1
P5 = (15 x − 70 x 3 + 63 x 5 )
8
1
P6 = (−5 + 105 x 2 − 315 x 4 + 231x 6 )
16
1
P7 = (−35 x + 315 x 3 − 693 x 5 + 429 x 7 ) -1
-1 0 1
16
etc. Fig. 4.8 Legendre polynomials
(−1) n ∂ n
In general they can be defined by the expression: Pn ( x) = (1 − x 2 ) n (4.57)
2 n n! ∂x n
They also satisfy the recurrence relation: (n + 1) Pn+1 ( x) = (2n + 1) xPn ( x) − nPn−1 ( x) (4.58)
page 35 E763 (part 2) Numerical Methods
2. Chebyshev Polynomials
The general, compact definition of these polynomials is: Tn ( x) = cos(n cos −1 ( x)) (4.59)
and they satisfy the following orthogonality condition:
1 0 if i ≠ j
Ti ( x)T j ( x)
Ti ( x ), T j ( x) = ∫ 1− x2
dx = π 2 if i = j ≠ 0 (4.60)
−1 π if i = j = 0
That is, they are orthogonal in the interval [−1, 1] with the weighting function: w( x) = 1 1 − x 2
They are characterised by having all their oscillations of the same amplitude and in the interval
[−1, 1] and also all their zeros occur in the same interval.
The first few Chebyshev polynomials are:
T0 ( x) = 1 1
T1 ( x) = x
T2 ( x) = 2 x 2 − 1
T3 ( x) = 4 x 3 − 3 x
0
T4 ( x ) = 8 x 4 − 8 x 2 + 1
T5 ( x) = 16 x 5 − 20 x 3 + 5 x
T6 ( x) = 32 x 6 − 48 x 4 + 18 x 2 − 1
T7 ( x ) = 64 x 7 − 112 x 5 + 56 x 3 − 7 x -1
… etc. -1 0 1
These are not the only possibilities. Other families of polynomials commonly used are the
Hermite polynomials, which are orthogonal over the complete real axis with weighting function
exp(− x 2 ) , and Laguerre polynomials, orthogonal in [0, ∞) with weighting function e − x .
Example
1
Approximate the function f ( x ) = with a = 4 in the interval [−1, 1] in the least squares
1+ a2x2
sense using Legendre polynomials up to order 2.
2 1
~ 1
The approximation is: f ( x) = ∑ ck Pk ( x ) and the coefficients are: ck = 2 ∫ f ( x) Pk ( x)dx
k =1 Pk ( x) −1
22
with Pk ( x) = .
2k + 1
Form the expression above we can see that the calculation of the coefficients will involve
integrals:
E763 (part 2) Numerical Methods page 36
1
xm
Im = ∫ 1 + a 2 x 2 dx
−1
which satisfy the recurrence relation:
2 1 2
Im = − I m−2 with I 0 = tan −1 a
(m − 1)a 2 a 2 a
We can also see that due to symmetry Im = 0 for m odd. (Integral of an odd function over the
interval [−1, 1]). Also, because of this, only even numbered coefficients are necessary. (Odd
coefficients of the expansion are zero).
The coefficients are then:
1 1
1 1 1 1
2 −∫1
c0 = f ( x)dx = ∫ dx = I 0
2 −11 + a 2 x 2 2
1 1
5 5 3x 2 − 1 5
c2 = ∫ P2 ( x) f ( x)dx = ∫ dx = (3I 2 − I 0 )
2 −1 4 −11 + a 2 x 2 4
1
9 9
P4 ( x) f ( x)dx = (35 I 4 − 30 I 2 + 3I 0 )
2 −∫1
c4 =
16
1
13 13
P6 ( x) f ( x)dx = (231I 6 − 315 I 4 + 105 I 2 − 5 I 0 )
2 −∫1
c6 =
32
1
17 17
(6435 I8 − 12012 I 6 + 6930 I 4 − 1260 I 2 + 35I 0 )
2 −∫1
c8 = P8 ( x ) f ( x )dx =
256
1
The results are shown in Fig. 4.10,
where the red line corresponds to the
approximating curve. The green line at the
bottom shows the absolute value of the
difference between the two curves. 0.5
The total error of the approximation
(integral of this difference, divided by the
integral of the original curve) is of 10.5%.
-1 0 1
Fig. 4.10
Exercise 4.6
Use cosine functions ( cos(nπ x) ) instead of Legendre polynomials (in the form of a Fourier
series) and compare the results.
Remarks:
We can observe in the figure of the example above that the error oscillates through the
domain. This is typical of least squares approximation, where the overall error is minimised.
In this context, Chebyshev polynomials are the best possible choice. The error obtained
with its approximation is the least possible with any other polynomial up to the same degree.
page 37 E763 (part 2) Numerical Methods
Furthermore, these polynomials have other useful properties. We have seen before interpolation
of data by higher order polynomials, using equally spaced data points and the problems that this
cause were quite clear. However, one then can think if there is a different distribution of points
that helps in minimising the error. The answer is yes, the optimum arrangement is to locate the
data points on the zeros of the Chebyshev polynomial of the order necessary to give the required
number of data points. These occur for:
2k − 1
xk = cos π for k = 1, …, n.
2n
Approximation to a Point
The approximation methods we have studied so far attempt to reach a “gobal” approximation to
a function; that is, to minimise the error over a complete domain. This might be desirable in
many cases but there could be others where it is more important to achieve high approximation
at one particular point of the domain and in its close vicinity. One such method is the Taylor
approximation, where the function is approximated by a Taylor polynomial.
f ( n+1) (ξ )
and the error incurred is given by: ( x − a ) n+1 where ξ is a point between a and x.
(n + 1)!
(See Appendix).
If a function varies rapidly or has a pole, polynomial approximations will not be able to achieve
high degree of accuracy. In that case an approximation using rational functions will be better.
E763 (part 2) Numerical Methods page 38
The polynomial in the denominator will provide the facilities for rapid variations and poles.
Padé Approximation
Padé Approximants are rational functions, or ratio of polynomials, that fits the value of a
function and a number of its derivatives at one point. They usually provide an approximation
that is better than that of the Taylor polynomials, in particular in the case of functions containing
poles.
A Padé approximation to a function f(x) that can be represented by a Taylor series in [a, b]
(or Padé approximant) is a ratio between two polynomials Pm(x) and Qn(x) of orders m and n
respectively:
P ( x) am x m + L + a2 x 2 + a1x + a0
f ( x ) ≈ Rnm ( x ) = m = (4.63)
Qn ( x ) bn x n + L + b2 x 2 + b1 x + b0
For simplicity, we’ll consider only approximations to a function at x = 0. For other values, a
simple transformation of variables can be used.
If the Taylor approximation at x = 0 (Maclaurin series) to f(x) is:
k
t ( x) = ∑ ci x i = ck x k + L + c2 x 2 + c1x + c0 with k = m + n (4.64)
i =0
Pm ( x )
we can write: t ( x) ≈ Rnm ( x) = or t ( x)Qn ( x) = Pm ( x) . (4.65)
Qn ( x)
Considering now this equation in its expanded form:
we can establish a system of equations to find the coefficients of P and Q. First of all, we can
force this condition to apply at x = 0 (exact matching of the function at x = 0). This will give:
t (0)Qn (0) = Pm (0) or:
c0b0 = a0 (4.67)
but, since the ratio R doesn’t change if we multiply the numerator and denominator by any
number, we can choose the value of b0 = 1 and this gives us the value of a0 .
Taking now the first derivative of (4.66) will give:
i
i!
g ( i ) ( 0) = ∑ j!(i − j )! t ( j ) (0)Q (i− j ) (0) (4.72)
j =0
The first derivative of a polynomial, say, Qn (x) is nbn x n−1 + L + 2b2 x + b1 , so the second
derivative will be: (n − 1)nbn x n−2 + L + 2 ⋅ 3b3 x + 2b2 and so on. Then these derivatives
evaluated at x = 0 are successively: b1, 2b2 , (2 ⋅ 3)b3 , (2 ⋅ 3 ⋅ 4)b4 , L , , j!b j
i i
i!
Then, we can write (4.72) as: g (i ) (0) = ∑ j!(i − j )! j!c j (i − j )!bi− j = i! ∑ c j bi− j and equating
j =0 j =0
this to the ith derivative of Pm (x) evaluated at x = 0, that is, i!ai gives:
i
ai = ∑ c j bi− j , but since b0 = 1 we can finally write:
j =0
i −1
ai − ∑ c j bi − j = ci for i = 1, …, k, (k = m + n). (4.73)
j =0
where we take the coefficients ai = 0 for i > m and bi = 0 for i > n .
Example
1
Consider the function f ( x) = . This function has a pole at x = 1 and polynomial
1− x
approximations will not perform well.
1 ⋅ 3 ⋅ 5 L (2 j − 1)
The Taylor coefficients of this function are given by: c j =
2 j j!
1 3 5 35
So the first five are: c0 = 1 , c1 = , c2 = , c3 = , c4 = which gives a polynomial of
2 8 16 128
order 4.
From the equations above we can calculate the Padé coefficients. We choose: m = n =2, so
k = m + n = 4.
i −1
ai − ∑ c j bi − j = ci for i = 1, …, k, (k = m + n)
j =0
and in detail:
a1 − c0 b1 = c1
a2 − c0 b2 − c1b1 = c2
a3 − c0 b3 − c1b2 − c 2 b1 = c3
a4 − c0 b4 − c1b3 − c 2 b2 − c3b1 = c4
but for m = n =2, we have a3 = a4 = b3 = b4 = 0 and the system can be written as:
a1 − c0b1 = c1
a2 − c0b2 − c1b1 = c2
0 0 0 − c0b2 − c2b1 = c3
0 0 0 − c2b2 − c3b1 = c4
and we can see that the second set of equations can be solved for the coefficients bi. Re-writing
this system:
c2b1 + c0b2 = −c3
c3b1 + c2b2 = − c4
In general, when n = m = k 2 as in this case the matrix that define the system will be of the
form:
r0 r1 r2 L rn−1
r r r L rn−2
−1 0 1
r−2 r−1 r0 L rn−3
L L L L L
r−n+1 r−n + 2 r−n+3 L r0
This is a special kind of matrix called the Toeplitz matrix that has the same element along each
diagonal so it is defined by a total of 2n+1 numbers. There are methods for solving systems with
this matrix and in particular,
Solving the system will give:
1
f ( x) = (blue line) and the Padé 6
1− x
approximant (red line)
4
1 − 0.75 x + 0.0625 x 2
R22 ( x) =
1 − 1.25 x + 0.3125 x 2 2
th
The Taylor polynomial up to 4 order:
P( x) = c0 + c1x + c2 x 2 + c3 x 3 + c4 x 4 0
It is clear that the Padé approximant gives a better fit, particularly closer to the pole.
R11 ( x) : 1.333 4
R22 ( x) : 1.106
R33 ( x) : 1.052 2
R44 ( x) : 1.031
0
0 0.5 1
Fig. 4.13
We can see that even R11 ( x ) , a ratio of linear functions of x, gives a better approximation than
the 4th order Taylor polynomial.
Exercise 4.7
Using the Taylor (McLaurin) expansion of f ( x) = cos x , truncated to order 4:
4
x2 x4
t ( x) = ∑ ck x k = 1 − +
2 24
k =0
find the Padé approximant:
P2 ( x) a2 x 2 + a1 x + a0
R22 ( x) = =
Q2 ( x) b2 x 2 + b1x + b0
E763 (part 2) Numerical Methods page 42
5. MATRIX COMPUTATIONS
We have seen earlier that a number of issues arise when we consider errors in the
calculations dealing with machine numbers. When matrices are involved, the problems of
accuracy of representation, error propagation and sensitivity of the solutions to small variations
in the data are much more important.
Before discussing any methods of solving matrix equations, we consider first the rather
fundamental matrix property of ‘condition number’.
‘Condition’ of a Matrix
We have seen that multiplying or dividing two floating-point numbers gives an error of the
order of the ‘last preserved bit’. If, say, two numbers are held to 8 decimal digits, the resulting
product (or quotient) will effectively have its least significant bit ‘truncated’ and therefore have
a relative uncertainty of ± 10–8.
By contrast, with matrices and vectors, multiplying (that is, evaluating y = Ax) or
‘dividing’ (that is, solving Ax = y for x) can lose in some cases ALL significant figures!
Before examining this problem we have to define matrix and vector norms.
VECTOR NORMS
If xT = (x1, x2,...., xn) is a real or complex vector, a general norm is denoted by x N
and is
defined by:
1N
n N
x N
= ∑ xi (5.1)
i =1
2 2
x 2
= x1 + ... + x n (5.2)
Other norms are used, e.g. x 1 and x ∞ , the latter corresponding to the greatest in
magnitude |xi| (Show this as an exercise).
MATRIX NORM
If A is an n-by-n real or complex matrix, we denote its norm defined by:
Ax N
A N = max for any choice of the vector x (5.3)
x≠0 x
N
Definition: The problem of finding x, satisfying Ax = y, is well posed or well conditioned if:
For a quantitative measure of “how well conditioned” a problem is, we need to estimate
the amount of variation in x when y varies and/or the variation in x when A changes slightly or
the corresponding changes in y when either (or both) x and A vary.
Suppose A is fixed, but y changes slightly to y + δy, with the associated x changing to
x + δx. We have:
Ax = y, (5.4)
Subtracting gives:
A δx = δy or δx = A–1 δy (5.5)
Az
≤ A and so Az ≤ A ⋅ z (5.6)
z
Taking the norm of both sides of (15) and using inequality (5.6) gives:
δ x = A −1δ y ≤ A −1 ⋅ δ y (5.7)
y = Ax ≤ A ⋅ x (5.8)
Finally multiplying corresponding sides of (5.7) and (5.8) and dividing by x y gives our
fundamental result:
δx δy
≤ A A −1 (5.9)
x y
For any square matrix A we introduce its condition number and define:
−1
cond(A) = A A (5.10)
δx x
must lie between 1/ cond(A) and cond(A).
δy y
Numerical Example
Here is an example using integers for total precision. Suppose:
100 99
A=
99 98
We then have:
1000 1000
Ax = y: A =
− 1000 1000
Shifting x slightly gives:
1001 1199
A =
− 999 1197
Alternatively, shifting y slightly:
803 1001
A =
− 801 999
So a small change in y can cause a big change in x or vice versa. We have this clear moral,
concerning any matrix multiplication or (effectively) inversion:
For given A, either multiplying (Ax) or ‘dividing’ (A–1y) can be catastrophic, the degree
of catastrophe depending on cond(A) and on the ‘direction’ of change in x or y.
In the above example, cond(A) is about 4000.
Matrix Computations
We now consider methods for solving matrix equations. The most common problems
encountered are of the form:
Ax=y (5.11)
A x = k2 B x (5.13)
where A (and B) are known n×n matrices and x and k2 are unknown.
Usually B (and sometimes A) is positive definite (meaning that xTBx > 0 for all x
A is sometimes complex (and also B), but numerically the difference is straightforward, and so
we will consider only real matrices.
* * * * 0 0
* * * * * 0
* * * * * *
* * * * * *
0 * * * * *
0 0 * * * *
Zeros and non-zeros in band matrix of semi-bandwidth 4
DIRECT where the solution emerges in a finite number of calculations (if we temporarily
ignore round-off error due to finite word-length).
INDIRECT or iterative, where a step-by-step procedure converges towards the correct solution.
Indirect methods can be specially suited to sparse matrices (especially when the order is large) as
they can often be implemented without the need to store the entire matrix A (or intermediate
forms of matrices) in high speed memory.
All the common direct routines are available in software libraries and in books and journals,
most commonly in Fortran or Fortran90/95, but also some in C.
1 4 7 x1 1
2 5 8 x = 1
2 (5.14)
3 6 11 x3 1
we subtract 2 times the first row from the second row, and then we subtract 3 times the first row
from the third row, to give:
1 4 7 x1 1
0 − 3 − 6 x = − 1
2 (5.15)
0 − 6 − 10 x3 − 2
and then subtracting 2 times the second row from the third row gives:
1 4 7 x1 1
0 − 3 − 6 x = − 1
2 (5.16)
0 0 2 x3 0
The steps from (5.14) to (5.16) are termed ‘triangulation’ or ‘forward-elimination’. The
triangular form of the left-hand matrix of (5.16) is crucial; it allows the next steps.
The third row immediately gives:
x3 = 0 (5.17a)
and substitution into row 2 gives:
The steps through (5.17) are termed ‘back-substitution’. We now ignore a complication of
‘pivoting’, a technique sometimes required to improve numerical stability and which consists of
changing the order of rows and columns.
1. When performed on a dense matrix, (or on a sparse matrix but not taking advantage of
the zeros), computing time is proportional to n3 (n: order of the matrix). This means that
doubling the order of the matrix will increase computation time by up to 8 times!
2. The determinant comes immediately as the product of the diagonal elements of (5.16).
3. Algorithms that take advantage of the special ‘band’ and ‘variable-band’ are very
straightforward, by changing the limits of the loops when performing row or column operations,
and some ‘book-keeping’. For example, in a matrix of ‘semi-bandwidth’ 4, the first column has
non-zero elements only in the first 4 rows as in the figure. Then only those 4 numbers need
storing, and only the 3 elements below the diagonal need ‘eliminating’ in the first column.
4. Oddly, it turns out that, in our context, one should NEVER find the inverse matrix A in
order to solve Ax = y for x. Even if it needs doing for a number of different right-hand-side
vectors, y, it is better to ‘keep a record’ of the triangular form of (5.16) and back-substitute as
necessary.
5. Other methods very similar to Gauss are due to Crout and Choleski. The latter is (only)
for use with symmetric matrices. Its advantage is that time AND storage are half that of the
page 47 E763 (part 2) Numerical Methods
orthodox Gauss.
6. There are variations in the implementation of the basic method developed to take
advantage of the special type of sparse matrices encountered in some cases, for example when
solving problems using finite differences or finite elements. One of these is the frontal method;
here, elimination takes place in carefully controlled manner, with intermediate results being kept
in backing-store. Another variation consists of only storing the ‘nonzero’ elements in the matrix,
at the expense of a great deal of ‘book-keeping’ and reordering (renumbering) rows and columns
through the process, in the search for the best compromise between numerical stability and fill-
in.
can be recast as: finding x to minimize the ‘error-residual’, a column vector r defined as a
function of x by:
r=y–Ax (5.19)
The general idea of this kind of methods is to search for the solution (minimum of the error
residual) in a multidimensional space (of the components of vector x), starting from a point x0
and choosing a direction to move. The optimum distance to move along that direction can then
be calculated.
In the steepest descent method, the simplest form of this method, these directions are
chosen as those of the gradient of the error residual at each iteration point. Because of this, they
will be mutually orthogonal and then there will be no more than n different directions. In 2D
(see figure below) this means that every time we’ll have to make a change of direction at right
angle to the previous, but this will not always allow us to reach the minimum or at least not in an
efficient way.
The norm r of this residual vector is an obvious choice for the quantity to minimise, and
2
r , the square of the norm of the residual, which is not negative and is only zero when the
error is zero and there are no square roots to calculate. However, using (5.19), gives (if A is
symmetric: AT=A):
2
r = (y – Ax)T(y – Ax) = xTAAx – 2xTAy + yTy (5.20)
which is rather awkward to compute, because of the product AA. Another possible choice of
error functional (measure of the error), valid for the case where the matrix A is positive definite
and which is also minimised for the correct solution is the functional: h2 = rTA–1r (instead of
rTr as in (5.20). This gives a simpler form:
E763 (part 2) Numerical Methods page 48
or, because the first term in (5.21) is independent of the variables and will play no part in the
minimisation we can drop it and we finally have
:
h 2 = xT Ax − 2xT y (5.22)
The method proceeds by evaluating the error functional at a point x0, choosing a direction,
in this case, the direction of the gradient of the error functional, and finding the minimum value
of the functional along that line. That is, if p gives the direction, the line is defined by all the
points (x + α p), where α is a scalar parameter. The next step is to find the value of α that
minimizes the error. This gives the next point x1. (Since in this case, α is the only variable, it is
simple to calculate the gradient of the error functional as a function of α and finding the
corresponding minimum).
x
0
v
x
1
u
Fig. 5.1
Several variations appear at this point. It would seem obvious that the best direction to
choose is that of the gradient (its negative or downwards) and that is the choice taken in the
conventional “steepest-gradient” or “steepest descent” method. In this case the consecutive
directions are all perpendicular to each other as illustrated in the figure above. However, as
mentioned earlier, this conduces to poor convergence due to the discrete nature of the steps.
The more efficient and popular “Conjugate Gradient Method” looks for a direction which
is ‘A–orthogonal’ (or conjugate) to the previous one instead (pTAq = 0 instead of pTq = 0).
Exercise 5.1:
Show that the value of α that minimizes the error functional in the line (x + αs) in the two cases
mentioned above (using the squared norm of the residual as error functional or the functional
h2), for a symmetric matrix A, is given respectively by:
p T A(y − Ax ) p T (y − Ax )
α= 2 and α=
Ap p T Ap
page 49 E763 (part 2) Numerical Methods
A useful feature of this method (as can be observed from the expressions above) is that
reference to the matrix A is only via simple matrix products; for given values of the matrices A,
y, xi and pi, we need only form A times a vector (Ax or Api) and AT times a vector. These can
be formed from a given sparse A without unnecessary multiplication (of non-zeros) or storage.
An even better advantage of the conjugate gradient method is that it is guaranteed to converge in
at most n steps, where n is the order of the matrix.
rkT rk
p k = rk + β p k −1 where β =
rkT−1rk −1
More robust versions of the algorithm use a preliminary ‘pre-conditioning’ of the matrix
A, to alleviate the problem that the condition number of (5.20) is the square of the condition
number of A, – a serious problem if A is not ‘safely’ positive-definite. This leads to the popular
‘PCCG’ or Pre-Conditioned-Conjugate-Gradient algorithm, as a complete package and to many
more variations in implementation that can be found as commercial packages.
Suppose we had the vector x(0) = [x1, x2, x3](0), and substituted it into the right-hand side of
(5.24), to yield on the left-hand side the new vector x(1) = [x1, x2, x3](1). Successive substitutions
will give the sequence of vectors:
Because (5.24) is merely a rearrangement of the equations for solution, the ‘correct’
solution substituted into (5.24) must be self-consistent, i.e. yield itself! The sequence will:
either converge to the correct solution
or diverge.
1 1.333333 1.133334
1.05 0.9472222 0.9888889
0.9895833 1.00544 1.000093
1.001337 0.9997463 1.000166
0.9998951 0.9999621 0.9999639
0.9999996 1.000012 1.000005
1.000002 0.9999981 0.9999996
0.9999996 1 1
1 1 1
1 1 1
Whether the algorithm converges or not depends on the matrix A and (surprisingly) not on
page 51 E763 (part 2) Numerical Methods
the right-hand-side vector y of (5.23). Convergence does not even depend on the ‘starting value’
of the vector, which only affects the necessary number of iterations.
We will skip over any formal proof of convergence. But to give the sharp criteria for
convergence,− first one splits A as:
A=L+D+U
where L, D and U are the lower, diagonal and upper triangular parts of A. Then the schemes
converge if-and-only-if all the eigenvalues of the matrix:
Choice of method
The choice of method will depend on the characteristics of the problem (type of the
matrices) and on the solution requirements. For example, most methods suitable for dense
matrices calculate all eigenvectors and eigenvalues of the system. However, in many problems
arising from the numerical solution of PDEs one is only interested in one or just a few
eigenvalues and/or eigenvectors. Also, in many cases the matrices will be large and sparse.
In what follows we will concentrate in methods which are suitable for sparse matrices (of
course the same methods can be applied to dense matrices).
The problem to solve can have two different forms:
Sometimes the generalized eigenvalue problem can be converted into the form (33) simply
by premultiplying by the inverse of B:
B −1Ax = λ x
however, if A and B are symmetric, the new matrix in the left hand side ( B −1A ) will have lost
this property. Instead, it is preferable to decompose the matrix B (factorise) in the form
B = LLT (Choleski factorisation - possible if B is positive definite). Substituting in (5.26) and
premultiplying by L−1 will give:
L−1Ax = λLT x ( )
and since: LT
−1 T
L = I , the identity matrix,
( )T
L−1 A L−1 LT x = λ LT x ; putting: LT x = y and ( )T ~
L−1 A L−1 = A gives:
~
Ay = λy
~
The matrix A is symmetric if A and B are symmetric and the eigenvalues are not
modified. The eigenvectors x can be obtained from y simply by back-substitution. However, if
~
the matrices A and B are sparse, this method is not convenient because A will generally be
dense. In the case of sparse matrices, it is more convenient to solve the generalized problem
directly.
Solution Methods
We can again classify solution methods as:
All these methods give all eigenvalues. We will not examine any of these in detail.
Premultiply by A
Normalize
Check convergence
Not OK
OK
STOP
The normalization step is necessary because otherwise the iteration vector can grow
indefinitely in length over the iterations.
How does this algorithm work?
If φi, i = 1, .. N are the eigenvectors of A, we can write any vector of N components as a
superposition of them (they constitute a base in the space of N dimensions). In particular, for the
starting vector:
N
x 0 = ∑ α i φi (5.27)
i =1
N N N
x1 = A ∑ α i φi = ∑ α i Aφi = ∑ α i λi φi
~ (5.28)
i =1 i =1 i =1
~
N λi
If λ1 is the eigenvalue of largest absolute value, we can also write this as: x1 = λ1 ∑ α i φi
i =1 λ1
~
x
This is then normalized by: x1 = ~1
x1
n
N λi
x n = C ∑ α i φi (5.29)
i =1 λ1
This method (the power method) finds the dominant eigenvector, that is the eigenvector
that corresponds to the largest eigenvalue (in absolute value). To find the eigenvalue we can see
that pre-multiplying (5.25) by the transpose of φi will give:
φiT Aφi
λi = (5.30)
φiT φi
This is known as the Rayleigh quotient and can be used to obtain the eigenvalue from the
eigenvector. This expression has interesting properties; if we only know an estimate of the
eigenvector, (5.30) will give an estimate of the eigenvalue with a higher order of accuracy than
that of the eigenvector itself.
Exercise 5.2
Write a short program to calculate the dominant eigenvector and the corresponding
eigenvalue of a matrix like that in (7.14) but of order 7, using the power method. Terminate
iterations when the relative difference between two successive estimates of the eigenvalue are
within a tolerance of 10–6.
Inverse Iteration
1
For the system Ax = λx we can write: x = A −1x from where we can see that the
λ
eigenvalues of A–1 are 1/λ; the reciprocals of the eigenvalues of A and the eigenvectors are the
same. In this form, if we are interested in finding the eigenvector of A corresponding to the
smallest eigenvalue in absolute value (closest to zero), we can notice that for that eigenvalue λ,
its reciprocal is the largest, and so it can be found using the power method on the matrix A–1.
Exercise 5.4
Write a program and calculate by inverse iteration the smallest eigenvalue of the
page 55 E763 (part 2) Numerical Methods
tridiagonal matrix A of order 7 where the elements are –4 in the main diagonal and 1 in the
subdiagonals. You can use the algorithms given in section 1 of the Appendix for the solution of
the linear system of equations.
An important extension of the inverse iteration method allows finding any eigenvalue of
the spectrum (spectrum of a matrix: set of eigenvalues) not just that of smallest eigenvalue in
absolute value.
~
Ax = Ax − σ x = λ x − σ x = (λ − σ )x (5.31)
~
And we can see that the matrix A has the same eigenvectors as A and its eigenvalues are
{λ−σ}, that is, the same eigenvalues as A but shifted by σ. Then, if we apply the inverse
~
iteration method to the matrix A , the procedure will yield the eigenvalue (λi−σ) closest to zero;
that is, we can find the eigenvalue λi of A closest to the real number σ.
Rayleigh Iteration
Another extension of this method is known as Rayleigh iteration. In this case, the
Rayleigh quotient is used to calculate an estimate of the eigenvalue at each iteration and the shift
is updated using this value.
Since the convergence rate of the power method depends on the relative value of the
coefficients of each eigenvector in the expansion of the trial vector during the iterations (as in
(5.27)) and these are affected by the ratio between the eigenvalues λi and λ1, the convergence
will be fastest when this ratio is largest as we can see from (5.29). The same reasoning applied
to the shifted inverse iteration method leads to the conclusion that the convergence rate will be
fastest when the shift is chosen as close as possible to the target eigenvalue. In this form, the
Rayleigh iteration has faster convergence than the ordinary shifted inverse iteration.
Exercise 5.5
Write a program using shifted inverse iteration to find the eigenvalue of the matrix A of
the previous exercise which lies closest to 3.5.
Then, write a modified version of this program using Rayleigh quotient update of the shift
in every iteration (Rayleigh iteration). Compare the convergence of both procedures (by the
number of iterations needed to achieve the same tolerance for the relative difference between
successive estimates of the eigenvalue). Use a tolerance of 10–6 in both programs.
E763 (part 2) Numerical Methods page 56
f ( a + h) − f (a )
f ' (a) = lim (6.1)
h→0 h
This suggests that choosing a small value for h the derivative can be reasonably
approximated by the forward difference:
f ( a + h) − f ( a )
f ' (a) ≈ (6.2)
h
f ( a ) − f ( a − h)
f ' (a) ≈ (6.3)
h
xa xc xb
Fig. 6.1
We can understand this better analysing the error in each approximation by the use of
Taylor expansions.
Considering the expansions for the points a+h and a−h:
f ( 2) (a) 2 f (3) (a ) 3
f ( a + h) = f ( a ) + f ' ( a ) h +
h + h + O(h 4 )
2! 3!
where the symbol O(h4) means: “a term of the order of h4 ”
f ( a + h) − f ( a )
f ' (a) = + O ( h) (6.6)
h
and we can see that the error of this approximation is of the order of h. A similar result is
obtained for the backward difference.
We can also see that subtracting (6.4) and (6.5) and discarding terms of order of h3 and
higher we can obtain a better approximation:
f ( a + h) − f ( a − h) = 2 f ' ( a ) h + O ( h 3 )
f ( a + h) − f ( a − h)
f ' (a) = + O (h 2 ) (6.7)
2h
Example:
Considering the 3 points x0, x1 = x0 + h and x2 = x0 + 2h and taking the Taylor expansions at x1
and x2 we can construct a three-point forward difference formula:
f ( 2) ( x0 ) 2
f ( x1 ) = f ( x0 ) + f ' ( x0 )h + h + O (h 3 ) (6.8)
2!
f ( 2) ( x0 ) 2
f ( x2 ) = f ( x0 ) + f ' ( x0 )2h + 4h + O ( h 3 ) (6.9)
2!
Multiplying (6.8) by 4 and subtracting to eliminate the second derivative term:
4 f ( x1 ) − f ( x2 ) = 3 f ( x0 ) + 2 f ' ( x0 )h + O(h3 )
from where we can extract for the first derivative:
− 3 f ( x0 ) + 4 f ( x1 ) − f ( x2 )
f ' ( x0 ) = + O (h 2 ) (6.10)
2h
Exercise 6.1
Considering the 3 points x0, x1 = x0 − h and x2 = x0 − 2h and the Taylor expansions at x1 and x2
find a three-point backward difference formula. What is the order of the error?
Exercise 6.2
E763 (part 2) Numerical Methods page 58
Using the Taylor expansions for f(a+h) and f(a–h) show that a suitable formula for the
second derivative is:
f (a − h) − 2 f ( a ) + f (a + h)
f ( 2) ( a ) ≈ (6.11)
h2
Exercise 6.3
Use the Taylor expansions for f(a+h), f(a–h), f(a+2h) and f(a–2h) to show that the
following are formulae for f ’(a) and f(2)(a), and that both have an error of the order of h4:
f ( a − 2h) − 8 f ( a − h) + 8 f (a + h) − f (a + 2h)
f ' (a) ≈
12h
− f (a − 2h) + 16 f (a − h) − 30 f (a ) + 16 f (a + h) − f (a + 2h)
f ( 2) ( a ) ≈
12h 2
Expressions for the derivatives can also be found using other methods. For example, if the
function is interpolated with a polynomial using, say, n points, the derivative (first, second, etc)
can be estimated by calculating the derivative of the interpolating polynomial at the point of
interest.
Example:
Considering the 3 points x1, x2 and x3 with x1 < x2 < x3 (this time, not necessarily equally spaced)
and respective function values y1, y2 and y3, we can use the Lagrange interpolation polynomial to
approximate y(x):
f ( x) ≈ L( x) = L1 ( x) y1 + L2 ( x) y2 + L3 ( x) y3 (6.12)
where:
( x − x2 )( x − x3 ) ( x − x1 )( x − x3 ) ( x − x1 )( x − x2 )
L1 ( x) = , L2 ( x) = and L3 ( x ) = (6.13)
( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 )
Exercise 6.4
Show that if the points are equally spaced by the distance h in (6.12) and the expression is
evaluated at x1, x2 and x3, the expression reduces respectively to the 3-point forward difference
formula (6.10), the central difference and the 3-point backward difference formulae.
page 59 E763 (part 2) Numerical Methods
Naturally, many more expressions can be developed using more points and/or different methods.
Exercise 6.5
Derive expressions iii) and iv) above. What is the order of the error for each of the 4 expressions
above?
Partial Derivatives
∂f ( x, y )
For a function of two variables: f(x,y), the partial derivative: is defined as:
∂x
∂f ( x, y ) f ( x + h, y ) − f ( x , y )
= lim , which clearly, is a function of y.
∂x h →0 h
Again, we can approximate this expression by a difference assuming that h is sufficiently small.
∂f ( x, y )
Then, for example a central difference expression for is:
∂x
∂f ( x, y ) f ( x + h, y ) − f ( x − h, y )
= (6.15)
∂x 2h
Similarly,
∂f ( x, y ) f ( x, y + h) − f ( x, y − h)
=
∂y 2h
In this form, the gradient of f is given by:
∂f ( x, y ) ∂f ( x, y ) 1
∇f ( x, y ) = xˆ + yˆ ≈ (( f ( x + h, y ) − f ( x − h, y )) xˆ + ( f ( x, y + h) − f ( x, y − h)) yˆ )
∂x ∂y 2h
Exercise 6.6
Using central difference formulae for the second derivatives derive an expression for the
Laplacian ( ∇ 2 f ) of a scalar function f.
Numerical Integration
In general, numerical integration methods approximate the definite integral of a function f by a
weighted sum of function values at several points in the interval of integration. In general these
E763 (part 2) Numerical Methods page 60
methods are called “quadrature” and there are different methods to choose the points and the
weights.
Trapezoid Rule
The simplest method to approximate the integral of a function is the trapezoid rule. In this case,
the interval of integration is divided into a number of subintervals on which the function is
simply approximated by a straight line as shown in the figure below.
The integral (area under the curve) is
then approximated by the sum of the areas of
the trapezoids based on each subinterval.
The area of the trapezoid with base in the
interval [xi, xi+1] is:
( f i + f i +1 )
( xi +1 − xi )
2
xn
1 n−1
∫ f ( x)dx ≈ 2 ∑ hi ( fi + fi +1) (6.16)
x1 i =1
If all the subintervals are of the same width (the points are equally spaced), (6.14) reduces to:
xn
f1 + f n n−1
∫ f ( x)dx ≈h + ∑ f i (6.17)
x1 2 i =2
Exercise 6.7
Using the trapezoid rule:
(a) Calculate the integral of exp(−x2) between 0 and 2.
(b) Calculate the integral of 1/x between 1 and 2
In both cases use 10 and 100 subintervals (11 and 101 points respectively).
It can be shown that the error incurred with the application of the trapezoid rule in one interval is
given by the term:
(b − a )3 ( 2)
E=− f (ξ ) (6.18)
12
where ξ is a point inside the interval and the error is defined as the difference between the exact
integral (I) and the area of the trapezoid (A): E = I − A.
If this is applied to the Trapezoid rule using a number of subintervals, the error term changes to:
page 61 E763 (part 2) Numerical Methods
(b − a )h 2 ( 2)
E=− f (ξ h ) (6.19)
12
where now ξ h is a point in the complete interval [a, b] and depends on h. Considering the Error
as the sum of the individual errors in each subinterval, we can write this term in the form:
h 3 n ( 2) h2 n
E=− ∑
12 i =1
f (ξi ) = − ∑ hf ( 2) (ξi )
12 i =1
(6.20)
where ξi are points in each subintervals. Expression (6.20), in the limit when n → ∞ and
h → 0 corresponds to the integral of f(2) over the interval [a, b]. Then, we can write (6.20) as:
h2
E≈− ( f ' (b) − f ' (a)) (6.21)
12
Two options are open now, we can use this term to estimate the error incurred or equivalently, to
determine the number of equally spaced points required for a determined precision or, include
this term in the calculation to form a corrected form of the trapezoid rule:
b
f1 + f n n−1 h 2
∫ f ( x)dx ≈h + ∑ f i − ( f ' (b) − f ' (a)) (6.22)
a 2 i = 2 12
Simpson’s Rule
In the case of the trapezoid rule, the function is approximated by a straight line and this can be
done repeatedly by subdividing the interval. A higher degree of accuracy using the same
number of subintervals can be obtained with a better approximation than the straight line. For
example, choosing a quadratic approximation could give a better result. This is the Simpson’s
rule.
Consider the function f(x) and the interval [a, b]. Defining the points x0, x1 and x2, as:
a+b b−a
x0 = a , x1 = , x2 = b and defining h =
2 2
Using Lagrange interpolation to generate a second order polynomial approximation to f(x) gives
as in (6.12):
f ( x) ≈ f ( x0 ) L0 ( x) + f ( x1 ) L1 ( x) + f ( x2 ) L2 ( x)
where:
( x − x1 )( x − x2 ) 1
L0 ( x) = = ( x − x1 )( x − x2 )
( x0 − x1 )( x0 − x2 ) 2h 2
( x − x0 )( x − x2 ) 1
L1 ( x) = = − 2 ( x − x0 )( x − x2 )
( x1 − x0 )( x1 − x2 ) h
( x − x0 )( x − x1 ) 1
and L2 ( x) = = ( x − x0 )( x − x1 )
( x2 − x0 )( x2 − x1 ) 2h 2
Then,
b a + 2h
b a + 2h a+2h a+2h
or ∫ f ( x)dx ≈ f ( x0 ) ∫ L0 ( x)dx + f ( x1 ) ∫ L1( x)dx + f ( x2 ) ∫ L2 ( x)dx
a a a a
a+2h h h
1 1 t 3 t2 h
∫ L0 ( x)dx = ∫ t (t − h)dt = 2 3
−h =
a 2h 2 −h 2h 2 3
−h
a+2h a+2h
4h h
Similarly, ∫ L1( x)dx = 3
and ∫ L2 ( x)dx = 3
a a
b
h
∫ f ( x)dx ≈ 3 ( f (a) + 4 f (a + h) + f (a + 2h)) (6.24)
a
Example
1.25
Use of the Simpson’s rule to calculate the integral: ∫ (sin πx + 0.5)dx
0.25
We have: x0 = a = 0.25, x2 = x0 + 2h = b = 1.25 ,
then x1 = x0 + h = 0.75 and h = 0.5
1
The figure shows the function (in blue) and the
2nd order Lagrange interpolation used to
calculate the integral with the Simpson’s rule.
The exact value of this integral is:
1.25 x2
0
∫ (sin πx + 0.5)dx = 0.950158158 x0 x1
0.25
Applying the Simpson’s rule to this
function gives: 0 0.5 1 1.5
Fig. 6.3
1.25
h
∫ (sin πx + 0.5)dx ≈ 3 ( f (a) + 4 f (a + h) + f (a + 2h)) = 0.97140452
0.25
page 63 E763 (part 2) Numerical Methods
As with the trapezoid rule, higher accuracy can be obtained subdividing the interval of
integration and adding the result of the integrals over each subintervals.
Exercise 6.8
Write down the expression for the composite Simpson’s rule using n subintervals and use it to
1.25
calculate the integral ∫ (sin πx + 0.5)dx of the example above using 10 subintervals.
0.25
( )
b
1 ( 2) 1 b 1
E= f (ξ ) ∫ ( x − c) 2 dx = f ( 2) (ξ )( x − c) a = f ( 2) (ξ ) (b − c)3 − (a − c)3
2 a
6 6
(b − a )3
but since c = a + h and c = b − h, where h = (b − a)/2, we have: (b − c)3 − (a − c)3 =
4
and the error is:
1
E=(b − a )3 f ( 2) (ξ ) (6.27)
24
which is half of the estimate for the single interval trapezoid rule.
subintervals to achieve higher precision. In that case, the expression for the integral becomes:
b N
h 2 (b − a ) ( 2) b−a
∫ f ( x)dx = h∑ f (ci ) + 24
f (η ) , h=
N
(6.29)
a i =1
where ci are the midpoints of each of the n subintervals and η is a point between a and b.
Exercise 6.9
1.25
Use the Midpoint Rule to calculate the integral: ∫ (sin πx + 0.5)dx using 2 and 10 subintervals.
0.25
compare the result with that of the Simpson’s rule in the example above.
Gaussian Quadrature
In the trapezoid, Simpson’s and midpoint rule, the definite integral of a function f(x) is
approximated by the exact integral of a polynomial that approximates the function. In all these
cases, the evaluation points are chosen arbitrarily, often equally spaced. However, it is rather
clear that the precision attained is dependent of the position of these points, giving then another
route to optimisation. Then the weighting coefficients are determined by the choice of method.
Considering again the general approach at the approximation of the definite integral, the problem
can be written in the form:
n 1
Gn ( f ) = ∑ win f ( xin ) ≈ ∫ f ( x)dx (6.30)
i =1 −1
(The interval of integration is here chosen as [ −1,1], but obviously, any other interval can be
mapped into this by a change of variable.)
The objective now is to find for a given n, the best choice of evaluation points xin (called here
“Gauss points”) and weights win to get maximum precision in the approximation. Compared to
the criterion for the trapezoid and Simpson’s rules, this is equivalent to try to find for a fixed n
the best choice for xin and win the approximation (6.30) is exact for a polynomial of degree N,
with N (>n) as large as possible. (That is we go beyond the degree of approximation of the
previous rules.)
This is equivalent to say:
1 n
∫x
k
dx = ∑ win ( xin ) k for k = 0, 1, 2, …, N (6.31)
−1 i =1
with N as large as possible (note the equal sign now). We can simplify the problem to these
functions now because any polynomial is just a superposition of terms of the form x k , so if the
integral is exact for each of them, it will be for any polynomial containing those terms.
Expression (6.31) is a system of equations that the Gauss points and weights (unknown) need to
satisfy. The problem is then, to find the xin and win (n of each). This is a nonlinear problem
that cannot be solved directly. We also have to determine the value of N. It can be shown that
this number is N = 2n − 1. This is also rather natural since (6.31) consists of N+1 equations and
we need to find 2n parameters.
page 65 E763 (part 2) Numerical Methods
n
x − xkn
Lnj ( x) = ∏ x nj − xkn
(6.32)
k =1,k ≠ j
then, since the expression (6.31) should be exact for any polynomial up to order N = 2n − 1, and
Lnj ( x) is of order n, we have:
1 n
∫ L j ( x)dx = ∑ wi L j ( xi )
n n n n
(6.33)
−1 i =1
but since the Lnj ( x) are interpolation polynomials, Lnj ( xin ) = δ i j (that is, they are 1 if i=j and 0
otherwise), all the terms in the sum in (6.33) are zero except for i=j and we can have:
∫ Li ( x)dx = wi
n n
(6.34)
−1
With this, we have the weights for a given set of Gauss points. We have to find now the best
choice for these.
If P(x) is an arbitrary polynomial of degree ≤ 2n − 1, we can write: P(x) = Pn(x) Q(x) + R(x); (Q
and R are respectively the quotient polynomial and reminder polynomial of the division of P by
Pn). Pn(x) is of order n and then, Q and R are of degree n − 1 or less. Then we have:
1 1 1
If we now define the polynomial Pn(x) by its roots and choose these as the Gauss points:
n
Pn ( x) = ∏ ( x − xin ) , (6.36)
i =1
then, the integral of the product Pn(x) Q(x), which is a polynomial of degree ≤ 2n − 1, must be
given exactly by the quadrature expression:
1 n
∫ Pn ( x)Q( x)dx = ∑ wi Pn ( xi )Q( xi )
n n n
(6.37)
−1 i =1
but since the Gaussian points are the roots of Pn(x), (6.37) must be zero; that is:
Going back now to the integral of the arbitrary polynomial P(x) of degree ≤ 2n − 1, and equation
(6.35), we have that if we choose Pn(x) as in (6.36), (6 38) is satisfied and then, (6.35) reduces
to:
1 1
but since R(x) is of degree n − 1 or less, the interpolation using Lagrange polynomials for the n
points will give the exact representation of R (see Exercise 4.2). That is:
n
R ( x) = ∑ R ( xin ) Li ( x) exactly. (6.40)
i =1
Then,
1 1 n n 1
but we have seen before (6.34) the integral of the Lagrange interpolation polynomial for point i
is the value of win , so:
1 n
∫ R( x)dx = ∑ wi R( xi )
n n
(6.40)
−1 i =1
Now, since P(x) = Pn(x) Q(x) + R(x) and Pn ( xin ) = 0 (see 6.36), P ( xin ) = R ( xin ) and from (6.39):
1 n
∫ P( x)dx = ∑ wi P( xi )
n n
(6.41)
−1 i =1
which tells us that the integral of the arbitrary polynomial P(x) of order 2n − 1can be calculated
exactly using the sets of Gauss points xin - zeros of the nth order Legendre polynomial and the
weights win determined by(6.34) – the integral over the interval [−1, 1] of the Lagrange
polynomial of order n corresponding to the Gauss point xin .
Back to the start then, we have now a set of n Gauss points and weights that yield the exact
evaluation of the integral of a polynomial up to degree 2n − 1. These should also give the best
result to the integral of an arbitrary function f:
1
Remember that for orthogonal polynomials in [−1, 1], all their roots lie in [−1, 1], and satisfy:
1 1
1 n
∫ f ( x)dx ≈ ∑ wi
n
f ( xin ) (6.42)
−1 i =1
Gauss nodes and weights for different orders are given in the following table.
Example
1
−x
For the integral: ∫e sin 5 x dx the results of the calculation using Gauss quadrature are listed in
−1
the table:
n Integral Error %
2 −0.307533965529
3 0.634074857001
4 0.172538331616 28.71
5 0.247736352452 −2.35
6 0.241785750244 0.10
The error is also calculated compared with the exact value: 0.24203832101745.
The results with few Gauss points are not very good because the function varies strongly in
the interval of integration. However, for n = 6, the error is very small.
Exercise 6.10
Compare the results of the example above with those of the Simpson’s and Trapezoid Rules for
E763 (part 2) Numerical Methods page 68
n subintervals.
Change of Variable
The above procedure was developed for definite integrals over the interval [−1, 1]. Integrals
over other intervals can be calculated after a change of variables. For example if the integral to
calculate is:
b
∫ f (t )dt
a
To change from the interval [a, b] to [−1, 1] the following change of variable can be made:
b
b−a n n b−a n a+b
∫ f (t )dt ≈ ∑ wi f 2 xi + 2
2 i =1
(6.43)
a
Exercise 6.11
1.25
Use Gaussian quadrature to calculate: ∫ (sin πx + 0.5)dx
0.25
page 69 E763 (part 2) Numerical Methods
Lf = g (7.1)
where L is a linear differential operator and g(x) is a known function. The problem is to find the
function f(x) satisfying equation (7.1) over a given region (interval) [a, b] and subjected to some
boundary conditions at a and b.
The basic idea is to substitute the derivatives by appropriate difference formulae like those
seen in section 6. This will convert the problem into a system of algebraic equations.
In order to apply systematically the difference approximations we proceed as follow:
–– First, we divide the interval [a, b] into N equal subintervals of length h : h = (b–a)/N,
defining the points xi : xi = a + ih
–– Next, we approximate all derivatives in the operator L by appropriate difference formulae
(h must be sufficiently small – N large – to do this accurately).
–– Finally, we formulate the corresponding difference equation at each point xi. This will
generate a linear system of N–1 equations on the N–1 unknown values of fi = f(xi).
Example:
If we take Lf = g to be a general second order differential equation:
fi −1 − 2 fi + fi+1 fi+1 − fi −1
fi ' ' ≈ and fi ' ≈ (7.4)
h2 2h
fi−1 − 2 fi + fi +1 f −f
c + d i +1 i−1 + e fi = gi (7.5)
h2 2h
E763 (part 2) Numerical Methods page 70
or:
d
(
) d
c − h fi−1 + −2c + eh2 fi + c + h fi +1 = gi h2
2 2
(7.6)
for all i except i = 1 and N–1, where for i = 1: fi-1 = f(a) and
for i = N–1: fi+1 = f(b)
This can be written as a matrix problem of the form: A f = g, where f = { fi} g = {h2gi}
are vectors of order N–1. The matrix A has only 3 elements per row:
d
( ) d
ai ,i−1 = c − h , ai,i = −2c + eh 2 , ai,i +1 = c + h
2 2
Exercise 7.1
Formulate the algorithm to solve a general second order differential equation over the
interval [a, b], with Dirichlet boundary conditions at a and b (known values of f at a and b).
Write a short computer program to implement it and use it to solve:
Example:
Referring to the figure below, the problem consists of finding the potential distribution
between the inner and outer conducting surfaces of square cross-section when a fixed voltage is
applied between them. The equation describing this problem is the Laplace equation:
2
∇ φ = 0 with the boundary conditions φ = 0 in the outer conductor and φ = 1 in the inner
conductor.
2 ∂ 2φ ∂ 2 φ
To approximate ∇φ= + (7.7)
∂ x2 ∂ y2
we can choose for convenience the same spacing h for x and y. Ignoring the symmetry of the
problem, the whole cross-section is discretized (as in Fig. 8.1). With this choice, there are 56
free nodes (for which the potential is not known).
page 71 E763 (part 2) Numerical Methods
1 2 3 4 5 6 7 8 9
φ=0
10 11 12 18
19 20 22
23 26
φ=1
30
34
38
47
48 56
Fig. 8.1 Finite difference mesh for solution of the electrostatic field in square coaxial line.
On this mesh, we only consider the unknown internal nodal N
values and only those nodes are numbered. An internal point of
the grid, xi, labelled by O in the figure, is surrounded by other O
W xi E
four, which for convenience are labelled by N, S, E and W. For
this point we can approximate the derivatives in (7.7) by: S
∂ 2φ 1 ∂ 2φ 1
≈ (φ − 2φO + φ E ) and ≈ (φ − 2φO + φS ) (7.8)
∂ x2 h 2 W ∂ y2 h 2 N
∇ 2φ ≈ (φ N + φ S + φE + φ W − 4φO ) h2 = 0 or
φ N + φ S + φ E + φW − 4φO = 0 (7.9)
Formulating this equation for each point in the grid and using the boundary conditions
where appropriate, we end up with a system of N equations, one for each of the N free points of
the grid. Applying equation (7.9) to point 1 of the mesh gives:
In this way, we can assemble all 56 equations from the 56 mesh points of the figure, in
terms of the 56 unknowns. The resulting 56 equations can be expressed as:
Ax=y (7.14a)
−4 1 L 1 K φ1 0
1 −4 1 K 1 φ 0
2
1 −4 K 1 φ3 0
⋅ ⋅ ⋅
⋅ ⋅ ⋅
⋅ φ = −1
13
or ⋅ ⋅ (7.14b)
⋅ ⋅
K −4 1 φ 55 0
1 −4 φ56 0
In the description so far we have seen the implementation of the Dirichlet boundary
condition; that is a condition where the values of the desired function are known at the edges of
the region of interest (ends of the interval in 1-D or boundary of a 2-D or 3-D region). This has
been implemented in a straightforward way in equations (7.10), (7.11) and (7.13).
Frequently, other types of boundary conditions appear, for example, when the derivatives
of the function are known (instead of the values themselves) at the boundary - this is the
Neumann condition. In occasions, a mixed condition will apply; something like:
page 73 E763 (part 2) Numerical Methods
∂f
f (r) + = K , where the second term refers to the normal derivative (or derivative along the
∂n
normal direction to the boundary. This type of condition appears in some radiation problems
(Sommerfeld radiation condition).
We will see next some forms of dealing with these boundary conditions in the context of
finite difference calculations.
For example in the case of the ‘square coaxial problem’ studied earlier, we can see that the
solution will have symmetry properties which makes unnecessary the calculation of the potential
over the complete cross-section. In fact only one eighth of the cross-section is actually needed.
The new region of interest can be one eighth φ=0
of the complete cross-section: the shaded region
or one quarter of the cross-section, to avoid ∂φ
oblique lines. In any case, the new boundary =0
∂n
conditions needed on the dashed lines that limit
the new region of interest are of the Neumann φ=1
∂φ
type: =0 since the lines of constant
∂n
potential will be perpendicular to those edges. (n
represents the normal to the boundary).
Fig. 8.2
We will need a different strategy to deal with these conditions.
For this condition it is more convenient to define the mesh in a different manner. If we
place the boundary at half the node spacing from the start of the mesh as in the figure below, we
can implement the normal derivative condition in a simple form:
(Note that the node numbers used in the next figure do not correspond to the mesh numbering
defined earlier for the whole cross-section).
Exercise 7.2:
Consider the Fig. 8.4 and the points 5 ∂φ
and 15 with the point a on the boundary (not =0 φ=V
in the mesh). Use Taylor expansions at points ∂n a
Using the results of Exercise 7.2, the difference equation corresponding to the
discretization of the Laplace equation at point 5 where the boundary condition is φ = V will be:
φ4 + 6φ 5 + φ6 = 8V
Exercise 7.3
Using Taylor expansions, derive an equation to implement the Neumann condition using
five points along a line normal to the edge and at distances 0, h, 2h, 3h, and 4h.
∂ 2u ∂u
= (7.15)
∂ x 2 ∂t
where u(x,t) is the temperature. The boundary and initial conditions are:
page 75 E763 (part 2) Numerical Methods
u(0,t) = 0
for all t u(x,0) = 0 for x < L
u(L,t) = 1
We can discretize the solution space (x.t) with a regular grid with spacing ∆x and ∆t: The
solution will be sought at positions: xi, i = 1, ...., M–1 (leaving out of the calculation the points
at the ends of the rod), so ∆ x = L/M) and at times: tn, n = 0, 1, 2,... .
Using this discretization, the boundary and initial condition become:
u0 = 0
n
for all n u0i = 0 for i = 0, ...., M–1 (7.16)
unM = 1
We now discretize equation (7.15), converting the derivatives into differences:
For the time derivative at time n and position i, we can use the central difference formula:
∂ 2u un − 2uni + ui+1
n
= i −1 (7.17)
∂ x 2 i,n ∆x 2
In the right hand side of (7.15), we need to approximate the time derivative at the same point
(i,n). We could use the forward difference:
∂u uin+1 − uin
≈ (7.18)
∂t ∆t
and in this case we get:
uni−1 − 2uin + uin+1 uni +1 − uin
= (7.19)
∆x 2 ∆t
as the difference equation.
n +1 ∆t n ∆t n ∆t n
Rearranging: ui = ui−1 + 1 − 2 2 ui + 2 ui +1 (7.20)
∆x 2 ∆x ∆x
∆t ∆t
If we call: b = 1− 2 2 and c = 2 , we can rewrite (7.20) as:
∆x ∆x
which can be written in matrix form as: un +1 = Aun + v, where the matrix A and the vector v
are:
E763 (part 2) Numerical Methods page 76
b c ⋅ ⋅ 0
c b c ⋅ ⋅ 0
A= c b c ⋅ ⋅ and v = ⋅ (7.22)
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0
⋅ c b c
n n
u is the vector containing all values ui . It is known at n = 0, so the matrix equation (7.21) can
be solved for all successive time steps.
We then need to evaluate the left hand side of (7.15) at the same time, but this is not on the grid!
We will have:
∂ 2u un+1/ 2 − 2uin+1/ 2 + ui+1
n+1/ 2
2 = i −1 2 (7.24)
∂ x i,n+1/ 2 ∆x
Since the values of u are restricted to positions in the grid, we have to approximate the values in
the RHS of (7.24). Since those values are at the centre of the intervals, we can approximate
them by the average between the neighbouring grid points:
We can now substitute (7.25) into (7.24) and evaluate the equation (7.15) at time n+1/2. After
re-arranging this gives:
n +1 n+1 n+1 n n n
ui−1 − 2dui + ui+1 = −ui−1 + 2eui − ui+1 (7.26)
page 77 E763 (part 2) Numerical Methods
∆x2 ∆x 2
where d = 1 + and e = 1−
∆t ∆t
This form of treating the derivative (evaluating the equation between grid points in order to
have a central difference approximation for the first order derivative is called the Crank-
Nicolson method and has several advantages over the previous formulation (using the forward
difference).
We can now write (7.26) for each value of i as in (7.21), considering the special cases at the ends
of the rod and at t = 0, and write the corresponding matrix form. In this case we will get:
n +1 n
Au = Bu − 2v (7.27)
where:
−2 d 1 ⋅ 2e −1 ⋅ 0
1 −2d 1 ⋅ −1 2e −1 ⋅ ⋅
A= , B= and v =
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 1 −2d ⋅ ⋅ −1 2e 1
Example
Consider now a parabolic equation in 2 space dimensions and time, like the Schroedinger
equation or the diffusion equation in 2D:
∂ 2u ∂ 2 u ∂u
+ =a (7.28)
∂ x2 ∂ y2 ∂t
For example, this could represent the temperature distribution over a 2-dimensional plate. Let’s
consider a square plate of length 1 in x and y and the following boundary conditions:
x:
0 1 ... i–1 i i+1 ... R R+1
E763 (part 2) Numerical Methods page 78
There are R+2 points with R unknown values (i = 1, ..., R). The two extreme points: i = 0 and i
= R+1 correspond to x = 0 and x = 1, respectively.
The discretization of the y coordinate can be made in similar form. For convenience, we
can also use R+2 points, where only R are unknown: (j = 1, ..., R).
Discretization of time is done by considering t = n∆ t, where ∆ t is the time step.
We need to approximate the LHS at the same time using the average of the values at n and n+1:
∇ 2 u(n +1 / 2) = 1
2 (∇2u (n) + ∇2u( n+1) )
Applying this and (7.29) to (7.28) we get:
∇ u
2 (n +1)
+∇ u
2 ( n)
=
∆t
(
2a (n+1)
u −u
(n)
)
or re-writing:
2 (n +1) 2 a ( n +1) 2 (n) 2a ( n)
∇ u − u = −∇ u + u (7.30)
∆t ∆t
2a
uin−+11, j + uin++11, j + uin, +j −11 + uin, +j +11 − 4 + uin, +j 1
∆t
2a
= − uin−1, j + uin+1, j + uin, j −1 + uin, j +1 − 4 − uin, j (7.31)
∆t
Defining now node numbering over the grid and a vector u(n) containing all the unknown values
(for all i and j), (7.31) can be written as a matrix equation of the form:
Exercise 7.5
A length L of transmission line is terminated at both ends with a short circuit. A unit
impulse voltage is applied in the middle of the line at time t = 0.
The voltage φ(x, t) along the line satisfies the following differential equation:
∂ 2φ 2
2∂ φ
2 +p =0 with the boundary and initial conditions:
∂x ∂t 2
a) By discretizing both the coordinate x and time t as xi = (i – 1)∆x , i = 1, 2, .., R+1 and
tm = m∆t , m = 0, 1, 2, ..., use finite differences to formulate the numerical solution of the
equation above. Show that the problem reduces to a matrix problem of the form:
Φ m +1 = AΦ m + BΦ m −1
where A and B are matrices and Φm is a vector containing the voltages at each of the
discretization points xi at time tm.
b) Choosing R = 7, find the matrices A and B giving the values of their elements, taking
special care at the edges of the grid. Show the discretized equation corresponding to points at
the edge of the grid (for i = 1 or R+1) and consider the boundary condition φ(0) = φ(L) = 0.
∂φ
c) How would the matrices A and B change if the boundary condition was changed to =0
∂x
at x = 0, L (corresponding in this case to an open circuit). Show the discretized equation
corresponding to one of the edge points and propose a way to transform it so it contains only
values corresponding to points in the defined grid.
E763 (part 2) Numerical Methods page 80
r r
where L is a linear differential operator, u( x ) is a function to be found, s( x ) is a known
r
function and x is the position vector of any point in Ω (coordinates).
We also need to impose boundary conditions on the values of u and/or its derivatives on
Γ, the boundary of Ω, in order to have a unique solution to the problem.
The boundary conditions can be written in general as:
r r
Bu ( x ) = t ( x ) on Γ. (8.2)
r
with B, a linear differential operator and t ( x ) a known function. So we will have for example:
r r
B=1: u( x ) = t ( x ) : known values of u on Γ (Dirichlet condition)
r
∂ ∂u( x ) r
B= : = t( x ) : fixed values of the normal derivative (Neumann condition)
∂n ∂n
∂ ∂u r
B= +k : + ku = t( x ) : mixed condition (Radiation condition)
∂n ∂n
A problem described in this form is known as a boundary value problem (differential equation
+ boundary conditions).
Weak Solution:
This is an indirect approach. Instead of trying to solve the problem directly, we can re-
formulate it as the search for a function that satisfies some conditions also satisfied by the
solution to the original problem (8.1)-(8.2). With a proper definition of these conditions the
search will lead to the correct and unique solution.
Before expanding on the details of this search, we need to introduce the idea of inner
product between two functions. This will allow us to quantify (put a number to) how close a
function is to another or how big or small is an error function.
The commonest definition of inner product between two functions f and g is:
(8.3)
f , g = ∫ f g * dΩ for complex functions f and g
In general, the inner product between two functions will be a real number obtained by
global operations between the functions over the domain and satisfying some defining properties
(for real functions):
i) f , g = g, f
ii) αf + β g, h = α f , h + β g, h α and β scalars (8.4)
iii) f , f = 0 if and only if f = 0
Lu, h = s, h or Lu − s, h = 0 (8.5)
Weighted Residuals
Variational Method
Weighted Residuals
For this approach the problem (8.5) is rewritten in the form: r = Lu – s = 0 (8.6)
where the function r is the residual or error or the difference
between the LHS and the RHS of (7.19) when we try any function in place of u. This residual
E763 (part 2) Numerical Methods page 82
will only be zero when the function we use is the correct solution to the problem.
We can now look at the solution to (7.24) as an optimization (or minimization) problem;
that is: “Find the function u such that r = 0” or in approximate way:
“Find the function u such that r is as small as possible” and here we need to be able to
measure how ‘small’ is the error (or residual).
Using these ideas on the weak formulation, this is now transformed into:
r r r
Given the function s( x ) and an arbitrary function h( x ) in Ω; find the function u( x ) that
satisfies:
r, h = 0 (where r = Lu – s) (8.7)
We now need to introduce the function r h. For this, we can also use an expansion, in
general using another set of functions wi ( x ) for example:
N
r r
h( x ) = ∑ ci wi ( x ) (8.9)
i =1
N N
r r r r
r, h = r( x ), ∑ ci wi ( x ) = ∑ ci r( x ),wi ( x ) = 0 (8.10)
i=1 i=1
which is simpler than (8.10). (It comes from the choice: ci = 1 and all others = 0, in sequence).
So, we can conclude that we don’t need to test the residual against any (and all) possible
function h; we only need to test it against the chosen weighting functions wi, i = 1, ... , N.
page 83 E763 (part 2) Numerical Methods
Now, we can expand the function u in terms of the trial (basis) functions as in (8.8) and use
this expansion in r: r = Lu – s. With this, (8.11) becomes:
N
r r r
r , wi = Lu − s, wi = L ∑ d j b j ( x ) − s ( x ), wi ( x ) = 0 for all i
j =1
or
N
r , wi = ∑ d j Lb j , wi − s, wi = 0 for all i (8.12)
j =1
Note that in this expression the only unknowns are the coefficients dj.
N
We can rewrite this expression (8.12) as: ∑ aij d j = si for all i: i = 1, N
j =1
which can be put in matrix notation as:
Ad=s (8.13)
Since the trial functions bj are known, the matrix elements aij are all known and the problem has
been reduced to solve (8.13), finding the coefficients dj of the expansion of the function u.
Different choices of the trial functions and the weighting functions define the different
variations by which this method is known:
Variational Method
The central idea of this method is to find a functional* (or variational expression)
associated to the boundary value problem and for which the solution of the BV problem leads to
a stationary value of the functional. Combining this idea with the Rayleigh-Ritz procedure we
can develop a systematic solution method.
* A functional is simply a function of a function over a complete domain which gives as a result
a number. For example: J(φ) = ∫ φ dΩ . This expression gives a numerical value for each
2
Ω
r
function φ we use in J(φ). Note that J is not a function of x .
E763 (part 2) Numerical Methods page 84
Example:
The problem is to find the path of light travel between two points. We can start using a
variational approach and for this we need a variational expression related to this problem. In
particular, for this case we can use Fermat’s Principle that says that ‘light travels through the
quickest path’ (Note that this is not necessarily the shortest). We can formulate this statement in
the form:
P2
ds
time = min ∫ (8.14)
v(s)
P1
where v(s) is the velocity point by point along the path. We are not really interested in the actual
time taken to go from P1 to P2, but on the conditions that this minimum imposes on the path
s(x,y) itself. That is, we want to find the actual path between these points that minimize this
time. We can write for the velocity: v = c/n, where n(x,y) is the refractive index.
Let’s consider first a uniform medium, that is, one with n uniform (constant). In that case,
(8.14) becomes:
P2
n n
time = min ∫ ds = min( path length) (8.15)
c c
P1
The integral in this case reduces to the actual length of the path, so the above statement asks for
the path of ‘minimum length’ or the shortest path between P1 to P2. Obviously, the solution in
this case is the straight line joining the points. However, you can see that (8.14) can be applied
to an inhomogeneous medium, and the minimization process should lead to the actual trajectory.
Extending the example a little more, let’s consider an interface between two media with
different refractive indices. Without losing generality, we can consider a situation like that of
the figure.
From above, we know that in each media the path will be a straight line, but what are the
coordinates of the point on the y–axis (interface) where both straight lines meet?
P1
Fig. 8.7
Both integrals correspond to the respective lengths of the two branches of the total path, but we
don’t know the coordinate y0 of the point P0. We can re-write (8.16) in the form:
1
[2 2 2
time = min n1 x1 + (y1 − y0 ) + n2 x 2 + (y2 − y0 )
c
2
] (8.17)
where the only variable (unknown) is y0. To find it we need to find the minimum, and for this
page 85 E763 (part 2) Numerical Methods
we do:
d −(y1 − y0 ) −(y2 − y0 )
(time) = 0 = n1 + n2 (8.18)
dy 0 x1 + (y1 − y0 )2
2
x22 + (y2 − y0 )2
Now, from the figure we can observe that the right hand side can be written in terms of the
angles of incidence and refraction as:
n1 sin α 1 − n2 sin α 2 = 0 (8.19)
as the condition the point P0 must satisfy, and we know this is right because (8.19) is the familiar
Snell’s law.
So, the general idea of this variational approach is to formulate the problem in a form that we
look for a stationary condition (maximum, minimum, inflexion point) on some parameter which
depends on the desired solution. (Like in the above problem, we looked for the time, which
depends on the actual path travelled).
An important property of a variational approach is that precisely because the solution
function produces a stationary value of the functional, this is rather insensitive to small
perturbations of the solution (approximations). This is a very desirable property, particularly for
the application of numerical methods where all solutions are only approximate. To illustrate this
property, let’s analyse another example:
Example
Consider the problem of finding the natural resonant frequencies of a vibrating string (a
chord in a guitar). For this problem, an appropriate variational expression is (do not worry about
where this comes from, but there is a proof in the Appendix):
b 2
dy
∫ dx dx
2
k = s.v. a b (8.20)
∫y
2
dx
a
The above expression corresponds to the k–number or resonant frequencies of a string vibrating
freely and attached at the ends at a and b.
For simplicity, let’s change the limits to –a and a. The first mode of oscillation has the
form:
πx dy Aπ πx
y = Acos then =− sin
2a dx 2a 2a
Using this exact solution in (99): we get for the first mode (prove this):
2 π2 π 1.571
k = then k= ≈
4a 2 2a a
Now, to show how a ‘bad’ approximation of the function y can still give a rather acceptable
value for k, let’s try a simple triangular shape (instead of the correct sinusoidal shape of the
vibrating string):
E763 (part 2) Numerical Methods page 86
x
A1 + a x < 0
y=
x
A1 − x > 0
a -a a
Fig. 8.8
dy A/ a x <0
then = and using these values in (99), gives (prove this):
dx − A / a x >0
2 3 1.732
k = then k≈
a2 a
which is not too bad considering how coarse is the approximation to y(x). If instead of the
triangular shape we try a second order (parabolic) shape:
(
y = a 1 − (x a)
2
) then
dy
dx
A
= −2 2 x
a
2 2.5 1.581
k = then k=
a2 a
Now, how can we use this method systematically? As said before, the use of the Rayleigh-
Ritz procedure permits to construct a systematic numerical method. In summary, we can specify
the necessary steps as:
N
Use Rayleigh-Ritz: u = ∑ d j bj Insert in J(u)
j =1
N
Reconstruct u = ∑ d j bj
j=1
u is solution of BV problem <–––––––– then
We will skip here the problem of actually finding the corresponding variational expression
to a boundary value problem, simply saying that there are systematic methods to find them. We
will be concerned here on how to solve a problem, once we already have a variational
formulation.
page 87 E763 (part 2) Numerical Methods
Example
For the problem of the square coaxial (or square capacitor) seen earlier, the BV problem is
defined by the Laplace equation: ∇ 2φ = 0 with some boundary conditions (L = ∇ 2 , u = φ,
s = 0).
An appropriate functional (variational expression) for this case is:
J (φ ) = ∫ (∇φ ) dΩ (given)
2
(8.21)
Ω
2
N
Using Rayleigh-Ritz: J (φ ) = ∫ ∇ ∑ d j b j (x, y) dΩ
Ω j =1
2
N
= ∫ ∑ d j ∇b j (x, y) dΩ
Ω j=1
N N
J (φ ) = ∫ ∑ ∑ did j∇bi (x, y) ⋅ ∇b j (x, y) dΩ (8.22)
Ω i =1 j =1
∂J
Now, find stationary value: =0 for all i: i = 1, ... , N
∂d i
N
∂J
so, applying it to (8.22): = ∑ aij d j = 0 , for all i: i = 1, ... , N
∂di j=1
Solving the system of equations (8.23) we obtain the coefficients dj and the unknown
function can be obtained as:
N
u(x,y) = ∑ d j b j (x, y)
j=1
We can see that both methods, the weighted residuals and the variational method transform
the BV problem into an algebraic, matrix problem.
One of the first steps in the implementation of either method is the choice of appropriate
expansion functions to use in the Rayleigh-Ritz procedure: basis or trial functions and weighting
functions.
E763 (part 2) Numerical Methods page 88
The finite element method provides a simple form to construct these functions and
implementing these methods.
page 89 E763 (part 2) Numerical Methods
9. FINITE ELEMENTS
As the name suggests, the finite element method is based on the division of the domain of
interest Ω into ‘elements’, or small pieces Ei that cover Ω completely but without intersections:
They constitute a tessellation (tiling) of Ω:
U Ei = Ω ; Ei I E j = ∅
i
Over the subdivided domain we apply the methods discussed earlier (either weighted
residuals or variational). The basis functions are defined locally in each element and because
each element is small, these functions can be very simple and still constitute overall a good
approximation of the desired function. In this form inside each element Ee, the wanted function
r r
u( x ) in (8.1) is represented by a local approximation u˜e ( x ) valid only in the element number e:
r
Ee. The complete function u( x ) over the whole domain Ω is then simply approximated by the
r r
addition of all the local pieces: u( x ) ≈ ∑ u˜e ( x )
e
An important characteristic of this method is that it is ‘exact-in-the-limit’, that is, the
degree of approximation can only improve when the number of elements increase, the solution
gradually and monotonically converging to the exact value. In this form, a solution can always
be obtained to any degree of approximation, provided the availability of computer resources.
The function u˜(x) , approximation to u(x), is the addition of the locally defined functions
u˜e (x) which are only nonzero in the subinterval e. Now, these local functions u˜e (x) can be
E763 (part 2) Numerical Methods page 90
defined as the superposition of interpolation functions Ni(x) and Ni+1(x) as shown in the figure
above (right). From the figure we can see that the function u˜e (x), local approximation to u(x) in
element e, can be written as:
u˜e (x) = ui Ni (x) + ui+1 Ni+1 (x) (9.1)
Ni (x)
Ni+1 (x)
With this definition, we can write for the function u˜(x) in the complete domain Ω:
Ne Np
u(x) ≈ u˜ (x) = ∑ u˜ e (x) = ∑ ui Ni (x) (9.2)
e=1 i =1
(Np is the number of nodes, Ne is the number of elements). The functions Ni(x) are defined as
the (double-sided) interpolation functions at node i, i = 1, ... , Np so Ni(x) = 1 at node i and 0 at
all other nodes.
Exercise 9.1:
Examine this figure and that of the previous page and show that indeed (9.1) is valid in the
element (xi, xi+1) and so (9.2) is valid over the full domain.
Example
d2 y
Consider the boundary value problem: + k2 y = 0 for x ∈ [a, b] with y(a) = y(b) = 0
dx 2
This corresponds for example, to the equation describing the envelope of the transverse
displacement of a vibrating string attached at both ends. A suitable variational expression
(corresponding to resonant frequencies) is the following, as seen in (8.20) and in the Appendix:
page 91 E763 (part 2) Numerical Methods
b 2
dy
∫ dx dx
2 a
J =k = b (given) (9.3)
∫y
2
dx
a
N
Using now the expansion: y = ∑ y j N j (x) (9.4)
j =1
where N is the number of nodes, yj are the nodal values (unknown) and Nj(x) are the shape
functions.
2 2 Q
From (9.3) we have: k = k (y j ) = where:
R
2 2
b
d N
b N
dN j b N
dN j N
dNk
Q = ∫ ∑ y j N j (x) dx = ∫ ∑ y j dx = ∫ ∑ y j ∑ yk dx (9.5)
a
dx j=1 a j=1 dx a j =1
dx k =1 dx
and
2
b N b N N
R = ∫ ∑ y j N j (x ) dx = ∫ ∑ y j N j ( x )∑ yk Nk ( x ) dx (9.6)
a j =1 a j=1 k =1
dk 2
To find the stationary value, we do: =0 for each yi, i = 1, ... , N (9.7)
dyi
2 Q dk 2 Q' R − QR ' dk 2
But since k = , then = so = 0 ⇒ Q ' R = QR ' and
R dyi R2 dyi
Q 2 dQ dR
Q' = R'= k R' so finally: = k2 for all yi, i = 1, ... , N (9.8)
R dyi dyi
b N
dQ
b
dN
N
dN dN j dNi
= ∫ i ∑ yk k dx + ∫ ∑ y j dx
dyi dx k =1 dx dx dx
a a j=1
or
dN dN j b dN dN j
b N N
dQ
= 2∫ i ∑ y j dx = 2∑ y j ∫ i
dx
dyi dx j =1 dx j =1 a dx dx
a
which can be written as:
N b
dQ dN dN j
= 2∑ aij y j where aij = ∫ i dx (9.9)
dyi j =1 a
dx dx
For the second term:
dR
b N b N
= ∫ Ni (x) ∑ y j N j (x) dx + ∫ ∑ yk Nk (x) Ni (x) dx
dyi
a j =1 a k =1
E763 (part 2) Numerical Methods page 92
or
N b N b
dR
= 2∑ y j ∫ Ni ( x ) N j ( x ) dx = 2∑ bij y j with bij = ∫ Ni ( x ) N j ( x ) dx (9.10)
dyi j =1 a j =1 a
Replacing (9.9) and (9.10) in (9.8), we can write the matrix equation:
A y = k2 B y (9.11)
Equation (9.11) is a matrix eigenvalue problem. The solution will give the eigenvalue k2
and the corresponding solution vector y (list of nodal values of the function y(x)).
The matrix elements:
b b
dNi dN j
aij = ∫ dx and bij = ∫ Ni N j dx (9.12)
a
dx dx a
The shape functions Ni and Nj are only nonzero in the vicinity of nodes i and j respectively (see
figure below). So, aij = bij = 0 if Ni and Nj do not overlap; that is, if j ≠ i–1, i, i+1. Then,
the matrices A and B are tridiagonal: They will have not more than 3 elements per row.
Ni Nj
Exercise 9.2:
Define generically the triangular functions Ni, integrate (9.12) and calculate the value of the
matrix elements aij and bij.
In this case, a two dimensional region Ω is subdivided into smaller pieces or ‘elements’
and the subdivision satisfies the same properties as in the one-dimensional case; that is, the
elements cover completely the region of interest and there is no intersections (no overlapping).
The most common form of subdividing a 2D region is by using triangles. Quadrilateral elements
are also used and they have some useful properties but by far the most common, versatile and
easier to use are triangles with straight sides. There are well-developed methods to produce
appropriate meshing (or subdivision) of a 2D region into triangles and they have maximum
flexibility to accommodate to intricate shapes of the region of interest Ω.
The process of calculation follows the same route as in the 1D case. Now, a function of
two variables u(x,y) is approximated by shape functions Nj(x,y) defined as interpolation
functions over one element (in this case a triangle). This approximation is given by:
page 93 E763 (part 2) Numerical Methods
N
u( x , y ) ≈ ∑ u j N j ( x, y) (9.13)
j =1
where N is the number of nodes in the mesh and the coefficients uj are the nodal values
(unknown). Nj(x,y) are the shape functions defined for every node of the mesh.
Fig. 9.4
The figure shows a rectangular region in the xy–plane subdivided into triangles, with the
corresponding piecewise planar approximation to a function u(x,y) plotted along the vertical
axis. Note that the approximation is composed by flat ‘tiles’ that fit exactly along the edges so
the approximation is continuous over the entire region but its derivatives are not. The
approximation shown in the figure shows uses first order functions, that is, pieces of planes (flat
tiles). Other types are also possible but require defining more nodes in each triangle. (While a
plane is totally defined by 3 points - e.g. the nodal values, a second order surface for example,
will need 6 points - for example the values at the 3 vertices and at 3 midside points).
For a first order approximation, the function u(x,y) is approximated in each triangle by a
function of the form (first order in x and y):
p
u˜(x, y) = p + qx + ry = (1 x y) q (9.14)
r
where p, q and r are constants with different values in each triangle. Similarly to the one
dimensional case, this function can be written in terms of shape functions (interpolation
polynomials):
u(x,y) ≈ u˜ (x,y) = u1 N1(x, y) + u2 N2 (x, y) + u3 N3 (x, y) (9.15)
for a triangle with nodes numbered 1, 2 and 3, with coordinates (x1 , y1), (x2 , y2) and (x3 , y3).
E763 (part 2) Numerical Methods page 94
The shape functions Ni are such that Ni = 1 at node i and 0 at all the others:
It can be shown that the function Ni satisfying this property is:
1
Ni (x, y) = (ai + bi x + ciy ) (9.16)
2A
a1 = x2y3 – x3y2 b1 = y2 – y3 c1 = x3 – x2
a2 = x3y1 – x1y3 b2 = y3 – y1 c2 = x1 – x3
a3 = x1y2 – x2y1 b3 = y1 – y2 c3 = x2 – x1
The function N1 defined in (9.16) and shown below corresponds to the shape function
(interpolation function) for the node numbered 1 in the triangle shown. Now, the same node can
be a vertex of other neighbouring triangles in which we will also define a corresponding shape
function for node 1 (with expressions like (9.16) but with different values of the constants a, b
and c – different orientation of the plane), building up the complete shape function for node 1:
N1. This is shown in the next figure for a node number i, which belongs to five triangles.
u2 u2
u1 N1 + u2 N2 + u3 N3
u2 N2
u1 N1 2 2
u1 u1
1 1 u3
3 3
Fig. 9.5 Fig. 9.6
Fig. 9.7
page 95 E763 (part 2) Numerical Methods
Joining all the facets of Ni, for each of the triangles that contain node i, we can refer to this
function as Ni(x,y) and then, considering all the nodes of the mesh, the function u can be written
as:
N
u(x,y) = ∑ u j N j (x, y) for j = 1, …, N (9.17)
j =1
We can now use this expansion for substitution in a variational expression or in the
weighted residuals expression, to obtain the corresponding matrix problem for the expansion
coefficients. An important advantage of this method is that these coefficients are precisely the
nodal values of the wanted function u, so the result is obtained immediately when solving the
matrix problem. An additional advantage is the sparsity of the resultant matrices.
MATRIX SPARSITY
Considering the global shape function for node i, shown in the figure above, we can see
that it is zero in all other nodes of the mesh. If we now consider the global shape function
corresponding to another node, say, j, we can see that products of the form Ni Nj or of
derivatives of these functions, which will appear in the definition of the matrix elements (as seen
in the 1–D case), will almost always be zero, except when the nodes i and j are either the same
node or immediate neighbours so there is an overlap (see figure above). This implies that the
corresponding matrices will be very sparse, which is very convenient in terms of computer
requirements.
Example:
1 2 3 4
1 3 5
2 4 6
6 7
5 8
If we consider a simple mesh:
7 9 11
8 10 12
9 10 11 12
The matrix sparsity pattern results:
1 2 3 4 5 6 7 8 9 10 11 12
10
11
12
E763 (part 2) Numerical Methods page 96
Example:
For the problem of finding the potential distribution in the square coaxial, or equivalently, the
temperature distribution (in steady state) between the two square section surfaces, an appropriate
variational expression is:
J = ∫ (∇φ ) dΩ
2
(9.18)
Ω
and substituting (9.17):
2 2
N N
J = ∫ ∇ ∑ φ j N j dx dy = ∫ ∑ φ j ∇N j dx dy
Ω j =1 Ω j =1
which can be re-written as:
N N
J = ∫ i∑
=1
φ i ∇Ni ⋅ ∑φ j ∇N j
j =1
dx dy
Ω
or
N N
J = ∑ ∑φ iφ j ∫ ∇Ni ⋅∇N j
T
dx dy = Φ AΦ (9.19)
i =1 j =1 Ω
Note that again, the coefficients aij can be calculated and the only unknowns are the nodal values
φj.
∂J
We now have to find the stationary value; that is: put: =0 for i = 1, … , N
∂φi
N
∂J
= 0 = 2∑ aij φ j for i = 1, … , N then: AΦ = 0 (9.20)
∂φi i =1
Then, the resultant equation is (9.20): AΦ = 0. We need to evaluate first the elements of
the matrix A. For this, we can consider the integral over the complete domain Ω as the sum of
the integrals over each element Ω k, k = 1, … , Ne (Ne elements in the mesh).
Ne
aij = ∫ ∇Ni ⋅∇ N j dx dy = ∑ ∫ ∇Ni ⋅∇N j dx dy (9.21)
Ω k =1 Ω k
Before calculating these values, let’s consider the matrix sparsity; that is, let’s see which
elements of A are actually nonzero. As discussed earlier (page 45), aij will only be nonzero if
the nodes i and j are both in the same triangle. In this way, the sum in (9.21) will only extend to
at most two triangles for each combination of i and j.
Inside one triangle, the shape function Ni(x,y), defined for the node i is:
page 97 E763 (part 2) Numerical Methods
1 ∂ Ni ∂N 1
Ni ( x, y) = ( a + bi x + ci y ); Then: ∇Ni = xˆ + i ˆy = (b xˆ + ci yˆ )
2A i ∂x ∂y 2A i
And then:
Ne Ne
1 1
aij = ∑ 4 A 2 ( bi bj + ci c j ) ∫ dx dy = ∑
4 Ak
(bi b j + ci c j ) (9.2)
k =1 k Ωk k =1
where the sum will only have a few terms (for those values of k corresponding to the triangles
containing nodes i and j. The values of Ak, the area of the triangle and bi, bj, ci and cj will be
different for each triangle concerned.
In particular, considering for example the element a4 7 corresponding to the mesh in the
previous figure, the sum extends over the triangles containing nodes 4 and 7; that is triangles
number 5 and 6:
a4 7 =
1
4A5
(
(5) ( 5) ( 5) ( 5)
b4 b7 + c4 c7 +
1
4A6
)
( 6) ( 6)
(
( 6) ( 6)
b4 b7 + c4 c7 )
In this case, the integral in (9.22) reduces simply to the area of the triangle and the
calculations are easy. However, in other cases the integration can be complicated because they
have to be done over the triangle, which is at any position and orientation in the x–y plane.
For example, solving a variational expression like:
J = k2 =
∫ (∇φ ) 2 dΩ (corresponding to (9.3) for 1D problems) (9.23)
∫ φ 2 dΩ
This results in an eigenvalue problem AΦ = k2BΦ where the matrix B has the elements:
bij = ∫ Ni N j dΩ (9.24)
Ω
Exercise 9.3:
Apply the Rayleigh-Ritz procedure to expression (9.23) and show that indeed the matrix
elements aij and bij have the values given in (9.21) and (9.24).
Ne
For the case of integrals like those in (126): bij = ∫ Ni N j dΩ = ∑ ∫ Ni N j dΩ
Ω k =1 Ω k
The sum is over all triangles containing nodes i and j. For example for the element 5,6 in the
previous mesh, these will be triangles 2 and 7 only.
If we take triangle 7, its contribution to this element is the term:
1
b56 =
4 A7 ∫ ( a5 + b5x + c5y)( a6 + b6 x + c6 y) dx dy
Ω7
E763 (part 2) Numerical Methods page 98
or
1
∫ [a5a6 + ( a5b6 + a6b5) x + ( a5c6 + a6c5) y + b5b6x
2
b56 =
4 A7
Ω7
2
+(b5c6 + b6 c5 )xy + c5 c6 y ] dx dy
Integrals like these need to be calculated for every pair of nodes in every triangle of the mesh.
These calculations can be cumbersome is attempted directly. However, it is much simpler to use
a transformation of coordinates, into a local system. This has the advantage that the integrals
can be calculated for just one model triangle and then the result converted back to the x and y
coordinates. The most common system of local coordinates used for this purpose is the triangle
area coordinates.
Fig. 9.8
The advantage of using this local coordinate system is that the actual shape of the triangle
is not important. Any point inside is defined by its proportional distance to the sides,
irrespective of triangle shape. In this way, calculations can be made in model triangle using this
system and then, mapped back to the global coordinates.
The area of each of these triangles, for example A1, can be calculated using (see
Appendix):
1 x y
1
A1 = det 1 x2 y2
2
1 x3 y3
A1 1
Then, ξ1 = = (a1 + b1x + c1y ) or ξ1 = N1( x , y) (9.25)
A 2A
We have then, that these coordinates vary in the same form as the shape functions, which
is quite convenient for calculations.
Expression (9.25) also gives us the required relationship between the local (ξ1,ξ2,ξ3) and
global coordinates (x,y); that is the expression we need to convert coordinates (x,y into ξ1,ξ2,ξ).
We can also find the inverse relation, that is (x,y) in terms of (ξ1,ξ2,ξ3). This is all we
need to convert from one system of coordinates to the other.
Equations (9.25) and (9.26) will allow us to change from one system to the other.
Finally, the evaluation of integrals can be made now in terms of the local coordinates in
the usual way:
∫ f (x, y) dxdy = ∫ f (ξ1,ξ2 ,ξ3) J dξ1dξ2
Ωk Ωk
where J is the Jacobian of the transformation. In this case this is simply 1/2A, so the
expression we need to use to transform integrals is:
E763 (part 2) Numerical Methods page 100
Example:
The integral (9.24) of the previous exercise, is difficult to calculate in terms of x and y for
an arbitrary triangle, and needs to be calculated separately for each pair of nodes in each triangle
of the mesh. Transforming to (local) area coordinates this is much simpler:
Fig. 9.10
So that the integral of (9.28) results:
1 1−ξ1
Note that the above conclusion about integration limits is valid for any integral over the
triangle, not only the one used above.
We can now calculate these integrals. Taking first the case where i ≠ j:
a) choosing i = 1 and j = 2 (This is an arbitrary choice – you can check that any other
choice, e.g. 1,3 will give the same result).
1 1−ξ1 1
1 2 1 A
Iij = 2A ∫ ξ1 ∫ ξ2 dξ2dξ1 = A ∫ ξ1(1− ξ1)
2
dξ 1 = A − + =
2 3 4 12
0 0 0
1 1−ξ1 1
1 1 A
Iii = 2 A∫ ξ12 ∫ dξ2dξ1 = 2A ∫ ξ12 (1 − ξ1)dξ1 = 2A − =
3 4 6
0 0 0
page 101 E763 (part 2) Numerical Methods
Once calculated in this form, the result can be used for any triangle irrespective of the
shape and position. We can see that for this integral, only the area A will change when applied
to different triangles.
In most cases involving second order differential equations, the resultant weighted
residuals expression or the corresponding variational expression can be written without
involving second order derivatives. In these cases, first order shape functions will be fine.
However, if this is not possible, other type of elements/shape functions must be used. (Note that
second order derivatives of a first order function will be zero everywhere.) We can also choose
to use different shape functions even if we could use first order polynomials, for example, to get
higher accuracy with less elements.
In the same form we can write the corresponding shape functions for the other vertices
(nodes 2 and 3). For the mid-side nodes, for example node 4, we have that N4 should be zero at
all nodes except 4 where its value is 1. We can see that all other nodes are either on the side 2–3
(where ξ1 = 0) or on the side 3–1 (where ξ2 = 0). So the function N4 should be:
N4 = 4ξ1ξ2 (9.31)
E763 (part 2) Numerical Methods page 102
The following figure shows the two types of second order shape functions, one for vertices
and one for mid-side points.
N3 N4
3 3 5
5 2
2 6
6 4
4
1
1
Fig. 9.12 Shape function N3(x). Fig. 9.13 Shape function N4(x).
Exercise 9.4:
For the mesh of second order triangles of the figure, find the corresponding sparsity
pattern.
2 8 7
4 6 9 11
1 5 3 12 10
Fig. 9.14
Exercise 9.5:
Use a similar reasoning as that used to define the second order shape functions (9.30)-
(9.31), to find the third order shape functions in terms of triangle area coordinates.
Note that in this case there are 3 different types of functions.
3 (0 0 1)
(13 0 23 ) (0 13 23 )
8 7
(23 0 13 ) (13 13 13 ) (0 23 13 )
9 6
10
(1 0 0) 4 5 (0 1 0)
1 2
( 2 1
3 3 0) (13 23 0)
Fig. 9.15
page A1 E763 (part 2) Numerical Methods Appendix
APPENDIX
1. Taylor theorem
x
For a continuous function we have ∫ f ' (t )dt = f ( x) − f (a) , then, we can write:
a
x
f ( x) = f (a) + ∫ f ' (t )dt or f ( x) = f (a ) + R0 ( x) (A1.1)
a
x
where the reminder is R0 ( x) = ∫ f ' (t )dt .
a
u = f ' (t ) du = f ( 2) (t )dt
We can now integrate R0 by parts using: giving:
dv = dt v = −( x − t )
x x
x
R0 ( x) = ∫ f ' (t )dt = − ( x − t ) f ' (t ) a + ∫ ( x − t ) f ( 2) (t )dt which gives after solving and
a a
substituting in (A1.1):
x
f ( x) = f (a ) + ( x − a ) f ' (a ) + ∫ ( x − t ) f ( 2) (t )dt or
a
f ( x ) = f (a ) + f ' (a )( x − a ) + R1 ( x ) (A1.2)
( 2) (3)
u= f du = f (t )dt
(t )
We can also integrate R1 by parts using this time: ( x − t ) 2 which gives:
dv = ( x − t )dt v = −
2
x x x
f ( 2) (t )
R1 ( x) = ∫ ( x − t ) f ( 2) (t )dt = − ( x − t ) 2 + ∫ ( x − t ) 2 f (3) (t )dt and again, after
a
2 a
a
substituting in (A1.2) gives:
x
f ( 2) ( a )
f ( x ) = f (a ) + f ' (a )( x − a ) + ( x − a ) 2 + ∫ ( x − t ) 2 f (3) (t )dt
2 a
f ( 2) ( a ) f (3) (a) f ( n) (a )
f ( x) = f (a) + f ' (a )( x − a) + ( x − a) 2 + ( x − a )3 + L + ( x − a) n + Rn (A1.3)
2 3! n!
x
( x − t ) n ( n+1)
where the reminder can be written as: Rn ( x) = ∫ f (t )dt (A1.4)
a
n!
To find a more useful for for the reminder we need to invoke some general mathematical
theorems:
E763 (part 2) Numerical Methods Appendix page A2
f ( n+1) (ξ )
Rn ( x) = ( x − t ) n+1 (A1.5)
(n + 1)!
for the reminder of the Taylor expansion, where ξ is appoint between a and x.
page A3 E763 (part 2) Numerical Methods Appendix
for k=1:n-1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
for i=k+1:n
A(i,k+1:n)=A(i,k+1:n)-v(i)*A(k,k+1:n);
end
end
In fact, this can be simplified eliminating the second loop, by noting that all operations on
rows can be performed simultaneously. The simpler version of the code is then:
for k=1:n-1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-v(k+1:n)*A(k,k+1:n);
end
U=triu(A); % This function simply puts zeros in the lower
triangle
The factorization is completed by calculating the lower triangular matrix L. The complete
procedure can be implemented as follow:
This function can be used to find the LU factors of a matrix A using dense storage. The
function eye(n,n) returns the identity matrix of order n and tril(A,-1) gives a lower
triangular matrix with the elements of A in the lower triangle, excluding the diagonal, and
E763 (part 2) Numerical Methods Appendix page A4
function x = LTriSol(L,b)
%
% Solves the triangular system Lx = b by forward substitution
%
n=length(b);
x=zeros(n,1); % a vector of zeros to start
for j=1:n-1
x(j)=b(j)/L(j,j);
b(j+1:n)=b(j+1:n)-x(j)*L(j+1:n,j);
end
x(n)=b(n)/L(n,n);
Backward substitution can be implemented in a similar form, this time the unknowns are
found from the end upwards:
function x = UTriSol(L,b)
%
% Solves the triangular system Ux = b by backward substitution
%
n=length(b);
x=zeros(n,1);
for j=n:-1:2 % from n to 2 one by one
x(j)=b(j)/U(j,j);
b(1:j-1)=b(1:j-1)-x(j)*U(1:j-1,j);
end
x(1)=b(1)/U(1,1);
With these functions the solution of the system of equations Ax = b can be performed in
three steps by the code:
[L,U] = GE(A);
y = LTriSol(L,b);
x = UTriSol(U,y);
Exercise
Use the functions GE, LTriSol and UTriSol to solve the system of equations
generated by the finite difference modelling of the square coaxial structure, given by equation
(7.14). You will have to complete first the matrix A given only schematically in (7.14), after
applying the boundary conditions. Note that because of the geometry of the structure, not all
rows will have the same pattern.
To input the matrix it might be useful to start with the matlab command:
A = triu(tril(ones(n,n),1),-1)-5*eye(n,n)
page A5 E763 (part 2) Numerical Methods Appendix
This will generate a tridiagonal matrix of order n with –4 in the main diagonal and 1 in the two
subdiagonals. After that you will have to adjust any differences between this matrix and A.
Compare the results with those obtained by Gauss-Seidel.
E763 (part 2) Numerical Methods Appendix page A6
Equation (7.9) is the typical equation, and putting the diagonal term on the left-hand-side
gives:
φO = 14 (φ N + φ S + φ E + φW )
We could write 56 lines of code (one per equation) or even simpler, use subscripted
variables inside a loop. In this case the elements of A are all either 0, +1 or -4, and are easily
“generated during the algorithm”, rather that actually stored in an array. This simplifies the
computer program, and instead of A, the only array needed holds the current value of vector
elements:
xT = (φ1, φ 2 , ... , φ56 ).
The program can be simplified further by keeping the vector of 56 unknowns x in a 2-D
array z(11,11) to be identified spatially with the 2-D Cartesian coordinates of the physical
problem (see figure). For example z(3,2) stores the value of φ10. There is obviously scope to
improve efficiency since with this arrangement we store values corresponding to nodes with
known, fixed voltage values, including all those nodes inside the inner conductor. None of the
actually need to be stored, but doing so makes the program simpler. In this case, the program (in
old Fortran 77) can be as simple as:
write(*,6) n,(z(3,j),j=1,11)
enddo
c
write(*,7) z
6 format(1x,i2,11f7.4)
7 format(‘FINAL RESULTS=’,//(/1x,11f7.4))
stop
end
In the program, first z(i,j) is initialized with zeros, then the values corresponding to the
inner conductor are set to one (1 V). After this, the iterations start (to a maximum of 30) and the
Gauss-Seidel equations are solved.
In order to check the convergence, the values of potentials in one intermediate row (the
third) are printed after every iteration. We can see that after 19 iterations there are no more
changes (within 4 decimals). Naturally, a more efficient monitoring of convergence can be
implemented whereby the changes are monitored, either in a point by point basis, or as the norm
of the difference, and the iterations are stopped when this value is within a prefixed precision.
The results are:
1 0.0000 0.0000 0.0000 0.2500 0.3125 0.3281 0.3320 0.3330 0.0833 0.0208 0.0000
2 0.0000 0.0000 0.1250 0.3750 0.4492 0.4717 0.4785 0.4181 0.1895 0.0699 0.0000
3 0.0000 0.0469 0.2109 0.4473 0.5225 0.5472 0.5399 0.4737 0.2579 0.1098 0.0000
4 0.0000 0.0908 0.2690 0.4922 0.5653 0.5865 0.5742 0.5082 0.3016 0.1369 0.0000
5 0.0000 0.1229 0.3076 0.5205 0.5902 0.6084 0.5944 0.5299 0.3293 0.1542 0.0000
6 0.0000 0.1446 0.3326 0.5381 0.6047 0.6211 0.6067 0.5435 0.3465 0.1650 0.0000
7 0.0000 0.1586 0.3484 0.5488 0.6133 0.6288 0.6143 0.5519 0.3572 0.1716 0.0000
8 0.0000 0.1675 0.3582 0.5553 0.6184 0.6334 0.6190 0.5572 0.3637 0.1756 0.0000
9 0.0000 0.1729 0.3641 0.5592 0.6215 0.6362 0.6218 0.5604 0.3676 0.1780 0.0000
10 0.0000 0.1763 0.3678 0.5616 0.6234 0.6379 0.6236 0.5624 0.3700 0.1794 0.0000
11 0.0000 0.1783 0.3700 0.5631 0.6245 0.6390 0.6247 0.5636 0.3713 0.1802 0.0000
12 0.0000 0.1795 0.3713 0.5639 0.6253 0.6396 0.6254 0.5643 0.3722 0.1807 0.0000
13 0.0000 0.1802 0.3721 0.5645 0.6257 0.6400 0.6258 0.5647 0.3727 0.1810 0.0000
14 0.0000 0.1807 0.3726 0.5648 0.6259 0.6403 0.6260 0.5649 0.3729 0.1812 0.0000
15 0.0000 0.1810 0.3729 0.5650 0.6261 0.6404 0.6261 0.5651 0.3731 0.1813 0.0000
16 0.0000 0.1811 0.3731 0.5651 0.6262 0.6405 0.6262 0.5652 0.3732 0.1813 0.0000
17 0.0000 0.1812 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1813 0.0000
18 0.0000 0.1813 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1814 0.0000
19 0.0000 0.1813 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
20 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
21 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
22 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
23 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
24 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
25 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
26 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
27 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
28 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
29 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
30 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
FINAL RESULTS=
E763 (part 2) Numerical Methods Appendix page A8
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.3099 0.6406 1.0000 1.0000 1.0000 1.0000 1.0000 0.6406 0.3099 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
page A9 E763 (part 2) Numerical Methods Appendix
∫ (y + δy)
2
dx
a
And re-writing:
b b 2
(k 2 + δk 2 )∫ (y + δy)2 dx = ∫ d(ydx+ δy) dx (A4.2)
a a
Now, since we want k2 to be stationary about the solution function y, we make δk2 = 0, and we
examine what conditions this imposes on the function y:
b b
dy dδy
∫ y δy dx = ∫
2
k dx (A4.5)
a a
dx dx
Integrating the RHS by parts:
b b b
dy d2 y
∫ y δy dx = dx δy a − ∫ δy dx 2
2
k dx
a a
E763 (part 2) Numerical Methods Appendix page A10
Or re-arranging:
b b
d2 y 2 dy
∫ δy dx 2 + k y dx = δy
dx a
(A4.6)
a
Since δy is arbitrary, (A4.6) can only be valid if both sides are zero. That means that y should
satisfy the differential equation:
d2 y 2
+k y= 0 (A4.7)
dx 2
and any of the boundary conditions:
dy
= 0 at a and b or δy = 0 at a and b (fixed values of y at the ends). (A4.8)
dx
Summarizing, we can see that imposing the condition of stationarity of (A4.1) with respect
to small variations of the function y leads to y satisfying the differential equation (A4.7), which
is the wave equation, and any of the boundary conditions (A4.8); that is, either fixed values of y
at the ends (Dirichlet B.C.), or zero normal derivative (Neumann B.C.).
page A11 E763 (part 2) Numerical Methods Appendix
5. Area of a Triangle
For a triangle with nodes 1, 2 and 3 with coordinates (x1, y1), (x2, y2) and (x3, y3):
y3 3
B
A
y2 1
C
y1 2
x1 x3 x2
The area of the triangle is: A = Area of rectangle – area(A) – area(B) – area(C)
A= 1
2 [(x 2y3 − x3y2 ) + (x 3y1 − x1y3) + (x1y2 − x2 y1)] (A5.1)
1 x1 y1
1
which can be written as: A= det 1 x 2 y2 (A5.2)
2
1 x3 y3
E763 (part 2) Numerical Methods Appendix page A12
Evaluating this expression at each node of a triangle (with nodes numbered 1, 2 and 3):
u1 = p + qx1 + ry1 u1 1 x1 y1 p
u2 = p + qx2 + ry2 ⇒ u2 = 1 x 2 y2 q (A6.2)
u3 = p + qx 3 + ry3 u3 1 x 3 y3 r
And from here we can calculate the value of the constants p, q and r in terms of the nodal values
and the coordinates of the nodes:
−1
p 1 x1 y1 u1
q = 1 x2 y2 u2 (A6.3)
r 1 x 3 y3 u3
u1
u(x,y) ≈ u˜ (x,y) = u1 N1(x, y) + u2 N2 (x, y) + u3 N3 (x, y) = (N1 N2 N3 ) u2 (A6.5)
u3
we have finally:
−1
1 x1 y1
(N1 N2 N3 ) = (1 x y) 1 x 2 y2 (A6.6)
1 x3 y3
Solving the right hand side (inverting the matrix and multiplying), gives the expression for
each shape function (9.16):
1
Ni (x, y) = (ai + bi x + ciy ) (A6.7)
2A
a1 = x2y3 – x3y2 b 1 = y2 – y3 c 1 = x3 – x 2
a2 = x3y1 – x1y3 b 2 = y3 – y1 c 2 = x1 – x 3 (A6.8)
a3 = x1y2 – x2y1 b 3 = y1 – y2 c 3 = x2 – x 1
Note that from (A5.1), the area of the triangle can be written as:
A= 1
2 (a1 + a2 + a3 ) (A6.9)