You are on page 1of 8

University of California, Irvine

Department of Civil and Environmental Engineering

CEE-290: Models & Data, Jasper A. Vrugt

APPENDIX B: DERIVATION OF GAUSS-NEWTON

May 27, 2014


CEE 290 - Models & Data Appendix B, Page 1

1 Curve Fitting: Theory

Consider a set of m data points, ( x1 , y1 ), (


x2 , y2 ), . . . , (
xm , ym ), and a curve (model function),
y = f (x, ), that in addition to the variable x also depends on d parameters, = {1 , 2 , . . . , n },
with m d. It is desired to find the vector of parameters such that the curve best fits the
data. Based on the early work of Carl Friedrich Gauss (1795) we use a least-squares approach.
This method finds its optimum when the sum of squared residuals, SSR
m
X
SSR = ri ()2 (1)
i=1

is at its minimum. The residual, ri () is defined to be the difference between the observed value,
yi and the value predicted by the model

ri () = yi f (
xi , ), (2)

The minimum of Equation (1) can be found by setting its gradient to zero
m
SSR X ri ()
=2 ri () (k = 1, . . . , d), (3)
k k
i=1

which is equivalent to
m
X f (
xi , )
2 ri () =0 (k = 1, . . . , d), (4)
k
i=1

These d gradient equations apply to all least squares problems - yet with differing expressions for
the model (function) and its partial derivatives.
CEE 290 - Models & Data Appendix B, Page 2

2 Linear Least Squares

If we consider the model y = f (x, ) to be linear in its parameters, for instance, y(x) = 1 x +
2 x + 3 x then the derivatives are
f (
xi , )
=x
ik (5)
k
Substitution of the expression of the residuals and the derivatives into the gradient equations (4)
gives
m d
!
SSR X X
= 2 yi x
ik k (
xir ) = 0 (r = 1, . . . , d) (6)
k
i=1 k=1

Upon rearrangement we obtain the normal equations


m X
X d m
X
x ik k =
ir x x
ir yi (r = 1, . . . , d) (7)
i=1 k=1 i=1

which can be written in matrix notation as


T X)
(X TY
= X (8)
is a md matrix with measured x values (also called basis functions)
where T denotes transpose, X
and Y is a m 1 vector with corresponding observations. The direct solution for is then

T X)
= (X TY
1 X (9)

which is easy to solve in a programming language such as MATLAB.


CEE 290 - Models & Data Appendix B, Page 3

3 Non-linear Least Squares

Non-linear least squares is the form of least squares analysis used to fit a set of m observations
with a model, f (
xi , ) that is non-linear in its d unknown parameters, (m > d). In this case, the
derivative
f (
xi , )
, (10)
k
is not constant, but depends on the actual values of the parameters, . This introduces some
problems as Equation (9) no longer gives a direct solution. Instead, we need an iterative approach
that iteratively refines starting from some initial values, = 0 .

Newtons method Suppose f : [a, b] R is a differentiable function defined on the interval


[a, b] with values in the real numbers R. The formula for converging on the root can be easily
derived. Suppose we have some current approximation xj . Then we can derive the formula for a
better approximation, xj+1 by referring to Figure 1 below.

function
tangent

x2 x3
x
x4 x1 x0

Figure 1: The function f is shown in blue and the tangent line is in red. We see that xj+1 is a better
approximation than xj for the root x of the function f (x).

The equation of the tangent line to the curve y = f (x) at the point x = xj is

y = f (xj ) + f 0 (xj )(x xj ), (11)

where f 0 (xj ) denotes the derivative of the function f at x = xj .


CEE 290 - Models & Data Appendix B, Page 4

The x-intercept of this line (the value of x such that y = 0) is then used as the next approximation
to the root, xj+1 . In other words, setting y to zero and x to xj+1 gives

0 = f (xj ) + f 0 (xj )(xj+1 xj ), (12)

which gives us the iterative solution

f (xj )
xj+1 = xj (13)
f 0 (xj )

This is Newtons method for finding the root of a function. In words, we start with some initial
value of x at j = 0, also referred to as x0 . We then calculate the function value, f (x0 ) and its
derivative, f 0 (x0 ) at x = x0 . We then use Equation (13) to propose a new value, x1 . This estimate
of x should be a better estimate of the root than x0 . We continue until the absolute value of the
change between two successive x-values or function values is smaller than some small tolerance
value defined by the user (typically 106 ). In the absence of any prior knowledge where the zero
might lie, a trial-and-error approach might narrow the possibilities to a reasonably small interval.
The method will usually converge, provided this initial guess is close enough to the unknown root,
and that f 0 (x0 ) 6= 0.
If an analytical expression for f 0 (x) is not readily available, we can use a numerical approximation
of the derivative
f (xj ) f (xj1 )
f 0 (x) (14)
xj xj1
using the values of x and f (x) at the current and previous iteration. This forms the basis of the
Secant method. Unlike Newtons method the Secant method uses a numerical estimate of the first
derivative to solve Equation (13).

Gauss-Newton method Newtons method can be used to find a minimum or maximum of a


function. The derivative is zero at a minimum or maximum, so minima and maxima can be found
by applying Newtons method to the derivative. The iteration becomes

f 0 (xj )
xj+1 = xj (15)
f 00 (xj )

which for our function f (


xi , ) is equivalent to

j+1 = j H()1 g() (16)

where g() denotes the gradient vector of the SSR and H denotes the Hessian matrix of the SSR.
The d-vector with differences, = j+1 j is also called the shift vector. The m d gradient
matrix is given by (see Equation (17) )
m
SSR X ri ()
gk () = =2 ri () (k = 1, . . . , d), (17)
k k
i=1

Elements of the Hessian can be calculated by differentiating the gradient elements, gk with respect
to k
m 
2 ri ()

X ri () ri ()
Hkl () = 2 + ri () (18)
k l k l
i=1
CEE 290 - Models & Data Appendix B, Page 5

The Gauss-Newton method ignores the second-order derivative terms (the second term in the
previous expression). Thus the Hessian matrix is approximated using
m
X
Hkl () 2 Jik ()Jil () (k = 1, . . . , d), (l = 1, . . . , d) (19)
i=1

where Jik = r
i ()
k
are entries of the Jacobian (sensitivity) matrix Jr . If we use matrix notation we
can write the gradient and the approximate Hessian as

gr () = 2Jr ()T r(), H() 2Jr ()T Jr () (20)

If we now substitute these Equations in the recurrence relation of Equation (16) we derive
1
j+1 = j Jr ()T Jr () Jr ()T r() (21)

where r() = {r1 (), . . . , rm ()} is a m 1 vector of residuals. If you interpret (21) you will notice
a close agreement between this Equation and the direct solution for least-squares fitting (Equation
9). This concludes the derivation of Gauss-Newton.
In words, we start at some initial values, 0 defined by the user. We calculate the function values
at 0 , yi = f (
xi , 0 ) and compute the residual vector, r. Then, we compute the Jacobian and
Hessian matrices, and use Equation (21) to propose a new solution, 1 . We then proceed with the
same steps, until the difference between two successive function values or values of is smaller
than some small threshold.
Figure 2 provides an illustration of Gauss-Newton for a nonlinear model with two (unknown) para-
meter values. This problem is equivalent to the log (SSR) 10

slug model used in homework 1. Color cod- 1.1 2

ing is used to give insights into the shape of 1


1

1
the response surface (objective function (sum of 0.9
1

squared residuals) mapped out in the parame- 0.8


0
ter space). The different iterations of Gauss- 0.7
2

Newton are indicated with the blue squares.


2

0.6
1
T

The connecting line is used to summarize the 0.5


3
0

trajectory towards the global minimum. To 0.4


1

avoid overshooting the shift vector, = 0.3

j+1 j computed from Equation (21) was 0.2 3


1

multiplied with a scalar, = 0.2. This will be 0.1


2

discussed in the next section, and an improved 0.1 0.2 0.3 0.4 0.5 0.6
4

S
local search method will be discussed that alter-
Figure 2: Gauss-Newton example.
nates between Gauss-Newton and steepest de-
scent to further improve search efficiency.
CEE 290 - Models & Data Appendix B, Page 6

4 Levenberg-Marquardt

With the Gauss-Newton method the sum of squares SSR may not decrease at every iteration.
Although, is a descent direction the method might overshoot. If divergence occurs, one solution
is to employ a fraction, , of the increment (shift) vector, in the updating formula

j+1 = j + (22)

In other words, the increment vector is too long, but it points in downhill, so going just a part of
the way will decrease the objective function SSR. An optimal value for can be found by using a
line search algorithm, that is, the magnitude of is determined by finding the value that minimizes
the SSR, usually using a direct search method in the interval 0 < < 1.
In cases where the direction of the shift vector is such that the optimal fraction, , is close to zero,
an alternative method for handling divergence is the use of the Levenberg-Marquardt algorithm.
The normal equations are modified in such a way that the increment vector is rotated towards the
direction of steepest descent
1
= Jr ()T Jr () + Id Jr ()T r() , (23)

where Id is the d-dimensional identity matrix. If , then the direction of approaches the di-
rection of the negative gradient Jr ()T r() (steepest descent method).
log (SSR)
The so-called Marquardt parameter, , may
10
1.1 2
also be optimized using line search, but
1
this is inefficient as the shift vector must
1

1
0.9 be re-calculated every time is changed.
1

0.8 Other more efficient strategies are discussed


0
0.7 in class. Figure 3 provides an illustration
of Levenberg-Marquardt for the two-parameter
2
2

0.6
1
T

slog model of homework 1. It is directly visi-


1

0.5
3
0

ble that Levenberg-Marquardt initially follows


1

0.4
2

0.3
the steepest descent direction - and hence re-
quires fewer iterations than Gauss-Newton to
0.2 3
locate the global minimum. By alternating be-
0.1
1

tween Gauss-Newton and steepest descent, the


0
2

4
0.1 0.2 0.3 0.4 0.5 0.6
S Levenberg-Marquardt method is less sensitive
Figure 3: Levenberg-Marquardt example. to overshooting. This constitutes an important
advantage for practical application.
CEE 290 - Models & Data Appendix B, Page 7

References

[Legendre (1805)] A.M. Legendre, Nouvelles methodes pour la determination des orbites des
com`etes [New Methods for the Determination of the Orbits of Comets] (in French), Paris: F.
Didot, 1805.

You might also like