Professional Documents
Culture Documents
is at its minimum. The residual, ri () is defined to be the difference between the observed value,
yi and the value predicted by the model
ri () = yi f (
xi , ), (2)
The minimum of Equation (1) can be found by setting its gradient to zero
m
SSR X ri ()
=2 ri () (k = 1, . . . , d), (3)
k k
i=1
which is equivalent to
m
X f (
xi , )
2 ri () =0 (k = 1, . . . , d), (4)
k
i=1
These d gradient equations apply to all least squares problems - yet with differing expressions for
the model (function) and its partial derivatives.
CEE 290 - Models & Data Appendix B, Page 2
If we consider the model y = f (x, ) to be linear in its parameters, for instance, y(x) = 1 x +
2 x + 3 x then the derivatives are
f (
xi , )
=x
ik (5)
k
Substitution of the expression of the residuals and the derivatives into the gradient equations (4)
gives
m d
!
SSR X X
= 2 yi x
ik k (
xir ) = 0 (r = 1, . . . , d) (6)
k
i=1 k=1
T X)
= (X TY
1 X (9)
Non-linear least squares is the form of least squares analysis used to fit a set of m observations
with a model, f (
xi , ) that is non-linear in its d unknown parameters, (m > d). In this case, the
derivative
f (
xi , )
, (10)
k
is not constant, but depends on the actual values of the parameters, . This introduces some
problems as Equation (9) no longer gives a direct solution. Instead, we need an iterative approach
that iteratively refines starting from some initial values, = 0 .
function
tangent
x2 x3
x
x4 x1 x0
Figure 1: The function f is shown in blue and the tangent line is in red. We see that xj+1 is a better
approximation than xj for the root x of the function f (x).
The equation of the tangent line to the curve y = f (x) at the point x = xj is
The x-intercept of this line (the value of x such that y = 0) is then used as the next approximation
to the root, xj+1 . In other words, setting y to zero and x to xj+1 gives
f (xj )
xj+1 = xj (13)
f 0 (xj )
This is Newtons method for finding the root of a function. In words, we start with some initial
value of x at j = 0, also referred to as x0 . We then calculate the function value, f (x0 ) and its
derivative, f 0 (x0 ) at x = x0 . We then use Equation (13) to propose a new value, x1 . This estimate
of x should be a better estimate of the root than x0 . We continue until the absolute value of the
change between two successive x-values or function values is smaller than some small tolerance
value defined by the user (typically 106 ). In the absence of any prior knowledge where the zero
might lie, a trial-and-error approach might narrow the possibilities to a reasonably small interval.
The method will usually converge, provided this initial guess is close enough to the unknown root,
and that f 0 (x0 ) 6= 0.
If an analytical expression for f 0 (x) is not readily available, we can use a numerical approximation
of the derivative
f (xj ) f (xj1 )
f 0 (x) (14)
xj xj1
using the values of x and f (x) at the current and previous iteration. This forms the basis of the
Secant method. Unlike Newtons method the Secant method uses a numerical estimate of the first
derivative to solve Equation (13).
f 0 (xj )
xj+1 = xj (15)
f 00 (xj )
where g() denotes the gradient vector of the SSR and H denotes the Hessian matrix of the SSR.
The d-vector with differences, = j+1 j is also called the shift vector. The m d gradient
matrix is given by (see Equation (17) )
m
SSR X ri ()
gk () = =2 ri () (k = 1, . . . , d), (17)
k k
i=1
Elements of the Hessian can be calculated by differentiating the gradient elements, gk with respect
to k
m
2 ri ()
X ri () ri ()
Hkl () = 2 + ri () (18)
k l k l
i=1
CEE 290 - Models & Data Appendix B, Page 5
The Gauss-Newton method ignores the second-order derivative terms (the second term in the
previous expression). Thus the Hessian matrix is approximated using
m
X
Hkl () 2 Jik ()Jil () (k = 1, . . . , d), (l = 1, . . . , d) (19)
i=1
where Jik = r
i ()
k
are entries of the Jacobian (sensitivity) matrix Jr . If we use matrix notation we
can write the gradient and the approximate Hessian as
If we now substitute these Equations in the recurrence relation of Equation (16) we derive
1
j+1 = j Jr ()T Jr () Jr ()T r() (21)
where r() = {r1 (), . . . , rm ()} is a m 1 vector of residuals. If you interpret (21) you will notice
a close agreement between this Equation and the direct solution for least-squares fitting (Equation
9). This concludes the derivation of Gauss-Newton.
In words, we start at some initial values, 0 defined by the user. We calculate the function values
at 0 , yi = f (
xi , 0 ) and compute the residual vector, r. Then, we compute the Jacobian and
Hessian matrices, and use Equation (21) to propose a new solution, 1 . We then proceed with the
same steps, until the difference between two successive function values or values of is smaller
than some small threshold.
Figure 2 provides an illustration of Gauss-Newton for a nonlinear model with two (unknown) para-
meter values. This problem is equivalent to the log (SSR) 10
1
the response surface (objective function (sum of 0.9
1
0.6
1
T
discussed in the next section, and an improved 0.1 0.2 0.3 0.4 0.5 0.6
4
S
local search method will be discussed that alter-
Figure 2: Gauss-Newton example.
nates between Gauss-Newton and steepest de-
scent to further improve search efficiency.
CEE 290 - Models & Data Appendix B, Page 6
4 Levenberg-Marquardt
With the Gauss-Newton method the sum of squares SSR may not decrease at every iteration.
Although, is a descent direction the method might overshoot. If divergence occurs, one solution
is to employ a fraction, , of the increment (shift) vector, in the updating formula
j+1 = j + (22)
In other words, the increment vector is too long, but it points in downhill, so going just a part of
the way will decrease the objective function SSR. An optimal value for can be found by using a
line search algorithm, that is, the magnitude of is determined by finding the value that minimizes
the SSR, usually using a direct search method in the interval 0 < < 1.
In cases where the direction of the shift vector is such that the optimal fraction, , is close to zero,
an alternative method for handling divergence is the use of the Levenberg-Marquardt algorithm.
The normal equations are modified in such a way that the increment vector is rotated towards the
direction of steepest descent
1
= Jr ()T Jr () + Id Jr ()T r() , (23)
where Id is the d-dimensional identity matrix. If , then the direction of approaches the di-
rection of the negative gradient Jr ()T r() (steepest descent method).
log (SSR)
The so-called Marquardt parameter, , may
10
1.1 2
also be optimized using line search, but
1
this is inefficient as the shift vector must
1
1
0.9 be re-calculated every time is changed.
1
0.6
1
T
0.5
3
0
0.4
2
0.3
the steepest descent direction - and hence re-
quires fewer iterations than Gauss-Newton to
0.2 3
locate the global minimum. By alternating be-
0.1
1
4
0.1 0.2 0.3 0.4 0.5 0.6
S Levenberg-Marquardt method is less sensitive
Figure 3: Levenberg-Marquardt example. to overshooting. This constitutes an important
advantage for practical application.
CEE 290 - Models & Data Appendix B, Page 7
References
[Legendre (1805)] A.M. Legendre, Nouvelles methodes pour la determination des orbites des
com`etes [New Methods for the Determination of the Orbits of Comets] (in French), Paris: F.
Didot, 1805.