Newton Gauss Method

Lecture 2
3B1B Optimization
Michaelmas 2014
Newtons method
Line search
Quasi-Newton methods
Least-Squares and Gauss-Newton methods
Downhill simplex (amoeba) algorithm
A. Zisserman
Optimization for General Functions

f (x, y) = exp(x)(4x2 + 2y 2 + 4xy + 2x + 1)
5
-1
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0.5
Apply methods developed using quadratic Taylor series expansion
Rosenbrocks function
2
2
2
f (x, y) = 100(y x ) + (1 x)
Rosenbrock function
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2
-1
Minimum is at [1, 1]
Steepest descent
The 1D line minimization must be performed using one of the earlier
methods (usually cubic polynomial interpolation)
Steepest Descent
Steepest Descent
3
2.5
0.85
2
0.8
1.5
1
0.75
0.5
0
0.7
-0.5
0.65
-1
-2
-1
-0.95
-0.9
-0.85
-0.8
The zig-zag behaviour is clear in the zoomed view (100 iterations)

The algorithm crawls down the valley
-0.75
Performance issues for optimization algorithms

1. Number of iterations required
2. Cost per iteration
3. Memory footprint
4. Region of convergence
Recall from lecture 1: Newtons method in 1D

Fit a quadratic approximation to f (x) using both first and second derivatives at x.
Expand f (x) locally using a Taylor series.

x2 00
f (x) + h.o.t
f (x + x) = f (x) + xf (x) +
2
0
Find the x which minimizes this local quadratic approximation.

f 0(x)
x = 00
f (x)
Update x
f 0(xn)
xn+1 = xn 00
f (xn)
Recall from lecture 1: Taylor expansion in 2D

A function may be approximated locally by its Taylor series expansion
about a point x0
f (x0 + x) f (x0) +
f f
,
x y
x
y
2f
1
2
+ (x, y) x
2f
2
xy
The expansion to second order is a quadratic function

1
f (x0 + x) = a + g>x + x>H x
2
!
2f
x
xy
2f
y
y 2
+ h.o.t
Newtons method in ND
Expand f (x) by its Taylor series about the point xn
1 >
f (xn + x) f (xn) + g>
n x + x H n x
2
where the gradient is the vector
"
f
f
gn = f (xn) =
,...,
x1
xN
#>
and the Hessian is the symmetric matrix
H n = H (xn) =
2f
x2
1
..
...
...
2f
x1xN
2f
x1xN
f
x2
N
For a minimum we require that f (x) = 0, and so

f (x) = gn + H nx = 0
with solution x = H1
n gn. This gives the iterative update
xn+1 = xn H1
n gn
Assume that H is positive definite (all eigenvalues greater than zero)
xn+1 = xn + x
= xn H1
n gn
If f (x) is quadratic, then the solution is found in one step.
The method has quadratic convergence (as in the 1D case).
The solution x = H1
n gn is guaranteed to be a downhill direction
(provided that H is positive definite)
For numerical reasons the inverse is not actually computed, instead
x is computed as the solution of H x = gn.
Rather than jump straight to xn H1
n gn , it is better to perform a
line search which ensures global convergence
xn+1 = xn nH1
n gn
If H = I then this reduces to steepest descent.
Newtons method - example

Newton method with line search
2.5
2.5
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-2
-1
0
gradient < 1e-3 after 15 iterations
-1
-2
-1.5
-1
-0.5
0.5
1.5
ellipses show successive

quadratic approximations
The algorithm converges in only 15 iterations compared to hundreds

for steepest descent.
However, the method requires computing the Hessian matrix at each
iteration this is not always feasible
Quasi-Newton methods
If the problem size is large and the Hessian matrix is dense then it
may be infeasible/inconvenient to compute it directly.
Quasi-Newton methods avoid this problem by keeping a rolling

estimate of H (x), updated at each iteration using new gradient
information.
Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno

(BFGS), and also Davidson, Fletcher and Powell (DFP).
e.g. in 1D
First derivatives
f f0
f f1
h
h
and f 0 (x0 ) = 0
f 0(x0 + ) = 1
2
h
2
h
second derivative
f(x)
f0f1
f1 f0
f 2f0 + f1
h
= 1
f 00(x0) = h
2
h
x-1
h
x0
x+1
For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1
Quasi-Newton: BFGS
Set H 0 = I.
Update according to
qnq>
(H nsn) (H nsn)>
n
H n+1 = H n + >
qn sn
s>
n H n sn
where
sn = xn+1 xn
qn = gn+1 gn
The matrix itself is not stored, but rather represented compactly by
a few stored vectors.
The estimate H n+1 is used to form a local quadratic approximation
as before.
Example
Rosenbrock function
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2
-1
The method converges in 25 iterations, compared to 15 for the full-Newton

method
In Matlab the optimization function fminunc uses a BFGS quasi-Newton
method for medium scale optimization problems
Matlab fminunc
>> f='100*(x(2)-x(1)^2)^2+(1-x(1))^2';
>> GRAD='[100*(4*x(1)^3-4*x(1)*x(2))+2*x(1)-2; 100*(2*x(2)-2*x(1)^2) ]';
Choose options for BFGS quasi-Newton
>> OPTIONS=optimset('LargeScale','off', 'HessUpdate','bfgs' );
>> OPTIONS = optimset(OPTIONS,'gradobj','on');
Start point
>> x = [-1.9; 2];
>> [x,fval] = fminunc({f,GRAD},x,OPTIONS);
This produces
x = 0.9998, 0.9996
fval = 3.4306e-008
Non-linear least squares

It is very common in applications for a cost function f (x) to be the
sum of a large number of squared residuals
f (x) =
M
X
ri2
i=1
If each residual depends non-linearly on the parameters x then the

minimization of f (x) is a non-linear least squares problem.
Examples arise in non-linear regression (fitting) of data
Linear least squares reminder

The goal is to fit a smooth curve to measured
data points {si, ti} by minimizing the cost
f (x) =
i=1
ri2 =
i=1
(y(si, x) ti)2
target value
s
For example, the regression function y(si, x)

might be polynomial
y(s, x) = x0 + x1s + x2s2 + . . .
In this case the function is linear in the parameter x and there is a closed form solution. In
general there will not be a closed form solution
for non-linear functions y(s, x).
Non-linear least squares example: aligning a 3D model to an image
Cost function
Transformation parameters:
3D rotation matrix R
translation 3-vector T = (Tx, Ty , Tz )>
Image generation:
rotate and translate 3D model by R and T
project to generate image IR,T(x, y)
Non-linear least squares

f (x ) =
M
X
i=1
ri2 = krk2
The M N Jacobian of the vector of residuals r is defined as
r1
x1
J(x) =
..
r
M
x1
...
...
r1
xN
rM
xN
assume
M>N
Consider
X
X 2
r
ri =
2ri i
xk i
xk
i
Hence
f (x) = 2J>r
J>
For the Hessian we require

X ri
2 X 2
ri = 2
ri
xl xk i
xl i
xk
2ri
= 2
+2
ri
xk xl
i xk xl
i
X ri ri
Hence
2 J> J
H ( x) =
M
X
riR i
i=1
J>
Ri
Note that the second-order term in the Hessian H (x) is multiplied by

the residuals ri.
In most problems, the residuals will typically be small.
Also, at the minimum, the residuals will typically be distributed with
mean = 0.
For these reasons, the second-order term is often ignored, giving the
Gauss-Newton approximation to the Hessian :
H (x) = 2J>J
Hence, explicit computation of the full Hessian can again be avoided.
Example Gauss-Newton
The minimization of the Rosenbrock function
f (x, y) = 100(y x2)2 + (1 x)2
can be written as a least-squares problem with residual vector
r=
r1
x
J(x) = r
2
x
"
10(y x2)
(1 x)
!
r1
20x 10
y =
r2
1
0
y
The true Hessian is
H(x) =
2f
x2
2f
xy
2f
xy
=
2f
y 2
"
1200x2 400y + 2 400x

400x
200
The Gauss-Newton approximation of the Hessian is

2J>J = 2
"
20x 1
10
0
#"
20x 10
1
0
"
800x2 + 2 400x
400x
200
Summary: Gauss-Newton optimization

For a cost function f (x) that is a sum of squared residuals
f (x) =
ri2
i=1
The Hessian can be approximated as
H (x) = 2J>J
and the gradient is given by
f (x) = 2J>r
So, the Newton update step
xn+1 = xn + x
= xn H1
n gn
computed as H x = gn, becomes
J>J x = J>r
These are called the normal equations.
>J
xn+1 = xn nH1
g
with
H
(
x
)
=
2
J
n
n n
n n
Gauss-Newton method with line search
2.5
2.5
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-2
-1
0
-1
-2
-1.5
-1
-0.5
0.5
minimization with the Gauss-Newton approximation with line search

takes only 14 iterations
1.5
Comparison
Gauss-Newton
Newton
2.5
2.5
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-2
-1
-1
-2
-1
requires computing Hessian

(i.e. n^2 second derivatives)
approximates Hessian by
Jacobian product
exact solution if quadratic
requires only n first derivatives
The downhill simplex (amoeba)

algorithm
The downhill simplex (amoeba) algorithm

Due to Nelder and Mead (1965)
A direct method: only uses function evaluations (no derivatives)
a simplex is the polytope in N dimensions with N+1 vertices, e.g.
2D: triangle
3D: tetrahedron
basic idea: move by reflections, expansions or contractions
start
reflect
expand
contract
One iteration of the simplex algorithm

Reorder the points so that f (xn+1 ) > f (x2) > f (x1) (i.e. xn+1 is the worst point).
Generate a trial point xr by reflection
xr =
x + (
x xn+1)
P
where
x=
i xi /(N + 1) is the centroid and > 0. Compute f (xr ), and there
are then 3 possibilities:
1. f (x1) < f (xr ) < f (xn ) (i.e. xr is neither the new best or worst point), replace
xn+1 by xr .
2. f (xr ) < f (x1 ) (i.e. xr is the new best point), then assume direction of reflection
is good and generate a new point by expansion
xe = xr + (xr
x)
where > 0. If f (xe) < f (xr ) then replace xn+1 by xe, otherwise, the expansion
has failed, replace xn+1 by xr .
3. f (xr ) > f (xn) then assume the polytope is too large and generate a new point
by contraction
xc =
x + (xn+1
x)
where (0 < < 1) is the contraction coecient. If f (xc) < f (xn+1 ) then the
contraction has succeeded and replace xn+1 by xc, otherwise contract again.
Standard values are = 1, = 1, = 0.5.
Example
Path of best vertex

Downhill Simplex
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-2
-1
2.7
2.6
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
-2
-1.8
-1.6
-1.4
detail
-1.2
-1
Matlab fminsearch
with 200 iterations
0
-1
-1
contractions
0
-1
0
-1
-1
XX
YY
3
2
YY
3
2
1
YY
0
-1
0
1
XX
3
1
-1
-1
XX
XX
YY
-1
0
-1
-1
XX
reflections
YY
3
2
-1
YY
3
2
-1
YY
3
2
1
-1
YY
Example 2: contraction about a minimum
-1
XX
1
XX
-1
XX
Summary
no derivatives required
deals well with noise in the cost function
is able to crawl out of some local minima (though, of course, can still get stuck)
Matlab fminsearch
Nelder-Mead simplex direct search
>> banana = @(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
Pass the function handle to fminsearch:
>> [x,fval] = fminsearch(banana,[-1.9, 2])
This produces
x = 1.0000 1.0000
fval =4.0686e-010
Google to find out more on using this function
What is next?
Move from general and quadratic optimization problems
to linear programming
Constrained optimzation

Newton Gauss Method

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Newton Gauss Method

Uploaded by

Copyright:

Available Formats

Lecture 2

Optimization for General Functions

Apply methods developed using quadratic Taylor series expansion

The zig-zag behaviour is clear in the zoomed view (100 iterations)

Performance issues for optimization algorithms

2. Cost per iteration

Recall from lecture 1: Newtons method in 1D

Expand f (x) locally using a Taylor series.

Find the x which minimizes this local quadratic approximation.

Recall from lecture 1: Taylor expansion in 2D

The expansion to second order is a quadratic function

and the Hessian is the symmetric matrix

For a minimum we require that f (x) = 0, and so

Assume that H is positive definite (all eigenvalues greater than zero)

Newtons method - example

Newton method with line search

gradient < 1e-3 after 15 iterations

ellipses show successive

The algorithm converges in only 15 iterations compared to hundreds

Quasi-Newton methods avoid this problem by keeping a rolling

Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno

For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1

The method converges in 25 iterations, compared to 15 for the full-Newton

Non-linear least squares

If each residual depends non-linearly on the parameters x then the

Examples arise in non-linear regression (fitting) of data

Linear least squares reminder

For example, the regression function y(si, x)

Non-linear least squares example: aligning a 3D model to an image

translation 3-vector T = (Tx, Ty , Tz )>

Non-linear least squares

The M N Jacobian of the vector of residuals r is defined as

For the Hessian we require

Note that the second-order term in the Hessian H (x) is multiplied by

The true Hessian is

1200x2 400y + 2 400x

The Gauss-Newton approximation of the Hessian is

Summary: Gauss-Newton optimization

The Hessian can be approximated as

Gauss-Newton method with line search

gradient < 1e-3 after 14 iterations

minimization with the Gauss-Newton approximation with line search

Gauss-Newton method with line search

gradient < 1e-3 after 15 iterations

gradient < 1e-3 after 14 iterations

requires computing Hessian

exact solution if quadratic

requires only n first derivatives

The downhill simplex (amoeba)

The downhill simplex (amoeba) algorithm

One iteration of the simplex algorithm

Path of best vertex

Example 2: contraction about a minimum

Google to find out more on using this function

You might also like