Newton Handouts

Methods for unconstrained optimization Methods for unconstrained optimization
Convergence Convergence Overview

Descent directions Descent directions Line search
Line search Line search Trust-region
The Newton Method The Newton Method
Overview
Nonlinear Optimization Most deterministic methods for unconstrained optimization have the
following features:
Overview of methods; the Newton method with line search
They are iterative, i.e. they start with an initial guess x0 of the
variables and tries to find better points {xk }, k = 1, . . ..
Niclas Brlin They are descent methods, i.e. at each iteration k,
Department of Computing Science
f (xk +1 ) < f (xk )
Ume University
niclas.borlin@cs.umu.se is (at least) required.
At each iteration k, the nonlinear objective function f is replaced
November 19, 2007
by a simpler model function mk that approximates f around xk .
The next iterate xk +1 = xk + p is sought as the minimizer of mk .
c 2007 Niclas Brlin, CS, UmU
Nonlinear Optimization; The Newton method w/ line search c 2007 Niclas Brlin, CS, UmU
Nonlinear Optimization; The Newton method w/ line search

Convergence Overview Convergence Overview
Descent directions Line search Descent directions Line search
Line search Trust-region Line search Trust-region
The model function mk is usually defined to be a quadratic Line search

function of the form
1
mk (xk + p) = fk + p T fk + p T Bk p,
2 In the line search strategy, the algorithm chooses a search
where fk = f (xk ), fk = f (xk ), and Bk is a matrix, usually a direction pk and tries to solve the following one-dimensional
positive definite approximation of the hessian 2 f (xk ). minimization problem
If Bk is positive definite, a minimizer of mk may be found by
min f (xk + pk ),
solving >0
p mk (xk + p) = 0
where the scalar is called the step length.
for p.
If the minimizer of mk does not produce a better point, the step p
In theory we would like optimal step lengths, but in practice it is
is modified to produce a point xk +1 = xk + p that is better. more efficient to test trial step lengths until we find one that gives
The modifications come in two major flavours: line search and us a good enough point.
trust-region.
Trust-region
0.8
0.6
In the trust-region strategy, the algorithm defines a region of trust
xk around xk where the current model function mk is trusted.
0.4 The region of trust is usually defined as
0.2 kpk2 ,
0 p where the scalar is called the trust-region radius.

A candidate step p is found by approximately solving the
0.2
following subproblem
0.4
min mk (xk + p) s.t. kpk2 .
p
0.6
If the candidate step does not produce a good enough new point,
2.5 2 1.5 1 0.5
we shrink the trust-region radius and re-solve the subproblem.

0.8
In the line search strategy, the direction is chosen first, followed
by the distance.
0.6
In the trust-region strategy, the maximum distance is chosen
xk
0.4
first, followed by the direction.
0.2 0.8 0.8
0.6 0.6
0 0.4
xk 0.4
xk
0.2 0.2
0.2
0 p 0
0.2 0.2
0.4
0.4 0.4
0.6 0.6 0.6
2.5 2 1.5 1 0.5 2.5 2 1.5 1 0.5
2.5 2 1.5 1 0.5

Methods for unconstrained optimization Convergence rate Methods for unconstrained optimization Convergence rate
Convergence Linear convergence Convergence Linear convergence
Descent directions Quadratic convergence Descent directions Quadratic convergence
Line search Local vs. global convergence Line search Local vs. global convergence
The Newton Method Globalization strategies The Newton Method Globalization strategies
Convergence rate
Assume we have a series {xk } that converges to a solution x .
Define the sequence of errors as
ek = xk x
In order to compare different iterative methods, we need an
efficiency measure. and note that
Since we do not know the number of iterations in advance, the lim ek = 0.
k
computational complexity measure used by direct methods
We say that the sequence {xk } converges to x with rate r and
cannot be used.
rate constant C if
Instead the concept of a convergence rate is defined. kek +1 k
lim =C
k kek kr
and C < .

Linear convergence, examples
For r = 1, C = 0.1 and ke0 k = 1, the norm of the error sequence

In practice there are three important rates of convergence: becomes
linear convergence, for r = 1 and 0 < C < 1; 1, 101 , 102 , . . . , 107
| {z }
7 iterations
quadratic convergence, for r = 2.
super-linear convergence, for r = 1 and C = 0.
For C = 0.99 the corresponding sequence is
1, 0.9, 0.9801, . . . , 0.997 107 .

| {z }
1604 iterations
Thus the constant C is of major importance for a method with

linear convergence.

Quadratic convergence, examples Local vs. global convergence

For r = 2, C = 0.1 och ke0 k = 1, the sequence becomes
1, 101 , 103 , 107 , . . . A method is called locally convergent if it produces a convergent
For r = 2, C = 3 och ke0 k = 1, the sequence diverges sequence toward a minimizer x provided a close enough
starting approximation.
1, 3, 27, . . .
A method is called globally convergent if it produces a
For r = 2, C = 3 och ke0 k = 0.1, the sequence becomes convergent sequence toward a minimizer x provided any
0.1, 0.03, 0.0027, . . . , starting approximation.
i.e. it converges despite C > 1. Note that global convergence does not imply convergence
For quadratic convergence, the constant C is of lesser importance.
towards a global minimizer.
Instead it is important that the initial approximation is close enough to
the solution, i.e. ke0 k is small.

Methods for unconstrained optimization Convergence rate Methods for unconstrained optimization
Convergence Linear convergence Convergence
Descent directions Quadratic convergence Descent directions
Line search Local vs. global convergence Line search
The Newton Method Globalization strategies The Newton Method
Globalization strategies Descent directions

Consider the Taylor expansion of the objective function along a
The line search and trust-region methods are sometimes called search direction p
globalization strategies, since they modify a core method 1
(typically locally convergent) to become globally convergent. f (xk + p) = f (xk ) + p T fk + 2 p T 2 f (xk + p)p,
2
There are two efficiency requirements on any globalization for some (0, )
strategy:
Far from the solution, they should stop the methods from going Any direction p such that p T fk < 0 will produce a reduction of
out of control. the objective function for a short enough step.
Close to the solution, when the core method is efficient, they
should interfere as little as possible. A direction p such that
p T fk < 0
is called a descent direction.
Convergence Convergence
Descent directions Descent directions
Line search Line search
p fk
Since cos = kpkkf is the angle between the search direction
kk
and the negative gradient, descent directions are in the same
half-plane as the negative gradient.
If the search direction has the form
The search direction corresponding to the negative gradient
pk = Bk1 fk , p = fk is called the direction of steepest descent.
1
the descent condition 0.9
0.8
pkT fk = fkT Bk1 fk < 0 0.7
0.6
0.5
is satisfied whenever Bk is positive definite. 0.4

f
0.3
0.2
0.1
1.5 1 0.5 0

Overview Overview
Exact and inexact line searches Exact and inexact line searches
The Sufficient Decrease Condition The Sufficient Decrease Condition
Backtracking Backtracking
The Curvature Condition The Curvature Condition
The Wolfe Condition The Wolfe Condition
Line search Exact and inexact line searches
Each iteration of a line search method computes a search direction pk Consider the function
and then decides how far to move along that direction.
The next iteration is given by () = f (xk + pk ), > 0.
xk +1 = xk + k pk Ideally we would like to find the global minimizer of for every

iteration. This is called an exact line search.
We will require pk to be a descent direction. This assures that the
objective function will decrease However, it is possible to construct inexact line search methods
that produce an adequate reduction of f at a minimal cost.
f (xk + k pk ) < f (xk )
Inexact line search methods construct a number of candidate
for some small k > 0. values for and stop when certain conditions are satisfied.

Overview Overview
The Sufficient Decrease Condition ()

c1=1
c1=0.5
Mathematically, the descent condition f (xk + pk ) < f (xk ) is not c1=0.1

c1=0
enough to guarantee convergence.
Instead, the sufficient decrease condition is formulated from the linear
Taylor approximation of ()
() (0) + (0)
f()
or
f (xk + pk ) f (xk ) + fkT pk .
The sufficient decrease condition states that the new point must at
least produce a fraction 0 < c1 < 1 of the decrease predicted by the
Taylor approximation, i.e.
f (xk + pk ) < f (xk ) + c1 fkT pk . 0 1/16 1/8 1/4 1/2 1

This condition is sometimes called the Armijo condition.
Overview Overview
Backtracking The Curvature Condition

The sufficient decrease condition alone is not enough to guarantee Another approximation to the solution of
convergence, since it is satisfied for arbitrarily small values of .
The sufficient decrease condition has to be combined with a strategy
min () f (xk + pk )
>0
that favours large step lengths over small.
is to solve for () = 0, which is approximated to the condition
A simple such strategy is called backtracking: Accept the first element
of the sequence | (k )| c2 | (0)|,
1 1
1, , , . . . , 2i , . . .
2 4 where c2 is a constant c1 < c2 < 1.
that satisfies the sufficient decrease condition. Such a step length
always exist.
Since () = pkT f (xk + pk ), we get
Large step lengths are tested before small ones. Thus, the step length |pkT f (xk + k pk )| c2 |pkT f (xk )|.
will not be too small.
This technique works well for Newton-type algorithms. This condition is called the curvature condition.
Overview The Newton-Raphson method in 1
Exact and inexact line searches The Classical Newton minimization method in n
The Sufficient Decrease Condition Geometrical interpretation; the model function
Backtracking Properties of the Newton method
The Curvature Condition Ensuring a descent direction
The Wolfe Condition The modified Newton algorithm with line search
The Wolfe Condition The Newton-Raphson method in 1

The sufficient decrease condition and the curvature condition Consider the non-linear problem f (x ) = 0, where f , x .
The Newton-Raphson method for solving this problem is based on the
f (xk + pk ) f (xk ) + c1 fkT pk , linear Taylor approximation of f around xk
|pkT f (xk + k pk )| c2 |pkT f (xk )|, f (xk + p) f (xk ) + pf (xk ).
where 0 < c1 < c2 < 1, are collectively called the strong Wolfe If f (xk ) 6= 0 we solve the linear equation
conditions. f (xk ) + pf (xk ) = 0
Step length methods that use the Wolfe conditions are more for p and get
complicated than backtracking. p = f (xk )/f (xk ).
Several popular implementations of nonlinear optimization
The new iterate is given by
routines are based on the Wolfe conditions, notably the BFGS
quasi-Newton method. xk +1 = xk + pk = xk f (xk )/f (xk ).

The Newton-Raphson method in 1 The Newton-Raphson method in 1

The Classical Newton minimization method in n The Classical Newton minimization method in n
Geometrical interpretation; the model function Geometrical interpretation; the model function
Properties of the Newton method Properties of the Newton method
Ensuring a descent direction Ensuring a descent direction
The modified Newton algorithm with line search The modified Newton algorithm with line search
The Classical Newton minimization method in n Geometrical interpretation; the model function
In order to use Newtons method to find a minimizer we apply the The approximation of the non-linear function f (x) with the
first-order necessary conditions on a function f linear (in p) polynomial
f (x ) = 0 (f (x ) = 0) f (xk + p) f (xk ) + 2 f (xk )p
This results in the Newton sequence corresponds to approximating the non-linear function f (x) with
the quadratic (in p) Taylor expansion
xk +1 = xk (2 f (xk ))1 f (xk ) (xk +1 = xk f (x )/f (x ))
1
mk (xk + p) f (xk ) + f (xk )T p + p T 2 f (xk )p,
This is often written as xk +1 = xk + pk , where pk is the solution of the 2
Newton equation: i.e. Bk = 2 f (xk ).
2 f (xk )pk = f (xk ). Newtons method can be interpreted as that at each iteration k, f
This formulation emphasizes that a linear equation system is solved in is approximated by the quadratic Taylor expansion mk around xk
each step, usually by other means than calculating an inverse. and xk +1 is calculated as the minimizer of mk .
Properties of the Newton method

Advantages: Disadvantages:
It converges quadratically It does not necessarily
toward a stationary point. converge toward a minimizer.
It may diverge if the starting
approximation is too far from the
solution.
It will fail if 2 f (xk ) is not
invertible for some k .
It requires second-order
information 2 f (xk ).
Newtons method is rarely used in its classical formulation. However, many
methods may be seen as approximations of Newtons method.

Ensuring a descent direction

Since the Newton search direction pN is written as
pN = Bk1 fk ,
with Bk = 2 fk , pN will be a descent direction if 2 fk is positive definite.
If 2 fk is not positive definite, the Newton direction pN may not a
descent direction.
In that case we will choose Bk as a positive definite approximation of
2 fk .
Performed in a proper way, this modified algorithm will converge toward
a minimizer. Furthermore, close to the solution the Hessian is usually
positive definite and the modification will only be performed far from
the solution.
The positive definite approximation Bk of the Hessian may be The modified Newton algorithm with line search
found with minimal extra effort: The search direction p is
calculated as the solution of
Specify a starting approximation x0 and a convergence tolerance .
2 f (x)p = f (x). Repeat for k = 0, 1, . . .
If 2 f (x) is positive definite, the matrix factorization If kf (xk )k < , stop.
2 f (x) = LDLT Compute the modified LDLT factorization of the Hessian.
may be used, where the diagonal elements of D are positive. Solve

(LDLT )pkN = f (xk )
If 2 f (x) is not positive definite, at some point during the
factorization, a diagonal element will be dii 0. In this case, the for the search direction pkN .
element may be replaced with a suitable positive entry. Perform a line search to determine the new approximation
Finally, the factorization is used to calculate the search direction xk +1 = xk + k pkN .
(LDLT )p = f (x).

Newton Handouts

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Newton Handouts

Uploaded by

Copyright:

Available Formats

Methods for unconstrained optimization Methods for unconstrained optimization

Convergence Convergence Overview

Methods for unconstrained optimization Methods for unconstrained optimization

The model function mk is usually defined to be a quadratic Line search

0 p where the scalar is called the trust-region radius.

Methods for unconstrained optimization Methods for unconstrained optimization

0.2 0.8 0.8

0.6 0.6 0.6

2.5 2 1.5 1 0.5 2.5 2 1.5 1 0.5

2.5 2 1.5 1 0.5

c 2007 Niclas Brlin, CS, UmU

c 2007 Niclas Brlin, CS, UmU

Linear convergence, examples

For r = 1, C = 0.1 and ke0 k = 1, the norm of the error sequence

1, 0.9, 0.9801, . . . , 0.997 107 .

Thus the constant C is of major importance for a method with

c 2007 Niclas Brlin, CS, UmU

Quadratic convergence, examples Local vs. global convergence

c 2007 Niclas Brlin, CS, UmU

Globalization strategies Descent directions

the descent condition 0.9

pkT fk = fkT Bk1 fk < 0 0.7

is satisfied whenever Bk is positive definite. 0.4

c 2007 Niclas Brlin, CS, UmU

Line search Exact and inexact line searches

xk +1 = xk + k pk Ideally we would like to find the global minimizer of for every

c 2007 Niclas Brlin, CS, UmU

The Sufficient Decrease Condition ()

Mathematically, the descent condition f (xk + pk ) < f (xk ) is not c1=0.1

Backtracking The Curvature Condition

The Wolfe Condition The Newton-Raphson method in 1

c 2007 Niclas Brlin, CS, UmU

The Newton-Raphson method in 1 The Newton-Raphson method in 1

Properties of the Newton method

The Newton-Raphson method in 1 The Newton-Raphson method in 1

Ensuring a descent direction

2 f (x) = LDLT Compute the modified LDLT factorization of the Hessian.

may be used, where the diagonal elements of D are positive. Solve

You might also like