Quasi Newton PDF

EE236C (Spring 2011-12)
2. Quasi-Newton methods
• variable metric methods
• quasi-Newton methods
• BFGS update
• limited-memory quasi-Newton methods
2-1
Newton method for unconstrained minimization
minimize f (x)
f convex, twice continously differentiable
Newton method
x+ = x − t∇2f (x)−1∇f (x)
• advantages: fast convergence, affine invariance

• disadvantages: requires second derivatives, solution of linear equation
can be too expensive for large scale applications
Quasi-Newton methods 2-2

Variable metric methods
x+ = x − tH −1∇f (x)
H ≻ 0 is approximation of the Hessian at x, chosen to:
• avoid calculation of second derivatives

• simplify computation of search direction
‘variable metric’ interpretation (EE236B, lecture 10, page 11)
∆x = −H −1∇f (x)
is steepest descent direction at x for quadratic norm
T
1/2
kzkH = z Hz

Quasi-Newton methods
given starting point x(0) ∈ dom f , H0 ≻ 0

for k = 1, 2, . . ., until a stopping criterion is satisfied
−1
1. compute quasi-Newton direction ∆x = −Hk−1 ∇f (x(k−1))
2. determine step size t (e.g., by backtracking line search)
3. compute x(k) = x(k−1) + t∆x
4. compute Hk
• different methods use different rules for updating H in step 4

• can also propagate Hk−1 to simplify calculation of ∆x

Broyden-Fletcher-Goldfarb-Shanno (BFGS) update
BFGS update
yy T Hk−1ssT Hk−1
Hk = Hk−1 + T −
y s sT Hk−1s
where
s = x(k) − x(k−1), y = ∇f (x(k)) − ∇f (x(k−1))
inverse update
T T
ssT

sy ys
Hk−1 = I− T −1
Hk−1 I − T + T
y s y s y s
• note that y T s > 0 for strictly convex f ; see page 1-11

• cost of update or inverse update is O(n2) operations

Positive definiteness
if y T s > 0, BFGS update preserves positive definitess of Hk
proof: from inverse update formula,
T
T T
(sT v)2

s v s v
v T Hk−1v = v− T y −1
Hk−1 v− T y + T
s y s y y s
• if Hk−1 ≻ 0, both terms are nonnegative for all v

• second term is zero only if sT v = 0; then first term is zero only if v = 0
this ensures that ∆x = −Hk−1∇f (x(k)) is a descent direction

Secant condition
BFGS update satisfies the secant condition Hk s = y, i.e.,
Hk (x(k) − x(k−1)) = ∇f (x(k)) − ∇f (x(k−1))
interpretation: define second-order approximation at x(k)
(k) (k) T (k) 1

fquad(z) = f (x ) + ∇f (x ) (z − x ) + (z − x(k))T Hk (z − x(k))
2
secant condition implies that gradient of fquad agrees with f at x(k−1):
∇fquad(x(k−1)) = ∇f (x(k)) + Hk (x(k−1) − x(k))

= ∇f (x(k−1))

secant method
for f : R → R, BFGS with unit step size gives the secant method
′ (k)
f (x ) f ′(x(k)) − f ′(x(k−1))
x(k+1) (k)
=x − , Hk =
Hk x(k) − x(k−1)
x(k−1) x(k) x(k+1)
′
fquad (z)
f ′(z)

Convergence
global result
if f is strongly convex, BFGS with backtracking line search (EE236B,

lecture 10-6) converges from any x(0), H (0) ≻ 0
local convergence
if f is strongly convex and ∇2f (x) is Lipschitz continuous, local

convergence is superlinear : for sufficiently large k,
kx(k+1) − x⋆k2 ≤ ck kx(k) − x⋆k2 → 0
where ck → 0 (cf., quadratic local convergence of Newton method)

Example
m
X
minimize cT x − log(bi − aTi x)
i=1
n = 100, m = 500
Newton BFGS
2 2
10 10
0 0
10 10
f (x(k)) − f ⋆
f (x(k)) − f ⋆
-2 -2
10 10
-4 -4
10 10
-6 -6
10 10
-8 -8
10 10
-10 -10
10 10
-12 -12
10 10
0 1 2 3 4 5 6 7 8 9 0 20 40 60 80 100 120 140
k k
cost per Newton iteration: O(n3) plus computing ∇2f (x)

cost per BFGS iteration: O(n2)

Square root BFGS update
to improve numerical stability, can propagate Hk in factored form
if Hk−1 = Lk−1LTk−1 then Hk = Lk LTk with
T
(αỹ − s̃) s̃

Lk = Lk−1 I + ,
s̃T s̃
where 1/2
T

s̃ s̃
ỹ = L−1
k−1 y, s̃ = Lk−1s, α=
yT s
if Lk−1 is triangular, cost of reducing Lk to triangular is O(n2)

Optimality of BFGS update
X = Hk solves the convex optimization problem
−1 −1
minimize tr(Hk−1 X) − log det(Hk−1 X) − n
subject to Xs = y
• cost function is nonnegative, equal to zero only if X = Hk−1

• also known as relative entropy between densities N (0, X), N (0, Hk−1)
optimality result follows from KKT conditions: X = Hk satisfies
1 T
X −1
= −1
Hk−1 − (sν + νsT ), Xs = y, X≻0
2
with ! !
T −1
1 −1 y Hk−1 y
ν= 2Hk−1 y − 1+ s
sT y yT s

Davidon-Fletcher-Powell (DFP) update
switch Hk−1 and X in objective on previous page
minimize tr(Hk−1X −1) − log det(Hk−1X −1) − n

subject to Xs = y
• minimize relative entropy between N (0, Hk−1) and N (0, X)

• problem is convex in X −1 (with constraint written as s = X −1y)
• solution is ‘dual’ of BFGS formula
T T
yy T

ys sy
Hk = I− T Hk−1 I − T + T
s y s y s y
(known as DFP update)
pre-dates BFGS update, but is less often used

Limited memory quasi-Newton methods
main disadvantage of quasi-Newton method is need to store Hk or Hk−1
limited-memory BFGS (L-BFGS): do not store Hk−1 explicitly
• instead we store the m (e.g., m = 30) most recent values of
sj = x(j) − x(j−1), yj = ∇f (x(j)) − ∇f (x(j−1))
• we evaluate ∆x = Hk−1∇f (x(k)) recursively, using

! !
sj yjT yj sTj sj sTj
Hj−1 = I− −1
Hj−1 I− + T
yjT sj yjT sj yj s j
−1
for j = k, k − 1, . . . , k − m + 1, assuming, for example, Hk−m =I
• cost per iteration is O(nm); storage is O(nm)

References
• J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapters

6 and 7
• J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained

Optimization and Nonlinear Equations (1983)

Quasi Newton PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quasi Newton PDF

Uploaded by

Copyright:

Available Formats

EE236C (Spring 2011-12)

• variable metric methods

• limited-memory quasi-Newton methods

f convex, twice continously differentiable

x+ = x − t∇2f (x)−1∇f (x)

• advantages: fast convergence, affine invariance

can be too expensive for large scale applications

Quasi-Newton methods 2-2

H ≻ 0 is approximation of the Hessian at x, chosen to:

• avoid calculation of second derivatives

‘variable metric’ interpretation (EE236B, lecture 10, page 11)

is steepest descent direction at x for quadratic norm

Quasi-Newton methods 2-3

given starting point x(0) ∈ dom f , H0 ≻ 0

• different methods use different rules for updating H in step 4

Quasi-Newton methods 2-4

• note that y T s > 0 for strictly convex f ; see page 1-11

Quasi-Newton methods 2-5

if y T s > 0, BFGS update preserves positive definitess of Hk

proof: from inverse update formula,

• if Hk−1 ≻ 0, both terms are nonnegative for all v

this ensures that ∆x = −Hk−1∇f (x(k)) is a descent direction

Quasi-Newton methods 2-6

BFGS update satisfies the secant condition Hk s = y, i.e.,

Hk (x(k) − x(k−1)) = ∇f (x(k)) − ∇f (x(k−1))

interpretation: define second-order approximation at x(k)

(k) (k) T (k) 1

secant condition implies that gradient of fquad agrees with f at x(k−1):

∇fquad(x(k−1)) = ∇f (x(k)) + Hk (x(k−1) − x(k))

Quasi-Newton methods 2-7

x(k−1) x(k) x(k+1)

Quasi-Newton methods 2-8

if f is strongly convex, BFGS with backtracking line search (EE236B,

if f is strongly convex and ∇2f (x) is Lipschitz continuous, local

kx(k+1) − x⋆k2 ≤ ck kx(k) − x⋆k2 → 0

where ck → 0 (cf., quadratic local convergence of Newton method)

Quasi-Newton methods 2-9

cost per Newton iteration: O(n3) plus computing ∇2f (x)

Quasi-Newton methods 2-10

to improve numerical stability, can propagate Hk in factored form

if Hk−1 = Lk−1LTk−1 then Hk = Lk LTk with

if Lk−1 is triangular, cost of reducing Lk to triangular is O(n2)

Quasi-Newton methods 2-11

X = Hk solves the convex optimization problem

• cost function is nonnegative, equal to zero only if X = Hk−1

optimality result follows from KKT conditions: X = Hk satisfies

Quasi-Newton methods 2-12

switch Hk−1 and X in objective on previous page

minimize tr(Hk−1X −1) − log det(Hk−1X −1) − n

• minimize relative entropy between N (0, Hk−1) and N (0, X)

(known as DFP update)

pre-dates BFGS update, but is less often used

Quasi-Newton methods 2-13

main disadvantage of quasi-Newton method is need to store Hk or Hk−1

limited-memory BFGS (L-BFGS): do not store Hk−1 explicitly

• instead we store the m (e.g., m = 30) most recent values of

sj = x(j) − x(j−1), yj = ∇f (x(j)) − ∇f (x(j−1))

• we evaluate ∆x = Hk−1∇f (x(k)) recursively, using

Quasi-Newton methods 2-14

• J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapters

• J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained

Quasi-Newton methods 2-15

You might also like