You are on page 1of 42

Algorithms for large-scale convex optimization DTU 2010

2. Unconstrained optimization methods

Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds

2-1

Newton method
iteration for minimizing closed, convex, twice dierentiable f x(k+1) = x(k) tk 2f (x(k))1f (x(k)) step size tk is xed or from line search we often suppress iteration number: x+ = x tf (x)1f (x), x := x tf (x)1f (x)

advantages of Newtons method: fast convergence, ane invariance disadvantages: requires second derivatives, solution of linear equation can be too expensive for large scale applications
Unconstrained optimization methods 2-2

Backtracking line search


to determine tk , start at some t = t (e.g., t = 1), and repeat t := t until f (x + tx) < f (x) + tf (x)T x where x = x(k), x = 2f (x(k))1f (x(k))

f (x + tx)

f (x) + tf (x)T x t=0 t0

f (x) + tf (x)T x t

t > 0, (0, 1/2), (0, 1) are algorithm parameters


Unconstrained optimization methods 2-3

Variable metric methods


x+ = x tH 1f (x) H 0 is approximation of the Hessian at x, chosen to: avoid calculation of second derivatives simplify computation of search direction variable metric interpretation x = H 1f (x) is steepest descent direction at x for quadratic norm z
H

= z T Hz

1/2

Unconstrained optimization methods

2-4

Quasi-Newton methods

given starting point x(0) dom f , H0 0 for k = 1, 2, . . ., until a stopping criterion is satised
1 1. compute quasi-Newton direction x = Hk1f (x(k1))

2. determine step size t (e.g., by backtracking line search) 3. compute x(k) = x(k1) + tx 4. compute Hk

1 can also propagate Hk to simplify calculation of x

dierent methods use dierent rules for updating H in step 4

Unconstrained optimization methods

2-5

Broyden-Fletcher-Goldfarb-Shanno (BFGS) update


Hk
1 Hk

yy T Hk1ssT Hk1 = Hk1 + T y s sT Hk1s = sy T ysT 1 I T Hk1 I T y s y s ssT + T y s

where s = x(k) x(k1), y = f (x(k)) f (x(k1)) note that y T s > 0 for strictly convex f ; this follows from f (v) > f (u) + f (u)T (v u), evaluated at u = x(k1), v = x(k) and u = x(k), v = x(k1) cost of update or inverse update is O(n2) operations
Unconstrained optimization methods 2-6

Positive deniteness
if y T s > 0, BFGS update preserves positive denitess of Hk

proof: from inverse update formula,


1 v T Hk v

sT v v T y s y

T 1 Hk1

sT v (sT v)2 v T y + T s y y s

if Hk1 0, both terms are nonnegative for all v second term is zero only if sT v = 0; then rst term is zero only if v = 0

1 this ensures that x = Hk f (x(k)) is a descent direction

Unconstrained optimization methods

2-7

Secant condition
BFGS update satises the secant condition Hk s = y, i.e., Hk (x(k) x(k1)) = f (x(k)) f (x(k1))

interpretation: dene second-order approximation at x(k) fquad(z) = f (x


(k)

) + f (x

(k) T

) (z x

(k)

1 ) + (z x(k))T Hk (z x(k)) 2

secant condition implies that gradient of fquad agrees with f at x(k1): fquad(x(k1)) = f (x(k)) + Hk (x(k1) x(k)) = f (x(k1))

Unconstrained optimization methods

2-8

secant method for f : R R, BFGS with unit step size gives the secant method x(k+1) f (x(k)) = x(k) , Hk f (x(k)) f (x(k1)) Hk = x(k) x(k1)

x(k1)

x(k)

x(k+1)

fquad(z)

f (z)

Unconstrained optimization methods

2-9

Convergence
global result if f is strongly convex (2f (x) mI for some m > 0), BFGS with backtracking line search converges from any x(0), H (0) 0

local convergence if f is strongly convex and 2f (x) is Lipschitz continuous, local convergence is superlinear : for suciently large k, x(k+1) x
2

ck x(k) x

where ck 0 (cf., quadratic local convergence of Newton method)

Unconstrained optimization methods

2-10

Example
m

minimize cT x n = 100, m = 500


Newton
10 10
2

i=1

log(bi aT x) i
BFGS
10 10
2 0

10 10 10 10 10 10

-2

f (x(k)) f
0 1 2 3 4 5 6 7 8 9

f (x(k)) f

10 10 10 10 10 10

-2

-4

-4

-6

-6

-8

-8

-10

-10

-12

-12

20

40

60

80

100

120

140

cost per Newton iteration: O(n3) plus computing 2f (x) cost per BFGS iteration: O(n2)
Unconstrained optimization methods 2-11

Square root BFGS update


to improve numerical stability, can propagate Hk in factored form

if Hk1 = Lk1LT then Hk = Lk LT with k1 k Lk = Lk1 where y = L1 y, k1 s = Lk1s, = s s yT s


T

( s) sT y I+ sT s

1/2

if Lk1 is triangular, cost of reducing Lk to triangular is O(n2)

Unconstrained optimization methods

2-12

Optimality of BFGS update


X = Hk solves the convex optimization problem
1 1 minimize tr(Hk1X) log det(Hk1X) n subject to Xs = y

cost function is nonnegative, equal to zero only if X = Hk1 also known as relative entropy between densities N (0, X), N (0, Hk1) optimality result follows from KKT conditions: X = Hk satises X with = 1 sT y
1 2Hk1y 1

1 Hk1

1 T (s + sT ), 2

Xs = y,

X0

1 y T Hk1y 1+ yT s

Unconstrained optimization methods

2-13

Davidon-Fletcher-Powell (DFP) update


switch Hk1 and X in objective on previous page minimize tr(Hk1X 1) log det(Hk1X 1) n subject to Xs = y minimize relative entropy between N (0, Hk1) and N (0, X) problem is convex in X 1 (with constraint written as s = X 1y) solution is dual of BFGS formula Hk = sy T ysT Hk1 I T I T s y s y yy T + T s y

(known as DFP update) pre-dates BFGS update, but is less often used
Unconstrained optimization methods 2-14

Limited memory quasi-Newton methods


1 main disadvantage of quasi-Newton method is need to store Hk or Hk 1 limited-memory BFGS (L-BFGS): do not store Hk explicitly

instead we store the m (e.g., m = 30) most recent values of sj = x(j) x(j1), yj = f (x(j)) f (x(j1))

1 we evaluate x = Hk f (x(k)) recursively, using 1 Hj = T s j yj I T yj s j 1 Hj1

yj s T j I T yj s j

sj sT j + T yj s j

1 for j = k, k 1, . . . , k m + 1, assuming, for example, Hkm = I

cost per iteration is O(nm); storage is O(nm)


Unconstrained optimization methods 2-15

Outline

Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds

Classical gradient method


x(k) = x(k1) tk f (x(k1)) tk > 0 is step size (constant or determined by line search) advantage: inexpensive disadvantage: often very slow; convergence rate problem-dependent
3

simple quadratic example 1 f (x1, x2) = (x2 + M x2) 2 2 1 with M = 10, t = 0.18

-1

-2

-3 -10

-5

10

Unconstrained optimization methods

2-16

Modications
multistep methods heavy ball method x(k) = x(k1) tf (x(k1)) + s(x(k1) x(k2)) conjugate gradient Nesterov-type methods (next lecture) spectral gradient method (Barzilai-Borwein) use step size tk = (sT sk )/(sT yk ) with k k sk = x(k1) x(k2), yk = f (x(k1)) f (x(k2))

often implemented with non-monotone line search


Unconstrained optimization methods 2-17

Unconstrained quadratic minimization


1 minimize f (x) = xT Ax bT x 2 with A Sn ++ equivalent to solving Ax = b residual r = b Ax is negative gradient at x: r = f (x) conjugate gradient method invented by Hestenes and Stiefel around 1951 the most widely used iterative method for solving Ax = b, with A 0 can be extended to non-quadratic unconstrained minimization
Unconstrained optimization methods 2-18

Krylov sequence
CG algorithm is a recursive method for computing the Krylov sequence x(k) = argmin f (x),
xKk

k0

where Kk is the Krylov subspace K0 = {0}, Kk = span{b, Ab, . . . , Ak1b} for k 1

A1b Kn, therefore x(n) = A1b there is a simple two-term recurrence x(k+1) = x(k) + ak rk + bk (x(k) x(k1))
Unconstrained optimization methods 2-19

Conjugate gradient method as iterative method


in exact arithmetic CG was originally proposed as a direct (non-iterative) method in theory, convergence in at most n steps

in practice due to rounding errors, can take n steps (or fail) CG is now used as an iterative method with luck (good spectrum of A), good approximation in n steps

Unconstrained optimization methods

2-20

Applications in optimization
nonlinear conjugate gradient methods extend linear CG method to nonquadratic functions local convergence similar to linear CG limited global convergence theory

inexact and truncated Newton methods use conjugate gradient method to compute (approximate) Newton step less reliable than exact Newton methods, but handle very large problems

Unconstrained optimization methods

2-21

Fletcher-Reeves CG algorithm
CG algorithm modied to minimize non-quadratic convex f

given x(0) for k = 1, 2, . . . 1. return x(k1) if f (x(k1))


2

2. if k = 1, p1 = f (x(0)); else pk = f (x
(k1)

) + pk1

where

f (x(k1)) = f (x(k2))

2 2 2 2

3. update x(k) = x(k1) + pk where = argmint f (x(k1) + tpk )

Unconstrained optimization methods

2-22

some observations rst iteration is a gradient step; practical implementations restart the algorithm by taking a gradient step, for example, every n iterations update is gradient step with momentum term x(k) = x(k1) k f (x(k1)) + k (x(k1) x(k2)) with exact line search, reduces to linear CG for quadratic f line search exact line search in step 3 implies f (x(k))T pk = 0 therefore pk is a descent direction at x(k+1): from step 2, f (x(k1))T pk = f (x(k1))
Unconstrained optimization methods

2 2

<0

2-23

Variations
Polak-Ribi`re: in step 2, compute from e f (x(k1))T (f (x(k1)) f (x(k2))) = f (x(k2)) 2 2

Hestenes-Stiefel f (x(k1))T (f (x(k1)) f (x(k2))) = pT (f (x(k1)) f (x(k2))) k1

formulas are equivalent for quadratic f and exact line search

Unconstrained optimization methods

2-24

Interpretation as restarted BFGS method


BFGS update (page 2-6) with Hk1 = I:
1 Hk

y T y ssT ysT + sy T = I + (1 + T ) T s y y s yT s

where y = f (x(k)) f (x(k1)), s = x(k) x(k1) f (x(k))T s = 0 if x(k) is determined by exact line search quasi-Newton step in iteration k is
1 Hk f (x(k))

= f (x

(k)

y T f (x(k)) s )+ Ts y

this is the Hestenes-Stiefel update nonlinear CG can be interpreted as L-BFGS with m = 1


Unconstrained optimization methods 2-25

Outline

Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds

Subgradient method
to minimize a nondierentiable convex function f : choose x(0) and repeat x(k) = x(k1) tk g (k1), g (k1) is any subgradient of f at x(k1) k = 1, 2, . . .

step size rules xed step: tk constant xed length: tk g (k1) diminishing: tk 0,
2 k=1 tk

constant (i.e., x(k) x(k1) =

constant)

optimal step size when f is known: tk = (f (x(k1)) f )/ g (k1)

2 2

Unconstrained optimization methods

2-26

Convergence results
assumptions f has nite optimal value f , minimizer x with x(0) x f is convex and Lipschitz continuous with constant G > 0: |f (x) f (y)| G x y
(k) 2 2

x, y

results (fbest = minik f (x(i)) is best value up to iteration k) convergence requires diminishing step size tk 0, with proper step size, can derive bound GR (k) fbest f k #iterations to reach fbest f is O(1/2)
Unconstrained optimization methods 2-27

k=1 tk

(k)

Example: 1-norm minimization


minimize Ax b
1

(A R500100, b R500)

subgradient is given by AT sign(Ax b)


10
0
0.1 0.01 0.001

f (x(k)) f f

10

-1

xed steplength tk = s g (k1)


2

10

-2

with s = 0.1, 0.01, 0.001

10

-3

20

40

60

80

100

Unconstrained optimization methods

2-28

diminishing step size tk = 0.01/ k, tk = 0.01/k

10

0.01/ k 0.01/k

10

-1

fbest f

10

-2

(k)

10

-3

10

-4

10

-5

1000

2000

3000

4000

5000

Unconstrained optimization methods

2-29

Finding a point in the intersection of convex sets


to nd point x C = C1 Cm (m closed convex sets): minimize f (x) = max{dist(x, C1), . . . , dist(x, Cm)} where dist(x, Cj ) = inf (PCj is projection on Cj )
zCj

xz

= x PCj (x)

dist(x, Cj ) is a convex function if Cj is convex f = 0 if the intersection is nonempty to nd subgradient of f , need subgradient of distance to farthest set Cj

Unconstrained optimization methods

2-30

subgradient of distance to closed convex set S


H S

S H = {z | (xPS (x))T (zPS (x)) 0}

PS (x) x

therefore

(x PS (x))T (y PS (x)) dist(y, S) x PS (x) 2

(for y H, r.h.s. is distance to H; for y H, r.h.s. is nonpositive) hence, (x PS (x))T (y x) dist(y, S) x PS (x) 2 + x PS (x) 2

conclusion: (x PS (x))/ dist(x, S) is a subgradient at x S


Unconstrained optimization methods 2-31

subgradient method with optimal step size for minimize f (x) = max{dist(x, C1), . . . , dist(x, Cm)} if Cj is the farthest set at iteration k (i.e., dist(x(k1), Cj ) = f (x(k1))): x
(k)

= x

(k1)

f (x(k1)) (x(k1) PCj (x(k1))) dist(x(k1), Cj )

= PCj (x(k1))

a version of the famous alternating projections algorithm at each step, project the current point onto the farthest set for m = 2 sets, projections alternate onto one set, then the other convergence: dist(x(k), C) 0 as k
Unconstrained optimization methods 2-32

Alternating projections
rst few iterations:

x(1) x(2)(4) x C2 x(3) x C1

. . . x(k) eventually converges to a point x C1 C2


Unconstrained optimization methods 2-33

Example: Positive semidenite matrix completion


some entries of X Sn xed; nd values for others so X C 1 = Sn + projection onto C1 by eigenvalue decomposition, truncation
n n T max{0, i}qiqi

PC1 (X) =
i=1

if X =
i=1

T i q i q i

C2 is (ane) set in Sn with specied xed entries projection of X onto C2 by re-setting specied entries to xed values

Unconstrained optimization methods

2-34

example: 100 100 matrix missing about 71% of its entries initialize X (0) with unknown entries set to 0

10 10

1 0

10 10 10 10 10 10 10 10 10

-1 -2 -3 -4 -5 -6 -7 -8 -9

X (k+1) X (k)

10

20

30

40

50

Unconstrained optimization methods

2-35

Summary: subgradient method

often very slow no good stopping criterion theoretical complexity: O(1/2) iterations to nd -suboptimal point

Unconstrained optimization methods

2-36

Outline

Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds

Complexity bound for rst-order methods


function class: convex with nite minimum; gradient Lipschitz continuous L > 0 : f (x) f (y)
2

L xy

x, y

problem: nd x with f (x) f algorithm class: any iterative method that that selects x(k) in x(0) + span{f (x(0)), f (x(1)), . . . , f (x(k1))} bound: no 1st order method can have a worst-case complexity better than O(1/ ) iterations gradient method is not optimal (complexity is O(1/); see next lecture)
Unconstrained optimization methods 2-37

Optimality of the subgradient method


function class: convex with nite minimum and Lipschitz continuous problem: nd x with f (x) f algorithm class: any iterative method that obtains function information via an oracle (black box): for given x, the oracle returns f (x) and a subgradient at x chooses the iterate x(k) in the set x(0) + span{g (0), g (1), . . . , g (k1)} result: no method in the class can have a worst-case complexity better than O(1/2) hence, subgradient method is optimal (see page 2-27)
Unconstrained optimization methods 2-38

References
algorithms for unconstrained minimization

B.T. Polyak, Introduction to Optimization (1987) J. Nocedal, S.J. Wright, Numerical Optimization (2006) D. Bertsekas, Nonlinear Programming (1999)
subgradient method

S. Boyd, lecture notes and slides for EE364b, Convex Optimization II N.Z. Shor, Nondierentiable Optimization and Polynomial Problems (1998)
fundamental complexity bounds

Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004)

Unconstrained optimization methods

2-39

You might also like