Lecture 2

Algorithms for large-scale convex optimization DTU 2010
2. Unconstrained optimization methods
Newton and quasi-Newton methods gradient and conjugate gradient method subgradient method complexity bounds
2-1
Newton method
iteration for minimizing closed, convex, twice dierentiable f x(k+1) = x(k) tk 2f (x(k))1f (x(k)) step size tk is xed or from line search we often suppress iteration number: x+ = x tf (x)1f (x), x := x tf (x)1f (x)
advantages of Newtons method: fast convergence, ane invariance disadvantages: requires second derivatives, solution of linear equation can be too expensive for large scale applications
Unconstrained optimization methods 2-2
Backtracking line search

to determine tk , start at some t = t (e.g., t = 1), and repeat t := t until f (x + tx) < f (x) + tf (x)T x where x = x(k), x = 2f (x(k))1f (x(k))
f (x + tx)
f (x) + tf (x)T x t=0 t0
f (x) + tf (x)T x t
t > 0, (0, 1/2), (0, 1) are algorithm parameters

Variable metric methods

x+ = x tH 1f (x) H 0 is approximation of the Hessian at x, chosen to: avoid calculation of second derivatives simplify computation of search direction variable metric interpretation x = H 1f (x) is steepest descent direction at x for quadratic norm z
H
= z T Hz
1/2
Unconstrained optimization methods
2-4
Quasi-Newton methods
given starting point x(0) dom f , H0 0 for k = 1, 2, . . ., until a stopping criterion is satised
1 1. compute quasi-Newton direction x = Hk1f (x(k1))
2. determine step size t (e.g., by backtracking line search) 3. compute x(k) = x(k1) + tx 4. compute Hk
1 can also propagate Hk to simplify calculation of x
dierent methods use dierent rules for updating H in step 4
2-5
Broyden-Fletcher-Goldfarb-Shanno (BFGS) update

Hk
1 Hk
yy T Hk1ssT Hk1 = Hk1 + T y s sT Hk1s = sy T ysT 1 I T Hk1 I T y s y s ssT + T y s
where s = x(k) x(k1), y = f (x(k)) f (x(k1)) note that y T s > 0 for strictly convex f ; this follows from f (v) > f (u) + f (u)T (v u), evaluated at u = x(k1), v = x(k) and u = x(k), v = x(k1) cost of update or inverse update is O(n2) operations
Positive deniteness
if y T s > 0, BFGS update preserves positive denitess of Hk
proof: from inverse update formula,

1 v T Hk v
sT v v T y s y
T 1 Hk1
sT v (sT v)2 v T y + T s y y s
if Hk1 0, both terms are nonnegative for all v second term is zero only if sT v = 0; then rst term is zero only if v = 0
1 this ensures that x = Hk f (x(k)) is a descent direction
2-7
Secant condition
BFGS update satises the secant condition Hk s = y, i.e., Hk (x(k) x(k1)) = f (x(k)) f (x(k1))
interpretation: dene second-order approximation at x(k) fquad(z) = f (x

(k)
) + f (x
(k) T
) (z x
(k)
1 ) + (z x(k))T Hk (z x(k)) 2
secant condition implies that gradient of fquad agrees with f at x(k1): fquad(x(k1)) = f (x(k)) + Hk (x(k1) x(k)) = f (x(k1))
2-8
secant method for f : R R, BFGS with unit step size gives the secant method x(k+1) f (x(k)) = x(k) , Hk f (x(k)) f (x(k1)) Hk = x(k) x(k1)
x(k1)
x(k)
x(k+1)
fquad(z)
f (z)
2-9
Convergence
global result if f is strongly convex (2f (x) mI for some m > 0), BFGS with backtracking line search converges from any x(0), H (0) 0
local convergence if f is strongly convex and 2f (x) is Lipschitz continuous, local convergence is superlinear : for suciently large k, x(k+1) x
2
ck x(k) x
where ck 0 (cf., quadratic local convergence of Newton method)
2-10
Example
m
minimize cT x n = 100, m = 500

Newton
10 10
2
i=1
log(bi aT x) i
BFGS
10 10
2 0
10 10 10 10 10 10
-2
f (x(k)) f
0 1 2 3 4 5 6 7 8 9
f (x(k)) f
10 10 10 10 10 10
-2
-4
-4
-6
-6
-8
-8
-10
-10
-12
-12
20
40
60
80
100
120
140
cost per Newton iteration: O(n3) plus computing 2f (x) cost per BFGS iteration: O(n2)
Square root BFGS update

to improve numerical stability, can propagate Hk in factored form
if Hk1 = Lk1LT then Hk = Lk LT with k1 k Lk = Lk1 where y = L1 y, k1 s = Lk1s, = s s yT s

T
( s) sT y I+ sT s
1/2
if Lk1 is triangular, cost of reducing Lk to triangular is O(n2)
2-12
Optimality of BFGS update

X = Hk solves the convex optimization problem
1 1 minimize tr(Hk1X) log det(Hk1X) n subject to Xs = y
cost function is nonnegative, equal to zero only if X = Hk1 also known as relative entropy between densities N (0, X), N (0, Hk1) optimality result follows from KKT conditions: X = Hk satises X with = 1 sT y
1 2Hk1y 1
1 Hk1
1 T (s + sT ), 2
Xs = y,
X0
1 y T Hk1y 1+ yT s
2-13
Davidon-Fletcher-Powell (DFP) update

switch Hk1 and X in objective on previous page minimize tr(Hk1X 1) log det(Hk1X 1) n subject to Xs = y minimize relative entropy between N (0, Hk1) and N (0, X) problem is convex in X 1 (with constraint written as s = X 1y) solution is dual of BFGS formula Hk = sy T ysT Hk1 I T I T s y s y yy T + T s y
(known as DFP update) pre-dates BFGS update, but is less often used
Limited memory quasi-Newton methods

1 main disadvantage of quasi-Newton method is need to store Hk or Hk 1 limited-memory BFGS (L-BFGS): do not store Hk explicitly
instead we store the m (e.g., m = 30) most recent values of sj = x(j) x(j1), yj = f (x(j)) f (x(j1))
1 we evaluate x = Hk f (x(k)) recursively, using 1 Hj = T s j yj I T yj s j 1 Hj1
yj s T j I T yj s j
sj sT j + T yj s j
1 for j = k, k 1, . . . , k m + 1, assuming, for example, Hkm = I
cost per iteration is O(nm); storage is O(nm)

Outline
Classical gradient method

x(k) = x(k1) tk f (x(k1)) tk > 0 is step size (constant or determined by line search) advantage: inexpensive disadvantage: often very slow; convergence rate problem-dependent
3
simple quadratic example 1 f (x1, x2) = (x2 + M x2) 2 2 1 with M = 10, t = 0.18
-1
-2
-3 -10
-5
10
2-16
Modications
multistep methods heavy ball method x(k) = x(k1) tf (x(k1)) + s(x(k1) x(k2)) conjugate gradient Nesterov-type methods (next lecture) spectral gradient method (Barzilai-Borwein) use step size tk = (sT sk )/(sT yk ) with k k sk = x(k1) x(k2), yk = f (x(k1)) f (x(k2))
often implemented with non-monotone line search

Unconstrained quadratic minimization

1 minimize f (x) = xT Ax bT x 2 with A Sn ++ equivalent to solving Ax = b residual r = b Ax is negative gradient at x: r = f (x) conjugate gradient method invented by Hestenes and Stiefel around 1951 the most widely used iterative method for solving Ax = b, with A 0 can be extended to non-quadratic unconstrained minimization
Krylov sequence
CG algorithm is a recursive method for computing the Krylov sequence x(k) = argmin f (x),
xKk
k0
where Kk is the Krylov subspace K0 = {0}, Kk = span{b, Ab, . . . , Ak1b} for k 1
A1b Kn, therefore x(n) = A1b there is a simple two-term recurrence x(k+1) = x(k) + ak rk + bk (x(k) x(k1))
Conjugate gradient method as iterative method

in exact arithmetic CG was originally proposed as a direct (non-iterative) method in theory, convergence in at most n steps
in practice due to rounding errors, can take n steps (or fail) CG is now used as an iterative method with luck (good spectrum of A), good approximation in n steps
2-20
Applications in optimization
nonlinear conjugate gradient methods extend linear CG method to nonquadratic functions local convergence similar to linear CG limited global convergence theory
inexact and truncated Newton methods use conjugate gradient method to compute (approximate) Newton step less reliable than exact Newton methods, but handle very large problems
2-21
Fletcher-Reeves CG algorithm
CG algorithm modied to minimize non-quadratic convex f
given x(0) for k = 1, 2, . . . 1. return x(k1) if f (x(k1))

2
2. if k = 1, p1 = f (x(0)); else pk = f (x
(k1)
) + pk1
where
f (x(k1)) = f (x(k2))
2 2 2 2
3. update x(k) = x(k1) + pk where = argmint f (x(k1) + tpk )
2-22
some observations rst iteration is a gradient step; practical implementations restart the algorithm by taking a gradient step, for example, every n iterations update is gradient step with momentum term x(k) = x(k1) k f (x(k1)) + k (x(k1) x(k2)) with exact line search, reduces to linear CG for quadratic f line search exact line search in step 3 implies f (x(k))T pk = 0 therefore pk is a descent direction at x(k+1): from step 2, f (x(k1))T pk = f (x(k1))
2 2
<0
2-23
Variations
Polak-Ribi`re: in step 2, compute from e f (x(k1))T (f (x(k1)) f (x(k2))) = f (x(k2)) 2 2
Hestenes-Stiefel f (x(k1))T (f (x(k1)) f (x(k2))) = pT (f (x(k1)) f (x(k2))) k1
formulas are equivalent for quadratic f and exact line search
2-24
Interpretation as restarted BFGS method

BFGS update (page 2-6) with Hk1 = I:
1 Hk
y T y ssT ysT + sy T = I + (1 + T ) T s y y s yT s
where y = f (x(k)) f (x(k1)), s = x(k) x(k1) f (x(k))T s = 0 if x(k) is determined by exact line search quasi-Newton step in iteration k is
1 Hk f (x(k))
= f (x
(k)
y T f (x(k)) s )+ Ts y
this is the Hestenes-Stiefel update nonlinear CG can be interpreted as L-BFGS with m = 1

Outline
Subgradient method
to minimize a nondierentiable convex function f : choose x(0) and repeat x(k) = x(k1) tk g (k1), g (k1) is any subgradient of f at x(k1) k = 1, 2, . . .
step size rules xed step: tk constant xed length: tk g (k1) diminishing: tk 0,
2 k=1 tk
constant (i.e., x(k) x(k1) =
constant)
optimal step size when f is known: tk = (f (x(k1)) f )/ g (k1)
2 2
2-26
Convergence results
assumptions f has nite optimal value f , minimizer x with x(0) x f is convex and Lipschitz continuous with constant G > 0: |f (x) f (y)| G x y
(k) 2 2
x, y
results (fbest = minik f (x(i)) is best value up to iteration k) convergence requires diminishing step size tk 0, with proper step size, can derive bound GR (k) fbest f k #iterations to reach fbest f is O(1/2)
k=1 tk
(k)
Example: 1-norm minimization

minimize Ax b
1
(A R500100, b R500)
subgradient is given by AT sign(Ax b)

10
0
0.1 0.01 0.001
f (x(k)) f f
10
-1
xed steplength tk = s g (k1)

2
10
-2
with s = 0.1, 0.01, 0.001
10
-3
20
40
60
80
100
2-28
diminishing step size tk = 0.01/ k, tk = 0.01/k
10
0.01/ k 0.01/k
10
-1
fbest f
10
-2
(k)
10
-3
10
-4
10
-5
1000
2000
3000
4000
5000
2-29
Finding a point in the intersection of convex sets

to nd point x C = C1 Cm (m closed convex sets): minimize f (x) = max{dist(x, C1), . . . , dist(x, Cm)} where dist(x, Cj ) = inf (PCj is projection on Cj )
zCj
xz
= x PCj (x)
dist(x, Cj ) is a convex function if Cj is convex f = 0 if the intersection is nonempty to nd subgradient of f , need subgradient of distance to farthest set Cj
2-30
subgradient of distance to closed convex set S

H S
S H = {z | (xPS (x))T (zPS (x)) 0}
PS (x) x
therefore
(x PS (x))T (y PS (x)) dist(y, S) x PS (x) 2
(for y H, r.h.s. is distance to H; for y H, r.h.s. is nonpositive) hence, (x PS (x))T (y x) dist(y, S) x PS (x) 2 + x PS (x) 2
conclusion: (x PS (x))/ dist(x, S) is a subgradient at x S

subgradient method with optimal step size for minimize f (x) = max{dist(x, C1), . . . , dist(x, Cm)} if Cj is the farthest set at iteration k (i.e., dist(x(k1), Cj ) = f (x(k1))): x
(k)
= x
(k1)
f (x(k1)) (x(k1) PCj (x(k1))) dist(x(k1), Cj )
= PCj (x(k1))
a version of the famous alternating projections algorithm at each step, project the current point onto the farthest set for m = 2 sets, projections alternate onto one set, then the other convergence: dist(x(k), C) 0 as k
Alternating projections
rst few iterations:
x(1) x(2)(4) x C2 x(3) x C1
. . . x(k) eventually converges to a point x C1 C2

Example: Positive semidenite matrix completion

some entries of X Sn xed; nd values for others so X C 1 = Sn + projection onto C1 by eigenvalue decomposition, truncation
n n T max{0, i}qiqi
PC1 (X) =
i=1
if X =
i=1
T i q i q i
C2 is (ane) set in Sn with specied xed entries projection of X onto C2 by re-setting specied entries to xed values
2-34
example: 100 100 matrix missing about 71% of its entries initialize X (0) with unknown entries set to 0
10 10
1 0
10 10 10 10 10 10 10 10 10
-1 -2 -3 -4 -5 -6 -7 -8 -9
X (k+1) X (k)
10
20
30
40
50
2-35
Summary: subgradient method
often very slow no good stopping criterion theoretical complexity: O(1/2) iterations to nd -suboptimal point
2-36
Outline
Complexity bound for rst-order methods

function class: convex with nite minimum; gradient Lipschitz continuous L > 0 : f (x) f (y)
2
L xy
x, y
problem: nd x with f (x) f algorithm class: any iterative method that that selects x(k) in x(0) + span{f (x(0)), f (x(1)), . . . , f (x(k1))} bound: no 1st order method can have a worst-case complexity better than O(1/ ) iterations gradient method is not optimal (complexity is O(1/); see next lecture)
Optimality of the subgradient method

function class: convex with nite minimum and Lipschitz continuous problem: nd x with f (x) f algorithm class: any iterative method that obtains function information via an oracle (black box): for given x, the oracle returns f (x) and a subgradient at x chooses the iterate x(k) in the set x(0) + span{g (0), g (1), . . . , g (k1)} result: no method in the class can have a worst-case complexity better than O(1/2) hence, subgradient method is optimal (see page 2-27)
References
algorithms for unconstrained minimization
B.T. Polyak, Introduction to Optimization (1987) J. Nocedal, S.J. Wright, Numerical Optimization (2006) D. Bertsekas, Nonlinear Programming (1999)
subgradient method
S. Boyd, lecture notes and slides for EE364b, Convex Optimization II N.Z. Shor, Nondierentiable Optimization and Polynomial Problems (1998)
fundamental complexity bounds
Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004)
2-39

Lecture 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2

Uploaded by

Copyright:

Available Formats

Algorithms for large-scale convex optimization DTU 2010

2. Unconstrained optimization methods

Backtracking line search

f (x) + tf (x)T x t=0 t0

t > 0, (0, 1/2), (0, 1) are algorithm parameters

Variable metric methods

Unconstrained optimization methods

1 can also propagate Hk to simplify calculation of x

dierent methods use dierent rules for updating H in step 4

Unconstrained optimization methods

Broyden-Fletcher-Goldfarb-Shanno (BFGS) update

yy T Hk1ssT Hk1 = Hk1 + T y s sT Hk1s = sy T ysT 1 I T Hk1 I T y s y s ssT + T y s

proof: from inverse update formula,

1 this ensures that x = Hk f (x(k)) is a descent direction

Unconstrained optimization methods

interpretation: dene second-order approximation at x(k) fquad(z) = f (x

Unconstrained optimization methods

Unconstrained optimization methods

where ck 0 (cf., quadratic local convergence of Newton method)

Unconstrained optimization methods

minimize cT x n = 100, m = 500

Square root BFGS update

if Hk1 = Lk1LT then Hk = Lk LT with k1 k Lk = Lk1 where y = L1 y, k1 s = Lk1s, = s s yT s

if Lk1 is triangular, cost of reducing Lk to triangular is O(n2)

Unconstrained optimization methods

Optimality of BFGS update

Unconstrained optimization methods

Davidon-Fletcher-Powell (DFP) update

Limited memory quasi-Newton methods

1 we evaluate x = Hk f (x(k)) recursively, using 1 Hj = T s j yj I T yj s j 1 Hj1

1 for j = k, k 1, . . . , k m + 1, assuming, for example, Hkm = I

cost per iteration is O(nm); storage is O(nm)

Classical gradient method

Unconstrained optimization methods

often implemented with non-monotone line search

Unconstrained quadratic minimization

where Kk is the Krylov subspace K0 = {0}, Kk = span{b, Ab, . . . , Ak1b} for k 1

Conjugate gradient method as iterative method

Unconstrained optimization methods

Unconstrained optimization methods

given x(0) for k = 1, 2, . . . 1. return x(k1) if f (x(k1))

3. update x(k) = x(k1) + pk where = argmint f (x(k1) + tpk )

Unconstrained optimization methods

Hestenes-Stiefel f (x(k1))T (f (x(k1)) f (x(k2))) = pT (f (x(k1)) f (x(k2))) k1

formulas are equivalent for quadratic f and exact line search

Unconstrained optimization methods

Interpretation as restarted BFGS method

this is the Hestenes-Stiefel update nonlinear CG can be interpreted as L-BFGS with m = 1

constant (i.e., x(k) x(k1) =

optimal step size when f is known: tk = (f (x(k1)) f )/ g (k1)

Unconstrained optimization methods

Example: 1-norm minimization

subgradient is given by AT sign(Ax b)

xed steplength tk = s g (k1)

with s = 0.1, 0.01, 0.001

Unconstrained optimization methods

diminishing step size tk = 0.01/ k, tk = 0.01/k

Unconstrained optimization methods

Finding a point in the intersection of convex sets

Unconstrained optimization methods

subgradient of distance to closed convex set S

S H = {z | (xPS (x))T (zPS (x)) 0}

(x PS (x))T (y PS (x)) dist(y, S) x PS (x) 2