You are on page 1of 300

Convex Optimization Boyd & Vandenberghe

1. Introduction
mathematical optimization
least-squares and linear programming
convex optimization
example
course goals and topics
nonlinear optimization
brief history of convex optimization
11
Mathematical optimization
(mathematical) optimization problem
minimize f
0
(x)
subject to f
i
(x) b
i
, i = 1, . . . , m
x = (x
1
, . . . , x
n
): optimization variables
f
0
: R
n
R: objective function
f
i
: R
n
R, i = 1, . . . , m: constraint functions
optimal solution x

has smallest value of f


0
among all vectors that
satisfy the constraints
Introduction 12
Examples
portfolio optimization
variables: amounts invested in dierent assets
constraints: budget, max./min. investment per asset, minimum return
objective: overall risk or return variance
device sizing in electronic circuits
variables: device widths and lengths
constraints: manufacturing limits, timing requirements, maximum area
objective: power consumption
data tting
variables: model parameters
constraints: prior information, parameter limits
objective: measure of mist or prediction error
Introduction 13
Solving optimization problems
general optimization problem
very dicult to solve
methods involve some compromise, e.g., very long computation time, or
not always nding the solution
exceptions: certain problem classes can be solved eciently and reliably
least-squares problems
linear programming problems
convex optimization problems
Introduction 14
Least-squares
minimize |Ax b|
2
2
solving least-squares problems
analytical solution: x

= (A
T
A)
1
A
T
b
reliable and ecient algorithms and software
computation time proportional to n
2
k (A R
kn
); less if structured
a mature technology
using least-squares
least-squares problems are easy to recognize
a few standard techniques increase exibility (e.g., including weights,
adding regularization terms)
Introduction 15
Linear programming
minimize c
T
x
subject to a
T
i
x b
i
, i = 1, . . . , m
solving linear programs
no analytical formula for solution
reliable and ecient algorithms and software
computation time proportional to n
2
m if m n; less with structure
a mature technology
using linear programming
not as easy to recognize as least-squares problems
a few standard tricks used to convert problems into linear programs
(e.g., problems involving
1
- or

-norms, piecewise-linear functions)


Introduction 16
Convex optimization problem
minimize f
0
(x)
subject to f
i
(x) b
i
, i = 1, . . . , m
objective and constraint functions are convex:
f
i
(x +y) f
i
(x) +f
i
(y)
if + = 1, 0, 0
includes least-squares problems and linear programs as special cases
Introduction 17
solving convex optimization problems
no analytical solution
reliable and ecient algorithms
computation time (roughly) proportional to maxn
3
, n
2
m, F, where F
is cost of evaluating f
i
s and their rst and second derivatives
almost a technology
using convex optimization
often dicult to recognize
many tricks for transforming problems into convex form
surprisingly many problems can be solved via convex optimization
Introduction 18
Example
m lamps illuminating n (small, at) patches
lamp power p
j
illumination I
k
r
kj

kj
intensity I
k
at patch k depends linearly on lamp powers p
j
:
I
k
=
m

j=1
a
kj
p
j
, a
kj
= r
2
kj
maxcos
kj
, 0
problem: achieve desired illumination I
des
with bounded lamp powers
minimize max
k=1,...,n
[ log I
k
log I
des
[
subject to 0 p
j
p
max
, j = 1, . . . , m
Introduction 19
how to solve?
1. use uniform power: p
j
= p, vary p
2. use least-squares:
minimize

n
k=1
(I
k
I
des
)
2
round p
j
if p
j
> p
max
or p
j
< 0
3. use weighted least-squares:
minimize

n
k=1
(I
k
I
des
)
2
+

m
j=1
w
j
(p
j
p
max
/2)
2
iteratively adjust weights w
j
until 0 p
j
p
max
4. use linear programming:
minimize max
k=1,...,n
[I
k
I
des
[
subject to 0 p
j
p
max
, j = 1, . . . , m
which can be solved via linear programming
of course these are approximate (suboptimal) solutions
Introduction 110
5. use convex optimization: problem is equivalent to
minimize f
0
(p) = max
k=1,...,n
h(I
k
/I
des
)
subject to 0 p
j
p
max
, j = 1, . . . , m
with h(u) = maxu, 1/u
0 1 2 3 4
0
1
2
3
4
5
u
h
(
u
)
f
0
is convex because maximum of convex functions is convex
exact solution obtained with eort modest factor least-squares eort
Introduction 111
additional constraints: does adding 1 or 2 below complicate the problem?
1. no more than half of total power is in any 10 lamps
2. no more than half of the lamps are on (p
j
> 0)
answer: with (1), still easy to solve; with (2), extremely dicult
moral: (untrained) intuition doesnt always work; without the proper
background very easy problems can appear quite similar to very dicult
problems
Introduction 112
Course goals and topics
goals
1. recognize/formulate problems (such as the illumination problem) as
convex optimization problems
2. develop code for problems of moderate size (1000 lamps, 5000 patches)
3. characterize optimal solution (optimal power distribution), give limits of
performance, etc.
topics
1. convex sets, functions, optimization problems
2. examples and applications
3. algorithms
Introduction 113
Nonlinear optimization
traditional techniques for general nonconvex problems involve compromises
local optimization methods (nonlinear programming)
nd a point that minimizes f
0
among feasible points near it
fast, can handle large problems
require initial guess
provide no information about distance to (global) optimum
global optimization methods
nd the (global) solution
worst-case complexity grows exponentially with problem size
these algorithms are often based on solving convex subproblems
Introduction 114
Brief history of convex optimization
theory (convex analysis): ca19001970
algorithms
1947: simplex algorithm for linear programming (Dantzig)
1960s: early interior-point methods (Fiacco & McCormick, Dikin, . . . )
1970s: ellipsoid method and other subgradient methods
1980s: polynomial-time interior-point methods for linear programming
(Karmarkar 1984)
late 1980snow: polynomial-time interior-point methods for nonlinear
convex optimization (Nesterov & Nemirovski 1994)
applications
before 1990: mostly in operations research; few in engineering
since 1990: many new applications in engineering (control, signal
processing, communications, circuit design, . . . ); new problem classes
(semidenite and second-order cone programming, robust optimization)
Introduction 115
Convex Optimization Boyd & Vandenberghe
2. Convex sets
ane and convex sets
some important examples
operations that preserve convexity
generalized inequalities
separating and supporting hyperplanes
dual cones and generalized inequalities
21
Ane set
line through x
1
, x
2
: all points
x = x
1
+ (1 )x
2
( R)
x
1
x
2
= 1.2
= 1
= 0.6
= 0
= 0.2
ane set: contains the line through any two distinct points in the set
example: solution set of linear equations x [ Ax = b
(conversely, every ane set can be expressed as solution set of system of
linear equations)
Convex sets 22
Convex set
line segment between x
1
and x
2
: all points
x = x
1
+ (1 )x
2
with 0 1
convex set: contains line segment between any two points in the set
x
1
, x
2
C, 0 1 = x
1
+ (1 )x
2
C
examples (one convex, two nonconvex sets)
Convex sets 23
Convex combination and convex hull
convex combination of x
1
,. . . , x
k
: any point x of the form
x =
1
x
1
+
2
x
2
+ +
k
x
k
with
1
+ +
k
= 1,
i
0
convex hull conv S: set of all convex combinations of points in S
Convex sets 24
Convex cone
conic (nonnegative) combination of x
1
and x
2
: any point of the form
x =
1
x
1
+
2
x
2
with
1
0,
2
0
0
x
1
x
2
convex cone: set that contains all conic combinations of points in the set
Convex sets 25
Hyperplanes and halfspaces
hyperplane: set of the form x [ a
T
x = b (a ,= 0)
a
x
a
T
x = b
x
0
halfspace: set of the form x [ a
T
x b (a ,= 0)
a
a
T
x b
a
T
x b
x
0
a is the normal vector
hyperplanes are ane and convex; halfspaces are convex
Convex sets 26
Euclidean balls and ellipsoids
(Euclidean) ball with center x
c
and radius r:
B(x
c
, r) = x [ |x x
c
|
2
r = x
c
+ru [ |u|
2
1
ellipsoid: set of the form
x [ (x x
c
)
T
P
1
(x x
c
) 1
with P S
n
++
(i.e., P symmetric positive denite)
x
c
other representation: x
c
+Au [ |u|
2
1 with A square and nonsingular
Convex sets 27
Norm balls and norm cones
norm: a function | | that satises
|x| 0; |x| = 0 if and only if x = 0
|tx| = [t[ |x| for t R
|x +y| |x| +|y|
notation: | | is general (unspecied) norm; | |
symb
is particular norm
norm ball with center x
c
and radius r: x [ |x x
c
| r
norm cone: (x, t) [ |x| t
Euclidean norm cone is called second-
order cone
x
1
x
2
t
1
0
1
1
0
1
0
0.5
1
norm balls and cones are convex
Convex sets 28
Polyhedra
solution set of nitely many linear inequalities and equalities
Ax _ b, Cx = d
(A R
mn
, C R
pn
, _ is componentwise inequality)
a
1
a
2
a
3
a
4
a
5
P
polyhedron is intersection of nite number of halfspaces and hyperplanes
Convex sets 29
Positive semidenite cone
notation:
S
n
is set of symmetric n n matrices
S
n
+
= X S
n
[ X _ 0: positive semidenite n n matrices
X S
n
+
z
T
Xz 0 for all z
S
n
+
is a convex cone
S
n
++
= X S
n
[ X 0: positive denite n n matrices
example:
_
x y
y z
_
S
2
+
x
y
z
0
0.5
1
1
0
1
0
0.5
1
Convex sets 210
Operations that preserve convexity
practical methods for establishing convexity of a set C
1. apply denition
x
1
, x
2
C, 0 1 = x
1
+ (1 )x
2
C
2. show that C is obtained from simple convex sets (hyperplanes,
halfspaces, norm balls, . . . ) by operations that preserve convexity
intersection
ane functions
perspective function
linear-fractional functions
Convex sets 211
Intersection
the intersection of (any number of) convex sets is convex
example:
S = x R
m
[ [p(t)[ 1 for [t[ /3
where p(t) = x
1
cos t +x
2
cos 2t + +x
m
cos mt
for m = 2:
0 /3 2/3
1
0
1
t
p
(
t
)
x
1
x
2
S
2 1 0 1 2
2
1
0
1
2
Convex sets 212
Ane function
suppose f : R
n
R
m
is ane (f(x) = Ax +b with A R
mn
, b R
m
)
the image of a convex set under f is convex
S R
n
convex = f(S) = f(x) [ x S convex
the inverse image f
1
(C) of a convex set under f is convex
C R
m
convex = f
1
(C) = x R
n
[ f(x) C convex
examples
scaling, translation, projection
solution set of linear matrix inequality x [ x
1
A
1
+ +x
m
A
m
_ B
(with A
i
, B S
p
)
hyperbolic cone x [ x
T
Px (c
T
x)
2
, c
T
x 0 (with P S
n
+
)
Convex sets 213
Perspective and linear-fractional function
perspective function P : R
n+1
R
n
:
P(x, t) = x/t, domP = (x, t) [ t > 0
images and inverse images of convex sets under perspective are convex
linear-fractional function f : R
n
R
m
:
f(x) =
Ax +b
c
T
x +d
, domf = x [ c
T
x +d > 0
images and inverse images of convex sets under linear-fractional functions
are convex
Convex sets 214
example of a linear-fractional function
f(x) =
1
x
1
+x
2
+ 1
x
x
1
x
2
C
1 0 1
1
0
1
x
1
x
2
f(C)
1 0 1
1
0
1
Convex sets 215
Generalized inequalities
a convex cone K R
n
is a proper cone if
K is closed (contains its boundary)
K is solid (has nonempty interior)
K is pointed (contains no line)
examples
nonnegative orthant K = R
n
+
= x R
n
[ x
i
0, i = 1, . . . , n
positive semidenite cone K = S
n
+
nonnegative polynomials on [0, 1]:
K = x R
n
[ x
1
+x
2
t +x
3
t
2
+ +x
n
t
n1
0 for t [0, 1]
Convex sets 216
generalized inequality dened by a proper cone K:
x _
K
y y x K, x
K
y y x int K
examples
componentwise inequality (K = R
n
+
)
x _
R
n
+
y x
i
y
i
, i = 1, . . . , n
matrix inequality (K = S
n
+
)
X _
S
n
+
Y Y X positive semidenite
these two types are so common that we drop the subscript in _
K
properties: many properties of _
K
are similar to on R, e.g.,
x _
K
y, u _
K
v = x +u _
K
y +v
Convex sets 217
Minimum and minimal elements
_
K
is not in general a linear ordering: we can have x ,_
K
y and y ,_
K
x
x S is the minimum element of S with respect to _
K
if
y S = x _
K
y
x S is a minimal element of S with respect to _
K
if
y S, y _
K
x = y = x
example (K = R
2
+
)
x
1
is the minimum element of S
1
x
2
is a minimal element of S
2 x
1
x
2
S
1
S
2
Convex sets 218
Separating hyperplane theorem
if C and D are disjoint convex sets, then there exists a ,= 0, b such that
a
T
x b for x C, a
T
x b for x D
D
C
a
a
T
x b a
T
x b
the hyperplane x [ a
T
x = b separates C and D
strict separation requires additional assumptions (e.g., C is closed, D is a
singleton)
Convex sets 219
Supporting hyperplane theorem
supporting hyperplane to set C at boundary point x
0
:
x [ a
T
x = a
T
x
0

where a ,= 0 and a
T
x a
T
x
0
for all x C
C
a
x
0
supporting hyperplane theorem: if C is convex, then there exists a
supporting hyperplane at every boundary point of C
Convex sets 220
Dual cones and generalized inequalities
dual cone of a cone K:
K

= y [ y
T
x 0 for all x K
examples
K = R
n
+
: K

= R
n
+
K = S
n
+
: K

= S
n
+
K = (x, t) [ |x|
2
t: K

= (x, t) [ |x|
2
t
K = (x, t) [ |x|
1
t: K

= (x, t) [ |x|

t
rst three examples are self-dual cones
dual cones of proper cones are proper, hence dene generalized inequalities:
y _
K

0 y
T
x 0 for all x _
K
0
Convex sets 221
Minimum and minimal elements via dual inequalities
minimum element w.r.t. _
K
x is minimum element of S i for all

K

0, x is the unique minimizer


of
T
z over S
x
S
minimal element w.r.t. _
K
if x minimizes
T
z over S for some
K

0, then x is minimal
S
x
1
x
2

2
if x is a minimal element of a convex set S, then there exists a nonzero
_
K

0 such that x minimizes


T
z over S
Convex sets 222
optimal production frontier
dierent production methods use dierent amounts of resources x R
n
production set P: resource vectors x for all possible production methods
ecient (Pareto optimal) methods correspond to resource vectors x
that are minimal w.r.t. R
n
+
example (n = 2)
x
1
, x
2
, x
3
are ecient; x
4
, x
5
are not
x
4 x
2
x
1
x
5
x
3

P
labor
fuel
Convex sets 223
Convex Optimization Boyd & Vandenberghe
3. Convex functions
basic properties and examples
operations that preserve convexity
the conjugate function
quasiconvex functions
log-concave and log-convex functions
convexity with respect to generalized inequalities
31
Denition
f : R
n
R is convex if domf is a convex set and
f(x + (1 )y) f(x) + (1 )f(y)
for all x, y domf, 0 1
(x, f(x))
(y, f(y))
f is concave if f is convex
f is strictly convex if domf is convex and
f(x + (1 )y) < f(x) + (1 )f(y)
for x, y domf, x ,= y, 0 < < 1
Convex functions 32
Examples on R
convex:
ane: ax +b on R, for any a, b R
exponential: e
ax
, for any a R
powers: x

on R
++
, for 1 or 0
powers of absolute value: [x[
p
on R, for p 1
negative entropy: xlog x on R
++
concave:
ane: ax +b on R, for any a, b R
powers: x

on R
++
, for 0 1
logarithm: log x on R
++
Convex functions 33
Examples on R
n
and R
mn
ane functions are convex and concave; all norms are convex
examples on R
n
ane function f(x) = a
T
x +b
norms: |x|
p
= (

n
i=1
[x
i
[
p
)
1/p
for p 1; |x|

= max
k
[x
k
[
examples on R
mn
(mn matrices)
ane function
f(X) = tr(A
T
X) +b =
m

i=1
n

j=1
A
ij
X
ij
+b
spectral (maximum singular value) norm
f(X) = |X|
2
=
max
(X) = (
max
(X
T
X))
1/2
Convex functions 34
Restriction of a convex function to a line
f : R
n
R is convex if and only if the function g : R R,
g(t) = f(x +tv), domg = t [ x +tv domf
is convex (in t) for any x domf, v R
n
can check convexity of f by checking convexity of functions of one variable
example. f : S
n
R with f(X) = log det X, domf = S
n
++
g(t) = log det(X +tV ) = log det X + log det(I +tX
1/2
V X
1/2
)
= log det X +
n

i=1
log(1 +t
i
)
where
i
are the eigenvalues of X
1/2
V X
1/2
g is concave in t (for any choice of X 0, V ); hence f is concave
Convex functions 35
Extended-value extension
extended-value extension

f of f is

f(x) = f(x), x domf,



f(x) = , x , domf
often simplies notation; for example, the condition
0 1 =

f(x + (1 )y)

f(x) + (1 )

f(y)
(as an inequality in R ), means the same as the two conditions
domf is convex
for x, y domf,
0 1 = f(x + (1 )y) f(x) + (1 )f(y)
Convex functions 36
First-order condition
f is dierentiable if domf is open and the gradient
f(x) =
_
f(x)
x
1
,
f(x)
x
2
, . . . ,
f(x)
x
n
_
exists at each x domf
1st-order condition: dierentiable f with convex domain is convex i
f(y) f(x) +f(x)
T
(y x) for all x, y domf
(x, f(x))
f(y)
f(x) +f(x)
T
(y x)
rst-order approximation of f is global underestimator
Convex functions 37
Second-order conditions
f is twice dierentiable if domf is open and the Hessian
2
f(x) S
n
,

2
f(x)
ij
=

2
f(x)
x
i
x
j
, i, j = 1, . . . , n,
exists at each x domf
2nd-order conditions: for twice dierentiable f with convex domain
f is convex if and only if

2
f(x) _ 0 for all x domf
if
2
f(x) 0 for all x domf, then f is strictly convex
Convex functions 38
Examples
quadratic function: f(x) = (1/2)x
T
Px +q
T
x +r (with P S
n
)
f(x) = Px +q,
2
f(x) = P
convex if P _ 0
least-squares objective: f(x) = |Ax b|
2
2
f(x) = 2A
T
(Ax b),
2
f(x) = 2A
T
A
convex (for any A)
quadratic-over-linear: f(x, y) = x
2
/y

2
f(x, y) =
2
y
3
_
y
x
_ _
y
x
_
T
_ 0
convex for y > 0
x y
f
(
x
,
y
)
2
0
2
0
1
2
0
1
2
Convex functions 39
log-sum-exp: f(x) = log

n
k=1
exp x
k
is convex

2
f(x) =
1
1
T
z
diag(z)
1
(1
T
z)
2
zz
T
(z
k
= exp x
k
)
to show
2
f(x) _ 0, we must verify that v
T

2
f(x)v 0 for all v:
v
T

2
f(x)v =
(

k
z
k
v
2
k
)(

k
z
k
) (

k
v
k
z
k
)
2
(

k
z
k
)
2
0
since (

k
v
k
z
k
)
2
(

k
z
k
v
2
k
)(

k
z
k
) (from Cauchy-Schwarz inequality)
geometric mean: f(x) = (

n
k=1
x
k
)
1/n
on R
n
++
is concave
(similar proof as for log-sum-exp)
Convex functions 310
Epigraph and sublevel set
-sublevel set of f : R
n
R:
C

= x domf [ f(x)
sublevel sets of convex functions are convex (converse is false)
epigraph of f : R
n
R:
epi f = (x, t) R
n+1
[ x domf, f(x) t
epi f
f
f is convex if and only if epi f is a convex set
Convex functions 311
Jensens inequality
basic inequality: if f is convex, then for 0 1,
f(x + (1 )y) f(x) + (1 )f(y)
extension: if f is convex, then
f(Ez) Ef(z)
for any random variable z
basic inequality is special case with discrete distribution
prob(z = x) = , prob(z = y) = 1
Convex functions 312
Operations that preserve convexity
practical methods for establishing convexity of a function
1. verify denition (often simplied by restricting to a line)
2. for twice dierentiable functions, show
2
f(x) _ 0
3. show that f is obtained from simple convex functions by operations
that preserve convexity
nonnegative weighted sum
composition with ane function
pointwise maximum and supremum
composition
minimization
perspective
Convex functions 313
Positive weighted sum & composition with ane function
nonnegative multiple: f is convex if f is convex, 0
sum: f
1
+f
2
convex if f
1
, f
2
convex (extends to innite sums, integrals)
composition with ane function: f(Ax +b) is convex if f is convex
examples
log barrier for linear inequalities
f(x) =
m

i=1
log(b
i
a
T
i
x), domf = x [ a
T
i
x < b
i
, i = 1, . . . , m
(any) norm of ane function: f(x) = |Ax +b|
Convex functions 314
Pointwise maximum
if f
1
, . . . , f
m
are convex, then f(x) = maxf
1
(x), . . . , f
m
(x) is convex
examples
piecewise-linear function: f(x) = max
i=1,...,m
(a
T
i
x +b
i
) is convex
sum of r largest components of x R
n
:
f(x) = x
[1]
+x
[2]
+ +x
[r]
is convex (x
[i]
is ith largest component of x)
proof:
f(x) = maxx
i
1
+x
i
2
+ +x
i
r
[ 1 i
1
< i
2
< < i
r
n
Convex functions 315
Pointwise supremum
if f(x, y) is convex in x for each y /, then
g(x) = sup
yA
f(x, y)
is convex
examples
support function of a set C: S
C
(x) = sup
yC
y
T
x is convex
distance to farthest point in a set C:
f(x) = sup
yC
|x y|
maximum eigenvalue of symmetric matrix: for X S
n
,

max
(X) = sup
y
2
=1
y
T
Xy
Convex functions 316
Composition with scalar functions
composition of g : R
n
R and h : R R:
f(x) = h(g(x))
f is convex if
g convex, h convex,

h nondecreasing
g concave, h convex,

h nonincreasing
proof (for n = 1, dierentiable g, h)
f

(x) = h

(g(x))g

(x)
2
+h

(g(x))g

(x)
note: monotonicity must hold for extended-value extension

h
examples
exp g(x) is convex if g is convex
1/g(x) is convex if g is concave and positive
Convex functions 317
Vector composition
composition of g : R
n
R
k
and h : R
k
R:
f(x) = h(g(x)) = h(g
1
(x), g
2
(x), . . . , g
k
(x))
f is convex if
g
i
convex, h convex,

h nondecreasing in each argument
g
i
concave, h convex,

h nonincreasing in each argument
proof (for n = 1, dierentiable g, h)
f

(x) = g

(x)
T

2
h(g(x))g

(x) +h(g(x))
T
g

(x)
examples


m
i=1
log g
i
(x) is concave if g
i
are concave and positive
log

m
i=1
exp g
i
(x) is convex if g
i
are convex
Convex functions 318
Minimization
if f(x, y) is convex in (x, y) and C is a convex set, then
g(x) = inf
yC
f(x, y)
is convex
examples
f(x, y) = x
T
Ax + 2x
T
By +y
T
Cy with
_
A B
B
T
C
_
_ 0, C 0
minimizing over y gives g(x) = inf
y
f(x, y) = x
T
(ABC
1
B
T
)x
g is convex, hence Schur complement ABC
1
B
T
_ 0
distance to a set: dist(x, S) = inf
yS
|x y| is convex if S is convex
Convex functions 319
Perspective
the perspective of a function f : R
n
R is the function g : R
n
R R,
g(x, t) = tf(x/t), domg = (x, t) [ x/t domf, t > 0
g is convex if f is convex
examples
f(x) = x
T
x is convex; hence g(x, t) = x
T
x/t is convex for t > 0
negative logarithm f(x) = log x is convex; hence relative entropy
g(x, t) = t log t t log x is convex on R
2
++
if f is convex, then
g(x) = (c
T
x +d)f
_
(Ax +b)/(c
T
x +d)
_
is convex on x [ c
T
x +d > 0, (Ax +b)/(c
T
x +d) domf
Convex functions 320
The conjugate function
the conjugate of a function f is
f

(y) = sup
xdomf
(y
T
x f(x))
f(x)
(0, f

(y))
xy
x
f

is convex (even if f is not)


will be useful in chapter 5
Convex functions 321
examples
negative logarithm f(x) = log x
f

(y) = sup
x>0
(xy + log x)
=
_
1 log(y) y < 0
otherwise
strictly convex quadratic f(x) = (1/2)x
T
Qx with Q S
n
++
f

(y) = sup
x
(y
T
x (1/2)x
T
Qx)
=
1
2
y
T
Q
1
y
Convex functions 322
Quasiconvex functions
f : R
n
R is quasiconvex if domf is convex and the sublevel sets
S

= x domf [ f(x)
are convex for all

a b c
f is quasiconcave if f is quasiconvex
f is quasilinear if it is quasiconvex and quasiconcave
Convex functions 323
Examples

_
[x[ is quasiconvex on R
ceil(x) = infz Z [ z x is quasilinear
log x is quasilinear on R
++
f(x
1
, x
2
) = x
1
x
2
is quasiconcave on R
2
++
linear-fractional function
f(x) =
a
T
x +b
c
T
x +d
, domf = x [ c
T
x +d > 0
is quasilinear
distance ratio
f(x) =
|x a|
2
|x b|
2
, domf = x [ |x a|
2
|x b|
2

is quasiconvex
Convex functions 324
internal rate of return
cash ow x = (x
0
, . . . , x
n
); x
i
is payment in period i (to us if x
i
> 0)
we assume x
0
< 0 and x
0
+x
1
+ +x
n
> 0
present value of cash ow x, for interest rate r:
PV(x, r) =
n

i=0
(1 +r)
i
x
i
internal rate of return is smallest interest rate for which PV(x, r) = 0:
IRR(x) = infr 0 [ PV(x, r) = 0
IRR is quasiconcave: superlevel set is intersection of open halfspaces
IRR(x) R
n

i=0
(1 +r)
i
x
i
> 0 for 0 r < R
Convex functions 325
Properties
modied Jensen inequality: for quasiconvex f
0 1 = f(x + (1 )y) maxf(x), f(y)
rst-order condition: dierentiable f with cvx domain is quasiconvex i
f(y) f(x) = f(x)
T
(y x) 0
x
f(x)
sums of quasiconvex functions are not necessarily quasiconvex
Convex functions 326
Log-concave and log-convex functions
a positive function f is log-concave if log f is concave:
f(x + (1 )y) f(x)

f(y)
1
for 0 1
f is log-convex if log f is convex
powers: x
a
on R
++
is log-convex for a 0, log-concave for a 0
many common probability densities are log-concave, e.g., normal:
f(x) =
1
_
(2)
n
det
e

1
2
(x x)
T

1
(x x)
cumulative Gaussian distribution function is log-concave
(x) =
1

2
_
x

e
u
2
/2
du
Convex functions 327
Properties of log-concave functions
twice dierentiable f with convex domain is log-concave if and only if
f(x)
2
f(x) _ f(x)f(x)
T
for all x domf
product of log-concave functions is log-concave
sum of log-concave functions is not always log-concave
integration: if f : R
n
R
m
R is log-concave, then
g(x) =
_
f(x, y) dy
is log-concave (not easy to show)
Convex functions 328
consequences of integration property
convolution f g of log-concave functions f, g is log-concave
(f g)(x) =
_
f(x y)g(y)dy
if C R
n
convex and y is a random variable with log-concave pdf then
f(x) = prob(x +y C)
is log-concave
proof: write f(x) as integral of product of log-concave functions
f(x) =
_
g(x +y)p(y) dy, g(u) =
_
1 u C
0 u , C,
p is pdf of y
Convex functions 329
example: yield function
Y (x) = prob(x +w S)
x R
n
: nominal parameter values for product
w R
n
: random variations of parameters in manufactured product
S: set of acceptable values
if S is convex and w has a log-concave pdf, then
Y is log-concave
yield regions x [ Y (x) are convex
Convex functions 330
Convexity with respect to generalized inequalities
f : R
n
R
m
is K-convex if domf is convex and
f(x + (1 )y) _
K
f(x) + (1 )f(y)
for x, y domf, 0 1
example f : S
m
S
m
, f(X) = X
2
is S
m
+
-convex
proof: for xed z R
m
, z
T
X
2
z = |Xz|
2
2
is convex in X, i.e.,
z
T
(X + (1 )Y )
2
z z
T
X
2
z + (1 )z
T
Y
2
z
for X, Y S
m
, 0 1
therefore (X + (1 )Y )
2
_ X
2
+ (1 )Y
2
Convex functions 331
Convex Optimization Boyd & Vandenberghe
4. Convex optimization problems
optimization problem in standard form
convex optimization problems
quasiconvex optimization
linear optimization
quadratic optimization
geometric programming
generalized inequality constraints
semidenite programming
vector optimization
41
Optimization problem in standard form
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
h
i
(x) = 0, i = 1, . . . , p
x R
n
is the optimization variable
f
0
: R
n
R is the objective or cost function
f
i
: R
n
R, i = 1, . . . , m, are the inequality constraint functions
h
i
: R
n
R are the equality constraint functions
optimal value:
p

= inff
0
(x) [ f
i
(x) 0, i = 1, . . . , m, h
i
(x) = 0, i = 1, . . . , p
p

= if problem is infeasible (no x satises the constraints)


p

= if problem is unbounded below


Convex optimization problems 42
Optimal and locally optimal points
x is feasible if x domf
0
and it satises the constraints
a feasible x is optimal if f
0
(x) = p

; X
opt
is the set of optimal points
x is locally optimal if there is an R > 0 such that x is optimal for
minimize (over z) f
0
(z)
subject to f
i
(z) 0, i = 1, . . . , m, h
i
(z) = 0, i = 1, . . . , p
|z x|
2
R
examples (with n = 1, m = p = 0)
f
0
(x) = 1/x, domf
0
= R
++
: p

= 0, no optimal point
f
0
(x) = log x, domf
0
= R
++
: p

=
f
0
(x) = xlog x, domf
0
= R
++
: p

= 1/e, x = 1/e is optimal


f
0
(x) = x
3
3x, p

= , local optimum at x = 1
Convex optimization problems 43
Implicit constraints
the standard form optimization problem has an implicit constraint
x T =
m

i=0
domf
i

p

i=1
domh
i
,
we call T the domain of the problem
the constraints f
i
(x) 0, h
i
(x) = 0 are the explicit constraints
a problem is unconstrained if it has no explicit constraints (m = p = 0)
example:
minimize f
0
(x) =

k
i=1
log(b
i
a
T
i
x)
is an unconstrained problem with implicit constraints a
T
i
x < b
i
Convex optimization problems 44
Feasibility problem
nd x
subject to f
i
(x) 0, i = 1, . . . , m
h
i
(x) = 0, i = 1, . . . , p
can be considered a special case of the general problem with f
0
(x) = 0:
minimize 0
subject to f
i
(x) 0, i = 1, . . . , m
h
i
(x) = 0, i = 1, . . . , p
p

= 0 if constraints are feasible; any feasible x is optimal


p

= if constraints are infeasible


Convex optimization problems 45
Convex optimization problem
standard form convex optimization problem
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
a
T
i
x = b
i
, i = 1, . . . , p
f
0
, f
1
, . . . , f
m
are convex; equality constraints are ane
problem is quasiconvex if f
0
is quasiconvex (and f
1
, . . . , f
m
convex)
often written as
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
Ax = b
important property: feasible set of a convex optimization problem is convex
Convex optimization problems 46
example
minimize f
0
(x) = x
2
1
+x
2
2
subject to f
1
(x) = x
1
/(1 +x
2
2
) 0
h
1
(x) = (x
1
+x
2
)
2
= 0
f
0
is convex; feasible set (x
1
, x
2
) [ x
1
= x
2
0 is convex
not a convex problem (according to our denition): f
1
is not convex, h
1
is not ane
equivalent (but not identical) to the convex problem
minimize x
2
1
+x
2
2
subject to x
1
0
x
1
+x
2
= 0
Convex optimization problems 47
Local and global optima
any locally optimal point of a convex problem is (globally) optimal
proof: suppose x is locally optimal and y is optimal with f
0
(y) < f
0
(x)
x locally optimal means there is an R > 0 such that
z feasible, |z x|
2
R = f
0
(z) f
0
(x)
consider z = y + (1 )x with = R/(2|y x|
2
)
|y x|
2
> R, so 0 < < 1/2
z is a convex combination of two feasible points, hence also feasible
|z x|
2
= R/2 and
f
0
(z) f
0
(x) + (1 )f
0
(y) < f
0
(x)
which contradicts our assumption that x is locally optimal
Convex optimization problems 48
Optimality criterion for dierentiable f
0
x is optimal if and only if it is feasible and
f
0
(x)
T
(y x) 0 for all feasible y
f
0
(x)
X
x
if nonzero, f
0
(x) denes a supporting hyperplane to feasible set X at x
Convex optimization problems 49
unconstrained problem: x is optimal if and only if
x domf
0
, f
0
(x) = 0
equality constrained problem
minimize f
0
(x) subject to Ax = b
x is optimal if and only if there exists a such that
x domf
0
, Ax = b, f
0
(x) +A
T
= 0
minimization over nonnegative orthant
minimize f
0
(x) subject to x _ 0
x is optimal if and only if
x domf
0
, x _ 0,
_
f
0
(x)
i
0 x
i
= 0
f
0
(x)
i
= 0 x
i
> 0
Convex optimization problems 410
Equivalent convex problems
two problems are (informally) equivalent if the solution of one is readily
obtained from the solution of the other, and vice-versa
some common transformations that preserve convexity:
eliminating equality constraints
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
Ax = b
is equivalent to
minimize (over z) f
0
(Fz +x
0
)
subject to f
i
(Fz +x
0
) 0, i = 1, . . . , m
where F and x
0
are such that
Ax = b x = Fz +x
0
for some z
Convex optimization problems 411
introducing equality constraints
minimize f
0
(A
0
x +b
0
)
subject to f
i
(A
i
x +b
i
) 0, i = 1, . . . , m
is equivalent to
minimize (over x, y
i
) f
0
(y
0
)
subject to f
i
(y
i
) 0, i = 1, . . . , m
y
i
= A
i
x +b
i
, i = 0, 1, . . . , m
introducing slack variables for linear inequalities
minimize f
0
(x)
subject to a
T
i
x b
i
, i = 1, . . . , m
is equivalent to
minimize (over x, s) f
0
(x)
subject to a
T
i
x +s
i
= b
i
, i = 1, . . . , m
s
i
0, i = 1, . . . m
Convex optimization problems 412
epigraph form: standard form convex problem is equivalent to
minimize (over x, t) t
subject to f
0
(x) t 0
f
i
(x) 0, i = 1, . . . , m
Ax = b
minimizing over some variables
minimize f
0
(x
1
, x
2
)
subject to f
i
(x
1
) 0, i = 1, . . . , m
is equivalent to
minimize

f
0
(x
1
)
subject to f
i
(x
1
) 0, i = 1, . . . , m
where

f
0
(x
1
) = inf
x
2
f
0
(x
1
, x
2
)
Convex optimization problems 413
Quasiconvex optimization
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
Ax = b
with f
0
: R
n
R quasiconvex, f
1
, . . . , f
m
convex
can have locally optimal points that are not (globally) optimal
(x, f
0
(x))
Convex optimization problems 414
convex representation of sublevel sets of f
0
if f
0
is quasiconvex, there exists a family of functions
t
such that:

t
(x) is convex in x for xed t
t-sublevel set of f
0
is 0-sublevel set of
t
, i.e.,
f
0
(x) t
t
(x) 0
example
f
0
(x) =
p(x)
q(x)
with p convex, q concave, and p(x) 0, q(x) > 0 on domf
0
can take
t
(x) = p(x) tq(x):
for t 0,
t
convex in x
p(x)/q(x) t if and only if
t
(x) 0
Convex optimization problems 415
quasiconvex optimization via convex feasibility problems

t
(x) 0, f
i
(x) 0, i = 1, . . . , m, Ax = b (1)
for xed t, a convex feasibility problem in x
if feasible, we can conclude that t p

; if infeasible, t p

Bisection method for quasiconvex optimization


given l p

, u p

, tolerance > 0.
repeat
1. t := (l + u)/2.
2. Solve the convex feasibility problem (1).
3. if (1) is feasible, u := t; else l := t.
until u l .
requires exactly log
2
((u l)/) iterations (where u, l are initial values)
Convex optimization problems 416
Linear program (LP)
minimize c
T
x +d
subject to Gx _ h
Ax = b
convex problem with ane objective and constraint functions
feasible set is a polyhedron
P
x

c
Convex optimization problems 417
Examples
diet problem: choose quantities x
1
, . . . , x
n
of n foods
one unit of food j costs c
j
, contains amount a
ij
of nutrient i
healthy diet requires nutrient i in quantity at least b
i
to nd cheapest healthy diet,
minimize c
T
x
subject to Ax _ b, x _ 0
piecewise-linear minimization
minimize max
i=1,...,m
(a
T
i
x +b
i
)
equivalent to an LP
minimize t
subject to a
T
i
x +b
i
t, i = 1, . . . , m
Convex optimization problems 418
Chebyshev center of a polyhedron
Chebyshev center of
T = x [ a
T
i
x b
i
, i = 1, . . . , m
is center of largest inscribed ball
B = x
c
+u [ |u|
2
r
x
cheb
x
cheb
a
T
i
x b
i
for all x B if and only if
supa
T
i
(x
c
+u) [ |u|
2
r = a
T
i
x
c
+r|a
i
|
2
b
i
hence, x
c
, r can be determined by solving the LP
maximize r
subject to a
T
i
x
c
+r|a
i
|
2
b
i
, i = 1, . . . , m
Convex optimization problems 419
Linear-fractional program
minimize f
0
(x)
subject to Gx _ h
Ax = b
linear-fractional program
f
0
(x) =
c
T
x +d
e
T
x +f
, domf
0
(x) = x [ e
T
x +f > 0
a quasiconvex optimization problem; can be solved by bisection
also equivalent to the LP (variables y, z)
minimize c
T
y +dz
subject to Gy _ hz
Ay = bz
e
T
y +fz = 1
z 0
Convex optimization problems 420
generalized linear-fractional program
f
0
(x) = max
i=1,...,r
c
T
i
x +d
i
e
T
i
x +f
i
, domf
0
(x) = x [ e
T
i
x+f
i
> 0, i = 1, . . . , r
a quasiconvex optimization problem; can be solved by bisection
example: Von Neumann model of a growing economy
maximize (over x, x
+
) min
i=1,...,n
x
+
i
/x
i
subject to x
+
_ 0, Bx
+
_ Ax
x, x
+
R
n
: activity levels of n sectors, in current and next period
(Ax)
i
, (Bx
+
)
i
: produced, resp. consumed, amounts of good i
x
+
i
/x
i
: growth rate of sector i
allocate activity to maximize growth rate of slowest growing sector
Convex optimization problems 421
Quadratic program (QP)
minimize (1/2)x
T
Px +q
T
x +r
subject to Gx _ h
Ax = b
P S
n
+
, so objective is convex quadratic
minimize a convex quadratic function over a polyhedron
P
x

f
0
(x

)
Convex optimization problems 422
Examples
least-squares
minimize |Ax b|
2
2
analytical solution x

= A

b (A

is pseudo-inverse)
can add linear constraints, e.g., l _ x _ u
linear program with random cost
minimize c
T
x +x
T
x = Ec
T
x + var(c
T
x)
subject to Gx _ h, Ax = b
c is random vector with mean c and covariance
hence, c
T
x is random variable with mean c
T
x and variance x
T
x
> 0 is risk aversion parameter; controls the trade-o between
expected cost and variance (risk)
Convex optimization problems 423
Quadratically constrained quadratic program (QCQP)
minimize (1/2)x
T
P
0
x +q
T
0
x +r
0
subject to (1/2)x
T
P
i
x +q
T
i
x +r
i
0, i = 1, . . . , m
Ax = b
P
i
S
n
+
; objective and constraints are convex quadratic
if P
1
, . . . , P
m
S
n
++
, feasible region is intersection of m ellipsoids and
an ane set
Convex optimization problems 424
Second-order cone programming
minimize f
T
x
subject to |A
i
x +b
i
|
2
c
T
i
x +d
i
, i = 1, . . . , m
Fx = g
(A
i
R
n
i
n
, F R
pn
)
inequalities are called second-order cone (SOC) constraints:
(A
i
x +b
i
, c
T
i
x +d
i
) second-order cone in R
n
i
+1
for n
i
= 0, reduces to an LP; if c
i
= 0, reduces to a QCQP
more general than QCQP and LP
Convex optimization problems 425
Robust linear programming
the parameters in optimization problems are often uncertain, e.g., in an LP
minimize c
T
x
subject to a
T
i
x b
i
, i = 1, . . . , m,
there can be uncertainty in c, a
i
, b
i
two common approaches to handling uncertainty (in a
i
, for simplicity)
deterministic model: constraints must hold for all a
i
c
i
minimize c
T
x
subject to a
T
i
x b
i
for all a
i
c
i
, i = 1, . . . , m,
stochastic model: a
i
is random variable; constraints must hold with
probability
minimize c
T
x
subject to prob(a
T
i
x b
i
) , i = 1, . . . , m
Convex optimization problems 426
deterministic approach via SOCP
choose an ellipsoid as c
i
:
c
i
= a
i
+P
i
u [ |u|
2
1 ( a
i
R
n
, P
i
R
nn
)
center is a
i
, semi-axes determined by singular values/vectors of P
i
robust LP
minimize c
T
x
subject to a
T
i
x b
i
a
i
c
i
, i = 1, . . . , m
is equivalent to the SOCP
minimize c
T
x
subject to a
T
i
x +|P
T
i
x|
2
b
i
, i = 1, . . . , m
(follows from sup
u
2
1
( a
i
+P
i
u)
T
x = a
T
i
x +|P
T
i
x|
2
)
Convex optimization problems 427
stochastic approach via SOCP
assume a
i
is Gaussian with mean a
i
, covariance
i
(a
i
A( a
i
,
i
))
a
T
i
x is Gaussian r.v. with mean a
T
i
x, variance x
T

i
x; hence
prob(a
T
i
x b
i
) =
_
b
i
a
T
i
x
|
1/2
i
x|
2
_
where (x) = (1/

2)
_
x

e
t
2
/2
dt is CDF of A(0, 1)
robust LP
minimize c
T
x
subject to prob(a
T
i
x b
i
) , i = 1, . . . , m,
with 1/2, is equivalent to the SOCP
minimize c
T
x
subject to a
T
i
x +
1
()|
1/2
i
x|
2
b
i
, i = 1, . . . , m
Convex optimization problems 428
Geometric programming
monomial function
f(x) = cx
a
1
1
x
a
2
2
x
a
n
n
, domf = R
n
++
with c > 0; exponent
i
can be any real number
posynomial function: sum of monomials
f(x) =
K

k=1
c
k
x
a
1k
1
x
a
2k
2
x
a
nk
n
, domf = R
n
++
geometric program (GP)
minimize f
0
(x)
subject to f
i
(x) 1, i = 1, . . . , m
h
i
(x) = 1, i = 1, . . . , p
with f
i
posynomial, h
i
monomial
Convex optimization problems 429
Geometric program in convex form
change variables to y
i
= log x
i
, and take logarithm of cost, constraints
monomial f(x) = cx
a
1
1
x
a
n
n
transforms to
log f(e
y
1
, . . . , e
y
n
) = a
T
y +b (b = log c)
posynomial f(x) =

K
k=1
c
k
x
a
1k
1
x
a
2k
2
x
a
nk
n
transforms to
log f(e
y
1
, . . . , e
y
n
) = log
_
K

k=1
e
a
T
k
y+b
k
_
(b
k
= log c
k
)
geometric program transforms to convex problem
minimize log
_

K
k=1
exp(a
T
0k
y +b
0k
)
_
subject to log
_

K
k=1
exp(a
T
ik
y +b
ik
)
_
0, i = 1, . . . , m
Gy +d = 0
Convex optimization problems 430
Design of cantilever beam
F
segment 4 segment 3 segment 2 segment 1
N segments with unit lengths, rectangular cross-sections of size w
i
h
i
given vertical force F applied at the right end
design problem
minimize total weight
subject to upper & lower bounds on w
i
, h
i
upper bound & lower bounds on aspect ratios h
i
/w
i
upper bound on stress in each segment
upper bound on vertical deection at the end of the beam
variables: w
i
, h
i
for i = 1, . . . , N
Convex optimization problems 431
objective and constraint functions
total weight w
1
h
1
+ +w
N
h
N
is posynomial
aspect ratio h
i
/w
i
and inverse aspect ratio w
i
/h
i
are monomials
maximum stress in segment i is given by 6iF/(w
i
h
2
i
), a monomial
the vertical deection y
i
and slope v
i
of central axis at the right end of
segment i are dened recursively as
v
i
= 12(i 1/2)
F
Ew
i
h
3
i
+v
i+1
y
i
= 6(i 1/3)
F
Ew
i
h
3
i
+v
i+1
+y
i+1
for i = N, N 1, . . . , 1, with v
N+1
= y
N+1
= 0 (E is Youngs modulus)
v
i
and y
i
are posynomial functions of w, h
Convex optimization problems 432
formulation as a GP
minimize w
1
h
1
+ +w
N
h
N
subject to w
1
max
w
i
1, w
min
w
1
i
1, i = 1, . . . , N
h
1
max
h
i
1, h
min
h
1
i
1, i = 1, . . . , N
S
1
max
w
1
i
h
i
1, S
min
w
i
h
1
i
1, i = 1, . . . , N
6iF
1
max
w
1
i
h
2
i
1, i = 1, . . . , N
y
1
max
y
1
1
note
we write w
min
w
i
w
max
and h
min
h
i
h
max
w
min
/w
i
1, w
i
/w
max
1, h
min
/h
i
1, h
i
/h
max
1
we write S
min
h
i
/w
i
S
max
as
S
min
w
i
/h
i
1, h
i
/(w
i
S
max
) 1
Convex optimization problems 433
Minimizing spectral radius of nonnegative matrix
Perron-Frobenius eigenvalue
pf
(A)
exists for (elementwise) positive A R
nn
a real, positive eigenvalue of A, equal to spectral radius max
i
[
i
(A)[
determines asymptotic growth (decay) rate of A
k
: A
k

k
pf
as k
alternative characterization:
pf
(A) = inf [ Av _ v for some v 0
minimizing spectral radius of matrix of posynomials
minimize
pf
(A(x)), where the elements A(x)
ij
are posynomials of x
equivalent geometric program:
minimize
subject to

n
j=1
A(x)
ij
v
j
/(v
i
) 1, i = 1, . . . , n
variables , v, x
Convex optimization problems 434
Generalized inequality constraints
convex problem with generalized inequality constraints
minimize f
0
(x)
subject to f
i
(x) _
K
i
0, i = 1, . . . , m
Ax = b
f
0
: R
n
R convex; f
i
: R
n
R
k
i
K
i
-convex w.r.t. proper cone K
i
same properties as standard convex problem (convex feasible set, local
optimum is global, etc.)
conic form problem: special case with ane objective and constraints
minimize c
T
x
subject to Fx +g _
K
0
Ax = b
extends linear programming (K = R
m
+
) to nonpolyhedral cones
Convex optimization problems 435
Semidenite program (SDP)
minimize c
T
x
subject to x
1
F
1
+x
2
F
2
+ +x
n
F
n
+G _ 0
Ax = b
with F
i
, G S
k
inequality constraint is called linear matrix inequality (LMI)
includes problems with multiple LMI constraints: for example,
x
1

F
1
+ +x
n

F
n
+

G _ 0, x
1

F
1
+ +x
n

F
n
+

G _ 0
is equivalent to single LMI
x
1
_

F
1
0
0

F
1
_
+x
2
_

F
2
0
0

F
2
_
+ +x
n
_

F
n
0
0

F
n
_
+
_

G 0
0

G
_
_ 0
Convex optimization problems 436
LP and SOCP as SDP
LP and equivalent SDP
LP: minimize c
T
x
subject to Ax _ b
SDP: minimize c
T
x
subject to diag(Ax b) _ 0
(note dierent interpretation of generalized inequality _)
SOCP and equivalent SDP
SOCP: minimize f
T
x
subject to |A
i
x +b
i
|
2
c
T
i
x +d
i
, i = 1, . . . , m
SDP: minimize f
T
x
subject to
_
(c
T
i
x +d
i
)I A
i
x +b
i
(A
i
x +b
i
)
T
c
T
i
x +d
i
_
_ 0, i = 1, . . . , m
Convex optimization problems 437
Eigenvalue minimization
minimize
max
(A(x))
where A(x) = A
0
+x
1
A
1
+ +x
n
A
n
(with given A
i
S
k
)
equivalent SDP
minimize t
subject to A(x) _ tI
variables x R
n
, t R
follows from

max
(A) t A _ tI
Convex optimization problems 438
Matrix norm minimization
minimize |A(x)|
2
=
_

max
(A(x)
T
A(x))
_
1/2
where A(x) = A
0
+x
1
A
1
+ +x
n
A
n
(with given A
i
R
pq
)
equivalent SDP
minimize t
subject to
_
tI A(x)
A(x)
T
tI
_
_ 0
variables x R
n
, t R
constraint follows from
|A|
2
t A
T
A _ t
2
I, t 0

_
tI A
A
T
tI
_
_ 0
Convex optimization problems 439
Vector optimization
general vector optimization problem
minimize (w.r.t. K) f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
h
i
(x) 0, i = 1, . . . , p
vector objective f
0
: R
n
R
q
, minimized w.r.t. proper cone K R
q
convex vector optimization problem
minimize (w.r.t. K) f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
Ax = b
with f
0
K-convex, f
1
, . . . , f
m
convex
Convex optimization problems 440
Optimal and Pareto optimal points
set of achievable objective values
O = f
0
(x) [ x feasible
feasible x is optimal if f
0
(x) is the minimum value of O
feasible x is Pareto optimal if f
0
(x) is a minimal value of O
O
f
0
(x

)
x

is optimal
O
f
0
(x
po
)
x
po
is Pareto optimal
Convex optimization problems 441
Multicriterion optimization
vector optimization problem with K = R
q
+
f
0
(x) = (F
1
(x), . . . , F
q
(x))
q dierent objectives F
i
; roughly speaking we want all F
i
s to be small
feasible x

is optimal if
y feasible = f
0
(x

) _ f
0
(y)
if there exists an optimal point, the objectives are noncompeting
feasible x
po
is Pareto optimal if
y feasible, f
0
(y) _ f
0
(x
po
) = f
0
(x
po
) = f
0
(y)
if there are multiple Pareto optimal values, there is a trade-o between
the objectives
Convex optimization problems 442
Regularized least-squares
minimize (w.r.t. R
2
+
) (|Ax b|
2
2
, |x|
2
2
)
0 10 20 30 40 50
0
5
10
15
20
25
F
1
(x) = |Ax b|
2
2
F
2
(
x
)
=
|
x
|
2 2 O
example for A R
10010
; heavy line is formed by Pareto optimal points
Convex optimization problems 443
Risk return trade-o in portfolio optimization
minimize (w.r.t. R
2
+
) ( p
T
x, x
T
x)
subject to 1
T
x = 1, x _ 0
x R
n
is investment portfolio; x
i
is fraction invested in asset i
p R
n
is vector of relative asset price changes; modeled as a random
variable with mean p, covariance
p
T
x = Er is expected return; x
T
x = var r is return variance
example
m
e
a
n
r
e
t
u
r
n
standard deviation of return
0% 10% 20%
0%
5%
10%
15%
standard deviation of return
a
l
l
o
c
a
t
i
o
n
x
x(1)
x(2) x(3) x(4)
0% 10% 20%
0
0.5
1
Convex optimization problems 444
Scalarization
to nd Pareto optimal points: choose
K

0 and solve scalar problem


minimize
T
f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
h
i
(x) = 0, i = 1, . . . , p
if x is optimal for scalar problem,
then it is Pareto-optimal for vector
optimization problem
O
f
0
(x
1
)

1
f
0
(x
2
)

2
f
0
(x
3
)
for convex vector optimization problems, can nd (almost) all Pareto
optimal points by varying
K

0
Convex optimization problems 445
Scalarization for multicriterion problems
to nd Pareto optimal points, minimize positive weighted sum

T
f
0
(x) =
1
F
1
(x) + +
q
F
q
(x)
examples
regularized least-squares problem of page 443
take = (1, ) with > 0
minimize |Ax b|
2
2
+|x|
2
2
for xed , a LS problem
0 5 10 15 20
0
5
10
15
20
|Ax b|
2
2
|
x
|
2 2
= 1
Convex optimization problems 446
risk-return trade-o of page 444
minimize p
T
x +x
T
x
subject to 1
T
x = 1, x _ 0
for xed > 0, a quadratic program
Convex optimization problems 447
Convex Optimization Boyd & Vandenberghe
5. Duality
Lagrange dual problem
weak and strong duality
geometric interpretation
optimality conditions
perturbation and sensitivity analysis
examples
generalized inequalities
51
Lagrangian
standard form problem (not necessarily convex)
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
h
i
(x) = 0, i = 1, . . . , p
variable x R
n
, domain T, optimal value p

Lagrangian: L : R
n
R
m
R
p
R, with domL = T R
m
R
p
,
L(x, , ) = f
0
(x) +
m

i=1

i
f
i
(x) +
p

i=1

i
h
i
(x)
weighted sum of objective and constraint functions

i
is Lagrange multiplier associated with f
i
(x) 0

i
is Lagrange multiplier associated with h
i
(x) = 0
Duality 52
Lagrange dual function
Lagrange dual function: g : R
m
R
p
R,
g(, ) = inf
xD
L(x, , )
= inf
xD
_
f
0
(x) +
m

i=1

i
f
i
(x) +
p

i=1

i
h
i
(x)
_
g is concave, can be for some ,
lower bound property: if _ 0, then g(, ) p

proof: if x is feasible and _ 0, then


f
0
( x) L( x, , ) inf
xD
L(x, , ) = g(, )
minimizing over all feasible x gives p

g(, )
Duality 53
Least-norm solution of linear equations
minimize x
T
x
subject to Ax = b
dual function
Lagrangian is L(x, ) = x
T
x +
T
(Ax b)
to minimize L over x, set gradient equal to zero:

x
L(x, ) = 2x +A
T
= 0 = x = (1/2)A
T

plug in in L to obtain g:
g() = L((1/2)A
T
, ) =
1
4

T
AA
T
b
T

a concave function of
lower bound property: p

(1/4)
T
AA
T
b
T
for all
Duality 54
Standard form LP
minimize c
T
x
subject to Ax = b, x _ 0
dual function
Lagrangian is
L(x, , ) = c
T
x +
T
(Ax b)
T
x
= b
T
+ (c +A
T
)
T
x
L is ane in x, hence
g(, ) = inf
x
L(x, , ) =
_
b
T
A
T
+c = 0
otherwise
g is linear on ane domain (, ) [ A
T
+c = 0, hence concave
lower bound property: p

b
T
if A
T
+c _ 0
Duality 55
Equality constrained norm minimization
minimize |x|
subject to Ax = b
dual function
g() = inf
x
(|x|
T
Ax +b
T
) =
_
b
T
|A
T
|

1
otherwise
where |v|

= sup
u1
u
T
v is dual norm of | |
proof: follows from inf
x
(|x| y
T
x) = 0 if |y|

1, otherwise
if |y|

1, then |x| y
T
x 0 for all x, with equality if x = 0
if |y|

> 1, choose x = tu where |u| 1, u


T
y = |y|

> 1:
|x| y
T
x = t(|u| |y|

) as t
lower bound property: p

b
T
if |A
T
|

1
Duality 56
Two-way partitioning
minimize x
T
Wx
subject to x
2
i
= 1, i = 1, . . . , n
a nonconvex problem; feasible set contains 2
n
discrete points
interpretation: partition 1, . . . , n in two sets; W
ij
is cost of assigning
i, j to the same set; W
ij
is cost of assigning to dierent sets
dual function
g() = inf
x
(x
T
Wx +

i
(x
2
i
1)) = inf
x
x
T
(W +diag())x 1
T

=
_
1
T
W +diag() _ 0
otherwise
lower bound property: p

1
T
if W +diag() _ 0
example: =
min
(W)1 gives bound p

n
min
(W)
Duality 57
Lagrange dual and conjugate function
minimize f
0
(x)
subject to Ax _ b, Cx = d
dual function
g(, ) = inf
xdomf
0
_
f
0
(x) + (A
T
+C
T
)
T
x b
T
d
T

_
= f

0
(A
T
C
T
) b
T
d
T

recall denition of conjugate f

(y) = sup
xdomf
(y
T
x f(x))
simplies derivation of dual if conjugate of f
0
is known
example: entropy maximization
f
0
(x) =
n

i=1
x
i
log x
i
, f

0
(y) =
n

i=1
e
y
i
1
Duality 58
The dual problem
Lagrange dual problem
maximize g(, )
subject to _ 0
nds best lower bound on p

, obtained from Lagrange dual function


a convex optimization problem; optimal value denoted d

, are dual feasible if _ 0, (, ) domg


often simplied by making implicit constraint (, ) domg explicit
example: standard form LP and its dual (page 55)
minimize c
T
x
subject to Ax = b
x _ 0
maximize b
T

subject to A
T
+c _ 0
Duality 59
Weak and strong duality
weak duality: d

always holds (for convex and nonconvex problems)


can be used to nd nontrivial lower bounds for dicult problems
for example, solving the SDP
maximize 1
T

subject to W +diag() _ 0
gives a lower bound for the two-way partitioning problem on page 57
strong duality: d

= p

does not hold in general


(usually) holds for convex problems
conditions that guarantee strong duality in convex problems are called
constraint qualications
Duality 510
Slaters constraint qualication
strong duality holds for a convex problem
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
Ax = b
if it is strictly feasible, i.e.,
x int T : f
i
(x) < 0, i = 1, . . . , m, Ax = b
also guarantees that the dual optimum is attained (if p

> )
can be sharpened: e.g., can replace int T with relint T (interior
relative to ane hull); linear inequalities do not need to hold with strict
inequality, . . .
there exist many other types of constraint qualications
Duality 511
Inequality form LP
primal problem
minimize c
T
x
subject to Ax _ b
dual function
g() = inf
x
_
(c +A
T
)
T
x b
T

_
=
_
b
T
A
T
+c = 0
otherwise
dual problem
maximize b
T

subject to A
T
+c = 0, _ 0
from Slaters condition: p

= d

if A x b for some x
in fact, p

= d

except when primal and dual are infeasible


Duality 512
Quadratic program
primal problem (assume P S
n
++
)
minimize x
T
Px
subject to Ax _ b
dual function
g() = inf
x
_
x
T
Px +
T
(Ax b)
_
=
1
4

T
AP
1
A
T
b
T

dual problem
maximize (1/4)
T
AP
1
A
T
b
T

subject to _ 0
from Slaters condition: p

= d

if A x b for some x
in fact, p

= d

always
Duality 513
A nonconvex problem with strong duality
minimize x
T
Ax + 2b
T
x
subject to x
T
x 1
A ,_ 0, hence nonconvex
dual function: g() = inf
x
(x
T
(A+I)x + 2b
T
x )
unbounded below if A+I ,_ 0 or if A+I _ 0 and b , 1(A+I)
minimized by x = (A+I)

b otherwise: g() = b
T
(A+I)

b
dual problem and equivalent SDP:
maximize b
T
(A+I)

b
subject to A+I _ 0
b 1(A+I)
maximize t
subject to
_
A+I b
b
T
t
_
_ 0
strong duality although primal problem is not convex (not easy to show)
Duality 514
Geometric interpretation
for simplicity, consider problem with one constraint f
1
(x) 0
interpretation of dual function:
g() = inf
(u,t)G
(t +u), where ( = (f
1
(x), f
0
(x)) [ x T
G
p

g()
u + t = g()
t
u
G
p

t
u
u +t = g() is (non-vertical) supporting hyperplane to (
hyperplane intersects t-axis at t = g()
Duality 515
epigraph variation: same interpretation if ( is replaced with
/ = (u, t) [ f
1
(x) u, f
0
(x) t for some x T
A
p

g()
u + t = g()
t
u
strong duality
holds if there is a non-vertical supporting hyperplane to / at (0, p

)
for convex problem, / is convex, hence has supp. hyperplane at (0, p

)
Slaters condition: if there exist ( u,

t) / with u < 0, then supporting


hyperplanes at (0, p

) must be non-vertical
Duality 516
Complementary slackness
assume strong duality holds, x

is primal optimal, (

) is dual optimal
f
0
(x

) = g(

) = inf
x
_
f
0
(x) +
m

i=1

i
f
i
(x) +
p

i=1

i
h
i
(x)
_
f
0
(x

) +
m

i=1

i
f
i
(x

) +
p

i=1

i
h
i
(x

)
f
0
(x

)
hence, the two inequalities hold with equality
x

minimizes L(x,

i
f
i
(x

) = 0 for i = 1, . . . , m (known as complementary slackness):

i
> 0 =f
i
(x

) = 0, f
i
(x

) < 0 =

i
= 0
Duality 517
Karush-Kuhn-Tucker (KKT) conditions
the following four conditions are called KKT conditions (for a problem with
dierentiable f
i
, h
i
):
1. primal constraints: f
i
(x) 0, i = 1, . . . , m, h
i
(x) = 0, i = 1, . . . , p
2. dual constraints: _ 0
3. complementary slackness:
i
f
i
(x) = 0, i = 1, . . . , m
4. gradient of Lagrangian with respect to x vanishes:
f
0
(x) +
m

i=1

i
f
i
(x) +
p

i=1

i
h
i
(x) = 0
from page 517: if strong duality holds and x, , are optimal, then they
must satisfy the KKT conditions
Duality 518
KKT conditions for convex problem
if x,

, satisfy KKT for a convex problem, then they are optimal:
from complementary slackness: f
0
( x) = L( x,

, )
from 4th condition (and convexity): g(

, ) = L( x,

, )
hence, f
0
( x) = g(

, )
if Slaters condition is satised:
x is optimal if and only if there exist , that satisfy KKT conditions
recall that Slater implies strong duality, and dual optimum is attained
generalizes optimality condition f
0
(x) = 0 for unconstrained problem
Duality 519
example: water-lling (assume
i
> 0)
minimize

n
i=1
log(x
i
+
i
)
subject to x _ 0, 1
T
x = 1
x is optimal i x _ 0, 1
T
x = 1, and there exist R
n
, R such that
_ 0,
i
x
i
= 0,
1
x
i
+
i
+
i
=
if < 1/
i
:
i
= 0 and x
i
= 1/
i
if 1/
i
:
i
= 1/
i
and x
i
= 0
determine from 1
T
x =

n
i=1
max0, 1/
i
= 1
interpretation
n patches; level of patch i is at height
i
ood area with unit amount of water
resulting level is 1/

i
1/

x
i

i
Duality 520
Perturbation and sensitivity analysis
(unperturbed) optimization problem and its dual
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
h
i
(x) = 0, i = 1, . . . , p
maximize g(, )
subject to _ 0
perturbed problem and its dual
min. f
0
(x)
s.t. f
i
(x) u
i
, i = 1, . . . , m
h
i
(x) = v
i
, i = 1, . . . , p
max. g(, ) u
T
v
T

s.t. _ 0
x is primal variable; u, v are parameters
p

(u, v) is optimal value as a function of u, v


we are interested in information about p

(u, v) that we can obtain from


the solution of the unperturbed problem and its dual
Duality 521
global sensitivity result
assume strong duality holds for unperturbed problem, and that

are
dual optimal for unperturbed problem
apply weak duality to perturbed problem:
p

(u, v) g(

) u
T

v
T

= p

(0, 0) u
T

v
T

sensitivity interpretation
if

i
large: p

increases greatly if we tighten constraint i (u


i
< 0)
if

i
small: p

does not decrease much if we loosen constraint i (u


i
> 0)
if

i
large and positive: p

increases greatly if we take v


i
< 0;
if

i
large and negative: p

increases greatly if we take v


i
> 0
if

i
small and positive: p

does not decrease much if we take v


i
> 0;
if

i
small and negative: p

does not decrease much if we take v


i
< 0
Duality 522
local sensitivity: if (in addition) p

(u, v) is dierentiable at (0, 0), then

i
=
p

(0, 0)
u
i
,

i
=
p

(0, 0)
v
i
proof (for

i
): from global sensitivity result,
p

(0, 0)
u
i
= lim
t0
p

(te
i
, 0) p

(0, 0)
t

i
p

(0, 0)
u
i
= lim
t0
p

(te
i
, 0) p

(0, 0)
t

i
hence, equality
p

(u) for a problem with one (inequality)


constraint: u
p

(u)
p

(0)

u
u = 0
Duality 523
Duality and problem reformulations
equivalent formulations of a problem can lead to very dierent duals
reformulating the primal problem can be useful when the dual is dicult
to derive, or uninteresting
common reformulations
introduce new variables and equality constraints
make explicit constraints implicit or vice-versa
transform objective or constraint functions
e.g., replace f
0
(x) by (f
0
(x)) with convex, increasing
Duality 524
Introducing new variables and equality constraints
minimize f
0
(Ax +b)
dual function is constant: g = inf
x
L(x) = inf
x
f
0
(Ax +b) = p

we have strong duality, but dual is quite useless


reformulated problem and its dual
minimize f
0
(y)
subject to Ax +b y = 0
maximize b
T
f

0
()
subject to A
T
= 0
dual function follows from
g() = inf
x,y
(f
0
(y)
T
y +
T
Ax +b
T
)
=
_
f

0
() +b
T
A
T
= 0
otherwise
Duality 525
norm approximation problem: minimize |Ax b|
minimize |y|
subject to y = Ax b
can look up conjugate of | |, or derive dual directly
g() = inf
x,y
(|y| +
T
y
T
Ax +b
T
)
=
_
b
T
+ inf
y
(|y| +
T
y) A
T
= 0
otherwise
=
_
b
T
A
T
= 0, ||

1
otherwise
(see page 54)
dual of norm approximation problem
maximize b
T

subject to A
T
= 0, ||

1
Duality 526
Implicit constraints
LP with box constraints: primal and dual problem
minimize c
T
x
subject to Ax = b
1 _ x _ 1
maximize b
T
1
T

1
1
T

2
subject to c +A
T
+
1

2
= 0

1
_ 0,
2
_ 0
reformulation with box constraints made implicit
minimize f
0
(x) =
_
c
T
x 1 _ x _ 1
otherwise
subject to Ax = b
dual function
g() = inf
1x1
(c
T
x +
T
(Ax b))
= b
T
|A
T
+c|
1
dual problem: maximize b
T
|A
T
+c|
1
Duality 527
Problems with generalized inequalities
minimize f
0
(x)
subject to f
i
(x) _
K
i
0, i = 1, . . . , m
h
i
(x) = 0, i = 1, . . . , p
_
K
i
is generalized inequality on R
k
i
denitions are parallel to scalar case:
Lagrange multiplier for f
i
(x) _
K
i
0 is vector
i
R
k
i
Lagrangian L : R
n
R
k
1
R
k
m
R
p
R, is dened as
L(x,
1
, ,
m
, ) = f
0
(x) +
m

i=1

T
i
f
i
(x) +
p

i=1

i
h
i
(x)
dual function g : R
k
1
R
k
m
R
p
R, is dened as
g(
1
, . . . ,
m
, ) = inf
xD
L(x,
1
, ,
m
, )
Duality 528
lower bound property: if
i
_
K

i
0, then g(
1
, . . . ,
m
, ) p

proof: if x is feasible and _


K

i
0, then
f
0
( x) f
0
( x) +
m

i=1

T
i
f
i
( x) +
p

i=1

i
h
i
( x)
inf
xD
L(x,
1
, . . . ,
m
, )
= g(
1
, . . . ,
m
, )
minimizing over all feasible x gives p

g(
1
, . . . ,
m
, )
dual problem
maximize g(
1
, . . . ,
m
, )
subject to
i
_
K

i
0, i = 1, . . . , m
weak duality: p

always
strong duality: p

= d

for convex problem with constraint qualication


(for example, Slaters: primal problem is strictly feasible)
Duality 529
Semidenite program
primal SDP (F
i
, G S
k
)
minimize c
T
x
subject to x
1
F
1
+ +x
n
F
n
_ G
Lagrange multiplier is matrix Z S
k
Lagrangian L(x, Z) = c
T
x +tr (Z(x
1
F
1
+ +x
n
F
n
G))
dual function
g(Z) = inf
x
L(x, Z) =
_
tr(GZ) tr(F
i
Z) +c
i
= 0, i = 1, . . . , n
otherwise
dual SDP
maximize tr(GZ)
subject to Z _ 0, tr(F
i
Z) +c
i
= 0, i = 1, . . . , n
p

= d

if primal SDP is strictly feasible (x with x


1
F
1
+ +x
n
F
n
G)
Duality 530
Convex Optimization Boyd & Vandenberghe
6. Approximation and tting
norm approximation
least-norm problems
regularized approximation
robust approximation
61
Norm approximation
minimize |Ax b|
(A R
mn
with m n, | | is a norm on R
m
)
interpretations of solution x

= argmin
x
|Ax b|:
geometric: Ax

is point in 1(A) closest to b


estimation: linear measurement model
y = Ax +v
y are measurements, x is unknown, v is measurement error
given y = b, best guess of x is x

optimal design: x are design variables (input), Ax is result (output)


x

is design that best approximates desired result b


Approximation and tting 62
examples
least-squares approximation (| |
2
): solution satises normal equations
A
T
Ax = A
T
b
(x

= (A
T
A)
1
A
T
b if rankA = n)
Chebyshev approximation (| |

): can be solved as an LP
minimize t
subject to t1 _ Ax b _ t1
sum of absolute residuals approximation (| |
1
): can be solved as an LP
minimize 1
T
y
subject to y _ Ax b _ y
Approximation and tting 63
Penalty function approximation
minimize (r
1
) + +(r
m
)
subject to r = Ax b
(A R
mn
, : R R is a convex penalty function)
examples
quadratic: (u) = u
2
deadzone-linear with width a:
(u) = max0, [u[ a
log-barrier with limit a:
(u) =
_
a
2
log(1 (u/a)
2
) [u[ < a
otherwise
u

(
u
)
deadzone-linear
quadratic
log barrier
1.5 1 0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
Approximation and tting 64
example (m = 100, n = 30): histogram of residuals for penalties
(u) = [u[, (u) = u
2
, (u) = max0, [u[a, (u) = log(1u
2
)
p
=
1
p
=
2
D
e
a
d
z
o
n
e
L
o
g
b
a
r
r
i
e
r
r
2
2
2
2
1
1
1
1
0
0
0
0
1
1
1
1
2
2
2
2
0
40
0
10
0
20
0
10
shape of penalty function has large eect on distribution of residuals
Approximation and tting 65
Huber penalty function (with parameter M)

hub
(u) =
_
u
2
[u[ M
M(2[u[ M) [u[ > M
linear growth for large u makes approximation less sensitive to outliers
replacements
u

h
u
b
(
u
)
1.5 1 0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
t
f
(
t
)
10 5 0 5 10
20
10
0
10
20
left: Huber penalty for M = 1
right: ane function f(t) = +t tted to 42 points t
i
, y
i
(circles)
using quadratic (dashed) and Huber (solid) penalty
Approximation and tting 66
Least-norm problems
minimize |x|
subject to Ax = b
(A R
mn
with m n, | | is a norm on R
n
)
interpretations of solution x

= argmin
Ax=b
|x|:
geometric: x

is point in ane set x [ Ax = b with minimum


distance to 0
estimation: b = Ax are (perfect) measurements of x; x

is smallest
(most plausible) estimate consistent with measurements
design: x are design variables (inputs); b are required results (outputs)
x

is smallest (most ecient) design that satises requirements


Approximation and tting 67
examples
least-squares solution of linear equations (| |
2
):
can be solved via optimality conditions
2x +A
T
= 0, Ax = b
minimum sum of absolute values (| |
1
): can be solved as an LP
minimize 1
T
y
subject to y _ x _ y, Ax = b
tends to produce sparse solution x

extension: least-penalty problem


minimize (x
1
) + +(x
n
)
subject to Ax = b
: R R is convex penalty function
Approximation and tting 68
Regularized approximation
minimize (w.r.t. R
2
+
) (|Ax b|, |x|)
A R
mn
, norms on R
m
and R
n
can be dierent
interpretation: nd good approximation Ax b with small x
estimation: linear measurement model y = Ax +v, with prior
knowledge that |x| is small
optimal design: small x is cheaper or more ecient, or the linear
model y = Ax is only valid for small x
robust approximation: good approximation Ax b with small x is
less sensitive to errors in A than good approximation with large x
Approximation and tting 69
Scalarized problem
minimize |Ax b| +|x|
solution for > 0 traces out optimal trade-o curve
other common method: minimize |Ax b|
2
+|x|
2
with > 0
Tikhonov regularization
minimize |Ax b|
2
2
+|x|
2
2
can be solved as a least-squares problem
minimize
_
_
_
_
_
A

I
_
x
_
b
0
__
_
_
_
2
2
solution x

= (A
T
A+I)
1
A
T
b
Approximation and tting 610
Optimal input design
linear dynamical system with impulse response h:
y(t) =
t

=0
h()u(t ), t = 0, 1, . . . , N
input design problem: multicriterion problem with 3 objectives
1. tracking error with desired output y
des
: J
track
=

N
t=0
(y(t) y
des
(t))
2
2. input magnitude: J
mag
=

N
t=0
u(t)
2
3. input variation: J
der
=

N1
t=0
(u(t + 1) u(t))
2
track desired output using a small and slowly varying input signal
regularized least-squares formulation
minimize J
track
+J
der
+J
mag
for xed , , a least-squares problem in u(0), . . . , u(N)
Approximation and tting 611
example: 3 solutions on optimal trade-o surface
(top) = 0, small ; (middle) = 0, larger ; (bottom) large
t
u
(
t
)
0 50 100 150 200
10
5
0
5
t
y
(
t
)
0 50 100 150 200
1
0.5
0
0.5
1
t
u
(
t
)
0 50 100 150 200
4
2
0
2
4
t
y
(
t
)
0 50 100 150 200
1
0.5
0
0.5
1
t
u
(
t
)
0 50 100 150 200
4
2
0
2
4
t
y
(
t
)
0 50 100 150 200
1
0.5
0
0.5
1
Approximation and tting 612
Signal reconstruction
minimize (w.r.t. R
2
+
) (| x x
cor
|
2
, ( x))
x R
n
is unknown signal
x
cor
= x +v is (known) corrupted version of x, with additive noise v
variable x (reconstructed signal) is estimate of x
: R
n
R is regularization function or smoothing objective
examples: quadratic smoothing, total variation smoothing:

quad
( x) =
n1

i=1
( x
i+1
x
i
)
2
,
tv
( x) =
n1

i=1
[ x
i+1
x
i
[
Approximation and tting 613
quadratic smoothing example
i
x
x
c
o
r
0
0
1000
1000
2000
2000
3000
3000
4000
4000
0.5
0.5
0
0
0.5
0.5
i

x
0
0
0
1000
1000
1000
2000
2000
2000
3000
3000
3000
4000
4000
4000
0.5
0.5
0.5
0
0
0
0.5
0.5
0.5
original signal x and noisy
signal x
cor
three solutions on trade-o curve
| x x
cor
|
2
versus
quad
( x)
Approximation and tting 614
total variation reconstruction example
i
x
x
c
o
r
0
0
500
500
1000
1000
1500
1500
2000
2000
2
2
1
1
0
0
1
1
2
2
i

x
i

x
i

x
i
0
0
0
500
500
500
1000
1000
1000
1500
1500
1500
2000
2000
2000
2
2
2
0
0
0
2
2
2
original signal x and noisy
signal x
cor
three solutions on trade-o curve
| x x
cor
|
2
versus
quad
( x)
quadratic smoothing smooths out noise and sharp transitions in signal
Approximation and tting 615
i
x
x
c
o
r
0
0
500
500
1000
1000
1500
1500
2000
2000
2
2
1
1
0
0
1
1
2
2
i

x
0
0
0
500
500
500
1000
1000
1000
1500
1500
1500
2000
2000
2000
2
2
2
0
0
0
2
2
2
original signal x and noisy
signal x
cor
three solutions on trade-o curve
| x x
cor
|
2
versus
tv
( x)
total variation smoothing preserves sharp transitions in signal
Approximation and tting 616
Robust approximation
minimize |Ax b| with uncertain A
two approaches:
stochastic: assume A is random, minimize E|Ax b|
worst-case: set / of possible values of A, minimize sup
AA
|Ax b|
tractable only in special cases (certain norms | |, distributions, sets /)
example: A(u) = A
0
+uA
1
x
nom
minimizes |A
0
x b|
2
2
x
stoch
minimizes E|A(u)x b|
2
2
with u uniform on [1, 1]
x
wc
minimizes sup
1u1
|A(u)x b|
2
2
gure shows r(u) = |A(u)x b|
2
u
r
(
u
)
x
nom
x
stoch
x
wc
2 1 0 1 2
0
2
4
6
8
10
12
Approximation and tting 617
stochastic robust LS with A =

A+U, U random, EU = 0, EU
T
U = P
minimize E|(

A+U)x b|
2
2
explicit expression for objective:
E|Ax b|
2
2
= E|

Ax b +Ux|
2
2
= |

Ax b|
2
2
+Ex
T
U
T
Ux
= |

Ax b|
2
2
+x
T
Px
hence, robust LS problem is equivalent to LS problem
minimize |

Ax b|
2
2
+|P
1/2
x|
2
2
for P = I, get Tikhonov regularized problem
minimize |

Ax b|
2
2
+|x|
2
2
Approximation and tting 618
worst-case robust LS with / =

A+u
1
A
1
+ +u
p
A
p
[ |u|
2
1
minimize sup
AA
|Ax b|
2
2
= sup
u
2
1
|P(x)u +q(x)|
2
2
where P(x) =
_
A
1
x A
2
x A
p
x

, q(x) =

Ax b
from page 514, strong duality holds between the following problems
maximize |Pu +q|
2
2
subject to |u|
2
2
1
minimize t +
subject to
_
_
I P q
P
T
I 0
q
T
0 t
_
_
_ 0
hence, robust LS problem is equivalent to SDP
minimize t +
subject to
_
_
I P(x) q(x)
P(x)
T
I 0
q(x)
T
0 t
_
_
_ 0
Approximation and tting 619
example: histogram of residuals
r(u) = |(A
0
+u
1
A
1
+u
2
A
2
)x b|
2
with u uniformly distributed on unit disk, for three values of x
r(u)
x
ls
x
tik
x
rls
f
r
e
q
u
e
n
c
y
0 1 2 3 4 5
0
0.05
0.1
0.15
0.2
0.25
x
ls
minimizes |A
0
x b|
2
x
tik
minimizes |A
0
x b|
2
2
+|x|
2
2
(Tikhonov solution)
x
wc
minimizes sup
u
2
1
|A
0
x b|
2
2
+|x|
2
2
Approximation and tting 620
Convex Optimization Boyd & Vandenberghe
7. Statistical estimation
maximum likelihood estimation
optimal detector design
experiment design
71
Parametric distribution estimation
distribution estimation problem: estimate probability density p(y) of a
random variable from observed values
parametric distribution estimation: choose from a family of densities
p
x
(y), indexed by a parameter x
maximum likelihood estimation
maximize (over x) log p
x
(y)
y is observed value
l(x) = log p
x
(y) is called log-likelihood function
can add constraints x C explicitly, or dene p
x
(y) = 0 for x , C
a convex optimization problem if log p
x
(y) is concave in x for xed y
Statistical estimation 72
Linear measurements with IID noise
linear measurement model
y
i
= a
T
i
x +v
i
, i = 1, . . . , m
x R
n
is vector of unknown parameters
v
i
is IID measurement noise, with density p(z)
y
i
is measurement: y R
m
has density p
x
(y) =

m
i=1
p(y
i
a
T
i
x)
maximum likelihood estimate: any solution x of
maximize l(x) =

m
i=1
log p(y
i
a
T
i
x)
(y is observed value)
Statistical estimation 73
examples
Gaussian noise A(0,
2
): p(z) = (2
2
)
1/2
e
z
2
/(2
2
)
,
l(x) =
m
2
log(2
2
)
1
2
2
m

i=1
(a
T
i
x y
i
)
2
ML estimate is LS solution
Laplacian noise: p(z) = (1/(2a))e
|z|/a
,
l(x) = mlog(2a)
1
a
m

i=1
[a
T
i
x y
i
[
ML estimate is
1
-norm solution
uniform noise on [a, a]:
l(x) =
_
mlog(2a) [a
T
i
x y
i
[ a, i = 1, . . . , m
otherwise
ML estimate is any x with [a
T
i
x y
i
[ a
Statistical estimation 74
Logistic regression
random variable y 0, 1 with distribution
p = prob(y = 1) =
exp(a
T
u +b)
1 + exp(a
T
u +b)
a, b are parameters; u R
n
are (observable) explanatory variables
estimation problem: estimate a, b from m observations (u
i
, y
i
)
log-likelihood function (for y
1
= = y
k
= 1, y
k+1
= = y
m
= 0):
l(a, b) = log
_
_
k

i=1
exp(a
T
u
i
+b)
1 + exp(a
T
u
i
+b)
m

i=k+1
1
1 + exp(a
T
u
i
+b)
_
_
=
k

i=1
(a
T
u
i
+b)
m

i=1
log(1 + exp(a
T
u
i
+b))
concave in a, b
Statistical estimation 75
example (n = 1, m = 50 measurements)
u
p
r
o
b
(
y
=
1
)
0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
circles show 50 points (u
i
, y
i
)
solid curve is ML estimate of p = exp(au +b)/(1 + exp(au +b))
Statistical estimation 76
(Binary) hypothesis testing
detection (hypothesis testing) problem
given observation of a random variable X 1, . . . , n, choose between:
hypothesis 1: X was generated by distribution p = (p
1
, . . . , p
n
)
hypothesis 2: X was generated by distribution q = (q
1
, . . . , q
n
)
randomized detector
a nonnegative matrix T R
2n
, with 1
T
T = 1
T
if we observe X = k, we choose hypothesis 1 with probability t
1k
,
hypothesis 2 with probability t
2k
if all elements of T are 0 or 1, it is called a deterministic detector
Statistical estimation 77
detection probability matrix:
D =
_
Tp Tq

=
_
1 P
fp
P
fn
P
fp
1 P
fn
_
P
fp
is probability of selecting hypothesis 2 if X is generated by
distribution 1 (false positive)
P
fn
is probability of selecting hypothesis 1 if X is generated by
distribution 2 (false negative)
multicriterion formulation of detector design
minimize (w.r.t. R
2
+
) (P
fp
, P
fn
) = ((Tp)
2
, (Tq)
1
)
subject to t
1k
+t
2k
= 1, k = 1, . . . , n
t
ik
0, i = 1, 2, k = 1, . . . , n
variable T R
2n
Statistical estimation 78
scalarization (with weight > 0)
minimize (Tp)
2
+(Tq)
1
subject to t
1k
+t
2k
= 1, t
ik
0, i = 1, 2, k = 1, . . . , n
an LP with a simple analytical solution
(t
1k
, t
2k
) =
_
(1, 0) p
k
q
k
(0, 1) p
k
< q
k
a deterministic detector, given by a likelihood ratio test
if p
k
= q
k
for some k, any value 0 t
1k
1, t
1k
= 1 t
2k
is optimal
(i.e., Pareto-optimal detectors include non-deterministic detectors)
minimax detector
minimize maxP
fp
, P
fn
= max(Tp)
2
, (Tq)
1

subject to t
1k
+t
2k
= 1, t
ik
0, i = 1, 2, k = 1, . . . , n
an LP; solution is usually not deterministic
Statistical estimation 79
example
P =
_

_
0.70 0.10
0.20 0.10
0.05 0.70
0.05 0.10
_

_
P
fp
P
f
n
1
2
3
4
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
solutions 1, 2, 3 (and endpoints) are deterministic; 4 is minimax detector
Statistical estimation 710
Experiment design
m linear measurements y
i
= a
T
i
x +w
i
, i = 1, . . . , m of unknown x R
n
measurement errors w
i
are IID A(0, 1)
ML (least-squares) estimate is
x =
_
m

i=1
a
i
a
T
i
_
1
m

i=1
y
i
a
i
error e = x x has zero mean and covariance
E = Eee
T
=
_
m

i=1
a
i
a
T
i
_
1
condence ellipsoids are given by x [ (x x)
T
E
1
(x x)
experiment design: choose a
i
v
1
, . . . , v
p
(a set of possible test
vectors) to make E small
Statistical estimation 711
vector optimization formulation
minimize (w.r.t. S
n
+
) E =
_

p
k=1
m
k
v
k
v
T
k
_
1
subject to m
k
0, m
1
+ +m
p
= m
m
k
Z
variables are m
k
(# vectors a
i
equal to v
k
)
dicult in general, due to integer constraint
relaxed experiment design
assume m p, use
k
= m
k
/m as (continuous) real variable
minimize (w.r.t. S
n
+
) E = (1/m)
_

p
k=1

k
v
k
v
T
k
_
1
subject to _ 0, 1
T
= 1
common scalarizations: minimize log det E, tr E,
max
(E), . . .
can add other convex constraints, e.g., bound experiment cost c
T
B
Statistical estimation 712
D-optimal design
minimize log det
_

p
k=1

k
v
k
v
T
k
_
1
subject to _ 0, 1
T
= 1
interpretation: minimizes volume of condence ellipsoids
dual problem
maximize log det W +nlog n
subject to v
T
k
Wv
k
1, k = 1, . . . , p
interpretation: x [ x
T
Wx 1 is minimum volume ellipsoid centered at
origin, that includes all test vectors v
k
complementary slackness: for , W primal and dual optimal

k
(1 v
T
k
Wv
k
) = 0, k = 1, . . . , p
optimal experiment uses vectors v
k
on boundary of ellipsoid dened by W
Statistical estimation 713
example (p = 20)

1
= 0.5

2
= 0.5
design uses two vectors, on boundary of ellipse dened by optimal W
Statistical estimation 714
derivation of dual of page 713
rst reformulate primal problem with new variable X:
minimize log det X
1
subject to X =

p
k=1

k
v
k
v
T
k
, _ 0, 1
T
= 1
L(X, , Z, z, ) = log det X
1
+tr
_
Z
_
X
p

k=1

k
v
k
v
T
k
__
z
T
+(1
T
1)
minimize over X by setting gradient to zero: X
1
+Z = 0
minimum over
k
is unless v
T
k
Zv
k
z
k
+ = 0
dual problem
maximize n + log det Z
subject to v
T
k
Zv
k
, k = 1, . . . , p
change variable W = Z/, and optimize over to get dual of page 713
Statistical estimation 715
Convex Optimization Boyd & Vandenberghe
8. Geometric problems
extremal volume ellipsoids
centering
classication
placement and facility location
81
Minimum volume ellipsoid around a set
L owner-John ellipsoid of a set C: minimum volume ellipsoid c s.t. C c
parametrize c as c = v [ |Av +b|
2
1; w.l.o.g. assume A S
n
++
vol c is proportional to det A
1
; to compute minimum volume ellipsoid,
minimize (over A, b) log det A
1
subject to sup
vC
|Av +b|
2
1
convex, but evaluating the constraint can be hard (for general C)
nite set C = x
1
, . . . , x
m
:
minimize (over A, b) log det A
1
subject to |Ax
i
+b|
2
1, i = 1, . . . , m
also gives Lowner-John ellipsoid for polyhedron convx
1
, . . . , x
m

Geometric problems 82
Maximum volume inscribed ellipsoid
maximum volume ellipsoid c inside a convex set C R
n
parametrize c as c = Bu +d [ |u|
2
1; w.l.o.g. assume B S
n
++
vol c is proportional to det B; can compute c by solving
maximize log det B
subject to sup
u
2
1
I
C
(Bu +d) 0
(where I
C
(x) = 0 for x C and I
C
(x) = for x , C)
convex, but evaluating the constraint can be hard (for general C)
polyhedron x [ a
T
i
x b
i
, i = 1, . . . , m:
maximize log det B
subject to |Ba
i
|
2
+a
T
i
d b
i
, i = 1, . . . , m
(constraint follows from sup
u
2
1
a
T
i
(Bu +d) = |Ba
i
|
2
+a
T
i
d)
Geometric problems 83
Eciency of ellipsoidal approximations
C R
n
convex, bounded, with nonempty interior
Lowner-John ellipsoid, shrunk by a factor n, lies inside C
maximum volume inscribed ellipsoid, expanded by a factor n, covers C
example (for two polyhedra in R
2
)
factor n can be improved to

n if C is symmetric
Geometric problems 84
Centering
some possible denitions of center of a convex set C:
center of largest inscribed ball (Chebyshev center)
for polyhedron, can be computed via linear programming (page 419)
center of maximum volume inscribed ellipsoid (page 83)
x
cheb
x
cheb
x
mve
MVE center is invariant under ane coordinate transformations
Geometric problems 85
Analytic center of a set of inequalities
the analytic center of set of convex inequalities and linear equations
f
i
(x) 0, i = 1, . . . , m, Fx = g
is dened as the optimal point of
minimize

m
i=1
log(f
i
(x))
subject to Fx = g
more easily computed than MVE or Chebyshev center (see later)
not just a property of the feasible set: two sets of inequalities can
describe the same set, but have dierent analytic centers
Geometric problems 86
analytic center of linear inequalities a
T
i
x b
i
, i = 1, . . . , m
x
ac
is minimizer of
(x) =
m

i=1
log(b
i
a
T
i
x)
x
ac
inner and outer ellipsoids from analytic center:
c
inner
x [ a
T
i
x b
i
, i = 1, . . . , m c
outer
where
c
inner
= x [ (x x
ac
)
T

2
(x
ac
)(x x
ac
) 1
c
outer
= x [ (x x
ac
)
T

2
(x
ac
)(x x
ac
) m(m1)
Geometric problems 87
Linear discrimination
separate two sets of points x
1
, . . . , x
N
, y
1
, . . . , y
M
by a hyperplane:
a
T
x
i
+b > 0, i = 1, . . . , N, a
T
y
i
+b < 0, i = 1, . . . , M
homogeneous in a, b, hence equivalent to
a
T
x
i
+b 1, i = 1, . . . , N, a
T
y
i
+b 1, i = 1, . . . , M
a set of linear inequalities in a, b
Geometric problems 88
Robust linear discrimination
(Euclidean) distance between hyperplanes
1
1
= z [ a
T
z +b = 1
1
2
= z [ a
T
z +b = 1
is dist(1
1
, 1
2
) = 2/|a|
2
to separate two sets of points by maximum margin,
minimize (1/2)|a|
2
subject to a
T
x
i
+b 1, i = 1, . . . , N
a
T
y
i
+b 1, i = 1, . . . , M
(1)
(after squaring objective) a QP in a, b
Geometric problems 89
Lagrange dual of maximum margin separation problem (1)
maximize 1
T
+1
T

subject to 2
_
_
_

N
i=1

i
x
i

M
i=1

i
y
i
_
_
_
2
1
1
T
= 1
T
, _ 0, _ 0
(2)
from duality, optimal value is inverse of maximum margin of separation
interpretation
change variables to
i
=
i
/1
T
,
i
=
i
/1
T
, t = 1/(1
T
+1
T
)
invert objective to minimize 1/(1
T
+1
T
) = t
minimize t
subject to
_
_
_

N
i=1

i
x
i

M
i=1

i
y
i
_
_
_
2
t
_ 0, 1
T
= 1, _ 0, 1
T
= 1
optimal value is distance between convex hulls
Geometric problems 810
Approximate linear separation of non-separable sets
minimize 1
T
u +1
T
v
subject to a
T
x
i
+b 1 u
i
, i = 1, . . . , N
a
T
y
i
+b 1 +v
i
, i = 1, . . . , M
u _ 0, v _ 0
an LP in a, b, u, v
at optimum, u
i
= max0, 1 a
T
x
i
b, v
i
= max0, 1 +a
T
y
i
+b
can be interpreted as a heuristic for minimizing #misclassied points
Geometric problems 811
Support vector classier
minimize |a|
2
+(1
T
u +1
T
v)
subject to a
T
x
i
+b 1 u
i
, i = 1, . . . , N
a
T
y
i
+b 1 +v
i
, i = 1, . . . , M
u _ 0, v _ 0
produces point on trade-o curve between inverse of margin 2/|a|
2
and
classication error, measured by total slack 1
T
u +1
T
v
same example as previous page,
with = 0.1:
Geometric problems 812
Nonlinear discrimination
separate two sets of points by a nonlinear function:
f(x
i
) > 0, i = 1, . . . , N, f(y
i
) < 0, i = 1, . . . , M
choose a linearly parametrized family of functions
f(z) =
T
F(z)
F = (F
1
, . . . , F
k
) : R
n
R
k
are basis functions
solve a set of linear inequalities in :

T
F(x
i
) 1, i = 1, . . . , N,
T
F(y
i
) 1, i = 1, . . . , M
Geometric problems 813
quadratic discrimination: f(z) = z
T
Pz +q
T
z +r
x
T
i
Px
i
+q
T
x
i
+r 1, y
T
i
Py
i
+q
T
y
i
+r 1
can add additional constraints (e.g., P _ I to separate by an ellipsoid)
polynomial discrimination: F(z) are all monomials up to a given degree
separation by ellipsoid separation by 4th degree polynomial
Geometric problems 814
Placement and facility location
N points with coordinates x
i
R
2
(or R
3
)
some positions x
i
are given; the other x
i
s are variables
for each pair of points, a cost function f
ij
(x
i
, x
j
)
placement problem
minimize

i=j
f
ij
(x
i
, x
j
)
variables are positions of free points
interpretations
points represent plants or warehouses; f
ij
is transportation cost between
facilities i and j
points represent cells on an IC; f
ij
represents wirelength
Geometric problems 815
example: minimize

(i,j)A
h(|x
i
x
j
|
2
), with 6 free points, 27 links
optimal placement for h(z) = z, h(z) = z
2
, h(z) = z
4
1 0 1
1
0
1
1 0 1
1
0
1
1 0 1
1
0
1
histograms of connection lengths |x
i
x
j
|
2
0 0.5 1 1.5 2
0
1
2
3
4
0 0.5 1 1.5
0
1
2
3
4
0 0.5 1 1.5
0
1
2
3
4
5
6
Geometric problems 816
Convex Optimization Boyd & Vandenberghe
9. Numerical linear algebra background
matrix structure and algorithm complexity
solving linear equations with factored matrices
LU, Cholesky, LDL
T
factorization
block elimination and the matrix inversion lemma
solving underdetermined equations
91
Matrix structure and algorithm complexity
cost (execution time) of solving Ax = b with A R
nn
for general methods, grows as n
3
less if A is structured (banded, sparse, Toeplitz, . . . )
op counts
op (oating-point operation): one addition, subtraction,
multiplication, or division of two oating-point numbers
to estimate complexity of an algorithm: express number of ops as a
(polynomial) function of the problem dimensions, and simplify by
keeping only the leading terms
not an accurate predictor of computation time on modern computers
useful as a rough estimate of complexity
Numerical linear algebra background 92
vector-vector operations (x, y R
n
)
inner product x
T
y: 2n 1 ops (or 2n if n is large)
sum x +y, scalar multiplication x: n ops
matrix-vector product y = Ax with A R
mn
m(2n 1) ops (or 2mn if n large)
2N if A is sparse with N nonzero elements
2p(n +m) if A is given as A = UV
T
, U R
mp
, V R
np
matrix-matrix product C = AB with A R
mn
, B R
np
mp(2n 1) ops (or 2mnp if n large)
less if A and/or B are sparse
(1/2)m(m+ 1)(2n 1) m
2
n if m = p and C symmetric
Numerical linear algebra background 93
Linear equations that are easy to solve
diagonal matrices (a
ij
= 0 if i ,= j): n ops
x = A
1
b = (b
1
/a
11
, . . . , b
n
/a
nn
)
lower triangular (a
ij
= 0 if j > i): n
2
ops
x
1
:= b
1
/a
11
x
2
:= (b
2
a
21
x
1
)/a
22
x
3
:= (b
3
a
31
x
1
a
32
x
2
)/a
33
.
.
.
x
n
:= (b
n
a
n1
x
1
a
n2
x
2
a
n,n1
x
n1
)/a
nn
called forward substitution
upper triangular (a
ij
= 0 if j < i): n
2
ops via backward substitution
Numerical linear algebra background 94
orthogonal matrices: A
1
= A
T
2n
2
ops to compute x = A
T
b for general A
less with structure, e.g., if A = I 2uu
T
with |u|
2
= 1, we can
compute x = A
T
b = b 2(u
T
b)u in 4n ops
permutation matrices:
a
ij
=
_
1 j =
i
0 otherwise
where = (
1
,
2
, . . . ,
n
) is a permutation of (1, 2, . . . , n)
interpretation: Ax = (x

1
, . . . , x

n
)
satises A
1
= A
T
, hence cost of solving Ax = b is 0 ops
example:
A =
_
_
0 1 0
0 0 1
1 0 0
_
_
, A
1
= A
T
=
_
_
0 0 1
1 0 0
0 1 0
_
_
Numerical linear algebra background 95
The factor-solve method for solving Ax = b
factor A as a product of simple matrices (usually 2 or 3):
A = A
1
A
2
A
k
(A
i
diagonal, upper or lower triangular, etc)
compute x = A
1
b = A
1
k
A
1
2
A
1
1
b by solving k easy equations
A
1
x
1
= b, A
2
x
2
= x
1
, . . . , A
k
x = x
k1
cost of factorization step usually dominates cost of solve step
equations with multiple righthand sides
Ax
1
= b
1
, Ax
2
= b
2
, . . . , Ax
m
= b
m
cost: one factorization plus m solves
Numerical linear algebra background 96
LU factorization
every nonsingular matrix A can be factored as
A = PLU
with P a permutation matrix, L lower triangular, U upper triangular
cost: (2/3)n
3
ops
Solving linear equations by LU factorization.
given a set of linear equations Ax = b, with A nonsingular.
1. LU factorization. Factor A as A = PLU ((2/3)n
3
ops).
2. Permutation. Solve Pz
1
= b (0 ops).
3. Forward substitution. Solve Lz
2
= z
1
(n
2
ops).
4. Backward substitution. Solve Ux = z
2
(n
2
ops).
cost: (2/3)n
3
+ 2n
2
(2/3)n
3
for large n
Numerical linear algebra background 97
sparse LU factorization
A = P
1
LUP
2
adding permutation matrix P
2
oers possibility of sparser L, U (hence,
cheaper factor and solve steps)
P
1
and P
2
chosen (heuristically) to yield sparse L, U
choice of P
1
and P
2
depends on sparsity pattern and values of A
cost is usually much less than (2/3)n
3
; exact value depends in a
complicated way on n, number of zeros in A, sparsity pattern
Numerical linear algebra background 98
Cholesky factorization
every positive denite A can be factored as
A = LL
T
with L lower triangular
cost: (1/3)n
3
ops
Solving linear equations by Cholesky factorization.
given a set of linear equations Ax = b, with A S
n
++
.
1. Cholesky factorization. Factor A as A = LL
T
((1/3)n
3
ops).
2. Forward substitution. Solve Lz
1
= b (n
2
ops).
3. Backward substitution. Solve L
T
x = z
1
(n
2
ops).
cost: (1/3)n
3
+ 2n
2
(1/3)n
3
for large n
Numerical linear algebra background 99
sparse Cholesky factorization
A = PLL
T
P
T
adding permutation matrix P oers possibility of sparser L
P chosen (heuristically) to yield sparse L
choice of P only depends on sparsity pattern of A (unlike sparse LU)
cost is usually much less than (1/3)n
3
; exact value depends in a
complicated way on n, number of zeros in A, sparsity pattern
Numerical linear algebra background 910
LDL
T
factorization
every nonsingular symmetric matrix A can be factored as
A = PLDL
T
P
T
with P a permutation matrix, L lower triangular, D block diagonal with
1 1 or 2 2 diagonal blocks
cost: (1/3)n
3
cost of solving symmetric sets of linear equations by LDL
T
factorization:
(1/3)n
3
+ 2n
2
(1/3)n
3
for large n
for sparse A, can choose P to yield sparse L; cost (1/3)n
3
Numerical linear algebra background 911
Equations with structured sub-blocks
_
A
11
A
12
A
21
A
22
_ _
x
1
x
2
_
=
_
b
1
b
2
_
(1)
variables x
1
R
n
1
, x
2
R
n
2
; blocks A
ij
R
n
i
n
j
if A
11
is nonsingular, can eliminate x
1
: x
1
= A
1
11
(b
1
A
12
x
2
);
to compute x
2
, solve
(A
22
A
21
A
1
11
A
12
)x
2
= b
2
A
21
A
1
11
b
1
Solving linear equations by block elimination.
given a nonsingular set of linear equations (1), with A
11
nonsingular.
1. Form A
1
11
A
12
and A
1
11
b
1
.
2. Form S = A
22
A
21
A
1
11
A
12
and

b = b
2
A
21
A
1
11
b
1
.
3. Determine x
2
by solving Sx
2
=

b.
4. Determine x
1
by solving A
11
x
1
= b
1
A
12
x
2
.
Numerical linear algebra background 912
dominant terms in op count
step 1: f +n
2
s (f is cost of factoring A
11
; s is cost of solve step)
step 2: 2n
2
2
n
1
(cost dominated by product of A
21
and A
1
11
A
12
)
step 3: (2/3)n
3
2
total: f +n
2
s + 2n
2
2
n
1
+ (2/3)n
3
2
examples
general A
11
(f = (2/3)n
3
1
, s = 2n
2
1
): no gain over standard method
#ops = (2/3)n
3
1
+ 2n
2
1
n
2
+ 2n
2
2
n
1
+ (2/3)n
3
2
= (2/3)(n
1
+n
2
)
3
block elimination is useful for structured A
11
(f n
3
1
)
for example, diagonal (f = 0, s = n
1
): #ops 2n
2
2
n
1
+ (2/3)n
3
2
Numerical linear algebra background 913
Structured matrix plus low rank term
(A+BC)x = b
A R
nn
, B R
np
, C R
pn
assume A has structure (Ax = b easy to solve)
rst write as
_
A B
C I
_ _
x
y
_
=
_
b
0
_
now apply block elimination: solve
(I +CA
1
B)y = CA
1
b,
then solve Ax = b By
this proves the matrix inversion lemma: if A and A+BC nonsingular,
(A+BC)
1
= A
1
A
1
B(I +CA
1
B)
1
CA
1
Numerical linear algebra background 914
example: A diagonal, B, C dense
method 1: form D = A+BC, then solve Dx = b
cost: (2/3)n
3
+ 2pn
2
method 2 (via matrix inversion lemma): solve
(I +CA
1
B)y = CA
1
b, (2)
then compute x = A
1
b A
1
By
total cost is dominated by (2): 2p
2
n + (2/3)p
3
(i.e., linear in n)
Numerical linear algebra background 915
Underdetermined linear equations
if A R
pn
with p < n, rankA = p,
x [ Ax = b = Fz + x [ z R
np

x is (any) particular solution


columns of F R
n(np)
span nullspace of A
there exist several numerical methods for computing F
(QR factorization, rectangular LU factorization, . . . )
Numerical linear algebra background 916
Convex Optimization Boyd & Vandenberghe
10. Unconstrained minimization
terminology and assumptions
gradient descent method
steepest descent method
Newtons method
self-concordant functions
implementation
101
Unconstrained minimization
minimize f(x)
f convex, twice continuously dierentiable (hence domf open)
we assume optimal value p

= inf
x
f(x) is attained (and nite)
unconstrained minimization methods
produce sequence of points x
(k)
domf, k = 0, 1, . . . with
f(x
(k)
) p

can be interpreted as iterative methods for solving optimality condition


f(x

) = 0
Unconstrained minimization 102
Initial point and sublevel set
algorithms in this chapter require a starting point x
(0)
such that
x
(0)
domf
sublevel set S = x [ f(x) f(x
(0)
) is closed
2nd condition is hard to verify, except when all sublevel sets are closed:
equivalent to condition that epi f is closed
true if domf = R
n
true if f(x) as x bddomf
examples of dierentiable functions with closed sublevel sets:
f(x) = log(
m

i=1
exp(a
T
i
x +b
i
)), f(x) =
m

i=1
log(b
i
a
T
i
x)
Unconstrained minimization 103
Strong convexity and implications
f is strongly convex on S if there exists an m > 0 such that

2
f(x) _ mI for all x S
implications
for x, y S,
f(y) f(x) +f(x)
T
(y x) +
m
2
|x y|
2
2
hence, S is bounded
p

> , and for x S,


f(x) p

1
2m
|f(x)|
2
2
useful as stopping criterion (if you know m)
Unconstrained minimization 104
Descent methods
x
(k+1)
= x
(k)
+t
(k)
x
(k)
with f(x
(k+1)
) < f(x
(k)
)
other notations: x
+
= x +tx, x := x +tx
x is the step, or search direction; t is the step size, or step length
from convexity, f(x
+
) < f(x) implies f(x)
T
x < 0
(i.e., x is a descent direction)
General descent method.
given a starting point x domf.
repeat
1. Determine a descent direction x.
2. Line search. Choose a step size t > 0.
3. Update. x := x + tx.
until stopping criterion is satised.
Unconstrained minimization 105
Line search types
exact line search: t = argmin
t>0
f(x +tx)
backtracking line search (with parameters (0, 1/2), (0, 1))
starting at t = 1, repeat t := t until
f(x +tx) < f(x) +tf(x)
T
x
graphical interpretation: backtrack until t t
0
t
f(x + tx)
t = 0 t
0
f(x) + tf(x)
T
x
f(x) + tf(x)
T
x
Unconstrained minimization 106
Gradient descent method
general descent method with x = f(x)
given a starting point x domf.
repeat
1. x := f(x).
2. Line search. Choose step size t via exact or backtracking line search.
3. Update. x := x + tx.
until stopping criterion is satised.
stopping criterion usually of the form |f(x)|
2

convergence result: for strongly convex f,
f(x
(k)
) p

c
k
(f(x
(0)
) p

)
c (0, 1) depends on m, x
(0)
, line search type
very simple, but often very slow; rarely used in practice
Unconstrained minimization 107
quadratic problem in R
2
f(x) = (1/2)(x
2
1
+x
2
2
) ( > 0)
with exact line search, starting at x
(0)
= (, 1):
x
(k)
1
=
_
1
+ 1
_
k
, x
(k)
2
=
_

1
+ 1
_
k
very slow if 1 or 1
example for = 10:
x
1
x
2
x
(0)
x
(1)
10 0 10
4
0
4
Unconstrained minimization 108
nonquadratic example
f(x
1
, x
2
) = e
x
1
+3x
2
0.1
+e
x
1
3x
2
0.1
+e
x
1
0.1
x
(0)
x
(1)
x
(2)
x
(0)
x
(1)
backtracking line search exact line search
Unconstrained minimization 109
a problem in R
100
f(x) = c
T
x
500

i=1
log(b
i
a
T
i
x)
k
f
(
x
(
k
)
)

exact l.s.
backtracking l.s.
0 50 100 150 200
10
4
10
2
10
0
10
2
10
4
linear convergence, i.e., a straight line on a semilog plot
Unconstrained minimization 1010
Steepest descent method
normalized steepest descent direction (at x, for norm | |):
x
nsd
= argminf(x)
T
v [ |v| = 1
interpretation: for small v, f(x +v) f(x) +f(x)
T
v;
direction x
nsd
is unit-norm step with most negative directional derivative
(unnormalized) steepest descent direction
x
sd
= |f(x)|

x
nsd
satises f(x)
T

sd
= |f(x)|
2

steepest descent method


general descent method with x = x
sd
convergence properties similar to gradient descent
Unconstrained minimization 1011
examples
Euclidean norm: x
sd
= f(x)
quadratic norm |x|
P
= (x
T
Px)
1/2
(P S
n
++
): x
sd
= P
1
f(x)

1
-norm: x
sd
= (f(x)/x
i
)e
i
, where [f(x)/x
i
[ = |f(x)|

unit balls and normalized steepest descent directions for a quadratic norm
and the
1
-norm:
f(x)
x
nsd
f(x)
x
nsd
Unconstrained minimization 1012
choice of norm for steepest descent
x
(0)
x
(1)
x
(2)
x
(0)
x
(1)
x
(2)
steepest descent with backtracking line search for two quadratic norms
ellipses show x [ |x x
(k)
|
P
= 1
equivalent interpretation of steepest descent with quadratic norm | |
P
:
gradient descent after change of variables x = P
1/2
x
shows choice of P has strong eect on speed of convergence
Unconstrained minimization 1013
Newton step
x
nt
=
2
f(x)
1
f(x)
interpretations
x + x
nt
minimizes second order approximation

f(x +v) = f(x) +f(x)


T
v +
1
2
v
T

2
f(x)v
x + x
nt
solves linearized optimality condition
f(x +v)

f(x +v) = f(x) +


2
f(x)v = 0
f

f
(x, f(x))
(x + x
nt
, f(x + x
nt
))
f

(x, f

(x))
(x + x
nt
, f

(x + x
nt
))
Unconstrained minimization 1014
x
nt
is steepest descent direction at x in local Hessian norm
|u|

2
f(x)
=
_
u
T

2
f(x)u
_
1/2
x
x + x
nt
x + x
nsd
dashed lines are contour lines of f; ellipse is x +v [ v
T

2
f(x)v = 1
arrow shows f(x)
Unconstrained minimization 1015
Newton decrement
(x) =
_
f(x)
T

2
f(x)
1
f(x)
_
1/2
a measure of the proximity of x to x

properties
gives an estimate of f(x) p

, using quadratic approximation



f:
f(x) inf
y

f(y) =
1
2
(x)
2
equal to the norm of the Newton step in the quadratic Hessian norm
(x) =
_
x
T
nt

2
f(x)x
nt
_
1/2
directional derivative in the Newton direction: f(x)
T
x
nt
= (x)
2
ane invariant (unlike |f(x)|
2
)
Unconstrained minimization 1016
Newtons method
given a starting point x domf, tolerance > 0.
repeat
1. Compute the Newton step and decrement.
x
nt
:=
2
f(x)
1
f(x);
2
:= f(x)
T

2
f(x)
1
f(x).
2. Stopping criterion. quit if
2
/2 .
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + tx
nt
.
ane invariant, i.e., independent of linear changes of coordinates:
Newton iterates for

f(y) = f(Ty) with starting point y
(0)
= T
1
x
(0)
are
y
(k)
= T
1
x
(k)
Unconstrained minimization 1017
Classical convergence analysis
assumptions
f strongly convex on S with constant m

2
f is Lipschitz continuous on S, with constant L > 0:
|
2
f(x)
2
f(y)|
2
L|x y|
2
(L measures how well f can be approximated by a quadratic function)
outline: there exist constants (0, m
2
/L), > 0 such that
if |f(x)|
2
, then f(x
(k+1)
) f(x
(k)
)
if |f(x)|
2
< , then
L
2m
2
|f(x
(k+1)
)|
2

_
L
2m
2
|f(x
(k)
)|
2
_
2
Unconstrained minimization 1018
damped Newton phase (|f(x)|
2
)
most iterations require backtracking steps
function value decreases by at least
if p

> , this phase ends after at most (f(x


(0)
) p

)/ iterations
quadratically convergent phase (|f(x)|
2
< )
all iterations use step size t = 1
|f(x)|
2
converges to zero quadratically: if |f(x
(k)
)|
2
< , then
L
2m
2
|f(x
l
)|
2

_
L
2m
2
|f(x
k
)|
2
_
2
lk

_
1
2
_
2
lk
, l k
Unconstrained minimization 1019
conclusion: number of iterations until f(x) p

is bounded above by
f(x
(0)
) p

+ log
2
log
2
(
0
/)
,
0
are constants that depend on m, L, x
(0)
second term is small (of the order of 6) and almost constant for
practical purposes
in practice, constants m, L (hence ,
0
) are usually unknown
provides qualitative insight in convergence properties (i.e., explains two
algorithm phases)
Unconstrained minimization 1020
Examples
example in R
2
(page 109)
x
(0)
x
(1)
k
f
(
x
(
k
)
)

0 1 2 3 4 5
10
15
10
10
10
5
10
0
10
5
backtracking parameters = 0.1, = 0.7
converges in only 5 steps
quadratic local convergence
Unconstrained minimization 1021
example in R
100
(page 1010)
k
f
(
x
(
k
)
)

exact line search


backtracking
0 2 4 6 8 10
10
15
10
10
10
5
10
0
10
5
k
s
t
e
p
s
i
z
e
t
(
k
)
exact line search
backtracking
0 2 4 6 8
0
0.5
1
1.5
2
backtracking parameters = 0.01, = 0.5
backtracking line search almost as fast as exact l.s. (and much simpler)
clearly shows two phases in algorithm
Unconstrained minimization 1022
example in R
10000
(with sparse a
i
)
f(x) =
10000

i=1
log(1 x
2
i
)
100000

i=1
log(b
i
a
T
i
x)
k
f
(
x
(
k
)
)

0 5 10 15 20
10
5
10
0
10
5
backtracking parameters = 0.01, = 0.5.
performance similar as for small examples
Unconstrained minimization 1023
Self-concordance
shortcomings of classical convergence analysis
depends on unknown constants (m, L, . . . )
bound is not anely invariant, although Newtons method is
convergence analysis via self-concordance (Nesterov and Nemirovski)
does not depend on any unknown constants
gives ane-invariant bound
applies to special class of convex functions (self-concordant functions)
developed to analyze polynomial-time interior-point methods for convex
optimization
Unconstrained minimization 1024
Self-concordant functions
denition
convex f : R R is self-concordant if [f

(x)[ 2f

(x)
3/2
for all
x domf
f : R
n
R is self-concordant if g(t) = f(x +tv) is self-concordant for
all x domf, v R
n
examples on R
linear and quadratic functions
negative logarithm f(x) = log x
negative entropy plus negative logarithm: f(x) = xlog x log x
ane invariance: if f : R R is s.c., then

f(y) = f(ay +b) is s.c.:

(y) = a
3
f

(ay +b),

f

(y) = a
2
f

(ay +b)
Unconstrained minimization 1025
Self-concordant calculus
properties
preserved under positive scaling 1, and sum
preserved under composition with ane function
if g is convex with domg = R
++
and [g

(x)[ 3g

(x)/x then
f(x) = log(g(x)) log x
is self-concordant
examples: properties can be used to show that the following are s.c.
f(x) =

m
i=1
log(b
i
a
T
i
x) on x [ a
T
i
x < b
i
, i = 1, . . . , m
f(X) = log det X on S
n
++
f(x) = log(y
2
x
T
x) on (x, y) [ |x|
2
< y
Unconstrained minimization 1026
Convergence analysis for self-concordant functions
summary: there exist constants (0, 1/4], > 0 such that
if (x) > , then
f(x
(k+1)
) f(x
(k)
)
if (x) , then
2(x
(k+1)
)
_
2(x
(k)
)
_
2
( and only depend on backtracking parameters , )
complexity bound: number of Newton iterations bounded by
f(x
(0)
) p

+ log
2
log
2
(1/)
for = 0.1, = 0.8, = 10
10
, bound evaluates to 375(f(x
(0)
) p

) + 6
Unconstrained minimization 1027
numerical example: 150 randomly generated instances of
minimize f(x) =

m
i=1
log(b
i
a
T
i
x)
: m = 100, n = 50
: m = 1000, n = 500
: m = 1000, n = 50
f(x
(0)
) p

i
t
e
r
a
t
i
o
n
s
0 5 10 15 20 25 30 35
0
5
10
15
20
25
number of iterations much smaller than 375(f(x
(0)
) p

) + 6
bound of the form c(f(x
(0)
) p

) + 6 with smaller c (empirically) valid


Unconstrained minimization 1028
Implementation
main eort in each iteration: evaluate derivatives and solve Newton system
Hx = g
where H =
2
f(x), g = f(x)
via Cholesky factorization
H = LL
T
, x
nt
= L
T
L
1
g, (x) = |L
1
g|
2
cost (1/3)n
3
ops for unstructured system
cost (1/3)n
3
if H sparse, banded
Unconstrained minimization 1029
example of dense Newton system with structure
f(x) =
n

i=1

i
(x
i
) +
0
(Ax +b), H = D +A
T
H
0
A
assume A R
pn
, dense, with p n
D diagonal with diagonal elements

i
(x
i
); H
0
=
2

0
(Ax +b)
method 1: form H, solve via dense Cholesky factorization: (cost (1/3)n
3
)
method 2 (page 915): factor H
0
= L
0
L
T
0
; write Newton system as
Dx +A
T
L
0
w = g, L
T
0
Ax w = 0
eliminate x from rst equation; compute w and x from
(I +L
T
0
AD
1
A
T
L
0
)w = L
T
0
AD
1
g, Dx = g A
T
L
0
w
cost: 2p
2
n (dominated by computation of L
T
0
AD
1
A
T
L
0
)
Unconstrained minimization 1030
Convex Optimization Boyd & Vandenberghe
11. Equality constrained minimization
equality constrained minimization
eliminating equality constraints
Newtons method with equality constraints
infeasible start Newton method
implementation
111
Equality constrained minimization
minimize f(x)
subject to Ax = b
f convex, twice continuously dierentiable
A R
pn
with rankA = p
we assume p

is nite and attained


optimality conditions: x

is optimal i there exists a

such that
f(x

) +A
T

= 0, Ax

= b
Equality constrained minimization 112
equality constrained quadratic minimization (with P S
n
+
)
minimize (1/2)x
T
Px +q
T
x +r
subject to Ax = b
optimality condition:
_
P A
T
A 0
_ _
x

_
=
_
q
b
_
coecient matrix is called KKT matrix
KKT matrix is nonsingular if and only if
Ax = 0, x ,= 0 = x
T
Px > 0
equivalent condition for nonsingularity: P +A
T
A 0
Equality constrained minimization 113
Eliminating equality constraints
represent solution of x [ Ax = b as
x [ Ax = b = Fz + x [ z R
np

x is (any) particular solution


range of F R
n(np)
is nullspace of A (rankF = np and AF = 0)
reduced or eliminated problem
minimize f(Fz + x)
an unconstrained problem with variable z R
np
from solution z

, obtain x

and

as
x

= Fz

+ x,

= (AA
T
)
1
Af(x

)
Equality constrained minimization 114
example: optimal allocation with resource constraint
minimize f
1
(x
1
) +f
2
(x
2
) + +f
n
(x
n
)
subject to x
1
+x
2
+ +x
n
= b
eliminate x
n
= b x
1
x
n1
, i.e., choose
x = be
n
, F =
_
I
1
T
_
R
n(n1)
reduced problem:
minimize f
1
(x
1
) + +f
n1
(x
n1
) +f
n
(b x
1
x
n1
)
(variables x
1
, . . . , x
n1
)
Equality constrained minimization 115
Newton step
Newton step x
nt
of f at feasible x is given by solution v of
_

2
f(x) A
T
A 0
_ _
v
w
_
=
_
f(x)
0
_
interpretations
x
nt
solves second order approximation (with variable v)
minimize

f(x +v) = f(x) +f(x)
T
v + (1/2)v
T

2
f(x)v
subject to A(x +v) = b
x
nt
equations follow from linearizing optimality conditions
f(x +v) +A
T
w f(x) +
2
f(x)v +A
T
w = 0, A(x +v) = b
Equality constrained minimization 116
Newton decrement
(x) =
_
x
T
nt

2
f(x)x
nt
_
1/2
=
_
f(x)
T
x
nt
_
1/2
properties
gives an estimate of f(x) p

using quadratic approximation



f:
f(x) inf
Ay=b

f(y) =
1
2
(x)
2
directional derivative in Newton direction:
d
dt
f(x +tx
nt
)

t=0
= (x)
2
in general, (x) ,=
_
f(x)
T

2
f(x)
1
f(x)
_
1/2
Equality constrained minimization 117
Newtons method with equality constraints
given starting point x domf with Ax = b, tolerance > 0.
repeat
1. Compute the Newton step and decrement x
nt
, (x).
2. Stopping criterion. quit if
2
/2 .
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + tx
nt
.
a feasible descent method: x
(k)
feasible and f(x
(k+1)
) < f(x
(k)
)
ane invariant
Equality constrained minimization 118
Newtons method and elimination
Newtons method for reduced problem
minimize

f(z) = f(Fz + x)
variables z R
np
x satises A x = b; rankF = n p and AF = 0
Newtons method for

f, started at z
(0)
, generates iterates z
(k)
Newtons method with equality constraints
when started at x
(0)
= Fz
(0)
+ x, iterates are
x
(k+1)
= Fz
(k)
+ x
hence, dont need separate convergence analysis
Equality constrained minimization 119
Newton step at infeasible points
2nd interpretation of page 116 extends to infeasible x (i.e., Ax ,= b)
linearizing optimality conditions at infeasible x (with x domf) gives
_

2
f(x) A
T
A 0
_ _
x
nt
w
_
=
_
f(x)
Ax b
_
(1)
primal-dual interpretation
write optimality condition as r(y) = 0, where
y = (x, ), r(y) = (f(x) +A
T
, Ax b)
linearizing r(y) = 0 gives r(y + y) r(y) +Dr(y)y = 0:
_

2
f(x) A
T
A 0
_ _
x
nt

nt
_
=
_
f(x) +A
T

Ax b
_
same as (1) with w = +
nt
Equality constrained minimization 1110
Infeasible start Newton method
given starting point x domf, , tolerance > 0, (0, 1/2), (0, 1).
repeat
1. Compute primal and dual Newton steps x
nt
,
nt
.
2. Backtracking line search on r
2
.
t := 1.
while r(x + tx
nt
, + t
nt
)
2
> (1 t)r(x, )
2
, t := t.
3. Update. x := x + tx
nt
, := + t
nt
.
until Ax = b and r(x, )
2
.
not a descent method: f(x
(k+1)
) > f(x
(k)
) is possible
directional derivative of |r(y)|
2
in direction y = (x
nt
,
nt
) is
d
dt
|r(y +ty)|
2

t=0
= |r(y)|
2
Equality constrained minimization 1111
Solving KKT systems
_
H A
T
A 0
_ _
v
w
_
=
_
g
h
_
solution methods
LDL
T
factorization
elimination (if H nonsingular)
AH
1
A
T
w = h AH
1
g, Hv = (g +A
T
w)
elimination with singular H: write as
_
H +A
T
QA A
T
A 0
_ _
v
w
_
=
_
g +A
T
Qh
h
_
with Q _ 0 for which H +A
T
QA 0, and apply elimination
Equality constrained minimization 1112
Equality constrained analytic centering
primal problem: minimize

n
i=1
log x
i
subject to Ax = b
dual problem: maximize b
T
+

n
i=1
log(A
T
)
i
+n
three methods for an example with A R
100500
, dierent starting points
1. Newton method with equality constraints (requires x
(0)
0, Ax
(0)
= b)
k
f
(
x
(
k
)
)

0 5 10 15 20
10
10
10
5
10
0
10
5
Equality constrained minimization 1113
2. Newton method applied to dual problem (requires A
T

(0)
0)
k
p

g
(

(
k
)
)
0 2 4 6 8 10
10
10
10
5
10
0
10
5
3. infeasible start Newton method (requires x
(0)
0)
k

r
(
x
(
k
)
,

(
k
)
)

2
0 5 10 15 20 25
10
15
10
10
10
5
10
0
10
5
10
10
Equality constrained minimization 1114
complexity per iteration of three methods is identical
1. use block elimination to solve KKT system
_
diag(x)
2
A
T
A 0
_ _
x
w
_
=
_
diag(x)
1
1
0
_
reduces to solving Adiag(x)
2
A
T
w = b
2. solve Newton system Adiag(A
T
)
2
A
T
= b +Adiag(A
T
)
1
1
3. use block elimination to solve KKT system
_
diag(x)
2
A
T
A 0
_ _
x

_
=
_
diag(x)
1
1
Ax b
_
reduces to solving Adiag(x)
2
A
T
w = 2Ax b
conclusion: in each case, solve ADA
T
w = h with D positive diagonal
Equality constrained minimization 1115
Network ow optimization
minimize

n
i=1

i
(x
i
)
subject to Ax = b
directed graph with n arcs, p + 1 nodes
x
i
: ow through arc i;
i
: cost ow function for arc i (with

i
(x) > 0)
node-incidence matrix

A R
(p+1)n
dened as

A
ij
=
_
_
_
1 arc j leaves node i
1 arc j enters node i
0 otherwise
reduced node-incidence matrix A R
pn
is

A with last row removed
b R
p
is (reduced) source vector
rankA = p if graph is connected
Equality constrained minimization 1116
KKT system
_
H A
T
A 0
_ _
v
w
_
=
_
g
h
_
H = diag(

1
(x
1
), . . . ,

n
(x
n
)), positive diagonal
solve via elimination:
AH
1
A
T
w = h AH
1
g, Hv = (g +A
T
w)
sparsity pattern of coecient matrix is given by graph connectivity
(AH
1
A
T
)
ij
,= 0 (AA
T
)
ij
,= 0
nodes i and j are connected by an arc
Equality constrained minimization 1117
Analytic center of linear matrix inequality
minimize log det X
subject to tr(A
i
X) = b
i
, i = 1, . . . , p
variable X S
n
optimality conditions
X

0, (X

)
1
+
p

j=1

j
A
i
= 0, tr(A
i
X

) = b
i
, i = 1, . . . , p
Newton equation at feasible X:
X
1
XX
1
+
p

j=1
w
j
A
i
= X
1
, tr(A
i
X) = 0, i = 1, . . . , p
follows from linear approximation (X + X)
1
X
1
X
1
XX
1
n(n + 1)/2 +p variables X, w
Equality constrained minimization 1118
solution by block elimination
eliminate X from rst equation: X = X

p
j=1
w
j
XA
j
X
substitute X in second equation
p

j=1
tr(A
i
XA
j
X)w
j
= b
i
, i = 1, . . . , p (2)
a dense positive denite set of linear equations with variable w R
p
op count (dominant terms) using Cholesky factorization X = LL
T
:
form p products L
T
A
j
L: (3/2)pn
3
form p(p + 1)/2 inner products tr((L
T
A
i
L)(L
T
A
j
L)): (1/2)p
2
n
2
solve (2) via Cholesky factorization: (1/3)p
3
Equality constrained minimization 1119
Convex Optimization Boyd & Vandenberghe
12. Interior-point methods
inequality constrained minimization
logarithmic barrier function and central path
barrier method
feasibility and phase I methods
complexity analysis via self-concordance
generalized inequalities
121
Inequality constrained minimization
minimize f
0
(x)
subject to f
i
(x) 0, i = 1, . . . , m
Ax = b
(1)
f
i
convex, twice continuously dierentiable
A R
pn
with rankA = p
we assume p

is nite and attained


we assume problem is strictly feasible: there exists x with
x domf
0
, f
i
( x) < 0, i = 1, . . . , m, A x = b
hence, strong duality holds and dual optimum is attained
Interior-point methods 122
Examples
LP, QP, QCQP, GP
entropy maximization with linear inequality constraints
minimize

n
i=1
x
i
log x
i
subject to Fx _ g
Ax = b
with domf
0
= R
n
++
dierentiability may require reformulating the problem, e.g.,
piecewise-linear minimization or

-norm approximation via LP


SDPs and SOCPs are better handled as problems with generalized
inequalities (see later)
Interior-point methods 123
Logarithmic barrier
reformulation of (1) via indicator function:
minimize f
0
(x) +

m
i=1
I

(f
i
(x))
subject to Ax = b
where I

(u) = 0 if u 0, I

(u) = otherwise (indicator function of R

)
approximation via logarithmic barrier
minimize f
0
(x) (1/t)

m
i=1
log(f
i
(x))
subject to Ax = b
an equality constrained problem
for t > 0, (1/t) log(u) is a
smooth approximation of I

approximation improves as t
u
3 2 1 0 1
5
0
5
10
Interior-point methods 124
logarithmic barrier function
(x) =
m

i=1
log(f
i
(x)), dom = x [ f
1
(x) < 0, . . . , f
m
(x) < 0
convex (follows from composition rules)
twice continuously dierentiable, with derivatives
(x) =
m

i=1
1
f
i
(x)
f
i
(x)

2
(x) =
m

i=1
1
f
i
(x)
2
f
i
(x)f
i
(x)
T
+
m

i=1
1
f
i
(x)

2
f
i
(x)
Interior-point methods 125
Central path
for t > 0, dene x

(t) as the solution of


minimize tf
0
(x) +(x)
subject to Ax = b
(for now, assume x

(t) exists and is unique for each t > 0)


central path is x

(t) [ t > 0
example: central path for an LP
minimize c
T
x
subject to a
T
i
x b
i
, i = 1, . . . , 6
hyperplane c
T
x = c
T
x

(t) is tangent to
level curve of through x

(t)
c
x

(10)
Interior-point methods 126
Dual points on central path
x = x

(t) if there exists a w such that


tf
0
(x) +
m

i=1
1
f
i
(x)
f
i
(x) +A
T
w = 0, Ax = b
therefore, x

(t) minimizes the Lagrangian


L(x,

(t),

(t)) = f
0
(x) +
m

i=1

i
(t)f
i
(x) +

(t)
T
(Ax b)
where we dene

i
(t) = 1/(tf
i
(x

(t)) and

(t) = w/t
this conrms the intuitive idea that f
0
(x

(t)) p

if t :
p

g(

(t),

(t))
= L(x

(t),

(t),

(t))
= f
0
(x

(t)) m/t
Interior-point methods 127
Interpretation via KKT conditions
x = x

(t), =

(t), =

(t) satisfy
1. primal constraints: f
i
(x) 0, i = 1, . . . , m, Ax = b
2. dual constraints: _ 0
3. approximate complementary slackness:
i
f
i
(x) = 1/t, i = 1, . . . , m
4. gradient of Lagrangian with respect to x vanishes:
f
0
(x) +
m

i=1

i
f
i
(x) +A
T
= 0
dierence with KKT is that condition 3 replaces
i
f
i
(x) = 0
Interior-point methods 128
Force eld interpretation
centering problem (for problem with no equality constraints)
minimize tf
0
(x)

m
i=1
log(f
i
(x))
force eld interpretation
tf
0
(x) is potential of force eld F
0
(x) = tf
0
(x)
log(f
i
(x)) is potential of force eld F
i
(x) = (1/f
i
(x))f
i
(x)
the forces balance at x

(t):
F
0
(x

(t)) +
m

i=1
F
i
(x

(t)) = 0
Interior-point methods 129
example
minimize c
T
x
subject to a
T
i
x b
i
, i = 1, . . . , m
objective force eld is constant: F
0
(x) = tc
constraint force eld decays as inverse distance to constraint hyperplane:
F
i
(x) =
a
i
b
i
a
T
i
x
, |F
i
(x)|
2
=
1
dist(x, 1
i
)
where 1
i
= x [ a
T
i
x = b
i

c
3c
t = 1 t = 3
Interior-point methods 1210
Barrier method
given strictly feasible x, t := t
(0)
> 0, > 1, tolerance > 0.
repeat
1. Centering step. Compute x

(t) by minimizing tf
0
+ , subject to Ax = b.
2. Update. x := x

(t).
3. Stopping criterion. quit if m/t < .
4. Increase t. t := t.
terminates with f
0
(x) p

(stopping criterion follows from


f
0
(x

(t)) p

m/t)
centering usually done using Newtons method, starting at current x
choice of involves a trade-o: large means fewer outer iterations,
more inner (Newton) iterations; typical values: = 1020
several heuristics for choice of t
(0)
Interior-point methods 1211
Convergence analysis
number of outer (centering) iterations: exactly
_
log(m/(t
(0)
))
log
_
plus the initial centering step (to compute x

(t
(0)
))
centering problem
minimize tf
0
(x) +(x)
see convergence analysis of Newtons method
tf
0
+ must have closed sublevel sets for t t
(0)
classical analysis requires strong convexity, Lipschitz condition
analysis via self-concordance requires self-concordance of tf
0
+
Interior-point methods 1212
Examples
inequality form LP (m = 100 inequalities, n = 50 variables)
Newton iterations
d
u
a
l
i
t
y
g
a
p
= 2 = 50 = 150
0 20 40 60 80
10
6
10
4
10
2
10
0
10
2

N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
0 40 80 120 160 200
0
20
40
60
80
100
120
140
starts with x on central path (t
(0)
= 1, duality gap 100)
terminates when t = 10
8
(gap 10
6
)
centering uses Newtons method with backtracking
total number of Newton iterations not very sensitive for 10
Interior-point methods 1213
geometric program (m = 100 inequalities and n = 50 variables)
minimize log
_

5
k=1
exp(a
T
0k
x +b
0k
)
_
subject to log
_

5
k=1
exp(a
T
ik
x +b
ik
)
_
0, i = 1, . . . , m
Newton iterations
d
u
a
l
i
t
y
g
a
p
= 2 = 50 = 150
0 20 40 60 80 100 120
10
6
10
4
10
2
10
0
10
2
Interior-point methods 1214
family of standard LPs (A R
m2m
)
minimize c
T
x
subject to Ax = b, x _ 0
m = 10, . . . , 1000; for each m, solve 100 randomly generated instances
m
N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
10
1
10
2
10
3
15
20
25
30
35
number of iterations grows very slowly as m ranges over a 100 : 1 ratio
Interior-point methods 1215
Feasibility and phase I methods
feasibility problem: nd x such that
f
i
(x) 0, i = 1, . . . , m, Ax = b (2)
phase I: computes strictly feasible starting point for barrier method
basic phase I method
minimize (over x, s) s
subject to f
i
(x) s, i = 1, . . . , m
Ax = b
(3)
if x, s feasible, with s < 0, then x is strictly feasible for (2)
if optimal value p

of (3) is positive, then problem (2) is infeasible


if p

= 0 and attained, then problem (2) is feasible (but not strictly);


if p

= 0 and not attained, then problem (2) is infeasible


Interior-point methods 1216
sum of infeasibilities phase I method
minimize 1
T
s
subject to s _ 0, f
i
(x) s
i
, i = 1, . . . , m
Ax = b
for infeasible problems, produces a solution that satises many more
inequalities than basic phase I method
example (infeasible set of 100 linear inequalities in 50 variables)
b
i
a
T
i
x
max
n
u
m
b
e
r
1 0.5 0 0.5 1 1.5
0
20
40
60
n
u
m
b
e
r
1 0.5 0 0.5 1 1.5
0
20
40
60
b
i
a
T
i
x
sum
left: basic phase I solution; satises 39 inequalities
right: sum of infeasibilities phase I solution; satises 79 inequalities
Interior-point methods 1217
example: family of linear inequalities Ax _ b +b
data chosen to be strictly feasible for > 0, infeasible for 0
use basic phase I, terminate when s < 0 or dual objective is positive

N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
Infeasible Feasible
1 0.5 0 0.5 1
0
20
40
60
80
100

N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
10
0
10
2
10
4
10
6
0
20
40
60
80
100

N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
10
6
10
4
10
2
10
0
0
20
40
60
80
100
number of iterations roughly proportional to log(1/[[)
Interior-point methods 1218
Complexity analysis via self-concordance
same assumptions as on page 122, plus:
sublevel sets (of f
0
, on the feasible set) are bounded
tf
0
+ is self-concordant with closed sublevel sets
second condition
holds for LP, QP, QCQP
may require reformulating the problem, e.g.,
minimize

n
i=1
x
i
log x
i
subject to Fx _ g
minimize

n
i=1
x
i
log x
i
subject to Fx _ g, x _ 0
needed for complexity analysis; barrier method works even when
self-concordance assumption does not apply
Interior-point methods 1219
Newton iterations per centering step: from self-concordance theory
#Newton iterations
tf
0
(x) +(x) tf
0
(x
+
) (x
+
)

+c
bound on eort of computing x
+
= x

(t) starting at x = x

(t)
, c are constants (depend only on Newton algorithm parameters)
from duality (with =

(t), =

(t)):
tf
0
(x) +(x) tf
0
(x
+
) (x
+
)
= tf
0
(x) tf
0
(x
+
) +
m

i=1
log(t
i
f
i
(x
+
)) mlog
tf
0
(x) tf
0
(x
+
) t
m

i=1

i
f
i
(x
+
) mmlog
tf
0
(x) tg(, ) mmlog
= m( 1 log )
Interior-point methods 1220
total number of Newton iterations (excluding rst centering step)
#Newton iterations N =
_
log(m/(t
(0)
))
log
__
m( 1 log )

+c
_

N
1 1.1 1.2
0
1 10
4
2 10
4
3 10
4
4 10
4
5 10
4
gure shows N for typical values of , c,
m = 100,
m
t
(0)

= 10
5
conrms trade-o in choice of
in practice, #iterations is in the tens; not very sensitive for 10
Interior-point methods 1221
polynomial-time complexity of barrier method
for = 1 + 1/

m:
N = O
_

mlog
_
m/t
(0)

__
number of Newton iterations for xed gap reduction is O(

m)
multiply with cost of one Newton iteration (a polynomial function of
problem dimensions), to get bound on number of ops
this choice of optimizes worst-case complexity; in practice we choose
xed ( = 10, . . . , 20)
Interior-point methods 1222
Generalized inequalities
minimize f
0
(x)
subject to f
i
(x) _
K
i
0, i = 1, . . . , m
Ax = b
f
0
convex, f
i
: R
n
R
k
i
, i = 1, . . . , m, convex with respect to proper
cones K
i
R
k
i
f
i
twice continuously dierentiable
A R
pn
with rankA = p
we assume p

is nite and attained


we assume problem is strictly feasible; hence strong duality holds and
dual optimum is attained
examples of greatest interest: SOCP, SDP
Interior-point methods 1223
Generalized logarithm for proper cone
: R
q
R is generalized logarithm for proper cone K R
q
if:
dom = int K and
2
(y) 0 for y
K
0
(sy) = (y) + log s for y
K
0, s > 0 ( is the degree of )
examples
nonnegative orthant K = R
n
+
: (y) =

n
i=1
log y
i
, with degree = n
positive semidenite cone K = S
n
+
:
(Y ) = log det Y ( = n)
second-order cone K = y R
n+1
[ (y
2
1
+ +y
2
n
)
1/2
y
n+1
:
(y) = log(y
2
n+1
y
2
1
y
2
n
) ( = 2)
Interior-point methods 1224
properties (without proof): for y
K
0,
(y) _
K

0, y
T
(y) =
nonnegative orthant R
n
+
: (y) =

n
i=1
log y
i
(y) = (1/y
1
, . . . , 1/y
n
), y
T
(y) = n
positive semidenite cone S
n
+
: (Y ) = log det Y
(Y ) = Y
1
, tr(Y (Y )) = n
second-order cone K = y R
n+1
[ (y
2
1
+ +y
2
n
)
1/2
y
n+1
:
(y) =
2
y
2
n+1
y
2
1
y
2
n
_

_
y
1
.
.
.
y
n
y
n+1
_

_
, y
T
(y) = 2
Interior-point methods 1225
Logarithmic barrier and central path
logarithmic barrier for f
1
(x) _
K
1
0, . . . , f
m
(x) _
K
m
0:
(x) =
m

i=1

i
(f
i
(x)), dom = x [ f
i
(x)
K
i
0, i = 1, . . . , m

i
is generalized logarithm for K
i
, with degree
i
is convex, twice continuously dierentiable
central path: x

(t) [ t > 0 where x

(t) solves
minimize tf
0
(x) +(x)
subject to Ax = b
Interior-point methods 1226
Dual points on central path
x = x

(t) if there exists w R


p
,
tf
0
(x) +
m

i=1
Df
i
(x)
T

i
(f
i
(x)) +A
T
w = 0
(Df
i
(x) R
k
i
n
is derivative matrix of f
i
)
therefore, x

(t) minimizes Lagrangian L(x,

(t),

(t)), where

i
(t) =
1
t

i
(f
i
(x

(t))),

(t) =
w
t
from properties of
i
:

i
(t)
K

i
0, with duality gap
f
0
(x

(t)) g(

(t),

(t)) = (1/t)
m

i=1

i
Interior-point methods 1227
example: semidenite programming (with F
i
S
p
)
minimize c
T
x
subject to F(x) =

n
i=1
x
i
F
i
+G _ 0
logarithmic barrier: (x) = log det(F(x)
1
)
central path: x

(t) minimizes tc
T
x log det(F(x)); hence
tc
i
tr(F
i
F(x

(t))
1
) = 0, i = 1, . . . , n
dual point on central path: Z

(t) = (1/t)F(x

(t))
1
is feasible for
maximize tr(GZ)
subject to tr(F
i
Z) +c
i
= 0, i = 1, . . . , n
Z _ 0
duality gap on central path: c
T
x

(t) tr(GZ

(t)) = p/t
Interior-point methods 1228
Barrier method
given strictly feasible x, t := t
(0)
> 0, > 1, tolerance > 0.
repeat
1. Centering step. Compute x

(t) by minimizing tf
0
+ , subject to Ax = b.
2. Update. x := x

(t).
3. Stopping criterion. quit if (

i
)/t < .
4. Increase t. t := t.
only dierence is duality gap m/t on central path is replaced by

i
/t
number of outer iterations:
_
log((

i
)/(t
(0)
))
log
_
complexity analysis via self-concordance applies to SDP, SOCP
Interior-point methods 1229
Examples
second-order cone program (50 variables, 50 SOC constraints in R
6
)
Newton iterations
d
u
a
l
i
t
y
g
a
p
= 2 = 50 = 200
0 20 40 60 80
10
6
10
4
10
2
10
0
10
2

N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
20 60 100 140 180
0
40
80
120
semidenite program (100 variables, LMI constraint in S
100
)
Newton iterations
d
u
a
l
i
t
y
g
a
p
= 2
= 50 = 150
0 20 40 60 80 100
10
6
10
4
10
2
10
0
10
2

N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
0 20 40 60 80 100 120
20
60
100
140
Interior-point methods 1230
family of SDPs (A S
n
, x R
n
)
minimize 1
T
x
subject to A+diag(x) _ 0
n = 10, . . . , 1000, for each n solve 100 randomly generated instances
n
N
e
w
t
o
n
i
t
e
r
a
t
i
o
n
s
10
1
10
2
10
3
15
20
25
30
35
Interior-point methods 1231
Primal-dual interior-point methods
more ecient than barrier method when high accuracy is needed
update primal and dual variables at each iteration; no distinction
between inner and outer iterations
often exhibit superlinear asymptotic convergence
search directions can be interpreted as Newton directions for modied
KKT conditions
can start at infeasible points
cost per iteration same as barrier method
Interior-point methods 1232
Convex Optimization Boyd & Vandenberghe
13. Conclusions
main ideas of the course
importance of modeling in optimization
131
Modeling
mathematical optimization
problems in engineering design, data analysis and statistics, economics,
management, . . . , can often be expressed as mathematical
optimization problems
techniques exist to take into account multiple objectives or uncertainty
in the data
tractability
roughly speaking, tractability in optimization requires convexity
algorithms for nonconvex optimization nd local (suboptimal) solutions,
or are very expensive
surprisingly many applications can be formulated as convex problems
Conclusions 132
Theoretical consequences of convexity
local optima are global
extensive duality theory
systematic way of deriving lower bounds on optimal value
necessary and sucient optimality conditions
certicates of infeasibility
sensitivity analysis
solution methods with polynomial worst-case complexity theory
(with self-concordance)
Conclusions 133
Practical consequences of convexity
(most) convex problems can be solved globally and eciently
interior-point methods require 20 80 steps in practice
basic algorithms (e.g., Newton, barrier method, . . . ) are easy to
implement and work well for small and medium size problems (larger
problems if structure is exploited)
more and more high-quality implementations of advanced algorithms
and modeling tools are becoming available
high level modeling tools like cvx ease modeling and problem
specication
Conclusions 134
How to use convex optimization
to use convex optimization in some applied context
use rapid prototyping, approximate modeling
start with simple models, small problem instances, inecient solution
methods
if you dont like the results, no need to expend further eort on more
accurate models or ecient algorithms
work out, simplify, and interpret optimality conditions and dual
even if the problem is quite nonconvex, you can use convex optimization
in subproblems, e.g., to nd search direction
by repeatedly forming and solving a convex approximation at the
current point
Conclusions 135
Further topics
some topics we didnt cover:
methods for very large scale problems
subgradient calculus, convex analysis
localization, subgradient, and related methods
distributed convex optimization
applications that build on or use convex optimization
. . . these will be in EE364B
Conclusions 136

You might also like