You are on page 1of 31

Subgradient Method

Yi Zhang

Outline

Subgradients Subgradients for convex optimization


Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

Examples

Outline

Subgradients Subgradients for convex optimization


Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

Examples

Differentiable functions

When f is convex and differentiable

First-order approximation of f at x pp A global underestimate of f

[Boyd & Vandenberghe]

What is f is not differentiable?

g is a subgradient of f (convex or not) at x if


Provides a global underestimate of f Not necessarily unique at x

[Boyd & Vandenberghe]

Subdifferential of f

The subdifferential of f at x

Set of all subgradients of f at x Denoted as A convex set can be an empty set (f not convex) set,

Subgradients of convex functions

If f is convex

is non empty for all x in dom(f) non-empty If f is differentiable at x, If contains only one element g then g,

f is differentiable at x, and g =

[Boyd & Vandenberghe]

Basic rules for calculating subgradients

Addition Scaling Affine transform Chain rule

Note: operations over sets

Basic rules for calculating subgradients

Finite pointwise maximum


is the convex hull of i.e., th convex h ll of union of subgradients of all i the hull f i f b di t f ll active functions

Optimality conditions

If f is convex and differentiable

If f is convex but non-differentiable

Is g a descent direction?

For a convex function f, and

Is

always a descent direction? Nope

[Boyd & Vandenberghe]

But always closer to the global minimizer x* y g

Always closer to optimal point!

If f is convex, f(z) < f(x) and

Then for a small t > 0

i.e., -g is a descent direction for This holds f any z s.t. f(z) < f(x) e.g., x* for f( ) f( ) *

Proof

Outline

Subgradients Subgradients for convex optimization


Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

Examples

Subgradient method

f is convex and non-differentiable Subgradient method

Why converge?

For a convex function f, and

is not always a descent direction at x

[Boyd & Vandenberghe]

Why converge?

Key point 1: Euclidean distance to the minimizer x* always decreases x

f convex, f(z) < f(x),

, and small t

Let z = x* and assume || g || bounded

Why converge?

Key point 2: convergence in || x x* ||2 indicates convergence in f(x) f(x*) f(x )


f convex, subgradient g exists at any x f(x ) f(x*) >= f(x) + gT(x x*) x)

f(x) f(x*) <= gT(x* x) and assume ||g||2 bounded x),

Proof in the class

Projected subgradient method

Projected subgradient method

Why converge?

Key point 1: -g is a descent direction for || x x x* ||2 at x Key point 2: projection decreases || x x* ||2

Key point 3: convergence in || x x* ||2 indicates convergence in f(x) f(x*)

Stochastic subgradient method

Consider the problem

f depends on a set of training examples in D


Very large set, say, 108 examples V l t l Coming one by one, i.e., data streams

Expensive/impossible t evaluate f and calculate E i /i ibl to l t d l l t any subgradient

Stochastic subgradient method

Stochastic subgradient method


Similar to subgradient method is an unbiased estimate of subgradient at

E.g.,

depends on a random subset of examples

Why converge?

Proof similar to deterministic cases, but take , expectation over both sides

Outline

Subgradients Subgradients for convex optimization


Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

Examples

Example 1: minimize a pointwise maximum

Minimize: How to calculate a subgradient of f ?

Recall: subgradients of pointwise maximum functions

Finite pointwise maximum


is the convex hull of i.e., th convex h ll of union of subgradients of all i the hull f i f b di t f ll active functions

Example 1: minimize a pointwise maximum

Minimize:

Find the active function j at current x We just need one subgradient (not the whole set) Subgradient method

Example 2: Max-margin Markov Networks (M3N)

Max-margin Markov networks

fi(y) is in short for f(xi, y) Constraint generation ! Or subgradient ?

Exponentially many constraints E ti ll t i t


Example 2: Max-margin Markov Networks (M3N)

Max-margin Markov networks

Problem equivalent to:

Example 2: Max-margin Markov Networks (M3N)

Max-margin Markov networks

An equivalent unconstrained problem:

Example 2: Max-margin Markov Networks (M3N)

An equivalent unconstrained problem:

Directly calculate the subgradient of f y g

For each ri(w) (separation oracle!) p )

Subgradient method!

Two methods to optimze M3N

Exponentially many constraints

Constraint generation: start from a few constraints, and gradually add violated constraints Subgradient: write all constraints into the objective function, and calculate the subgradient Find most violated constraint Find active function in a pointwise maximum term

Both method uses the separation oracle

You might also like