Sub Gradients

Subgradient Method
Yi Zhang
Outline

Subgradients Subgradients for convex optimization

Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method
Examples
Outline


Examples
Differentiable functions
When f is convex and differentiable
First-order approximation of f at x pp A global underestimate of f
[Boyd & Vandenberghe]
What is f is not differentiable?
g is a subgradient of f (convex or not) at x if

Provides a global underestimate of f Not necessarily unique at x
Subdifferential of f
The subdifferential of f at x

Set of all subgradients of f at x Denoted as A convex set can be an empty set (f not convex) set,
Subgradients of convex functions
If f is convex

is non empty for all x in dom(f) non-empty If f is differentiable at x, If contains only one element g then g,
f is differentiable at x, and g =
Basic rules for calculating subgradients
Addition Scaling Affine transform Chain rule
Note: operations over sets
Basic rules for calculating subgradients
Finite pointwise maximum

is the convex hull of i.e., th convex h ll of union of subgradients of all i the hull f i f b di t f ll active functions
Optimality conditions
If f is convex and differentiable
If f is convex but non-differentiable
Is g a descent direction?
For a convex function f, and
Is
always a descent direction? Nope
But always closer to the global minimizer x* y g
Always closer to optimal point!
If f is convex, f(z) < f(x) and
Then for a small t > 0
i.e., -g is a descent direction for This holds f any z s.t. f(z) < f(x) e.g., x* for f( ) f( ) *
Proof
Outline


Examples
Subgradient method
f is convex and non-differentiable Subgradient method
Why converge?
For a convex function f, and
is not always a descent direction at x
Why converge?
Key point 1: Euclidean distance to the minimizer x* always decreases x
f convex, f(z) < f(x),
, and small t
Let z = x* and assume || g || bounded
Why converge?
Key point 2: convergence in || x x* ||2 indicates convergence in f(x) f(x*) f(x )

f convex, subgradient g exists at any x f(x ) f(x*) >= f(x) + gT(x x*) x)
f(x) f(x*) <= gT(x* x) and assume ||g||2 bounded x),
Proof in the class
Projected subgradient method
Projected subgradient method
Why converge?
Key point 1: -g is a descent direction for || x x x* ||2 at x Key point 2: projection decreases || x x* ||2
Key point 3: convergence in || x x* ||2 indicates convergence in f(x) f(x*)
Stochastic subgradient method
Consider the problem
f depends on a set of training examples in D

Very large set, say, 108 examples V l t l Coming one by one, i.e., data streams
Expensive/impossible t evaluate f and calculate E i /i ibl to l t d l l t any subgradient

Similar to subgradient method is an unbiased estimate of subgradient at
E.g.,
depends on a random subset of examples
Why converge?
Proof similar to deterministic cases, but take , expectation over both sides
Outline


Examples
Example 1: minimize a pointwise maximum
Minimize: How to calculate a subgradient of f ?
Recall: subgradients of pointwise maximum functions
Finite pointwise maximum

is the convex hull of i.e., th convex h ll of union of subgradients of all i the hull f i f b di t f ll active functions
Example 1: minimize a pointwise maximum
Minimize:
Find the active function j at current x We just need one subgradient (not the whole set) Subgradient method
Example 2: Max-margin Markov Networks (M3N)
Max-margin Markov networks
fi(y) is in short for f(xi, y) Constraint generation ! Or subgradient ?
Exponentially many constraints E ti ll t i t

Problem equivalent to:
An equivalent unconstrained problem:
An equivalent unconstrained problem:
Directly calculate the subgradient of f y g
For each ri(w) (separation oracle!) p )
Subgradient method!
Two methods to optimze M3N
Exponentially many constraints
Constraint generation: start from a few constraints, and gradually add violated constraints Subgradient: write all constraints into the objective function, and calculate the subgradient Find most violated constraint Find active function in a pointwise maximum term
Both method uses the separation oracle

Sub Gradients

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sub Gradients

Uploaded by

Copyright:

Available Formats

Subgradient Method

Subgradients Subgradients for convex optimization

Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

Subgradients Subgradients for convex optimization

Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

When f is convex and differentiable

First-order approximation of f at x pp A global underestimate of f

[Boyd & Vandenberghe]

What is f is not differentiable?

g is a subgradient of f (convex or not) at x if

Provides a global underestimate of f Not necessarily unique at x

[Boyd & Vandenberghe]

Subgradients of convex functions

[Boyd & Vandenberghe]

Basic rules for calculating subgradients

Addition Scaling Affine transform Chain rule

Note: operations over sets

Basic rules for calculating subgradients

Finite pointwise maximum

If f is convex and differentiable

If f is convex but non-differentiable

For a convex function f, and

always a descent direction? Nope

[Boyd & Vandenberghe]

But always closer to the global minimizer x* y g

Always closer to optimal point!

If f is convex, f(z) < f(x) and

Then for a small t > 0

Subgradients Subgradients for convex optimization

Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

f is convex and non-differentiable Subgradient method

For a convex function f, and

is not always a descent direction at x

[Boyd & Vandenberghe]

Key point 1: Euclidean distance to the minimizer x* always decreases x

f convex, f(z) < f(x),

Let z = x* and assume || g || bounded

Key point 2: convergence in || x x* ||2 indicates convergence in f(x) f(x*) f(x )

f(x) f(x*) <= gT(x* x) and assume ||g||2 bounded x),

Proof in the class

Projected subgradient method

Projected subgradient method

Key point 3: convergence in || x x* ||2 indicates convergence in f(x) f(x*)

Stochastic subgradient method

Consider the problem

f depends on a set of training examples in D

Expensive/impossible t evaluate f and calculate E i /i ibl to l t d l l t any subgradient

Stochastic subgradient method

Stochastic subgradient method

Similar to subgradient method is an unbiased estimate of subgradient at

depends on a random subset of examples

Subgradients Subgradients for convex optimization

Subgradient method Projected b di t P j t d subgradient method th d Stochastic subgradient method

Example 1: minimize a pointwise maximum

Minimize: How to calculate a subgradient of f ?

Recall: subgradients of pointwise maximum functions

Finite pointwise maximum

Example 1: minimize a pointwise maximum

Example 2: Max-margin Markov Networks (M3N)

Max-margin Markov networks

fi(y) is in short for f(xi, y) Constraint generation ! Or subgradient ?

Exponentially many constraints E ti ll t i t

Example 2: Max-margin Markov Networks (M3N)

Max-margin Markov networks

f(x) f(x) <= gT(x x) and assume ||g||2 bounded x),