You are on page 1of 49

1

:
1

1 2 1 3

2011-10-25 @ RAMP 2011

()

RAMP2011

2011-10-25

1 / 37

I
I
I

CVX (Grant & Boyd)


(80 )

I
I
I

60-70

()

RAMP2011

2011-10-25

2 / 37

I
I
I

CVX (Grant & Boyd)


(80 )

I
I
I

60-70
I

(Accelerated) Proximal gradient methods

()

RAMP2011

2011-10-25

2 / 37

I
I
I

CVX (Grant & Boyd)


(80 )

I
I
I

60-70
I
I

(Accelerated) Proximal gradient methods


Dual decomposition (Uzawas method)

()

RAMP2011

2011-10-25

2 / 37

I
I
I

CVX (Grant & Boyd)


(80 )

I
I
I

60-70
I
I
I

(Accelerated) Proximal gradient methods


Dual decomposition (Uzawas method)
Alternating Direction Method of Multipliers (ADMM)

()

RAMP2011

2011-10-25

2 / 37


( )
I

SNP etc

MRI

I
I

1
1
4

Users

3
2

1
1

2
3

1
2

3
2

Movies

()

RAMP2011

2011-10-25

3 / 37

1: SNP
x i : (SNP) yi = 1: yi = 1:
: x i yi
2 (yi {1, +1})

minimize
n
w R

m
P

log(1 + exp(yi x i , w ))
|
{z
}
i=1

w 1
| {z }
Regularization

data-t

f(x)=log(1+exp(x))

SNP n = 500, 000


m = 5, 000
(MAP)
:

y<x,w>

log(1 + eyz ) = log P(Y = y |z)


1
(z)

where P(Y = +1|z) =

ez
1+ez .

0.5
0
5

()

RAMP2011

0
z

5
2011-10-25

4 / 37

2: [Candes, Romberg, & Tao 06]


MRI

minimize
n
w R

1
y w 22 + w 1
2

y:
w :
: Rn Rm :
:
1

minimize
Rn
w

1
22 + w
1 ,
y Aw
2

( A = 1 )
()

RAMP2011

2011-10-25

5 / 37

3: [Fazel+ 01; Srebro+ 05]

X Y

1
(X Y )2 + X S1
2
X
r
X
where X S1 :=
j (X ) (Schatten 1-norm)
minimize

1
1
4

Users

3
2

1
1

2
3

j=1

2
1

()

3
2

Movies

RAMP2011

2011-10-25

6 / 37

4: [Tucker 66]
r1 !
r2 !
r3
!

Xijk =

(1)

(2)

(c)

Cabc Uia Ujb Ukc

a=1 b=1 c=1

C
o

(
i

r
e

u
n

2
r

RAMP2011

l
o

()

2011-10-25

7 / 37

minimize L(w ) + w 1
w

I
I
I

SNP
1

minimize L(w ) + w 1
w

I
I

1 Total variation)
Tucker

()

RAMP2011

2011-10-25

8 / 37

I
I

(proximal gradient method)


Dual Augmented Lagrangian (DAL)

Alternating Direction Method of Multipliers (ADMM)

()

RAMP2011

2011-10-25

9 / 37


(proximal gradient method)

Dual Augmented Lagrangian (DAL)

()

RAMP2011

2011-10-25

10 / 37

(proximal gradient method)

.
minimize L(w ) + w 1
w
| {z }
| {z }

t+1

1
= argmin L(w )(w w ) +
w w t 22 + w 1
2t
w

1
= argmin w 1 +
w (w t t L(w t ))22
2t
w
t

= proxt (w t t L(w t )).

t+1
x* x x

()

RAMP2011

2011-10-25

11 / 37

Proximal operator:

1
proxg (z) = argmin g(x) + x z2
2
x

: proxC (z) = projC (z).

Soft-Threshold (g(x) = x1 )

1
2
prox (z) = argmin x1 + x z
2
x

zj + (zj < ),
= 0
( zj ),

zj (zj > ).

ST(z)

r Prox

()

RAMP2011

2011-10-25

12 / 37

(proximal gradient method)


(Lions & Mercier 79; Figueiredo&Nowak 03; Daubechies 04;...)
1
2

w 0 .

w t+1 proxt w t t L(w t ) .


{z
}
| {z } |

:
: L

: Forward-Backward splitting,
Iterative Shrinkage/Thresholding

()

RAMP2011

2011-10-25

13 / 37

(proximal gradient method)


(Lions & Mercier 79; Figueiredo&Nowak 03; Daubechies 04;...)
1
2

w 0 .

w t+1 proxt w t t L(w t ) .


{z
}
| {z } |

:
: L

: Forward-Backward splitting,
Iterative Shrinkage/Thresholding

()

RAMP2011

2011-10-25

13 / 37

L(x) L(y) Hx y
1
t 1/H

f (x k ) f (x )

Hx 0 x 22
2k

O(1/k 2 )
(Nesterov 07; Beck & Teboulle 09)

()

RAMP2011

2011-10-25

14 / 37

Dual Augmented Lagrangian (DAL) [Tomioka & Sugiyama 09]

L(w ) = f (Aw ) f : A Rmn :

1
y Aw 22
(2 )
2
m
X
( 2) L(w ) =
log(1 + exp(yi x i , w ))
()
( 1) L(w ) =

i=1
2

()

RAMP2011

2011-10-25

15 / 37

Dual Augmented Lagrangian (DAL)


.

min
w

f (Aw ) + w 1
|
{z
}

max
,v

f () ( 1 ) (v)

s.t. v = A

f (w )

()

RAMP2011

2011-10-25

16 / 37

Dual Augmented Lagrangian (DAL)


.

f (Aw ) + w 1
|
{z
}

min
w

max
,v

f () ( 1 ) (v)

s.t. v = A

f (w )

Proximal minimization
[Rockafellar 76]:
w

t+1

1
t 2
= argmin f (w ) +
w w
2t
w

(0 1 )

f (w t+1 ) +

1
2t

w t+1 w t 2 f (w t ).

()

RAMP2011

2011-10-25

16 / 37

Dual Augmented Lagrangian (DAL)


.

f (Aw ) + w 1
|
{z
}

min
w

max
,v

s.t. v = A

f (w )

Augmented Lagrangian
[Powell 69; Hestenes 69]:

Proximal minimization
[Rockafellar 76]:
w

t+1

1
t 2
= argmin f (w ) +
w w
2t
w

w t+1 w t 2 f (w t ).

()

t = argmin t ()
t ()
.

1
2t

w t+1 = proxt (w t + t A t )

(0 1 )
f (w t+1 ) +

f () ( 1 ) (v)

RAMP2011

t
Rockafellar 76

2011-10-25

16 / 37

Dual Augmented Lagrangian (1 -)


1
2

w 0

w t+1 = proxt w t + t A t

t = argmin
Rm

f ()
| {z }

()

RAMP2011

1
proxt (w t + t A )22
2t
.

2011-10-25

17 / 37

DAL (1 -)
(1) Prox

w t+1 = proxt w t + t A t
(2)

t = argmin
f ()
| {z }

1
proxt (w t + t A )2
2t |
{z
}

. A

(w)

()

0
RAMP2011

(w)

2011-10-25

18 / 37

DAL

f Proximation

w t+1 = argmin
w

f (w )

}|
f (Aw )
| {z }

{ 1
+w 1 +
w w t 2
2t

()

RAMP2011

2011-10-25

19 / 37

DAL

f Proximation

w t+1 = argmin
w

f (w )

}|
f (Aw )
| {z }

{ 1
+w 1 +
w w t 2
2t

f (Aw ) f (Aw t ) + (w w t ) A f (Aw t )


w t

()

RAMP2011

wt+1 wt

2011-10-25

19 / 37

DAL

f Proximation

w t+1 = argmin

f (w )

}|
f (Aw )
| {z }

{ 1
+w 1 +
w w t 2
2t

f (Aw ) f (Aw t ) + (w w t ) A f (Aw t )


w t

wt+1 wt

DAL:

f (Aw ) = maxm f () w A
R

w t+1
()

RAMP2011

wt+1 wt
2011-10-25

19 / 37


A DAL

DAL
2

1.5

1.5

1.5

0.5

0.5

0.5

0.5

0.5

0.5

1.5

1.5

1.5

2
2

1.5

()

0.5

0.5

1.5

2
2

1.5

0.5

0.5

RAMP2011

1.5

2
2

1.5

0.5

0.5

1.5

2011-10-25

20 / 37

.
w t DAL t (t ) = 0
w f
.

f (w t+1 ) f (w ) w t+1 w 2

(t = 0, 1, 2, . . .).

w t+1 w

t w t w
()

1
w t w .
1 + t

RAMP2011

.
2011-10-25

21 / 37

w t : DAL
.

q
1/:

t
t+1
t
t ( ) t w
w
f .

w t+1 w

1
w t w .
1 + 2t

t w t w

()

RAMP2011

2011-10-25

22 / 37

w t : DAL
.

q
1/:

t
t+1
t
t ( ) t w
w
f .

w t+1 w

1
w t w .
1 + 2t

t w t w
(t (t ) = 0)

.
(t )
w t+1t w t O(1/t ).
()

RAMP2011

2011-10-25

22 / 37

1
w t+1 f (w ) +

1
2t w

w t 2

(w t w t+1 )/t f (w t+1 ) .


(Beck & Teboulle 09)
D
E
f (w ) f (w t+1 ) (w t w t+1 )/t , w w t+1 .

f(wt+1)

f(w)
w
()

RAMP2011

wt+1
2011-10-25

23 / 37

2
D
E
1
f (w ) f (w t+1 ) (w t w t+1 )/t , w w t+1 t (t )2 .
2 .
|
{z
}

1/: f

f(wt+1)

f(w)
w

()

RAMP2011

wt+1

2011-10-25

24 / 37


Alternating Direction Method of Multipliers (ADMM)

()

RAMP2011

2011-10-25

25 / 37

[Powell 69; Hestenes 69]

minimize f (x) + z1 ,
x,z

s.t. z = x
.

L (x, z, ) = f (x) + z1 + (z x) + z x2 .
2

x, z : .

t+1 t+1

argmin Lt (x, z, t ).

(x , z ) = xR
n ,zRm
.

t+1

= t + t (z t+1 x t+1 ).

x z !
()

RAMP2011

2011-10-25

26 / 37

Alternating Direction Method of Multipliers (ADMM; Gabay


& Mercier 76)

2
.
L (x, z, ) = f (x) + z1 + (z x) + z x
.
2

x :

x t+1 = argmin Lt (x, z t , t ).

xRn

z :

z t+1 = argmin Lt (x t+1 , z, t ).

zRm

t+1 = t + (z t+1 x t+1 ).

x t+1 z t+1
()

RAMP2011

2011-10-25

27 / 37

ADMM (Gabay & Mercier 76)


.

L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2

()

RAMP2011

t+1
x
= argmin Lt (x, z t , t ).

xRn
z t+1 = argmin Lt (x t+1 , z, t ).

zRm

t+1

= t + t (z t+1 x t+1 ).

2011-10-25

28 / 37

ADMM (Gabay & Mercier 76)


.

t t
t
2
t+1

x
= argmin f (x) + z x + /t .

2
xRn
t+1
t+1
z
= argmin Lt (x , z, t ).

zRm

t+1

= t + t (z t+1 x t+1 ).

()

RAMP2011

2011-10-25

L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2

28 / 37

ADMM (Gabay & Mercier 76)

t+1 = argmin f (x) + t z t x + t / 2 .

x
t

xRn

t+1 = argmin z + t z x t+1 + t / 2 .


z
t
1

2
zRm

t+1

= t + t (z t+1 x t+1 ).

()

RAMP2011

2011-10-25

L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2

28 / 37

ADMM (Gabay & Mercier 76)

t+1 = argmin f (x) + t z t x + t / 2 .

x
t

xRn

t+1 = argmin z + t z x t+1 + t / 2 .


z
t
1

2
zRm

t+1

= t + t (z t+1 x t+1 ).

L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2

z Prox prox/t
x
1

Douglas Rachford Splitting


ADMM (Lions & Mercier 76; Eckstein & Bertsekas 92)
()

RAMP2011

2011-10-25

28 / 37

[Liu+09,
Signoretto +10, Tomioka+10, Gandy+11]

: (Matricization)
Tucker

X (1)
n

X (2)
n

()

RAMP2011

2011-10-25

29 / 37

ADMM
:

X
1
x y2 +
k Z k S1 ,
2
| {z }
K

minimize

x,z 1 ,...,z K RN

s.t.

k=1

Pk x = zk

(k = 1, . . . , K ),

x
y RM
M N = n1 n2 nK )
P k k
P k P k = I

()

RAMP2011

2011-10-25

30 / 37

ADMM

X
1
x y2 +
k Z k S1
2
K

L (x, {Z k }Kk=1 , {k }Kk=1 ) =

k =1

K
X
k=1

k (P k x z k ) + P k x z k 2 .
2

x P k O(N)
Z k z k Schatten 1-
Prox

()

RAMP2011

2011-10-25

31 / 37

1:
2

Generalization error

10

As a Matrix (mode 1)
As a Matrix (mode 2)
As a Matrix (mode 3)
Constraint
Mixture
Tucker (large)
Tucker (exact)
Optimization tolerance

10

10

10

0.1

0.2

0.3
0.4
0.5
0.6
0.7
Fraction of observed elements

0.8

0.9

Constraint 35%

Tucker (EM ) OK

()

RAMP2011

2011-10-25

32 / 37

2:
50

As a Matrix
Constraint
Mixture
Tucker (large)
Tucker (exact)

Computation time (s)

40

30

20

10

0
0

0.1

0.2

0.3
0.4
0.5
0.6
0.7
Fraction of observed elements

0.8

0.9

()

RAMP2011

2011-10-25

33 / 37

I
I

Stochastic Optimization in Machine Learning (Nathan Srebro,


tutorial at ICML 2010)

LCCC : NIPS 2010 Workshop on Learning on Cores, Clusters and


Clouds

()

RAMP2011

2011-10-25

34 / 37

Optimization for Machine Learning (MIT Press, 2011)

22700138
NTT
()

RAMP2011

2011-10-25

35 / 37

References
Recent surveys
Tomioka, Suzuki, & Sugiyama (2011) Augmented Lagrangian Methods for Learning,
Selecting, and Combining Features. In Sra, Nowozin, Wright., editors, Optimization for
Machine Learning, MIT Press.
Combettes & Pesquet (2010) Proximal splitting methods in signal processing. In
Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag.
Boyd, Parikh, Peleato, & Eckstein (2010) Distributed optimization and statistical learning
via the alternating direction method of multipliers.
IST/FISTA
Moreau (1965) Proximit et dualit dans un espace Hilbertien. Bul letin de la S. M. F.
Nesterov (2007) Gradient Methods for Minimizing Composite Objective Function.
Beck & Teboulle (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear
Inverse Problems. SIAM J Imag Sci 2, 183202.
Augmented Lagrangian
Rockafellar (1976) Augmented Lagrangians and applications of the proximal point
algorithm in convex programming. Math. of Oper. Res. 1.
Bertsekas (1982) Constrained Optimization and Lagrange Multiplier Methods. Academic
Press.
Tomioka, Suzuki, & Sugiyama (2011) Super-Linear Convergence of Dual Augmented
Lagrangian Algorithm for Sparse Learning. JMLR 12.
()

RAMP2011

2011-10-25

36 / 37

References
ADMM
Gabay & Mercier (1976) A dual algorithm for the solution of nonlinear variational problems
via nite element approximation. Comput Math Appl 2, 1740.
Lions & Mercier (1979) Splitting Algorithms for the Sum of Two Nonlinear Operators. SIAM
J Numer Anal 16, 964979.
Eckstein & Bertsekas (1992) On the Douglas-Rachford splitting method and the proximal
point algorithm for maximal monotone operators.
Matrices/Tensor
Fazal, Hindi, & Boyd (2001) A Rank Minimization Heuristic with Application to Minimum
Order System Approximation. Proc. of the American Control Conference.
Srebro, Rennie, & Jaakkola (2005) Maximum-Margin Matrix Factorization. Advances in
NIPS 17, 13291336.
Cai, Cands, & Shen (2008) A singular value thresholding algorithm for matrix completion.
Mazumder, Hastie, & Tibshirani (2010) Spectral Regularization Algorithms for Learning
Large Incomplete Matrices. JMLR 11, 22872322.
Tomioka, Hayashi, & Kashima (2011) Estimation of low-rank tensors via convex
optimization. arXiv:1010.0789.
Total variation
Rudin, Osher, Fetemi. (1992) Nonlinear total variation based noise removal algorithms.
Physica D: Nonlinear Phenomena, 60.
Goldstein & Osher (2009) Split Bregman method for L1 regularization problems. SIAM J.
Imag. Sci. 2.
()

RAMP2011

2011-10-25

37 / 37

You might also like