機械学習における連続最適化の新しいトレンド

1
:
1
1 2 1 3
2011-10-25 @ RAMP 2011
()
RAMP2011
2011-10-25
1 / 37
I
I
I
CVX (Grant & Boyd)

(80 )
I
I
I
60-70
()
RAMP2011
2011-10-25
2 / 37
I
I
I
CVX (Grant & Boyd)

(80 )
I
I
I
60-70
I
(Accelerated) Proximal gradient methods
()
RAMP2011
2011-10-25
2 / 37
I
I
I
CVX (Grant & Boyd)

(80 )
I
I
I
60-70
I
I

Dual decomposition (Uzawas method)
()
RAMP2011
2011-10-25
2 / 37
I
I
I
CVX (Grant & Boyd)

(80 )
I
I
I
60-70
I
I
I

Dual decomposition (Uzawas method)
Alternating Direction Method of Multipliers (ADMM)
()
RAMP2011
2011-10-25
2 / 37

( )
I
SNP etc
MRI
I
I
1
1
4
Users
3
2
1
1
2
3
1
2
3
2
Movies
()
RAMP2011
2011-10-25
3 / 37
1: SNP
x i : (SNP) yi = 1: yi = 1:
: x i yi
2 (yi {1, +1})
minimize
n
w R
m
P
log(1 + exp(yi x i , w ))
|
{z
}
i=1
w 1
| {z }
Regularization
data-t
f(x)=log(1+exp(x))
SNP n = 500, 000

m = 5, 000
(MAP)
:
y<x,w>
log(1 + eyz ) = log P(Y = y |z)

1
(z)
where P(Y = +1|z) =
ez
1+ez .
0.5
0
5
()
RAMP2011
0
z
5
2011-10-25
4 / 37
2: [Candes, Romberg, & Tao 06]

MRI
minimize
n
w R
1
y w 22 + w 1
2
y:
w :
: Rn Rm :
:
1
minimize
Rn
w
1
22 + w
1 ,
y Aw
2
( A = 1 )
()
RAMP2011
2011-10-25
5 / 37
3: [Fazel+ 01; Srebro+ 05]
X Y
1
(X Y )2 + X S1
2
X
r
X
where X S1 :=
j (X ) (Schatten 1-norm)
minimize
1
1
4
Users
3
2
1
1
2
3
j=1
2
1
()
3
2
Movies
RAMP2011
2011-10-25
6 / 37
4: [Tucker 66]
r1 !
r2 !
r3
!
Xijk =
(1)
(2)
(c)
Cabc Uia Ujb Ukc
a=1 b=1 c=1
C
o
(
i
r
e
u
n
2
r
RAMP2011
l
o
()
2011-10-25
7 / 37
minimize L(w ) + w 1
w
I
I
I
SNP
1
w
I
I
1 Total variation)
Tucker
()
RAMP2011
2011-10-25
8 / 37
I
I
(proximal gradient method)

Dual Augmented Lagrangian (DAL)
()
RAMP2011
2011-10-25
9 / 37

()
RAMP2011
2011-10-25
10 / 37
.
w
| {z }
| {z }
t+1
1
= argmin L(w )(w w ) +
w w t 22 + w 1
2t
w
1
= argmin w 1 +
w (w t t L(w t ))22
2t
w
t
= proxt (w t t L(w t )).
t+1
x* x x
()
RAMP2011
2011-10-25
11 / 37
Proximal operator:
1
proxg (z) = argmin g(x) + x z2
2
x
: proxC (z) = projC (z).
Soft-Threshold (g(x) = x1 )
1
2
prox (z) = argmin x1 + x z
2
x
zj + (zj < ),
= 0
( zj ),
zj (zj > ).
ST(z)
r Prox
()
RAMP2011
2011-10-25
12 / 37

(Lions & Mercier 79; Figueiredo&Nowak 03; Daubechies 04;...)
1
2
w 0 .
w t+1 proxt w t t L(w t ) .

{z
}
| {z } |
:
: L
: Forward-Backward splitting,
Iterative Shrinkage/Thresholding
()
RAMP2011
2011-10-25
13 / 37

(Lions & Mercier 79; Figueiredo&Nowak 03; Daubechies 04;...)
1
2
w 0 .
w t+1 proxt w t t L(w t ) .

{z
}
| {z } |
:
: L
: Forward-Backward splitting,
Iterative Shrinkage/Thresholding
()
RAMP2011
2011-10-25
13 / 37
L(x) L(y) Hx y
1
t 1/H
f (x k ) f (x )
Hx 0 x 22
2k
O(1/k 2 )
(Nesterov 07; Beck & Teboulle 09)
()
RAMP2011
2011-10-25
14 / 37
Dual Augmented Lagrangian (DAL) [Tomioka & Sugiyama 09]
L(w ) = f (Aw ) f : A Rmn :
1
y Aw 22
(2 )
2
m
X
( 2) L(w ) =
log(1 + exp(yi x i , w ))
()
( 1) L(w ) =
i=1
2
()
RAMP2011
2011-10-25
15 / 37

.
min
w
f (Aw ) + w 1
|
{z
}
max
,v
f () ( 1 ) (v)
s.t. v = A
f (w )
()
RAMP2011
2011-10-25
16 / 37

.
f (Aw ) + w 1
|
{z
}
min
w
max
,v
f () ( 1 ) (v)
s.t. v = A
f (w )
Proximal minimization
[Rockafellar 76]:
w
t+1
1
t 2
= argmin f (w ) +
w w
2t
w
(0 1 )
f (w t+1 ) +
1
2t
w t+1 w t 2 f (w t ).
()
RAMP2011
2011-10-25
16 / 37

.
f (Aw ) + w 1
|
{z
}
min
w
max
,v
s.t. v = A
f (w )
Augmented Lagrangian
[Powell 69; Hestenes 69]:
Proximal minimization
[Rockafellar 76]:
w
t+1
1
t 2
= argmin f (w ) +
w w
2t
w
w t+1 w t 2 f (w t ).
()
t = argmin t ()
t ()
.
1
2t
w t+1 = proxt (w t + t A t )
(0 1 )
f (w t+1 ) +
f () ( 1 ) (v)
RAMP2011
t
Rockafellar 76
2011-10-25
16 / 37
Dual Augmented Lagrangian (1 -)

1
2
w 0
w t+1 = proxt w t + t A t
t = argmin
Rm
f ()
| {z }
()
RAMP2011
1
proxt (w t + t A )22
2t
.
2011-10-25
17 / 37
DAL (1 -)
(1) Prox
w t+1 = proxt w t + t A t
(2)
t = argmin
f ()
| {z }
1
proxt (w t + t A )2
2t |
{z
}
. A
(w)
()
0
RAMP2011
(w)
2011-10-25
18 / 37
DAL
f Proximation
w t+1 = argmin
w
f (w )
}|
f (Aw )
| {z }
{ 1
+w 1 +
w w t 2
2t
()
RAMP2011
2011-10-25
19 / 37
DAL
f Proximation
w t+1 = argmin
w
f (w )
}|
f (Aw )
| {z }
{ 1
+w 1 +
w w t 2
2t
f (Aw ) f (Aw t ) + (w w t ) A f (Aw t )

w t
()
RAMP2011
wt+1 wt
2011-10-25
19 / 37
DAL
f Proximation
w t+1 = argmin
f (w )
}|
f (Aw )
| {z }
{ 1
+w 1 +
w w t 2
2t
f (Aw ) f (Aw t ) + (w w t ) A f (Aw t )

w t
wt+1 wt
DAL:
f (Aw ) = maxm f () w A
R
w t+1
()
RAMP2011
wt+1 wt
2011-10-25
19 / 37

A DAL
DAL
2
1.5
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
2
2
1.5
()
0.5
0.5
1.5
2
2
1.5
0.5
0.5
RAMP2011
1.5
2
2
1.5
0.5
0.5
1.5
2011-10-25
20 / 37
.
w t DAL t (t ) = 0
w f
.
f (w t+1 ) f (w ) w t+1 w 2
(t = 0, 1, 2, . . .).
w t+1 w
t w t w
()
1
w t w .
1 + t
RAMP2011
.
2011-10-25
21 / 37
w t : DAL
.
q
1/:
t
t+1
t
t ( ) t w
w
f .
w t+1 w
1
w t w .
1 + 2t
t w t w
()
RAMP2011
2011-10-25
22 / 37
w t : DAL
.
q
1/:
t
t+1
t
t ( ) t w
w
f .
w t+1 w
1
w t w .
1 + 2t
t w t w
(t (t ) = 0)
.
(t )
w t+1t w t O(1/t ).
()
RAMP2011
2011-10-25
22 / 37
1
w t+1 f (w ) +
1
2t w
w t 2
(w t w t+1 )/t f (w t+1 ) .

(Beck & Teboulle 09)
D
E
f (w ) f (w t+1 ) (w t w t+1 )/t , w w t+1 .
f(wt+1)
f(w)
w
()
RAMP2011
wt+1
2011-10-25
23 / 37
2
D
E
1
f (w ) f (w t+1 ) (w t w t+1 )/t , w w t+1 t (t )2 .
2 .
|
{z
}
1/: f
f(wt+1)
f(w)
w
()
RAMP2011
wt+1
2011-10-25
24 / 37

()
RAMP2011
2011-10-25
25 / 37
[Powell 69; Hestenes 69]
minimize f (x) + z1 ,
x,z
s.t. z = x
.
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
2
x, z : .
t+1 t+1
argmin Lt (x, z, t ).
(x , z ) = xR
n ,zRm
.
t+1
= t + t (z t+1 x t+1 ).
x z !
()
RAMP2011
2011-10-25
26 / 37
Alternating Direction Method of Multipliers (ADMM; Gabay

& Mercier 76)
2
.
L (x, z, ) = f (x) + z1 + (z x) + z x
.
2
x :
x t+1 = argmin Lt (x, z t , t ).
xRn
z :
z t+1 = argmin Lt (x t+1 , z, t ).
zRm
t+1 = t + (z t+1 x t+1 ).
x t+1 z t+1
()
RAMP2011
2011-10-25
27 / 37
ADMM (Gabay & Mercier 76)

.
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
()
RAMP2011
t+1
x
= argmin Lt (x, z t , t ).
xRn
z t+1 = argmin Lt (x t+1 , z, t ).
zRm
t+1
= t + t (z t+1 x t+1 ).
2011-10-25
28 / 37

.
t t
t
2
t+1
x
= argmin f (x) + z x + /t .
2
xRn
t+1
t+1
z
= argmin Lt (x , z, t ).
zRm
t+1
= t + t (z t+1 x t+1 ).
()
RAMP2011
2011-10-25
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
28 / 37
t+1 = argmin f (x) + t z t x + t / 2 .
x
t
xRn
t+1 = argmin z + t z x t+1 + t / 2 .

z
t
1
2
zRm
t+1
= t + t (z t+1 x t+1 ).
()
RAMP2011
2011-10-25
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
28 / 37
t+1 = argmin f (x) + t z t x + t / 2 .
x
t
xRn
t+1 = argmin z + t z x t+1 + t / 2 .

z
t
1
2
zRm
t+1
= t + t (z t+1 x t+1 ).
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
z Prox prox/t
x
1
Douglas Rachford Splitting

ADMM (Lions & Mercier 76; Eckstein & Bertsekas 92)
()
RAMP2011
2011-10-25
28 / 37
[Liu+09,
Signoretto +10, Tomioka+10, Gandy+11]
: (Matricization)
Tucker

X (1)
n
X (2)
n
()
RAMP2011
2011-10-25
29 / 37
ADMM
:
X
1
x y2 +
k Z k S1 ,
2
| {z }
K
minimize
x,z 1 ,...,z K RN
s.t.
k=1
Pk x = zk
(k = 1, . . . , K ),
x
y RM
M N = n1 n2 nK )
P k k
P k P k = I
()
RAMP2011
2011-10-25
30 / 37
ADMM
X
1
x y2 +
k Z k S1
2
K
L (x, {Z k }Kk=1 , {k }Kk=1 ) =
k =1
K
X
k=1
k (P k x z k ) + P k x z k 2 .
2
x P k O(N)
Z k z k Schatten 1-
Prox
()
RAMP2011
2011-10-25
31 / 37
1:
2
Generalization error
10
As a Matrix (mode 1)
Constraint
Mixture
Tucker (large)
Tucker (exact)
Optimization tolerance
10
10
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Fraction of observed elements
0.8
0.9
Constraint 35%
Tucker (EM ) OK
()
RAMP2011
2011-10-25
32 / 37
2:
50
As a Matrix
Constraint
Mixture
Tucker (large)
Tucker (exact)
Computation time (s)
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Fraction of observed elements
0.8
0.9
()
RAMP2011
2011-10-25
33 / 37
I
I
Stochastic Optimization in Machine Learning (Nathan Srebro,

tutorial at ICML 2010)
LCCC : NIPS 2010 Workshop on Learning on Cores, Clusters and

Clouds
()
RAMP2011
2011-10-25
34 / 37
Optimization for Machine Learning (MIT Press, 2011)
22700138
NTT
()
RAMP2011
2011-10-25
35 / 37
References
Recent surveys
Tomioka, Suzuki, & Sugiyama (2011) Augmented Lagrangian Methods for Learning,
Selecting, and Combining Features. In Sra, Nowozin, Wright., editors, Optimization for
Machine Learning, MIT Press.
Combettes & Pesquet (2010) Proximal splitting methods in signal processing. In
Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag.
Boyd, Parikh, Peleato, & Eckstein (2010) Distributed optimization and statistical learning
via the alternating direction method of multipliers.
IST/FISTA
Moreau (1965) Proximit et dualit dans un espace Hilbertien. Bul letin de la S. M. F.
Nesterov (2007) Gradient Methods for Minimizing Composite Objective Function.
Beck & Teboulle (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear
Inverse Problems. SIAM J Imag Sci 2, 183202.
Augmented Lagrangian
Rockafellar (1976) Augmented Lagrangians and applications of the proximal point
algorithm in convex programming. Math. of Oper. Res. 1.
Bertsekas (1982) Constrained Optimization and Lagrange Multiplier Methods. Academic
Press.
Tomioka, Suzuki, & Sugiyama (2011) Super-Linear Convergence of Dual Augmented
Lagrangian Algorithm for Sparse Learning. JMLR 12.
()
RAMP2011
2011-10-25
36 / 37
References
ADMM
Gabay & Mercier (1976) A dual algorithm for the solution of nonlinear variational problems
via nite element approximation. Comput Math Appl 2, 1740.
Lions & Mercier (1979) Splitting Algorithms for the Sum of Two Nonlinear Operators. SIAM
J Numer Anal 16, 964979.
Eckstein & Bertsekas (1992) On the Douglas-Rachford splitting method and the proximal
point algorithm for maximal monotone operators.
Matrices/Tensor
Fazal, Hindi, & Boyd (2001) A Rank Minimization Heuristic with Application to Minimum
Order System Approximation. Proc. of the American Control Conference.
Srebro, Rennie, & Jaakkola (2005) Maximum-Margin Matrix Factorization. Advances in
NIPS 17, 13291336.
Cai, Cands, & Shen (2008) A singular value thresholding algorithm for matrix completion.
Mazumder, Hastie, & Tibshirani (2010) Spectral Regularization Algorithms for Learning
Large Incomplete Matrices. JMLR 11, 22872322.
Tomioka, Hayashi, & Kashima (2011) Estimation of low-rank tensors via convex
optimization. arXiv:1010.0789.
Total variation
Rudin, Osher, Fetemi. (1992) Nonlinear total variation based noise removal algorithms.
Physica D: Nonlinear Phenomena, 60.
Goldstein & Osher (2009) Split Bregman method for L1 regularization problems. SIAM J.
Imag. Sci. 2.
()
RAMP2011
2011-10-25
37 / 37

機械学習における連続最適化の新しいトレンド

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

機械学習における連続最適化の新しいトレンド

Uploaded by

Copyright:

Available Formats

1

2011-10-25 @ RAMP 2011

CVX (Grant & Boyd)

CVX (Grant & Boyd)

(Accelerated) Proximal gradient methods

CVX (Grant & Boyd)

(Accelerated) Proximal gradient methods

CVX (Grant & Boyd)

(Accelerated) Proximal gradient methods

SNP n = 500, 000

log(1 + eyz ) = log P(Y = y |z)

where P(Y = +1|z) =

2: [Candes, Romberg, & Tao 06]

3: [Fazel+ 01; Srebro+ 05]

Cabc Uia Ujb Ukc

a=1 b=1 c=1

(proximal gradient method)

Alternating Direction Method of Multipliers (ADMM)

Dual Augmented Lagrangian (DAL)

(proximal gradient method)

= proxt (w t t L(w t )).

: proxC (z) = projC (z).

(proximal gradient method)

w t+1 proxt w t t L(w t ) .

(proximal gradient method)

w t+1 proxt w t t L(w t ) .

Dual Augmented Lagrangian (DAL) [Tomioka & Sugiyama 09]

L(w ) = f (Aw ) f : A Rmn :

Dual Augmented Lagrangian (DAL)

Dual Augmented Lagrangian (DAL)

Dual Augmented Lagrangian (DAL)

Dual Augmented Lagrangian (1 -)

f (Aw ) f (Aw t ) + (w w t ) A f (Aw t )

f (Aw ) f (Aw t ) + (w w t ) A f (Aw t )

(w t w t+1 )/t f (w t+1 ) .

[Powell 69; Hestenes 69]

Alternating Direction Method of Multipliers (ADMM; Gabay

x t+1 = argmin Lt (x, z t , t ).

z t+1 = argmin Lt (x t+1 , z, t ).

t+1 = t + (z t+1 x t+1 ).

ADMM (Gabay & Mercier 76)

ADMM (Gabay & Mercier 76)

ADMM (Gabay & Mercier 76)

t+1 = argmin f (x) + t z t x + t / 2 .

t+1 = argmin z + t z x t+1 + t / 2 .

ADMM (Gabay & Mercier 76)

t+1 = argmin f (x) + t z t x + t / 2 .

t+1 = argmin z + t z x t+1 + t / 2 .

Douglas Rachford Splitting

L (x, {Z k }Kk=1 , {k }Kk=1 ) =

Computation time (s)

Stochastic Optimization in Machine Learning (Nathan Srebro,

LCCC : NIPS 2010 Workshop on Learning on Cores, Clusters and

Optimization for Machine Learning (MIT Press, 2011)

You might also like