Professional Documents
Culture Documents
:
1
1 2 1 3
()
RAMP2011
2011-10-25
1 / 37
I
I
I
I
I
I
60-70
()
RAMP2011
2011-10-25
2 / 37
I
I
I
I
I
I
60-70
I
()
RAMP2011
2011-10-25
2 / 37
I
I
I
I
I
I
60-70
I
I
()
RAMP2011
2011-10-25
2 / 37
I
I
I
I
I
I
60-70
I
I
I
()
RAMP2011
2011-10-25
2 / 37
( )
I
SNP etc
MRI
I
I
1
1
4
Users
3
2
1
1
2
3
1
2
3
2
Movies
()
RAMP2011
2011-10-25
3 / 37
1: SNP
x i : (SNP) yi = 1: yi = 1:
: x i yi
2 (yi {1, +1})
minimize
n
w R
m
P
log(1 + exp(yi x i , w ))
|
{z
}
i=1
w 1
| {z }
Regularization
data-t
f(x)=log(1+exp(x))
y<x,w>
ez
1+ez .
0.5
0
5
()
RAMP2011
0
z
5
2011-10-25
4 / 37
minimize
n
w R
1
y w 22 + w 1
2
y:
w :
: Rn Rm :
:
1
minimize
Rn
w
1
22 + w
1 ,
y Aw
2
( A = 1 )
()
RAMP2011
2011-10-25
5 / 37
X Y
1
(X Y )2 + X S1
2
X
r
X
where X S1 :=
j (X ) (Schatten 1-norm)
minimize
1
1
4
Users
3
2
1
1
2
3
j=1
2
1
()
3
2
Movies
RAMP2011
2011-10-25
6 / 37
4: [Tucker 66]
r1 !
r2 !
r3
!
Xijk =
(1)
(2)
(c)
C
o
(
i
r
e
u
n
2
r
RAMP2011
l
o
()
2011-10-25
7 / 37
minimize L(w ) + w 1
w
I
I
I
SNP
1
minimize L(w ) + w 1
w
I
I
1 Total variation)
Tucker
()
RAMP2011
2011-10-25
8 / 37
I
I
()
RAMP2011
2011-10-25
9 / 37
(proximal gradient method)
()
RAMP2011
2011-10-25
10 / 37
.
minimize L(w ) + w 1
w
| {z }
| {z }
t+1
1
= argmin L(w )(w w ) +
w w t 22 + w 1
2t
w
1
= argmin w 1 +
w (w t t L(w t ))22
2t
w
t
t+1
x* x x
()
RAMP2011
2011-10-25
11 / 37
Proximal operator:
1
proxg (z) = argmin g(x) + x z2
2
x
Soft-Threshold (g(x) = x1 )
1
2
prox (z) = argmin x1 + x z
2
x
zj + (zj < ),
= 0
( zj ),
zj (zj > ).
ST(z)
r Prox
()
RAMP2011
2011-10-25
12 / 37
w 0 .
:
: L
: Forward-Backward splitting,
Iterative Shrinkage/Thresholding
()
RAMP2011
2011-10-25
13 / 37
w 0 .
:
: L
: Forward-Backward splitting,
Iterative Shrinkage/Thresholding
()
RAMP2011
2011-10-25
13 / 37
L(x) L(y) Hx y
1
t 1/H
f (x k ) f (x )
Hx 0 x 22
2k
O(1/k 2 )
(Nesterov 07; Beck & Teboulle 09)
()
RAMP2011
2011-10-25
14 / 37
1
y Aw 22
(2 )
2
m
X
( 2) L(w ) =
log(1 + exp(yi x i , w ))
()
( 1) L(w ) =
i=1
2
()
RAMP2011
2011-10-25
15 / 37
min
w
f (Aw ) + w 1
|
{z
}
max
,v
f () ( 1 ) (v)
s.t. v = A
f (w )
()
RAMP2011
2011-10-25
16 / 37
f (Aw ) + w 1
|
{z
}
min
w
max
,v
f () ( 1 ) (v)
s.t. v = A
f (w )
Proximal minimization
[Rockafellar 76]:
w
t+1
1
t 2
= argmin f (w ) +
w w
2t
w
(0 1 )
f (w t+1 ) +
1
2t
w t+1 w t 2 f (w t ).
()
RAMP2011
2011-10-25
16 / 37
f (Aw ) + w 1
|
{z
}
min
w
max
,v
s.t. v = A
f (w )
Augmented Lagrangian
[Powell 69; Hestenes 69]:
Proximal minimization
[Rockafellar 76]:
w
t+1
1
t 2
= argmin f (w ) +
w w
2t
w
w t+1 w t 2 f (w t ).
()
t = argmin t ()
t ()
.
1
2t
w t+1 = proxt (w t + t A t )
(0 1 )
f (w t+1 ) +
f () ( 1 ) (v)
RAMP2011
t
Rockafellar 76
2011-10-25
16 / 37
w 0
w t+1 = proxt w t + t A t
t = argmin
Rm
f ()
| {z }
()
RAMP2011
1
proxt (w t + t A )22
2t
.
2011-10-25
17 / 37
DAL (1 -)
(1) Prox
w t+1 = proxt w t + t A t
(2)
t = argmin
f ()
| {z }
1
proxt (w t + t A )2
2t |
{z
}
. A
(w)
()
0
RAMP2011
(w)
2011-10-25
18 / 37
DAL
f Proximation
w t+1 = argmin
w
f (w )
}|
f (Aw )
| {z }
{ 1
+w 1 +
w w t 2
2t
()
RAMP2011
2011-10-25
19 / 37
DAL
f Proximation
w t+1 = argmin
w
f (w )
}|
f (Aw )
| {z }
{ 1
+w 1 +
w w t 2
2t
()
RAMP2011
wt+1 wt
2011-10-25
19 / 37
DAL
f Proximation
w t+1 = argmin
f (w )
}|
f (Aw )
| {z }
{ 1
+w 1 +
w w t 2
2t
wt+1 wt
DAL:
f (Aw ) = maxm f () w A
R
w t+1
()
RAMP2011
wt+1 wt
2011-10-25
19 / 37
A DAL
DAL
2
1.5
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
2
2
1.5
()
0.5
0.5
1.5
2
2
1.5
0.5
0.5
RAMP2011
1.5
2
2
1.5
0.5
0.5
1.5
2011-10-25
20 / 37
.
w t DAL t (t ) = 0
w f
.
f (w t+1 ) f (w ) w t+1 w 2
(t = 0, 1, 2, . . .).
w t+1 w
t w t w
()
1
w t w .
1 + t
RAMP2011
.
2011-10-25
21 / 37
w t : DAL
.
q
1/:
t
t+1
t
t ( ) t w
w
f .
w t+1 w
1
w t w .
1 + 2t
t w t w
()
RAMP2011
2011-10-25
22 / 37
w t : DAL
.
q
1/:
t
t+1
t
t ( ) t w
w
f .
w t+1 w
1
w t w .
1 + 2t
t w t w
(t (t ) = 0)
.
(t )
w t+1t w t O(1/t ).
()
RAMP2011
2011-10-25
22 / 37
1
w t+1 f (w ) +
1
2t w
w t 2
f(wt+1)
f(w)
w
()
RAMP2011
wt+1
2011-10-25
23 / 37
2
D
E
1
f (w ) f (w t+1 ) (w t w t+1 )/t , w w t+1 t (t )2 .
2 .
|
{z
}
1/: f
f(wt+1)
f(w)
w
()
RAMP2011
wt+1
2011-10-25
24 / 37
Alternating Direction Method of Multipliers (ADMM)
()
RAMP2011
2011-10-25
25 / 37
minimize f (x) + z1 ,
x,z
s.t. z = x
.
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
2
x, z : .
t+1 t+1
argmin Lt (x, z, t ).
(x , z ) = xR
n ,zRm
.
t+1
= t + t (z t+1 x t+1 ).
x z !
()
RAMP2011
2011-10-25
26 / 37
2
.
L (x, z, ) = f (x) + z1 + (z x) + z x
.
2
x :
xRn
z :
zRm
x t+1 z t+1
()
RAMP2011
2011-10-25
27 / 37
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
()
RAMP2011
t+1
x
= argmin Lt (x, z t , t ).
xRn
z t+1 = argmin Lt (x t+1 , z, t ).
zRm
t+1
= t + t (z t+1 x t+1 ).
2011-10-25
28 / 37
t t
t
2
t+1
x
= argmin f (x) + z x + /t .
2
xRn
t+1
t+1
z
= argmin Lt (x , z, t ).
zRm
t+1
= t + t (z t+1 x t+1 ).
()
RAMP2011
2011-10-25
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
28 / 37
x
t
xRn
2
zRm
t+1
= t + t (z t+1 x t+1 ).
()
RAMP2011
2011-10-25
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
28 / 37
x
t
xRn
2
zRm
t+1
= t + t (z t+1 x t+1 ).
L (x, z, ) = f (x) + z1 + (z x) + z x2 .
.
2
z Prox prox/t
x
1
RAMP2011
2011-10-25
28 / 37
[Liu+09,
Signoretto +10, Tomioka+10, Gandy+11]
: (Matricization)
Tucker
X (1)
n
X (2)
n
()
RAMP2011
2011-10-25
29 / 37
ADMM
:
X
1
x y2 +
k Z k S1 ,
2
| {z }
K
minimize
x,z 1 ,...,z K RN
s.t.
k=1
Pk x = zk
(k = 1, . . . , K ),
x
y RM
M N = n1 n2 nK )
P k k
P k P k = I
()
RAMP2011
2011-10-25
30 / 37
ADMM
X
1
x y2 +
k Z k S1
2
K
k =1
K
X
k=1
k (P k x z k ) + P k x z k 2 .
2
x P k O(N)
Z k z k Schatten 1-
Prox
()
RAMP2011
2011-10-25
31 / 37
1:
2
Generalization error
10
As a Matrix (mode 1)
As a Matrix (mode 2)
As a Matrix (mode 3)
Constraint
Mixture
Tucker (large)
Tucker (exact)
Optimization tolerance
10
10
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Fraction of observed elements
0.8
0.9
Constraint 35%
Tucker (EM ) OK
()
RAMP2011
2011-10-25
32 / 37
2:
50
As a Matrix
Constraint
Mixture
Tucker (large)
Tucker (exact)
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Fraction of observed elements
0.8
0.9
()
RAMP2011
2011-10-25
33 / 37
I
I
()
RAMP2011
2011-10-25
34 / 37
22700138
NTT
()
RAMP2011
2011-10-25
35 / 37
References
Recent surveys
Tomioka, Suzuki, & Sugiyama (2011) Augmented Lagrangian Methods for Learning,
Selecting, and Combining Features. In Sra, Nowozin, Wright., editors, Optimization for
Machine Learning, MIT Press.
Combettes & Pesquet (2010) Proximal splitting methods in signal processing. In
Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag.
Boyd, Parikh, Peleato, & Eckstein (2010) Distributed optimization and statistical learning
via the alternating direction method of multipliers.
IST/FISTA
Moreau (1965) Proximit et dualit dans un espace Hilbertien. Bul letin de la S. M. F.
Nesterov (2007) Gradient Methods for Minimizing Composite Objective Function.
Beck & Teboulle (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear
Inverse Problems. SIAM J Imag Sci 2, 183202.
Augmented Lagrangian
Rockafellar (1976) Augmented Lagrangians and applications of the proximal point
algorithm in convex programming. Math. of Oper. Res. 1.
Bertsekas (1982) Constrained Optimization and Lagrange Multiplier Methods. Academic
Press.
Tomioka, Suzuki, & Sugiyama (2011) Super-Linear Convergence of Dual Augmented
Lagrangian Algorithm for Sparse Learning. JMLR 12.
()
RAMP2011
2011-10-25
36 / 37
References
ADMM
Gabay & Mercier (1976) A dual algorithm for the solution of nonlinear variational problems
via nite element approximation. Comput Math Appl 2, 1740.
Lions & Mercier (1979) Splitting Algorithms for the Sum of Two Nonlinear Operators. SIAM
J Numer Anal 16, 964979.
Eckstein & Bertsekas (1992) On the Douglas-Rachford splitting method and the proximal
point algorithm for maximal monotone operators.
Matrices/Tensor
Fazal, Hindi, & Boyd (2001) A Rank Minimization Heuristic with Application to Minimum
Order System Approximation. Proc. of the American Control Conference.
Srebro, Rennie, & Jaakkola (2005) Maximum-Margin Matrix Factorization. Advances in
NIPS 17, 13291336.
Cai, Cands, & Shen (2008) A singular value thresholding algorithm for matrix completion.
Mazumder, Hastie, & Tibshirani (2010) Spectral Regularization Algorithms for Learning
Large Incomplete Matrices. JMLR 11, 22872322.
Tomioka, Hayashi, & Kashima (2011) Estimation of low-rank tensors via convex
optimization. arXiv:1010.0789.
Total variation
Rudin, Osher, Fetemi. (1992) Nonlinear total variation based noise removal algorithms.
Physica D: Nonlinear Phenomena, 60.
Goldstein & Osher (2009) Split Bregman method for L1 regularization problems. SIAM J.
Imag. Sci. 2.
()
RAMP2011
2011-10-25
37 / 37