Professional Documents
Culture Documents
AdaGrad+RDA
echizen_tm
Oct.11, 2014
(3p)
Stochastic Gradient Descent(2p)
AdaGrad+RDA(6p)
AdaGrad+RDA(3p)
(1p)
(1/3)
{, }
{, , , , }
{10, 20, 30, 40, }
(2/3)
:x
:w
y = xi wi
i
y>0Ay<=0B
3
(3/3)
x = {:1, :1, :1, :1}
w = {:1, :1, :1, :-1}
y = 1*1 + 1*1 + 1*1 + 1*-1
=2>0
(t=1)(t=-1)
Stochastic Gradient
Descent(1/2)
wxt
w
f (w, x, t) =max(0,1 t xi wi )
i
1
: f
(w,
x, t) = (t xi wi )2
2
i
Stochastic Gradient
Descent(2/2)
w
Stochastic Gradient Descent(SGD)
w = 0;
for ((x,t) in X) {
w -= f(w, x, t);
}
: f (w,
x, t) / wi = txi
:f (w, x, t) / wi = (t xi wi )xi
i
AdaGrad+RDA(1/6)
SGDAROWSCW
AdaGrad+RDA
AROWSCW
SGDAdaGrad+RDA
SGD:
sxs+1w
AdaGrad+RDA:
0ss+1
AdaGrad+RDA(2/6)
AdaGrad+RDARegret
1
2
R(ws+1 ) = ( gi ws+1,i ) + ws+1 + ( hi ws+1,i
)
2 i
i
1 s
gi = f (w j , x j , t j ) / w j,i
s j=0
1
hj =
s
2
{f (w j , x j , t j ) / w j,i }
s
i=0
AdaGrad+RDA(3/6)
1
2
R(ws+1 ) = ( gi ws+1,i ) + ws+1 + ( hi ws+1,i
)
2 i
i
Regretw, g, h, 4
w: s+1
g,h: (f)
ghRegret
:
AdaGrad+RDA(4/6)
1
gi = f (w j , x j , t j ) / w j,i
s j=0
s
1
hj =
s
2
{f (w j , x j , t j ) / w j,i }
s
i=0
gh
f (w j , x j , t j ) / w j,i
g
AdaGrad+RDA(5/6)
R(w)=0w=r(,g,h)
=
w = 0;
for ((x,t) in X) {
g(w,x,t);
h(w,x,t);
w = r(, g, h);
}
AdaGrad+RDA(6/6)
R(w)=0w=r(,g,h)
gi
wi = 0
gi >
wi = (gi + ) / h i
gi <
wi = (gi ) / h i
AdaGrad+RDA(1/3)
AdaGrad+RDA
AdaGrad+RDA(2/3)
Regret
1
2
R(ws+1 ) = ( gi ws+1,i ) + ws+1 + ( hi ws+1,i
)
2 i
i
loss function:
()
regularization
term:
Dual
Averaging
Regularized
proximal
term
Adaptive
Gradient
AdaGrad+RDA(3/3)
1
w
s
max f j , w j ws+1
ws+1
j=0
s
= max f j , w j f j , ws+1
ws+1
j=0
j=0
ws+1
j=0
f
j=0
ws+1
/ s, ws+1
(1/1)
SGD
AdaGrad+RDA
AdaGrad+RDA
(https://github.com/echizentm/AdaGrad)
:
Duchi et.al.(2010) Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization
Xiao(2010) Dual Averaging Methods for Regularized
Stochastic Learning and Online Optimization