30minutes Adagrad Rda

30
AdaGrad+RDA
echizen_tm
Oct.11, 2014

(3p)
Stochastic Gradient Descent(2p)
AdaGrad+RDA(6p)
AdaGrad+RDA(3p)
(1p)
(1/3)

{, }
{, , , , }
{10, 20, 30, 40, }
(2/3)

:x
:w
y = xi wi
i
y>0Ay<=0B
3

(3/3)

x = {:1, :1, :1, :1}
w = {:1, :1, :1, :-1}
y = 1*1 + 1*1 + 1*1 + 1*-1
=2>0

(t=1)(t=-1)
Stochastic Gradient
Descent(1/2)
wxt
w

f (w, x, t) =max(0,1 t xi wi )
i

1
: f (w,
x, t) = (t xi wi )2
2
i
Stochastic Gradient
Descent(2/2)
w
Stochastic Gradient Descent(SGD)
Stochastic Gradient Descent

=
w = 0;
for ((x,t) in X) {
w -= f(w, x, t);
}

: f (w,
x, t) / wi = txi

:f (w, x, t) / wi = (t xi wi )xi
i
AdaGrad+RDA(1/6)
SGDAROWSCW

AdaGrad+RDA

AROWSCW
SGDAdaGrad+RDA
SGD:
sxs+1w
AdaGrad+RDA:
0ss+1
AdaGrad+RDA(2/6)
AdaGrad+RDARegret
1
2
R(ws+1 ) = ( gi ws+1,i ) + ws+1 + ( hi ws+1,i
)
2 i
i

1 s
gi = f (w j , x j , t j ) / w j,i
s j=0
1
hj =
s
2

{f (w j , x j , t j ) / w j,i }
s
i=0
AdaGrad+RDA(3/6)

1
2
)
2 i
i
Regretw, g, h, 4
w: s+1
g,h: (f)
ghRegret
:
AdaGrad+RDA(4/6)

1
gi = f (w j , x j , t j ) / w j,i
s j=0
s
1
hj =
s
2

{f (w j , x j , t j ) / w j,i }
s
i=0
gh

f (w j , x j , t j ) / w j,i
g
AdaGrad+RDA(5/6)

R(w)=0w=r(,g,h)
=
w = 0;
for ((x,t) in X) {
g(w,x,t);
h(w,x,t);
w = r(, g, h);
}
AdaGrad+RDA(6/6)
R(w)=0w=r(,g,h)
gi
wi = 0
gi >
wi = (gi + ) / h i
gi <
wi = (gi ) / h i
AdaGrad+RDA(1/3)

AdaGrad+RDA
AdaGrad = Adaptive Gradient

=
AROWSCW
RDA = Regularized Dual Averaging

Regularized: ()
Dual Averaging: ()
AdaGrad+RDA(2/3)
Regret
1
2
)
2 i
i
loss function:
()
regularization
term:

Dual
Averaging
Regularized
proximal
term
Adaptive
Gradient
AdaGrad+RDA(3/3)
1
w
s
max f j , w j ws+1
ws+1
j=0
s
= max f j , w j f j , ws+1
ws+1
j=0
j=0
= min f j , ws+1 = min

ws+1
ws+1
j=0
f
j=0
= min g, ws+1 = min gi ws+1,i

ws+1
ws+1
/ s, ws+1
(1/1)

SGD
AdaGrad+RDA
AdaGrad+RDA
(https://github.com/echizentm/AdaGrad)
:
Duchi et.al.(2010) Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization
Xiao(2010) Dual Averaging Methods for Regularized
Stochastic Learning and Online Optimization

30minutes Adagrad Rda

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

30minutes Adagrad Rda

Uploaded by

Copyright:

Available Formats

30

Stochastic Gradient Descent

AdaGrad = Adaptive Gradient

RDA = Regularized Dual Averaging

= min f j , ws+1 = min

= min g, ws+1 = min gi ws+1,i

You might also like