Professional Documents
Culture Documents
20
1619
16
20.1
(Fahlman et al., 1983; Ackley et al.,
1985; Hinton et al., 1984; Hinton and Sejnowski, 1986)
d x {0, 1}d
16.2.4
exp (E(x))
P (x) = . (20.1)
Z
20.
!
E(x) Z x P (x) = 1
E(x) = x U x b x, (20.2)
U b
n 20.1
MLP
MLP
MLP
(Le Roux
and Bengio, 2008)
x 2 v
h
E(v, h) = v Rv v W h h Sh b v c h. (20.3)
18
1 2
Pmodel (v) Pdata (v)Pmodel (h | v)
2
624
20.
2 (Hebb, 1949)
fire together, wire together
18.2
20.2
harmonium (Smolensky, 1986)
16.7.1
RBM
RBM
RBM RBM
20.120.1a RBM
2
625
20.
v1 v2 v3 v1 v2 v3
(a) (b)
v1 v2 v3
(c)
20.1: RBM(a)
2
RBM
RBM (b) DBN
RBM
DBN
RBM
DBN
(c)
DBMRBM DBN DBM
RBM DBM DBN RBM
DBM RBM
DBM RBM
626
20.
nv
v nh h
1
P (v = v, h = h) = exp (E(v, h)) . (20.4)
Z
RBM
E(v, h) = b v c h v W h, (20.5)
Z
!!
Z= exp {E(v, h)} . (20.6)
v h
Z
Z Z
Long and Servedio (2010)
Z Z
P (v)
20.2.1
P (v) RBM 2 P (h | v)
P (v | h)
P (h, v)
P (h | v) = (20.7)
P (v)
1 1 " #
= exp b v + c h + v W h (20.8)
P (v) Z
1 " #
= exp c h + v W h (20.9)
Z
1 ! nh nh
!
= exp c j hj + v W:,j hj (20.10)
Z
j=1 j=1
627
20.
nh
1 ! " #
=
exp cj hj + v W:,j hj . (20.11)
Z j=1
v P (h | v)
P (h | v) h
hj
hj
P (hj = 1 | v)
P (hj = 1 | v) = (20.12)
P (hj = 0 | v) + P (hj = 1 | v)
" #
exp cj + v W:,j
= (20.13)
exp {0} + exp {cj + v W:,j }
$ %
= cj + v W:,j . (20.14)
nh
! $ %
P (h | v) = (2h 1) (c + W v) j . (20.15)
j=1
P (v | h)
nv
!
P (v | h) = ((2v 1) (b + W h))i . (20.16)
i=1
20.2.2
RBM P (v)
MCMC 18
CDSMLPCD
P (h | v) RBM
20.3
deep belief networkDBN
628
20.
MNIST
(Hinton et al., 2006)
DBN 2
20.1b
! "
v N v; b(0) + W (1) h(1) , 1 (20.20)
1 DBN RBM
DBN 2
2 RBM
1
629
20.
RBM Evpdata log p(v)
RBM DBN 1 2
RBM
Evpdata Eh(1) p(1) (h(1) |v) log p(2) (h(1) ) (20.21)
RBM 1 RBM
DBN RBM DBN
DBN
(Hinton et al., 2006)
DBN
wake-sleep
DBN DBN
DBN
MLP
! "
h(1) = b(1) + v W (1) . (20.22)
! "
(l)
h(l) = bi + h(l1) W (l) l 2, . . . , m, (20.23)
DBN MLP
630
20.
MLP MLP
MLP 19
MLP
DBN MLP
MLP MLP
DBN MLP
DBN
DBN (Dean
and Kanazawa, 1989)
20.4
deep Boltzmann machineDBM(Salakhutdinov and Hin-
ton, 2009a) 1 DBN
DBM RBM DBM
RBM 1 RBM
20.2
631
20.
v1 v2 v3
20.2: 1 2
DBM
E 1 v 3 h(1) h(2) h(3)
! " 1 ! "
P v, h(1) , h(2) , h(3) = exp E(v, h(1) , h(2) , h(3) ; ) (20.24)
Z()
DBM
E(v, h(1) , h(2) , h(3) ; ) = v W (1) h(1) h(1) W (2) h(2) h(2) W (3) h(3) . (20.25)
RBM 20.5DBM
(W (2) W (3) )
DBM RBM 20.3
DBM 2
632
20.
(2) (3)
h1 h1
(3) (3)
h1 h2
(2) (3)
h2 h2
v1 v2 v2 (1)
h3
20.3: 2
DBM 2 RBM
DBM
2
! "
(1)
P (vi = 1 | h(1) ) = Wi,: h(1) , (20.26)
! "
(1) (1) (2)
P (hi = 1 | v, h(2) ) = v W:,i + Wi,: h(2) , (20.27)
! "
(2) (2)
P (hk = 1 | h(1) ) = h(1) W:,k . (20.28)
2
1 1
RBM 1
2 1
633
20.
l DBM l + 1
2
2
2 DBM
1
1
20.4.1
DBN
MLP
Q DBM
DBM
DBM
DBM
(Series et al., 2010; Reichert et al., 2011)
634
20.
1 DBM DBN
2 MCMC
DBM
MCMC
20.4.2 DBM
DBM 2
DBM P (v | h(1) )P (h(1) | v, h(2) )P (h(2) | h(1) )
DBN DBM
DBN DBM 19.4
DBM
Salakhutdinov and
Hinton (2009a)
2
Q(h(1) , h(2) | v) P (h(1) , h(2) | v)
! (1)
! (2)
Q(h(1) , h(2) | v) = Q(hj | v) Q(hk | v). (20.29)
j k
P (h(1) , h(2) | v)
v Q
Q(h | v) P (h | v)
635
20.
! " #
(1) (2) Q(h(1) , h(2) | v)
KL(QP ) = Q(h ,h | v) log . (20.30)
h
P (h(1) , h(2) | v)
Q h(1)
(1) (1) (1)
hj [0, 1] j hj = Q(hj =
(2) (2) (2)
1 | v) hk [0, 1] k hk = Q(hk = 1 | v)
$ (1)
$ (2)
Q(h(1) , h(2) | v) = Q(hj | v) Q(hk | v) (20.31)
j k
$ (1) (1)
(1) (1) $ (2) (2)
(2) (2)
= (hj )hj (1 hj )(1hj )
(hk )hk (1 hk )(1hk ) .
j k
(20.32)
DBM
2
Q P
19.56
% &
(1)
! (1)
! (2) (2)
hj = vi Wi,j + Wj,k hk , j (20.33)
i k
(2)
! (2) (1)
hk = Wj ,k hj , k. (20.34)
j
636
20.
L(Q)
(1) (2)
20.33hj 20.34hk
MNIST 10
50 1
DBM
20.4.3 DBM
DBM 18
192
20.4.2 P (h | v) Q(h | v)
log P (v; ) L(v, Q, )
2 L
!! (1) (1)
! ! (1) (2) (2)
L(Q, ) = vi Wi,j hj + hj Wj ,k hk log Z() + H(Q). (20.35)
i j j k
log Z()
18DBM 18
18.2DBM
20.1 DBM
637
20.
638
20.
20.4.4
DBM
DBM RBM
1 DBM RBM
20.4.5DBM
DBM RBM 1
RBM 1 RBM
RBM
RBM DBM DBM PCD
PCD
20.4
1
2
639
20.
a) b)
c) d)
640
20.
SML
PCD (Salakhutdinov
and Hinton, 2009a)
1
DBM Goodfellow et al. (2013b)
20.4.5
DBM
MLP
1 RBM DBM
DBM
RBM CD DBM PCD
MLP
MLP
2 1
centered deep Boltzmann machine(Montavon
and Muller, 2012)
MLP
2 multi-prediction deep
Boltzmann machine (Goodfellow et al., 2013b) MCMC
MCMC
641
20.
U b x
20.2
E(x) = x U x b x. (20.36)
U RBM DBM
x
U
E (x; U , b) = (x ) U (x ) (x ) b. (20.37)
x 0
MP-
DBM
(Goodfellow et al.,
2013b)
20.5
(Stoyanov et al., 2011; Brakel et al., 2013) MP-DBM
642
20.
20.5:
MP-DBM
Goodfellow et al. (2013b)
643
20.
MP-DBM
p(v)
2 1
SML
DBM MP-DBM
MP-DBM
MP-DBM
MP-DBM
MP-DBM
MP-DBM
MP-DBM
20.5
644
20.
[0,1]
Hinton (2000)
[0,1] 1
20.5.1 - RBM
(Welling et al., 2005)
RBM
- RBM 1
1
log N (v; W h, 1 ) = (v W h) (v W h) + f () (20.39)
2
f 1
f f
20.39 v
v p(v | h)
p(h | v) 20.39
645
20.
1
h W W h. (20.40)
2
hi hj
hi hj
p(v | h)
20.39 hi
hi
1 ! 2
hi j Wj,i . (20.41)
2 j
hi {0, 1} h2i = hi
hi
- RBM 1
1
E(v, h) = v ( v) (v ) W h b h (20.42)
2
- RBM 1
646
20.
20.5.2
RBM
Ranzato et al. (2010a) RBM
RBM
- RBMthe mean
and covariance RBMmcRBM*1 - t mean-product of
t-distributionmPoT- RBMspike and slab RBMssRBM
- RBM mcRBM
mcRBM
2
RBM 1 RBM
covariance RBMcRBM (Ranzato et al., 2010a)
Em - RBM *2
1 ! (m)
! (m) (m)
Em (x, h(m) ) = x x x W:,j hj bj hj , (20.44)
2 j j
*1 mcRBM----
mc
*2 - RBM
647
20.
Ec cRBM
1 ! (c) " (j) #2 ! (c) (c)
Ec (x, h(c) ) = h x r bj hj (20.45)
2 j j j
(c)
r (j) hj b(c)
1 $ %
pmc (x, h(m) , h(c) ) = exp Emc (x, h(m) , h(c) ) , (20.46)
Z
h(m) h(c)
! (m)
pmc (x | h(m) , h(c) ) = N x; Cx|h
mc
W:,j hj , Cx|h
mc
. (20.47)
j
"* #1
mc (c)
Cx|h = j hj r (j) r (j) + I
W RBM
mcRBM
CD PCD xh(m) h(c)
RBM
mcRBM pmc (x | h(m) , h(c) )
(C mc )1
Ranzato and Hinton (2010)
pmc (x | h(m) , h(c) ) p(x)
mcRBM ()
(Neal, 1993)
- t - t mean-product of
Students t-distributionmPoT (Ranzato et al., 2010b) PoT (Welling
et al., 2003a) mcRBM cRBM
RBM
mcRBM PoT
mcRBM
G(k, ) k mPoT
648
20.
mPoT
(c)
r (j) hj
Em (x, h(m) ) 20.44
mcRBM mPoT x
mPoT
mcRBM pmPoT (x | h(m) , h(c) )
Ranzato et al. (2010b)
p(x)
- -
spike and slab restricted Boltzmann machinesssRBM(Courville et al., 2011)
mcRBM ssRBM
mcRBM
mPoT ssRBM
- RBM 2 spike
h slab s
(h s)W W:,i
hi = 1 hi
si
W:,i
ssRBM
& '
! 1 !
Ess (x, s, h) = x W:,i si hi + x + i hi x (20.50)
i
2 i
649
20.
1! ! ! !
+ i s2i i i s i hi b i hi + i 2i hi (20.51)
2 i i i i
bi hi x
i > 0 si i
x h 2 i
si
ssRBM
s h
"
1 1
pss (x | h) = exp {E(x, s, h)} ds (20.52)
P (h) Z
# $
!
ss ss
=N x; Cx|h W:,i i hi , Cx|h (20.53)
i
! " " #
1
ss
Cx|h = + i i hi i i1 hi W:,i W:,i
ss
Cx|h
h s
MAP
650
20.
ssRBM
16.1
ssRBM
(Courville et al., 2014)
20.6
9
p n
d
p = maxi di
2n 3 3
29 = 512
!
651
20.
1 1
n + 1 n
1
1 1
n + 1 1
n + 1
n + 1
*3
652
20.
2 2 2 2
1
50%
50%
3 3 3 3 1
1
20.7
x y y
y p(y | x)
x y
p(x(1) , . . . , x( ) )
p(x(t) | x(1) , . . . , x(t1) )
653
20.
3D
RNN
20.8
log p(v)
654
20.
2
2
vi Wi,j hj
(Sejnowski,
1987) 1 2 3
1
(Memisevic and Hinton, 2007, 2010)one-hot
(Nair
and Hinton, 2009) 2
1 v y
v (Luo et al., 2011)
655
20.
20.9
x
x
1
z
z f (x, z)
f
2 y
y N (, 2 ). (20.54)
y
2 y
z N (z; 0, 1)
y = + z (20.55)
J(y)
= f (x; ) = g(x; )
J(y)
656
20.
p(y; )
p(y | x; ) p(y | )
x p(y | )
y
y p(y | ) (20.56)
y = f (z; ), (20.57)
z f
f
y z
z
reparametrization trickstochastic back-propagation
perturbation analysis
f y
20 (Price,
1958; Bonnet, 1964) (Williams,
1992) (Opper and Archambeau, 2009)
(Bengio et al., 2013b; Kingma, 2013; Kingma and Welling,
2014b,a; Rezende et al., 2014; Goodfellow et al., 2014c)
657
20.
20.9.1
y
x
z y
y = f (z; ) (20.58)
y f
J(y)
SGD
REINFORCE
.
!
Ez [J(y)] = J(y)p(y) (20.59)
y
E[J(y)] ! p(y)
= J(y) (20.60)
y
! log p(y)
= J(y)p(y) (20.61)
y
m
!
1 log p(y (i) )
J(y (i) ) . (20.62)
m
y(i) p(y), i=1
658
20.
20.60J
log p(y) 1 p(y)
20.61 = p(y)
20.62
REINFORCE 1
y
1 SGD
variance reduction
(Wilson, 1984; LEcuyer, 1994)
REINFORCE J(y)
baseliney
b()
! " #
log p(y) log p(y)
Ep(y) = p(y) (20.63)
y
# p(y)
= (20.64)
y
#
= p(y) = 1 = 0, (20.65)
y
659
20.
,i
log p(y)
(J(y) b()i ) (20.69)
i
b()i b ()i b
2 ! "
log p(y) 2
Ep(y) [J(y) log
i
p(y)
] Ep(y) i
y p(y)
log p(y) 2 log p(y) 2
J(y) i i b
20.68Mnih and Gregor (2014) b() Ep(y) [J(y)]
J(y) i
1
Dayan (1990)
(Sutton et al., 2000; Weaver and Tao, 2001)
REINFORCE Bengio et al. (2013b)
Mnih and Gregor (2014)Ba et al. (2014)Mnih et al. (2014)Xu et al. (2015)
b() Mnih and Gregor (2014)
(J(y) b())
REINFORCE y J(y)
y
20.10
16
RBM
2013
660
20.
20.10.1
(Neal, 1990)
s
#
p(si ) = Wj,i sj + bi . (20.70)
j<i
1
(Saul et al., 1996)
1
19.5
661
20.
20.9.1
20.10.7
20.10.2
generator network
g(z; (g) ) z
x x
z
1
.
x = g(z) = + Lz. (20.71)
L
662
20.
inverse transform sampling(Devroye, 2013) U (0, 1)
z x g(z)
!x
F (x) = p(v)dv p(x) x
p(x)
g z x ,
3.47 g
" "
" g ""
"
pz (z) = px (g(z)) "det( )" (20.72)
z
. x ,
pz (g 1 (x))
px (x) = "" " , (20.73)
g "
"det( z )"
. g log p(x)
g
g x g x
. g p(x | z) z x
,
p(xi = 1 | z) = g(z)i . (20.75)
g p(x | z) z x
. pg (x) 20.9
pg
z x
x y
z z x
z
x
x
z
664
20.
20.10.3
x L(q)
. ,
20.771
EM
2
q
x
z 20.78
1
2 q(z | x) pmodel (z)
19.4
q
Ezq log pmodel (z, x)
665
20.
q
z
q(z | x) = q(z; f (x; )) z
L
. L
1
DKL (pdata pmodel ) 3.6
,z .
VAE
VAE
VAE 1 deep recurrent attention writerDRAW (Gregor
et al., 2015) DRAW
DRAW
VAE VAE
RNN (Chung et al., 2015b)
RNN
RNN VAE
666
20.
VAE
importance-weighted autoencoder(Burda et al., 2015)
:
! k
#
1 " pmodel (x, z (i) )
Lk (x, q) = Ez(1) ,...,z(k) q(z|x) log . (20.80)
k i=1 q(z (i) | x)
k = 1 L
q(z | x) z log pmodel (x)
log pmodel (x)
k
MP-DBM
(Goodfellow et al., 2013b; Stoyanov et al., 2011;
Brakel et al., 2013)
MP-DBM
1 x z
1
20.6
1 2
667
20.
20.10.4
x = g(z; (g) )
discriminator network
d(x; (d) )
x
668
20.
v( (g) , (d) )
v( (g) , (d) )
v ,
1
2
GAN
(g) maxd v(g, d)
g d maxd v(g, d)
GAN Goodfellow (2014) GAN
GAN 2
v(a, b) = ab 1 a ab 1
b ab
a b
v
1
2
v
v
GAN
Goodfellow (2014)
GAN
669
20.
GAN
Goodfellow et al. (2014c)
GAN GAN
Radford et al. (2015)
GANDCGAN15.9
DCGAN
20.7
GAN
p(x) p(x | y)
GAN (Mirza and Osindero, 2014) Denton et al. (2015)
GAN
LAPGAN
LAPGAN
40%
LAPGAN 20.7
GAN
670
20.
GAN
20.10.5
moment matching
671
20.
moment
1 2
,
Ex i xni i (20.83)
. n = [n1 , n2 , . . . , nd ]
xi xj
x 2
1 2
GAN
2 MMD
GAN
1
1
MMD
672
20.
,
1
GAN MMD
20.10.6
Goodfellow et al. (2014c) Dosovitskiy et al. (2015)
9.5
i i k k
673
20.
20.10.7
P (xd | xd1 , . . . , x1 )
fully-visible Bayes networkFVBN
(Frey, 1998)
(Bengio and Bengio, 2000b; Larochelle
and Murray, 2011) 20.10.10 NADE (Larochelle and Murray,
2011)
x1 x2 x3 x4
P (x1 ) P (x3 | x1 , x2 )
P (x4 | x1 , x2 , x3 )
P (x2 | x1 )
x1 x2 x3 x4
20.8: i i 1
FVBN FVBN
674
20.
20.10.8
P (xi | xi1 , . . . , x1 )
Frey (1998) d
O(d2 ) 20.8
P (x1 ) P (x3 | x1 , x2 )
P (x2 | x1 )
P (x4 | x1 , x2 , x3 )
h1 h2 h3
x1 x2 x3 x4
20.9: i xi i 1
x1 , . . . , xi hi xi+1 ,
xi+2 , . . . , xd
675
20.
20.10.9
20.8
1 1
2
1. (i 1) k k k one-hot
P (xi | xi1 , . . . , x1 )
2. xi 20.9
1
xi xi+k (k > 0)
i x1 , . . . , xi
676
20.
20.10.10 NADE
Wj,k,i = Wk,i . (20.84)
j < i
P (x1 ) P (x3 | x1 , x2 )
P (x2 | x1 ) P (x4 | x1 , x2 , x3 )
h1 h2 h3
x1 x2 x3 x4
RBM
1 NADE NADE
RBM
NADE
1 k
NADE-k (Raiko et al., 2014)
3.9.6
i i
i i2 RNADE
NADE (Uria et al., 2013)
i i2
Uria et al. (2013)
(Murray and Larochelle, 2014)
n
n! o p(x | o) o
:
k
1!
pensemble (x) = p(x | o(i) ). (20.85)
k i=1
678
20.
(Bengio and Bengio, 2000b)
1 NADE O(nh)
Bengio and Bengio (2000b) O(n2 h) h
20.1020.9 hi
l l + 1 h
n O(n2 h2 )
Murray and Larochelle (2014) l + 1 i l i
O(nh2 )
NADE h
20.11
14
MCMC
20.11.1
Bengio et al. (2013c)
generalized denoising autoencoders
679
20.
f g
x !
C(x | x) p(x | !)
x x
20.11:
(a)x C x (b) f
h = f (x) (c) g
(d) p(x | = g(f (x)))
E[x | x] g(h) = x
x p(x | ) 2
(Vincent, 2011)
C p f g
(Bengio et al., 2014)
20.11
:
1. x C(x | x) x
2. x h = f (x)
3. h p(x | = g(h)) = p(x | x) = g(h)
4. p(x | = g(h)) = p(x | x) x
680
20.
20.11.2
GSN
p(xf | xo )
xf xf
xo MP-DBM
GSN MP-DBM
(Bengio et al., 2014)
detailed balance
20.12
20.12:
MINIST
GSN
681
20.
20.11.3
-
18.2
k 1
20.12
generative stochastic networkGSN(Bengio et al.,
2014) x
h
GSN 1 2
:
1. p(x(k) | h(k) )
RBMDBNDBM
2. p(h(k) | h(k1) , x(k1) )
GSN
MCMC
17.3
682
20.
20.11.3GSN
(Bengio et al., 2014)
20.12.1 GSN
GSN
y x
Larochelle and Bengio (2008) RBM
683
20.
20.13
MCMC
Sohl-Dickstein et al. (2015)
diusion inversion
MCMC
684
20.
ABC
20.14
A
B
A B
B
AIC
685
20.
0.1 10
1
MNIST MNIST
MNIST
0.5 0 1 1
1
1
3
16.1
x
686
20.
MNIST
MNIST
20.15
pmodel (x)
pmodel (h | x) x
h
687
20.
688