Deep Learning：第20章

623
20
1619
16
20.1

(Fahlman et al., 1983; Ackley et al.,
1985; Hinton et al., 1984; Hinton and Sejnowski, 1986)
d x {0, 1}d
16.2.4
exp (E(x))
P (x) = . (20.1)
Z
20.
!
E(x) Z x P (x) = 1
E(x) = x U x b x, (20.2)
U b
n 20.1
MLP
MLP
MLP
(Le Roux
and Bengio, 2008)
x 2 v
h
E(v, h) = v Rv v W h h Sh b v c h. (20.3)

18
1 2
Pmodel (v) Pdata (v)Pmodel (h | v)
2
624
20.
2 (Hebb, 1949)
fire together, wire together
(Giudice et al., 2009)
(Hinton, 2007a; Bengio, 2015) Bengio (2015)

18.2
20.2
harmonium (Smolensky, 1986)
16.7.1
RBM
RBM
RBM RBM
20.120.1a RBM
2
625
20.
(2) (2) (2)

h1 h2 h3
(1) (1) (1) (1)

h1 h2 h3 h4 h1 h2 h3 h4
v1 v2 v3 v1 v2 v3
(a) (b)
(2) (2) (2)

h1 h2 h3
(1) (1) (1) (1)

h1 h2 h3 h4
v1 v2 v3
(c)
20.1: RBM(a)
2
RBM
RBM (b) DBN
RBM
DBN
RBM
DBN
(c)
DBMRBM DBN DBM
RBM DBM DBN RBM
DBM RBM
DBM RBM
626
20.
nv
v nh h
1
P (v = v, h = h) = exp (E(v, h)) . (20.4)
Z
RBM
E(v, h) = b v c h v W h, (20.5)
Z
!!
Z= exp {E(v, h)} . (20.6)
v h
Z
Z Z
Long and Servedio (2010)
Z Z
P (v)
20.2.1
P (v) RBM 2 P (h | v)
P (v | h)

P (h, v)
P (h | v) = (20.7)
P (v)
1 1 " #
= exp b v + c h + v W h (20.8)
P (v) Z
1 " #
= exp c h + v W h (20.9)
Z
1 ! nh nh
!
= exp c j hj + v W:,j hj (20.10)
Z
j=1 j=1
627
20.
nh
1 ! " #
=
exp cj hj + v W:,j hj . (20.11)
Z j=1
v P (h | v)
P (h | v) h
hj
hj
P (hj = 1 | v)
P (hj = 1 | v) = (20.12)
P (hj = 0 | v) + P (hj = 1 | v)
" #
exp cj + v W:,j
= (20.13)
exp {0} + exp {cj + v W:,j }
$ %
= cj + v W:,j . (20.14)
nh
! $ %
P (h | v) = (2h 1) (c + W v) j . (20.15)
j=1
P (v | h)
nv
!
P (v | h) = ((2v 1) (b + W h))i . (20.16)
i=1
20.2.2
RBM P (v)
MCMC 18
CDSMLPCD
P (h | v) RBM
20.3
deep belief networkDBN
628
20.
1 (Hinton et al., 2006; Hinton,

2007b)2006
MNIST
(Hinton et al., 2006)
DBN 2
20.1b
l DBN l W (1) , . . . , W (l) DBN

b(0) b(0) , . . . , b(l) l + 1
DBN
! "
P (h(l) , h(l1) ) exp b(l) h(l) + b(l1) h(l1) + h(l1) W (l) h(l) , (20.17)
! "
(k) (k+1) (k) (k+1) (k+1)
P (hi =1|h )= bi + W:,i h i, k 1, . . . , l 2, (20.18)
! "
(0) (1)
P (vi = 1 | h(1) ) = bi + W:,i h(1) i (20.19)
! "
v N v; b(0) + W (1) h(1) , 1 (20.20)

1 DBN RBM
DBN 2
2 RBM
1
629
20.

RBM Evpdata log p(v)
RBM DBN 1 2
RBM
Evpdata Eh(1) p(1) (h(1) |v) log p(2) (h(1) ) (20.21)
p(1) 1 RBM p(2) 2

RBM 1 RBM
2 RBM 1 RBM
RBM 1 RBM
DBN RBM DBN
DBN
(Hinton et al., 2006)
DBN
wake-sleep
DBN DBN
DBN
MLP
! "
h(1) = b(1) + v W (1) . (20.22)
! "
(l)
h(l) = bi + h(l1) W (l) l 2, . . . , m, (20.23)
DBN MLP
630
20.
MLP MLP
MLP 19
MLP
DBN MLP
MLP MLP
DBN MLP
DBN
DBN AIS (Salakhutdinov

and Murray, 2008)
DBN (Dean
and Kanazawa, 1989)
20.4
deep Boltzmann machineDBM(Salakhutdinov and Hin-
ton, 2009a) 1 DBN
DBM RBM DBM
RBM 1 RBM
20.2
631
20.
(2) (2) (2)

h1 h2 h3
(1) (1) (1) (1)

h1 h2 h3 h4
v1 v2 v3
20.2: 1 2
(Srivastava et al., 2013)
RBM DBN DBM
DBM
E 1 v 3 h(1) h(2) h(3)
! " 1 ! "
P v, h(1) , h(2) , h(3) = exp E(v, h(1) , h(2) , h(3) ; ) (20.24)
Z()
DBM
E(v, h(1) , h(2) , h(3) ; ) = v W (1) h(1) h(1) W (2) h(2) h(2) W (3) h(3) . (20.25)
RBM 20.5DBM
(W (2) W (3) )

DBM RBM 20.3
DBM 2
632
20.
(2) (3)
h1 h1
(3) (3)
h1 h2
(2) (3)
h2 h2
(2) (2) (2)

h1 h2 h3
(2)
h3
(1)
h1
(1) (1) (1)

h1 h2 h3
v1 (1)
h2
v1 v2 v2 (1)
h3
20.3: 2
DBM 2 RBM
DBM
2
! "
(1)
P (vi = 1 | h(1) ) = Wi,: h(1) , (20.26)
! "
(1) (1) (2)
P (hi = 1 | v, h(2) ) = v W:,i + Wi,: h(2) , (20.27)
! "
(2) (2)
P (hk = 1 | h(1) ) = h(1) W:,k . (20.28)
2
1 1
RBM 1
2 1
633
20.
l DBM l + 1
2
2
2 DBM
1
1
20.4.1
DBM DBN DBN DBM P (h | v)

DBN
DBN
MLP
Q(h)
DBN
MLP
Q DBM
DBM
DBM
DBM
(Series et al., 2010; Reichert et al., 2011)
634
20.
1 DBM DBN
2 MCMC
DBM
MCMC
20.4.2 DBM
DBM 2
DBM P (v | h(1) )P (h(1) | v, h(2) )P (h(2) | h(1) )
2 h(1) h(2) W (2)

P (h(1) , h(2) | v)
DBN DBM
DBN DBM 19.4
DBM
Salakhutdinov and
Hinton (2009a)

2
Q(h(1) , h(2) | v) P (h(1) , h(2) | v)
! (1)
! (2)
Q(h(1) , h(2) | v) = Q(hj | v) Q(hk | v). (20.29)
j k
P (h(1) , h(2) | v)
v Q
Q(h | v) P (h | v)
635
20.
! " #
(1) (2) Q(h(1) , h(2) | v)
KL(QP ) = Q(h ,h | v) log . (20.30)
h
P (h(1) , h(2) | v)
Q h(1)
(1) (1) (1)
hj [0, 1] j hj = Q(hj =
(2) (2) (2)
1 | v) hk [0, 1] k hk = Q(hk = 1 | v)
$ (1)
$ (2)
Q(h(1) , h(2) | v) = Q(hj | v) Q(hk | v) (20.31)
j k
$ (1) (1)
(1) (1) $ (2) (2)
(2) (2)
= (hj )hj (1 hj )(1hj )
(hk )hk (1 hk )(1hk ) .
j k
(20.32)
DBM
2
Q P
19.56
% &
(1)
! (1)
! (2) (2)
hj = vi Wi,j + Wj,k hk , j (20.33)
i k

(2)
! (2) (1)
hk = Wj ,k hj , k. (20.34)
j
636
20.
L(Q)
(1) (2)
20.33hj 20.34hk
MNIST 10
50 1
DBM
20.4.3 DBM
DBM 18
192
20.4.2 P (h | v) Q(h | v)
log P (v; ) L(v, Q, )
2 L
!! (1) (1)
! ! (1) (2) (2)
L(Q, ) = vi Wi,j hj + hj Wj ,k hk log Z() + H(Q). (20.35)
i j j k
log Z()
18DBM 18
18.2DBM
20.1 DBM
637
20.
Algorithm 20.1 2 DBM

k p(v, h(1) , h(2) ; + ) p(v, h(1) , h(2) ; )

m 3 V H (1) H (2)

while do
m V
H (1) H (2)
while
! do
"
H V W + H (2) W (2)
(1) (1)
! "
H (2) H (1) W (2)
end while
1 (1)
W (1) m V H
1 (1)
W (2) m H H (2)
for l = 1 to k do
1: # ! " $
(1) (1)
i, j, P (Vi,j = 1) = Wj,: Hi,: Vi,j
! "
(2) (1) (2) (2)
i, j, P (Hi,j = 1) = Hi,: W:,j Hi,j
2:
! "
(1) (1) (2) (2) (1)
i, j, P (Hi,j = 1) = Vi,: W:,j + Hi,: Wj,: Hi,j
end for
1 (1)
W (1) W (1) m V H
1 (1)
W (2) W (2) m H H (2)
W (1) W (1) + W (1)

W (2) W (2) + W (2)
end while
638
20.
20.4.4
DBM
DBM RBM
1 DBM RBM
20.4.5DBM
DBM RBM 1
RBM 1 RBM
RBM
RBM DBM DBM PCD
PCD
20.4

1
2
DBM DBN DBN

RBM DBN DBM RBM
DBM RBM
DBM
Salakhutdinov and Hinton (2009a)

RBM DBM RBM
2 2
2
RBM 2
639
20.
a) b)
c) d)
20.4: MNIST (Salakhut-

dinov and Hinton, 2009a; Srivastava et al., 2014)(a)CD RBM log P (v)
(b)CD-k h(1) y 2 RBM
log P (h(1) , y) h(1) 1
RBM k 1 20 (c)2 RBM
DBM k = 5 DBM log P (v, y)
(d)y y
h(1) h(2) MLP MLP y
MLP
DBM MLP log P (y | v)
Goodfellow et al. (2013b)
640
20.
SML
PCD (Salakhutdinov
and Hinton, 2009a)
1
DBM Goodfellow et al. (2013b)
20.4.5
DBM
MLP
1 RBM DBM
DBM
RBM CD DBM PCD
MLP
MLP
2 1
centered deep Boltzmann machine(Montavon
and Muller, 2012)
MLP
2 multi-prediction deep
Boltzmann machine (Goodfellow et al., 2013b) MCMC
MCMC
641
20.

U b x
20.2
E(x) = x U x b x. (20.36)
U RBM DBM
x
U

E (x; U , b) = (x ) U (x ) (x ) b. (20.37)

x 0
Melchior et al. (2013)

1
enhanced gradient(Cho et al., 2011)
MP-
DBM
(Goodfellow et al.,
2013b)
20.5

(Stoyanov et al., 2011; Brakel et al., 2013) MP-DBM
642
20.
20.5:
MP-DBM
Goodfellow et al. (2013b)
643
20.
MP-DBM
p(v)
2 1
DBM MP-DBM DBM

DBM
DBM DBM
MP-DBM
SML
DBM MP-DBM
MP-DBM NADE NADE-k (Raiko et al., 2014)

20.10.10
MP-DBM
MP-DBM
MP-DBM
MP-DBM
MP-DBM
MP-DBM
20.5

644
20.
[0,1]
Hinton (2000)
[0,1] 1
20.5.1 - RBM

(Welling et al., 2005)
RBM
- RBM 1
p(v | h) = N (v; W h, 1 ). (20.38)
1
log N (v; W h, 1 ) = (v W h) (v W h) + f () (20.39)
2
f 1
f f
20.39 v
v p(v | h)
p(h | v) 20.39
645
20.
1
h W W h. (20.40)
2
hi hj
hi hj
p(v | h)
20.39 hi
hi
1 ! 2
hi j Wj,i . (20.41)
2 j
hi {0, 1} h2i = hi
hi
- RBM 1
1
E(v, h) = v ( v) (v ) W h b h (20.42)
2

- RBM 1
646
20.
20.5.2
RBM
Ranzato et al. (2010a) RBM
RBM
- RBMthe mean
and covariance RBMmcRBM*1 - t mean-product of
t-distributionmPoT- RBMspike and slab RBMssRBM
- RBM mcRBM
mcRBM
2
RBM 1 RBM
covariance RBMcRBM (Ranzato et al., 2010a)
h(m) h(c) mcRBM

2
Emc (x, h(m) , h(c) ) = Em (x, h(m) ) + Ec (x, h(c) ), (20.43)
Em - RBM *2
1 ! (m)
! (m) (m)
Em (x, h(m) ) = x x x W:,j hj bj hj , (20.44)
2 j j
*1 mcRBM----
mc
*2 - RBM
647
20.
Ec cRBM
1 ! (c) " (j) #2 ! (c) (c)
Ec (x, h(c) ) = h x r bj hj (20.45)
2 j j j
(c)
r (j) hj b(c)
1 $ %
pmc (x, h(m) , h(c) ) = exp Emc (x, h(m) , h(c) ) , (20.46)
Z
h(m) h(c)

! (m)
pmc (x | h(m) , h(c) ) = N x; Cx|h
mc
W:,j hj , Cx|h
mc
. (20.47)
j
"* #1
mc (c)
Cx|h = j hj r (j) r (j) + I
W RBM
mcRBM
CD PCD xh(m) h(c)
RBM
mcRBM pmc (x | h(m) , h(c) )
(C mc )1
Ranzato and Hinton (2010)
pmc (x | h(m) , h(c) ) p(x)
mcRBM ()
(Neal, 1993)
- t - t mean-product of
Students t-distributionmPoT (Ranzato et al., 2010b) PoT (Welling
et al., 2003a) mcRBM cRBM
RBM
mcRBM PoT
mcRBM
G(k, ) k mPoT
648
20.
mPoT
EmPoT (x, h(m) , h(c) ) (20.48)

! " (c) " # $
1 (j) 2
%
(c)
%
= Em (x, h(m) ) + hj 1+ r x + (1 j ) log hj (20.49)
j
2
(c)
r (j) hj
Em (x, h(m) ) 20.44
mcRBM mPoT x
mPoT
mcRBM pmPoT (x | h(m) , h(c) )
Ranzato et al. (2010b)
p(x)
- -
spike and slab restricted Boltzmann machinesssRBM(Courville et al., 2011)
mcRBM ssRBM
mcRBM
mPoT ssRBM
- RBM 2 spike
h slab s
(h s)W W:,i
hi = 1 hi
si
W:,i
ssRBM
& '
! 1 !
Ess (x, s, h) = x W:,i si hi + x + i hi x (20.50)
i
2 i
649
20.
1! ! ! !
+ i s2i i i s i hi b i hi + i 2i hi (20.51)
2 i i i i
bi hi x
i > 0 si i
x h 2 i
si
ssRBM
s h
"
1 1
pss (x | h) = exp {E(x, s, h)} ds (20.52)
P (h) Z
# $
!
ss ss
=N x; Cx|h W:,i i hi , Cx|h (20.53)
i
! " " #
1
ss
Cx|h = + i i hi i i1 hi W:,i W:,i
ss
Cx|h
h s
MAP
ssRBM mcRBM mPoT ssRBM

mcRBM mPoT
%& '1
(c) (j) (j)
j hj r r +I hj > 0
r (j) ssRBM
hi = 1
ssRBM
product of probabilistic principal components
analysis, PoPPCA (Williams and Agakov, 2002)
ssRBM
hi 1
mcRBM mPoT
650
20.
ssRBM
16.1
ssRBM
(Courville et al., 2014)
S3C - (Goodfellow et al.,

2013d)
20.6
9
Desjardins and Bengio (2008) RBM
p n
d
p = maxi di
2n 3 3
29 = 512
!
651
20.
Lee et al. (2009) probabilistic max

pooling
1 1
n + 1 n
1
1 1
n + 1 1
n + 1
n + 1
Lee et al. (2009)

*3
Lee et al. (2009)
*3

652
20.
2 2 2 2
1
50%
50%
3 3 3 3 1
1
20.7
x y y
y p(y | x)

x y
p(x(1) , . . . , x( ) )
p(x(t) | x(1) , . . . , x(t1) )
653
20.
3D
Taylor et al. (2007) m

RBM p(x(t) | x(t1) , . . . , x(tm) )
m x p(x(t) ) RBM
x(t1) x RBM
x RBM
RBM
x
RBM (Mnih et al., 2011)
RBM (Taylor and Hinton, 2009; Sutskever et al.,
2009; Boulanger-Lewandowski et al., 2012)

Boulanger-Lewandowski et al. (2012) RNN-RBM
RNN-RBM RBM
RNN x(t) RBM
1
RNN-RBM RBM
RNN RNN
RBM
RBM
RNN
20.8

log p(v)
654
20.
log p(y | v) RBM

(Larochelle and Bengio, 2008)
RBM
MLP
2
2
vi Wi,j hj
(Sejnowski,
1987) 1 2 3
1
(Memisevic and Hinton, 2007, 2010)one-hot
(Nair
and Hinton, 2009) 2
1 v y
v (Luo et al., 2011)
Sohn et al. (2013)

3
655
20.
20.9
x
x
1
z
z f (x, z)
f
2 y
y N (, 2 ). (20.54)
y
2 y
z N (z; 0, 1)
y = + z (20.55)
J(y)
= f (x; ) = g(x; )
J(y)
656
20.
p(y; )
p(y | x; ) p(y | )
x p(y | )
y
y p(y | ) (20.56)
y = f (z; ), (20.57)
z f
f
y z
z
reparametrization trickstochastic back-propagation
perturbation analysis
f y
20.9.1 REINFORCE (Williams, 1992)
20 (Price,
1958; Bonnet, 1964) (Williams,
1992) (Opper and Archambeau, 2009)
(Bengio et al., 2013b; Kingma, 2013; Kingma and Welling,
2014b,a; Rezende et al., 2014; Goodfellow et al., 2014c)
657
20.
20.9.1
y
x
z y
y = f (z; ) (20.58)
y f
J(y)
REINFORCE (REward Increment = Nonnegative Factor Oset Rein-

forcement Characteristic Eligibility)
(Williams, 1992) J(f (z; ))
Ezp(z) J (f (z; ))
y
SGD
REINFORCE
.
!
Ez [J(y)] = J(y)p(y) (20.59)
y
E[J(y)] ! p(y)
= J(y) (20.60)
y

! log p(y)
= J(y)p(y) (20.61)
y

m
!
1 log p(y (i) )
J(y (i) ) . (20.62)
m
y(i) p(y), i=1
658
20.
20.60J
log p(y) 1 p(y)
20.61 = p(y)
20.62
p(y) p(y | x) p(y)

x
REINFORCE 1
y
1 SGD
variance reduction
(Wilson, 1984; LEcuyer, 1994)
REINFORCE J(y)
baseliney
b()
! " #
log p(y) log p(y)
Ep(y) = p(y) (20.63)
y

# p(y)
= (20.64)
y

#
= p(y) = 1 = 0, (20.65)
y
! " ! " ! "

log p(y) log p(y) log p(y)
Ep(y) (J(y) b()) = Ep(y) J(y) b()Ep(y)

(20.66)
! "
log p(y)
= Ep(y) J(y) . (20.67)

log p(y)
b() p(y) (J(y)b())
b() b ()i
i .
$ %
p(y) 2
Ep(y) J(y) log i
b ()i = $ % . (20.68)
log p(y) 2
Ep(y) i
659
20.
,i
log p(y)
(J(y) b()i ) (20.69)
i
b()i b ()i b

2 ! "
log p(y) 2
Ep(y) [J(y) log
i
p(y)
] Ep(y) i
y p(y)
log p(y) 2 log p(y) 2
J(y) i i b
20.68Mnih and Gregor (2014) b() Ep(y) [J(y)]
J(y) i
1
Dayan (1990)
(Sutton et al., 2000; Weaver and Tao, 2001)
REINFORCE Bengio et al. (2013b)
Mnih and Gregor (2014)Ba et al. (2014)Mnih et al. (2014)Xu et al. (2015)
b() Mnih and Gregor (2014)
(J(y) b())
Mnih and Gregor (2014)

variance normalization
REINFORCE y J(y)
y
20.10
16
RBM
2013
660
20.
20.10.1
(Neal, 1990)
s

#
p(si ) = Wj,i sj + bi . (20.70)
j<i
(Sutskever and Hinton, 2008)
1
(Saul et al., 1996)
1
19.5
661
20.
(Dayan et al., 1995; Dayan and Hinton, 1996)
(Gregor et al., 2014; Mnih and Gregor,

2014)
20.9.1
wake-sleep (Bornschein and Bengio, 2015)

(Bornschein et al., 2015)
20.10.7
20.10.2
generator network
g(z; (g) ) z
x x

z
1
.
x = g(z) = + Lz. (20.71)
L
662
20.

inverse transform sampling(Devroye, 2013) U (0, 1)
z x g(z)
!x
F (x) = p(v)dv p(x) x
p(x)
g z x ,
3.47 g
" "
" g ""
"
pz (z) = px (g(z)) "det( )" (20.72)
z
. x ,
pz (g 1 (x))
px (x) = "" " , (20.73)
g "
"det( z )"
. g log p(x)
g
g x g x
p(xi = 1 | z) = g(z)i , (20.74)
. g p(x | z) z x
,
p(xi = 1 | z) = g(z)i . (20.75)
g p(x | z) z x
p(x) = Ez p(x | z), (20.76)

663
20.
. pg (x) 20.9
pg
z x
x y
z z x
Dosovitskiy et al. (2015) z x ,
z
x
x
z
664
20.
20.10.3
variational autoencoderVAE(Kingma, 2013; Rezende et al.,

2014)
VAE pmodel (z) z

g(z)
x pmodel (x; g(z)) = pmodel (x | z)
q(z | x) z
pmodel (x | z)
x L(q)
. ,
L(q) = Ezq(z|x) log pmodel (z, x) + H(q(z | x)) (20.77)

= Ezq(z|x) log pmodel (x | z) DKL (q(z | x)||pmodel (z)) (20.78)
log pmodel (x). (20.79)
20.771
EM
2
q
x
z 20.78
1
2 q(z | x) pmodel (z)
19.4
q
Ezq log pmodel (z, x)
665
20.
q
z
q(z | x) = q(z; f (x; )) z
L
. L
1
DKL (pdata pmodel ) 3.6
pmodel (x; g(z))
VAE DKL (pdata pmodel )

Theis et al. (2015) Huszar (2015)
VAE ,
,z .
VAE
VAE
VAE 1 deep recurrent attention writerDRAW (Gregor
et al., 2015) DRAW
DRAW
VAE VAE
RNN (Chung et al., 2015b)
RNN
RNN VAE
666
20.
VAE
importance-weighted autoencoder(Burda et al., 2015)
:
! k
#
1 " pmodel (x, z (i) )
Lk (x, q) = Ez(1) ,...,z(k) q(z|x) log . (20.80)
k i=1 q(z (i) | x)
k = 1 L
q(z | x) z log pmodel (x)
log pmodel (x)
k
MP-DBM
(Goodfellow et al., 2013b; Stoyanov et al., 2011;
Brakel et al., 2013)
MP-DBM
1 x z
1
20.6
1 2
667
20.
20.6: 2 (Kingma and Welling,

2014a)2
2
2
z 2 z
p(x | z) x Frey 2 .
1 1
MNIST 2
20.10.4
GAN(Goodfellow et al., 2014c)

x = g(z; (g) )
discriminator network
d(x; (d) )
x
668
20.
v( (g) , (d) )
v( (g) , (d) )
g = arg min max v(g, d). (20.81)

g d
v ,
v( (g) , (d) ) = Expdata log d(x) + Expmodel log (1 d(x)) . (20.82)
1
2
GAN
(g) maxd v(g, d)
g d maxd v(g, d)
GAN Goodfellow (2014) GAN
GAN 2
v(a, b) = ab 1 a ab 1
b ab
a b
v
1
2
v
v
GAN
Goodfellow (2014)
GAN
669
20.
GAN
Goodfellow et al. (2014c)
GAN GAN
Radford et al. (2015)
GANDCGAN15.9
DCGAN
20.7
GAN
p(x) p(x | y)
GAN (Mirza and Osindero, 2014) Denton et al. (2015)
GAN
LAPGAN
LAPGAN
40%
LAPGAN 20.7
GAN
670
20.
20.7: LSUN GAN DCGAN

Radford et al. (2015) LAPGAN
Denton et al. (2015)
GAN
self-supervised boosting (Welling et al., 2002)

RBM
20.10.5
generative moment matching network

(Li et al., 2015; Dziugaite et al., 2015)
VAE GAN
VAE GAN
moment matching
671
20.
moment
1 2
,
Ex i xni i (20.83)
. n = [n1 , n2 , . . . , nd ]
xi xj
x 2
1 2
GAN
maximum mean discrepancy

MMD(Schlkopf and Smola, 2002; Gretton et al., 2012)
1
2 MMD
GAN
1
1
MMD
672
20.
,
1
GAN MMD
20.10.6

Goodfellow et al. (2014c) Dosovitskiy et al. (2015)
9.5
Dosovitskiy et al. (2015)

.
i i k k
673
20.
20.10.7

P (xd | xd1 , . . . , x1 )
fully-visible Bayes networkFVBN
(Frey, 1998)
(Bengio and Bengio, 2000b; Larochelle
and Murray, 2011) 20.10.10 NADE (Larochelle and Murray,
2011)
x1 x2 x3 x4
P (x1 ) P (x3 | x1 , x2 )
P (x4 | x1 , x2 , x3 )
P (x2 | x1 )
x1 x2 x3 x4
20.8: i i 1
FVBN FVBN
674
20.
20.10.8

P (xi | xi1 , . . . , x1 )
Frey (1998) d
O(d2 ) 20.8
P (x1 ) P (x3 | x1 , x2 )
P (x2 | x1 )
P (x4 | x1 , x2 , x3 )
h1 h2 h3
x1 x2 x3 x4
20.9: i xi i 1
x1 , . . . , xi hi xi+1 ,
xi+2 , . . . , xd
675
20.
20.10.9
(Bengio and Bengio, 2000a,b)

20.8
20.8
1 1
2
1. (i 1) k k k one-hot
P (xi | xi1 , . . . , x1 )

2. xi 20.9
1
xi xi+k (k > 0)

i x1 , . . . , xi

P (xi | xi1 , . . . , x1 ) 6.2.1.1xi
676
20.
20.10.10 NADE
neural auto-regressive density estimatorNADE

(Larochelle
and Murray, 2011)Bengio and Bengio (2000b)
NADE 20.10
j .
(j)
i xi j k hk (j i)

Wj,k,i :

Wj,k,i = Wk,i . (20.84)
j < i
P (x1 ) P (x3 | x1 , x2 )
P (x2 | x1 ) P (x4 | x1 , x2 , x3 )
h1 h2 h3
W:,1 W:,1 W:,1

W:,2 W:,2 W:,3
x1 x2 x3 x4
20.10: NADE h(j)

x1 , . . . , xi h(i) P (xj | xj1 , . . . , x1 )
j > i NADE

Wj,k,i = Wk,i xi j i k
(W1,i , W2,i , . . . , Wn,i ) W:,i
Larochelle and Murray (2011) NADE

677
20.
RBM
1 NADE NADE
RBM
NADE
1 k
NADE-k (Raiko et al., 2014)

3.9.6
i i
i i2 RNADE
NADE (Uria et al., 2013)
i i2
Uria et al. (2013)

(Murray and Larochelle, 2014)
n
n! o p(x | o) o
:
k
1!
pensemble (x) = p(x | o(i) ). (20.85)
k i=1
678
20.

(Bengio and Bengio, 2000b)
1 NADE O(nh)
Bengio and Bengio (2000b) O(n2 h) h
20.1020.9 hi
l l + 1 h
n O(n2 h2 )
Murray and Larochelle (2014) l + 1 i l i
O(nh2 )
NADE h
20.11
14
MCMC
(Rifai et al., 2012; Mesnil et al., 2012)

1
20.11.1

Bengio et al. (2013c)
generalized denoising autoencoders
679
20.
f g
x !
C(x | x) p(x | !)
x x
20.11:
(a)x C x (b) f
h = f (x) (c) g
(d) p(x | = g(f (x)))
E[x | x] g(h) = x
x p(x | ) 2
(Vincent, 2011)
C p f g
(Bengio et al., 2014)
20.11
:
1. x C(x | x) x
2. x h = f (x)
3. h p(x | = g(h)) = p(x | x) = g(h)
4. p(x | = g(h)) = p(x | x) x
Bengio et al. (2014) p(x | x)

x
680
20.
20.11.2
GSN
p(xf | xo )
xf xf
xo MP-DBM
GSN MP-DBM
Alain et al. (2015) Bengio et al. (2014) 1
detailed balance
20.12
20.12:
MINIST
GSN
681
20.
20.11.3
Bengio et al. (2013c)

1 -
-
18.2
k 1
20.12
generative stochastic networkGSN(Bengio et al.,
2014) x
h
GSN 1 2
:
1. p(x(k) | h(k) )
RBMDBNDBM
2. p(h(k) | h(k1) , x(k1) )

GSN
MCMC
17.3
682
20.
GSN Bengio et al. (2014)

1
x(0) = x
x x(0) = x
log p(x(k) = x | h(k) ) h(k)
log p(x(k) = x | h(k) )
Bengio et al. (2014) 20.9
20.11.3GSN
20.12.1 GSN
GSN (Bengio et al., 2014) x

p(x) p(y | x)
Zhou and Troyanskaya (2014)

GSN
2 1
GSN
Zhrer and Pernkopf (2014)

GSN
y x
Larochelle and Bengio (2008) RBM
683
20.
20.13
MCMC

Sohl-Dickstein et al. (2015)
diusion inversion
MCMC
Sohl-Dickstein et al. (2015)

20.11.1
approximate Bayesian computation

ABC (Rubin et al., 1984)
Bachman and Precup (2015) ABC GSN MCMC
684
20.
ABC
20.14

A
B
A B

B
log p(x) log Z

AIC log Z AIS
Z log p(x)
AIC
685
20.
0.1 10
1
MNIST MNIST
MNIST
0.5 0 1 1
1
1
3
(Denton et al., 2015)
16.1
x
686
20.
MNIST
MNIST
Theis et al. (2015)
3.6 DKL (pdata pmodel )

DKL (pmodel pdata )
20.15

pmodel (x)
pmodel (h | x) x
h
687
20.
688

Deep Learning：第20章

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning：第20章

Uploaded by

Copyright:

Available Formats

623

(Giudice et al., 2009)

(Hinton, 2007a; Bengio, 2015) Bengio (2015)

(2) (2) (2)

(1) (1) (1) (1)

(2) (2) (2)

(1) (1) (1) (1)

1 (Hinton et al., 2006; Hinton,

l DBN l W (1) , . . . , W (l) DBN

p(1) 1 RBM p(2) 2

DBN AIS (Salakhutdinov

(2) (2) (2)

(1) (1) (1) (1)

(Srivastava et al., 2013)

RBM DBN DBM

(2) (2) (2)

(1) (1) (1)

DBM DBN DBN DBM P (h | v)

2 h(1) h(2) W (2)

Algorithm 20.1 2 DBM

DBM DBN DBN

Salakhutdinov and Hinton (2009a)

20.4: MNIST (Salakhut-

Melchior et al. (2013)

DBM MP-DBM DBM

MP-DBM NADE NADE-k (Raiko et al., 2014)

p(v | h) = N (v; W h, 1 ). (20.38)

h(m) h(c) mcRBM

Emc (x, h(m) , h(c) ) = Em (x, h(m) ) + Ec (x, h(c) ), (20.43)

EmPoT (x, h(m) , h(c) ) (20.48)

ssRBM mcRBM mPoT ssRBM

S3C - (Goodfellow et al.,

Desjardins and Bengio (2008) RBM

Lee et al. (2009) probabilistic max

Lee et al. (2009)

Lee et al. (2009)

Taylor et al. (2007) m

log p(y | v) RBM

Sohn et al. (2013)

20.9.1 REINFORCE (Williams, 1992)

REINFORCE (REward Increment = Nonnegative Factor Oset Rein-

p(y) p(y | x) p(y)

! " ! " ! "

Mnih and Gregor (2014)

(Sutskever and Hinton, 2008)

(Dayan et al., 1995; Dayan and Hinton, 1996)

(Gregor et al., 2014; Mnih and Gregor,

wake-sleep (Bornschein and Bengio, 2015)

p(xi = 1 | z) = g(z)i , (20.74)

p(x) = Ez p(x | z), (20.76)

Dosovitskiy et al. (2015) z x ,

variational autoencoderVAE(Kingma, 2013; Rezende et al.,

VAE pmodel (z) z

L(q) = Ezq(z|x) log pmodel (z, x) + H(q(z | x)) (20.77)

pmodel (x; g(z))

VAE DKL (pdata pmodel )

20.6: 2 (Kingma and Welling,

GAN(Goodfellow et al., 2014c)

g = arg min max v(g, d). (20.81)

v( (g) , (d) ) = Expdata log d(x) + Expmodel log (1 d(x)) . (20.82)

20.7: LSUN GAN DCGAN

self-supervised boosting (Welling et al., 2002)

generative moment matching network

maximum mean discrepancy