You are on page 1of 66

623

20

1619

16

20.1

(Fahlman et al., 1983; Ackley et al.,
1985; Hinton et al., 1984; Hinton and Sejnowski, 1986)

d x {0, 1}d
16.2.4

exp (E(x))
P (x) = . (20.1)
Z
20.

!
E(x) Z x P (x) = 1

E(x) = x U x b x, (20.2)

U b

n 20.1

MLP
MLP
MLP

(Le Roux
and Bengio, 2008)
x 2 v
h

E(v, h) = v Rv v W h h Sh b v c h. (20.3)


18

1 2
Pmodel (v) Pdata (v)Pmodel (h | v)
2

624
20.

2 (Hebb, 1949)
fire together, wire together

(Giudice et al., 2009)

(Hinton, 2007a; Bengio, 2015) Bengio (2015)


18.2

20.2
harmonium (Smolensky, 1986)
16.7.1
RBM
RBM
RBM RBM
20.120.1a RBM
2

625
20.

(2) (2) (2)


h1 h2 h3

(1) (1) (1) (1)


h1 h2 h3 h4 h1 h2 h3 h4

v1 v2 v3 v1 v2 v3

(a) (b)

(2) (2) (2)


h1 h2 h3

(1) (1) (1) (1)


h1 h2 h3 h4

v1 v2 v3

(c)
20.1: RBM(a)
2

RBM
RBM (b) DBN
RBM
DBN
RBM
DBN
(c)
DBMRBM DBN DBM
RBM DBM DBN RBM
DBM RBM
DBM RBM

626
20.

nv
v nh h

1
P (v = v, h = h) = exp (E(v, h)) . (20.4)
Z
RBM

E(v, h) = b v c h v W h, (20.5)

Z
!!
Z= exp {E(v, h)} . (20.6)
v h

Z
Z Z
Long and Servedio (2010)
Z Z
P (v)

20.2.1

P (v) RBM 2 P (h | v)
P (v | h)


P (h, v)
P (h | v) = (20.7)
P (v)
1 1 " #
= exp b v + c h + v W h (20.8)
P (v) Z
1 " #
= exp c h + v W h (20.9)
Z
1 ! nh nh
!
= exp c j hj + v W:,j hj (20.10)
Z
j=1 j=1

627
20.

nh
1 ! " #
=
exp cj hj + v W:,j hj . (20.11)
Z j=1

v P (h | v)
P (h | v) h
hj
hj

P (hj = 1 | v)
P (hj = 1 | v) = (20.12)
P (hj = 0 | v) + P (hj = 1 | v)
" #
exp cj + v W:,j
= (20.13)
exp {0} + exp {cj + v W:,j }
$ %
= cj + v W:,j . (20.14)

nh
! $ %
P (h | v) = (2h 1) (c + W v) j . (20.15)
j=1

P (v | h)

nv
!
P (v | h) = ((2v 1) (b + W h))i . (20.16)
i=1

20.2.2

RBM P (v)
MCMC 18

CDSMLPCD
P (h | v) RBM

20.3
deep belief networkDBN
628
20.

1 (Hinton et al., 2006; Hinton,


2007b)2006

MNIST
(Hinton et al., 2006)

DBN 2
20.1b

l DBN l W (1) , . . . , W (l) DBN


b(0) b(0) , . . . , b(l) l + 1
DBN
! "
P (h(l) , h(l1) ) exp b(l) h(l) + b(l1) h(l1) + h(l1) W (l) h(l) , (20.17)
! "
(k) (k+1) (k) (k+1) (k+1)
P (hi =1|h )= bi + W:,i h i, k 1, . . . , l 2, (20.18)
! "
(0) (1)
P (vi = 1 | h(1) ) = bi + W:,i h(1) i (20.19)

! "
v N v; b(0) + W (1) h(1) , 1 (20.20)


1 DBN RBM

DBN 2
2 RBM
1

629
20.


RBM Evpdata log p(v)
RBM DBN 1 2
RBM
Evpdata Eh(1) p(1) (h(1) |v) log p(2) (h(1) ) (20.21)

p(1) 1 RBM p(2) 2


RBM 1 RBM
2 RBM 1 RBM

RBM 1 RBM
DBN RBM DBN
DBN
(Hinton et al., 2006)

DBN
wake-sleep

DBN DBN
DBN
MLP
! "
h(1) = b(1) + v W (1) . (20.22)
! "
(l)
h(l) = bi + h(l1) W (l) l 2, . . . , m, (20.23)

DBN MLP

630
20.

MLP MLP

MLP 19
MLP

DBN MLP

MLP MLP
DBN MLP

DBN

DBN AIS (Salakhutdinov


and Murray, 2008)

DBN (Dean
and Kanazawa, 1989)

20.4
deep Boltzmann machineDBM(Salakhutdinov and Hin-
ton, 2009a) 1 DBN
DBM RBM DBM
RBM 1 RBM
20.2

631
20.

(2) (2) (2)


h1 h2 h3

(1) (1) (1) (1)


h1 h2 h3 h4

v1 v2 v3

20.2: 1 2

(Srivastava et al., 2013)

RBM DBN DBM

DBM
E 1 v 3 h(1) h(2) h(3)

! " 1 ! "
P v, h(1) , h(2) , h(3) = exp E(v, h(1) , h(2) , h(3) ; ) (20.24)
Z()
DBM

E(v, h(1) , h(2) , h(3) ; ) = v W (1) h(1) h(1) W (2) h(2) h(2) W (3) h(3) . (20.25)

RBM 20.5DBM
(W (2) W (3) )


DBM RBM 20.3
DBM 2

632
20.

(2) (3)
h1 h1
(3) (3)
h1 h2

(2) (3)
h2 h2

(2) (2) (2)


h1 h2 h3
(2)
h3
(1)
h1

(1) (1) (1)


h1 h2 h3
v1 (1)
h2

v1 v2 v2 (1)
h3

20.3: 2

DBM 2 RBM
DBM

2
! "
(1)
P (vi = 1 | h(1) ) = Wi,: h(1) , (20.26)

! "
(1) (1) (2)
P (hi = 1 | v, h(2) ) = v W:,i + Wi,: h(2) , (20.27)

! "
(2) (2)
P (hk = 1 | h(1) ) = h(1) W:,k . (20.28)

2
1 1
RBM 1
2 1

633
20.

l DBM l + 1
2

2
2 DBM
1
1

20.4.1

DBM DBN DBN DBM P (h | v)



DBN
DBN
MLP
Q(h)

DBN
MLP
Q DBM

DBM
DBM

DBM
(Series et al., 2010; Reichert et al., 2011)

634
20.

1 DBM DBN
2 MCMC
DBM
MCMC

20.4.2 DBM

DBM 2
DBM P (v | h(1) )P (h(1) | v, h(2) )P (h(2) | h(1) )

2 h(1) h(2) W (2)


P (h(1) , h(2) | v)

DBN DBM
DBN DBM 19.4

DBM
Salakhutdinov and
Hinton (2009a)

2
Q(h(1) , h(2) | v) P (h(1) , h(2) | v)
! (1)
! (2)
Q(h(1) , h(2) | v) = Q(hj | v) Q(hk | v). (20.29)
j k

P (h(1) , h(2) | v)
v Q

Q(h | v) P (h | v)

635
20.

! " #
(1) (2) Q(h(1) , h(2) | v)
KL(QP ) = Q(h ,h | v) log . (20.30)
h
P (h(1) , h(2) | v)

Q h(1)
(1) (1) (1)
hj [0, 1] j hj = Q(hj =
(2) (2) (2)
1 | v) hk [0, 1] k hk = Q(hk = 1 | v)

$ (1)
$ (2)
Q(h(1) , h(2) | v) = Q(hj | v) Q(hk | v) (20.31)
j k
$ (1) (1)
(1) (1) $ (2) (2)
(2) (2)
= (hj )hj (1 hj )(1hj )
(hk )hk (1 hk )(1hk ) .
j k
(20.32)

DBM
2

Q P
19.56

% &
(1)
! (1)
! (2) (2)
hj = vi Wi,j + Wj,k hk , j (20.33)
i k

(2)
! (2) (1)
hk = Wj ,k hj , k. (20.34)
j

636
20.

L(Q)
(1) (2)
20.33hj 20.34hk
MNIST 10
50 1

DBM

20.4.3 DBM

DBM 18
192

20.4.2 P (h | v) Q(h | v)
log P (v; ) L(v, Q, )

2 L
!! (1) (1)
! ! (1) (2) (2)
L(Q, ) = vi Wi,j hj + hj Wj ,k hk log Z() + H(Q). (20.35)
i j j k

log Z()

18DBM 18

18.2DBM
20.1 DBM

637
20.

Algorithm 20.1 2 DBM



k p(v, h(1) , h(2) ; + ) p(v, h(1) , h(2) ; )

m 3 V H (1) H (2)

while do
m V
H (1) H (2)
while
! do
"
H V W + H (2) W (2)
(1) (1)
! "
H (2) H (1) W (2)
end while
1 (1)
W (1) m V H
1 (1)
W (2) m H H (2)
for l = 1 to k do
1: # ! " $
(1) (1)
i, j, P (Vi,j = 1) = Wj,: Hi,: Vi,j
! "
(2) (1) (2) (2)
i, j, P (Hi,j = 1) = Hi,: W:,j Hi,j
2:
! "
(1) (1) (2) (2) (1)
i, j, P (Hi,j = 1) = Vi,: W:,j + Hi,: Wj,: Hi,j
end for
1 (1)
W (1) W (1) m V H
1 (1)
W (2) W (2) m H H (2)
W (1) W (1) + W (1)

W (2) W (2) + W (2)
end while

638
20.

20.4.4

DBM

DBM RBM
1 DBM RBM

20.4.5DBM

DBM RBM 1
RBM 1 RBM
RBM
RBM DBM DBM PCD
PCD

20.4


1
2

DBM DBN DBN


RBM DBN DBM RBM
DBM RBM
DBM

Salakhutdinov and Hinton (2009a)


RBM DBM RBM
2 2
2
RBM 2

639
20.

a) b)

c) d)

20.4: MNIST (Salakhut-


dinov and Hinton, 2009a; Srivastava et al., 2014)(a)CD RBM log P (v)
(b)CD-k h(1) y 2 RBM
log P (h(1) , y) h(1) 1
RBM k 1 20 (c)2 RBM
DBM k = 5 DBM log P (v, y)
(d)y y
h(1) h(2) MLP MLP y
MLP
DBM MLP log P (y | v)
Goodfellow et al. (2013b)

640
20.

SML
PCD (Salakhutdinov
and Hinton, 2009a)

1
DBM Goodfellow et al. (2013b)

20.4.5

DBM
MLP
1 RBM DBM

DBM
RBM CD DBM PCD
MLP
MLP

2 1
centered deep Boltzmann machine(Montavon
and Muller, 2012)

MLP
2 multi-prediction deep
Boltzmann machine (Goodfellow et al., 2013b) MCMC

MCMC

641
20.


U b x
20.2

E(x) = x U x b x. (20.36)

U RBM DBM
x
U

E (x; U , b) = (x ) U (x ) (x ) b. (20.37)


x 0

Melchior et al. (2013)


1
enhanced gradient(Cho et al., 2011)

MP-
DBM
(Goodfellow et al.,
2013b)

20.5


(Stoyanov et al., 2011; Brakel et al., 2013) MP-DBM

642
20.

20.5:

MP-DBM
Goodfellow et al. (2013b)

643
20.

MP-DBM
p(v)

2 1

DBM MP-DBM DBM


DBM
DBM DBM
MP-DBM

SML
DBM MP-DBM

MP-DBM NADE NADE-k (Raiko et al., 2014)


20.10.10

MP-DBM

MP-DBM
MP-DBM
MP-DBM
MP-DBM

MP-DBM

20.5

644
20.

[0,1]
Hinton (2000)
[0,1] 1

20.5.1 - RBM


(Welling et al., 2005)
RBM

- RBM 1

p(v | h) = N (v; W h, 1 ). (20.38)

1
log N (v; W h, 1 ) = (v W h) (v W h) + f () (20.39)
2
f 1
f f

20.39 v
v p(v | h)

p(h | v) 20.39
645
20.

1
h W W h. (20.40)
2
hi hj

hi hj
p(v | h)
20.39 hi
hi

1 ! 2
hi j Wj,i . (20.41)
2 j

hi {0, 1} h2i = hi

hi

- RBM 1

1
E(v, h) = v ( v) (v ) W h b h (20.42)
2


- RBM 1

646
20.

20.5.2

RBM
Ranzato et al. (2010a) RBM

RBM

- RBMthe mean
and covariance RBMmcRBM*1 - t mean-product of
t-distributionmPoT- RBMspike and slab RBMssRBM

- RBM mcRBM
mcRBM
2
RBM 1 RBM
covariance RBMcRBM (Ranzato et al., 2010a)

h(m) h(c) mcRBM


2

Emc (x, h(m) , h(c) ) = Em (x, h(m) ) + Ec (x, h(c) ), (20.43)

Em - RBM *2

1 ! (m)
! (m) (m)
Em (x, h(m) ) = x x x W:,j hj bj hj , (20.44)
2 j j

*1 mcRBM----
mc
*2 - RBM

647
20.

Ec cRBM
1 ! (c) " (j) #2 ! (c) (c)
Ec (x, h(c) ) = h x r bj hj (20.45)
2 j j j

(c)
r (j) hj b(c)

1 $ %
pmc (x, h(m) , h(c) ) = exp Emc (x, h(m) , h(c) ) , (20.46)
Z
h(m) h(c)

! (m)
pmc (x | h(m) , h(c) ) = N x; Cx|h
mc
W:,j hj , Cx|h
mc
. (20.47)
j

"* #1
mc (c)
Cx|h = j hj r (j) r (j) + I
W RBM

mcRBM
CD PCD xh(m) h(c)
RBM
mcRBM pmc (x | h(m) , h(c) )
(C mc )1
Ranzato and Hinton (2010)
pmc (x | h(m) , h(c) ) p(x)
mcRBM ()
(Neal, 1993)

- t - t mean-product of
Students t-distributionmPoT (Ranzato et al., 2010b) PoT (Welling
et al., 2003a) mcRBM cRBM
RBM
mcRBM PoT
mcRBM

G(k, ) k mPoT

648
20.

mPoT

EmPoT (x, h(m) , h(c) ) (20.48)


! " (c) " # $
1 (j) 2
%
(c)
%
= Em (x, h(m) ) + hj 1+ r x + (1 j ) log hj (20.49)
j
2

(c)
r (j) hj
Em (x, h(m) ) 20.44
mcRBM mPoT x
mPoT
mcRBM pmPoT (x | h(m) , h(c) )
Ranzato et al. (2010b)
p(x)

- -
spike and slab restricted Boltzmann machinesssRBM(Courville et al., 2011)
mcRBM ssRBM
mcRBM
mPoT ssRBM

- RBM 2 spike
h slab s
(h s)W W:,i
hi = 1 hi
si
W:,i

ssRBM
& '
! 1 !
Ess (x, s, h) = x W:,i si hi + x + i hi x (20.50)
i
2 i

649
20.

1! ! ! !
+ i s2i i i s i hi b i hi + i 2i hi (20.51)
2 i i i i

bi hi x
i > 0 si i
x h 2 i
si

ssRBM
s h

"
1 1
pss (x | h) = exp {E(x, s, h)} ds (20.52)
P (h) Z
# $
!
ss ss
=N x; Cx|h W:,i i hi , Cx|h (20.53)
i
! " " #
1
ss
Cx|h = + i i hi i i1 hi W:,i W:,i
ss
Cx|h

h s

MAP

ssRBM mcRBM mPoT ssRBM


mcRBM mPoT
%& '1
(c) (j) (j)
j hj r r +I hj > 0
r (j) ssRBM
hi = 1
ssRBM
product of probabilistic principal components
analysis, PoPPCA (Williams and Agakov, 2002)
ssRBM
hi 1
mcRBM mPoT

650
20.

ssRBM
16.1

ssRBM
(Courville et al., 2014)

S3C - (Goodfellow et al.,


2013d)

20.6
9

Desjardins and Bengio (2008) RBM

p n
d
p = maxi di
2n 3 3
29 = 512
!

651
20.

Lee et al. (2009) probabilistic max


pooling

1 1
n + 1 n
1
1 1

n + 1 1
n + 1
n + 1

Lee et al. (2009)


*3

Lee et al. (2009)

*3

652
20.

2 2 2 2
1
50%
50%
3 3 3 3 1
1

20.7
x y y

y p(y | x)


x y
p(x(1) , . . . , x( ) )
p(x(t) | x(1) , . . . , x(t1) )

653
20.

3D

Taylor et al. (2007) m


RBM p(x(t) | x(t1) , . . . , x(tm) )
m x p(x(t) ) RBM
x(t1) x RBM
x RBM
RBM
x
RBM (Mnih et al., 2011)
RBM (Taylor and Hinton, 2009; Sutskever et al.,
2009; Boulanger-Lewandowski et al., 2012)

Boulanger-Lewandowski et al. (2012) RNN-RBM
RNN-RBM RBM
RNN x(t) RBM
1
RNN-RBM RBM
RNN RNN
RBM
RBM

RNN

20.8


log p(v)

654
20.

log p(y | v) RBM


(Larochelle and Bengio, 2008)
RBM
MLP

2
2
vi Wi,j hj
(Sejnowski,
1987) 1 2 3
1
(Memisevic and Hinton, 2007, 2010)one-hot
(Nair
and Hinton, 2009) 2
1 v y
v (Luo et al., 2011)

Sohn et al. (2013)


3

655
20.

20.9
x
x
1
z

z f (x, z)
f

2 y

y N (, 2 ). (20.54)

y
2 y

z N (z; 0, 1)

y = + z (20.55)

J(y)

= f (x; ) = g(x; )

J(y)

656
20.

p(y; )
p(y | x; ) p(y | )
x p(y | )
y

y p(y | ) (20.56)

y = f (z; ), (20.57)

z f
f
y z
z
reparametrization trickstochastic back-propagation
perturbation analysis

f y

20.9.1 REINFORCE (Williams, 1992)

20 (Price,
1958; Bonnet, 1964) (Williams,
1992) (Opper and Archambeau, 2009)
(Bengio et al., 2013b; Kingma, 2013; Kingma and Welling,
2014b,a; Rezende et al., 2014; Goodfellow et al., 2014c)

657
20.

20.9.1

y
x
z y

y = f (z; ) (20.58)

y f

J(y)

REINFORCE (REward Increment = Nonnegative Factor Oset Rein-


forcement Characteristic Eligibility)
(Williams, 1992) J(f (z; ))
Ezp(z) J (f (z; ))
y

SGD
REINFORCE
.
!
Ez [J(y)] = J(y)p(y) (20.59)
y
E[J(y)] ! p(y)
= J(y) (20.60)
y

! log p(y)
= J(y)p(y) (20.61)
y

m
!
1 log p(y (i) )
J(y (i) ) . (20.62)
m
y(i) p(y), i=1

658
20.

20.60J
log p(y) 1 p(y)
20.61 = p(y)
20.62

p(y) p(y | x) p(y)


x

REINFORCE 1
y
1 SGD
variance reduction
(Wilson, 1984; LEcuyer, 1994)

REINFORCE J(y)
baseliney
b()
! " #
log p(y) log p(y)
Ep(y) = p(y) (20.63)
y

# p(y)
= (20.64)
y

#
= p(y) = 1 = 0, (20.65)
y

! " ! " ! "


log p(y) log p(y) log p(y)
Ep(y) (J(y) b()) = Ep(y) J(y) b()Ep(y)

(20.66)
! "
log p(y)
= Ep(y) J(y) . (20.67)

log p(y)
b() p(y) (J(y)b())
b() b ()i
i .
$ %
p(y) 2
Ep(y) J(y) log i
b ()i = $ % . (20.68)
log p(y) 2
Ep(y) i

659
20.

,i

log p(y)
(J(y) b()i ) (20.69)
i
b()i b ()i b

2 ! "
log p(y) 2
Ep(y) [J(y) log
i
p(y)
] Ep(y) i
y p(y)
log p(y) 2 log p(y) 2
J(y) i i b
20.68Mnih and Gregor (2014) b() Ep(y) [J(y)]
J(y) i
1

Dayan (1990)
(Sutton et al., 2000; Weaver and Tao, 2001)
REINFORCE Bengio et al. (2013b)
Mnih and Gregor (2014)Ba et al. (2014)Mnih et al. (2014)Xu et al. (2015)
b() Mnih and Gregor (2014)
(J(y) b())

Mnih and Gregor (2014)


variance normalization

REINFORCE y J(y)
y

20.10
16

RBM
2013

660
20.

20.10.1

(Neal, 1990)

s

#
p(si ) = Wj,i sj + bi . (20.70)
j<i

(Sutskever and Hinton, 2008)

1
(Saul et al., 1996)
1
19.5

661
20.

(Dayan et al., 1995; Dayan and Hinton, 1996)

(Gregor et al., 2014; Mnih and Gregor,


2014)

20.9.1

wake-sleep (Bornschein and Bengio, 2015)


(Bornschein et al., 2015)

20.10.7

20.10.2

generator network
g(z; (g) ) z
x x


z
1
.
x = g(z) = + Lz. (20.71)

L
662
20.


inverse transform sampling(Devroye, 2013) U (0, 1)
z x g(z)
!x
F (x) = p(v)dv p(x) x
p(x)

g z x ,

3.47 g
" "
" g ""
"
pz (z) = px (g(z)) "det( )" (20.72)
z
. x ,

pz (g 1 (x))
px (x) = "" " , (20.73)
g "
"det( z )"

. g log p(x)
g

g x g x

p(xi = 1 | z) = g(z)i , (20.74)

. g p(x | z) z x
,
p(xi = 1 | z) = g(z)i . (20.75)

g p(x | z) z x

p(x) = Ez p(x | z), (20.76)


663
20.

. pg (x) 20.9
pg

z x
x y

z z x

Dosovitskiy et al. (2015) z x ,

z
x

x
z

664
20.

20.10.3

variational autoencoderVAE(Kingma, 2013; Rezende et al.,


2014)

VAE pmodel (z) z


g(z)
x pmodel (x; g(z)) = pmodel (x | z)
q(z | x) z
pmodel (x | z)

x L(q)
. ,

L(q) = Ezq(z|x) log pmodel (z, x) + H(q(z | x)) (20.77)


= Ezq(z|x) log pmodel (x | z) DKL (q(z | x)||pmodel (z)) (20.78)
log pmodel (x). (20.79)

20.771
EM
2
q

x
z 20.78
1
2 q(z | x) pmodel (z)

19.4
q
Ezq log pmodel (z, x)
665
20.

q
z
q(z | x) = q(z; f (x; )) z
L
. L

1
DKL (pdata pmodel ) 3.6

pmodel (x; g(z))

VAE DKL (pdata pmodel )


Theis et al. (2015) Huszar (2015)
VAE ,

,z .

VAE

VAE
VAE 1 deep recurrent attention writerDRAW (Gregor
et al., 2015) DRAW
DRAW
VAE VAE
RNN (Chung et al., 2015b)
RNN
RNN VAE

666
20.

VAE
importance-weighted autoencoder(Burda et al., 2015)
:
! k
#
1 " pmodel (x, z (i) )
Lk (x, q) = Ez(1) ,...,z(k) q(z|x) log . (20.80)
k i=1 q(z (i) | x)

k = 1 L
q(z | x) z log pmodel (x)
log pmodel (x)
k

MP-DBM
(Goodfellow et al., 2013b; Stoyanov et al., 2011;
Brakel et al., 2013)

MP-DBM

1 x z
1

20.6
1 2

667
20.

20.6: 2 (Kingma and Welling,


2014a)2
2
2
z 2 z
p(x | z) x Frey 2 .
1 1
MNIST 2

20.10.4

GAN(Goodfellow et al., 2014c)


x = g(z; (g) )
discriminator network
d(x; (d) )
x

668
20.

v( (g) , (d) )
v( (g) , (d) )

g = arg min max v(g, d). (20.81)


g d

v ,

v( (g) , (d) ) = Expdata log d(x) + Expmodel log (1 d(x)) . (20.82)

1
2

GAN
(g) maxd v(g, d)

g d maxd v(g, d)
GAN Goodfellow (2014) GAN
GAN 2

v(a, b) = ab 1 a ab 1
b ab

a b
v
1
2
v
v
GAN

Goodfellow (2014)

GAN

669
20.

GAN
Goodfellow et al. (2014c)

GAN GAN
Radford et al. (2015)
GANDCGAN15.9
DCGAN
20.7

GAN
p(x) p(x | y)
GAN (Mirza and Osindero, 2014) Denton et al. (2015)
GAN

LAPGAN
LAPGAN
40%
LAPGAN 20.7

GAN

670
20.

20.7: LSUN GAN DCGAN


Radford et al. (2015) LAPGAN
Denton et al. (2015)

GAN

self-supervised boosting (Welling et al., 2002)


RBM

20.10.5

generative moment matching network


(Li et al., 2015; Dziugaite et al., 2015)
VAE GAN
VAE GAN

moment matching

671
20.

moment
1 2

,
Ex i xni i (20.83)

. n = [n1 , n2 , . . . , nd ]

xi xj
x 2
1 2

GAN

maximum mean discrepancy


MMD(Schlkopf and Smola, 2002; Gretton et al., 2012)
1

2 MMD

GAN
1
1

MMD

672
20.

,
1

GAN MMD

20.10.6


Goodfellow et al. (2014c) Dosovitskiy et al. (2015)
9.5

Dosovitskiy et al. (2015)


.

i i k k

673
20.

20.10.7


P (xd | xd1 , . . . , x1 )
fully-visible Bayes networkFVBN
(Frey, 1998)
(Bengio and Bengio, 2000b; Larochelle
and Murray, 2011) 20.10.10 NADE (Larochelle and Murray,
2011)

x1 x2 x3 x4

P (x1 ) P (x3 | x1 , x2 )

P (x4 | x1 , x2 , x3 )
P (x2 | x1 )

x1 x2 x3 x4

20.8: i i 1
FVBN FVBN

674
20.

20.10.8


P (xi | xi1 , . . . , x1 )

Frey (1998) d
O(d2 ) 20.8

P (x1 ) P (x3 | x1 , x2 )

P (x2 | x1 )
P (x4 | x1 , x2 , x3 )

h1 h2 h3

x1 x2 x3 x4

20.9: i xi i 1
x1 , . . . , xi hi xi+1 ,
xi+2 , . . . , xd

675
20.

20.10.9

(Bengio and Bengio, 2000a,b)


20.8

20.8

1 1
2

1. (i 1) k k k one-hot
P (xi | xi1 , . . . , x1 )



2. xi 20.9
1
xi xi+k (k > 0)

i x1 , . . . , xi




P (xi | xi1 , . . . , x1 ) 6.2.1.1xi

676
20.

20.10.10 NADE

neural auto-regressive density estimatorNADE


(Larochelle
and Murray, 2011)Bengio and Bengio (2000b)
NADE 20.10
j .
(j)
i xi j k hk (j i)

Wj,k,i :


Wj,k,i = Wk,i . (20.84)

j < i

P (x1 ) P (x3 | x1 , x2 )

P (x2 | x1 ) P (x4 | x1 , x2 , x3 )

h1 h2 h3

W:,1 W:,1 W:,1


W:,2 W:,2 W:,3

x1 x2 x3 x4

20.10: NADE h(j)


x1 , . . . , xi h(i) P (xj | xj1 , . . . , x1 )
j > i NADE

Wj,k,i = Wk,i xi j i k

(W1,i , W2,i , . . . , Wn,i ) W:,i

Larochelle and Murray (2011) NADE


677
20.

RBM

1 NADE NADE

RBM
NADE
1 k
NADE-k (Raiko et al., 2014)


3.9.6
i i
i i2 RNADE
NADE (Uria et al., 2013)

i i2
Uria et al. (2013)


(Murray and Larochelle, 2014)

n
n! o p(x | o) o
:
k
1!
pensemble (x) = p(x | o(i) ). (20.85)
k i=1

678
20.


(Bengio and Bengio, 2000b)
1 NADE O(nh)
Bengio and Bengio (2000b) O(n2 h) h
20.1020.9 hi
l l + 1 h
n O(n2 h2 )
Murray and Larochelle (2014) l + 1 i l i
O(nh2 )
NADE h

20.11
14

MCMC

(Rifai et al., 2012; Mesnil et al., 2012)


1

20.11.1


Bengio et al. (2013c)
generalized denoising autoencoders

679
20.

f g

x !

C(x | x) p(x | !)

x x

20.11:

(a)x C x (b) f
h = f (x) (c) g
(d) p(x | = g(f (x)))
E[x | x] g(h) = x
x p(x | ) 2

(Vincent, 2011)
C p f g
(Bengio et al., 2014)

20.11
:

1. x C(x | x) x
2. x h = f (x)
3. h p(x | = g(h)) = p(x | x) = g(h)
4. p(x | = g(h)) = p(x | x) x

Bengio et al. (2014) p(x | x)


x

680
20.

20.11.2

GSN
p(xf | xo )
xf xf
xo MP-DBM

GSN MP-DBM
(Bengio et al., 2014)

Alain et al. (2015) Bengio et al. (2014) 1

detailed balance

20.12

20.12:
MINIST
GSN

681
20.

20.11.3

Bengio et al. (2013c)


1 -

-
18.2

k 1

20.12
generative stochastic networkGSN(Bengio et al.,
2014) x
h

GSN 1 2
:

1. p(x(k) | h(k) )
RBMDBNDBM
2. p(h(k) | h(k1) , x(k1) )

GSN

MCMC
17.3

682
20.

GSN Bengio et al. (2014)


1
x(0) = x
x x(0) = x
log p(x(k) = x | h(k) ) h(k)
log p(x(k) = x | h(k) )
Bengio et al. (2014) 20.9

20.11.3GSN
(Bengio et al., 2014)

20.12.1 GSN

GSN (Bengio et al., 2014) x


p(x) p(y | x)

Zhou and Troyanskaya (2014)


GSN
2 1

GSN

Zhrer and Pernkopf (2014)


GSN

y x
Larochelle and Bengio (2008) RBM

683
20.

20.13
MCMC


Sohl-Dickstein et al. (2015)
diusion inversion

MCMC

Sohl-Dickstein et al. (2015)


20.11.1

approximate Bayesian computation


ABC (Rubin et al., 1984)

Bachman and Precup (2015) ABC GSN MCMC

684
20.

ABC

20.14

A
B
A B

B

log p(x) log Z


AIC log Z AIS
Z log p(x)

AIC

685
20.

0.1 10

1
MNIST MNIST
MNIST

0.5 0 1 1

1
1
3

(Denton et al., 2015)

16.1
x

686
20.

MNIST

MNIST

Theis et al. (2015)

3.6 DKL (pdata pmodel )


DKL (pmodel pdata )

20.15

pmodel (x)
pmodel (h | x) x
h

687
20.

688

You might also like