You are on page 1of 74

CHNG 1

S dng mng n ron nhn dng ch

H thng th gic ca con ngi l mt trong nhng k quan ca


th gii. Xem xt cc chui k t vit tay sau y:

Hu ht mi ngi d dng nhn ra nhng ch s nh 504192.


iu d dng l la o. mi bn cu no, con ngi c v
no th gic chnh, cn gi l V1, cha 140 triu t bo thn kinh,
vi hng chc t kt ni gia chng. V tm nhn ca con ngi
khng ch lin quan n V1, m cn l mt lot cc v no th gic -
V2, V3, V4 v V5 - ang dn dn phc tp hn. Chng ti mang
trong u mt siu my tnh, theo di bi s tin ha trong hng
trm triu nm, v thch nghi tuyt vi hiu c th gii th
gic. Nhn bit ch vit tay l khng d dng. Thay vo , chng
ta con ngi tht ng kinh ngc, tht ng kinh ngc khi lm r
nhng g mt chng ta ch cho chng ta. Nhng gn nh tt c
nhng cng vic c thc hin mt cch v thc. V do ,
chng ti thng khng nh gi cao s khc nghit ca vn
m h thng hnh nh ca chng ti gii quyt ra sao.

Kh khn ca cng nhn mu hnh nh tr nn r rng nu bn c


gng vit mt chng trnh my tnh nhn dng cc ch s
nh nhng ngi trn. iu c v d dng khi chng ta t lm
mnh t nhin tr nn v cng kh khn. n gin, trc gic v
cch chng ta nhn ra cc hnh dng - "mt 9 c mt vng lp trn
cng, v mt nt thng ng pha di bn phi" - tr nn khng
n gin nh vy th hin thut ton. Khi bn c gng lm
cho cc quy tc nh vy chnh xc, bn nhanh chng b mt trong
mt morass ca ngoi l v cn thn v cc trng hp c bit. N
c v v vng.

Mng thn kinh tip cn vn theo mt cch khc. tng l


c mt s lng ln ch vit tay ch s, c gi l cc v d
o to,

v sau pht trin mt h thng c th hc hi t nhng v d o


to. Ni cch khc, mng thn kinh s dng cc v d t ng
suy ra cc quy tc nhn bit ch vit tay. Hn na, bng cch
tng s lng cc v d o to, mng c th tm hiu thm v ch
vit tay, v do nng cao chnh xc ca n. V vy, trong khi
ti ch ra 100 s o to trn, c l chng ta c th xy dng
mt cng c nhn din ch vit tay tt hn bng cch s dng
hng ngn hoc thm ch hng triu hoc hng t v d o to.

Trong chng ny chng ta s vit mt chng trnh my tnh thc


hin mt mng n ron hc nhn dng ch vit tay. Chng trnh
ch di 74 dng, v khng s dng cc th vin mng n ron c bit.
Tuy nhin, chng trnh ngn ny c th nhn ra cc ch s vi
chnh xc trn 96 phn trm, m khng c s can thip ca con
ngi. Hn na, trong cc chng sau chng ta s pht trin cc
tng c th nng cao chnh xc ln hn 99 phn trm. Trn
thc t, cc mng thn kinh thng mi tt nht hin nay
tt n mc chng c cc ngn hng s dng kim tra sc v
bu in nhn a ch.

Chng ti ang tp trung vo vic nhn dng ch vit tay bi v y


l mt vn nguyn mu xut sc hc v mng n ron ni
chung. L mt mu th nghim, n gp mt im yu: l mt
thch thc - khng phi l mt thnh cng nh nhn ra ch vit
tay - nhng khng kh i hi mt gii php cc k phc tp,
hoc sc mnh tnh ton to ln. Hn na, l mt cch tuyt vi
pht trin cc k thut tin tin hn, chng hn nh hc tp su
sc. V nh vy trong sut cun sch, chng ta s tr li lin tc
n vn nhn dng ch vit tay. Sau ny trong sch, chng ta
s tho lun v nhng tng ny c th c p dng nh th
no i vi cc vn khc trong tm nhn my tnh, cng nh
trong bi pht biu, x l ngn ng t nhin v cc lnh vc khc.

Tt nhin, nu im ca chng l ch vit mt chng trnh


my tnh nhn ra ch vit tay ch s, sau chng s ngn
hn nhiu! Nhng dc theo con ng chng ta s pht trin
nhiu tng quan trng v mng n-ron, bao gm hai loi n-
ron nhn to quan trng (perceptron v siguroid n-ron), v thut
ton hc chun cho mng n-ron, gi l stochastic gradient. Trong
sut, ti tp trung gii thch l do ti sao mi th c thc hin
theo cch ca h, v v vic xy dng trc gic mng n-ron ca
bn. iu i hi mt cuc tho lun di hn nu ti ch trnh
by c ch c bn ca nhng g ang xy ra, nhng n c gi tr n
cho s hiu bit su sc bn s t c. Trong s nhng phn
thng, vo cui chng chng ta s c mt ci nhn chnh xc v
vic hc su l g,

Perceptrons
Mng thn kinh l g? bt u, ti s gii thch mt loi n-
ron nhn to c gi l perceptron . Perceptron c nh khoa
hc Frank Rosenblatt pht trin vo nhng nm 1950 v 1960 , ly
cm hng t cng trnh trc y ca Warren McCulloch v Walter
Pitts . Ngy nay, ph bin nht l s dng cc m hnh n-ron
nhn to khc - trong cun sch ny, v trong nhiu nghin cu
hin i v mng n-ron, m hnh n-ron chnh c s dng l
mt neuron sigma . Chng ta s sm nhn c neuron sigmoid.
Nhng hiu ti sao cc nrn thn kinh sigma c xc nh
theo cch ca chng, nn dnh thi gian hiu perceptron u
tin.

Vy perceptron hot ng nh th no? Mt perceptron c mt s


u vo nh phn, 1, , v to ra mt u ra nh
x 2 , ...
phn duy nht:

Trong v d th hin perceptron c ba u vo, 1, x2 , x3 .


Ni chung n c th c nhiu hoc t hn u vo. Rosenblatt
xut mt quy tc n gin tnh ton u ra. ng gii thiu
trng lng , 1, , s thc th hin tm quan
w 2 , ...

trng ca cc u vo tng ng vi u ra. Kt qu ca n-ron,


hoc , c xc nh bi vic tnh trng s j
wj xj t
hn hoc ln hn mt s ngng gi tr . Ging nh trng s,
ngng l mt s thc l mt tham s ca neuron. t n trong
cc thut ng i s chnh xc hn:

0

neu wj xj Ngng

j

au ra = { (1)
1

neu j
wj xj > ng ng

l tt c nhng g m perceptron hot ng!

l m hnh ton hc c bn. Mt cch bn c th ngh v


perceptron l n l mt thit b quyt nh bng cch cn nhc
cc bng chng. ti ly mt v d. y khng phi l mt v d
rt thc t, nhng n d hiu, v chng ta s sm nhn c v d
thc t hn. Gi s cui tun sp ti, v bn nghe ni rng
s c mt l hi pho mt thnh ph ca bn. Bn thch pho mt,
v ang c gng quyt nh c hay khng i n l hi. Bn
c th quyt nh bng cch cn nhc ba yu t:
1. Thi tit c tt khng?
2. Bn trai hay bn gi ca bn c mun i cng bn khng?
3. L hi c gn phng tin cng cng khng? (Bn khng s
hu mt chic xe hi).

Chng ta c th biu din ba yu t ny bng cc bin nh phn


tng ng 1, , v
x2 3 . V d, chng ta s c
1 = 1nu thi tit tt, v 1 = nu thi tit
0

xu. Tng t, 2 = nu bn trai hoc bn gi ca bn


1

mun i, v 2 = nu khng. V tng t nh vy cho


0 3

v vn chuyn cng cng.

By gi, gi s bn hon ton thch b pho mt, rt nhiu bn


vui lng i n l hi ngay c khi bn trai hoc bn gi ca bn
khng quan tm v l hi l kh khn c c. Nhng c l bn
thc s loathe thi tit xu, v khng c cch no bn mun i
n l hi nu thi tit xu. Bn c th s dng perceptron
m hnh loi ra quyt nh ny. Mt cch lm iu ny l chn
mt trng lng 1 = v thi tit, v
6 2 = v
2

3 = 2 cho cc iu kin khc. Gi tr ln hn ca 1 ch


ra rng thi tit c ngha rt quan trng i vi bn, khng ch
l bn trai hoc bn gi ca bn hay bn thn hay s gn gi ca
phng tin giao thng cng cng. Cui cng, gi s bn chn
ngng cho perceptron. Vi nhng la chn ny, perceptron thc
hin m hnh ra quyt nh mong mun, xut ra bt c khi
no thi tit tt, v bt c khi no thi tit xu. N khng
to ra s khc bit cho kt qu u ra nu bn trai hoc bn gi
ca bn mun i hay khng, hoc l phng tin cng cng ang
gn.
Bng cch thay i trng lng v ngng, chng ta c th c c
cc m hnh ra quyt nh khc nhau. V d, gi s chng ta thay v
chn ngng . Sau , perceptron s quyt nh bn nn n d
l hi bt c khi no thi tit tt hay khi c hai l hi l gn qu
cnh cng cng v bn trai hoc bn gi ca bn sn sng gia
nhp bn. Ni cch khc, l mt m hnh ra quyt nh khc.
Ngn chn ngng ny c ngha l bn sn sng i n l hi hn.

R rng, perceptron khng phi l mt m hnh hon chnh ca con


ngi ra quyt nh! Nhng v d minh ho nh th no
perceptron c th cn nhc cc loi bng chng khc nhau a
ra cc quyt nh. V c v hp l rng mt mng li perceptron
phc tp c th a ra nhng quyt nh kh tinh t:

Trong mng li ny, ct u tin ca perceptrons - ci m chng


ta gi l lp perceptron u tin - ang a ra ba quyt nh rt
n gin bng cch cn nhc cc bng chng u vo. Cn cc
perceptron trong lp th hai th sao? Mi trong s nhng
perceptron ny ang a ra quyt nh bng cch cn nhc kt
qu t lp u tin ca qu trnh ra quyt nh. Theo cch ny
mt perceptron trong lp th hai c th a ra quyt nh mt
mc phc tp hn v tru tng hn perceptron trong lp u
tin. V thm ch cc quyt nh phc tp hn c th c thc
hin bi perceptron trong lp th ba. Bng cch ny, mt mng
li perceptron nhiu lp c th tham gia vo qu trnh ra quyt
nh phc tp.

Ngu nhin, khi ti xc nh Perceptron ti ni rng perceptron


ch c mt u ra. Trong mng li trn perceptrons trng ging
nh h c nhiu u ra. Trn thc t, chng vn l u ra duy
nht. Cc mi tn sn lng nhiu ch n thun l mt cch hu
ch ch ra rng u ra t mt perceptron ang c s dng
nh l u vo cho mt s perceptron khc. N t kh s dng
hn so vi vic v mt u ra duy nht m sau chia tch.

Hy n gin ha cch chng ta m t perceptrons. iu kin

j
wj xj > ngng l rm r, v chng ta c
th thc hin hai thay i quan trng n gin ha n. S thay
i u tin l vit j
wj xj nh l mt sn phm du
chm,

ni v l cc vect c cc thnh phn tng ng l trng


lng v u vo. S thay i th hai l di chuyn cc ngng
pha bn kia ca s bt bnh ng, v thay th n bng nhng
g c gi l ca Perceptron thin v ,

S dng sai s thay v ngng, quy tc perceptron c th c


vit li:


0 neu w x + b 0
output = { (2)

1 neu w x + b > 0
Bn c th ngh v s thin v nh l mt bin php ca cch d
dng l c c perceptron sn xut mt . Hoc t n
trong nhiu iu khon sinh hc, s thin v l mt bin php ca
cch d dng l c c perceptron la . i vi mt
perceptron vi mt s thin v rt ln, rt d dng cho perceptron
sn xut mt . Nhng nu s thin v l rt tiu cc, sau
n rt kh cho cc perceptron sn xut mt . R rng, vic
a ra s thin v ch l mt s thay i nh trong cch chng ta m
t perceptron, nhng sau chng ta s thy n dn n nhng
s n gin ha hn na. Do , trong phn cn li ca cun sch
chng ti s khng s dng ngng, chng ti s lun lun s dng
s thin v.

Ti m t perceptrons nh l mt phng php cn nhc cc


bng chng a ra quyt nh. Mt cch khc perceptron c
th c s dng l tnh ton cc chc nng logic c bn chng ta
thng ngh v tnh ton nh c bn, cc chc nng nh AND, OR, v
NAND. V d, gi s chng ta c mt perceptron vi hai u vo, mi
ci c trng lng , v mt s thin v tng th l . y l
perceptron ca chng ti:

Sau , chng ta thy rng u vo sn xut u ra , v

cc. y, ti gii thiu biu tng lm cho cc php nhn


r rng. Cc tnh ton tng t cho thy cc u vo v
sn xut u ra . Nhng u vo sn xut ra , v

tiu cc. V do , perceptron ca chng ti thc hin mt NAND


cng!

Cc NANDv d cho thy rng chng ta c th s dng tnh ton


perceptron chc nng logic n gin. Trn thc t, chng ta c th
s dng cc mng perceptron tnh ton bt k chc nng logic
no c. L do l cc NANDcng l ph qut cho tnh ton, c ngha l,
chng ta c th xy dng bt k tnh ton ra khi NANDca. V d,
chng ta c th s dng NANDcc cng xy dng mt mch m
thm hai bit, 1 v 2 . iu ny i hi phi tnh tng ca
bitwise, 1 x2 , cng nh mt bit mang c t thnh
khi c hai 1 v 2 l , ngha l, bit mang ch l sn phm
bitwise 1 x2 :

c c mt mng perceptron tng ng chng ta thay th


tt c cc NANDcng bng perceptron vi hai u vo, mi khi
c trng lng , v mt s thin v tng th l . y l
mng kt qu. Lu rng ti di chuyn cc perceptron tng
ng vi pha di cng bn phi NANDca khu mt cht, ch lm
cho d dng hn rt ra cc mi tn trn s :
Mt kha cnh ng ch ca mng perceptron ny l u ra t
perceptron bn tri c s dng hai ln nh u vo perceptron
pha di cng. Khi ti nh ngha m hnh perceptron ti khng
ni liu c th cho php loi kt hp hai u ra ny vi nhau hay
khng. Trn thc t, n khng quan trng lm. Nu chng ta
khng mun cho php loi iu ny, th c th n gin kt hp
hai ng, thnh mt kt ni duy nht vi trng lng -4 thay v
hai kt ni vi -2 trng lng. (Nu bn khng thy iu ny
r rng, bn nn dng li v chng minh cho chnh mnh rng
iu ny l tng ng.) Vi s thay i , mng li nh sau,
vi tt c cc trng s khng c nh du bng -2, tt c cc
thnh kin bng 3, v mt n trng lng ca -4, nh nh du:

Cho n by gi ti v cc u vo nh 1 v 2 nh cc
bin ni bn tri ca mng perceptrons. Trong thc t, thng
thng v thm mt lp perceptron - lp u vo - m ho
cc u vo:

Ch thch ny cho perceptron u vo, trong chng ta c mt


u ra, nhng khng c u vo,

l mt php vit tt. N khng thc s c ngha l mt perceptron


khng c u vo. xem iu ny, gi s chng ta c mt
perceptron khng c u vo. Sau , tng trng s

j
wj xj s lun lun l s khng, v do cc perceptron
s sn lng nu , v nu
. Tc l, perceptron ch n gin s xut ra
mt gi tr c nh ch khng phi gi tr mong mun ( 1 ,
trong v d trn). Tt hn l suy ngh v perceptron u vo nh
khng thc s l perceptron tt c, m l cc n v c bit m
ch n gin l nh ngha xut ra cc gi tr mong mun,
1, .
x 2 , ...

V d adder chng minh lm th no mt mng perceptrons c th


c s dng m phng mt mch c cha nhiu NANDcng. V
bi v NANDcc cng l ph qut cho tnh ton, n sau
perceptrons cng ph qut cho tnh ton.
Tnh ph qut tnh ton ca perceptrons ng thi yn tm v
ng tht vng. N yn tm v n cho chng ta bit rng cc
mng li perceptron c th c sc mnh nh bt k thit b tnh
ton khc. Nhng n cng ng tht vng, bi v n lm cho n c
v nh th perceptrons ch l mt loi NANDcng mi. khng phi
l tin tc ln!

Tuy nhin, tnh hnh tt hn quan im ny. N ch ra rng chng


ta c th a ra cc thut ton hc c th t ng iu chnh trng
lng v thnh kin ca mt mng n-ron nhn to. iu chnh
ny xy ra p ng vi kch thch bn ngoi, m khng c s can
thip trc tip ca mt lp trnh vin. Nhng thut ton hc ny
cho php chng ta s dng cc t bo thn kinh nhn to theo
mt cch m hon ton khc vi cc cng logic thng thng. Thay
v t mt mch NANDv cc ca khc, mng n-ron ca chng ti ch
c th hc gii quyt vn , i khi nhng vn m n s
rt kh khn trc tip thit k mt mch thng thng.

T bo thn kinh sigma


Hc thut m thanh tuyt vi. Nhng lm th no chng ta c th
to ra cc thut ton nh vy cho mt mng nron? Gi s chng ta
c mt mng perceptron m chng ta mun s dng tm hiu
gii quyt mt s vn . V d, cc u vo vo mng c th l
d liu pixel th t mt bc nh c qut, vit tay ca mt ch
s. V chng ti mun cc mng tm hiu trng lng v thnh
kin sn lng t mng chnh xc phn loi ch s. xem
cch thc hc tp c th hot ng, gi s chng ta thc hin mt s
thay i nh v trng lng (hoc thin v) trong mng. Nhng g
chng ti mun l s thay i trng lng nh ny ch gy ra mt
s thay i nh tng ng v u ra t mng. Nh chng ta s
thy trong giy lt, ti sn ny s gip cho vic hc tr nn kh thi.
V mt gin , y l nhng g chng ti mun (r rng mng
ny qu n gin nhn dng ch vit!):

Nu ng l mt s thay i nh v trng lng ch gy ra mt s


thay i nh trong sn lng, th chng ta c th s dng thc t
ny sa i trng lng v thnh kin mng ca chng ti
hot ng theo cch m chng ta mun. V d, gi s mng
nhm ln phn loi mt hnh nh nh l mt "8" khi n phi l mt
"9". Chng ta c th tm ra lm th no to ra mt s thay i
nh v trng lng v thnh kin mng gn gi hn vi phn
loi hnh nh nh mt "9". V sau chng ti s lp li iu ny,
thay i trng lng v thnh kin to ra sn phm tt hn v
tt hn. Mng li s c hc tp.

Vn l iu ny khng phi l iu xy ra khi mng ca


chng ta cha perceptron. Trong thc t, s thay i nh v trng
lng hoc xu hng ca bt k perceptron n no trong mng
i khi c th lm cho u ra ca perceptron hon ton lt, ni
t n . S lt c th lm cho hnh vi ca phn cn li ca
mng thay i hon ton theo mt cch rt phc tp. V vy, mc
d "9" ca bn by gi c th c phn loi chnh xc, hnh vi ca
mng trn tt c cc hnh nh khc c th thay i hon ton
trong mt s kh kim sot cch. iu lm cho n kh khn
xem lm th no dn dn sa i trng lng v thnh kin
mng c gn gi hn vi hnh vi mong mun. C l c mt
s cch thng minh gii quyt vn ny. Nhng n khng
phi l ngay lp tc r rng lm th no chng ta c th c c
mt mng li perceptrons tm hiu.

Chng ta c th khc phc vn ny bng cch a ra mt loi


t bo thn kinh nhn to mi gi l neuron h thn kinh sigma .
Cc n-ron thn kinh tng t nh perceptron, nhng c sa i
s thay i nh v trng lng v s thin v ca chng ch gy
ra mt s thay i nh trong sn phm. l thc t quan trng
cho php mng li cc t bo thn kinh sigma.

c ri, ti m t cc neuron sigma. Chng ti s miu t cc


n-ron thn kinh sigmoid ging nh cch chng ti miu t cc
perceptron:
Ging nh perceptron, neuron sigma c u vo,
1, . Nhng thay v ch l
x 2 , ... hoc , cc u vo ny
cng c th mt trn bt k gi tr t v . V d,
l mt u vo hp l cho mt neuron
sigma. Cng ging nh perceptron, neuron sigma c trng lng
cho mi u vo, 1, w 2 , ... , v mt s thin v tng th,
. Nhng u ra khng phi l hoc . Thay vo , n l
( w x + b ) , trong c gi l hm sigmoid * , v
c xc nh bi:

1
( z) . (3)
- z
1 + e

t tt c mt cht r rng hn, u ra ca mt n-ron sigmoid


vi u vo 1, , trng lng
x 2 , ... 1, , v
w 2 , ...

s thin v L

1
. (4)
1 + exp( - j
wj xj - b )

Ngay t ci nhn u tin, cc n-ron thn kinh sigma dng nh


rt khc vi perceptron. Hnh thc i s ca hm sigmoid c v
nh khng r rng v b cm nu bn cha quen vi n. Trn thc
t, c nhiu im tng ng gia perceptron v neuron
sigmoid, v hnh thc i s ca hm sigmoid tr nn chi tit k
thut hn l mt ro cn thc s hiu.

hiu s tng ng vi m hnh perceptron, gi s


w x + b l mt con s tch cc ln. Sau
- z
v do
0 ( z) 1. Ni cch khc, khi
= w x + b l ln v tch cc, sn lng t cc t bo
thn kinh sigma l khong , ging nh n c cho mt
perceptron. Gi s mt khc = w x + b l rt tiu
cc. Sau - z
, v ( z) . V vy, khi
0

= w x + b l rt tiu cc, hnh vi ca mt n ron


thn kinh sigma cng gn xp x mt perceptron. Ch khi
c kch thc
khim tn m c nhiu sai lch so vi m hnh perceptron.

iu g v cc hnh i s ca ? Lm th no chng ta c th
hiu iu ? Trong thc t, dng chnh xc ca khng phi l
quan trng - nhng g thc s quan trng l hnh dng ca cc chc
nng khi v. y l hnh dng:

sigmoid function
1.0

0.8

0.6

0.4

0.2

0.0
-4 -3 -2 -1 0 1 2 3 4
z

Hnh dng ny l mt phin bn c lm mn ca mt chc nng


bc:
step function
1.0

0.8

0.6

0.4

0.2

0.0
-4 -3 -2 -1 0 1 2 3 4
z

Nu thc s l mt chc nng bc, sau cc neuron


sigmaoid s l mt perceptron, v u ra s l hoc ph thuc
vo vic
c tch cc hoc tiu cc * . Bng cch s dng thc t chc
nng chng ta nhn c, nh c ng trn, mt perceptron
mn mng. Tht vy, n l s trn ca chc nng l thc t
quan trng ch khng phi hnh thc chi tit. trn ca c
ngha l nhng thay i nh j trong trng lng v
trong s thin v s to ra mt s thay i nh
u ra t n-ron. Trong thc t, tnh
ton cho chng ta bit rng cng c
xp x bi

au

ra au

ra

au ra wj + b , (5)
wj b
j

ni tng l trn tt c cc trng lng, j , v


au ra / w v au ra / b ch cc dn

j

xut mt phn ca i vi j v , tng ng.


ng hong s nu bn khng hi lng vi cc dn xut mt
phn! Mc d cc biu hin trn c v phc tp, vi tt c cc dn
xut mt phn, n thc s ni iu g rt n gin (v l
tin rt tt): l mt hm tuyn tnh ca
cc thay i j v trong trng lng v thin v.
Tnh tuyn tnh ny gip bn d dng la chn cc thay i nh v
trng lng v thnh kin t c bt k thay i nh no
mong mun u ra. V vy, trong khi cc nrn thn kinh
sigmoid c nhiu hnh vi nh tnh ging nh perceptron, h lm
cho n d dng hn nhiu tm ra cch thay i trng lng v
thnh kin s thay i sn lng.

Nu l hnh dng ca m thc s c vn , v khng phi


l hnh thc chnh xc ca n, th ti sao s dng cc hnh thc c
bit c s dng cho trong phng trnh (3) ? Trong thc t,
sau ny trong cun sch chng ta s thnh thong xem xt cc n-
ron m u ra l cho mt s chc nng
( w x + b )

kch hot khc ( ) . The main thing that changes when we use a
different activation function is that the particular values for the
partial derivatives in Equation (5) change. It turns out that when we
compute those partial derivatives later, using will simplify the
algebra, simply because exponentials have lovely properties when
differentiated. In any case, is commonly-used in work on neural
nets, and is the activation function we'll use most often in this book.

How should we interpret the output from a sigmoid neuron?


Obviously, one big difference between perceptrons and sigmoid
neurons is that sigmoid neurons don't just output 0 or 1. They can
have as output any real number between 0 and 1, so values such as
0.173 and 0.689 are legitimate outputs. This can be useful, for
example, if we want to use the output value to represent the average
intensity of the pixels in an image input to a neural network. But
sometimes it can be a nuisance. Suppose we want the output from
the network to indicate either "the input image is a 9" or "the input
image is not a 9". Obviously, it'd be easiest to do this if the output
was a 0 or a 1, as in a perceptron. But in practice we can set up a
convention to deal with this, for example, by deciding to interpret
any output of at least 0.5 as indicating a "9", and any output less
than 0.5 as indicating "not a 9". I'll always explicitly state when
we're using such a convention, so it shouldn't cause any confusion.

Exercises
Sigmoid neurons simulating perceptrons, part I
Suppose we take all the weights and biases in a network of
perceptrons, and multiply them by a positive constant, c > 0.
Show that the behaviour of the network doesn't change.

Sigmoid neurons simulating perceptrons, part II


Gi s chng ta c cng thit lp nh l vn cui cng -
mt mng li perceptron. Gi s rng tng th u vo vo
mng perceptron c chn. Chng ti s khng cn gi tr
u vo thc t, chng ti ch cn u vo c c
nh. Gi s trng s v thnh kin l nh vy

u vo cho bt k perceptron c th no trong mng. By


gi hy thay th tt c cc perceptron trong mng bng cc
n-ron sigmoid, v nhn cc trng s v thnh kin bng
hng s dng . Cho thy rng trong
gii hn nh hnh vi ca mng n-
ron sigmoid ny ging ht nh mng perceptron. Lm th
no c th tht bi ny khi

cho mt trong cc perceptrons?

Kin trc ca mng nron


Trong phn tip theo ti s gii thiu mt mng li thn kinh c
th lm mt cng vic kh tt phn loi ch vit tay ch s.
chun b cho iu , n gip gii thch mt s thut ng cho
php chng ta t tn cc phn khc nhau ca mng. Gi s chng
ta c mng:

Nh cp trn, lp bn tri ca mng ny c gi l lp


u vo, v cc n-ron trong lp c gi l cc n-ron u vo .
Lp bn phi hay u ra cha cc n-ron u ra , hoc, nh trong
trng hp ny, mt n-ron u ra duy nht. Lp trung lu c
gi l lp n, v cc n-ron trong lp ny khng phi l u vo hay
u ra. Thut ng "n" c th nghe c v b n mt cht - ln u
tin ti nghe thut ng ti ngh n phi c mt vi ngha trit hc
hay ton hc su sc - nhng n thc s c ngha g ch khng
phi l "khng phi u vo hay u ra". Mng trn ch c mt lp
n duy nht, nhng mt s mng c nhiu lp n. V d, mng
bn lp sau y c hai lp n:

Mt phn no gy nhm ln v v nhng l do lch s, cc mng


lp nhiu lp nh vy i khi c gi l perceptron nhiu lp
hoc MLP , mc d n c hnh thnh bi cc t bo thn kinh
sigma, ch khng phi cc perceptron. Ti s khng s dng thut
ng MLP trong cun sch ny, v ti ngh n rt kh hiu, nhng
mun cnh bo bn v s tn ti ca n.

Vic thit k cc lp u vo v u ra trong mng thng rt


n gin. V d: gi s chng ti ang c xc nh liu mt hnh
nh vit tay m t mt "9" hay khng. Mt cch t nhin thit
k mng l m ha cng ca cc im nh nh vo cc nron
u vo. Nu hnh nh l mt bi greyscale image, then
we'd have 4, 096 = 64 64 input neurons, with the intensities
scaled appropriately between 0 and 1. The output layer will contain
just a single neuron, with output values of less than 0.5 indicating
"input image is not a 9", and values greater than 0.5 indicating
"input image is a 9 ".

While the design of the input and output layers of a neural network
is often straightforward, there can be quite an art to the design of
the hidden layers. In particular, it's not possible to sum up the
design process for the hidden layers with a few simple rules of
thumb. Instead, neural networks researchers have developed many
design heuristics for the hidden layers, which help people get the
behaviour they want out of their nets. For example, such heuristics
can be used to help determine how to trade off the number of
hidden layers against the time required to train the network. We'll
meet several such design heuristics later in this book.

Up to now, we've been discussing neural networks where the output


from one layer is used as input to the next layer. Such networks are
called feedforward neural networks. This means there are no loops
in the network - information is always fed forward, never fed back.
If we did have loops, we'd end up with situations where the input to
the function depended on the output. That'd be hard to make
sense of, and so we don't allow such loops.

However, there are other models of artificial neural networks in


which feedback loops are possible. These models are called
recurrent neural networks. The idea in these models is to have
neurons which fire for some limited duration of time, before
becoming quiescent. That firing can stimulate other neurons, which
may fire a little while later, also for a limited duration. That causes
still more neurons to fire, and so over time we get a cascade of
neurons firing. Loops don't cause problems in such a model, since a
neuron's output only affects its input at some later time, not
instantaneously.

Recurrent neural nets have been less influential than feedforward


networks, in part because the learning algorithms for recurrent nets
are (at least to date) less powerful. But recurrent networks are still
extremely interesting. They're much closer in spirit to how our
brains work than feedforward networks. And it's possible that
recurrent networks can solve important problems which can only be
solved with great difficulty by feedforward networks. However, to
limit our scope, in this book we're going to concentrate on the more
widely-used feedforward networks.

A simple network to classify


handwritten digits
Having defined neural networks, let's return to handwriting
recognition. We can split the problem of recognizing handwritten
digits into two sub-problems. First, we'd like a way of breaking an
image containing many digits into a sequence of separate images,
each containing a single digit. For example, we'd like to break the
image

into six separate images,


We humans solve this segmentation problem with ease, but it's
challenging for a computer program to correctly break up the
image. Once the image has been segmented, the program then
needs to classify each individual digit. So, for instance, we'd like our
program to recognize that the first digit above,

is a 5.

We'll focus on writing a program to solve the second problem, that


is, classifying individual digits. We do this because it turns out that
the segmentation problem is not so difficult to solve, once you have
a good way of classifying individual digits. There are many
approaches to solving the segmentation problem. One approach is
to trial many different ways of segmenting the image, using the
individual digit classifier to score each trial segmentation. A trial
segmentation gets a high score if the individual digit classifier is
confident of its classification in all segments, and a low score if the
classifier is having a lot of trouble in one or more segments. The
idea is that if the classifier is having trouble somewhere, then it's
probably having trouble because the segmentation has been chosen
incorrectly. This idea and other variations can be used to solve the
segmentation problem quite well. So instead of worrying about
segmentation we'll concentrate on developing a neural network
which can solve the more interesting and difficult problem, namely,
recognizing individual handwritten digits.

To recognize individual digits we will use a three-layer neural


network:

The input layer of the network contains neurons encoding the


values of the input pixels. As discussed in the next section, our
training data for the network will consist of many 28 by 28 pixel
images of scanned handwritten digits, and so the input layer
contains 784 = 28 28 neurons. For simplicity I've omitted most of
the 784 input neurons in the diagram above. The input pixels are
greyscale, with a value of 0.0 representing white, a value of 1.0
representing black, and in between values representing gradually
darkening shades of grey.
The second layer of the network is a hidden layer. We denote the
number of neurons in this hidden layer by n, and we'll experiment
with different values for n. The example shown illustrates a small
hidden layer, containing just n = 15 neurons.

The output layer of the network contains 10 neurons. If the first


neuron fires, i.e., has an output 1, then that will indicate that the
network thinks the digit is a 0. If the second neuron fires then that
will indicate that the network thinks the digit is a 1. And so on. A
little more precisely, we number the output neurons from 0 through
9, and figure out which neuron has the highest activation value. If
that neuron is, say, neuron number 6, then our network will guess
that the input digit was a 6. And so on for the other output neurons.

You might wonder why we use 10 output neurons. After all, the goal
of the network is to tell us which digit (0, 1, 2, , 9) corresponds to
the input image. A seemingly natural way of doing that is to use just
4 output neurons, treating each neuron as taking on a binary value,
depending on whether the neuron's output is closer to 0 or to 1.
Four neurons are enough to encode the answer, since 2 4
= 16 is
more than the 10 possible values for the input digit. Why should our
network use 10 neurons instead? Isn't that inefficient? The ultimate
justification is empirical: we can try out both network designs, and
it turns out that, for this particular problem, the network with 10
output neurons learns to recognize digits better than the network
with 4 output neurons. But that leaves us wondering why using 10
output neurons works better. Is there some heuristic that would tell
us in advance that we should use the 10-output encoding instead of
the 4-output encoding?
To understand why we do this, it helps to think about what the
neural network is doing from first principles. Consider first the case
where we use 10 output neurons. Let's concentrate on the first
output neuron, the one that's trying to decide whether or not the
digit is a 0. It does this by weighing up evidence from the hidden
layer of neurons. What are those hidden neurons doing? Well, just
suppose for the sake of argument that the first neuron in the hidden
layer detects whether or not an image like the following is present:

It can do this by heavily weighting input pixels which overlap with


the image, and only lightly weighting the other inputs. In a similar
way, let's suppose for the sake of argument that the second, third,
and fourth neurons in the hidden layer detect whether or not the
following images are present:

As you may have guessed, these four images together make up the 0
image that we saw in the line of digits shown earlier:
So if all four of these hidden neurons are firing then we can
conclude that the digit is a 0. Of course, that's not the only sort of
evidence we can use to conclude that the image was a 0 - we could
legitimately get a 0theo nhiu cch khc (ni, thng qua bn dch
ca cc hnh nh trn, hoc bp mo nh). Nhng c v nh an ton
ni rng t nht trong trng hp ny chng ti mun kt
lun rng u vo l mt .

Gi s cc chc nng mng thn kinh theo cch ny, chng ta c


th a ra mt li gii thch hp l cho l do ti sao n tt hn c
kt qu u ra t mng, ch khng phi l . Nu chng ta
c u ra, sau n-ron u ra u tin s c gng quyt
nh bit quan trng nht ca ch s l g. V khng c cch no d
dng lin quan n bit quan trng nht i vi cc hnh dng
n gin nh nhng hnh nh c hin th trn. Tht kh tng
tng rng c bt k l do lch s tt no v thnh phn ca
ch s s lin quan cht ch n (ni) phn quan trng nht
trong u ra.

By gi, vi tt c nhng g ni, y ch l mt phng php


cha bnh. Khng c g ni rng mng thn kinh ba lp phi hot
ng theo cch m ti m t, vi cc n-ron n pht hin nhng
hnh dng thnh phn n gin. C l mt thut ton hc thng
minh s tm thy mt s phn cng cc trng lng cho php
chng ta s dng ch c u ra n-ron. Nhng nh mt phng
php suy ngh theo cch ti m t kh tt, v c th gip bn
tit kim rt nhiu thi gian trong vic thit k kin trc
mng nrn tt.

Tp th dc
C mt cch xc nh biu din bitwise ca mt ch s
bng cch thm mt lp thm vo mng ba lp trn. Lp
thm chuyn i u ra t lp trc thnh mt biu din nh
phn, nh c minh ha trong hnh bn di. Tm mt b
trng lng v thnh kin cho lp u ra mi. Gi s rng
cc lp t bo thn kinh l nh vy m sn lng chnh xc
trong lp th ba (v d, lp u ra c) kch hot t nht
, v u ra khng chnh xc kch hot t hn .

Hc vi gradien xung
Now that we have a design for our neural network, how can it learn
to recognize digits? The first thing we'll need is a data set to learn
from - a so-called training data set. We'll use the MNIST data set,
which contains tens of thousands of scanned images of handwritten
digits, together with their correct classifications. MNIST's name
comes from the fact that it is a modified subset of two data sets
collected by NIST, the United States' National Institute of
Standards and Technology. Here's a few images from MNIST:

As you can see, these digits are, in fact, the same as those shown at
the beginning of this chapter as a challenge to recognize. Of course,
when testing our network we'll ask it to recognize images which
aren't in the training set!

The MNIST data comes in two parts. The first part contains 60,000
images to be used as training data. These images are scanned
handwriting samples from 250 people, half of whom were US
Census Bureau employees, and half of whom were high school
students. The images are greyscale and 28 by 28 pixels in size. The
second part of the MNIST data set is 10,000 images to be used as
test data. Again, these are 28 by 28 greyscale images. We'll use the
test data to evaluate how well our neural network has learned to
recognize digits. To make this a good test of performance, the test
data was taken from a different set of 250 people than the original
training data (albeit still a group split between Census Bureau
employees and high school students). This helps give us confidence
that our system can recognize digits from people whose writing it
didn't see during training.

We'll use the notation x to denote a training input. It'll be


convenient to regard each training input x as a 28 28 = 784-
dimensional vector. Each entry in the vector represents the grey
value for a single pixel in the image. We'll denote the corresponding
desired output by y = y(x) , where y is a 10-dimensional vector. For
example, if a particular training image, x, depicts a 6, then
y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
T
is the desired output from the
network. Note that T here is the transpose operation, turning a row
vector into an ordinary (column) vector.

What we'd like is an algorithm which lets us find weights and biases
so that the output from the network approximates y(x) for all
training inputs x. To quantify how well we're achieving this goal we
define a cost function*:

1
2
C (w, b) y(x) a . (6)
2n
x

Here, w denotes the collection of all weights in the network, b all the
biases, n is the total number of training inputs, a is the vector of
outputs from the network when x is input, and the sum is over all
training inputs, x. Of course, the output a depends on x, w and b,
but to keep the notation simple I haven't explicitly indicated this
dependence. The notation v just denotes the usual length
function for a vector v. We'll call C the quadratic cost function; it's
also sometimes known as the mean squared error or just MSE.
Inspecting the form of the quadratic cost function, we see that
C (w, b) is non-negative, since every term in the sum is non-
negative. Furthermore, the cost C (w, b) becomes small, i.e.,
C (w, b) 0 , precisely when y(x) is approximately equal to the
output, a, for all training inputs, x. So our training algorithm has
done a good job if it can find weights and biases so that C (w, b) 0.
By contrast, it's not doing so well when C (w, b) is large - that would
mean that y(x) is not close to the output a for a large number of
inputs. So the aim of our training algorithm will be to minimize the
cost C (w, b) as a function of the weights and biases. In other words,
we want to find a set of weights and biases which make the cost as
small as possible. We'll do that using an algorithm known as
gradient descent.

Why introduce the quadratic cost? After all, aren't we primarily


interested in the number of images correctly classified by the
network? Why not try to maximize that number directly, rather
than minimizing a proxy measure like the quadratic cost? The
problem with that is that the number of images correctly classified
is not a smooth function of the weights and biases in the network.
For the most part, making small changes to the weights and biases
won't cause any change at all in the number of training images
classified correctly. That makes it difficult to figure out how to
change the weights and biases to get improved performance. If we
instead use a smooth cost function like the quadratic cost it turns
out to be easy to figure out how to make small changes in the
weights and biases so as to get an improvement in the cost. That's
why we focus first on minimizing the quadratic cost, and only after
that will we examine the classification accuracy.
Even given that we want to use a smooth cost function, you may still
wonder why we choose the quadratic function used in Equation (6).
Isn't this a rather ad hoc choice? Perhaps if we chose a different
cost function we'd get a totally different set of minimizing weights
and biases? This is a valid concern, and later we'll revisit the cost
function, and make some modifications. However, the quadratic
cost function of Equation (6) works perfectly well for understanding
the basics of learning in neural networks, so we'll stick with it for
now.

Recapping, our goal in training a neural network is to find weights


and biases which minimize the quadratic cost function C (w, b). This
is a well-posed problem, but it's got a lot of distracting structure as
currently posed - the interpretation of w and b as weights and
biases, the function lurking in the background, the choice of
network architecture, MNIST, and so on. It turns out that we can
understand a tremendous amount by ignoring most of that
structure, and just concentrating on the minimization aspect. So for
now we're going to forget all about the specific form of the cost
function, the connection to neural networks, and so on. Instead,
we're going to imagine that we've simply been given a function of
many variables and we want to minimize that function. We're going
to develop a technique called gradient descent which can be used to
solve such minimization problems. Then we'll come back to the
specific function we want to minimize for neural networks.

Okay, let's suppose we're trying to minimize some function, C (v).


This could be any real-valued function of many variables,
v = v1 , v2 , . Note that I've replaced the w and b notation by v to
emphasize that this could be any function - we're not specifically
thinking in the neural networks context any more. To minimize C (v)
it helps to imagine C as a function of just two variables, which we'll
call v and v :
1 2

What we'd like is to find where C achieves its global minimum.


Now, of course, for the function plotted above, we can eyeball the
graph and find the minimum. In that sense, I've perhaps shown
slightly too simple a function! A general function, C , may be a
complicated function of many variables, and it won't usually be
possible to just eyeball the graph to find the minimum.

One way of attacking the problem is to use calculus to try to find the
minimum analytically. We could compute derivatives and then try
using them to find places where C is an extremum. With some luck
that might work when C l mt chc nng ca ch mt hoc mt vi
bin. Nhng n s tr thnh mt cn c mng khi chng ta c
nhiu bin hn. V i vi cc mng thn kinh chng ti s
thng mun xa bin hn - cc mng thn kinh ln nht c
chc nng chi ph m ph thuc vo t trng v nhng thnh kin
mt cch cc k phc tp. S dng php tnh gim thiu m ch s
khng lm vic!

(Sau khi khng nh rng chng ta s c c ci nhn su sc


bng cch tng tng nh l mt chc nng ca hai bin, ti
quay li hai ln trong hai on vn v ni, "hey, nhng nhng g
nu n l mt chc nng ca nhiu hn hai bin?" Xin li v
iu . Xin hy tin ti khi ti ni rng n thc s gip bn tng
tng nh l mt chc nng ca hai bin. N ch xy ra rng
i khi hnh nh b hng, v hai on vn cui cng c x
l s c nh vy. T duy tt v ton hc thng lin quan n
vic tung hng nhiu hnh nh trc quan, hc tp khi n thch hp
s dng mi bc tranh, v khi n khng phi l.)

c ri, v vy tnh ton khng hot ng. May mn thay, c mt


php so snh p cho thy mt thut ton hot ng kh tt.
Chng ti bt u bng cch suy ngh v chc nng ca chng
ti nh mt loi thung lng. Nu bn nheo mt mt cht ct
truyn trn, iu khng nn qu kh. V chng ti tng
tng mt qu bng ln xung dc ca thung lng. Kinh nghim
hng ngy ca chng ti cho chng ta bit rng qu bng cui
cng cun xung y thung lng. C l chng ta c th s dng
tng ny nh l mt cch tm ra mt mc ti thiu cho chc
nng? Chng ti s ngu nhin chn mt im khi u cho mt
qu bng (tng tng), v sau m phng chuyn ng ca qu
cu khi n cun xung y thung lng. Chng ta c th lm m
phng ny n gin bng cch tnh cc dn xut (v c l mt s
dn xut th hai) ca - nhng dn xut ny s cho chng ta
bit tt c mi th chng ta cn bit v hnh dng "a phng"
ca thung lng, v do qu bng ca chng ta nn cun nh th
no.

Da trn nhng g ti va vit, bn c th gi s rng chng ta s


c gng vit ra cc phng trnh ca Newton cho qu bng, xem
xt cc nh hng ca ma st v trng lc, vn vn. Trn thc t,
chng ta s khng coi s tng t ca cc qu bng ln l kh
nghim trng - chng ta ang ngh ra mt thut ton gim thiu
, khng pht trin mt m phng chnh xc ca cc lut vt l!
Quan im ca qu bng l kch thch tr tng tng ca chng
ti, khng hn ch suy ngh ca chng ti. V vy, thay v nhn c
vo tt c cc chi tit ln xn ca vt l, chng ta hy t hi mnh:
nu chng ta tuyn b Thin Cha mt ngy, v c th to ra lut
ca chng ta v vt l, dictating cho qu bng lm th no n nn
ln, nhng g php lut hoc cc lut ca chuyn ng chng ta c
th chn m s lm cho n bng lun lun cun xung di
cng ca thung lng?

lm cho cu hi ny chnh xc hn, chng ta hy suy ngh v


nhng g s xy ra khi chng ta di chuyn qu bng mt lng nh
1 trong 1 hng, v mt lng nh 2 trong
2 phng hng. Calculus ni vi chng ta rng thay i
nh sau:

C C
C v1 + v2 . (7)
v1 v2
We're going to find a way of choosing v and v so as to make 1 2

C negative; i.e., we'll choose them so the ball is rolling down into
the valley. To figure out how to make such a choice it helps to define
v to be the vector of changes in v, v (v 1, v2 )
T
, where T is
again the transpose operation, turning row vectors into column
vectors. We'll also define the gradient of C to be the vector of
T

partial derivatives, ( C

v1
,
C

v2
) . We denote the gradient vector by
C , i.e.:

T
C C
C ( , ) . (8)
v1 v2

In a moment we'll rewrite the change C in terms of v and the


gradient, C . Before getting to that, though, I want to clarify
something that sometimes gets people hung up on the gradient.
When meeting the C notation for the first time, people sometimes
wonder how they should think about the symbol. What, exactly,
does mean? In fact, it's perfectly fine to think of C nh l mt
i tng ton hc duy nht - vector c nh ngha trn -
iu ny xy ra c vit bng hai k t. Trong quan im
ny, ch l mt phn ca vy c ni ting, ni vi bn "hey,
l mt vector gradient "C nhiu im tin tin ca xem
ni c th c xem nh l mt thc th ton hc c lp theo
cch ring ca n (v d, nh mt ton t vi phn), nhng chng ta
s khng cn nhng quan im nh vy.

Vi cc nh ngha ny, biu thc (7) cho c th c


vit li nh

C C v . (9)
Phng trnh ny gip gii thch ti sao c gi l
vector gradient: lin quan n nhng thay i trong
n nhng thay i trong , ging nh chng ta mong i
iu g c gi l mt gradient lm. Nhng nhng g thc s
th v v phng trnh l n cho php chng ta xem lm th no
chn lm cho tiu cc. C th, gi s
chng ta chn

v = - C , (10)

ni l mt tham s nh, tch cc (c gi l t l hc tp ). Sau


Phng trnh (9) cho chng ta bit rng

- C C = - C
2
. Bi v

2
0 , iu ny m bo
rng 0 , tc l, s lun lun gim, khng
bao gi tng, nu chng ta thay i theo n thuc trong (10) .
(D nhin, trong gii hn ca xp x trong phng trnh (9) ). y
chnh l ti sn chng ti mun! V v vy chng ta s ly cng
thc (10) xc nh "lut chuyn ng" cho qu bng theo thut
ton dc ng ca chng ta. Tc l chng ta s s dng phng
trnh (10) tnh gi tr cho , sau di chuyn v tr ca
qu bng bng s tin :

'
v v = v - C . (11)

Sau chng ti s s dng li quy tc cp nht ny thc hin


mt ng thi khc. Nu chng ta tip tc lm vic ny, lp i lp
li, chng ta s tip tc gim cho n - chng ti hy vng -
chng ti t c mc ti thiu ton cu.
Tng kt, cch m thut ton gradient descent hot ng l lp li
tnh ton gradient , v sau di chuyn theo hng
ngc li , "ng xung" dc ca thung lng. Chng ta c th
hnh dung n nh th ny:

Lu rng vi quy tc ny, gradient gc khng sao chp chuyn


ng vt l thc. Trong thc t mt qu bng c , v c th
cho php n ln trn sn dc, hoc thm ch (trong giy lt) cun
ln dc. Ch sau khi cc nh hng ca ma st t vo qu bng
c bo m ln xung thung lng. Ngc li, nguyn tc ca
chng ti chn ch ni "i xung, ngay by gi".
vn l mt quy tc kh tt cho vic tm kim mc ti thiu!

thc hin gradient descent hot ng chnh xc, chng ta cn


phi chn t l hc c nh m Phng trnh (9) l mt xp
x tt. Nu khng, chng ta c th kt thc vi
> 0 , which obviously would not be good! At the same time, we
don't want to be too small, since that will make the changes v
tiny, and thus the gradient descent algorithm will work very slowly.
In practical implementations, is often varied so that Equation (9)
remains a good approximation, but the algorithm isn't too slow.
We'll see later how this works.

I've explained gradient descent when C is a function of just two


variables. But, in fact, everything works just as well even when C is
a function of many more variables. Suppose in particular that C is a
function of m variables, v 1, , vm . Then the change C in C
produced by a small change v = (v 1, , vm )
T
is

C C v, (12)

where the gradient C is the vector

T
C C
C ( ,, ) . (13)
v1 vm

Just as for the two variable case, we can choose

v = C , (14)

and we're guaranteed that our (approximate) expression (12) for C


will be negative. This gives us a way of following the gradient to a
minimum, even when C is a function of many variables, by
repeatedly applying the update rule


v v = v C . (15)

You can think of this update rule as defining the gradient descent
algorithm. It gives us a way of repeatedly changing the position v in
order to find a minimum of the function C . The rule doesn't always
work - several things can go wrong and prevent gradient descent
from finding the global minimum of C , a point we'll return to
explore in later chapters. But, in practice gradient descent often
works extremely well, and in neural networks we'll find that it's a
powerful way of minimizing the cost function, and so helping the
net learn.

Indeed, there's even a sense in which gradient descent is the


optimal strategy for searching for a minimum. Let's suppose that
we're trying to make a move v in position so as to decrease C as
much as possible. This is equivalent to minimizing C C v .
We'll constrain the size of the move so that v = for some small
fixed > 0. In other words, we want a move that is a small step of a
fixed size, and we're trying to find the movement direction which
decreases C as much as possible. It can be proved that the choice of
v which minimizes C v is v = C , where = /C is
determined by the size constraint v = . So gradient descent
can be viewed as a way of taking small steps in the direction which
does the most to immediately decrease C .

Exercises
Prove the assertion of the last paragraph. Hint: If you're not
already familiar with the Cauchy-Schwarz inequality, you may
find it helpful to familiarize yourself with it.

I explained gradient descent when C is a function of two


variables, and when it's a function of more than two variables.
What happens when C is a function of just one variable? Can
you provide a geometric interpretation of what gradient
descent is doing in the one-dimensional case?

Mi ngi iu tra nhiu bin th ca gc gradient, bao


gm cc bin th bt chc cht ch hn mt qu bng thc s.
Nhng bin th bt chc bng ny c mt s li th, nhng
cng c mt bt li ln: cn phi tnh ton cc dn xut mt
phn th hai ca , v iu ny c th kh tn km. bit l
do ti sao n l tn km, gi s chng ta mun tnh tt c cc dn
xut mt phn th hai 2
. Nu c mt
C / vj vk

triu nh vy bin th chng ta cn phi tnh ton mt th


j

ging nh mt dn xut mt phn nghn t (tc l mt phn


triu) * ! s l chi ph tnh ton. Vi nhng g ni, c nhng
th thut trnh loi vn ny, v tm ra cc la chn thay th
cho gradient dc l mt lnh vc hot ng ca iu tra. Nhng
trong cun sch ny chng ta s s dng phng php tip cn
chnh hc tp trong cc mng thn kinh s dng gradient
descent (v bin th).

Lm th no p dng gradient descent hc trong mng


nron? tng l s dng gradient descent tm trng lng
k v thnh kin l m gim thiu chi ph trong phng trnh
(6) . xem cng trnh ny hot ng nh th no, hy trnh by
li quy tc cp nht gc c gradient, vi cc trng s v thnh
kin thay th cc bin j . Ni cch khc, "v tr" ca chng ta
by gi c cc thnh phn k v l , v vector gradient
c cc thnh phn tng ng C / wk v
C / bl . Vit ra cc quy tc cp nht gc gradient theo
cc thnh phn, chng ta c

C
'
wk w = wk - (16)
k
wk

C
'
bl b = bl - . (17)
l
bl

Bng cch lin tc p dng quy tc cp nht ny, chng ti c th


"cun xung i" v hy vng tm thy chc nng chi ph ti
thiu. Ni cch khc, y l mt quy tc c th c s dng hc
trong mng thn kinh.

C mt s thch thc trong vic p dng quy tc gc di chuyn.


Chng ta s nhn su hn vo nhng chng sau. Nhng by gi ti
ch mun cp n mt vn . hiu vn l g, hy
xem li chi ph bc hai trong phng trnh (6) . Lu rng hm chi
ph ny c dng =
1

n
x
Cx , c ngha l, l trung
2
y( x ) - a
bnh qua chi ph x
2
cho cc v d o to
c nhn. Trong thc t, tnh gradient chng ta cn
tnh gradients x ring cho mi u vo o to, , v
sau trung bnh chng,
=
1

n
x
Cx . Tht khng
may, khi s lng u vo o to l rt ln, iu ny c th
mt mt thi gian di, v hc tp do xy ra chm.

Mt tng c gi l stochastic gradient c th c s dng


tng tc hc tp. tng l c lng gradient
bng tnh ton x cho mt mu nh ca
u vo o to ngu nhin la chn. Bng cch trung bnh qua
mu nh ny, n ch ra rng chng ta c th nhanh chng c c
mt c tnh tt v gradient thc s , v iu ny s
gip tng tc xung dc, v do hc tp.

lm cho nhng tng ny chnh xc hn, stochastic gradient


descent hot ng bng cch chn ngu nhin mt s nh of
randomly chosen training inputs. We'll label those random training
inputs X 1, X2 , , Xm , and refer to them as a mini-batch. Provided
the sample size m is large enough we expect that the average value
of the C Xj will be roughly equal to the average over all C , that x

is,
m
C Xj C x
j=1 x
= C , (18)
m n

where the second sum is over the entire set of training data.
Swapping sides we get

m
1
C C Xj , (19)
m
j=1

confirming that we can estimate the overall gradient by computing


gradients just for the randomly chosen mini-batch.

To connect this explicitly to learning in neural networks, suppose w k

and b denote the weights and biases in our neural network. Then
l

stochastic gradient descent works by picking out a randomly chosen


mini-batch of training inputs, and training with those,
CX
j

wk w = wk (20)
k
m wk
j

CX
j

bl b = bl , (21)
l
m bl
j

where the sums are over all the training examples X in the current j

mini-batch. Then we pick out another randomly chosen mini-batch


and train with those. And so on, until we've exhausted the training
inputs, which is said to complete an epoch of training. At that point
we start over with a new training epoch.

Ngc nhin, iu ng ch l cc cng c khc nhau v vic m


rng chc nng chi ph v cp nht hng lot cho trng s v thnh
kin. Trong phng trnh (6), chng ti thu nh ton b chc
nng chi ph bng mt yu t n
. i khi ngi ta b qua n
,
tng hp qua cc chi ph ca cc v d o to c nhn thay v trung
bnh. iu ny c bit hu ch khi tng s cc v d o to
khng c bit trc. iu ny c th xy ra nu nhiu d liu
o to c to ra trong thi gian thc, v d. V, trong mt cch
tng t, cc quy tc cp nht mini-l (20) v (21) i khi b qua

m
di ra pha trc ca cc khon tin. V mt khi nim, iu
ny khng c g khc bit, v n tng ng vi vic thay i tc
hc . Nhng khi so snh chi tit cc tc phm khc nhau th
ng xem.

Chng ta c th ngh rng cc stochastic gradient gc ging nh


cc cuc thm d chnh tr: vic ly mu gradient nh hn nhiu
so vi vic thc hin mt cuc thm d s d dng hn l chy mt
cuc bu c ton phn. V d, nu chng ta c mt tp hun
luyn kch thc
,
nh trong MNIST, v chn mt kch c l nh (ni)
, iu ny c ngha l chng ti s
nhn c mt yu t l tng tc c
lng gradient! Tt nhin, c tnh s khng hon ho - s c
nhng dao ng thng k - nhng n khng cn phi hon ho:
tt c nhng g chng ta thc s quan tm ang di chuyn theo
hng chung s gip gim , v iu c ngha l chng ta
khng cn tnh ton chnh xc gradient. Trong thc t, stochastic
gradient l mt k thut ph bin c s dng v mnh m hc
trong mng n ron, v l c s cho hu ht cc k thut hc tp
chng ti s pht trin trong cun sch ny.

Tp th dc
Mt phin bn cc i ca gradient descent l s dng mt
mini-batch kch thc ch 1. l, cho mt u vo o to,
, chng ti cp nht trng lng v thnh kin ca mnh theo
cc quy tc
k w
'
k
= v
w k - C x / w k

l b
'
l
= bl - C x / bl . Sau ,
chng ti chn mt u vo o to, v cp nht trng lng
v thnh kin mt ln na. V vn vn, lin tc. Th tc ny
c gi l trc tuyn , trc tuyn , hoc gia tng hc tp.
Trong hc tp trc tuyn, mt mng n ron ch hc t mt
u vo o to ti mt thi im (ging nh con ngi lm).
t tn mt li th v bt li ca hc tp trc tuyn, so vi
stochastic gradient vi mt kch thc nh, v d: .

Hy ti kt lun phn ny bng cch tho lun mt im m


i khi nhng li ngi mi gradient gc. Trong mng nron chi
ph is, of course, a function of many variables - all the weights
and biases - and so in some sense defines a surface in a very high-
dimensional space. Some people get hung up thinking: "Hey, I have
to be able to visualize all these extra dimensions". And they may
start to worry: "I can't think in four dimensions, let alone five (or
five million)". Is there some special ability they're missing, some
ability that "real" supermathematicians have? Of course, the answer
is no. Even most professional mathematicians can't visualize four
dimensions especially well, if at all. The trick they use, instead, is to
develop other ways of representing what's going on. That's exactly
what we did above: we used an algebraic (rather than visual)
representation of C to figure out how to move so as to decrease C .
People who are good at thinking in high dimensions have a mental
library containing many different techniques along these lines; our
algebraic trick is just one example. Those techniques may not have
the simplicity we're accustomed to when visualizing three
dimensions, but once you build up a library of such techniques, you
can get pretty good at thinking in high dimensions. I won't go into
more detail here, but if you're interested then you may enjoy
reading this discussion of some of the techniques professional
mathematicians use to think in high dimensions. While some of the
techniques discussed are quite complex, much of the best content is
intuitive and accessible, and could be mastered by anyone.
Implementing our network to classify
digits
Alright, let's write a program that learns how to recognize
handwritten digits, using stochastic gradient descent and the
MNIST training data. We'll do this with a short Python (2.7)
program, just 74 lines of code! The first thing we need is to get the
MNIST data. If you're a git user then you can obtain the data by
cloning the code repository for this book,

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

If you don't use git then you can download the data and code here.

Ngu nhin, khi ti m t cc d liu MNIST trc , ti ni n


c chia thnh 60.000 hnh nh o to, v 10.000 hnh nh th
nghim. l m t MNIST chnh thc. Trn thc t, chng ta s
phn chia d liu mt cht khc nhau. Chng ti s li nhng
hnh nh th nghim nh c, nhng tch lp o to MNIST 60.000
hnh thnh hai phn: mt b 50.000 hnh nh, chng ti s s
dng hun luyn mng li thn kinh ca chng ti v mt b
xc nhn hnh nh 10.000 ring bit . Chng ti s khng s dng
d liu xc nhn trong chng ny, nhng sau trong cun sch,
chng ti s thy hu ch trong vic tm ra cch thit lp mt s
thng s siu nhin nht nh ca mng thn kinh - nhng th
nh t l hc tp, v.v ... khng c chn trc tip bi thut ton
hc ca chng ta. Mc d d liu xc nhn khng phi l mt phn
ca c t MNIST ban u, nhiu ngi s dng MNIST theo cch
ny v vic s dng d liu xc nhn l ph bin trong mng nron.
Khi ti tham kho "d liu o to MNIST" t by gi, ti s cp
n b d liu hnh nh 50.000 ca chng ti, khng phi l b d
liu hnh nh gc 60.000 * .

Ngoi cc d liu MNIST chng ti cng cn mt th vin Python


gi l Numpy , lm i s tuyn tnh nhanh. Nu bn cha ci
t Numpy, bn c th ti n y .

Hy ti gii thch cc tnh nng ct li ca m mng thn kinh,


trc khi a ra mt danh sch y , di y. Trng tm l mt
lp Mng , chng ta s dng biu din mt mng thn kinh. y
l m chng ti s dng khi to mt i tng Mng :

lp Mng ( i tng ):

def __init__ ( t , kch c ):


t . num_layers = len ( kch c )
t . kch c = kch c
t . thnh kin = [ np . ngu nhin . randn ( y , 1 ) cho y trong kch c [ 1 :
t . trng lng = [ np . ngu nhin . randn ( y, x )
cho x , y trong zip ( kch c [: - 1 ], kch c [ 1 :])]

Trong m ny, kch thc danh sch cha s lng n-ron trong
cc lp tng ng. V d, nu chng ta mun to ra mt i
tng Network vi 2 neuron trong lp u tin, 3 neuron trong lp
th hai v 1 neuron trong lp cui cng, chng ta s thc hin
iu ny bng m:

net = Mng ([ 2 , 3 , 1 ])

Cc thnh kin v trng lng trong i tng Network c tt c


cc khi to ngu nhin, s dng chc nng Numpy np.random.randn
to ra cc phn b Gaussian c ngha l v lch chun .
S khi to ngu nhin ny cho php thut ton bt chc gradient
ngu nhin ca chng ta l mt ni bt u. Trong cc chng
sau, chng ta s tm ra cch tt hn khi to trng s v thnh
kin, nhng iu ny s lm ngay by gi. Lu rng m khi
to mng gi nh rng lp u tin ca n-ron l mt lp u vo
v b qua bt k s thin v no cho nhng n-ron ny, v cc
thnh kin ch c s dng tnh cc u ra t cc lp sau ny.

Cng lu rng cc thnh kin v trng lng c lu tr nh l


danh sch ma trn Numpy. V vy, v d net.weights [1] l ma trn
Numpy lu tr trng lng kt ni cc lp th hai v th ba ca
n-ron. (N khng phi l lp u tin v th hai, v vic lp ch
mc danh sch ca Python bt u bng 0 ) V net.weights [1]
kh di dng, chng ta ch biu th ma trn . l mt ma trn
sao cho j k l trng lng cho kt ni gia n-ron trong
t h

lp th hai, v n-ron trong lp th ba. Th t ca


t h
v ch
s c th c v l - chc chn n s c ngha hn trao i cc
v ch s xung quanh? u im ln nht ca vic s dng th
t ny l n c ngha l vector ca cc hot ng ca lp th ba ca
n-ron l:

'
m =
t ( w a + b ) . (22)

C kh nhiu din ra trong phng trnh ny, v vy hy gii nn


tng phn. l vector ca s hot ha ca lp th hai ca n-ron.
c c '
chng ta nhn bi ma trn cn , v thm cc
vector ca nhng thnh kin. Sau chng ta p dng hm
elementwise vi mi mc nhp trong vect
. (iu ny c gi l vector
ho hm .) D dng xc minh rng Phng trnh (22) cho kt
qu tng t nh quy tc trc y ca chng ta, Equation (4) ,
tnh ton u ra ca mt n-ron sigma.

Tp th dc
Vit ra phng trnh (22) dng thnh phn, v xc minh
rng n cho kt qu tng t nh cc quy tc (4) tnh
ton u ra ca mt n ron thn kinh sigma.

Vi tt c iu ny trong tm tr, rt d dng vit m tnh


ton u ra t mt trng hp mng . Chng ta bt u bng vic
xc nh hm sigmaoid:

def sigmoid ( z ):
tr li 1,0 / ( 1,0 + np . exp ( - z ))

Lu rng khi u vo z l mt vect hoc mng Numpy, Numpy


t ng p dng hm sigmoid elementwise, tc l dng vector ho.

Sau chng ti thm mt feedforward phng php cc mng lp,


trong , cho mt u vo mt cho mng, tr v u ra tng ng
* . Tt c cc phng php khng c p dng phng trnh (22)
cho mi lp:

def feedforward ( t , mt ):
"" "Return u ra ca mng nu "a" l u vo." ""
cho b , w trong zip ( t . nhng thnh kin , t . trng lng ):
mt = sigmoid ( np . dot ( w , a ) + b )
tr v mt

Tt nhin, iu chnh m chng ti mun cc i tng Mng


ca chng ti phi lm l hc hi. kt thc, chng ti s cung
cp cho h mt phng php SGD thc hin gradient xui ngu
nhin. y l m. l mt b n nh mt vi ni, nhng ti s
chia nh n xung di, sau khi nim yt.

def SGD ( t , training_data , k nguyn , mini_batch_size , eta ,


test_data = Khng ):
"" "Hy dy cc mng thn kinh s dng mini-l ngu nhin
gradient descent Cc. "training_data" l mt danh sch cc hng
"(x, y)" i din cc
d liu th nghim v cc u ra
mong mun Cc thng s khng ty chn khc l t gii thch Nu "test_data" c cung
mng s c nh gi da trn d liu kim tra sau mi
giai on, v mt phn tin hnh in ra.
theo di tin , nhng lm chm xung ng k """.
nu test_data : n_test = len ( test_data )
n = len ( training_data )
cho j trong xrange ( thi i ):
ngu nhin . ngu nhin ( training_data )
mini_batches = [
training_data [ k : k + mini_batch_size ]
cho k trong xrange ( 0 , n, mini_batch_size )]
cho mini_batch trong mini_batches :
t . update_mini_batch ( mini_batch , eta )
nu test_data :
in "Epoch {0}: {1} / {2}" . nh dng (
j , t . nh gi ( test_data ), n_test )
khc :
in "Epoch {0} hon thnh" . nh dng ( j )

Cc training_data l mt danh sch cc b i (x, y) i din cho cc


u vo o to v u ra mong mun tng ng. Cc giai on
bin v mini_batch_size l nhng g bn mong i - s k nguyn
tp luyn, v kch thc ca cc l nh s dng khi ly mu.
eta l t l hc, . Nu test_data i s c cung cp, th
chng trnh s nh gi mng sau mi giai on hun luyn v in
ra mt phn tin b. iu ny hu ch cho vic theo di s tin
b, nhng lm chm li mi th.

M hot ng nh sau. Trong mi k nguyn, n bt u bng


cch ngu nhin xo trn d liu hun luyn, v sau phn vng
n thnh cc l nh c kch thc thch hp. y l mt cch ly
mu ngu nhin t d liu hun luyn. Sau cho mi mini_batch
chng ta p dng mt bc n gradient gradient. iu ny c
thc hin bng m self.update_mini_batch (mini_batch, eta) , cp
nht khi lng mng v thnh kin theo mt ln lp i lp li
ca gradient, ch s dng d liu hun luyn trong mini_batch . Di
y l m cho phng thc update_mini_batch :

def update_mini_batch ( t , mini_batch , eta ):


"" "Cp nht trng lng v thnh kin ca mng bng cch s dng
gradient descent bng cch s dng backpropagation cho mt l nh duy nht
" mini_batch "l mt danh sch cc b k t" (x, y) "v" eta "
l t l hc tp." ""
nabla_b = [ np . zeros ( b . hnh dng ) cho b trong t . thin v ]
nabla_w = [ np . zeros ( w .hnh ) cho w trong t . trng lng ]
cho x , y trong mini_batch :
delta_nabla_b , delta_nabla_w = t . backprop ( x , y )
nabla_b = [ nb + dnb cho nb , dnb trong zip ( nabla_b , delta_nabla_b )]
nabla_w = [ nw + dnw cho nw, dnw trong zip ( nabla_w , delta_nabla_w )]
t . trng lng = [ w - ( eta / len ( mini_batch )) * nw
cho w , nw trong zip ( t . trng lng , nabla_w )]
t . sai lch = [ b - ( eta / len ( mini_batch )) * nb
cho b , nb trong zip ( self . biases , nabla_b )]

Hu ht cng vic c thc hin bng ng dy


delta_nabla_b , delta_nabla_w = t . backprop ( x , y )

iu ny gi mt ci g c gi l thut ton y li ngc li ,


y l mt cch nhanh chng tnh gradient ca hm chi ph. V
vy, update_mini_batch hot ng n gin bng cch tnh cc
gradient ny cho mi v d hun luyn trong mini_batch , v sau
cp nht self.weights v self.biases thch hp.

Ti s khng hin th m cho self.backprop ngay by gi. Chng ta s


nghin cu lm th no backpropagation hot ng trong chng
tip theo, bao gm c m cho self.backprop . By gi, gi s rng
n hot ng nh tuyn b, tr li gradient thch hp cho chi ph
lin quan n v d hun luyn x .

Hy nhn vo chng trnh y , bao gm cc chui ti liu, m


ti b qua trn. Ngoi self.backprop chng trnh l t gii thch -
tt c cc nng nng c thc hin trong self.SGD v
self.update_mini_batch , m chng ti tho lun. Cc self.backprop
phng php lm cho vic s dng mt vi chc nng b sung
gip trong vic tnh ton gradient, c th l sigmoid_prime , m
tnh o hm ca chc nng, v self.cost_derivative , m ti s
khng m t y. Bn c th nhn c ngha quan trng ca
nhng (v c l cc chi tit) ch bng cch nhn vo m v cc
chui ti liu. Chng ta s xem xt chng chi tit trong chng
tip theo. Lu rng trong khi chng trnh xut hin di,
nhiu m l cc chui ti liu nhm lm cho on m d hiu.
Trong thc t, chng trnh ch cha 74 dng khng-khong
trng, m khng bnh lun. Tt c cc m c th c tm thy
trn GitHub y .

"" "
network.py
~~~~~~~~~~

Mt m-un thc hin cc


thut ton
hc gradient gc ngu nhin cho mt mng n-ron. Gradients c tnh ton bng cch s dng t
n gin, d c, v d dng iu chnh c. N khng phi l ti u,
v b qua nhiu tnh nng mong mun.
"" "

#### Th vin
#
Th vin chun nhp ngu nhin

# Th vin ca bn th ba
nhp numpy nh np
lp Mng ( i tng ):

def __init__ ( self , sizes ):


"" "Danh sch` `size`` cha s lng n-ron trong cc
lp tng ng ca mng v d, nu danh sch
l [2, 3, 1] th n s l mt ba lp mng, vi
lp u tin cha 2 n-ron, lp th hai 3 n-ron,
v lp th ba n-ron th 3. Nhng sai lch v trng lng cho
mng c khi to ngu nhin, s dng mt
phn phi
Gaussian vi gi tr trung bnh l 0 v phng sai 1. Lu rng lp
u tin c gi thit l mt lp u vo, v theo quy c, chng ta s khng t ra
tng c s dng trong tnh ton cc kt qu u ra t cc lp sau """.
t . num_layers = len ( kch c )
t . kch thc = kch thc
t . nhng thnh kin = [ np . ngu nhin . randn ( y , 1 ) cho y trong kch th
t . trng lng = [ np . ngu nhin . randn ( y , x )
cho x , y trong zip ( kch c [: - 1 ], kch c [ 1 :])]

def feedforward ( t , mt ):
"" "Quay tr li u ra ca mng nu` 'a`` l u vo" ""
cho b , w trong zip ( t . nhng thnh kin , t . trng lng ):
mt = sigmoid ( np . dot ( w , mt ) + b )
quay tr li mt

def SGD ( t , training_data , k nguyn , mini_batch_size , eta ,


test_data = Khng ):
"" "Hy dy cc mng thn kinh s dng mini-l ngu nhin
. gradient descent Lnh` 'training_data`` l mt danh sch cc hng
' `(x, y ) `` i din cho cc u vo o to v
u ra
mong mun.Cc tham s khng ty chn khc l t gii thch.Nu `` test_data`` c cun
mng s c nh gi i vi d liu th nghim sau mi
giai on, v tin b mt phn in ra iu ny rt hu ch cho
theo di tin , nhng lm chm xung ng k """.
nu test_data : n_test = len ( test_data )
n = len ( training_data )
cho j trong xrange ( thi i ):
ngu nhin . ngu nhin ( training_data )
mini_batches = [
training_data [ k : k + mini_batch_size ]
cho k trong xrange ( 0 , n, mini_batch_size )]
cho mini_batch trong mini_batches :
t . update_mini_batch ( mini_batch , eta )
nu test_data :
in "Epoch {0}: {1} / {2}" . nh dng (
j , t . nh gi ( test_data ), n_test )
khc :
in "Epoch {0} hon thnh" . nh dng ( j )

def update_mini_batch ( t , mini_batch , eta ):


"" "Cp nht trng lng v thnh kin ca mng bng cch s dng
gradient descent bng cch s dng backpropagation cho mt gi nh.
" `mini_batch`` l mt danh sch cc b (x, y) `` `, v` `eta``
l t l hc tp." ""
nabla_b = [ np . zeros ( b . hnh dng ) cho b trong t . thin v ]
nabla_w = [ np . zeros ( w .hnh ) cho w trong t . trng lng ]
cho x , y trong mini_batch :
delta_nabla_b , delta_nabla_w = t . backprop ( x , y )
nabla_b = [ nb + dnb cho nb , dnb trong zip ( nabla_b , delta_nabla_b )]
nabla_w = [ nw + dnw cho nw, dnw trong zip ( nabla_w , delta_nabla_w )]
t . trng lng = [ w - ( eta / len ( mini_batch )) * nw
cho w , nw trong zip ( t . trng lng , nabla_w )]
t . sai lch = [ b - ( eta / len ( mini_batch )) * nb
cho b , nb trong zip ( self . biases , nabla_b )]

def backprop ( self , x , y ):


"" "Tr v mt tuple` `(nabla_b, nabla_w)` `i din cho
gradient cho cc chc nng chi ph C_x` `nabla_b`` v`
`nabla_w`` l layer-by- lp ca cc mng numpy, tng t nh
`` self.biases`` v `` self.weights``. "" "
nabla_b = [ np . zeros ( b . hnh dng ) cho b trong t . thin v ]
nabla_w = [ np . zeros ( w .hnh ) cho w trong t . trng lng ]
# feedforward
kch hot = x
kch hot = [ x ] # danh sch lu tr tt c cc kch hot, lp, tng lp
ZS = [] # danh sch lu tr tt c cc vect z, lp, tng lp
cho b , w trong zip ( t . nhng thnh kin , t . trng lng ):
z = np . chm ( w , kch hot )+ B
ZS . ni thm ( z )
activation = sigmoid ( z )
activations . ph thm ( kch hot )
# lc hu vt qua
delta = t . cost_derivative ( activations [ - 1 ], y ) * \
sigmoid_prime ( zs [ - 1 ])
nabla_b [ - 1 ] = delta
nabla_w[ - 1 ] = np . dot ( ng bng , kch hot [ - 2 ] . transpose ())
# Lu rng l bin trong vng lp bn di c s dng mt cht
# khc nhau cc k hiu trong Chng 2 ca cun sch. y,
# l = 1 ngha l lp cui cng ca n-ron, l = 2 l
lp cui cng # th hai, vn vn. y l mt s sp xp li ca
lc # trong cun sch, s dng y tn dng li th ca thc t
# rng Python c th s dng cc ch s ph nh trong danh sch.
cho l trong xrange ( 2 , t .num_layers ):
z = zs [ - l ]
sp = sigmoid_prime ( z )
delta = np . dot ( t . trng lng [ - l + 1 ] . transpose (), ng bng ) *
nabla_b [ - l ] = ng bng
nabla_w [ - l ] = np . chm ( ng bng, kch hot [ - l - 1 ] . chuyn i ())
return ( nabla_b , nabla_w )

def nh gi ( t , test_data ):
"" "Tr li s nguyn liu u vo kim tra m thn kinh
. kt qu u ra mng kt qu chnh xc Lu rng thn kinh
u ra ca mng c gi nh l cc ch s ca bt c
t bo thn kinh trong lp cui cng c cng kch hot cao nht . """
test_results = [( np . argmax ( t . feedforward ( x )), y )
cho ( x , y ) trong test_data ]
tr li sum (int ( x == y ) cho ( x , y ) trong kt qu test_ )

def cost_derivative ( self , output_activations , y ):


"" Tr li vector ca cc dn xut mt phn \ partial C_x /
\ partial a cho cc kch hot u ra. "" "
return ( output_activations - y )

#### Cc chc nng khc


def sigmoid ( z ):
"" "Chc nng sigmoid" ""
return 1.0 / ( 1.0 + np . Exp ( - z ))

def sigmoid_prime ( z ):
"" "Derivative ca hm sigmoid." ""
return sigmoid ( z ) * ( 1 - sigmoid ( z ))

Chng trnh nhn dng ch vit tay nh th no? Chng ta hy


bt u bng cch ti d liu MNIST. Ti s lm iu ny bng
cch s dng mt chng trnh tr gip nh mnist_loader.py , c
m t bn di. Chng ta thc hin cc lnh sau trong mt trnh bao
Python,

>>> import mnist_loader


>>> training_data , validation_data , test_data = \
... mnist_loader . load_data_wrapper ()

Tt nhin, iu ny cng c th c thc hin trong mt chng


trnh Python ring bit, nhng nu bn ang i theo n c th d
dng nht lm trong mt trnh bao Python.
Sau khi ti d liu MNIST, chng ti s thit lp Mng c n-
ron n. Chng ti thc hin vic ny sau khi nhp chng trnh
Python c lit k trn, c t tn l mng ,

>>> Nhp mng


>>> mng = mng . Mng ([ 784 , 30 , 10 ])

Cui cng, chng ta s s dng gradient dc ngu nhin hc t


ti liu o to MNIST trn 30 k nguyn, vi kch thc l nh l
10, v t l hc tp ca = 3.0 ,

>>> net . SGD ( training_data , 30 , 10 , 3.0 , test_data = test_data )

Lu rng nu bn ang chy m khi bn c cng, s mt mt


thi gian thc hin - i vi mt my tnh in hnh (tnh n
nm 2015), c th phi mt vi pht chy. Ti ngh bn
thit lp mi th ang chy, tip tc c, v kim tra nh k kt
qu t m. Nu bn ang vi vng, bn c th tng tc bng
cch gim s lng thi gian bng cch gim s lng n-ron n
hoc ch s dng mt phn d liu hun luyn. Lu rng m
sn xut s nhanh hn nhiu: cc tp lnh Python ny nhm gip
bn hiu cch hot ng ca mng nron, khng phi l m hiu
nng cao! V, tt nhin, mt khi chng ta o to mt mng li
th n c th chy rt nhanh trn hu ht cc nn tng my tnh.
Chng hn, mt khi chng ta hc c mt tp tt v trng
lng v thnh kin cho mng, n c th d dng c chuyn sang
chy trong Javascript trong trnh duyt web hoc nh mt ng dng
gc trn thit b di ng. Trong bt k trng hp no, y l
mt bn ghi mt phn ca u ra ca mt t hun luyn ca
mng thn kinh. Bn ghi cho thy s lng hnh nh th nghim
c nhn dng chnh xc bi mng thn kinh sau mi giai on
tp luyn. Nh bn thy, ch sau mt thi i, con s ny ln
n 9.129 trong tng s 10.000 ngi, v s lng tip tc pht
trin,

K nguyn 0: 9129/10000
Knh 1: 9295/10000
Knh 2: 9348/10000
...
K nguyn 27: 9528/10000
K nguyn 28: 9542/10000
K nguyn 29: 9534/10000

Tc l, mng li o to cho chng ta mt t l phn loi khong


phn trm - phn trm nh cao ca n ("Epoch
28")! l kh ng khch l nh l mt n lc u tin. Tuy
nhin, ti nn cnh bo bn rng nu bn chy m, th kt qu
ca bn khng nht thit phi ging nh ca ti, v chng ti s
khi to mng ca chng ti bng cch s dng cc trng lng v
thnh kin ngu nhin khc nhau. to ra kt qu trong chng
ny, ti thc hin tt nht ca ba chy.

Hy chy li th nghim trn, thay i s n-ron n thnh


. Nh trc y, nu bn ang chy m khi c, bn nn
c cnh bo rng phi mt mt khong thi gian thc hin
(trn my tnh ca ti, th nghim ny mt hng chc giy cho mi
giai on o to), vy nn khn ngoan tip tc c song song
trong khi m thc hin.

>>> net = mng . Mng ([ 784 , 100 , 10 ])


>>> net . SGD ( training_data , 30 , 10 , 3.0 , test_data = test_data )

Chc chn, iu ny ci thin kt qu n phn


trm. t nht trong trng hp ny, s dng nhiu t bo thn
kinh n gip chng ta c c kt qu tt hn * .
Tt nhin, c c nhng tnh xc thc ny, ti phi a ra cc
la chn c th cho s lng thi gian o to, quy m l nh, v t
l hc tp, . Nh ti cp trn, chng c gi l cc siu
thng s cho mng thn kinh ca chng ta, phn bit chng vi
cc thng s (trng lng v thnh kin) c rt ra bi thut
ton hc ca chng ta. Nu chng ta chn cc tham s siu ca
chng ta km, chng ta c th c kt qu xu. V d, gi s rng
chng ta chn t l hc l = 0.001 ,

>>> net = mng . Mng ([ 784 , 100 , 10 ])


>>> net . SGD ( training_data , 30 , 10 , 0.001 , test_data = test_data )

Kt qu khng my khch l,

K nguyn 0: 1139/10000
Knh 1: 1136/10000
Knh 2: 1135/10000
...
K nguyn 27: 2101/10000
K nguyn 28: 2123/10000
Knh tha 29: 2142/10000

Tuy nhin, bn c th thy rng hiu sut ca mng ang dn


dn tr nn tt hn theo thi gian. iu cho thy tng t l
hc tp, ni vi = 0.01 . Nu chng ta lm iu ,
chng ta s c c kt qu tt hn, iu cho thy vic tng
t l hc tp mt ln na. (Nu thc hin thay i ci thin mi
th, hy th lm nhiu hn na!) Nu chng ta lm iu
nhiu ln hn, chng ta s kt thc vi mt t l hc tp ca mt
ci g nh = 1,0 (v c th tinh chnh n ), gn
vi cc th nghim trc ca chng ti. V vy, mc d ban u
chng ti la chn cc thng s siu nhanh, chng ti t nht
cng c thng tin gip chng ti ci thin la chn cc thng
s siu.

Ni chung, g li mt mng n-ron c th l mt thch thc. iu


ny c bit ng khi s la chn ban u ca cc thng s siu
to kt qu khng tt hn ting n ngu nhin. Gi s chng ta
th cu trc mng n-ron n 30 thnh cng t trc, nhng vi t
l hc thay i thnh = 100,0 :

>>> net = mng . Mng ([ 784 , 30 , 10 ])


>>> net . SGD ( training_data , 30 , 10 , 100.0 , test_data = test_data )

Ti thi im ny chng ta thc s i qu xa, v t l hc tp qu


cao:
Epoch 0 : 1009 / 10000
Epoch 1 : 1009 / 10000
Epoch 2 : 1009 / 10000
Epoch 3 : 1009 / 10000
...
Epoch 27 : 982 / 10000
Epoch 28 : 982 / 10000
Epoch 29 : 982 / 10000

By gi hy tng tng rng chng ti n vn ny ln


u tin. Tt nhin, chng ta bit t nhng th nghim trc
ca chng ti rng iu ng l lm gim t l hc tp. Nhng
nu ln u tin chng ta gp vn ny th s khng c
nhiu kt qu hng dn chng ta phi lm g. Chng ta c th
lo lng khng ch v t l hc tp m cn v mi kha cnh khc
ca mng n-ron ca chng ta. Chng ta c th thc mc khng
bit liu chng ta khi to cc trng s v thnh kin trong
mt cch lm cho mng ca mnh kh hc khng? Hoc c l chng
ta khng c d liu o to c c hc tp c ngha? C l
chng ta khng chy cho thi i? Hoc c th n khng th
cho mt mng li thn kinh vi kin trc ny hc cch nhn
dng ch vit tay ch s? C l t l hc vn qu thp? Hoc, c
th, t l hc tp qu cao? Khi bn gp vn ln u tin, bn
khng phi lc no cng chc chn.

Bi hc rt ra t vic ny l g li mt mng thn kinh khng phi


l tm thng, v, cng nh i vi cc chng trnh thng
thng, c mt ngh thut vi n. Bn cn phi bit rng ngh
thut g li c c kt qu tt t cc mng nron. Ni chung,
chng ta cn pht trin heuristics chn cc thng s siu tt
v kin trc tt. Chng ta s tho lun tt c nhng iu ny
qua cun sch, bao gm cch ti chn cc thng s siu trn.

Tp th dc
Hy th to mt mng vi ch hai lp - mt u vo v mt lp
u ra, khng c lp n - vi 784 v 10 t bo thn kinh,
tng ng. o to mng bng cch s dng gradient ngu
nhin stochastic. Bn c th t c chnh xc phn loi
no?

Trc , ti b qua cc chi tit v cch d liu MNIST c


ti. N kh n gin. hon thnh, y l m. Cc cu trc d
liu c s dng lu tr d liu MNIST c m t trong chui
ti liu - l th n gin, tuples v danh sch cc i tng
Numpy ndarray (suy ngh ca h nh vect nu bn khng quen vi
ndarray s):

"" "
mnist_loader
~~~~~~~~~~~~
Mt th vin np d liu hnh nh MNIST. bit chi tit v cc
cu trc
d liu c tr v, hy xem chui ti liu cho `` load_data`` v `` load_data_wrapper``. Tron
hm thng c gi bi m mng thn kinh ca chng ta.
"" "

#### Th vin
#
Th vin chun import cPickle
import gzip

# Th vin ca bn th ba
nhp numpy nh np

def load_data ():


"" "Tr v d liu MNIST nh l mt b cha d liu hun luyn,
d liu xc nhn v d liu kim tra.

Cc `` o to_data`` c tr li nh l mt tuple vi hai mc.


Mc nhp u tin cha cc hnh nh o to thc t. y l mt
ndarray numpy vi 50.000 mc. Mi mc nhp ln lt l mt
numdarray numpy vi 784 gi tr, i din cho 28 * 28 = 784
pixel trong mt hnh nh MNIST.

Mc nhp th hai trong b tp hp `` training_data`` l mt tp tin numpad


cha 50.000 mc. Cc mc ny ch l cc
gi tr
s (0 ... 9) cho cc hnh nh tng ng cha trong mc u tin ca b .

Cc `` validation_data`` v `` test_data`` tng t, ngoi tr


mi ch cha 10.000 hnh nh.

y l mt nh dng d liu tt, nhng s dng trong cc mng n-rn,


hu ch sa i nh dng ca `` training_data`` mt cht.
l thc hin trong hm wrapper `` load_data_wrapper () ``, xem
bn di.
"""
F = gzip . M ( '../data/mnist.pkl.gz' , 'rb' )
training_data , validation_data , test_data = cPickle . Ti ( f )
f . Cht ch ()
tr li ( training_data , validation_data , test_data )

def load_data_wrapper ():


"" "Tr v mt b gm c` `(training_data, validation_data,
test_data)` `Da vo` `load_data``, nhng nh dng ny
thun tin hn s dng trong vic trin khai mng n ron.
C th, `` training_data`` l mt danh sch c cha 50.000
2-tuple `` (x, y) ``. `` x`` l mt numpy.ndarray 784 chiu
cha hnh nh u vo. `` y`` l mt numpy.ndarray 10 chiu
i din cho vector n v tng ng vi
ch s chnh xc cho `` x``.

`` validation_data`` v `` test_data`` l danh sch cha 10.000


2-tuples `` (x, y) ``. Trong mi trng hp, `` x`` l mt numpy.ndarry 784 chiu
cha hnh nh u vo v `` y`` l
phn loi tng ng, ngha l cc gi tr s (s nguyn)
tng ng vi `` x``.

R rng, iu ny c ngha l chng ti ang s dng nh dng hi khc nhau cho


d liu hun luyn v d liu xc nhn / kim tra. Cc nh dng ny
tr nn thun tin nht s dng trong
m
mng thn kinh ca chng ta . "" " Tr_d , va_d , te_d = load_data ()
training_inputs = [ np . Reshape ( x , ( 784 , 1 )) cho x trong tr_d [ 0 ]]
training_results = [ vectorized_result ( y ) cho y trong tr_d [ 1 ]
training_data = zip ( training_inputs , training_results )
validation_inputs = [ np . reshape ( x , ( 784 , 1 )) cho x trong va_d [ 0 ]
validation_data = zip ( validation_inputs , va_d [ 1 ])
test_inputs = [ np .thay i hnh dng ( x , ( 784 , 1 )) cho x trong te_d [ 0 ]
test_data = zip ( test_inputs , te_d [ 1 ])
return ( training_data , validation_data , test_data )

def vectorized_result ( j ):
"" "Tr v mt vector n v 10 chiu vi
v tr
j v v tr th i v tr 0. N c s dng chuyn i mt ch s (0 ... 9) thnh mt
mng
n ron "" e = np . zeros (( 10 , 1 ))
e [ j ] = 1,0
return e

Ti ni trn rng chng trnh ca chng ti nhn c kt


qu kh tt. iu ngha l g? Tt so vi ci g? l thng
tin c mt s xt nghim c bn n gin (khng phi mng
thn kinh) so snh, hiu ngha ca vic thc hin tt. D
nhin, ng c s n gin nht ca tt c l ngu nhin on
ch s. iu s ng khong mi phn trm thi gian.
Chng ta ang lm tt hn th!
iu g v mt c s t tm thng hn? Hy th mt tng cc
k n gin: chng ta s nhn vo hnh nh ti nh th no . V d,
mt hnh nh ca mt thng s hi ti hn mt hnh nh ca ,
ch v nhiu pixel c lm en i, nh cc v d sau minh ho:

iu ny gi s dng d liu hun luyn tnh ton ti


trung bnh cho mi ch s,

Khi c trnh by vi mt hnh nh mi, chng ti tnh ton ti


ca hnh nh, v sau on rng l s no c en ti
nht gn nht. y l mt th tc n gin, v d dng m
ha, v vy ti s khng r rng vit ra m - nu bn quan tm n
trong kho GitHub . Nhng l mt ci tin ln trong vic phng
on ngu nhin, nhn c trong
kim tra hnh nh chnh xc, tc l,
chnh xc phn trm.

Khng kh tm ra nhng tng khc t c cc tnh xc thc


trong n phn trm. Nu bn lm vic kh hn mt
cht, bn c th c c ln trn phn trm. Nhng c c
chnh xc cao hn nhiu n gip s dng cc thut ton hc
my c thit lp. Hy th s dng mt trong nhng thut ton
ni ting nht, my vector h tr hoc SVM . Nu bn khng
quen thuc vi SVM, ng lo lng, chng ti s khng cn phi
hiu cc chi tit v cch SVM hot ng. Thay vo , chng ta s
s dng mt th vin Python gi l scikit-learn , cung cp mt giao
din Python n gin cho mt th vin da trn C nhanh cho SVMs
c gi l LIBSVM .

Nu chng ta chy trnh phn loi SVM ca scikit-learn bng cch


s dng cc ci t mc nh, th n s nhn c 9.435 trong s
10.000 mu th nghim chnh xc. (M ny c sn y .) l
mt bc tin ln so cch tip cn ngy th ca chng ta v phn
loi mt hnh nh da trn cch ti n c. Tht vy, n c ngha
l SVM ang thc hin gn nh l cc mng thn kinh ca chng
ti, ch mt cht ti t hn. Trong cc chng sau chng ta s gii
thiu cc k thut mi cho php chng ta ci tin mng li thn
kinh ca chng ta chng hot ng tt hn so vi SVM.

khng phi l kt thc ca cu chuyn, tuy nhin. 9.435 trong


s 10.000 kt qu l dnh cho ci t mc nh ca scikit-learn
cho SVM. SVM c mt s tham s c th iu chnh c, v c
th tm kim cc thng s ci thin hiu nng ngoi hin
trng. Ti s khng thc hin tm kim ny mt cch r rng,
nhng thay vo gii thiu bn n bi ng blog ca Andreas
Mueller nu bn mun bit thm. Mueller cho thy rng vi
mt s cng vic ti u ha cc tham s ca SVM, c th t c
hiu sut trn chnh xc 98,5%. Ni cch khc, SVM c iu
chnh tt ch gy ra li v khong mt ch s trong 70. iu
kh tt! Mng nron c th lm tt hn?

Trn thc t, h c th. Hin ti, mng n-ron c thit k tt


hn cc k thut khc gii quyt MNIST, bao gm SVM. Bn
ghi hin ti (2013) ang phn loi 9.979 trong s 10.000 hnh nh
chnh xc. iu ny c thc hin bi Li Wan , Matthew Zeiler
, Sixin Zhang, Yann LeCun , v Rob Fergus . Chng ta s thy hu
ht cc k thut h s dng trong cun sch. cp , hiu
sut gn tng ng con ngi, v c cho l tt hn, v kh
nhiu hnh nh ca MNIST rt kh khn ngay c i vi con
ngi nhn ra s t tin, v d:

Ti tin tng bn s ng rng nhng ngi kh phn loi! Vi


nhng hnh nh nh th ny trong b d liu MNIST, ng ch l
cc mng n ron c th phn loi chnh xc tt c 21 trong s
10.000 hnh nh th nghim. Thng thng, khi lp trnh chng ti
tin rng gii quyt mt vn phc tp nh nhn dng cc ch
s MNIST yu cu mt thut ton phc tp. Nhng ngay c cc
mng thn kinh trong bi bo Wan et al ch cp n cc thut
ton kh n gin, cc bin th ca thut ton chng ta thy
trong chng ny. Tt c s phc tp l hc c, t ng, t d
liu hun luyn. Theo mt ngha no , o c ca c hai kt
qu ca chng ta v nhng ngi trong cc giy t phc tp hn, l
i vi mt s vn :

thut ton phc tp thut ton hc n gin + d liu hun


luyn tt.
Hng ti hc tp su
Mc d mng n-ron ca chng ti cho hiu sut n tng, hiu
sut l b n. Trng lng v thnh kin trong mng c pht
hin t ng. V iu c ngha l chng ta khng ngay lp tc c
mt li gii thch v cch mng thc hin nhng g n lm. Chng
ta c th tm thy mt s cch hiu cc nguyn tc m mng
li ca chng ti phn loi ch vit tay? V, vi nhng nguyn
tc nh vy, chng ta c th lm tt hn khng?

a ra nhng cu hi ny mt cch nghim tc hn, gi s rng


mt vi thp nin v th mng nron dn n tr thng minh nhn
to (AI). Chng ta s hiu nhng mng thng minh nh th no
hot ng? C l cc mng s khng r rng i vi chng ta, vi
trng s v thnh kin chng ta khng hiu, bi v chng c
hc t ng. Trong nhng ngy u ca nghin cu AI ngi ta hy
vng rng n lc xy dng mt AI cng s gip chng ta hiu c
nhng nguyn tc ng sau s thng minh v, c th, s hot
ng ca b no con ngi. Nhng c l kt qu s l chng ta s
khng hiu c tr no v tr thng minh nhn to hot ng nh
th no!

gii quyt nhng cu hi ny, chng ta hy suy ngh li v vic


gii thch cc n-ron nhn to m ti a ra vo u chng,
nh mt phng tin cn nhc bng chng. Gi s chng ta
mun xc nh liu mt hnh nh cho thy mt ngi hay khng:
Chng ta c th tn cng vn ny ging nh cch chng ta
tn cng nhn dng ch vit tay bng cch s dng cc im nh
trong hnh nh nh l u vo cho mng n ron, vi u ra t
mng mt nrn n cho bit "Vng, l gng mt" hoc
"Khng, khng phi l khun mt ".

Gi s chng ta lm iu ny, nhng chng ta khng s dng mt


thut ton hc. Thay vo , chng ta s c gng thit k mt
mng li bng tay, la chn trng lng v thnh kin thch hp.
Lm th no chng ta c th i v n? Qun hon ton mng n-
ron cho thi im ny, chng ta c th s dng phng php phn
tch gii quyt vn thnh cc vn ph: hnh nh c
mt mt trn bn tri? N c mt mt pha trn bn phi
khng? N c mi gia khng? N c ming gia y khng? C
mi tc trn u? V nh vy.

Nu cu tr li cho mt s cu hi l "c", hoc thm ch "c th


c", th chng ta s kt lun rng hnh nh c v nh l mt khun
mt. Ngc li, nu cu tr li cho hu ht cc cu hi l
"khng", th hnh nh c l khng phi l khun mt.

Tt nhin, y ch l mt phng php heuristic th, v n chu


ng nhiu thiu st. C th ngi hi, v vy h khng c tc.
C l chng ta ch c th nhn thy mt phn khun mt, hoc
khun mt l mt gc, v vy mt s tnh nng trn khun mt b
che khut. Tuy nhin, phng php heuristic cho thy nu
chng ta c th gii quyt cc vn ph s dng mng n-ron,
th c l chng ta c th xy dng mt mng n-ron pht hin
khun mt, bng cch kt hp cc mng cho cc vn ph.
y l mt kin trc c th, vi hnh ch nht biu th cc mng
con. Lu rng y khng phi l mt cch tip cn thc t
gii quyt vn nhn din khun mt; thay vo , n gip
chng ta xy dng trc gic v cch mng hot ng. y l kin
trc:

Cng c th hp l rng cc mng con c th b phn hy. Gi s


chng ta ang cn nhc cu hi: "C phi l mt trn bn tri?"
iu ny c th b phn hy thnh nhng cu hi nh: "C lng
my?"; "C lng mi khng?"; "C mt mng mt khng?"; v nh
vy. Tt nhin, nhng cu hi ny cng nn bao gm thng tin
v v tr - "B lng my trn cng bn tri v trn mng mt
khng?", Kiu - nhng hy gi n n gin. Mng li tr li
cu hi "C mt mt trn bn tri?" by gi c th b phn hy:

Nhng cu hi cng c th c chia nh, xa hn v nhiu hn


na thng qua nhiu lp. Cui cng, chng ti s lm vic vi cc
mng con tr li cc cu hi n gin nh vy m chng c th d
dng tr li mc im nh n. Nhng cu hi c th l v d
v s hin din hay vng mt ca nhng hnh dng rt n gin
ti nhng im c th trong hnh nh. Nhng cu hi nh vy c th
c tr li bi cc n ron n l kt ni vi cc im nh th
trong hnh nh.

Kt qu cui cng l mt mng li chia nh mt cu hi rt


phc tp - hnh nh ny c mt hay khng - vo cc cu hi rt n
gin c th tr li mc im nh n l. N thc hin iu ny
thng qua mt lot cc lp, vi cc lp sm tr li cc cu hi rt
n gin v c th v hnh nh u vo, v cc lp sau ny xy
dng mt h thng th bc cc khi nim phc tp v tru tng
hn bao gi ht. Cc mng vi loi cu trc nhiu lp - hai hoc
nhiu lp n - c gi l cc mng thn kinh su .

Tt nhin, ti khng ni lm th no lm iu ny quy


phn hy thnh cc tiu mng. Chc chn l khng thit thc
thit k tay trng lng v thnh kin trong mng. Thay vo ,
chng ti mun s dng thut ton hc mng c th t ng hc
cc trng s v thnh kin - v do , th bc ca cc khi nim -
t d liu hun luyn. Cc nh nghin cu trong nhng nm 1980
v 1990 c gng s dng gradient ngu nhin v
backpropagation o to cc mng li su. Tht khng may,
ngoi tr mt vi kin trc c bit, h khng c nhiu may mn.
Cc mng li s hc hi, nhng rt chm, v trong thc t thng
qu chm c ch.

T nm 2006, mt lot cc k thut c pht trin cho php hc


tp trong cc mng thn kinh su. Nhng k thut hc tp su ny
da trn nn tng gc ngu nhin v ngc li, m cn gii thiu
nhng tng mi. Nhng k thut ny cho php o to nhiu
su hn (v cc mng ln hn) - ngi ta thng xuyn o to cc
mng vi 5 n 10 lp n. V, n ch ra rng nhng iu ny thc
hin tt hn nhiu v cc vn hn l mng li thn kinh
cn, tc l, cc mng ch vi mt lp n duy nht. Tt nhin, l do
l kh nng ca cc mng li su xy dng mt h thng cc
khi nim phc tp. N ging nh cch cc ngn ng lp trnh
thng thng s dng thit k m un v cc tng v vic tru
tng cho php to ra cc chng trnh my tnh phc tp. So
snh mt mng li su vi mt mng li nng l mt cht ging
nh so snh mt ngn ng lp trnh vi kh nng thc hin cc cuc
gi chc nng n mt ngn ng b tc b m khng c kh nng
thc hin cc cuc gi nh vy. S tru tng c mt dng khc
trong mng nron so vi trong chng trnh thng thng, nhng
n cng quan trng.

Trong tc phm hc thut, hy trch dn cun schCp


nynht
nh:ln
Michael
cui: Th
A. Nielsen,
hai ngy"Mng
2 thng
thn
12kinh
09:09:08
v hc2017
tp su
sc", Bo co xc nh, 2015

Tc phm ny c cp php theo Giy php khng c giy php Creative Commons Ghi cng-
NonCommercial 3.0 . iu ny c ngha l bn c t do sao chp, chia s v xy dng trn cun sch ny,
nhng khng c bn n. Nu bn quan tm n vic s dng thng mi, vui lng lin h vi ti .

You might also like