Professional Documents
Culture Documents
Perceptrons
Mng thn kinh l g? bt u, ti s gii thch mt loi n-
ron nhn to c gi l perceptron . Perceptron c nh khoa
hc Frank Rosenblatt pht trin vo nhng nm 1950 v 1960 , ly
cm hng t cng trnh trc y ca Warren McCulloch v Walter
Pitts . Ngy nay, ph bin nht l s dng cc m hnh n-ron
nhn to khc - trong cun sch ny, v trong nhiu nghin cu
hin i v mng n-ron, m hnh n-ron chnh c s dng l
mt neuron sigma . Chng ta s sm nhn c neuron sigmoid.
Nhng hiu ti sao cc nrn thn kinh sigma c xc nh
theo cch ca chng, nn dnh thi gian hiu perceptron u
tin.
0
neu wj xj Ngng
j
au ra = { (1)
1
neu j
wj xj > ng ng
j
wj xj > ngng l rm r, v chng ta c
th thc hin hai thay i quan trng n gin ha n. S thay
i u tin l vit j
wj xj nh l mt sn phm du
chm,
0 neu w x + b 0
output = { (2)
1 neu w x + b > 0
Bn c th ngh v s thin v nh l mt bin php ca cch d
dng l c c perceptron sn xut mt . Hoc t n
trong nhiu iu khon sinh hc, s thin v l mt bin php ca
cch d dng l c c perceptron la . i vi mt
perceptron vi mt s thin v rt ln, rt d dng cho perceptron
sn xut mt . Nhng nu s thin v l rt tiu cc, sau
n rt kh cho cc perceptron sn xut mt . R rng, vic
a ra s thin v ch l mt s thay i nh trong cch chng ta m
t perceptron, nhng sau chng ta s thy n dn n nhng
s n gin ha hn na. Do , trong phn cn li ca cun sch
chng ti s khng s dng ngng, chng ti s lun lun s dng
s thin v.
Cho n by gi ti v cc u vo nh 1 v 2 nh cc
bin ni bn tri ca mng perceptrons. Trong thc t, thng
thng v thm mt lp perceptron - lp u vo - m ho
cc u vo:
j
wj xj s lun lun l s khng, v do cc perceptron
s sn lng nu , v nu
. Tc l, perceptron ch n gin s xut ra
mt gi tr c nh ch khng phi gi tr mong mun ( 1 ,
trong v d trn). Tt hn l suy ngh v perceptron u vo nh
khng thc s l perceptron tt c, m l cc n v c bit m
ch n gin l nh ngha xut ra cc gi tr mong mun,
1, .
x 2 , ...
1
( z) . (3)
- z
1 + e
s thin v L
1
. (4)
1 + exp( - j
wj xj - b )
iu g v cc hnh i s ca ? Lm th no chng ta c th
hiu iu ? Trong thc t, dng chnh xc ca khng phi l
quan trng - nhng g thc s quan trng l hnh dng ca cc chc
nng khi v. y l hnh dng:
sigmoid function
1.0
0.8
0.6
0.4
0.2
0.0
-4 -3 -2 -1 0 1 2 3 4
z
0.8
0.6
0.4
0.2
0.0
-4 -3 -2 -1 0 1 2 3 4
z
au
ra au
ra
au ra wj + b , (5)
wj b
j
kch hot khc ( ) . The main thing that changes when we use a
different activation function is that the particular values for the
partial derivatives in Equation (5) change. It turns out that when we
compute those partial derivatives later, using will simplify the
algebra, simply because exponentials have lovely properties when
differentiated. In any case, is commonly-used in work on neural
nets, and is the activation function we'll use most often in this book.
Exercises
Sigmoid neurons simulating perceptrons, part I
Suppose we take all the weights and biases in a network of
perceptrons, and multiply them by a positive constant, c > 0.
Show that the behaviour of the network doesn't change.
While the design of the input and output layers of a neural network
is often straightforward, there can be quite an art to the design of
the hidden layers. In particular, it's not possible to sum up the
design process for the hidden layers with a few simple rules of
thumb. Instead, neural networks researchers have developed many
design heuristics for the hidden layers, which help people get the
behaviour they want out of their nets. For example, such heuristics
can be used to help determine how to trade off the number of
hidden layers against the time required to train the network. We'll
meet several such design heuristics later in this book.
is a 5.
You might wonder why we use 10 output neurons. After all, the goal
of the network is to tell us which digit (0, 1, 2, , 9) corresponds to
the input image. A seemingly natural way of doing that is to use just
4 output neurons, treating each neuron as taking on a binary value,
depending on whether the neuron's output is closer to 0 or to 1.
Four neurons are enough to encode the answer, since 2 4
= 16 is
more than the 10 possible values for the input digit. Why should our
network use 10 neurons instead? Isn't that inefficient? The ultimate
justification is empirical: we can try out both network designs, and
it turns out that, for this particular problem, the network with 10
output neurons learns to recognize digits better than the network
with 4 output neurons. But that leaves us wondering why using 10
output neurons works better. Is there some heuristic that would tell
us in advance that we should use the 10-output encoding instead of
the 4-output encoding?
To understand why we do this, it helps to think about what the
neural network is doing from first principles. Consider first the case
where we use 10 output neurons. Let's concentrate on the first
output neuron, the one that's trying to decide whether or not the
digit is a 0. It does this by weighing up evidence from the hidden
layer of neurons. What are those hidden neurons doing? Well, just
suppose for the sake of argument that the first neuron in the hidden
layer detects whether or not an image like the following is present:
As you may have guessed, these four images together make up the 0
image that we saw in the line of digits shown earlier:
So if all four of these hidden neurons are firing then we can
conclude that the digit is a 0. Of course, that's not the only sort of
evidence we can use to conclude that the image was a 0 - we could
legitimately get a 0theo nhiu cch khc (ni, thng qua bn dch
ca cc hnh nh trn, hoc bp mo nh). Nhng c v nh an ton
ni rng t nht trong trng hp ny chng ti mun kt
lun rng u vo l mt .
Tp th dc
C mt cch xc nh biu din bitwise ca mt ch s
bng cch thm mt lp thm vo mng ba lp trn. Lp
thm chuyn i u ra t lp trc thnh mt biu din nh
phn, nh c minh ha trong hnh bn di. Tm mt b
trng lng v thnh kin cho lp u ra mi. Gi s rng
cc lp t bo thn kinh l nh vy m sn lng chnh xc
trong lp th ba (v d, lp u ra c) kch hot t nht
, v u ra khng chnh xc kch hot t hn .
Hc vi gradien xung
Now that we have a design for our neural network, how can it learn
to recognize digits? The first thing we'll need is a data set to learn
from - a so-called training data set. We'll use the MNIST data set,
which contains tens of thousands of scanned images of handwritten
digits, together with their correct classifications. MNIST's name
comes from the fact that it is a modified subset of two data sets
collected by NIST, the United States' National Institute of
Standards and Technology. Here's a few images from MNIST:
As you can see, these digits are, in fact, the same as those shown at
the beginning of this chapter as a challenge to recognize. Of course,
when testing our network we'll ask it to recognize images which
aren't in the training set!
The MNIST data comes in two parts. The first part contains 60,000
images to be used as training data. These images are scanned
handwriting samples from 250 people, half of whom were US
Census Bureau employees, and half of whom were high school
students. The images are greyscale and 28 by 28 pixels in size. The
second part of the MNIST data set is 10,000 images to be used as
test data. Again, these are 28 by 28 greyscale images. We'll use the
test data to evaluate how well our neural network has learned to
recognize digits. To make this a good test of performance, the test
data was taken from a different set of 250 people than the original
training data (albeit still a group split between Census Bureau
employees and high school students). This helps give us confidence
that our system can recognize digits from people whose writing it
didn't see during training.
What we'd like is an algorithm which lets us find weights and biases
so that the output from the network approximates y(x) for all
training inputs x. To quantify how well we're achieving this goal we
define a cost function*:
1
2
C (w, b) y(x) a . (6)
2n
x
Here, w denotes the collection of all weights in the network, b all the
biases, n is the total number of training inputs, a is the vector of
outputs from the network when x is input, and the sum is over all
training inputs, x. Of course, the output a depends on x, w and b,
but to keep the notation simple I haven't explicitly indicated this
dependence. The notation v just denotes the usual length
function for a vector v. We'll call C the quadratic cost function; it's
also sometimes known as the mean squared error or just MSE.
Inspecting the form of the quadratic cost function, we see that
C (w, b) is non-negative, since every term in the sum is non-
negative. Furthermore, the cost C (w, b) becomes small, i.e.,
C (w, b) 0 , precisely when y(x) is approximately equal to the
output, a, for all training inputs, x. So our training algorithm has
done a good job if it can find weights and biases so that C (w, b) 0.
By contrast, it's not doing so well when C (w, b) is large - that would
mean that y(x) is not close to the output a for a large number of
inputs. So the aim of our training algorithm will be to minimize the
cost C (w, b) as a function of the weights and biases. In other words,
we want to find a set of weights and biases which make the cost as
small as possible. We'll do that using an algorithm known as
gradient descent.
One way of attacking the problem is to use calculus to try to find the
minimum analytically. We could compute derivatives and then try
using them to find places where C is an extremum. With some luck
that might work when C l mt chc nng ca ch mt hoc mt vi
bin. Nhng n s tr thnh mt cn c mng khi chng ta c
nhiu bin hn. V i vi cc mng thn kinh chng ti s
thng mun xa bin hn - cc mng thn kinh ln nht c
chc nng chi ph m ph thuc vo t trng v nhng thnh kin
mt cch cc k phc tp. S dng php tnh gim thiu m ch s
khng lm vic!
C C
C v1 + v2 . (7)
v1 v2
We're going to find a way of choosing v and v so as to make 1 2
C negative; i.e., we'll choose them so the ball is rolling down into
the valley. To figure out how to make such a choice it helps to define
v to be the vector of changes in v, v (v 1, v2 )
T
, where T is
again the transpose operation, turning row vectors into column
vectors. We'll also define the gradient of C to be the vector of
T
partial derivatives, ( C
v1
,
C
v2
) . We denote the gradient vector by
C , i.e.:
T
C C
C ( , ) . (8)
v1 v2
C C v . (9)
Phng trnh ny gip gii thch ti sao c gi l
vector gradient: lin quan n nhng thay i trong
n nhng thay i trong , ging nh chng ta mong i
iu g c gi l mt gradient lm. Nhng nhng g thc s
th v v phng trnh l n cho php chng ta xem lm th no
chn lm cho tiu cc. C th, gi s
chng ta chn
v = - C , (10)
- C C = - C
2
. Bi v
2
0 , iu ny m bo
rng 0 , tc l, s lun lun gim, khng
bao gi tng, nu chng ta thay i theo n thuc trong (10) .
(D nhin, trong gii hn ca xp x trong phng trnh (9) ). y
chnh l ti sn chng ti mun! V v vy chng ta s ly cng
thc (10) xc nh "lut chuyn ng" cho qu bng theo thut
ton dc ng ca chng ta. Tc l chng ta s s dng phng
trnh (10) tnh gi tr cho , sau di chuyn v tr ca
qu bng bng s tin :
'
v v = v - C . (11)
C C v, (12)
T
C C
C ( ,, ) . (13)
v1 vm
v = C , (14)
v v = v C . (15)
You can think of this update rule as defining the gradient descent
algorithm. It gives us a way of repeatedly changing the position v in
order to find a minimum of the function C . The rule doesn't always
work - several things can go wrong and prevent gradient descent
from finding the global minimum of C , a point we'll return to
explore in later chapters. But, in practice gradient descent often
works extremely well, and in neural networks we'll find that it's a
powerful way of minimizing the cost function, and so helping the
net learn.
Exercises
Prove the assertion of the last paragraph. Hint: If you're not
already familiar with the Cauchy-Schwarz inequality, you may
find it helpful to familiarize yourself with it.
C
'
wk w = wk - (16)
k
wk
C
'
bl b = bl - . (17)
l
bl
n
x
Cx , c ngha l, l trung
2
y( x ) - a
bnh qua chi ph x
2
cho cc v d o to
c nhn. Trong thc t, tnh gradient chng ta cn
tnh gradients x ring cho mi u vo o to, , v
sau trung bnh chng,
=
1
n
x
Cx . Tht khng
may, khi s lng u vo o to l rt ln, iu ny c th
mt mt thi gian di, v hc tp do xy ra chm.
is,
m
C Xj C x
j=1 x
= C , (18)
m n
where the second sum is over the entire set of training data.
Swapping sides we get
m
1
C C Xj , (19)
m
j=1
and b denote the weights and biases in our neural network. Then
l
CX
j
bl b = bl , (21)
l
m bl
j
where the sums are over all the training examples X in the current j
m
di ra pha trc ca cc khon tin. V mt khi nim, iu
ny khng c g khc bit, v n tng ng vi vic thay i tc
hc . Nhng khi so snh chi tit cc tc phm khc nhau th
ng xem.
Tp th dc
Mt phin bn cc i ca gradient descent l s dng mt
mini-batch kch thc ch 1. l, cho mt u vo o to,
, chng ti cp nht trng lng v thnh kin ca mnh theo
cc quy tc
k w
'
k
= v
w k - C x / w k
l b
'
l
= bl - C x / bl . Sau ,
chng ti chn mt u vo o to, v cp nht trng lng
v thnh kin mt ln na. V vn vn, lin tc. Th tc ny
c gi l trc tuyn , trc tuyn , hoc gia tng hc tp.
Trong hc tp trc tuyn, mt mng n ron ch hc t mt
u vo o to ti mt thi im (ging nh con ngi lm).
t tn mt li th v bt li ca hc tp trc tuyn, so vi
stochastic gradient vi mt kch thc nh, v d: .
If you don't use git then you can download the data and code here.
lp Mng ( i tng ):
Trong m ny, kch thc danh sch cha s lng n-ron trong
cc lp tng ng. V d, nu chng ta mun to ra mt i
tng Network vi 2 neuron trong lp u tin, 3 neuron trong lp
th hai v 1 neuron trong lp cui cng, chng ta s thc hin
iu ny bng m:
net = Mng ([ 2 , 3 , 1 ])
'
m =
t ( w a + b ) . (22)
Tp th dc
Vit ra phng trnh (22) dng thnh phn, v xc minh
rng n cho kt qu tng t nh cc quy tc (4) tnh
ton u ra ca mt n ron thn kinh sigma.
def sigmoid ( z ):
tr li 1,0 / ( 1,0 + np . exp ( - z ))
def feedforward ( t , mt ):
"" "Return u ra ca mng nu "a" l u vo." ""
cho b , w trong zip ( t . nhng thnh kin , t . trng lng ):
mt = sigmoid ( np . dot ( w , a ) + b )
tr v mt
"" "
network.py
~~~~~~~~~~
#### Th vin
#
Th vin chun nhp ngu nhin
# Th vin ca bn th ba
nhp numpy nh np
lp Mng ( i tng ):
def feedforward ( t , mt ):
"" "Quay tr li u ra ca mng nu` 'a`` l u vo" ""
cho b , w trong zip ( t . nhng thnh kin , t . trng lng ):
mt = sigmoid ( np . dot ( w , mt ) + b )
quay tr li mt
def nh gi ( t , test_data ):
"" "Tr li s nguyn liu u vo kim tra m thn kinh
. kt qu u ra mng kt qu chnh xc Lu rng thn kinh
u ra ca mng c gi nh l cc ch s ca bt c
t bo thn kinh trong lp cui cng c cng kch hot cao nht . """
test_results = [( np . argmax ( t . feedforward ( x )), y )
cho ( x , y ) trong test_data ]
tr li sum (int ( x == y ) cho ( x , y ) trong kt qu test_ )
def sigmoid_prime ( z ):
"" "Derivative ca hm sigmoid." ""
return sigmoid ( z ) * ( 1 - sigmoid ( z ))
K nguyn 0: 9129/10000
Knh 1: 9295/10000
Knh 2: 9348/10000
...
K nguyn 27: 9528/10000
K nguyn 28: 9542/10000
K nguyn 29: 9534/10000
Kt qu khng my khch l,
K nguyn 0: 1139/10000
Knh 1: 1136/10000
Knh 2: 1135/10000
...
K nguyn 27: 2101/10000
K nguyn 28: 2123/10000
Knh tha 29: 2142/10000
Tp th dc
Hy th to mt mng vi ch hai lp - mt u vo v mt lp
u ra, khng c lp n - vi 784 v 10 t bo thn kinh,
tng ng. o to mng bng cch s dng gradient ngu
nhin stochastic. Bn c th t c chnh xc phn loi
no?
"" "
mnist_loader
~~~~~~~~~~~~
Mt th vin np d liu hnh nh MNIST. bit chi tit v cc
cu trc
d liu c tr v, hy xem chui ti liu cho `` load_data`` v `` load_data_wrapper``. Tron
hm thng c gi bi m mng thn kinh ca chng ta.
"" "
#### Th vin
#
Th vin chun import cPickle
import gzip
# Th vin ca bn th ba
nhp numpy nh np
def vectorized_result ( j ):
"" "Tr v mt vector n v 10 chiu vi
v tr
j v v tr th i v tr 0. N c s dng chuyn i mt ch s (0 ... 9) thnh mt
mng
n ron "" e = np . zeros (( 10 , 1 ))
e [ j ] = 1,0
return e
Tc phm ny c cp php theo Giy php khng c giy php Creative Commons Ghi cng-
NonCommercial 3.0 . iu ny c ngha l bn c t do sao chp, chia s v xy dng trn cun sch ny,
nhng khng c bn n. Nu bn quan tm n vic s dng thng mi, vui lng lin h vi ti .