Professional Documents
Culture Documents
Peter Sadowski
Department of Computer Science
University of California Irvine
Irvine, CA 92697
peter.j.sadowski@uci.edu
Abstract
This document derives backpropagation for some common error functions and
describes some other tricks.
For classification problems with two classes, the standard neural network architecture has a single
output unit that provides a predicted probability of being in one class over another. The logistic
activation function combined with the cross-entropy loss function gives us a nice probabilistic interpretation (as opposed to a sum-of-squared loss). We can generalize this to the case where we
have multiple, independent, two-class outputs; we simply sum the loglikelihood of the independent
targets.
Output
xi
t1
t2
t3
x1
x2
x3
Hidden
xj
Inputs xk
The cross entropy error for a single example with nout independent targets
E=
nout
X
(1)
i=1
where t is the target, x is the output, indexed by i. The activation function is the logistic function
applied to the weighted sum of the neurons inputs,
1
xi =
(2)
1 + esi
X
si =
xj wji .
(3)
j=1
The backprop algorithm is simply the chain rule applied to the neurons in each layer. The first step
of the algorithm is to calculate the gradient of the training error with respect to the output layer
E
weights, w
.
ji
E
wji
E xi si
xi si wji
1
(4)
ti
1 ti
+
,
xi
1 xi
xi t i
,
xi (1 xi )
=
=
xi
si
si
wji
(5)
(6)
= xi (1 xi )
(7)
= xj
(8)
where xj is the activation of the j node in the hidden layer. Combining things back together,
E
si
= xi ti
(9)
and
E
wji
(xi ti )xj
(10)
.
This gives us the gradient for the weights in the last layer of the network. We now need to calculate
E
the error gradient for the weights of the lower layers. Here it is useful to calculate the quantity s
j
where j indexes the units in the second layer down.
E
sj
nout
X
E si xj
si xj sj
(11)
(12)
i=1
nout
X
i=1
E
xj
X E xi Si
xi Si xj
i=1
X E
xi (1 xi )wji
=
xi
i
(13)
(14)
Then a weight wkj connecting units in the second and third layers down has gradient
E
wkj
=
=
E sj
sj wkj
nout
X
(15)
(16)
i=1
E
wij for a general
sj
= xk .
by wkj
In conclusion, to compute
recursively, then multiply
E
sj
For classification problems with more than 2 classes, the softmax output layer provides a way of
assigning probabilities to each class. The cross-entropy error function is modified, but it turns out
to have the same gradient as for the case of summed cross-entropy on logistic outputs. The softmax
activation of the ith output unit is
2
xi
esi
Pnclass
c
(17)
esc
nclass
X
ti log(xi )
(18)
ti
=
xi
(
(19)
e si
Pnclass
e si
( Pnclass
)2
e sc
e sc
c
si sk
e e
(Pnclass
esc )2
c
=
E
si
i=k
i 6= k
xi (1 xi ) i = k
xi xk
i=
6 k
(21)
nclass
X
E xk
xk si
k
E xi X E xk
xi si
xk si
k6=i
X
ti (1 xi ) +
tk xk
=
=
=
(20)
(22)
(23)
(24)
k6=i
ti + xi
tk
(25)
xi ti
(26)
(27)
X E si
si wji
i
(28)
(xi ti )xj
(29)
nclass
X
E si xj
si xj sj
(30)
(31)
nclass
X
i
Notice that this gradient has the same formula as for the summed cross entropy case, but it is different
because the activation x takes on different values.
We can save some computation when doing cross-entropy error calculations, often an expensive part
of training a neural network.
3
For a single output neuron with logistic activation, the cross-entropy error is given by
E
(32)
(33)
!
1
) + log (1
)
= t log (
1 + ex
1
1
= tx + log (
)
1 + ex
= tx + log (1 + ex )
1
1+ex
1+e1x
(34)
(35)
(36)
X
i
exi
ti log P xj
je
ti xi log
(37)
exj
(38)
(39)
(40)
Also note that in this softmax calculation, a constant can be added to each row of the output with no
effect on the error function.