Notes

Notes on Backpropagation
Peter Sadowski
Department of Computer Science
University of California Irvine
Irvine, CA 92697
peter.j.sadowski@uci.edu
Abstract
This document derives backpropagation for some common error functions and
describes some other tricks.
Cross Entropy Error with Logistic Activation
For classification problems with two classes, the standard neural network architecture has a single
output unit that provides a predicted probability of being in one class over another. The logistic
activation function combined with the cross-entropy loss function gives us a nice probabilistic interpretation (as opposed to a sum-of-squared loss). We can generalize this to the case where we
have multiple, independent, two-class outputs; we simply sum the loglikelihood of the independent
targets.
Output
xi
t1
t2
t3
x1
x2
x3
Hidden
xj
Inputs xk
The cross entropy error for a single example with nout independent targets
E=
nout
X
(ti log(xi ) + (1 ti ) log(1 xi ))
(1)
i=1
where t is the target, x is the output, indexed by i. The activation function is the logistic function
applied to the weighted sum of the neurons inputs,
1
xi =
(2)
1 + esi
X
si =
xj wji .
(3)
j=1
The backprop algorithm is simply the chain rule applied to the neurons in each layer. The first step
of the algorithm is to calculate the gradient of the training error with respect to the output layer
E
weights, w
.
ji
E
wji
E xi si
xi si wji
1
(4)
We can compute each factor as

E
xi
ti
1 ti
+
,
xi
1 xi
xi t i
,
xi (1 xi )
=
=
xi
si
si
wji
(5)
(6)
= xi (1 xi )
(7)
= xj
(8)
where xj is the activation of the j node in the hidden layer. Combining things back together,
E
si
= xi ti
(9)
and
E
wji
(xi ti )xj
(10)
.
This gives us the gradient for the weights in the last layer of the network. We now need to calculate
E
the error gradient for the weights of the lower layers. Here it is useful to calculate the quantity s
j
where j indexes the units in the second layer down.
E
sj
nout
X
E si xj
si xj sj
(11)
(xi ti )(wji )(xj (1 xj ))
(12)
i=1
nout
X
i=1
E
xj
X E xi Si
xi Si xj
i=1
X E
xi (1 xi )wji
=
xi
i
(13)
(14)
Then a weight wkj connecting units in the second and third layers down has gradient
E
wkj
=
=
E sj
sj wkj
nout
X
(15)
(xi ti )(wji )(xj (1 xj ))(xk )
(16)
i=1
E
wij for a general
sj
= xk .
by wkj
In conclusion, to compute
recursively, then multiply
multilayer network, we simply need to compute
E
sj
Classification with Softmax Transfer and Cross Entropy Error
For classification problems with more than 2 classes, the softmax output layer provides a way of
assigning probabilities to each class. The cross-entropy error function is modified, but it turns out
to have the same gradient as for the case of summed cross-entropy on logistic outputs. The softmax
activation of the ith output unit is
2
xi
esi
Pnclass
c
(17)
esc
and the cross entropy error function for multi-class output is
nclass
X
ti log(xi )
(18)
Thus, computing the gradient yields

E
xi
xi
sk
ti
=
xi
(
(19)
e si
Pnclass
e si
( Pnclass
)2
e sc
e sc
c
si sk
e e
(Pnclass
esc )2
c

=
E
si
i=k
i 6= k
xi (1 xi ) i = k
xi xk
i=
6 k
(21)
nclass
X
E xk
xk si
k
E xi X E xk
xi si
xk si
k6=i
X
ti (1 xi ) +
tk xk
=
=
=
(20)
(22)
(23)
(24)
k6=i
ti + xi
tk
(25)
xi ti
(26)
(27)
the gradient for weights in the top layer is thus

E
wji
X E si
si wji
i
(28)
(xi ti )xj
(29)
and for units in the second lowest layer indexed by j,

E
sj
nclass
X
E si xj
si xj sj
(30)
(xi ti )(wji )(xj (1 xj ))
(31)
nclass
X
i
Notice that this gradient has the same formula as for the summed cross entropy case, but it is different
because the activation x takes on different values.
Algebraic trick for cross-entropy calculations
We can save some computation when doing cross-entropy error calculations, often an expensive part
of training a neural network.
3
For a single output neuron with logistic activation, the cross-entropy error is given by
E
= (t log o + (1 t) log (1 o))

o
= t log (
) + log(1 o)
1o
(32)
(33)
!
1
) + log (1
)
= t log (
1 + ex
1

1
= tx + log (
)
1 + ex
= tx + log (1 + ex )
1
1+ex
1+e1x
(34)
(35)
(36)
For a softmax output, the cross-entropy error is given by

E
X
i
exi
ti log P xj
je

ti xi log
(37)
exj
(38)
(39)
(40)
Also note that in this softmax calculation, a constant can be added to each row of the output with no
effect on the error function.

Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes

Uploaded by

Copyright:

Available Formats

Notes on Backpropagation

Cross Entropy Error with Logistic Activation

(ti log(xi ) + (1 ti ) log(1 xi ))

We can compute each factor as

(xi ti )(wji )(xj (1 xj ))

(xi ti )(wji )(xj (1 xj ))(xk )

multilayer network, we simply need to compute

Classification with Softmax Transfer and Cross Entropy Error

and the cross entropy error function for multi-class output is

Thus, computing the gradient yields

the gradient for weights in the top layer is thus

and for units in the second lowest layer indexed by j,

(xi ti )(wji )(xj (1 xj ))

Algebraic trick for cross-entropy calculations

= (t log o + (1 t) log (1 o))

For a softmax output, the cross-entropy error is given by

You might also like