Professional Documents
Culture Documents
Peter Sadowski
Department of Computer Science
University of California Irvine
Irvine, CA 92697
peter.j.sadowski@uci.edu
Abstract
This document derives backpropagation for some common error functions and
describes some other tricks. I wrote this up because I kept having to reassure
myself of the algebra.
nout
X
(1)
i=1
where t is the target, o is the output, indexed by i. Our activation function is the logistic
1
oi =
1 + exi
X
oj wji
xi =
(2)
(3)
j=1
Our first goal is to compute the error gradient for the weights in the final layer
observing,
E
E oi xi
=
wji
oi xi wji
and proceed by analyzing the partial derivatives for specific data points,
ti
1 ti
E
=
+
oi
oi
1 oi
oi ti
=
oi (1 oi )
oi
= oi (1 oi )
xi
xi
= oj
wji
1
E
wji .
We begin by
(4)
(5)
(6)
(7)
(8)
where oj is the activation of the j node in the hidden layer. Combining things back together,
E
xi
oi ti
(9)
and
E
wji
= (oi ti )oj
(10)
.
This gives us the gradient for the weights in the last layer of the network. We now need to calculate
E
the error gradient for the weights of the lower layers. Here it is useful to calculate the quantity x
j
where j indexes the units in the second layer down.
E
xj
=
=
nout
X
i=1
nout
X
E xi oj
xi oj xj
(11)
(12)
i=1
Then a weight wkj connecting units in the second and third layers down has gradient
E
wkj
E xj
xj wkj
(13)
nout
X
(14)
i=1
E
wij for a general
xj
by wkj
= ok .
In conclusion, to compute
recursively, then multiply
E
xj
fi (w, x)
exi
Pnclass
exk
(15)
nclass
X
ti log(oi )
(16)
So
E
oi
oi
xk
ti
=
o
i
oi (1 oi )
=
oi ok
2
(17)
i=k
i 6= k
(18)
E
xi
nclass
X
=
=
=
E ok
ok xi
k
E oi X E ok
oi xi
ok xi
k6=i
X
ti (1 oi ) +
tk ok
(20)
ti + oi
(22)
(19)
(21)
k6=i
tk
oi ti
(23)
(24)
=
=
X E xi
xi wji
i
(25)
(oi ti )oj
(26)
nclass
X
E xi oj
xi oj xj
(27)
(28)
nclass
X
i
Notice that this gradient has the same formula as for the summed cross entropy case, but it is different
because the different activation function in last layer means o takes on different values.
=
=
=
=
=
1
1+ex
1+e1x
) + log (1
1
)
1 + ex
tx + log (1 + ex )
tx + log (
(29)
(30)
!
1
)
1 + ex
(31)
(32)
(33)