Professional Documents
Culture Documents
Perceptron
Can only model linear functions
Kernel Machines
Non-linearity provided by kernels
Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
Solution is linear combination of kernels
Activation Function
Activation Function
1
0.9
0.8
1/(1+exp(-x))
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10
-5
0
x
Sigmoid
f (x) = (wT x) =
Smooth version of threshold
approximately linear around zero
saturates for large and small values
1
1 + exp(wT x)
10
Training MLP
Stochastic gradient descent
Training error for example (x, y) (e.g. regression):
E(W ) =
1
(y f (x))2
2
E(W )
wlj
E(W )
E(W ) al
=
= l j
wlj
al wlj
| {z }
l
Training MLP
Output units
Delta is easy to compute on output units.
E.g. for regression with sigmoid outputs:
1 (y f (x))2
E(W )
= 2
ao
ao
1
2
(y (ao ))
(ao )
= 2
= (y (ao ))
ao
ao
= (y (ao ))(ao )(1 (ao ))
o =
Training MLP
Derivative of sigmoid
(x)
1
=
x
x 1 + exp(x)
(1 + exp(x))
x
exp(x)
= (1 + exp(x))2 exp(2x)
x
= (1 + exp(x))2 exp(2x) exp(x)
= (1 + exp(x))2
1
exp(x)
1 + exp(x) 1 + exp(x)
1
1
=
(1
)
1 + exp(x)
1 + exp(x)
= (x)(1 (x))
=
Training MLP
Hidden units
Consider contribution to error through all outer connections (again sigmoid activation):
l =
X E(W ) ak
E(W )
=
al
ak al
kch[l]
X
kch[l]
X
kch[l]
ak l
k
l al
k wkl
(al )
al
kch[l]
Backpropagation for
arbitrary modules
y
Loss(3 , y)
@E
@2
F 2 (1 ,W2 )
W2
j = F j ( j 1 ,W j )
@E
@3
F 3 (2 ,W3 )
W3
@E
@1
@E
@E @ j
@E @F j ( j 1 ,W j )
=
=
@W j
@ j @W j
@ j
@W j
@E
@E @ j
@E @F j ( j 1 ,W j )
=
=
@ j 1 @ j @ j 1 @ j
@ j 1
F 1 (x,W1 )
W1
x
Remarks on backpropagation
Local minima
The error surface of a multilayer neural network can contain several minima
Backpropagation is only guaranteed to converge to a local minimum
Heuristic attempts to address the problem:
use stochastic instead of true gradient descent
train multiple networks with different random weights and average or choose best
many more..
Note
Training kernel machines requires solving quadratic optimization problems rightarrow global optimum guaranteed
Deep networks are more expressive in principle, but harder to train
Error
validation
tra
inin
g
test
epochs
1
6
8
9 10 11
reconstruction
code
input
Autoencode
train shallow network to reproduce input in the output
learns to map inputs into a sensible hidden representation (representation learning)
can be done with unlabelled examples (unsupervised learning)
input
input
10
reconstruction
input
Stacked autoencoder
repeat:
1. discard output layer
2. freeze hidden layer weights
3. add another hidden + output layer
11
reconstruction
input
12
input
13
output
input
global refinement
discard autoencoder output layer
add appropriate output layer for supervised task (e.g. one-hot encoding for multiclass classification)
learn output layer weights + refine all internal weights by backpropagation algorithm
14
References
Libraries
Pylearn2 (http://deeplearning.net/software/pylearn2/)
Caffe (http://caffe.berkeleyvision.org/)
Torch7 (http://torch.ch/)
Literature
Yoshua Bengio, Learning Deep Architectures for AI, Foundations & Trends in Machine Learning, 2009.
15