Professional Documents
Culture Documents
1.
Connectionist Models
Consider humans:
Neuron switching time: .001 sec. Number of neurons: 1010 Connections per neuron: 1045 Scene recognition time: 0.1 sec. 100 inference steps doesnt seem like enough much parallel computation
2.
3.
PLAN
1. Examples: ALVINN driving system; Face recognition 2. The perceptron The linear unit The gradient descent learning rule 3. Multilayer networks of sigmoid units The Backpropagation algorithm Hidden layer representations 4. Advanced topics
4.
1. NN Examples:
ALVINN drives at 70 mph on highways
Sharp Left Straight Ahead Sharp Right
30 Output Units
4 Hidden Units
left
strt
rght
up
5.
...
...
30x32 inputs
Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces
2. The Perceptron
[Rosenblat, 1962]
x1 x2 w1 w2 x0=1 w0
6.
. . .
xn
wn
i=0
wi xi
o=
i=0 i i -1 otherwise
1 if
x >0
o(x1, . . . , xn) =
1 if w x > 0 1 otherwise.
7.
(a)
(b)
Represents some useful functions What weights represent g(x1 , x2 ) = AND(x1 , x2 )? But certain examples may not be linearly separable Therefore, well want networks of these...
8.
9.
where D is set of training examples. The linear unit uses the descent gradient training rule, presented on the next slides. Remark: Ch. 6 (Bayesian Learning) shows that the hypothesis h = (w0 , w1 , . . . , wn ) that minimises E is the most probable hypothesis given the training data.
10.
11.
(td od )xi,d
12.
13.
The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error given a suciently small learning rate even when the training data contains noise even when the training data is not separable by H Note: If is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it. For this reason, one common modication of the algorithm is to gradually reduce the value of as the number of gradient descent steps grows.
14.
Remark
Gradient descent is an important general paradigm for learning. It is a strategy for searching through a large or innite hypothesis space that can be applied whenever
the hypothesis space contains continuously parametrized hypotheses the error can be dierentiated w.r.t. these hypothesis parameters.
To alleviate these diculties, a variantion called incremental (or: stochastic) gradient descent was proposed (see next slide).
15.
The Incremental Gradient Descent can approximate the Batch Gradient Descent arbitrarily closely if is made small enough.
16.
. . .
xn
wn
net = wi xi
i=0
o = (net) =
1 1+e
-net
1 1+ex
= (x)(1 (x))
We can derive gradient descent rules to train One sigmoid unit Multilayer networks of sigmoid units Backpropagation
17.
18.
F1
F2
This network was trained to recognize 1 of 10 vowel sounds occurring in the context h d (e.g. head, hid). The inputs have been obtained from a spectral analysis of sound. The 10 network outputs correspond to the 10 possible vowel sounds. The network prediction is the output whose value is the highest. The plot on the right illustrates the highly non-linear decision surface represented by the learned network. Points shown on the plot are test examples disstinct from the examples used to train the network.
19.
Note: To see how these weight-updating rules were derived, please refer to [Tom Mitchell, 1997] pag. 101103.
20.
More on Backpropagation
Gradient descent over the entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum wi,j (n) = j xij + wij (n 1) Minimizes error over training examples Will it generalize well to subsequent examples? Training can take thousands of iterations slow! Using network after training is very fast
21.
Convergence of Backpropagation
Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses
22.
A target function:
Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001 Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
This network was trained to learn the identity function, using the 8 training examples shown.
23.
Learned hidden layer representation: Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001 Hidden Values .89 .04 .08 .15 .99 .99 .01 .97 .27 .99 .97 .71 .03 .05 .02 .01 .11 .88 .80 .01 .98 .60 .94 .01 Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
After 8000 training epochs, the 3 hidden unit values encode the 8 distinct inputs. Note that if the encoded values are rounded to 0 or 1, the result is the standard binary encoding for 8 distinct values (however not the usual one, i.e. 1 001, 2 010, etc).
24.
Training (I)
Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500
25.
Training (II)
Hidden unit encoding for input 01000000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500
26.
Training (III)
Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500
27.
Error versus weight updates (example 2) 0.08 0.07 0.06 0.05 Error 0.04 0.03 0.02 0.01 0 0 1000 2000 3000 4000 Number of weight updates 5000 6000 Training set error Validation set error
28.
Train on target slopes as well as values: E(w) 1 tkd okd (tkd okd )2 + j 2 dD koutputs xd xj jinputs d
29.
y(t + 1) y(t + 1)
x(t)
c(t)
y(t + 1)
x(t)
c(t) y(t)
x(t 1)
c(t 1) y(t 1)
x(t 2)
c(t 2)
unfolded in time
30.