You are on page 1of 12

Course content

Summary
Our goal is to introduce students to a powerful class of model, the Neural Network. In fact,
this is a broad term which includes many diverse models and approaches. We will first
motivate networks by analogy to the brain. The analogy is loose, but serves to introduce the
idea of parallel and distributed computation.
We then introduce one kind of network in detail: the feedforward network trained by
backpropagation of error. We discuss model architectures, training methods and data
representation issues. We hope to cover everything you need to know to get backpropagation
working for you. range of applications and e!tensions to the basic model will be presented
in the final section of the module.
Lecture 1: Introduction
"omputation in the brain
rtificial neuron models
#inear regression
#inear neural networks
$ulti%layer networks
&rror 'ackpropagation
Lecture 2: The Backprop Toolbox
(evision: the backprop algorithm
'ackprop: an e!ample
Overfitting and regulari)ation
*rowing and pruning networks
+reconditioning the network
$omentum and learning rate adaptation
Computation in the brain
The brain - that's my second most favourite oran! - "oody #llen
The Brain as an Information Processing System
The human brain contains about ,- billion nerve cells, or neurons. On average, each neuron
is connected to other neurons through about ,- --- synapses. .The actual figures vary
greatly, depending on the local neuroanatomy./ The brain0s network of neurons forms a
massively parallel information processing system. This contrasts with conventional
computers, in which a single processor e!ecutes a single series of instructions.
gainst this, consider the time taken for each elementary operation: neurons typically operate
at a ma!imum rate of about ,-- 1), while a conventional "+2 carries out several hundred
million machine level operations per second. 3espite of being built with very slow hardware,
the brain has 4uite remarkable capabilities:
its performance tends to degrade gracefully under partial damage. In contrast, most
programs and engineered systems are brittle: if you remove some arbitrary parts, very
likely the whole will cease to function.
it can learn .reorgani)e itself/ from e!perience.
this means that partial recovery from damage is possible if healthy units can learn to
take over the functions previously carried out by the damaged areas.
it performs massively parallel computations e!tremely efficiently. 5or e!ample,
comple! visual perception occurs within less than ,-- ms, that is, ,- processing
steps6
it supports our intelligence and self-awareness. (Nobody knows yet how this
occurs.)
processin

elements
elemen
t si$e
ener
y use
processin
speed
style of
computati
on
fault
toleran
t
learn
s
intellien
t%
conscious
,-
,7

synapses
,-
%8
m 9- W ,-- 1)
parallel,
distributed
yes yes usually
,-
:

transistor
s
,-
%8
m
9- W
."+2/
,-
;
1)
serial,
centrali)ed
no
a
little
not .yet/
s a discipline of rtificial Intelligence, Neural Networks attempt to bring computers a little
closer to the brain0s capabilities by imitating certain aspects of information processing in the
brain, in a highly simplified way.
Neural Networks in the
Brain
The brain is not homogeneous. t the
largest anatomical scale, we distinguish
cortex, midbrain, brainstem, and
cerebellum. &ach of these can be hierarchically subdivided into many reions, and areas
within each region, either according to the anatomical structure of the neural networks within
it, or according to the function performed by them.
The overall pattern of pro&ections .bundles of neural connections/ between areas is
e!tremely comple!, and only partially known. The best mapped .and largest/ system in the
human brain is the visual system, where the first ,- or ,, processing stages have been
identified. We distinguish feedfor'ard pro<ections that go from earlier processing stages
.near the sensory input/ to later ones .near the motor output/, from feedback connections that
go in the opposite direction.
In addition to these long%range connections, neurons also link up with many thousands of
their neighbours. In this way they form very dense, comple! local networks:
Neurons and Synapses
The basic computational unit in the nervous system is the nerve cell, or neuron. neuron
has:
3endrites .inputs/
"ell body
!on .output/
neuron receives input from other neurons .typically
many thousands/. Inputs sum .appro!imately/. Once input
e!ceeds a critical level, the neuron discharges a spike % an
electrical pulse that travels from the body, down the a!on,
to the ne!t neuron.s/ .or other receptors/. This spiking
event is also called depolari$ation, and is followed by a refractory period, during which the
neuron is unable to fire.
The a!on endings .Output =one/ almost touch the dendrites or cell body of the ne!t neuron.
Transmission of an electrical signal from one neuron to the ne!t is effected by
neurotransmittors, chemicals which are released from the first neuron and which bind to
receptors in the second. This link is called a synapse. The e!tent to which the signal from one
neuron is passed on to the ne!t depends on many factors, e.g. the amount of neurotransmittor
available, the number and arrangement of receptors, amount of neurotransmittor reabsorbed,
etc.
Synaptic
Learning
'rains learn. Of
course. 5rom what we
know of neuronal
structures, one way
brains learn is by
altering the strengths
of connections
between neurons, and
by adding or deleting
connections between
neurons. 5urthermore,
they learn >on%line>,
based on e!perience,
and typically without
the benefit of a
benevolent teacher.
The efficacy of a
synapse can change as a result of e!perience, providing both memory and learning through
lon-term potentiation. One way this happens is through release of more neurotransmitter.
$any other changes may also be involved.
#ong%term +otentiation:
n enduring .?, hour/ increase in synaptic
efficacy that results from high%fre4uency
stimulation of an afferent .input/ pathway
1ebbs +ostulate:
>When an a!on of cell ... e!cites@sA cell ' and
repeatedly or persistently takes part in firing it, some
growth process or metabolic change takes place in one or
both cells so that 0s efficiency as one of the cells firing
' is increased.>
'liss and #omo discovered #T+ in the hippocampus in
,;B9
+oints to note about #T+:
Cynapses become more or less important over
time .plasticity/
#T+ is based on e!perience
#T+ is based only on local information .1ebb0s postulate/
Summary
The following properties of nervous systems will be of particular interest in our neurally%
inspired models:
parallel, distributed information processing
high degree of connectivity among basic units
connections are modifiable based on e!perience
learning is a constant process, and usually unsupervised
learning is based only on local information
performance degrades gracefully if some units are removed
etc..........
Artificial Neuron odels
"omputational neurobiologists have constructed very elaborate computer models of neurons
in order to run detailed simulations of particular circuits in the brain. s "omputer Ccientists,
we are more interested in the general properties of neural networks, independent of how they
are actually >implemented> in the brain. This means that we can use much simpler, abstract
>neurons>, which .hopefully/ capture the essence of neural computation even if they leave out
much of the details of how biological neurons work.
+eople have implemented model neurons in hardware as electronic circuits, often integrated
on D#CI chips. (emember though that computers run much faster than brains % we can
therefore run fairly large networks of simple model neurons as software simulations in
reasonable time. This has obvious advantages over having to use special >neural> computer
hardware.
A Simple Artificial Neuron
Our basic computational element .model neuron/ is often called a node or unit. It receives
input from some other units, or perhaps from an e!ternal source. &ach input has an associated
'eiht w, which can be modified so as to model synaptic learning. The unit computes some
function f of the weighted sum of its inputs:
Its output, in turn, can serve as input to other units.
The weighted sum is called the net input to unit i, often written net
i
.
Note that w
ij
refers to the weight from unit j to unit i .not the other way around/.
The function f is the unit0s activation function. In the simplest case, f is the identity
function, and the unit0s output is <ust its net input. This is called a linear unit.
Linear Regression
Fitting a Model to Data
Consider the data below:

(Fig. 1)
&ach dot in the figure provides information about the weight .!%a!is, units: 2.C. pounds/ and
fuel consumption .y%a!is, units: miles per gallon/ for one of B7 cars .data from ,;B;/. "learly
weight and fuel consumption are linked, so that, in general, heavier cars use more fuel.
Now suppose we are given the weight of a BEth car, and asked to predict how much fuel it
will use, based on the above data. Cuch 4uestions can be answered by using a model % a short
mathematical description % of the data. The simplest useful model here is of the form
y = w
1
x + w
0

(1
)
This is a linear model: in an !y%plot, e4uation , describes a straight line with slope w
1
and
intercept w
0
with the y%a!is, as shown in 5ig. F. .Note that we have rescaled the coordinate
a!es % this does not change the problem in any fundamental way./
1ow do we choose the two parameters w
0
and w
1
of our modelG "learly, any straight line
drawn somehow through the data could be used as a predictor, but some lines will do a better
<ob than others. The line in 5ig. F is certainly not a good model: for most cars, it will predict
too much fuel consumption for a given weight.

(Fig. 2)
The Loss Function
n order to !ake precise what we !ean by being a "good predictor"# we de$ne a loss
(also called objective or error) function E o%er the !odel para!eters. & popular choice
for E is the sum-squared error:
(2
)
In words, it is the sum over all points i in our data set of the s4uared difference between the
taret value t
i
.here: actual fuel consumption/ and the model0s prediction y
i
, calculated from
the input value x
i
.here: weight of the car/ by e4uation ,. 5or a linear model, the sum%s4aured
error is a 4uadratic function of the model parameters. 5igure 9 shows E for a range of values
of w
0
and w
1
. 5igure 7 shows the same functions as a contour plot.

(Fig. ')

(Fig. ()
Minimizing the Loss
)he loss function E pro%ides us with an ob*ecti%e !easure of predicti%e error for a
speci$c choice of !odel para!eters. +e can thus restate our goal of $nding the best
(linear) !odel as $nding the %alues for the !odel para!eters that !ini!i,e E.
5or linear models, linear reression provides a direct way to compute these optimal model
parameters. .Cee any statistics te!tbook for details./ 1owever, this analytical approach does
not generali)e to nonlinear models .which we will get to by the end of this lecture/. &ven
though the solution cannot be calculated e!plicitly in that case, the problem can still be
solved by an iterative numerical techni4ue called radient descent. It works as follows:
1. Choose so!e (rando!) initial %alues for the !odel para!eters.
2. Calculate the gradient - of the error function with respect to each !odel
para!eter.
'. Change the !odel para!eters so that we !o%e a short distance in the direction
of the greatest rate of decrease of the error# i.e.# in the direction of --.
(. .epeat steps 2 and ' until - gets close to ,ero.
/ow does this work0 )he gradient of 1 gi%es us the direction in which the loss function at
the current settting of the w has the steepest slope. n ordder to decrease E# we take a
s!all step in the opposite direction# -G (Fig. 2).
(Fig. 2)
'y repeating this over and over, we move >downhill> in E until we reach a minimum, where
G H -, so that no further progress is possible .5ig. 8/.
(Fig. 3)
5ig. B shows the best linear model for our car data, found by this procedure.

(Fig. 4)
It's a neural netor!"
Our linear model of e4uation , can in fact be implemented by the simple neural network
shown in 5ig. :. It consists of a bias unit, an input unit, and a linear output unit. The input
unit makes e!ternal input x .here: the weight of a car/ available to the network, while the bias
unit always has a constant output of ,. The output unit computes the sum:
y
2
5 y
1
w
21
6 1.7 w
20

('
)
It is easy to see that this is e4uivalent to e4uation ,, with w
21
implementing the slope of the
straight line, and w
20
its intercept with the y%a!is.
(Fig. 8)

You might also like