You are on page 1of 104

1

Neural Networks: Definition


Neural computing is the study of networks of adaptable nodes
which, through a process of learning from task examples, store
experiential knowledge and make it available for use.

2
What Are Neural Networks?
A computing model, inspired by the mammalian neural system,
composed of many simple, highly interconnected processing
units.
Neural network models are algorithms for cognitive tasks, such as
learning and optimization, which are in a loose sense based on
concepts derived from research into the nature of the brain.
3
What Are Neural Networks?
Neural network model is a directed graph with the following
properties:
A state variable n
i
is associated with each node i.
A real value weight w
ij
is associated with each link from node i
to node j.
A real value bias u
i
is associated with each node i.
A transfer function f
i
(n
j
, w
ij
, u
i
) is defined, for each node i,
which determines the state of node i.
4
What Can ANN Do?
Biological
Modeling the retina
Modeling brain disorders (ADD)
Business
Evaluate probability of oil in geological formation
Identify and filter promotion and job applicants
Mine corporate databases for business rules
Financial
Assessing credit risk
Identify forgeries
Interpret handwritten forms
Predict portfolio and stock values
5
What Can ANN Do?
Manufacturing
Automated robot control systems
Control material flow
Optimize production lines
Quality inspection
Medical
Analyze speech in hearing aids
Diagnose and prescribe treatment by symptoms
Monitor surgery and recovery
Read X-rays and CET/PET Scans
6
What Can ANN Do?
Military
Classify radar and sonar signals
Target acquisition and tracking
Analyze intelligence inputs
Optimizing scarce resources
Signal processing
Adaptive Noise Canceling
Zip Code Reader
Speech Recognition
7
A Brief History
First concepts
Turing 1936
McCulloch & Pitts 1943
Hebb 1949
Early steps 1950s - 1960s
The perceptron
ADALINE and MADALINE
Excessive hype

8
A Brief History
Stunted growth 1969-1981
Perceptrons by Minskey and Papert
Continued work
Renewed interest
The Hopfield model 1982
Backpropagation rediscovered 1985 (first 1974 by
Werbos)
Radial Basis Functions - Broomhead & Lowe 1988
9
A Quick Word About The Brain
10
The Biological Neuron
Cell Body Synapse() Dendrites() Axons()
11
Computers And The Brain
We do not understand the brain
The ANN model is only loosely based on the brain
The ANN model is metaphoric to the brain
12
Computers vs. Neural Networks
Von-Neumann Machines Neural Networks
Few strong processors ~10
11
Simple neurons
Serial processing Parallel processing
Central control No central control
10
-9
sec. Cycle 10
-3
sec. Cycle
Bit data Voltage data
Not tolerant Very robust
Fast numeric operations Slow numeric
operations
Slow high operations Fast high operations
Learning ? Learning !
13
Building Blocks Of The Model
The processing element
The connections
Learning methods
14
Processing Element Building Block
The basic building block of a neural network is the
processing element (or node or unit).
A generalised node embodies elements:
inputs(+bias)
weights
transfer function
combining function
activation function
output(s)
15
The function of a single node
The job of a processing element is to receive a number of
inputs (either from the external world or from other nodes
or from itself) and to distribute a single output (either to
the external world or to other nodes).
16
Some Input Functions
Weighted Summation
net = w
1
x
1
+ w
2
x
2
+ + w
n
x
n
+ bias
where w
i
is the weight associated with the connection
between an input and the processing element

17
Some Input Functions
Multiplication (or Product)
net = w
1
x
1
* w
2
x
2
* * w
n
x
n

similar to the weighted summation but the summation is
replaced by the product
Maximum, Minimum, Majority
net = max (w
n
x
n
)
net = min (w
n
x
n
)
net = 1 IF E (w
n
x
n
) > 0 ELSE -1
18
Some Activation Functions
Sigmoid
maps an input into a value between zero and one
Linear
where no transformation takes place to the outcome of
the combing function
Tangent
similar to the sigmoid but the mapping is between -1 and
1
Step
where the transfer value equals 1 if the outcome of the
combing function is greater than some threshold,
otherwise it equals 0
19
Some Activation Functions

20
Closer Look At Transfer Functions
Unipolar
Sigmoid

Threshold()

Bipolar
Sigmoid

Sign

21
The Connections
The connections are the only thing changing in neural
networks
Connections may be either inhibitory or excitatory
Connection strengths are expressed by weights
22
The role of the weights
Each input or node is connected to a processing element

Graphically this is represented by an arc

Each arc has a weight. The weight simply determines the
influence (or strength) of an input to a processing element

Neuro-computing is concerned with identification of the
correct set of weights
23
An example of a single node
Assume a processing element receives 3 inputs: 1 0.5 0.3
If the combining function is the weighted summation and the
weights are: -0.2 0.04 2.35
then the result of the combining function is 0.705

1
0.5
0.3
-0.2
0.04
2.35
0.705
24
An example of a single node
If the activation function is
linear f(x)=x then output is 0.705

1
0.5
0.3
-0.2
0.04
2.35
0.705
f(x)=x
0.705
25
An example of a single node
If the activation function is
sigmoid then output is 1 / (1 + exp(-0.705)) = 0.669
1
0.5
0.3
-0.2
0.04
2.35
0.705
f(x)=1/(1+exp(x)
0.705
26
Neural Networks Layers
NN can be constructed using a number of processing
elements
Rather than a chaotic construction it is generally preferable
to build neural networks using layers
A neural network will have an input layer, an output layer and
in between zero, one or more of hidden layers
27
Neural Network Layers 2
Depending on where a processing element is placed, it is
categorised as an input, hidden or output processing
element
Typically, but not necessarily, each processing element in
a layer has the same transfer function
a NN with 4-3-2 configuration is a 2 or 3 layer NN
(depends on if input layer is counted) with 4 input nodes, 3
hidden nodes, 2 output nodes
28
The Role of the Input Layer
An input processing element receives input from the external
world and simply sends the actual input to the processing
elements of the next layer
29
The Role of the Hidden Layer
A hidden processing element receives its input from the
nodes of the previous layer and the transformation of the
input is sent to the next layer

A hidden layer may be seen as a pre-processor
30
The Role of the Output Layer
An output processing element delivers the representation of
the original input after transformations have taken place to
the world
31
Connectivity Matters
A number of different networks can be constructed - differ in
terms of the connectivity pattern and the number of layers
No hidden layers are called single-layer networks
One or more hidden layers are called multi-layer networks
If all connections lead from input to output then it is called
a feed-forward network
If there are connections in the opposite direction then it is
called a feedback or recurrent network
32
Artificial Neural Networks Models

Single layer
feedforward
Multi layer
feedforward
Recurrent
( feedforward )
33
Calculations of a multi-layer feed-forward
neural network
x
2

+1
+1
1.5
-1
0.5
+1
+1 0.5
+1
x
1

x
4

x
3

x
5

34
Learning Laws
As we saw on the previous slide the output with the current
weights is wrong if we want to perform AND.

This bring to us the problem of finding the correct set of
weights

The process of identifying the correct set of weights is called
the learning process and it is characterised by a learning
law

35
Learning Laws 2
The purpose of a learning law is to locate the set of weights
which will give correct answers for all the inputs

The learning is achieved by employing an algorithm which
iteratively changes the weights of the connections in
response to every set of inputs until the correct weights
have been located
36
Learning Laws 3
Most learning laws are based on Hebbs rule which states
that
if two units are simultaneously active, increase the
strength of the connection between them

This rule is the basis for most learning laws used today
(Kohonen learning, Boltzman learning, Delta rule)
37
Some Learning Rules
Hebbian learning rule

Perceptron learning rule

Delta learning rule

Widrow-Hoff learning rule
j
t
i ij
x x w cf w ) ( = A
( ) | |
j
t
i i ij
x x w d c w sgn = A
( ) ( )
j i i i ij
x net f o d c w
'
= A
( )
j
t
i i ij
x x w d c w = A
38
Learning Methods
Supervised approach
a neural network is given a set of inputs and also the
correct output

39
Learning Methods 2
Unsupervised approach
a neural network is given a set of inputs and no outputs.
The network attempts to generate its own classes

40
Learning Methods 3
Reinforcement approach
a neural network is given a set of inputs and no outputs.
The network generates an output and only then it is
told if the produced output was correct or not
Learn by doing
41
Single-Layer Perceptrons
Network architecture
x1
x2
x3
w1
w2
w3
w0
y= signum(net)
y=step(net)
net= E x
i
* w
i
- u
= E x
i
* w
i
+ w
0
where w
0
= u
= E x
i
* w
i

where i=0 now
Signum(net) = 1 if net > 0
else -1
Step(net)=1 if net > 0 else 0
42
Example I - The AND Function
X
1

X
2

W
2
=
W
1
=
W
0
= O
1
1
2
1,1 ---> 1
rest ---> 0
43
Single-Layer Perceptrons
If correct response no modification takes place, else


An entire pass through all of the input training vectors is
called an epoch. When such an entire pass of the training
set has occurred without error, training is complete.
( ) | |
j
t
i i ij
x x w d c w sgn = A
44
Limitations
Perceptron networks have several limitations.
First, the output values of a perceptron can take on only one
of two values (True or False).
Second, perceptrons can only classify linearly separable sets
of vectors. If a straight line or plane can be drawn to
separate the input vectors into their correct categories, the
input vectors are linearly separable and the perceptron will
find the solution. If the vectors are not linearly separable
learning will never reach a point where all vectors are
classified properly.
The most famous example is the boolean XOR problem.

45
The XOR problem
In 1960s perceptrons created a great deal of interest until.
M.Minsky and S. Papert Perceptrons MIT Press
Cambridge MA 1969
single-layer perceptrons can only be used for toy problems
since

cannot represent a simple XOR function
46
The XOR problem 2
The task is to classify a binary input vector to class 0 if the
vector has an even number of 1s or assign it to class 1.

A two-input binary XOR truth table:
0 0 0
0 1 1
1 0 1
1 1 0
47
The XOR problem 3
Recall that the output of a perceptron is given as follows:
1 if the weighted input is greater than 0
0 otherwise
The first input of XOR is 0 0 with desired output as 0
hence the weighted input must be less or equal than zero
in order to get the desired output
0 w1 + 0 w2 + 1 wo < = 0
wo < = 0
48
The XOR problem 4
The second input of XOR is 0 1 with desired output as 1
hence the weighted input must be greater than zero in
order to get the desired output
0 w1 + 1 w2 + 1 wo > 0
w2 + wo > 0
49
The XOR problem 5
The third input of XOR is 1 0 with desired output as 1
hence the weighted input must be greater than zero in
order to get the desired output
1 w1 + 0 w2 + 1 wo > 0
w1 + wo > 0
50
The XOR problem 6
The fourth input of XOR is 1 1 with desired output as 0
hence the weighted input must be less or equal than zero
in order to get the desired output
1 w1 + 1 w2 + 1 wo < = 0
w1 + w2 + wo < = 0
51
The XOR problem 7
In summary the percptron requires satisfying the following
four inequalities
wo < = 0
w2 + wo > 0
w1 + wo > 0
w1 + w2 + wo < = 0
The first inequality tell us that wo must be less or equal to
zero. Therefore for 2nd and 3rd to apply must have w2
and w1 respectively as positive numbers - which
contradicts with the 4th which says that their summation
must be negative or zero
52
Linear Separability
For binary inputs and outputs using the step function the
output is 1 if the net input is positive and 0 if the net input
is negative

net_input = 0: for two-inputs this equation represents a
line

If there are weights so that all of the training input vectors
for which the correct response is +1 lie on one side of
the decision line and all of the training input vectors for
which the correct response is 0 lie on the other side of
the boundary then the problem is linearly separable
53
Linear Separability
54
The XOR problem 8
The XOR problem is not linearly separable
We can not use a single-layer perceptron to construct a
straight line to partition the two dimensional input
space into two regions, each containing only data
points of the same class

X
Y
0
1
0
1
0
0
1
1
55
Multi-Layer Perceptrons
The lack of suitable training methods for multi-layer
perceptrons (MLPs) led to a waning of interest until the
reformulation of the backpropagation training method
Previous work used signum or step activation functions
which are nondifferentiable, now continuous activation
functions are employed
56
Multi-Layer Perceptrons 2
All nodes (or neurons) perform the same function on
incoming signals
a composite of the weighted sum and a differentiable
nonlinear activation function together known as the
transfer function
57
Multi Layer Feedforward Networks
The layers that are neither input nor output are called hidden
layers
Hidden layers extract high order statistics and in a way
provide an overall view of the input data
The output of each layer is used as input to the next layer
There is no theoretical limit on connections between non
neighboring layers
58
MLP Architecture 2-2-1
x2
I n p u t l e v e l
I n t e r m e d i a t e
l e v e l ( H i d d e n )
O u t p u t l e v e l
y
x1
h1 h2
59
Activation Functions
Logistic function
f(net) = 1 / (1 + e
-net
)
Hyperbolic tangent function
f(net) = tanh(net/2) = (1 - e
-net
) / (1 + e
-net
) =
(2 / (1+e
-net
) ) - 1 = (e
net
- e
-net
) / (e
net
+ e
-net
)
Identity function
f(net) = net
where net is the weighted input

60
Activation Functions 2
Logistic and Hyperbolic tangent function
approximate the signum and step function respectively
but they provide smooth, non-zero derivatives with
respect to the input signals
referred to as squashing functions since the inputs to
these functions are squashed to the range [0,1] or [-
1,1]
referred to as sigmoidal functions because of their S-
shaped curves
the hyperbolic is sometimes referred to as the bipolar
sigmoidal
the logistic is sometimes referred to as the binary
sigmoidal
61
Activation Functions Graphs

The Logistic Function
-2
The Hyperbolic Function
-2
62
Identity Activation Function
Identity function
it is usually employed for nodes of the output layer to
approximate a continuous valued function not limited to
[0,1] or [-1,1]
such nodes are referred to as the linear nodes

The Identity Function
-2
63
Binary and Bipolar Sigmoid Derivatives
f(net) = 1 / (1 + e
-net
)

f(net) = f(net) [ 1-f(net) ]

f(net) = (2 / (1+e
-net
) ) - 1

f(net) = 0.5 [ 1 + f(net) ] [ 1 - f(net) ]
64
Learning
Learning target:
minimize the difference between actual outputs and target
outputs

Learning rule:
Steepest descent (Back-propagation)
Conjugate gradient method
All optimization methods using first derivative
Derivative-free optimization

65
MLP and the backpropagation algorithm
66
67
68
MLP and the backpropagation algorithm
o
j
( d e s i r e d
o u t p u t )
h
i
w
i j
w
k i
x
k
X
S i g n a l
E r r o r
I n p u t L a y e r H i d d e n L a y e r O u t p u t L a y e r
y
j
69
Backpropagation Algorithm
0 Initialise Weights
1 While Stopping condition is false, do steps 2 to 9


70
Backpropagation Algorithm 2
2 For each training pair, do steps 3 to 8
Feedforward pass
3 Each input unit receives input signal and broadcasts this
signal to all units in the layer above (the hidden units)
4 Each hidden unit sums its weighted input signals, applies
its activation function to compute its output signal and
sends this signal to all units in the layer above (output
units)
5 Each output unit sums its weighted input signals and
applies its activation function to compute its output signal
End of Feedforward Pass
71
Backpropagation Algorithm 3
Backward Pass
6 Each output unit receives a target pattern corresponding
to the input training pattern, computes its error information
term, calculates its weight and bias correction term, and
sends its error information term to units in the layer
below
7 Each hidden unit sums its error information terms (from
units in the layer above) multiplies by the derivative of its
activation function to calculate its error information term,
calculates its weight and bias correction term
End of Backward pass
72
Backpropagation Algorithm 4
Updating Pass
8 Each output unit updates its bias and weights. Each
hidden unit updates its bias and weights.
End of Updating pass

9 Test stopping criterion
73
Backpropagation Algorithm 5
74
Problems
How to determine the architecture?
How to determine the parameters?
How to get global optima?
... ...
75
GA and ANN
Three levels:
connection weights: introduce an adaptive and global
approach to training
architectures: adapt the topologies to different tasks without
human intervention and thus provide an approach to
automatic ANN design as both ANN connection weights
and structures
learn rules: learning to learn, an adaptive process of
automatic discovery of novel learning rules
76
Evolution of connection weights
Weight training in ANNs is usually formulated as
minimization of an error function, such as the mean
square error between target and actual outputs averaged
over all examples, by iteratively adjusting connecting
weights.
BP often gets trapped in a local minimum of the error
function and is incapable of finding a global minimum if the
error function is multimodal and/or nondifferentiable.
GA can be used effectively in the evolution to find a near-
optimal set of connection weights globally without
computing gradient information.
77
Typical cycle of the evolution of the
connection weights
1 Decode each individual in the current generation into a set
of connection weights and construct a corresponding ANN
with the weights
2 Evaluate each ANN by computing its total mean square
error between actual and target outputs. The fitness of an
individual is determined by the error. A regularization term
may be included in the fitness function to penalize large
weights.
3 Select parents for reproduction based on their fitness
4 Apply genetic operators, such as crossover and mutation,
to parents to generate offspring, which form the next
generation
78
Representation
Binary or real number
Put connection weights to the same node together. Nodes in
ANN are in essence feature extractors and detectors.
Separating inputs to the same node far apart would
increase the difficulty of constructing useful feature
detectors because they might be destroyed by crossover
operators.
Permutation problem: The many-to-one mapping from the
representation to the actual ANN since two ANNs that
order their hidden nodes differently in their chromosomes
will still be equivalent functionally. This makes crossover
operator very inefficient in producing good offspring.
79
80
Comparison between GA and BP
GA can handle the global search problem better. It can be
used to train many different networks regardless of their
architecture and saves a lot of human efforts in
developing different training algorithm for different types of
ANN.
GA makes it easier to generate ANN with some special
characteristics.
GA is much less sensitive to initial conditions of training.
There is no clear winner in terms of the best training
algorithm.
81
Hybrid training
Combine GAs global search ability with local searchs ability
to fine tune. GA can be used to locate a good region in the
space and then a local search procedure is used to find a
near-optimal solution in this region.
82
The evolution of architecture
The architecture of an ANN includes its topological structure,
i.e., connectivity, and the transfer function of each node in
the ANN.
The architecture has significant impact on a networks
information processing capabilities. Given a learning task,
an ANN with only a few connections and linear nodes may
not be able to perform the task at all due to its limited
capability, while an ANN with a large number of
connections and nonlinear nodes may overfit noise in the
training data and fail to have good generalization ability.
83
Traditional way to design the architecture
There is no systematic way to design a near-optimal
architecture for a given task automatically.
A constructive algorithm starts with a minimal network
(network with minimal number of hidden layers, nodes and
connections) and adds new layers, nodes and
connections when necessary during training.
A destructive algorithm starts with a maximal network
(network with maximal number of hidden layers, nodes
and connections) and deletes unnecessary layers, nodes
and connections when during training.
Such structural hill climbing methods are susceptible to
becoming trapped at structural local optima. They only
investigate restricted topological subsets rather than the
complete class of network architecture.
84
Typical cycle of the evolution of
architecture
1 Decode each individual in the current generation into an
architecture.
2 Train each ANN with the decoded architecture by a
predefined learning rule starting from different sets of
random initial connection weights and learning rule
parameters.
3 Compute the fitness of each individual according to the
above training result and other performance criteria such
as the complexity of the architecture.
4 Select parents from the population based on their fitness.
5 Apply search operators to the parents and generate
offspring which form the next generation.
85
The direct encoding scheme
An NN matrix C=(c(i,j)) can represent an ANN architecture
with N nodes, where c(i,j) indicates presence or absence
of the connection from node i to node j.
Such an encoding scheme can handle both feedforward and
recurrent ANNs.
86
A feedforward ANN
87
A recurrent ANN
88
Notes about direct encoding scheme
It is straightforward to implement.
Training error, training time, complexity can be used in the
fitness function
A large ANN would require a very large matrix and thus
increase the computation time of the evolution. Domain
knowledge can be used to reduce the search space
The permutation problem still exists
89
The indirect encoding scheme
Only some characteristics of an architecture are encoded to
reduce the length of the chromosome. The details about
each connection in an ANN is either predefined according
to prior knowledge or specified by a set of deterministic
development rules.
90
Parametric representation
ANN architectures may be specified by a set of parameters
such as the number of hidden layers, the number of
hidden nodes in each layer, the number of connections
between two layers, etc.
In general the parametric representation method will be most
suitable when we know what kind of architectures we are
trying to find.
91
Example of pattern recognition
Input Output Input Output
0000 00 0100 00
1100 00 1000 00
1001 01 0000 01
1101 01 0101 01
0010 11 1010 11
0110 11 1110 11
0011 10 0111 10
1011 10 1111 10
In fact the first two bits of the input are noise and the output
is the Gray code of the last two bits of the input.
92
Chromosome
We use a 16-bit chromosome
The first 2 bits stand for the study ratio: 0.5, 0.25, 0.125,
0.0625
The next 2 bits stands for the momentum: 0.9, 0.8, 0.7, 0.6
The next 2 bits stands for the range of the initial weight: 1,
0.5, 0.25, 0.125
The next 5 bits is used for the 1st hidden layer: the first bit
means if there is a hidden layer and the other 4 bits
stands for the number of hidden units.
The last 5 bits is used for the 2nd hidden layer: the first bit
means if there is a hidden layer and the other 4 bits
stands for the number of hidden units.
93
Evolution and result
Only use the first 8 samples for evolution.
Use 7 of these 8 samples for training the ANN and the other
one is used to get the fitness.
Finally we get a 4-1-4-2 ANN(structure and weight).
In order to check the final result we use the other 8 samples
and compare with a 4-16-16-2 ANN which is trained by BP.
94
Developmental rule representation
Development rules, which are used to construct architectures,
are encoded in chromosomes.
A development rule is usually described by a recursive
equation or a production system.
How to get such a set of rules to construct an ANN? One
answer is to evolve them. We can encode the whole rule
set as an individual (Pittsburgh approach) or encode each
rule as an individual (Michigan approach)
95
Examples of some development rules
96
Development of an ANN architecture
97
Simultaneous evolution of architectures &
weights
98
Evolution of learning rules
An ANN training algorithm may have different performance
when applied to different architectures. The design of
training rules, more fundamentally the learning rules used
to adjust weights, depends on the type of architectures
under investigation. Different variants of the Hebbian
learning rule have been proposed to deal with different
architectures. It is desirable to develop an automatic and
systematic way to adapt the learning rule to an
architecture and the task to be performed. Designing a
learning rule manually often implies that some
assumptions, which are not necessarily true in practice,
have to be made.
99
Typical cycle of the evolution of learning
rule
1 Decode each individual in the current generation into a
learning rule
2 Construct a set of ANNs with randomly generated
architectures and initial connection weights, and train
them using the decoded learning rule.
3 Calculate the fitness of each individual according to the
average training result
4 Select parents from the current generation according to
their fitness
5 Apply search operators to parents to generate offspring
which form the new generation
100
Evolution of algorithm parameters
The adaptive adjustment of BPs parameters through
evolution could be considered as the first attempt to the
evolution of learning rules.
Some researchers used an GA process to find parameters
for BP but ANNs architecture was predefined. The
parameters evolved in this case tend to be optimized
towards the architecture rather than being generally
applied to learning.
Some researchers encoded BPs parameters in
chromosomes together with ANNs architecture.
101
Evolution of learning rules
The evolution of learning rules has to work on the dynamic
behavior of an ANN.
Try to develop a universal representation scheme which can
specify any kind of dynamic behaviors is clearly
impractical.
Two basic assumptions which have often been made on
learning rules are 1) weight-updating depends only on
local information such as the activation of the input node,
the activation of the output node, the current connection
weight, etc.; 2) the learning rule is the same for all
connections in an ANN
102
Learning rule
A learning rule can be described by the following function



There are three major issues involved in the evolution of
learning rules: 1) determination of a subset of terms
described in the above equation; 2) representation of the
coefficients as chromosomes, and 3) the GA used to
evolve these chromosomes.
103
Other combination between GA and ANN
Evolution of input features: finding a near-optimal set of input
features to an ANN
ANN as fitness estimator: the time-consuming fitness
evaluation based on real systems is replaced by fast
fitness evaluation based on ANN
Evolving ANN ensembles: combining different individuals in
the population to form an integrated system is expected to
produce better results.
104
A general framework for GA and ANN

You might also like