You are on page 1of 18

ARTIFICIAL NEURAL NETWORK

Artificial Neural Networks (ANNs), also called parallel distributed processing systems
(PDPs), and connectionist systems are intended for modeling the organizational principles of
the central nervous system. This offers the hope that the biologically inspired computing
capabilities of the ANN will allow the cognitive and sensory tasks to be performed more
easily and more satisfactorily than with conventional serial processors.

3.1

BIOLOGICAL NEURAL NETWORKS

Regulatory or control mechanism is a basic need of all multicellular organism. These


regulatory mechanism are for switching on and off the bodys multifarious physiological
performances like transmission, monitoring, integration and deciphering the information that
are continuously exchanged between the external and the internal environments. This
synchronization between the external and the internal environments. This synchronization
between various systems in a multicellular organism is not haphazard but are rather
controlled. The timing and location of one set of activities is correlated with that of the other
set of activities. This is known as co-ordination.
The co-ordination between the time and space of one set of activities to the other set in
multicellular organisms is solely due to the nervous system and endocrine system. The
nervous system is the bodys control unit and communication network that shares the
maintenance of homoeostasis of the body with the endocrine system. The nervous system is
made up of several millions of nerve cells along with various supporting tissues forming a
series of conducting tissue extending to all parts of the body.
The main function of a nerve cell is to transmit external and internal impulses from the site of
the receptor to the central nervous system through sensory nerves and back to the effector
organs via motor nerves. Nervous system provides the quickest means of communication
within the body and thus serves as the chief coordinator of all the performances of a body.
The structural and the functional unit of each nerve cell is the neuron. Fig.3.1 illustrates a
biological neuron. Each neuron structurally consists of three parts each of which are
associated with a specific function.

The cell body (soma) or centron or neurocyton with dendrons (dendrites) which are
the outgrowth of the cell membrane of the neuron-cell body.

The axon (neurite). It is a single long process arising from the axon-hillock of the cell
body of neuron.

The axons and the dendrons constitute nerve fibers. The axons give off branches
called collaterals along its course and near the ends it ramifies into non-myelinated
terminated terminal branches known as axons terminals that serve as the third
structural unit of a neuron.

3.2

MATHEMATICAL MODELLING OF ANN

With this basic idea a mathematical model of a neuron can be developed for ANNs , based on
two distinct operations performed by neurons. They are

Synaptic Operation

The synaptic operation provides a confluence between n-dimensional neural input vectors, X,
and n-dimensional synaptic weight vector, W. A dot product is often used as a confluence
operation. Thus , the resulting vector, Z, can be expressed as,
Z i Wi T X i

i=1,2,3,

(3.1)

Fig.3.1 A biological neuron

where,

W (W1W2W3......Wn )T

(3.2)

is the vector of the synaptic weights.

X ( X 1 X 2 X 3 .......... X n ) T

(3.3)

is the vector of the neural inputs.

Z ( Z 1 Z 2 Z 3 ...... Z n ) T

(3.4)

of the weighted neural inputs.


Thus, synaptic operation assigns a relative significance to each incoming input signal

Xi ,

according to past experience stored in Wi .

Somatic Operation
This operation is a two step process , viz.

Somatic Aggregation: This operation can be expressed as,


n

u Z i Wi T X i
i 1

(3.5)

i 1

where , u is the intermediate accumulated sum. Thus, the combined synaptic operation and
the somatic aggregation operation provide a mapping from n-dimensional neural input space,
X, to one dimensional space, u.

Non-linear operation with Thresholding: A nonlinear operation on u yields a neural


output y, given by,

y f [u W0 ]

(3.6)

where, f is the non-linear function and W0 is the threshold.


Thus,
n

y f [ W i T X i W 0 ] f [ W iT X i ] f [v ]
i0

(3.7)

i0

provided ,

X 0 1

(3.8)

where, v is the final accumulated sum and X 0 is the fixed bias input. Thus a Neural
Processing Unit (NPU) can be schematically represented as in Fig.3.2.

3.3

DIFFERENT NETWORK ARCHITECTURES

Although a single neuron processing unit can handle single pattern classification problems,
the strength of neural computation comes from the neurons connected in a network. The set
of processing unit when assembled in a closely interconnected network is called an artificial
neural network.
Depending on the different modes of interconnection, ANNs can be broadly classified as:
X = -1
0

Z0
0
1

Z1

Wi

Wn

i
n

Y
(v)

Zn

Neutral Output

Zi

Fig. 3.2 A neural processing unit

1. Multilayer Feed-forward Neural Networks or Static Neural Networks: This is


characterized by directed layered graphs. In this case, a number of patterns such as
structure, connection weights and thresholds are allowed to obtain from learning
rather than they are predetermined.

Fig. 3.3a Detailed Representation of Multi-layer Structure

2. Laterally connected Neural Networks: In consists of feed forward input units and a
layer of neurons that are laterally connected to their neighbors.
3. Recurrent or Dynamic Neural Networks: In these networks, neurons are connected
in a layered structure and the neurons in a given layer may receive inputs from
neurons in given layer below it and/or from the layers above it. The output not only
depends upon the current inputs but also upon past inputs and/or outputs. These
networks have been mostly used for the solution of optimization problems
4. Hybrid Neural Networks: These networks combine two or more of the features of
the above mentioned networks.

3.4

SUPERVISED LEARNING

The ability of a particular neural network is largely determined by the learning process and
network structure used. The learning procedure is divided into thee types : Supervised,
reinforced and unsupervised. These three types of learning are defined by the type of error
signal used to train the weights in the network. In supervised learning, an error scalar is
provided for each output unit by an external teacher, while in reinforced learning the
network is given only a global punish/reward signal. In unsupervised learning, no external
error signal is provided, instead internal errors are generated between the units which are then
used to modify the weights.
In supervised learning paradigm, the weights connecting units in the network are set on the
basis of detailed error information supplied to the network by an external teacher. In most
cases the network is trained using a set of input-output pairs which are examples of the
mapping that the network is required to learn to compute. The learning process may,
5

therefore, be viewed as fitting a function and its performance can thus be judged on whether
the network can learn the desired function over the interval represented by the training set
and to what extent the network can successfully generalize away from the points that it has
been trained on.
As an example, consider the case of electrode contour optimization where the input-output
training sets are known, i.e. the predetermined electrode contours and the stresses along those
electrode contours from the results of electric field computations carried out for such
electrode contours. Thus for such problem a neural network with supervised learning is
needed.

3.5

MULTILAYER FEED-FORWARD NEURAL NETWORKS

Pattern classification problems, which are not linearly separable, can be solved with MFNNs
possessing one or more hidden layers in which the neurons have nonlinear characteristics.
A single neuron can provide simple pattern classification problem, i.e. transformation of sets
or functions from the input space to the output space. A two layer network consisting of two
inputs and N outputs can produce N distinct lines in the pattern space, provided the regions
formed by the problem are linearly separable.
But, in the all possible problems, as the dimensionality of the input space is more, the
problems are not linearly separable and cannot be solved by a two- layer neural network. This
leads to MFNNs.
In MFNNs, neurons are connected in a layered structure and neurons in given layer receive
inputs from the neurons in the layer immediately below it and send their outputs to the
neurons immediately above it. Their outputs are a function of only the current inputs and are
independent on the past inputs and/or outputs.
Although, theoretically, an infinite number of layers are required to define any decision
boundary. A three-layer FNN can generate arbitrary complex decision regions. For this
reason, three layer FNNs often referred to as universal approximators. The additional layer in
between the input and the output layer is known as the hidden layer and the number of units
in the hidden layers depends on the nature of the problem.
The term feed-forward implies that all the information in the network flows in the forward
direction, and during normal processing there is no feedback from the outputs to the inputs.

Backpropagation learning algorithm


Fig.3.3 shows a schematic diagram of multilayer feedforward network. Processing elements
in neural networks are commonly known as neurons. The neurons in the network are divided
into three layers: the input layer, the output layer and the hidden layers. It is important to note
that in feedforward networks signals can only propagate from the input layer to the output
layer via one or more hidden layers. It should also be noted that only the nodes in the hidden
layers and the output layer which perform activation function are called ordinary neurons.
Since the nodes in the input layer simply pass on the signals from the external source to the
hidden layer, they are after not regarded as ordinary neurons.

Fig. 3.3b Schematic Multi-layer Structure

The neural network can identify input pattern vectors once the connection weights are
adjusted by means of the learning process. The back-propagation learning algorithm, which is
a generalization of Widrow-Hoff error correction rule, is the most popular method in training
the ANN. This learning algorithm is presented below in details.
Let, the net input to a neuron in input layer is net i . Then for each neuron in the input layer,
the neuron outputs are given by

Oi neti

(3.9)

The net input to a neuron in hidden layer j is

Ni

net j ji oi

(3.10)

i 1

where, N i is the number of neurons in the input layer .


The output of neuron j is

O j f ( net j , j )

(3.11)

Where f is the activation function.


For a sigmoidal activation function,

Oj

1
1 e

( net j j )

(3.12)

In the eqn(3.12), the parameter j serves as threshold or bias. The effect of a positive is j
to shift the activation function to the left along the horizontal axis. These effects are
illustrated in Fig. 3.4.

Fig. 3.4 Sigmoidal Activation Function


Similarly, for a neuron in output layer, the input is given by
Nj

net K Kj O j

(3.13)

j 1

where, N j is the number of neurons in the hidden layer.


The corresponding output is given by

OK f (netK , K )

(3.14)

In the learning phases or training of such a net work, a pattern is presented as input and the
set of weights in all the connecting links and also all the threshold in the neurons are adjusted
in such a way that the desired outputs t pK are obtained at the output neurons. Once this
adjustment has been accomplished by the net work, another pair of input-output patterns is
presented and the net work is required to learn that association also. In fact the net work is

required to find a single set of weights and thresholds that will satisfy all the input-output
pairs presented to it.
In general, the outputs O pK will not be the same as the target values t pK . For each pattern
the sum of squared errors is

1N
E p (t pK O pK ) 2
2 K 1
K

(3.15)

Where, N K is the number of neurons in the output layer. In the generalized delta rule
formulated by Rumelhart et al for learning the weights and thresholds, the procedure for
learning the correct set of weights is to vary the weights in a manner calculated to reduce the
error E p as rapidly as possible. In other words, the gradient search in weight space is carried
out on the basis of E p .
Omitting the subscript p for convenience, eqn (3.15) is written as

1 NK
E p (t K OK ) 2
2 K 1

(3.16)

Convergence towards improved values for the weights and thresholds is achieved by taking
incremental changes Kj proportional to

Kj

E
, that is
Kj

E
Kj

(3.17)

where, is the learning rate.


Eqn. (3.17) can be written as

Kj

Now,

E net K
net K Kj

net K

Kj
Kj

Kj

Oj Oj

(3.18)

(3.19)

Let,

E
net K

(3.20)

Therefore, eqn. (3.18) becomes

Kj K O j
Again,

E
E OK

net K
OK net K

(3.21)
(3.22)

The two factors of R.H.S of eqn. (3.22) are obtained as follows


2
E
1

[ (t K OK ) ]
O K OK 2

= (t K OK )

Again,

(3.23)

OK

f (net K , K )
net K net K

1
[
]
=
net K 1 e ( netK K )
= OK (1 O K )

(3.24)

Therefore, for any output-layer neuron K, K is obtained from eqns (3.22), (3.23) and (3.24)
as follows

K (t K OK )OK (1 OK )

(3.25)

For the next lower layer where the weights do not affect output nodes directly, it can be
written that

ji

E
ji

(3.26)

E net j
net j ji

E
Oi
net j

= j Oi

(3.27)

10

where,

E
net j

(3.28)

E O j
O j net j

(3.29)

Now, as in the case of eqn (3.24)

O j
net j

O j (1 O j )

(3.30)

However, the factor can not be evaluated directly. Instead it is written in terms of quantities
which are known and other quantities that can be evaluated. Hence,

E NK E netK

Oj K 1netK Oj
NK

(
K 1

)
Kjo j
netK Oj

NK

(
K 1

E
) Kj
net K

NK

K Kj

(3.31)

K 1

Therefore, from eqns (3.29), (3.30) and (3.31),


NK

j O j (1 O j ) K Kj

(3.32)

K 1

Thus the deltas at a hidden layer neuron can be evaluated in terms of the deltas at an upper
layer. Hence, starting at the highest layer, i.e. the output layer, are evaluated using eqn (3.25)
and then the errors are propagated backward to lower layers using eqn.(3.32)
Summarizing and using the subscript p to denote the pattern number,

pKj Pk O pj
and

pK (tpK OpK)OpK(1OpK)

(3.33)
(3.34)

11

for the output-layer neurons and

p ji pjOpi

(3.35)

NK

and

pj O pj (1 O pj ) pK Kj

(3.36)

K 1

for the hidden-layer neurons.


It is important to note here that the threshold of each neuron is trained in the same way as the
other weights. The threshold of a neuron is regarded as a modifiable connection weight
between that neuron and a fictitious neuron in the previous layer, which always has an output
value of unity.
The learning procedure therefore consists of the network starting off with a random set of
weight values choosing one of the training-set patterns and using that as input pattern
evaluating the outputs in a feedforward manner. The errors at the outputs generally will be
quite large which necessitate changes in the weights. Using the back-propagation procedure,
the net work calculates p ji for all the ji in the net work for that particular pattern and the
corrections to the weights are made. This procedure is repeated for all the patterns in
completing the first iteration and then all the patterns of the training set are presented once
again in the second iteration. A new set of outputs is obtained and new weights are again
evaluated. In a successful learning exercise, the system error will decrease with decrease with
the number of iterations, and the procedure will converge to a stable set of weights, which
will exhibit only small fluctuations in value as further learning is attempted.
While implementing such network it is very important to choose a proper value of . A large

corresponds to rapid learning but might also result in oscillations. Rumelhart et al


suggested that the eqns(3.33) and (3.35) might be modified to include a sort of momentum
term that is

ji ( n 1) pj O pi ji ( n )

(3.37)

where (n+1)is used to include the (n+1)th step and is a proportionality constant called
momentum constant. The second term in eqns (3.37) is used to specify that the change in ji
at the (n+1)th step should be some what similar to the change undertaken at the nth step. In
this way some inertia is built in and momentum in the rate of change is conserved to some
degree.

12

It is also important to note that the network must not be allowed to start off with a set of equal
weights. It has been shown that it is not possible to proceed from such weight configuration
to one of unequal weights, even if the latter corresponds to smaller system error.

3.6

NORMALISATION OF INPUT-OUTPUT DATA

Scaling of the input-output data has a significant influence on the convergence property and
also on the accuracy of the learning process. It is obvious from the sigmoidal activation
function given in eqn (3.12) that the range of the output of the network must be within (0,1).
Moreover the input variables should be kept small in order to avoid saturation effect caused
by the sigmoidal function. Thus the input-output data must be normalized before the initiation
of the training of the neural network. Two schemes have been tried for scaling the inputoutput variables as detailed below.

Scheme 1 of Normalization
In maximum values of the input and output vector components are determined as follows

neti , max max(neti ( p))


p 1,.....,NP

(3.38)

i 1,....., N i

where NP is the number of patterns in the training set.


And

O K , max max( O K ( p ))

p 1,.....,NP

(3.39)

K 1,....., N K

Normalized by these maximum values, the input and output variables are given as follows.

net

i , nor

( p ) net i ( p ) / net

p 1,.....,NP

i , max

(3.40)

i 1,....., N i

and

O K , nor ( p ) O K ( p ) / O K , max

p 1,.....,NP

(3.41)

K 1,....., N K
13

After the normalization the input and output variable range is within (0,1) in this scheme.

Scheme 2 of Normalization
In this scheme of normalization, the output variables are normalized by using eqns(3.39) and
(3.41) to get a variables range within(0,1).But the input variables are normalized as follows.

neti ,nor ( p )

neti ( p ) neti ,av

p 1,.....,NP
net i , av and i

where

(3.42)

i
i 1,....., N i

are the average value and the standard deviation of the ith

component of the input vector respectively.


In this scheme after the normalization the input variable range is ( K1 , K 2 ) , where K 1 and

K 2 are real positive numbers. These input variables can then be easily made to fall in the
range
(-1 , 1) by dividing these variables by the greater of the two numbers i.e. K 1 and K 2 .

3.7

FASTER TRAINING

Fudge the Derivative Term


The first major improvement to back-propagation is extremely simple: you can fudge the
derivative term in the output layer. If you are using the usual back-propagation activation
function:

1 /(1 exp( D * x ))
the derivative is:

s (1 s )

14

where s is the activation value of the output unit and most often D = 1. The derivative is
largest at s = 1/2 and it is here that you will get the largest weight changes. Unfortunately as
you near the values 0 or 1 the derivative term gets close to 0 and the weight changes become
very small. In fact if the network's response is 1 and the target is 0, that is the network is off
by quite a lot, you end up with very small weight changes. It can take a VERY long time for
the training process to correct this. More than likely you will get tired of waiting. Fahlman's
solution was to add 0.1 to the derivative term making it:

0.1 s (1 s )
The solution of Chen and Mars was to drop the derivative term altogether, in effect the
derivative was 1. This method passes back much larger error quotas to the lower layer, so
large that a smaller

must be used there. In their experiments on the 10 - 5 - 10 encoder

problem they found the best results came when that was 0.1 times the upper level , hence
they called their method the "differential step size" method. One tenth is not always the best
value so you must experiment with both the upper and lower level etas to get the best results.
Besides that, the

you use for the upper layer must also be much smaller than the you

use without this method.

3.8

ADAPTIVE LEARNING ALGORITHM

To make the learning process converge more rapidly than the conventional method, in which
both the learning rate and momentum are kept constant during the learning, an adaptive
learning algorithm has been developed to adopt both momentum and learning rate in the
learning process.
The proposed adaption rule for learning rate is as follows

i ( n 1) exp[
]
i (n)
n

i ( n 1)

R e ( n ) R e ( n 1)
R e ( n ) R e ( n 1)

(3.43)

15

where

i (n) is the learning rate at iteration n in between the input layer and the next hidden

layer, R e the root mean square error in training.

1
Re [
NP.Nk

NP Nk

(t pk Opk ) ]
p 1k 1

(3.44)

where, k is a constant.

Here the basic idea is to decrease


for R e

( n ) R e ( n 1)

when R e ( n ) R e ( n 1) and

to keep

constant

. Note that E ( n ) R e ( n ) R e ( n 1) is negative when the

error is decreasing, which implies that the connection weights are updated in the correct
direction. It is reasonable to maintain this update direction in the next iteration. In this case
we achieve this by decreasing the learning rate in next iteration. On the other hand, if the
connection weights are moved to the opposite direction, causing the error to increase, we
should try to ignore this direction in next iteration by keeping the value of the same as the
value of in previous iteration. The value of the constant k should be selected judiciously to
give the best result, and the optimum value of k is problem dependent.
Similar to the learning rate, the proposed adaption rule for momentum constant is as follows

[
1

(
)] i ( n 1)

100
i (n)
R
[1
] i ( n 1)

100

R e ( n ) R e ( n 1)
R e ( n ) R e ( n 1)

(3.45)
where R is the percentage rate of change of i between two successive iteration, and

i (n ) is the momentum at iteration n in between the input layer and the next hidden layer.
The learning rates and the moments for the other layers are updated in the same way as those
for i and i , respectively, at nth iteration during the learning process.

16

3.9

THE RESILIENT PROPAGATION ALGORITHM

The basic principle of RPROP is to eliminate the harmful influence of the size of the partial
derivative on the weight step. As a consequence, only the sign of the derivative is considered
to indicate the direction of the weight update. The size of the weight change is exclusively
determined by a weight-specific, so called update-value

w ij( t )

where

E (t )

ij ( t ),

ij ( t ),

0,

E (t )
0
w ij
E (t )
if
0
w ij
else

if

(3.46)

wij denotes the partial derivative with respect to each weight. The second

step of Rprop learning is to determine the new update-values.

(ijt )

where


. ij (t ),

. ij ( t ),

ij ( t 1),

0 1 .Thus

E ( t 1) E (t )
0
w ij
w ij
E ( t 1) E ( t )
if
0
w ij
w ij
else
if

(3.46)

every time the partial derivative of the corresponding

weight wij (t ) changes its sign, which indicates that the last update was too big and the
algorithm has jumped over the local minimum, the update-value ij (t ) is decreased by the

factor . If the derivative retains its sign, the update value is slightly increased in order to
accelerate convergence in shallow regions. Additionally, in case of a change in sign, there
should be no adaptation in the succeeding learning step. In practice this can be achieved by

17

setting

E (t )

wij 0 in

the adaptation rule. Finally the weight update and the

adaptation are performed after the gradient information of the whole pattern set is computed.
The Rprop algorithm requires setting the following parameters (i) the increase factor is set
to
to

1 . 2 ; (ii)the decrease factor is set to 0 .5 ; (iii)the initial update-value is set

0 0 . 1 ; (iv)the maximum weight step, which is used in order to prevent the weights

from becoming too large, is max 50 .

18

You might also like