You are on page 1of 226

Artificial Neural Networks

in Control and Optimization

A thesis submitted to
the University of Manchester
for the degree of
Doctor of Philosophy
in the Faculty of Technology

by

Cairo L. Nascimento Jr., B.Sc., M.Sc.

Control Systems Centre


UMIST
PO Box 88
Manchester M60 1QD

February 1994
To Sandra

for her support and personal sacrifices


Abstract

This thesis concerns the application of artificial neural networks to solve


optimization and dynamical control problems.
A general framework for artificial neural networks models is introduced first.
Then the main feedforward and feedback models are presented. The IAC (Interactive
Activation and Competition) feedback network is analysed in detail. It is shown that the
IAC network, like the Hopfield network, can be used to solve quadratic optimization
problems.
A method that speeds up the training of feedforward artificial neural networks
by constraining the location of the decision surfaces defined by the weights arriving at
the hidden units is developed.
The problem of training artificial neural networks to be fault tolerant to loss of
hidden units is mathematically analysed. It is shown that by considering the network
fault tolerance the above problem is regularized, that is the number of local minima is
reduced. It is also shown that in some cases there is a unique set of weights that
minimizes a cost function. The BPS algorithm, a network training algorithm that
switches the hidden units on and off, is developed and it is shown that its use results in
fault tolerant neural networks.
A novel non-standard artificial neural network model is then proposed to solve
the extremum control problem for static systems that have an asymmetric performance
index. An algorithm to train such a network is developed and it is shown that the
proposed network structure can also be applied to the multi-input case.
A control structure that integrates feedback control and a feedforward artificial
neural network to perform nonlinear control is proposed. It is shown that such a
structure performs closed-loop identification of the inverse dynamical system. The
technique of adapting the gains of the feedback controller during training is then
introduced. Finally it is shown that the BPS algorithm can also be used in this case to
increase the fault tolerance of the neural controller in relation to loss of hidden units.
Computer simulations are used throughout to illustrate the results.

iii
Declaration

No portion of the work referred to in this thesis has been submitted in support of an
application for another degree or qualification of this or any other university or other
institution of learning.

iv
Acknowledgements

I am most grateful to my supervisor Dr. Martin Zarrop for all his patience,
guidance and enthusiasm. It was an honour for me to work under his exemplary
supervision.
I am also very grateful to Prof. Peter Wellstead, Prof. Neil Munro and Dr. Allan
Muir for their support and encouragement. I also would like to acknowledge Dr. Daniel
W. McMichael for his contributions in the early stages of this research.
I am indebted to the Aeronautics Technological Institute (Instituto Tecnológico
de Aeronáutica) and to the Brazilian Research Council (CNPq - grant 200.617/88.5) for
the support provided during this research.
Many thanks are due to Dr. Evandro T. de Souza, Dr. Takashi Yoneyama and
Dr. J. A. M. Felippe de Souza for their long distance moral support and constant
encouragement.
Finally, I wish to thank all the staff and students at the Control Systems Centre,
especially Suchin Arunsawatwong, Chris Berg, Alex Bozin, Nongji Chen, Kevin
Czachur, Tim Fitzpatrick, Celso Gonzalez, Peter Green, William Heath, Tariq Hussein,
Duncan Hutchinson, Arne Jenssen, Tracy Jones, Alan Kjaer, Paul Lane, Panos Liatsis,
Andy McCorkle, Roy Moody, Njal Pettit, Janice Prunty and Susan Simpson for all their
help, moral support and friendship.

"You are never given a wish without also


being given the power to make it true.
You may have to work for it, however."
Richard Bach

v
Contents

List of Figures x
List of Tables xiv
Notation and Abbreviations xv

1 Introduction 1
1.1 Background and Motivation 1
1.2 Structure of the Thesis and Contributions 3

2 Artificial Neural Networks: Basic Concepts 6


2.1 The Human Nervous System and the Brain 8
2.1.1 Neurons 9
2.1.2 The Action Potential 10
2.1.3 Structure of the Brain 13
2.2 Brain versus Digital Computer 15
2.2.1 Processing Speed 15
2.2.2 Processing Order (Serial/Parallel) 16
2.2.3 Number and Complexity of Processors 16
2.2.4 Knowledge Storage and Tolerance to Damage 17
2.2.5 Processing Control 17
2.3 The Basics of an Artificial Neural Networks Model 18
2.3.1 A Formal Definition 18
2.3.2 A General Framework for ANN models 19
2.3.3 Learning 21
2.3.4 Network Topology 22
2.4 Artificial Neural Network Models 24
2.4.1 Early Models 24
2.4.2 The Perceptron and Linear Separability 27
2.4.3 ADALINE/MADALINE and the Delta Rule 32
2.4.4 The Multi-Layer Perceptron and the Role of Hidden Units 35

vi
2.4.5 The Back-Propagation Algorithm 39
2.4.6 Using the Back-Propagation Algorithm 41
2.5 Representation, Learning and Generalization 43
2.5.1 The Representation Problem 44
2.5.2 The Learning Problem 46
2.5.3 The Generalization Problem 48
2.6 Limitations of Feedforward ANNs 50

3 Feedback Neural Networks: the Hopfield and IAC Models 52


3.1 Associative Memories 52
3.1.1 Storing One Pattern 54
3.1.2 Storing Several Patterns 55
3.1.3 Storage Capacity 56
3.1.4 Minimizing an Energy Function 58
3.1.5 Spurious States 60
3.1.6 Synchronous Updating 61
3.2 Solving Optimization Problems 62
3.2.1 An Analog Implementation 62
3.2.2 An Energy Function 64
3.3 The IAC Neural Network 67
3.3.1 The Mathematical Model 69
3.3.2 Initial Considerations 70
3.3.3 Minimizing an Energy Function 71
3.3.4 Considering two units 73
3.3.5 Case Positive Decay and No External Inputs 74
3.3.6 Case of Non-Zero External Inputs With No Decay 76
3.3.7 Case of Non-Zero External Inputs and Positive Decay 81
3.3.8 A Two-Bit Analog-Digital Converter using the IAC Network 84
3.4 Conclusions 86

4 Faster Learning by Constraining Decision Surfaces 87


4.1 Initialization Procedures 88
4.1.1 The Standard Initialization Procedure 89

vii
4.1.2 The First Initialization Procedure 91
4.1.3 The Second Initialization Procedure 93
4.1.4 Adjusting the Inclinations of the Decision Surfaces 95
4.2 Constraining the Decision Surfaces during Training 97
4.3 Simulations 98
4.4 Conclusion 102

5 Fault Tolerant Artificial Neural Networks 103


5.1 Approaches to Fault Tolerance 104
5.2 Fault Tolerance in Artificial Neural Networks 105
5.3 Back-Propagation with Increased Fault Tolerance 108
5.4 Simulations 111
5.5 Regularizing the Estimation Problem by Considering Fault Tolerance 115
5.5.1 Considering the Input Weights Fixed 116
5.5.2 Considering the Output Weights Fixed 120
5.5.3 Considering Input and Output Weights as Variables 121
5.6 Conclusions 123

6 Extremum Control Using Artificial Neural Networks 125


6.1 The Extremum Control Problem 125
6.2 The Quadratic System Model 128
6.3 Adapting the Artificial Neural Network to Extremum Control 129
6.4 Training the Neural Network 132
6.5 Simulation of a Single Input Example 133
6.6 An Extension to Multi-Input Extremum Control 137
6.7 Simulation of a Two Input Example 139
6.8 Conclusions 143

7 Dynamical Control Using Artificial Neural Networks 145


7.1 Artificial Neural Networks and Dynamical Control 145
7.2 Neural Control Architectures 147
7.2.1 Supervised Control 147
7.2.2 Adaptive Critic and Reinforcement Learning 149

viii
7.2.3 Back-Propagation Through Time (BPTT) 152
7.2.4 Direct Inverse Control 155
7.2.5 Neural Adaptive Control 158
7.3 The Feedback-Error-Learning Method 161
7.3.1 The Original Feedback-Error-Learning Control Structure 162
7.3.2 The Modified Feedback-Error-Learning Control Structure 164
7.3.3 Mathematical Analysis of the Modified Feedback-Error-Learning
Control Structure 166
7.3.4 Simulations for the Linear Case 172
7.3.5 Using a Variable Feedback Controller 173
7.3.6 Simulations for the Nonlinear Case with a Variable
Feedback Controller 176
7.4 Fault-Tolerant Dynamical Control 184
7.5 Conclusions 190

8 Conclusions and Directions for Further Work 192

References 195

ix
List of Figures

Figure 2.1 The human nervous system as a closed-loop control system 9


Figure 2.2 A schematic representation of a neuron 11
Figure 2.3 The general model of a processing unit 19
Figure 2.4 Examples of feedforward (a,b) and feedback (c) ANN. 23
Figure 2.5 The Linear Associative Memory (LAM) 26
Figure 2.6 The Single-Layer Perceptron 27
Figure 2.7 Rosenblatt’s elementary perceptron 28
Figure 2.8 The AND, OR and XOR problems and possible location
for the decision surfaces 29
Figure 2.9 The minimum configuration for a Multi-Layer Perceptron (MLP) 35
Figure 2.10 The first possible solution for the XOR problem 36
Figure 2.11 The second possible solution for the XOR problem 37
Figure 2.12 The third possible solution for the XOR problem 38

Figure 3.1 An one-layer feedback ANN 54


Figure 3.2 The analog implementation of a unit using electrical components 63
Figure 3.3 The analog implementation of a continuous-time Hopfield network
with 4 units 63
Figure 3.4 Typical topology for the IAC network 67
Figure 3.5 The IAC Neural Network with 2 units 74
Figure 3.6 ext1 = ext2 = 0, decay/(c max) = 0.15, c > 0 75
Figure 3.7 Location of the stable E.P. 77
Figure 3.8 ext1’ = ext2’ = 1.5, decay = 0 77
Figure 3.9 ext1’ = 0, ext2’ = -1, decay = 0 78
Figure 3.10 ext1’ = 1, ext2’ = -1.5, decay = 0 79
Figure 3.11 ext1’ = ext2’ = 1, decay = 0 79
Figure 3.12 ext1’ = 1, ext2’ = -1, decay = 0 80
Figure 3.13 Asymptotes for ext1’ ≠ 0 and/or ext2’ ≠ 0 with decay = 0 81

x
Figure 3.14 Curves da1’/dt = 0 ( ) and da2’/dt = 0 (–•–)
for several external inputs and dec = 0.15 82
Figure 3.15 The case where one of the E.P.s is over the separatrix 83
Figure 3.16 Curves da1’/dt = 0 ( ) and da2’/dt = 0 (–•–) for several
external inputs and dec = 1 84

Figure 4.1 Histogram of a variable defined as the quotient of two


gaussian random variables with zero mean and variance 1 90
Figure 4.2 100 decision surfaces generated by the standard
initialization procedure 90
Figure 4.3 100 decision surfaces generated by the first
initialization procedure 93
Figure 4.4 Geometrical interpretation of the initialization procedure 94
Figure 4.5 100 decision surfaces generated by the second
initialization procedure 95
Figure 4.6 The function sin(x) and its approximation F(x) 99
Figure 4.7 The RMS error history for the 3 simulation cases 100
Figure 4.8 Decision surfaces for case 1 101
Figure 4.9 Decision surfaces for case 1 101
Figure 4.10 Decision surfaces for case 2 101
Figure 4.11 Decision surfaces for case 3 101
Figure 4.12 Output unit weights and bias for case 3 102
Figure 4.13 The function sin(x) and its network approximation for case 3 102

Figure 5.1 The image bitmaps used to train the ANN 112
Figure 5.2 RMS error history for the no-fault configuration for cases:
(a) using the BP algorithm; (b) using the BPS algorithm 114
Figure 5.3 The parameter β as a function of λ[0] λ* for NH = 5 and N1 = 3 119

Figure 6.1 The Extremum Controller Algorithm 128


Figure 6.2 The asymmetric function used in the hidden units
with different parameters v 130
Figure 6.3 The diagram of the FF ANN used in the extremum controller 131

xi
Figure 6.4 The true function with the noise band (±1 standard deviation)
and the ANN approximation (before and after training) 135
Figure 6.5 The ANN estimate of the optimum input x0 135
Figure 6.6 The time evolution of the 6 coefficients of asymmetry v 135
Figure 6.7 The time evolution of the estimate of the optimum input x0
for the quadratic model when the true model is symmetric
or asymmetric 136
Figure 6.8 The object image and the initial features of the mask 140
Figure 6.9 The mask history. The object is situated between
positions 10 and 16. 142
Figure 6.10 The output of the ANN for the estimate of optimum input x0 142
Figure 6.11 The true performance index as a function of the mask position
and size, its ANN approximation after being trained and
the respective contour plots 142
Figure 6.12 Contour plots considering only the region of the dither [−3 3] 143
Figure 6.13 Contour plots using a dither of [−6 6] and 600 time steps 143

Figure 7.1 Widrow and Smith’s ADALINE controller 147


Figure 7.2 Widrow, Gupta and Maitra’s bootstrap adaptation 150
Figure 7.3 Barto, Sutton and Anderson’s Adaptive Critic 151
Figure 7.4 The plant and the ANN controller 153
Figure 7.5 The ANN emulator 153
Figure 7.6 Training the ANN controller (C = ANN controller,
T = ANN truck emulator) 154
Figure 7.7 The Indirect Learning architecture 156
Figure 7.8 The Generalized Learning architecture 156
Figure 7.9 The Specialized Learning architecture 157
Figure 7.10 Indirect Adaptive Control 159
Figure 7.11 Neural Adaptive Control using the Model-Reference approach 160
Figure 7.12 The original network topology and control structure
used in the Feedback-Error-Learning method 163
Figure 7.13 The modified feedback-error-learning neural control structure 165
Figure 7.14 The plant output y and the error e 173

xii
Figure 7.15 The variation of ANN parameters during training 174
Figure 7.16 The two-joint robot arm 177
Figure 7.17 Desired joint positions for RT 1 at t = 0, 1, 2, 3 and 4 seconds 180
Figure 7.18 The feedback gain multiplier during the training session 182
Figure 7.19 The RMS values for the feedback and ANN controllers
during training 183
Figure 7.20 Recalling RT 1 before and after training with RT 1,
using a variable feedback controller 183
Figure 7.21 Recalling RT 3 before and after training with RT 1,
using a variable feedback controller 183
Figure 7.22 The RMS errors for all RTs during the recall session for the
cases when the ANN is trained with a fixed or variable
feedback controller 184
Figure 7.23 The inverted pendulum, a 1-axis robot 186
Figure 7.24 Actual and desired joint angle trajectory at the end of the
training session 188
Figure 7.25 The tracking error during testing when the hidden unit 7 is
lost at 200 s 190

xiii
List of Tables

Table 5.1 RMS errors for the no-fault configuration after training 114
Table 5.2 RMS error mean (standard deviation) and number of
misclassifications for the fault configurations after training 114

Table 7.1 Integral of the squared error (ISE) when the ANN is tested
for fault tolerance 189

xiv
Notation and Abbreviations:

Vectors and matrices are represented in bold italics, e.g. X. Scalars are
represented in italics, e.g. X1. Common symbols and abbreviations are given below with
the section number where they are introduced.

Symbol Section
a activation value of a processing unit 2.3.2
ACE Adaptive Critic Element 7.2.2
ADALINE Adaptive Linear Neuron 2.4.3
AI Artificial Intelligence 2
ANN Artificial Neural Network 2
ASE Associative Search Element 7.2.2
AP Action Potential 2.1.1
BP Back-Propagation 2.4.5
BPS Back-Propagation with Switching 5.3
BPTT Back-Propagation Through Time 7.2.3
c factor of cooperation or competition 3.3.4
CNS Central Nervous System 2.1
D network desired output vector 2.4.1
E squared network output error 2.4.3
ext external input 3.1
EP Equilibrium Point 3.1
f( ) activation function of a processing unit 3.2
FB Feedback 2.3.4
FF Feedforward 2.3.4
FM Frequency Modulation 2.1.2
g( ) output function of a processing unit 3.3.1
H energy function to be minimized (for feedback networks) 3.1.4
IAC Interactive Activation and Competition 3.3
J cost function 5.3

xv
J* modified cost function 5.3
L length of the tapped delay line 7.3.2
LAM Linear Associative Matrix 2.4.1
LMS Least-Mean-Square 2.4.3
LTM Long-Term Memory 2.3.2
M delay applied to the reference signal 7.3.2
MADALINE Multiple Adaptive Linear Neuron 2.4.3
MIMD Multiple Instruction Multiple Data 2.3.1
MLP Multi-Layer Perceptron 2.4.2
NC number of possible configurations 5.1
NH number of hidden units 6.3
net net input value of a processing unit 2.3.2
OOP Object-Oriented Programming 2
out output value of a processing unit 2.3.2
PE Processing Element 2.3.1
PDP Parallel Distributed Processing 2
PID Proportional-Integral-Derivative 7.3
PNS Peripheral Nervous System 2.1
qk joint angle vector at time instant k 7.3.1
R correlation matrix 5.5.1
RBF Radial Basis Functions 2.5.1
refk reference signal at time instant k 7.2.1
ref*k neural network controller input at time instant k 7.3.3
RLS Recursive Least Square 6.1
RMS Root-Mean-Square 5.4
S pattern to be stored in a feedback network 3.1
STM Short-Term Memory 2.3.2
t time 2.3.2
T target network output vector 5.3
TFk feedback torque at time instant k 7.3.1
TNk feedforward torque generated by the network at time instant k 7.3.1
TD Temporal Differences 7.2.2
TLU Threshold Logic Unit 2.4.1

xvi
TMR Triple Modular Redundancy 5.1
uk plant input at time instant k 7.2.3
uFB
k output of the feedback controller at time instant k 7.3.2
uNN
k output of the neural network controller at time instant k 7.3.2
W weight matrix 2.3.2
X network input vector 2.4.1
xk plant state vector at time instant k 7.2.3
x0 optimum input value 6.1
yk plant output at time instant k 7.2.4
Y network output vector 2.4.1

Greek Letters

η learning rate 2.4.1


α momentum rate 2.4.6
α* network coefficients 7.3.3
λi probability of fault for configuration i 5.3
ν coefficient of asymmetry 6.3
τ torque 7.3.6

xvii
1

Chapter 1 - Introduction

1.1 - Background and Motivation

This thesis concerns the application of artificial neural networks to solve


optimization and dynamical control problems. Artificial neural networks are
computational devices whose conception has been motivated by our current knowledge
of biological nervous systems. As such, neurocomputing, that is computation using
artificial neural networks, offers an alternative to the traditional computational approach
based on sequential and algorithmic processing.
Probably the main feature that characterizes the artificial neural networks
approach is the simultaneous use of a large number of relatively simple processors,
instead of using few very powerful central processors, as is nowadays the standard in
most man-made computers. This is also the computational architecture selected by
natural evolution for the central nervous systems of the most developed animals, where
the basic computational unit is the neuron.
The use of a large number of simple processors makes it possible to perform
parallel computation and to have a very short response time for tasks that involve real-
time simultaneous processing of several signals. Furthermore it is also possible to have
a decentralized architecture, which is much more fault tolerant to loss of individual
processors than centralized architectures.
Another important feature of artificial neural networks is that, although each
processor is very simple in terms of computational power and memory, they are
adaptable nonlinear devices. Consequently, artificial neural networks can be used to
approximate nonlinear models, an essential property for solving many real-world
problems. The adaptable parameters of artificial neural network models are the
connections that link the processors. This is similar to "learning" in biological neural
networks that is supposed to be the result of changes in the strength of the connections
between neurons.
2

Research in artificial neural network models began at the same time as the first
digital computers where being developed in the 1940’s. In 1943 McCulloch and Pitts
[McPi43] proposed modelling the biological neuron as a simple threshold device, i.e. it
could be in only two states, on or off. In 1958 Rosenblatt proposed the Perceptron
[Ros58] as a model for visual pattern recognition. In 1960 Widrow and Hoff [WiHo60]
proposed the ADALINE (adaptive linear neuron). However, when in the late 1960’s
Minsky and Papert [MiPa69] pointed out some limitations of the artificial neural
network models available at the time (e.g. the lack of a reliable algorithm to train multi-
layer networks), interest in neurocomputing was greatly reduced and efforts were shifted
to the area of Artificial Intelligence and expert systems.
Interest in neurocomputing only reappeared in the mid 80’s mainly as a result
of a combination of the following factors: a) the popularization of Hopfield’s work to
solve optimization problems using feedback networks [Hop85]; b) the rediscovery of the
Back-Propagation algorithm, used to train multi-layer feedforward networks, by
Rumelhart, Hinton and Williams [RHM86], c) the realization of the limitations of
Artificial Intelligence and expert systems approaches, and d) the availability of powerful
and cheap digital computers that could be used to simulate, test and refine artificial
neural network models.
Nowadays artificial neural network models are the subject of study in many areas
as diverse as medicine, engineering and economics, to tackle problems that cannot be
easily solved by other more established approaches.
In this work our motivation is to develop techniques that exploit the properties
of nonlinear modelling, adaptability and tolerance to internal damage exhibited by
artificial neural network models in order to solve the problems of: a) extremum control
of static systems; and b) adaptive control of nonlinear dynamical systems under
feedback.
In the extremum control problem the aim is to estimate on-line, i.e. as new data
are made available, the input that maximizes the system output, considering that the
input-output relationship is unknown. Only noisy measurements of the output and the
respective input values used to produce them are available. Previous work [WeZa91] has
dealt with the case when the system is assumed to be governed by a quadratic function
with unknown parameters. In this work we develop a neural solution for the case when
the input-output relationship is non-quadratic but with a unique maximum over the
3

interval of interest.
Control theory in the areas of analysis and design of time-invariant dynamical
linear systems is well developed due to the intense research effort expended since the
early 60’s. However, several dynamical systems of interest may contain severe
nonlinearities and therefore a linear model will not be entirely suitable. Unfortunately,
the nonlinear control field is much less advanced than its linear counterpart and few
general approaches exist. In this work we modify and analyze the neural control
architecture known as feedback-error-learning, which operates under feedback control
by using an artificial neural network as a feedforward controller.

1.2 - Structure of the Thesis and Contributions

In chapter 2 the basic concepts relating to artificial neural network models are
presented. The chapter begins with a simplified section about the human nervous system
and the human brain. A general framework for artificial neural network models is later
introduced and the most important feedforward artificial neural network models, i.e.
network models that perform static mappings, are then presented. Finally, some
limitations of the artificial neural network approach are discussed.
Chapter 3 is concerned with feedback artificial neural network models. Because
of the presence of feedback connections, feedback networks are complex nonlinear
dynamical systems. The two main application areas for feedback networks are as
associative memories and to solve quadratic optimization problems. Two models, the
Hopfield and IAC (Interactive Activation and Competition) neural networks are
presented and analyzed. The main contribution of this chapter is the mathematical proof
that the IAC network can also be used to solve quadratic optimization problems, much
like the Hopfield network. As far as we are aware, this is the first time that it is shown
that another feedback neural network model can be used to solve quadratic optimization
problems.
One of the limitations of current feedforward artificial neural network models,
as it is pointed out in chapter 2, is that a large number of iterations is needed if the current
training algorithms are used. The contribution in chapter 4 is the presentation of a novel
method that speeds up learning by constraining the location of the decision surfaces
defined by the values of the weights arriving at the hidden units. The same method can
4

be used to provide a better initialization procedure for the network weights. The
performance of the method is evaluated through computer simulations.
Chapter 5 is concerned with fault tolerant artificial neural networks, i.e. networks
that are tolerant to loss of weights and hidden units. The problem of training an artificial
neural network can be seen as an optimization problem. Therefore it is not surprising
that current algorithms, such as Back-Propagation, will not necessarily result in fault
tolerant solutions since they do not explicitly search for a fault tolerant solution. In this
chapter we propose the BPS algorithm (involving Back-Propagation with Switching).
The proposed algorithm switches during training between the different fault
configurations, i.e. all possible fault configurations are trained and forced to share the
same set of network weights. The conventional Back-Propagation algorithm can be
considered a special case of the proposed algorithm where the set of possible
configurations contain only the no-fault configuration. The benefits of the proposed
algorithm are illustrated using a bit-mapped image recognition problem.
The main contribution of chapter 5 is the mathematical analysis that shows that,
by considering network fault tolerance, the problem of training the network is
regularized, that is the number of local minima is reduced. We show that in some cases
when the weights of a layer are fixed there is only one set of weights that minimizes the
cost function.
Chapter 6 deals with the extremum control problem. First the extremum control
problem is introduced and the limitations of the quadratic model approach are presented.
The contribution of chapter 6 is the development of a novel non-standard artificial neural
network model. Since the network model is flexible enough to accommodate non-
quadratic functions, the optimum input for static systems with an asymmetric
performance index can be estimated with a small error, even if the system is excited by
a dither with a large amplitude. The standard Back-Propagation algorithm, with the
necessary modifications for the specific network model developed in this chapter, is used
to adapt the network parameters. We mathematically prove that the proposed network
model can also be used in the multi-input case (theorem 6.1). Two simulation examples
are presented, one for the single input case and the other for a two input case.
In chapter 7 we address the use of artificial neural networks for control of
dynamical systems. First we review the main approaches proposed in the literature to
integrate feedforward artificial neural networks into the general control structure. The
5

concept of feedback-error-learning, proposed by Kawato [Kaw90], is then introduced.


The first contribution in this chapter is the development of a modified feedback-error-
learning control structure which aims to perform closed-loop identification of the inverse
dynamical system. Such a modified control structure is then mathematically analyzed
and we show that, at least for the case of a single input single output linear dynamical
system, when certain requirements are satisfied, there exists an artificial neural network
(a linear filter in this case) that is a close approximation of the inverse dynamical model
of the system under control. A computer simulation for the linear case is used to
illustrate the use of the proposed neural control structure.
The second contribution in chapter 7 is the introduction of the technique of
variable (or adaptive) feedback to be used in the proposed neural control structure.
Simulations of the control of a two-joint robot, a two-input two-output nonlinear control
problem, are presented and we show that use of the variable feedback technique
improves the generalization of the neural network controller in relation to trajectories
not used during training.
The third contribution in chapter 7 is the application of the BPS algorithm,
presented in chapter 5, to improve the fault tolerance of the neural controller in relation
to faults in the neural network. The control of an inverted pendulum is used as the
simulation example in this case.
Finally the last chapter presents general conclusions and suggestions for further
work.
6

Chapter 2 - Artificial Neural Networks:


Basic Concepts

The great majority of digital computers in use today are based around the
principle of using one very powerful processor through which all computations are
channelled. This is the so called von Neumann architecture, after John von Neumann,
one of the pioneers of modern computing. The power of such a processor can be
measured in terms of its speed (number of instructions that it can execute in a unit of
time) and complexity (the number of different instructions that it can execute).
The traditional way to use such computers has been to write a precise sequence
of steps (a computer program or an algorithm) to be executed by the computer. This is
the algorithmic approach. Such programs can be written in different computer
languages, where higher level languages will have commands that when translated to the
machine level will correspond to several instructions at the processor level.
Researchers in Artificial Intelligence (AI) follow the algorithmic approach and
try to capture the knowledge of an expert in some specific domain as a set of rules to
create so called expert systems. This is based on the hypothesis that the expert’s thought
process can be modelled by using a set of symbols and a set of logical rules which
manipulate such symbols. This is the symbolic approach. It is still necessary to have
someone that understands the process (the expert) and someone to program the
computer.
The algorithmic and symbolic approaches can be very useful for certain problems
where it is possible to find a precise sequence of mathematical operations (e.g. inversion
of matrices) or a precise sequence of rules (e.g medical diagnosis of certain well
understood diseases). However such approaches have the following weaknesses:
a) Sequential (or Serial) Computation: as a consequence of the
centralization around the processor, the instructions have to be executed sequentially,
even if two sets of instructions are unrelated. This creates a bottleneck around the
central processor. Sometimes, instead of just one, a small number of very powerful
7

central processors are used, but this has to be weighed against the increase in the
complexity of programming the management of these processors so that they are used
effectively. Also, sooner or later, the physical limits for signal propagation times within
the computer will be reached. The current approach of reducing the processor size is
also constrained by physical limits.
b) Local Representation: the knowledge is localized in the sense that a
concept or a rule can be traced to a precise area in the computer memory. Such
representation is not resistant to damage. Also a very small corruption in one of the
instructions to be executed by the processor (a single bit error) can easily ruin the
sequential computation. Also, as the complexity of the program increases, its reliability
decreases, since it is more likely that the programmers will make mistakes. Recently
developed programming styles such as object-oriented programming (OOP) aim to make
it easier to manage these complex programs.
c) "Learning" is difficult: if we define computational "learning" as the
construction or modification of some computational representation or model [Tho92],
it is difficult to simulate "learning" using the algorithmic and symbolic approaches. This
happens because it is not straightforward to incorporate the data acquired from
interaction with the environment into the model.
In general, it can be said that digital computers can solve problems that are
difficult for humans, but it is often very difficult to use them to automate tasks that
humans can solve with little effort, such as driving a car or recognizing faces and
sounds in a real-world situation.
Artificial Neural Networks (ANN), also called neurocomputing, connectionism,
or parallel distributed processing (PDP), provide an alternative approach to be applied
to problems where the algorithmic and symbolic approaches are not well suited.
Artificial Neural Networks are inspired by our present knowledge of biological nervous
systems, although they do not try to be realistic in every detail (the area of ANN is not
concerned with biological modelling, a different field). Some ANN models may
therefore be totally unrealistic from a biological modelling point of view [HKP91].
In contrast to the conventional digital computer, ANN perform their computation
using a large number of very simple and highly interconnected processors operating in
parallel. The representation of knowledge is distributed over these connections and
"learning" is performed by changing certain values associated with such connections, not
8

by programming. The learning methods still have to be programmed, however, and for
each problem, we must choose a suitable learning algorithm but the same general
approach is kept.
Current ANN models are so crude approximations of biological nervous systems
that it is hard to justify the use of the word neural. The word is used today more
because of historical reasons since most of the earlier researchers came from biological
or psychological backgrounds, not engineering or computer science.
It is generally believed that knowledge about real biological neural networks can
help by providing insights about how to improve the artificial neural network models
and clarifying their limitations and weaknesses. The next section presents a simplified
introduction to the human nervous system and human brain. The human brain is the
most complex organ we have and is a structure still poorly understood, despite intense
research and much progress since Santiago Ramon y Cajal showed that the human
nervous system is made of an assembly of well-defined cells.
A general framework for ANN models is later introduced and some important
ANN models are presented in the subsequent sections. Finally some limitations of the
current ANN models are highlighted in the conclusions.

2.1 - The Human Nervous System and the Brain

The human nervous system consists of the Central Nervous System (CNS) and
the Peripheral Nervous System (PNS). The CNS is composed of the brain and the spinal
cord. The PNS is composed of the nervous system outside the brain and spinal cord.
The human nervous system can be seen as a vast electrical switching network.
The top-level behaviour of such a network can be approximately described by figure 2.1.
The inputs to this network are provided by sensory receptors. Such receptors act as
transducers and generate signals from within the body or from sense organs that observe
the external environment. The information is then conveyed by the PNS to the CNS,
where it is then analyzed and processed. If necessary, the CNS sends signals to the
effectors and the related motor organs that will execute the desired actions. From the
above description we can see that the human nervous systems can be described as a
closed-loop control system, with feedback from within the body (in order to regulate
some bodily functions such as the heart beat rate) and from outside the body (so we are
9

aware of our interactions with the external environment) [Zur92].


Most of the information processing done by the CNS is performed in the brain.
In contrast to other organs in the human body, the brain does not process metabolic
products, but instead processes "information". In order to process such information the
brain is the most concentrated consumer of energy in the body, being responsible, with
the body at rest, for over 20% of the body’s oxygen consumption despite being only 2%
of the body mass . Despite such high energy consumption the brain dissipates very little
heat. Since the brain is very mechanically and chemically sensitive and cannot
regenerate itself if damaged, the brain is the most protected organ in the body. A bony
skull provides the brain with a strong mechanical protection while chemical protection
is provided by a highly effective filtration system called the blood-brain barrier, a dense
network of blood vessels that isolates the brain from potentially toxic substances found
in the bloodstream [Was89]. The brain and spinal cord are also immersed in the
cerebrospinal fluid, which provides further protection against damage.

2.1.1 - Neurons
The human brain contains approximately 1011 elementary nerve cells called
neurons (1011 is around 20 times the current world’s population and the estimated
number of stars in our galaxy). Each of these neurons is connected to around 103 to 104
other neurons, and therefore the human brain is estimated to have 1014 to 1015
connections. The neuron is the basic building block of the nervous system and most
neurons are in the brain.

Figure 2.1 - The human nervous system as a closed-loop control system


10

Neurons can be classified into two main classes: 1) output cells, that connect
different regions of the brain to each other, connect the brain to the effectors (motor
neurons), or connect the sensory receptors to the brain (sensory neurons); and 2)
interneurons, that are confined to the region where they occur [BeJa90].
There are hundreds of neuron types, each with its characteristic function, shape
and location, but the main features of a neuron of any type are its cell body, called
soma, dendrites and the axon, as figure 2.2 illustrates.
The cell body is usually 5 to 100 µm in diameter and contains the normally large
nucleus of the neuron. Most of the biochemical activities necessary to maintain the life
of the neuron, such as synthesis of enzymes and other molecules, take place within its
cell body.
The dendrites act as the input channels of external signals to the neuron and the
axon acts as the output channel. Dendrites form a dendritic tree, which is a bushy tree
that spreads out around the cell body within a region of up to 400 µm in radius. An
axon extends away from the cell body and is relatively uniform in diameter. It can be
as short as 100 µm for interneurons or as long as 1 meter for sensory and motor
neurons, such as the neurons that connect the toe to the spinal cord. The axon also
branches but only at its end, in contrast to dendrites that split much closer to the cell
body.
The end of a branch of an axon has a button shape, with diameter around 1 µm,
and connects to the dendrite of another neuron. Such a connection is called a synapse
(from the greek verb "to join"). Usually this is not a physical connection (the axon and
the dendrite do not touch) but there is a small gap called the synapse gap or synapse
cleft that is normally between 200 Å and 500 Å across (1 Å = 10-10 m, the diameter of
a water molecule is around 3 Å). The point where the axon is connected to its cell body
is called the Hillock zone.

2.1.2 - The Action Potential


The cell body can generate electrical activity in the form of a voltage pulse
called an action potential (AP) or electrical spike.
The axon carries the action potential from the cell body to the synapses where
chemical molecules, called neurotransmitters, are then released. These diffuse across the
synapse gap to the dendrite at the other side of the synapse and modify the dendrite’s
11

Figure 2.2 - A schematic representation of a neuron


membrane potential. It takes around 0.6 ms for the neurotransmitters to cross the
synapse gap. According to the predominant type of neurotransmitter presented at the
synapse, the membrane potential of the dendrite is increased (an excitatory synapse) or
decreased (an inhibitory synapse). These signals received by the dendrites from many
different neurons are then sent to the cell body where they are, roughly speaking,
averaged. If this average over a short time interval is above a certain threshold at the
Hillock zone, the neuron "fires", i.e. the cell body generates an action potential that is
then transmitted by its axon to other neurons.
The AP has a peak value of about 100 mV and a duration of around 1 ms. At
rest (no input to the neuron) the cell body has a potential of -70 mV in relation to its
outside (this generates a electrical field of about 104V/mm across the membrane of the
neuron), the threshold value for the generation of the AP is around -60 mV to -30 mV
(depending on the sensitivity of the neuron) and when the AP is at its peak the interior
of the neuron is 30 mV above the potential of its external environment [DeDe93].
After the AP is generated and extinguished, there is a refractory period when the
neuron does not fire even if its receives very large inputs. The refractory period takes
3 to 4 ms and is important since it sets an upper limit for the maximum firing frequency
of the neuron. The firing duration period is defined as the duration of the pulse added
to the duration of the refractory period. Considering the minimum duration period as 4
ms (1 ms as the minimum duration of the pulse + 3 ms as the minimum duration of the
refractory period), the maximum firing frequency is 250 Hz.
12

Once the AP is generated, it is transmitted along the axon, like an electrical


signal is transmitted along a electric cable. There is a chemical regenerating mechanism,
provided by exchange of ions, that ensures that the AP is transmitted along the axon
without much distortion in its shape and duration. The velocity of propagation of the AP
along the axon can vary from 0.5 to 130 m/s since it is proportional to the square root
of the diameter of the axon and it increases when the axon is covered by myelin, a
relatively thick and fatty insulating layer. Two-thirds of the axons in the human body
have small diameter (between 0.0003 to 0.0013 mm) are unmyelinated. They constitute
the low-speed (up to 1.5 m/s or 3.2 mph) nerve fibres (group of axons) and they carry
"routine" information such as body temperature, where this low speed is adequate. The
other one-third consists of high-speed nerve fibres (up to 130 m/s or 290 mph), the
axons have relatively large diameter (0.001 to 0.022 mm) and they are myelinated. They
are used for transmission of vital information that needs to be processed rapidly, for
instance, when there is danger to the organism [DeDe93].
At regular intervals, there are breakes in the myelin cover, the so called Ranvier
nodes, which play a vital role in the transmission of the pulses along the axon.
Without the myelin cover, the diameter of the mammalian optic nerve, that
contain about 1 million axons, would have to be increased from 2 mm to around
100 mm to carry the same information at the same speed. On the other hand,
information that has low priority is carried by the lower speed axons since they occupy
less space in the organism. The victims of multiple sclerosis are supposed to suffer from
deterioration of such myelin cover, probably caused by an attack from the autoimmune
system.
In contrast to axons the dendrites do not have a myelin cover and there is no
regenerating mechanism to transmit the signal received by a dendrite at the synapse to
the cell body. Therefore a greater distance between a synapse and the cell body will
mean that the signal that such a synapse sends along the dendrite take longer to arrive
at the cell body and will suffer greater attenuation and distortion. This is the reason why
dendrites can not be very long. A possible model for a dendritic tree is a passible RCG
network (resistors in series with capacitors and resistances in parallel), much like the
models used for transmission lines in studies of distribution of electric energy [DeDe93].
In cases when there is imminent danger or damage, for instance, when we
unintentionally touch a very hot object, the brain is not directly involved in the
13

immediate reaction. In such cases a simple decision and very fast reaction is needed, i.e.
move the hand away from the hot object, a so called reflex arc. The brain is excluded
from this decision process to avoid slowing the reaction. If our ancestors had to think
in order to react to such situations, probably they would have not survived the harsh
environment in which they lived. In these cases, when the signal reaches the spinal cord,
a signal is very rapidly sent to the proper muscles to perform a corrective action. In the
above example, the sensory neurons that received the stimulus are directly linked,
possibly through some interneurons in the spinal cord, to some motor neurons. The brain
receives the signal as well since the same sensory neurons will also be connected to
other interneurons that have a path to the brain. This is important to make the brain
aware of the environment.
From the above, we can see that the signal transmitted by the neuron is
modulated in frequency and the information content is not that a neuron has "fired", but
in the number of pulses fired per unit of time. Using such a frequency modulation (FM)
method, the signal generated in the cell body of the neuron by the neuron can be
transmitted by the axon to other neurons over long distances. Interestingly,
telecommunication engineering has proven that the FM technique has significant
advantages in noise rejection over other techniques.
The main sources of noise at the neuronal level are a consequence of the
chemical mechanism involved in the transmission of the AP across the synapse and
along the axon. The cause of the noise in the first case is the random movements of the
molecules of the neurotransmitters. In the second case the cause is the random
movement of the ions that are involved in the transmission of the AP along the axon.

2.1.3 - Structure of the Brain


The human brain is hierarchically structured and the higher levels of the structure
are believed to be specified by the genetic code. The values of the synapses of all
neurons are the lowest level of such structure and are believed to determined not by the
genetic code, but by the interactions of each individual with the environment, i.e.
"learned".
In the scale of evolution, the lower level animals have their nervous system
completed specified by the genetic code. Man is at the top of the scale and has the
highest brain volume in relation to total body weight. The genetic code cannot specify
14

all 1014-1015 synapses in the human brain since it is beyond its capacity [RMS92].
However, this turns out to be a decisive advantage since it enables short-term adaptation
to the environment while the evolutionary process provides long-term adaptation,
increasing the probability of survival of the species that use this strategy.
The human brain can be divided into smaller regions, according to appearance
and organization. One possibility is to divide the brain in 3 main regions: the cerebral
cortex, the thalamus and the perithalamic structures [Per92].
The cerebral cortex is the "central" processor of the brain and it is unique to
mammals. It is the youngest brain region in the evolutionary sense and constitutes the
outer part of the brain (the word cortex means the outer layer of an organ). The cerebral
cortex is a flat thin two-dimensional layered structure of the order of 0.2 m2 in area and
on average 2-3 mm in thickness, i.e. about 50 to 100 neurons in depth [RMS92]. It is
extensively folded in higher mammals with several fissures in order to fit inside a skull
of reasonable size. The cerebral cortex can be divided in several subareas, which seem
to be functional areas. Such areas are specialized for specific tasks such as visual
perception (the visual cortex), motor control (motor cortex), or touch (somatosensory
cortex). The body is represented unequally in the somatosensory cortex, with face and
hands having proportionally larger representation than other parts. There are also
association areas that help in the interpretation of the signals received by the sensory
areas [And83].
The thalamus is the "frontal" computer of the brain. All information which flows
to and from the cortex is processed by the thalamus. It can also be divided in regions
and is centrally located in the brain.
The perithalamic structures are the "peripherical" computers that play auxiliary
but vital and not fully understood roles, like "slave" computers. Some of these structures
are the hypothalamus, that control hormonal secretions and other activities such as
breathing and digestion; the hippocampus, that is involved in long-term memory; and
the cerebellum, that is mainly involved in storing, retrieving and learning sequences of
coordinated movements [Per92].
More details about the human nervous system and the human brain can be
obtained from any neurobiology textbook. A good readable introduction to the subject
is given in [Sci79]. For a engineering perspective of the nervous system see [DeDe93].
For an introduction to the mathematical modelling of biological neurons see [Hop86].
15

2.2 - Brain versus Digital Computer

The human brain can be seen as a flexible analog processor with enormous
memory capacity that has been engineered and fine-tuned by evolution through several
millions of years to execute tasks that are important for survival in our particular world.
The more important a task was for our survival, the more optimized our whole body is
for that particular task, satisfying certain biological and physical constraints such as
body size, energy consumption and energy dissipation. The human nervous system and
the brain is a particular good example of this.
We are very good at recognizing faces and understand speech, very rapidly and
accurately and far better than any digital computer, probably because it was very
important to our survival to differentiate between friends and enemies and to
communicate with each other. We can perform such tasks so effortlessly that we do not
realize how hard they are until we try to program a digital computer to perform them.
On the other hand, we are easily outperformed by a pocket calculator when executing
arithmetic tasks since such tasks are unnatural to us and had very little importance to
our survival. Hinton [Hin89] suggests that: a) considering arithmetic operations, 1 brain
has the processing power of 1/10 of a pocket calculator; b) but if we consider vision,
1 brain has the processing power of 1000 supercomputers; c) considering memorization
of arbitrary facts brains are much worse than digital computers; d) but if we consider
associative memory for real-world facts, such as recalling the name of a person given
a partial description with possibly few wrong clues (the so called content-addressable
memory), the brain is much better than computers (that use address-addressable
memory).
The important point is to realize that certain problems are suitable to be solved
by the conventional algorithmic procedure implemented in digital computers while other
problems are not. Artificial Neural Networks provide possible methods for trying to
solve some of the problems that are not suitable for digital computation.
The main differences in the mode of information processing between the brain
and a digital computer can be summarized as follows [Sim90]:

2.2.1 - Processing Speed


Nowadays it is common to have digital computers operating using clock
16

frequencies in the range 16 to 33 MHz, such as the 80386 and 80486 Intel
microprocessors. Therefore they will take between 30 and 40 ns to execute a single
instruction (supercomputers can take as little as 3 ns, 1 ns = 1 nanosecond = 10-9 s). As
we have seen in the previous section, neurons operate in the millisecond range and will
take as at least 4 ms to complete a firing cycle. So a digital computer can have
components that are 105 times faster than a neuron.

2.2.2 - Processing Order (Serial/Parallel)


A digital computer processes information serially while the brain processes
information in parallel. Furthermore, consider that we take around 500 milliseconds to
recognize a face and that the average processing time of a neuron is 5 milliseconds.
Therefore, if the brain is executing "parallel programs", such programs cannot have more
than 100 steps. This has been called the "100-step program" constraint. So, one possible
explanation for the brain outperforming a very much faster digital computer in certain
tasks such as vision is that, instead of executing a very large program serially as digital
computers do, the brain executes in parallel a large number of small programs. This
shows that to perform certain tasks well it is not enough to execute a single instruction
of the "program" very fast.

2.2.3 - Number and Complexity of Processors


A digital computer can execute each single instruction of its program much faster
that one neuron can change its state, but the human brain has a much larger number
(1014) of processors (neurons) operating at the same time. The brain also has a high
interconnectivity since each neuron is connected to around 104 other neurons. Another
difference is that, while in a digital computer the processor is very complex (since it has
to be able to interpret a large number of different instructions) and it has a high
precision (in terms of the number of significant digits of the response), it is believed that
the neuron is in comparison a very simple processor with a low precision1.

1
It is very difficult to model very accurately the behaviour of a real neuron. Recent studies seem to
indicate that some signal processing is also done by the dendrites and at the synapse level, instead of just
at the cell body. However, the general assumption today is that such phenomena do not contribute
significantly to the computational power of the brain. Much more research is needed to clarify this issue.
17

2.2.4 - Knowledge Storage and Tolerance to Damage


In a digital computer a particular item of information, or datum, is stored in a
specific memory location. This type of memory is referred to as a localized memory,
since a memory unit holds an entire piece of information. Moreover, a digital computer
uses address-addressable memory. In contrast, information in the brain is thought to be
located in the synapses in a distributed manner, such that no synapse holds an entire
datum and each synapse can contribute to the representation of several pieces of
information. This is called a distributed memory and the brain uses content-addressable
memory, i.e. a memory is retrieved by using parts of its contents as clues.
Distributed memories have the advantage that they are more resistant to damage
(faults). This means that the human brain is relatively tolerant to the loss of few
neurons, i.e. the information stored (the memory) is not severely distorted when a few
neurons die. Also, because of the intrinsic parallelism, the loss of few computational
units (the neurons in this case) will not result in a total failure. Such tolerance to
damage is sometimes referred to as graceful degradation and means that performance
decreases smoothly with increase in damage. Compare this with a digital computer
where the corruption of a memory location or failure of any processing element can
result in a total machine failure.
On the other hand, distributed memories have the possible disadvantage that
when it is necessary to update some information, much more work is necessary since
several physical locations of the memory need to be updated. For this reason, sometimes
it is said that knowledge in a digital computer is strictly replaceable while knowledge
in the brain is adaptable.

2.2.5 - Processing Control


In a digital computer there is a clock signal that is used to synchronize all
components. A central processor uses the clock signal to control the activities of all the
other components. In contrast there is no specific area in the brain responsible for
control or for synchronization of all neurons. For this reason, the brain is sometimes
called an "anarchic" system since there is no "homunculus" that monitors the activities
of each neuron.
Given all these differences it is ironic to realize that today’s achievements in
ANN research are a direct consequence of the vast progress in the areas of hardware and
18

software for digital computers in the recent decades. The majority of ANN models in
use today are simulated in digital computers since specific hardware for ANN is not yet
easily available or affordable. The current digital computers provide a suitable
framework that is used by researchers to carry out experiments with their ANN models.

2.3 - The Basics of an Artificial Neural Networks Model

In this section a formal definition of Artificial Neural Networks is introduced and


a general framework for ANN models is presented.

2.3.1 - A Formal Definition


The following formal definition of an Artificial Neural Network was proposed
by Hecht-Nielsen [Hec90]:
"An Artificial Neural Network is a parallel, distributed information
processing structure consisting of processing units (which can possess
a local memory and can carry out localized information processing
operations) interconnected via unidirectional signal channels called
connections. Each processing unit has a single output connection that
branches ("fans out") into as many collateral connections as desired;
each carries the same signal - the processing unit output signal. The
processing unit output signal can be of any mathematical type desired.
The information processing that goes on within each processing unit can
be defined arbitrarily with the restriction that it must be completely local;
that is, it must depend only on the current values of the input signals
arriving at the processing element via impinging connections and on
values stored in the processing unit’s local memory."
The above definition contains two slight changes in relation to the nomenclature used
in the original one proposed by Hecht-Nielsen. The term "Neural Networks" was
changed to "Artificial Neural Networks" to emphasize that we are not dealing with
biological neural networks. Also the term processing element (PE) was changed to
processing unit and most of the time we will simply use the term unit.
From the above definition ANN can be seen as a subclass of a general computing
architecture known as Multiple Instruction Multiple Data (MIMD) parallel processing
19

architecture. Hecht-Nielsen points out [Hec90] that maybe the general MIMD
architectures are too general to be efficient and maybe ANN is a good compromise
between an efficient structure with considerable information processing capability and
a general-purpose implementation.

2.3.2 - A General Framework for ANN models


There are many different ANN models but each model can be precisely specified
by the following eight major aspects [RHM86]:
• A set of processing units
• A state of activation for each unit
• An output function for each unit
• A pattern of connectivity among units or topology of the network
• A propagation rule, or combining function, to propagate the activities
of the units through the network
• An activation rule to update the activities of each unit by using the
current activation value and the inputs received from other units
• An external environment that provides information to the network
and/or interacts with it.
• A learning rule to modify the pattern of connectivity by using
information provided by the external environment.
Figure 2.3 illustrates the general model of a processing unit. The state of the unit is
given by its activation value ai. The state of an ANN with N units at time instant t can
be represented by the vector [a1(t) a2(t) ... aN(t)]. Such a vector is sometimes referred to

Figure 2.3 - The general model of a processing unit


20

as the short-term memory (STM) of the network. The output function uses as argument
the activation value to calculate the output of the unit denoted by outi. Such an output
value is then transmitted to the other units in the network. Some possibilities for the
output function are:
1) The linear function: outi = Gain * ai;
2) The threshold function:
if ai > Threshold, then outi = 1; otherwise outi = 0
3) The sigmoid function:
outi = 1/[1+exp(-ai)]
The pattern of connectivity, or the network topology, specifies how each unit is
connected to the other units in the network. The pattern of connectivity also specifies
which units (or groups of units) are allowed to receive connections from a particular
unit. The strength of each connection is normally represented by a real number w. We
adopt the notation that wij means the weight to unit i from unit j (i←j). The pattern of
connectivity for a whole network with N units can be represented by the weight matrix
W with dimensions N by N. Row i of W contains all weights received by unit i from
the other units in the network (the input weights of unit i) and column j contains all
weights sent by unit j to the other units (the output weights of unit j). The weight matrix
has a very important role since it represent the knowledge that is encoded in the network
and, because of this, it is said that the matrix W contains the long-term memory of the
network (LTM).
Each unit can send its output value to several other units and each unit can
receive as input the output values of several other units. The propagation rule specifies
how such output values from other units are combined into a much smaller set of values,
normally only one value called the net input of the unit. For this reason the propagation
rule is sometimes called the combining function. A frequently used combining function
simply defines a net input value for each unit as a weighted summation or
net(t) = W(t) out(t), where net(t) and out(t) are respectively the net input and output
vectors. Another possibility would be to define a net excitatory input vector and net
inhibitory vector as netE(t) = WE(t) out(t) and netI(t) = WI(t) out(t) where the matrix
WE uses only the positive elements of W and WI uses only the negative elements.
Such net input values and the current activation value are then used by the
activation rule to define the activation value of the unit at the next time step or, using
21

a discrete time notation:


ai(t+1) = F[ai(t),neti(t)] or ai(t+1) = F[ai(t),netEi(t),netIi(t)].
In some models the activation function will be simply the identity function:
ai(t+1) = neti(t) and if we have at the same time outi(t) = ai(t) and net(t) = W out(t), the
whole network is can be viewed as a linear discrete time dynamical system since we
will have: a(t+1) = W a(t). In most models the activation function or the output function
are the identity function but not both simultaneously.
The external environment interacts with the network sometimes to provide inputs
to the network and to receive its outputs. The units that receive signals from the external
environment are called input units and units that send signals to the external
environment are called output units. Units that are not input nor output units, i.e. that
are not connected directly to the external environment, are called hidden units. In some
ANN models some units are at the same time input and output units. The units can be
grouped in layers according to some property. Therefore in some models we can have
an input layer, an output layer and one, several or no hidden layers. Some authors refer
to layers as layers of weights but in this work we use layers to refer to layers of units.

2.3.3 - Learning
The external environment also interacts with the network during "learning" or
training of ANN. In this phase a learning rule is used to change the elements of the
matrix W and other adaptable parameters that the network may have. In this context
"learning" and "adaptation" are seen as simply changes in the network parameters. In
ANN models the external environment will normally provide a set of "training" input
vectors. There are two main types of learning: supervised and unsupervised.
In supervised learning the external environment also provides a desired output
for each one of the training input vectors and it is said that the external environment
acts as a "teacher". A special case of supervised learning is reinforcement learning
where the external environment only provides the information that the network output
is "good" or "bad", instead of giving the correct output. In the case of reinforcement
learning it is said that the external environment acts as a "critic". Some authors prefer
to classify reinforcement learning not as a special case of supervised learning but as a
third type of learning rule.
In unsupervised learning the external environment does not provide the desired
22

network output nor classifies it as good or bad. By using the correlation of the input
vector the learning rule changes the network weights in order to group the input vector
into "clusters" such that similar input vectors will produce similar network outputs since
they will belong to the same cluster. Ideally, the learning rule finds the number of
clusters and their respective centres, if they exist, for the training data. This learning
method is also called self-organization. Sometimes, it is improperly said that in
unsupervised learning the network learns without a teacher, but this is not absolutely
correct. The teacher is not involved in every step but he still has to set goals even in an
unsupervised learning mode. Zurada [Zur92] proposes the following analogy to clarify
this point. An ANN being trained using supervised learning corresponds to a student
learning by answering the questions posed by the teacher and comparing his answers to
the correct answers given by the teacher. The unsupervised learning case corresponds
to the student learning the subject from a videotape lecture provided by the teacher but
the teacher is not available to answer any questions. The teacher provides the methods
(the learning rule) and questions (the input training vectors) but not the answers to the
questions (the output training vectors).
Considering hardware implementations of an ANN with a large number of units,
it is preferable to have learning rules that use only local information to the unit whose
weights are being updated. Without such a constraint, especially for a large ANN, inter-
unit communication can cause a considerable burden.

2.3.4 - Network Topology


Accordingly to its topology an ANN can be classified as a feedforward or
feedback (also called recurrent) ANN. In a feedforward ANN a unit only sends its
output to units from which it does not receive an input directly or indirectly (via other
units). In other words, there are no feedback loops.
In general, given a feedforward ANN, by properly numbering the units we can
define a weight matrix W which is lower-triangular and has a zero diagonal (the unit
does not feedback into itself). A feedforward ANN arranged in layers, where the units
are connected only to the units situated in the next consecutive layer, is called a strictly
feedforward ANN. In a feedback network feedback loops are allowed to exist. A
feedforward ANN implements a static mapping from its input space to its output space
while a feedback ANN is, in general, a nonlinear dynamical system and therefore the
23

Figure 2.4 - Examples of feedforward (a,b) and feedback (c) ANN.


However only (a) is a strictly feedforward ANN.

stability of the network is one of the main concerns. Figure 2.4 shows examples of
feedforward, strictly feedforward and feedback networks.
A typical application of feedforward ANNs is to develop nonlinear models that
are then used for pattern recognition/classification. In this case a feedforward ANN can
be seen as another tool for performing nonlinear regression analysis. A typical
application of feedback ANNs is as content-addressable memories. The state vectors that
correspond to the information that we want to record (the specific "memory") are set to
be a stable equilibrium point. Another possible area of application is
unconstrained/constrained optimization where it is hoped that the network will converge
to a stable equilibrium point that represents a satisfactory near-optimum solution to the
problem.
ANNs can also be classified as synchronized or asynchronized according to the
timing of the application of the activation rule. In synchronized ANNs, we can imagine
the equivalent of a central clock that is used by all units in the network such that all of
them, simultaneously, sample their inputs, calculate their net input and their activation
and output values, i.e. a synchronous update is used. Such an update can be seen as a
discrete difference equation that approximates an underlying continuous differential
equation. In asynchronized ANNs, at each point in time, there is an maximum of only
one unit being updated. Normally, whenever the updating is allowed, a unit is selected
at random to be updated and the activation values of the other units are kept constant.
In some models of feedback ANNs such a procedure helps, but does not guarantee, the
24

stability of the network and is generally used when a synchronous update can result in
stability problems.

2.4 - Artificial Neural Network Models

In this section we review the basic characteristics of the most important


feedforward ANN models, more or less in chronological order. Such models provide the
foundation of most of the many feedforward ANN models available and in use today.
Anderson and Rosenfeld [AnRo88] have edited a very interesting book which contains
a collection of several classical papers in the ANN area. Nilsson [Nil65] has published
a theoretical view of the state of the field in mid-60’s. Lippmann [Lip87] and, more
recently, Hush and Horne [HuHo93] published updated reviews of several ANN models
and Simpson [Sim90] has published an extensive compilation of ANN models. White
[Whi89] and Levin et al. [LTS90] provide a statistical interpretation of the methods used
to train feedforward ANNs. Nerrand et al. [NRPD93] show that ANNs can be
considered as general nonlinear filters that can be trained adaptively. They also show
that several algorithms used in linear and nonlinear adaptive filtering can be seen as
special cases of algorithms used to train ANNs.

2.4.1 - Early Models


In 1943 the neurophysiologist Warren McCulloch and the logician Walter Pitts
[McPi43] proposed to describe the biological neuron as a Threshold Logic Unit (TLU)
with L binary (0 or 1) inputs xj and 1 binary output y. The weights associated with such
inputs are wj = ±1. The output of such a unit is high (1) when the linear summation of
the all inputs is greater than a certain threshold value and low (0) otherwise. This is
equivalent to defining the activation rule as the threshold function and the output value
as equal to the activation value. The threshold value is mathematically represented by
a variable bias. Therefore the output y of the TLU is:

y FT wj xj bias (2.1)
j

with FT(z) = 1 if z ≥ 0 and FT(z) = 0 if z < 0 (the Heaviside or unit step function). The
variable bias can be simply seen as another weight which originates from a unit whose
output is always 1.
25

McCulloch and Pitts showed that it is possible to construct any arbitrary logical
function by using a combination of such units, i.e such a network is capable of universal
computation. For example, the logical function AND can be implemented by using one
unit with weights set to 1 and bias set to -1.5. Still with weights set to 1, if the bias is
set to -0.5 we have the logical function OR. To obtain an inverter the weight is set to -
1 and the bias to 0.5. By using a combination of these basic functions AND, OR and
inverter is possible to construct any arbitrary logical function. However, such networks
are not fault-tolerant in that all components must always function properly in contrast
to the fault-tolerance of biological neural networks. Another problem was that
McCulloch and Pitts did not explain how to obtain the values of weights and biases for
the network and, in particular, they did not propose a learning rule.
In relation to the fault-tolerance problem, von Neumann [Neu56] realized that,
by using redundancy, a more reliable network could be build with unreliable
components.
In relation to the learning problem, in 1949 the psychologist Donald O. Hebb
proposed a method of determining the network weights. In his book The Organization
of Behavior [Heb49] he states:
"When an axon of cell A is near enough to excite a cell B and repeatedly
or persistently takes part in firing it, some growth process or metabolic
change takes place in one or both cells such that A’s efficiency as one
of the cells firing B, is increased."
This is known today as the Hebbian rule and most of the learning rules can be seen as
variants of this rule. The Hebbian rule can be formulated mathematically as:

∆ wi j η yi (x) xj (2.2)

where x = [ x1 x2 ... xp ]T is the input vector, y = [ y1 y2 ... yq ]T is the output vector and
η > 0, called the learning rate, controls the size of each learning step and p and q are
respectively the number of inputs and output units.
A simple example of an application of the Hebbian rule can be illustrated by the
Linear Associative Matrix (LAM) model, sometimes also called Linear Associator,
introduced by J. Anderson [And68]. In such models the output is a linear function of the
inputs, or simply Y = W X, i.e a feedforward linear ANN. Figure 2.5 illustrates the
LAM network that is used to associate a set of input vectors X, [ X1 X2 ... XM ], to a set
of desired output vectors D, [ D1 D2 ... DM ], where M is the number of desired
26

Figure 2.5 - The Linear Associative Memory (LAM)

associations. The weight matrix has dimension qxp, each input vector has dimension px1
and the output vectors have dimension qx1.
Sometimes such vectors are referred to as patterns (from the nomenclature used
in the area of pattern recognition) since a pattern can be seen as a point in a multi-
dimensional input space, i.e. a vector. Therefore one can refer to the LAM and several
other ANN models as pattern associators.
In the training phase, if the input vectors X form an orthonormal set, i.e. they are
orthogonal to each other and each one has unit length (we can easily force the unit
length by dividing each component of the vector by its length), we can initialize the
weight matrix as a null matrix (Wij(0) = 0 for all i and j), and set the learning rate η to
1. The correct input-output association can be encoded in the weight matrix using only
a single presentation of each input-desired output pair. Since we have for each
presentation:

∆W ( k ) W(k) W(k 1) Dk Xk ,
T
1≤k≤M (2.3)

after the presentation of all M pairs the weight matrix W will be:

W (M ) D1 X1
T
D2 X2
T
... DM XM
T (2.4)

The advantages of the LAM model are: 1) its real-time response, 2) its fault-
tolerance in relation to corruption of the weight matrix, and 3) its interpolated response.
The limitations of a such model are: 1) the number of patterns stored has to be less or
equal to the number of input units, M ≤ p, 2) the input patterns have to be orthogonal
to each other and 3) it is not possible to store nonlinear input-output relationships. Later
27

on we will see that, by using the Delta Rule, the association can be learned even for
non-orthogonal input patterns.

2.4.2 - The Perceptron and Linear Separability


In 1958 Frank Rosenblatt proposed the Perceptron model ([Ros58], [Ros62]) that
can also be used as a pattern associator or pattern classifier. The single-layer perceptron
model consists of one layer of input binary units and one layer of binary output units.
There are no hidden layers and therefore there is only one layer of modifiable weights.
The output units use a hard-limiting threshold output function as in eq. 2-1. Figure 2.6
illustrates the single-layer perceptron.
The single-layer perceptron is a special case of Rosenblatt’s elementary
perceptron. His original perceptron was proposed as a model for visual pattern
recognition and has 3 types of units: sensory, association and response units (see figure
2.7). The sensory units, or S-units, form a "retina" and act as transducers responding to
physical signals such as light, pressure or heat. The association units, or A-units, receive
random, localized and fixed connections from the sensory units. The association units
send variable connections to the response units, or R-units, that act as the output units.
All units are TLUs but, while the sensory and association units have 0 or 1 as outputs
(binary outputs) and non-zero fixed threshold values, the response units have a -1 or 1
output (bipolar output) and a zero fixed threshold value. Since such a model is
equivalent to a network of McCulloch and Pitts’ units, such a network can also compute
any logical function and therefore it can also perform universal computation [Ros62].

Figure 2.6 - The Single-Layer Perceptron


28

Figure 2.7 - Rosenblatt’s elementary perceptron

The computation performed by the sensory and association units can be viewed as a
fixed pre-processing stage, since the connections between them are non-adaptable.
For training the perceptron, Rosenblatt proposed the following supervised
learning error-correction rule:
1) Apply an input pattern and calculate the output Y.
2) a) If the output is CORRECT, go to step 1;
b) If the output is INCORRECT and is -1, ADD each input to its
corresponding weight, ∆Wij = Xj; or
c) if the output is INCORRECT and is 1, SUBTRACT each input
from its corresponding weight, ∆Wij = - Xj.
3) Select another input pattern from the training set and return to step 1.
Mathematically step 2 can be expressed as:

1
∆ Wi j Di Yi Xj (2.5)
2
making explicit the error-correction part. Rosenblatt [Ros62] proved that, if a solution
exists, i.e. if there is a weight matrix W that gives the correct classification for the set
of training patterns, then the above algorithm will find such solution after a finite
number of iterations. This proof is today known as the Perceptron Convergence
Theorem and it can also be found in [Zur92], [MiPa69], [HKP91] and [BeJa90].
Therefore it is also very important to understand in which cases such a solution exists.
Without loss of generality, since all output units operate independently of each
29

other, let’s consider just one of the output units of the single-layer perceptron. Such an
output unit divides its input space into 2 regions or classes (one region has high output
and the other low output, where "high" normally means 1 and "low" means 0 or −1
depending on if we are using binary or bipolar output units). These 2 regions are
separated by a hyperplane (a line for 2 inputs and a plane for 3 inputs) and the
hyperplane is the decision surface in this case. The position of such a hyperplane is
determined by the weights and bias received by this output unit. The equation of the
hyperplane of output unit i is:
p
Wi j X j biasi 0 (2.6)
j 1

Marvin Minsky and Seymour Papert [MiPa69] analysed in detail the capabilities and
limitations of the single-layer perceptron model (chapter 3 of [AlMo90] contains a good
explanation of Minsky and Papert’s arguments and Block [Blo70] summarizes their main
results). One of the most important limitations proved by Minsky and Papert was that
the single-layer perceptron can only solve problems that are linearly separable, i.e
problems where, for each output unit, a hyperplane exists that correctly divides the input
space into the two correct classes. Unfortunately, many interesting problems are not
linearly separable problems. Moreover, Peretto ([Per92], chap. 6) shows that the number
of linearly separable logical functions reduces to zero as the number of arguments
increases.
Figure 2.8 illustrates the logical boolean functions AND, OR and XOR which
have 2 inputs and 1 output. From figure 2.8 we can see that the functions AND and OR
are linearly separable (see the position of the decision surfaces, i.e. lines in this case).
However the XOR function is not linearly separable since it is not possible to position

Figure 2.8 - The AND, OR and XOR problems and possible location
for the decision surfaces
30

a single line to separate the points that should produce a "0" (or "-1") output from the
points that should produce a "1" output. Another way to illustrate that the perceptron
cannot solve the XOR problem is to write the 4 inequalities that need to be satisfied by
the network weights and bias. The inequalities are:
a) 0 W1 + 0 W2 < bias ⇒ bias > 0
b) 1 W1 + 0 W2 > bias ⇒ W1 > bias
c) 0 W1 + 1 W2 > bias ⇒ W2 > bias
d) 1 W1 + 1 W2 < bias ⇒ W1 + W2 < bias
These inequalities cannot be simultaneously satisfied since the weights W1 and W2
cannot both be greater than the bias, which has a positive value, and at the same time
their sum be less than bias.
It is important to understand that for such problems, for the single-layer
perceptron, there is no solution to be found, i.e. it is a representation problem (where
we are interested to know if there is at least one solution), not a learning problem
(where we know that there is at least one solution and want to find one of the solutions).
Figure 2.8 also illustrates that a possible solution for the XOR problem is to
change the shape of the decision surfaces, for instance from hyperplanes to ellipsoids.
In such a case, there are two possible solutions. In the first solution all points in the
input space X outside D.S.1 produce a "1" output and the points inside D.S.1 produce
a "0" output. In the second solution all points inside D.S.2 produce a "1" output and the
points outside produce a "0" output.
The way to overcome the limitation of linear separability is to use multi-layer
networks, such as the so called Multi-Layer Perceptron (MLP), that introduce extra
layers of units (the so called hidden units) between the input and output layers. It is
possible to show that this is equivalent to defining new shapes for the decision surfaces
by combining several hyperplanes. However, partly as a consequence of the publication
of Minsky and Papert’s book, the interest of the research community in the late 1960’s
was quickly diverted from ANN to other areas, mostly to the then new area of Artificial
Intelligence. Minsky and Papert state in their book ([MiPa69] page 231-232):
"The perceptron has shown itself worthy of study despite (and even
because of!) its severe limitations. It has many features to attract
attention: its linearity; its intriguing learning theorem; its clear
paradigmatic simplicity as a kind of parallel computation. There is no
31

reason to suppose that any of these virtues carry over to the many-
layered version. Nevertheless, we consider it to be an important research
problem to elucidate (or reject) our intuitive judgement that the extension
is sterile. Perhaps some powerful convergence theorem will be
discovered, or some profound reason for the failure to produce an
interesting "learning theorem" for the multi-layered machine will be
found."
At that time there was no reliable algorithm to train a multi-layer ANN and Minsky and
Papert judged that it was not worthwhile to try to find one. The interest in ANN would
only resurface again in the mid/late 1980’s partly because of the popularization of
Hopfield’s work and the Back-Propagation Algorithm (which will be presented in the
next section), but also because of the perceived limitations of the AI approach.
Simpson [Sim90] argues that Rosenblatt was aware of the limitations of the
single-layer perceptron model. Rosenblatt ([Ros62], [Sim90]) also proposed extensions
to the perceptron model illustrated in figure 2.7, that he called the series-coupled
perceptron (a feedforward network). Such extensions were made by adding extra
feedback connections to the series-coupled perceptron. He proposed the cross-coupled
perceptron with parallel connections within the association units, and the back-coupled
perceptron with added connections back from the response units to the association units.
However, in all of these models the first layer of weights (the weights from the sensory
units to the association units) were randomly preset and non-adaptable. The extra
feedback connections made it very difficult to analyze mathematically such models and
Minsky and Papert’s book is not concerned with them. Rosenblatt also came very close
to discovering the key to training multi-layer perceptrons when he proposed a heuristic
algorithm to adapt both layers of weights. In page 292 of [Ros62] Rosenblatt states:
"The procedure to be described here is called the "back-propagating
error correction procedure" since it takes its cue from the error of the R-
units (the output units), propagating corrections back towards the sensory
end of the network (the input units) if it fails to make a satisfactory
correction quickly at the response end (output units). The actual
correction procedure for the connections to a given unit, whether it is an
A-unit (hidden unit) or an R-unit (output unit), is perfectly identical to
the correction procedure employed for an elementary perceptron (the
32

single-layer perceptron), based on the error-indication assigned to the


terminal unit."
Again Minsky and Papert’s book did not discuss this algorithm, possibly because
Rosenblatt could not, apparently, make much progress with it.

2.4.3 - ADALINE/MADALINE and the Delta Rule


In 1960 Widrow and Hoff [WiHo60] introduced the ADALINE, initially an
abbreviation for ADAptive LInear NEuron and later, when ANN models became less
popular, ADAptive LINear Element. The ADALINE is a TLU with a bipolar output and
bipolar inputs (-1 or 1). As usual, such unit computes a weighted sum of the inputs plus
a bias. If the sum is greater than zero, the output is 1. If the sum is equal to or less than
zero, the output is -1. Later such a model was also used with continuous real-valued
inputs. A MADALINE (multiple ADALINE) is basically the single-layer perceptron with
bipolar inputs and bipolar outputs.
To train the ADALINE/MADALINE network, Widrow and Hoff proposed the
Delta Rule, also known today as the Widrow-Hoff Rule or Least-Mean-Square (LMS)
algorithm. The Delta Rule is also an error-correction rule, i.e. a supervised learning rule.
The learning speed of the single-layer perceptron, when trained with the
Perceptron rule (eq. 2.5), could be very slow since the weights were changed only when
there was a gross classification error. The basic principle of the Delta Rule is to change,
for each presentation of an input/desired output pattern from the training set, the network
weights in the direction that decreases the squared output error Epat defined as:
q q
1
pat (2.7)
E pat
Ei Di Yi 2
i 1 i 1 2
In other words, the Delta rule is a gradient-descent search procedure executed at each
iteration. By repeating this procedure over the set of training patterns, we minimize the
"average" output error Eav where:
M
1 (2.8)
E av E pat
M pat 1

and consequently we have:

∂E pat ∂E pat ∂Yi ∂Yi


∆Wi j η η η Di Yi (2.9)
∂Wi j ∂Yi ∂Wi j ∂Wi j
However the activation/output function is not continuous and therefore not differentiable
33

and in principle eq. 2.9 cannot be applied. So, in effect Widrow and Hoff proposed to
use, during training, a linear activation function or Y = W X + bias. Such a modification
makes learning quicker because it changes the weights even when the output
classification is almost correct, in contrast to the perceptron rule that changes the
weights only when there is a gross classification error. Another important difference is
the use of bipolar inputs instead of binary inputs. Using binary inputs, when the input
is 0, the weights associated with such an input do not change. Using bipolar inputs, the
weights change even when the inputs are inactivated (-1 in this case).
The training procedure of the single-layer perceptron with the delta rule can be
summarized as:
1) Initialize the matrix W and the bias vector with small random numbers.
2) Select an input/desired output (X,D) vector from the training set.
3) Calculate the network output as: Y = W X + bias
4) Change the weight matrix and the bias vector using:

∆Wi j η Di Yi Xj (2.10)

∆ biasi η Di Yi (2.11)

5) Repeat step 2-4 until the output error vector D − Y is sufficiently


small for all input vectors in the training set.
After training, the output of the network Y for any input vector X is calculated in two
steps:
1) Calculate the net input to each output unit: net = W X + bias
2) The network output is given by:

 1 if neti > 0 (2.12)



Yi
 1 otherwise
This is known in the ANN literature as the recall phase.
The above training procedure directly minimizes the average of the difference
between the desired output and the net input for each output unit (what Widrow and
Hoff in [WiHo60] call measured error). However, it is possible to show that, by doing
this, we are also minimizing the average of the output error (what Widrow and Hoff call
neuron error). Since the introduction of the ADALINE/MADALINE model, Widrow and
Hoff were well aware that it could only be used to solve linearly separable problems.
In relation to the network capacity, Widrow and Lehr [WiLe90] and Nilson
[Nil65] show that, on average, an ADALINE with p inputs can store up to 2p random
34

patterns with random binary desired responses. The value 2p is the upper limit reached
when p → ∞.
By comparing eq. 2.9 with eq. 2.5, we can see that the Perceptron learning rule
and the Delta rule are in principle identical, with the only major difference being the
omission of the threshold function during training in the case of the Delta rule.
However, they are based on different principles: the Perceptron rule is based upon the
placement of a hyperplane and the Delta rule is based upon the minimization of the
mean-squared-error between the desired and computed outputs.
It is also interesting to see that if we train the Linear Associator Y = W X using
the Delta rule instead of the Hebbian rule, the input vectors do not need to be
orthogonal to each other, they only need to be linearly independent. However, for p
input units, the Linear Associator is still limited to store up to p linear associations,
since it is not possible to have more than p independent vectors in an input space with
dimension p [Per92]. In particular if: a) the learning rate η is small enough; b) all
training pairs are presented with the same probability; c) there are p input training
patterns and p network inputs; and d) the input training patterns form a linear
independent set; then the weight matrix converges to the optimal solution W* where:

W D 1D 2 ... D p X 1X 2 ... X p
1 (2.13)

For convergence results, which are also applied to the ADALINE/Perceptron ANN, see
[Sim90], [Luo91], [WiSt85] and [ShRo90].
Widrow applied the LMS algorithm and its variants to train the Linear Associator
(what he called an Adaptive Linear Combiner or a non-recursive adaptive linear filter,
i.e an ADALINE with a linear output function) to a large range of signal processing
problems. For examples of applications see Widrow and Stearns [WiSt85].
In the early 60’s Widrow also proposed an heuristic algorithm to adapt the
weights of a multi-layer ANN. The first layer was composed of ADALINEs and the
output layer has a single fixed logic unit, for instance, and OR, AND or majority-vote
taker. Only the weights arriving at the ADALINEs were adapted. The learning rule,
called MRI for Madaline Rule I, uses the minimal disturbance principle, i.e. no more
ADALINEs are adapted than necessary to correct the output decision, therefore causing
the minimal disturbance to the responses already learned.
In 1987 Widrow and Winter developed the MRII, Madaline Rule II, an extension
of MRI to allow the use of more than one logic unit at the output layer. However, up
35

to now both MRI and MRII have not been used much in the ANN literature. In 1988
David Andes modified MRII into MRIII by replacing the threshold logic function used
in the ADALINE by sigmoid functions. However, Widrow and his students later realised
that MRIII is mathematically equivalent to the Back-Propagation algorithm to be
presented in the next section. For more details on the MRI and MRII rules see [WiLe90]
and [Sim90].

2.4.4 - The Multi-Layer Perceptron and the role of hidden units


Figure 2.9 shows the minimum configuration for a Multi-Layer Perceptron. At
least one layer of hidden units with nonlinear activation functions is needed. An ANN
with hidden layers of linear units can be represented by an equivalent ANN without
hidden layers. The output units can have linear or nonlinear activation functions. It is
also possible to have direct connections from the input to the output units. In general,
if we draw the ANN with the input layer at the bottom and the output layer at the top
of the diagram (as in fig. 2.9), a layer of units can send connections to any layer that
is above it, since we assume that the MLP is by definition a feedforward ANN model.
The use of hidden units make it possible to reencode the input patterns therefore
creating a different representation. Each hidden layer reencodes its input. Some authors
refer to the hidden units creating internal representations or extracting the hidden

Figure 2.9 - The minimum configuration for a Multi-Layer Perceptron (MLP)


36

features from the data. Depending on the number of hidden units, the new representation
can correspond to vectors that are then linearly separable. If there are too few units in
a hidden layer to make possible the necessary reencoding, perhaps another layer of
hidden units is necessary. Because of this, the designer has to decide, for instance,
between using a) only one hidden layer with several units; or b) two hidden layers with
fewer units in each hidden layer. Normally no more than two hidden layers of units are
used, firstly because the representation power added by up to 2 hidden layers is likely
to be enough to solve the problem and secondly because for most of the algorithms used
nowadays the simulation results indicate that the training time increases rapidly with the
number of hidden layers.
The power of an algorithm that can adapt the weights of a MLP originates from
the fact that such algorithm can find such reencoding automatically by using the given
set of examples of the desired input-output mapping. It is possible to see such internal
reencoding, or internal representation, as a set of rules (or micro-rules as some authors
prefer to refer to them). So, using an analogy with expert systems, such an algorithm
would "extract" the rules or features from the set of examples, what is referred to by
some authors as the property of performing feature extraction from the data set.

Figure 2.10 - The first possible solution for the XOR problem
37

Figure 2.11 - The second possible solution for the XOR problem

Figures 2.10, 2.11 and 2.12 illustrate three different solutions for the XOR
problem using TLUs in the hidden and output layers. Note that for the ANNs illustrated
in fig. 2.10, 2.11 and 2.12, the output unit can also be linear with a zero bias, i.e.
respectively y = x3 + x4, y = x4 - x3 and y = x1 + x2 - 2x3. In figures 2.10 and 2.11 the
two hidden units reencode the input variables x1 and x2 as the variables x3 and x4. The
four input patterns are mapped to three points in the x3-x4 space. These three points are
then linearly separable as illustrated. Observe that the solution illustrated in figure 2.11
is a combination of the AND and OR functions.
Figure 2.12 illustrates that if connections from the input to the output units are
used, the XOR problem can be solved using only one hidden unit which implements the
AND functions. Then if we consider the expanded input space x1-x2-x3, the 4 patterns
are now linearly separable since it is possible to find a plane that separates the points
which should produce a "0" output from the points that should produce a "1" output. If
the output unit is kept as the TLU, the decision surface in the space x1-x2 changes from
a line to a ellipse (see figure 2.8) [WiLe90].
Since the AND function can be defined for binary variables as the product of the
38

variables, from figure 2.12 we can see that if we have as input to the network the value
of the variable x1*x2, one layer of units would be enough to solve the problem and there
would be no need of hidden units. Generalizing this idea, when the unit itself uses
products of its input variables, it is called a higher-order unit and the network a higher-
order ANN. In general, higher-order units implements the function [GiMa87]:

F  bias wi j k l xj xk xl ....  (2.14)


(1) (2) (3)
yi wi j xj wi j k xj xk
 j j k j k l 
From this definition, the percepton is a first-order ANN since it uses only the first input
term of the above equation. Widrow [WiLe90] refers to such units as units with
Polynomial Discriminant Functions. The problem with higher-order ANN is the very
rapid increase in the number of weights with the number of inputs as was earlier noted
by Minsky and Papert [MiPa69]. However, recently such networks have successfully
been used for classification of images irrespectively of their translation, rotation and
scaling ([RSO89], [SpRe92]), where the weight number explosion is kept under control
by grouping the weights. For some problems, as was the case for the XOR, one layer
of higher-order units may be enough since they use more complex decision surfaces than
the MLP’s hyperplanes. A MLP can only implement more complex decision surfaces
by a combination of such hyperplanes.
Finally, on the subject of units that use products of inputs, Durbin and Rumelhart
proposed to use what they called product units [DuRu89]. Instead of calculating a
weighted sum, each product unit calculates a weighted product, where each unit is raised
to a power determined by a variable weight. Therefore such a unit can learn an arbitrary
polynomial term. They argue that such units are biological plausible and correspond to
processing done locally at synapses.

Figure 2.12 - The third possible solution for the XOR problem
39

2.4.5 - The Back-Propagation Algorithm


We have seen that the advantage of using hidden units is that the ANN can then
implement more complex decision surfaces, i.e the representation power is greatly
increased. The disadvantage of using hidden units is that learning becomes much harder
since the learning procedure has to decide which features it should extract from the
training data. Basically the dimension of the solution space is also greatly increased
since we need to determine a larger number of weights.
The Back-Propagation algorithm (BP) has been independently derived by several
people working in different fields. Werbos [Wer74] discovered the BP algorithm while
working on his doctoral thesis in statistics and called it the dynamical feedback
algorithm. Parker ([Par82], [Par85]) rediscovered the BP algorithm in 1982 and called
it the learning logic algorithm. Finally, in 1986, Rumelhart, Hinton and Williams
[RHW86] rediscovered the algorithm and the technique became widely known. The BP
algorithm is today the most popular supervised learning rule to train feedforward multi-
layered ANNs and it is responsible, with Hopfield networks (presented in the next
chapter), for the return of a general interest in ANNs.
The BP algorithm uses the same principle as the Delta Rule, i.e. minimize the
sum of the squares of the output error, averaged over the training set, using a gradient-
descent search. For this reason, the BP algorithm is also called the Generalized Delta
Rule. The crucial modification was to use smooth continuous activation functions in all
units instead of using TLUs. This allows the application of a gradient-descent search
even through the hidden units. The standard activation function for the hidden units are
the so called squashing or S-shaped functions, such as the sigmoid,
sig(x) = [1+exp(−x)]−1, and the hyperbolic tangent, tanh(x) = 2*sig(2x) - 1. Sometimes
the general class of squashing functions is also referred to as sigmoidal functions.
The sigmoid function increases monotonically from 0 to 1 while the hyperbolic
tangent increases from −1 to 1. Note that the sigmoid function can be seen as a smooth
approximation to the threshold function defined in eq. 2.1, while the hyperbolic tangent
can be seen as the approximation of a bipolar TLU with a −1/1 output as used by
Widrow in the ADALINE. The function sig(x/T) tends to the threshold function when
T tends to 0; the parameter T is called the temperature and is sometimes used to change
the inclination of the sigmoid or hyperbolic tangent functions around their middle point.
In some applications, especially pattern classification where we need or want to limit
40

the range of the output units, squashing functions are also used in those units.
The difficulty in training a MLP is that there is no pre-defined error for the
hidden units. Since the BP algorithm is a supervised rule, we have the target for the
output units but not for the hidden units. As in the case of the Delta rule we want to
change the weights in the direction that decreases the output error.
Without loss of generality, let a feedforward ANN be numbered from input to
output such that unit 1 is the first input unit and unit N is the last output unit. Assuming
that the ANN has p input units, H hidden units distributed over one or more hidden
layers, and q output units, making a total of N units (p + H + q = N), then:
N
1 (2.15)
E pat
D outr 2
2 r pH1 r
As in the case of the Delta rule, we apply the chain rule from variational calculus:

∂E pat ∂E pat d outi ∂ neti


∆Wi j η η (2.16)
∂Wi j ∂ outi d neti ∂Wi j
and douti/dneti = derivative of the activation function of unit i with respect to its
argument neti; and ∂neti/∂Wij = outj. However, to calculate the term ∂E pat/∂outi we need
to consider if the unit i is an output unit (p+H+1 ≤ i ≤ N) or a hidden unit
(p+1 ≤ i ≤ p+H). If the unit i is an output unit, then as in the Delta rule, we have:

∂E pat (2.17)
Di outi
∂ outi
If the unit i is a hidden unit, then:

∂E pat ∂E pat ∂ netL


N
(2.18)
∂ outi L i 1 ∂ net L ∂ out i

but ∂netL/∂outi = WLi. If we define ∂E pat/∂netL = δL, then:

∂E pat
N
δL W L i (2.19)
∂ outi L i 1

Equation 2.18 and 2.19 simply states that the effect of the output of an hidden unit on
the output error is defined as the summation of the effect of the units that receive
connections from the hidden unit multiplied by the value of each connection. In other
words, the output error is "back-propagated" from the output layer to the hidden layers
through the weights and through the nonlinear activation functions. Observe that, in
relation to the Delta rule, the only new equation is really eq. 2.18 since the new problem
created by the hidden units is to find how a change in a weight received by a hidden
41

unit affects the output error.


Summarizing, we have:

∂E pat
δi outj (2.20)
∂Wi j
where for output units (p+H+1 ≤ i ≤ N):

d outi
δi Di outi (2.21)
d neti
and for hidden units (p+1 ≤ i ≤ p+H):

 N 
δi  δL WL i 
d outi (2.22)

L i 1  d neti
As usual, the above equations are also applied to adjust the bias by simply considering
them as additional weights that come from units with a constant unit output, i.e. in eq.
2.20, outj = 1.
Observe that in the above derivation of the BP algorithm, only the following
constraints are included in relation to the network: 1) the network is a feedforward
ANN; 2) all units have differentiable activation functions f (neti); and 3) the combination
function is defined in a vectorial notation as net = W out + bias. Some possible cases
are: the use of different activation functions in the hidden layer; use of several hidden
layers; and feedforward networks that are not strictly feedforward.
Another reason for using the sigmoid function or the hyperbolic tangent in a
multi-layered ANN is that their derivatives can be calculated simply from their output
value (dsig(x)/dx = sig(x) [1-sig(x)]; dtanh(x) = [1+tanh(x)] [1-tanh(x)]), without the
need of more complex calculations. This is very useful since it reduces the overall
number of calculations needed to train the network.

2.4.6 - Using the Back-Propagation Algorithm


In relation to the initialization of the weights and biases, Rumelhart et al.
[RMW86] suggested using small random values. Concerning the learning rate η, they
point out that, although larger learning rates will result in more rapid learning, they can
also lead to oscillation. They suggested that one way to use larger learning rates without
leading to oscillations is to modify eq. 2.16 by adding a momentum term:
42

∆Wi j ( k 1 ) η δi outj α ∆Wi j ( k ) (2.23)

where the index k indicates the presentation number and α is a small positive constant
selected by the user. A larger α increases the influence of the last weight change on the
current weight change. Such a modification in effect filters out the high-frequency
oscillations in the weight changes since it tends to cancel weight changes in opposite
directions and reinforces the predominant direction of change. This can be useful when
the error surface contains long ravines with a sharp curvature across the ravine and a
floor with a small inclination. For more details about the use of the momentum term see
[Zur92].
In the case of the Delta rule, when applied to networks without hidden layers and
with output units with linear activation functions, the error surface will always have a
bowl shape and the local minima points are also global minima. If the learning rate is
small enough, the Delta rule will converge to one of these minima. In the case of the
MLP, the error surface can be much more complex with many local minima. Since the
BP is, as the Delta rule, a gradient-descent procedure, there is the possibility for the
algorithm to get trapped in one of these local minima and therefore not converge to the
best possible solution, the global minimum ([WiLe90], [Zur92], [McRu88]).
Whenever we have a pre-determined set of training data with a fixed number of
patterns, we can define an epoch as a single presentation of all training patterns to the
network. We will normally adopt a random order presentation of the training patterns
during an epoch and to adjust the weights after the presentation of each single pattern.
This is called random incremental updating as opposed to sequential cumulative
updating, when the patterns are presented to the network with a constant ordering, the
weight changes are summed and the weights are only updated at the end of the epoch.
Simulations results indicate that random incremental updating tends to work better than
sequential cumulative updating, since it injects some "noise" into the search procedure
[Zur92] and therefore helps the network to settle to a better local minimum.
It is interesting to know that, as Widrow points out [WiLe90], the idea of error
backpropagation through nonlinear systems has been used for centuries in the field of
variational calculus and has also been widely used since the 60’s in the field of optimal
control. Le Cun [LeC89] and Simpson [Sim90] point out that Bryson and Ho [BrHo69]
developed an algorithm very similar to the BP algorithm for nonlinear adaptive control.
Le Cun [LeC89] also shows how, using a Lagrangian formalism, the BP algorithm can
43

be derived as a solution to an optimization problem with nonlinear constraints and that


from such interpretation some extensions can easily be derived.
Although the BP algorithm was proposed for feedforward ANN, Almeida
[Alm89] has extended it to feedback networks by using a linearization technique, where
he assumes that each input pattern is presented to the network long enough for it to
reach a stable equilibrium. Only then are the outputs compared to the desired ones. Also
he assumes that the desired outputs depend only on the present inputs, not on the past
ones. Rumelhart et al. [RHW86] also considered applying the BP algorithm to feedback
networks but they used different assumptions. They simply expand the feedback network
as a feedforward network with several layers. This is possible because, as Minsky and
Papert [MiPa69] point out, for every feedback network, there is a feedforward network
with identical behaviour over a finite period of time. The BP algorithm is then applied
on this equivalent feedforward network and the weights are averaged after each change
to avoid violating the constraint that certain weights should be equal.
Another multi-layered learning algorithm that was presented before the
popularization of the BP algorithm in 1986 was the Boltzmann Machine (BM),
introduced in 1984 by Hinton, Ackley and Sejnowski ([HAS84], [HiSe86]). It uses a
much more complicated procedure than the BP algorithm in which the activations of the
hidden units are probabilistically adjusted using gradually decreasing amounts of noise
to escape local minima in favour of the global minimum. The idea of using noise to
escape local minima is called simulated annealing [KGV83]. The combination of
simulated annealing with the probabilistic adjustment of the hidden layers is called
stochastic learning [Sim90]. The main disadvantage of the Boltzmann Machine is its
excessively long training time. Later on, in 1986, Szu introduced a modified version of
the Boltzmann Machine called the Cauchy Machine (CM) that uses a fast simulated
annealing procedure [Szu86]. Although faster than the Boltzmann Machine, the Cauchy
Machine still suffers from very long training times [Sim90].

2.5 - Representation, Learning and Generalization

The first problem to be solved when applying feedforward ANNs trained using
supervised learning is the training data selection problem, i.e. to select a data set to be
used when training the ANN. Such training data set must contain the underlying
44

relationship that the ANN should acquire. Since in most cases this underlying
relationship is unknown this may not be a trivial problem.
Once a training data set has been selected, the subsequent problems, in the
sequence that they have to be solved, can be classified in three main areas:
representation, learning and generalization.
The representation problem is how to design the ANN structure such that there
is at least one solution (set of network weights) that learn the training set. The learning
problem is how to find one of these possible set of weights, i.e. training the ANN. This
is also referred to by some authors as the loading problem, based on the concept that
we are "loading" the training data set onto the ANN [Jud90]. Once training is finished,
the generalization problem is concerned with the network response when presented with
data that was not in the training set. A measure of generalization is normally obtained
by verifying the network performance using a test data set.

2.5.1 - The Representation Problem


The representation problem concerns: a) how many hidden layers we use; b) how
many units in each hidden layer; and c) which functions we use for the hidden units.
Normally the particular application in hand will specify how many input and output
units the ANN should have.
Particularly in classification problems (to determine the class to which the input
pattern belongs) the designer has some freedom to decide how to code the output, e.g.
using binary coding or 1-of-N coding. Sometimes, the designer may even decide to
preprocess the input data. Here we will assume that the designer has already decided the
input and output representation.
Once the designer has decided the network input and output representation, to
solve the particular problem in hand it is still necessary to look for the network internal
representation. The representation problem is then to choose the ANN structure such that
an internal representation exists, i.e. that there is at least one set of parameters (weights)
that can reproduce the training data set with a small error. At this moment there is very
little theory to help in this task.
Hornik et al. [HSW89] established that a feedforward ANN with as few as one
hidden layer using arbitrary squashing activation functions (such as sigmoids) and no
squashing functions at the output layer are capable of approximating virtually any
45

function of interest from one finite multi-dimensional space to another to any degree of
accuracy, provided sufficiently many hidden units are available. Later Stinchcombe and
White [StWh89] extended this result and showed that even if the activation function
used in the hidden layer is a rather general nonlinear function, the same type of FF
ANN is still a universal approximator. More or less at the same time, Funahashi
[Fun89], Cybenko [Cyb89], Kreinovichi [Kro91] and Ito [Ito91] proved similar results.
White [Whi92] edited a book with a collection of his papers on this subject of ANNs
and approximation and learning theory.
From a theoretical point of view such results are important but they are existence
proofs, i.e. they prove that there is a FF ANN with just one hidden layer using
squashing or non-squashing functions in the hidden layer that solves the input-output
mapping problem. However, it is not possible to deduce from these proofs the ANN
topology (number of hidden layers and number of units in each hidden layer) or, once
the network topology is chosen, how to determine the network free parameters (the
weights).
Another important point not clarified by the proofs mentioned above is, given a
specific criterion such as minimum number of hidden units, which function is more
suitable to be used as the activation function for the hidden units. In general, these
functions belong to two classes: local or global (also called nonlocal) functions.
Units that use local functions have a constant output (normally zero) outside a
closed region of the unit input space and a different set of values within the closed
region. Units that use functions that can not be characterized as local function are said
to use global functions.
The classical example of FF ANN that use local functions in the hidden layer are
the so called gaussian Radial Basis Functions (RBF), where outi = exp(-neti2) and
neti = x-Ci , and Ci is a vector which determines the position of the centre of the unit.
global (also called nonlocal) or local functions. In this case the regions where the unit
output is above or below a certain value are respectively closed and open region and the
decision surfaces are in general ellipsoids.
When using the usual combining function neti = Wi x + biasi, where x is the unit
vector input, the squashing and step functions are examples of global functions. In this
case there is a hyperplane that divides the unit input space into two regions where the
unit output has a high constant value in one region and a low constant value in the other
46

region. In this case, if we consider the input space to be unbounded, the regions where
the unit output is above or below a certain value are open regions and the decision
surfaces are hyperplanes. Note that the use of higher order units (see eq. 2.14) with a
squashing or step function makes it possible for the unit to implement global or local
functions by varying the unit weight values.
Park and Sandberg ([PaSa91],[PaSa93]) proved that RBF networks with just one
hidden layer and linear output units are also universal approximators.

2.5.2 - The Learning Problem


Once we have decided the network topology and the type of units to be used, the
next step is to determine the network free parameters, i.e. the network weights. The
range of applicable algorithms depends on the particular functions used in the hidden
units. Typically the Back-Propagation algorithm is used for FF ANNs with squashing
functions.
The BP algorithm can also be used for RBF networks but Moody and Darken
[MoDa89] have proposed a hybrid algorithm with two stages. In the first stage the
hidden units centres and the widths of the gaussian functions used by the hidden units
are determined in an unsupervised manner, i.e. by using only the input data and not the
correspondent desired outputs. The centres are determined by using a k-means clustering
algorithm and the widths by nearest-neighbour heuristics. In the second stage just the
output weights, i.e. the weights between the hidden and output units, which correspond
to the amplitudes of the gaussians, are modified in order to minimize the standard least-
squares error using a supervised algorithm such as the delta rule. The authors found out
that, in comparison with networks with sigmoid units trained by BP, the convergence
is very rapid, possibly because the first unsupervised stage has done most of the work
necessary for the correct classification. However, a possible drawback is the need of a
larger number of hidden units (and therefore network weights) to achieve the same
accuracy when approximating certain functions, in comparison with a network which
uses squashing functions.
The algorithms used to train a FF ANN can be classified into two main classes:
a) the algorithms that try to converge to the global minimum solution, and b) the
algorithms that try to converge rapidly. Unfortunately, it seems that the two classes do
not overlap. Consequently the algorithms that try to converge rapidly can still be trapped
47

in local minima (as BP does) while the algorithms that try to converge to the global
minimum tend to converge very slowly when compared, for instance, with the BP
algorithm.
Examples of algorithms that look for the global minimum are the Boltzman
Machine, already mentioned in the previous section, and genetic algorithms
([MoDav89], [HKP91]). Anther possible problem with the use of genetic algorithms to
train FF ANNs is the need for large amount of processing power and memory.
Jacobs [Jac88] and Silva and Almeida [SiAl90] proposed to adapt the learning
rate (the step size) when executing the BP algorithm in order to speed the convergence.
This modification has the advantage that it does not increase significantly the
computational and memory requirements in relation to the standard BP algorithm.
The BP algorithm is a first-order algorithm since it uses only the first derivative
of the cost function to search for the minimum. Several researchers have proposed
second-order algorithms to perform such a search, for instance, Becker and le Cun
[BeCu88] and Kollias and Anastassiou [KoAn89]. Battiti [Bat92] published a review of
the application of first- and second-order methods for the training of FF ANN.
The main problems of using such second-order algorithms are: 1) a large increase
in the number of operations performed and in the memory requirements, especially for
large networks; and 2) not all implementations use local computations. Furthermore,
Saarinen et al. [SBC91] argue that many network training problems are ill-conditioned,
i.e. have ill-conditioned or indefinite Hessians, and therefore may not be solved more
efficiently by higher-order optimization algorithms.
A more recent approach has been suggested by Shah et al. [SPD92] where they
use optimal stochastic filtering techniques to train the ANN and at the same time they
pay attention to the computational and storage costs. Tepedelenlioglu et al. [TRSR91]
and Singhal and Wu [SiWu89] have proposed to use the Extended Kalman Filtering
algorithm to train FF ANNs.
There have also been a few approaches that try to reduce the network training
time and at the same time determine the number of units in the hidden layer, i.e. they
try to adapt the network topology. Normally such approaches start with an ANN with
a small size and add hidden units. Fahlman and Lebiere proposed the Cascade-
Correlation Learning Architecture [FaLe90] and studied the two-spirals problem (the
training points are arranged in two interlocking spirals).
48

Hirose et al. [HYH91] also suggest adapting during training the number of
hidden units with the aim of escaping local minima. Training is performed as standard
by the BP algorithm and they proposed adding an extra hidden unit whenever the
network seems to be trapped in a local minimum. Since the addition of such an extra
hidden unit distorts the error surface, that point in the weight space is not a local
minimum anymore. Later on, after satisfactory convergence is achieved, they proposed
a way of eliminating some of the hidden units.

2.5.3 - The Generalization Problem


Even if the training algorithm manages to find a satisfactory solution for the
training patterns, the ANN still needs to produce "reasonable" outputs when presented
with input patterns that were not used in the training set. i.e. the ANN needs to be able
to "generalize" what it has learned.
Poggio and Girosi ([PoGi90a],[PoGi90b]) state that, from the point of view that
FF ANN are trying to learn an input-output mapping from a set of examples, such a
form of learning is closely related to classical approximation techniques, for instance,
generalized splines and regularization theory [TiAr77]. In this case learning can be seen
as solving the problem of hypersurface reconstruction and is a ill-posed problem since
in general there are infinite solutions. A priori assumptions are then necessary to make
the problem well-posed. Possibly the simplest assumption is that the input-output
mapping is smooth, that is small changes in the inputs cause a small change in the
output.
Training a FF ANN can be seen as a generalized multi-dimensional version of
finding the parameters of a polynomial that fits a set of points drawn from a uni-
dimensional space. Too many degrees of freedom (too many weights in the ANN) can
result in overfitting the training data and to poor performance in the test data set
[HKP91]. Therefore the ideal situation would be to find the minimum number of hidden
units that can produce the desired input-output mapping. This should result in the
smoothest possible mapping. Since it is very difficult and time-consuming to determine
the minimum number of hidden units, one approach that is frequently used is to train
the network using a small training data set and periodically to test the network using a
larger test data set. Training is then stopped when the error cost function measured over
the test data achieves the minimum value. If we continue training the network after such
49

a minimum is achieved, the error cost function measured over the training data will
continue to decrease but it will increase if the measure was taken over the test data.
Baum and Haussler [BaHa89] proved some theoretical bounds governing the
appropriate sample size against the network size in terms of the network generalization.
One possible approach that can be used to improve network generalization is to
somehow constrain, during training, the degrees of freedom available to the network
trying to obtain a near-optimal network topology. The ANN should then be large enough
to contain the desired knowledge (assumed to be contained in the examples of the
training data set) but small enough to generalize well. A simple approach is just to add
to the normal cost function (the mean-squared-output error) a penalty for network
complexity. One possibility is to use the weight decay idea, i.e. we add to the cost
function the term β∑wi2 [HKP91]. The application of the BP algorithm to this new cost
function results in a weight decay term which discourages very large weights. Another
possibility is to use the extra cost function term β∑[wij2/(K+wi2)]. For small weights this
can be approximated by β∑[wij2/K] and for large weights by β. After training, the ANN
can then be tested where the weights with the smallest magnitudes are removed, what
is known as "pruning" the network. When all the incoming weights of a hidden units are
removed, the hidden unit is effectively removed as well. Therefore the weight-
elimination stage can also affect the network topology.
Nowlan and Hinton ([NoHi92a],[NoHi92b]) propose an approach where the
network degrees of freedom are constrained by encouraging clustering of the weights
values. While the weight decay approach encourages clustering around the zero value,
their approach is aimed at encouraging clustering around a finite set of arbitrary real
values, what is sometimes called weight-sharing. Kendall and Hall propose the minimum
description length (MDL) approach [KeHa93] aimed at minimizing the information
content in the network weights. They claim that the MDL length also encourages weight
elimination and weight-sharing.
More recently Green, Nascimento and York [GNY93] proposed to add
competition within the hidden layer of a FF ANN in order to eliminate unnecessary
hidden units. The addition of the competition turns the network into a feedback ANN.
However, the BP algorithm is applied as normal since they proposed to ignore the
competition weights during the backward pass of the BP algorithm.
One drawback that all these approaches have in common is the need to select
50

some extra parameters during training.

2.6 - Limitations of Feedforward ANNs

The basic concepts of Artificial Neural Networks and the differences in relation
to traditional computation were introduced in this chapter. Also the more important
feedforward ANN models were presented and the role of hidden units was discussed.
The majority of feedforward ANN models currently in use are sigmoid based and
have the following limitations:
1) Current ANN models take a long time to be trained, there is no
guarantee of convergence and the learning is inconsistent, i.e. the mean-
squared-error can remain high for many iterations and suddenly decrease
to a lower value. Therefore, without previous experience with a particular
problem, it is very difficult to estimate how long training will take.
2) When an ANN produces an output that corresponds to a decision, for
instance in a pattern classification problem, in general it is very difficult
to trace how the network reached such a decision, that is to get an
"explanation" form the network. An ANN by being trained using a
training data set, extracts the knowledge from the set of examples and
creates its own internal representation. To extract the knowledge coded
into the network we need to understand this internal coding, a difficult
task.
3) In general, an ANN does not give confidence intervals for its outputs.
However, Richard and Lippman [RiLi91] show that when an FF ANN is
trained to solve an M-class problem (one output unit corresponding to the
correct class, all other zero) using a mean-squared-error cost function
such as in the BP algorithm, the network outputs provide estimates of
Bayesian probabilities.
4) Without prior experience with the problem in hand, the network
topology is determined by trial and error. A too small network will make
learning impossible and a too large network will generalize badly.
5) It is not possible, in the general case, to encode prior information in
the network. If this was possible, training times could be reduced
51

considerably.
While the models presented in this chapter were of the feedforward type, the next
chapter concerns feedback networks, their theory and applications.
52

Chapter 3 - Feedback Neural Networks:


the Hopfield and IAC Models

The main feedforward ANN models were presented in chapter 2. In this chapter
the principles behind the use of feedback ANNs are introduced and two models, the
Hopfield and IAC (Interactive Activation and Competition) neural networks are
presented and analyzed.
Because of the presence of the feedback connections, feedback ANN are
nonlinear dynamical systems which can exhibit very complex behaviour. They are used
in two areas: 1) as associative memories or 2) to solve some hard optimization
problems. The basic idea in using a feedback ANN as an associative memory is to
design the network such that the patterns that should be memorized correspond to stable
equilibrium points. To use feedback ANN to solve optimization problems the network
is designed so that it converges to the stable equilibrium points that correspond to good
(perhaps not necessarily optimal) solutions of the problem in hand.
In this chapter we show how the IAC network can be used to solve certain
optimization problems, much like the Hopfield network. As an example we show in
detail how to implement a 2-bit analog-digital converter using the IAC network.

3.1 - Associative Memories

To work as an associative memory, a network has to solve the following


problem:
"Store M patterns S such that when presented with a new pattern Y, the
network returns the stored pattern S that is closest in some sense to Y".
Such associative memory can work as a content-addressable memory since we should
be able to retrieve the stored pattern by using as input an incomplete or corrupted
version of it (pattern completion and pattern recognition). Possible applications are in
53

hand-written digit and face recognition tasks and retrieval of information in general
databases.
For mathematical convenience we will assume that the components of the stored
patterns S and the test patterns Y can be only −1 or 1, instead of the usual binary values
0 and 1.
Figure 3.1 shows the general model of a one-layer feedback ANN that can be
used as an associative memory. In this particular case each unit is a TLU (Threshold
Logic Unit) with a bipolar output. The output of each unit is calculated as:

 1 if net > 0
 i
(3.1)
Yi sgn neti 
 1 if net < 0
 i
where the net input net is calculated as:
N
neti Wi j Y j biasi exti (3.2)
j 1

where N is the number of units in the network. The terms biasi and exti represent
respectively the fixed internal and variable external inputs. These terms could be
grouped together but in most models one or both of them are zero.
For simplicity, let’s consider for the moment that the bias term biasi and the
external input exti are zero.
The network is operated as follows: 1) a input pattern is loaded into the network
as the initial values for the network output Y; 2) the network output values are updated
asynchronously and stochastically, i.e. at each time step a unit is selected randomly from
among the N units with equal probability 1/N, independently of which units were
updated previously, and eqs. 3.1 and 3.2 are used to update its output. We will show
later that under some conditions, after a sufficient large number of time steps, the
network will converge to a stable equilibrium point (EP), called a "memory". The output
of the units are then interpreted as the network classification of the input pattern.
Three important issues in such applications are: 1) how the network weights
should be adjusted such that network is stable, that is such that the network converges
to an EP for any initial condition; 2) for a network with N units, how many patterns can
be stored; and 3) under what conditions will the network converge to the closest stored
pattern.
Note that: 1) the units are simultaneously input and output units; 2) since there
are no hidden units, such a network cannot encode the patterns, or in other words, the
54

Figure 3.1 - An one-layer feedback ANN

network cannot change the pattern representation; and 3) the network always occupies
the corners of the hypercube [−1 1]N.

3.1.1 - Storing one pattern


Let’s firstly consider the simple case where we want to store just one pattern. A
pattern Y is a stable EP if:

N 
sgn  Wi j Yj 
(3.3)
Yi
j 1 
for all i, since when eq. 3.1 is applied to update the unit output no change will be
produced. Representing by S the pattern that we want to store, this can be achieved by
setting the network weights to:

Wi j k Si Sj (3.4)

where k > 0 since then:

N 
sgn  k Si Sj Sj 
(3.5)
sgn k N Si Si
j 1 
given that Sj Sj = 1. For later convenience, let k = 1 / N. Then, in vectorial notation we
have that:
55

1 T (3.6)
  S S
W
 
N
where S is a column vector and W is a symmetric matrix.
Note that even if almost half of the bits of the initial condition (the starting
pattern) are wrong, the stored pattern will still be retrieved since the correct bits, that
are in the majority, will force the sign of the net input to be equal to Si. This can be
proved by combining eqs. 3.3 and 3.6:

N     
 N N 
N
 Wi j Yj  sgn  i
S
Sj Yj  sgn  Si c w
(3.7)
sgn   Si
j 1  N j 1   N 
where Nc and Nw are respectively the number of correct and wrong bits in the starting
pattern Y in relation to the stored pattern S. Observe also that if the starting pattern has
more than half the bits different from the stored pattern (Nw > Nc) than the network will
retrieve the inverse of the stored pattern, i.e. −S. Therefore there are two stable EPs,
sometimes also called attractors. The set of patterns that converge to one of the EPs
constitutes what is called the basin of attraction or region of convergence of that EP.
For this particular case, the entire input space is symmetrically divided into the two
basins of attraction.

3.1.2 - Storing several patterns


One simple way to store more than one pattern in the network is to generalize
eq. 3.4 and try to superimpose the patterns by using:
M
1 pat pat (3.8)
Wi j Si Sj
N pat 1

or in vectorial notation,
M
1 T
(3.9)
W S pat S pat
N pat 1
where M is the total number of patterns that we want to store in the network and the
weight matrix W is still symmetric.
Equations 3.8 and 3.9 are implementations of the Hebbian rule, already
introduced in chapter 2. A feedback network operating as an associative memory, using
the Hebbian rule to store all patterns and being updated asynchronously is usually called
a discrete-time Hopfield network, after J. J. Hopfield who emphasized the concept of
using the equilibrium points of nonlinear dynamical systems as stored memories [Hop82].
56

The patterns S will be stored as stable EPs, i.e. fixed attractors, if they satisfy
the condition that:

N 

sgn  Wi j Sj  Si
(3.10)
j 1 
By combining eqs. 3.8 and 3.10 we have that:

 N M 
sgn  Sj 
1 pat pat (3.11)
Si Sj Si
N j 1 pat 1 
Let’s suppose that we want to test such a condition for stored pattern S1. The interior
of the function sgn ( ) can be separated into the term pat = 1 and pat > 1:
N N M
1 1 1 1 1 pat pat 1 1 (3.12)
Si Sj Sj Si Sj Sj Si c.t.
N j1 N j 1 pat 2
where c.t. stands for crosstalk term, the second term of the left side of eq. 3.12.
Therefore if the magnitude of the crosstalk term is less than 1, it will not change the
sign of Si1 and the condition for stability of the pattern S1 will be satisfied. The
magnitude of the crosstalk term is a function of the type and number of patterns to be
stored.
For many cases of interest, provided that the number of patterns to be stored is
much less than the number of units (M << N, see next section about storage capacity),
the crosstalk term is small enough and all stored patterns are stable. Moreover, as in the
single pattern case, if the network is initialized with a version of one of the stored
patterns that is corrupted with a few wrong bits, the network will retrieve the correct
stored version [HKP91].

3.1.3 - Storage Capacity


Hertz et. al. [HKP91] show that if: a) if the patterns to be stored are random
(each bit has equal probability of being −1 or +1) and independent; and b) M and N are
large, then the crosstalk term can be approximated by a random variable with gaussian
distribution, zero mean and variance M/N. Therefore the ratio M/N determines the
probability of the crosstalk term being greater than 1 for Si = −1 or less than −1 for
Si = +1. From this modelling we can estimate, for instance, that if we choose
M = 0.185 N and the network is initialized with one of the S patterns, no more than 1%
of the bits will change. However, these few bits that change can cause more bits to
57

change and so on, i.e. what is known as the "avalanche" effect.


Hertz et. al. [HKP91] show, using an analogy to spin glass models and mean
field theory, that this avalanche occurs if M > 0.138 N and therefore we could not use
the network as a "memory". They also show that, using the previous modelling, for
M = 0.138 N, 0.37% of the bits will change initially and 1.6% of them will change
before an attractor is reached. So, if we choose M ≤ 0.138 N there will be an attractor
"close" to the patterns S that we want to store, i.e. they will be retrieved but the final
result will have a few bits wrong. As an example for this case, for N = 256, M ≤ 35.
If we want to recall all stored patterns S without error (perfect recall) , i.e. to
force the patterns S to be the attractors (not only "close" to the attractors as in the
previous case), then McEliece et. al. [MPRV87] show that M ≤ N / (4 ln N). Moreover,
they show that perfect recall will happen if the initial pattern has less than N / 2 different
bits when compared with a stored pattern ([HKP91],[HuHo3]). In this case for N = 256,
M ≤ 11.
From these arguments we can see that, when using the Hebbian rule (eqs. 3.8 and
3.9), the storage capacity of the Hopfield network is rather limited. Other design
techniques have been proposed that improve the storage capacity ([VePs89],[FaMi90])
to a value closer to M = N, the limit for the storage capacity of the Hopfield network
[AbJa85].
Note as well that if the patterns to be stored are all orthogonal to each other, i.e.

T  0 for l≠k (3.13)


Sl Sk
 N for l k
apparently the memory capacity would be N since the crosstalk term is zero in this case
(see eq. 3.12). However if we use the Hebbian rule (eqs. 3.8 or 3.9) to store N
orthogonal patterns, the weight matrix W will be equal to the identity matrix, i.e. each
unit one feedbacks to itself. Such an arrangement is useless as a memory since it makes
all initial patterns stable, that is the network does not change its initial pattern. This can
be interpreted as making attractors of all points of the discrete configuration space and
their basins of attraction contain only the attractors themselves. Therefore to make the
network useful in this case we need to store less than N orthogonal patterns.
We can prove that the weight matrix will be equal to the identity matrix if we
try to store N patterns using the Hebbian rule, by defining a square and not in general
symmetric matrix X where each row of X is defined as the transpose of one of the
58

patterns to be stored, i.e. Xij = Sji. Consequently, from eq. 3.13 we have that:

X XT NI (3.14)

where I is the identity matrix. Then, we can rewrite eq. 3.9 as:
M
1 T 1 T (3.15)
W S pat S pat X X
N pat 1 N
By the definition of orthogonality, no row of the matrix X can be written as a linear
combination of the other rows and therefore the inverse of X and the inverse of its
transpose X−T exist. Then if we pre-multiply both sides of eq. 3.14 by XT and pos-
multiply them by X−T we have:

XT X X T X T
N I XT X T (3.16)

and consequently XT X = N I and W = I, as we want to show.

3.1.4 - Minimizing an energy function


One important contribution made by Hopfield [Hop82] was to propose a lower
and upper bounded scalar-valued function, a so-called "energy function", that reflects
the state of the whole network, i.e. such a function involves all the network outputs. He
then showed that whenever one of the network outputs Yi is updated, the value of this
function is decreased if Yi changes or remains constant if Yi does not change. Therefore
the network will evolve until it reaches one state that is locally stable equilibrium point.
To prove this, Hopfield defined the energy function as the following quadratic function:
N N
1 T 1 (3.17)
H(k) Y (k) W Y(k) W Y (k) Yj (k)
2 2 i 1 j 1 ij i
where H(k) is the value of the energy function for the whole network at time step k. The
lower and upper limit for H(k) for any k are given respectively by
− (1/2) ∑Ni=1 ∑Nj=1 Wij and (1/2) ∑Ni=1 ∑Nj=1 Wij since the outputs Y are −1 or +1.
Let’s assume that at time k the unit L was selected to be updated, where
1 ≤ L ≤ N. Isolating the energy terms due to unit L, we can rewrite eq. 3.17 as:
N N N
1 1
H (k) Wi j Yi (k) Yj (k) Y (k) WL j Yj (k)
2 i 1 j 1 2 L j 1
i≠L j≠L j≠L
(3.18)
N
1 1 2
Y (k) Wi L Yi (k) W Y (k)
2 L i 1 2 LL L
i≠L
59

The variation in the energy is given by ∆H(k) = H(k+1) − H(k). Note that: 1) since the
updating is asynchronous only unit L may change at time k and consequently
Yi(k+1) = Yi(k) for i ≠ L, 2) since all units have bipolar outputs [Yi]2 = 1 for all i.
Therefore, when calculating ∆H(k), the first and fourth terms of the right side of eq. 3.18
will be cancelled out and we can write:

 N 
 
N
1
∆ H (k) YL (k 1)  WL j Yj (k) Wi L Yi (k) 
2 j 1 i 1 
 j≠L i≠L 
(3.19)
 N 
 
N
1
YL (k)  WL j Yj (k) Wi L Yi (k) 
2 j 1 i 1 
 j≠L i≠L 
If unit L changes its output then YL(k+1) = −YL(k) and using the fact that the weight
matrix W is symmetric (see eq. 3.8), we have that:
N
∆ H (k) 2 YL (k) WL j Yj (k) 2 YL (k) netL (k) 2 WL L (3.20)
j 1
j≠L

Due to the rule used to update the network outputs (eqs. 3.1 and 3.2), whenever a unit
changes its output the product YL(k) netL(k) is negative. Due to the Hebbian rule (eq. 3.8)
WLL = M / N. Therefore whenever a unit changes its output, the overall energy of the
network decreases. In other words, the energy is a monotonically decreasing function
with respect to time.
Note that we use the fact that the weight matrix is symmetric, an assumption that
is not biologically plausible in terms of networks of real neurons. McEliece et. al.
[MPRV87] speculate, however, that maybe all that is necessary is a "little" symmetry,
such a lot of zeros at symmetric positions in the weight matrix, what is common in real
neural networks. Moreover, asymmetric weight matrices can be used to generate a
cyclical sequence of patterns ([HKP91],[Kle86]) and Kleinfeld and Sompolinsky
[KlSo89] even found a mollusc that apparently uses this mechanism. In this case the
attractors are stable limit cycles.
In chapter 2 we mentioned that learning in feedforward networks could be seen
as an optimization process. This is also the case here for feedback networks and such
an interpretation will be very useful later. The problem can be stated as follows: how
should the weights be set such that the patterns to be stored are deep minima of the
energy function given by eq. 3.17. Let’s start with storing just one pattern. If we want
60

to store just pattern S, since its components are −1 or +1, we can make the energy term
dependent on [Si]2 [Sj]2 that is always positive and the energy term will be as small as
possible [BeJa90], or:
N N N N
1 1 (3.21)
H Wi j S i S j Si 2 Sj 2
2 i1 j1 2 i1 j1
From that we see that we just need to define the weight matrix as Wij = Si Sj. Again to
store several patterns, we just sum this equation over all patterns (eqs. 3.8 and 3.9).
Adding all patterns together in this way will distort the energy levels for each stored
patterns because of the crosstalk term. However, as stated before, if M << N the
distortion will not be significant.

3.1.5 - Spurious States


We have shown that if the crosstalk term is small enough the patterns Si to be
stored are attractors (stable equilibrium points) and they will be local minima of the
energy function. Such attractors are sometimes called retrieval states or retrieval
memories. This situation is very likely to happen, as stated before, if the number of
patterns M to be stored is much less than the number of units N. However, these are not
the only attractors that the network has.
Firstly, the reverse of an attractor −Si is also an attractor since it also satisfies eq.
3.10 and it will have the same energy H.
Secondly, Hertz et. al. [HKP91] and Amit et. al. [AGS85a] show that patterns
defined as a linear combination of an odd number of attractors are also attractors. They
call such attractors mixture states or retrieval memories.
Thirdly, Amit et. al. [AGS85b] show that if the number M of patterns to be
stored is relatively large (compared to N), then there are attractors that are not correlated
with any linear combination of the original patterns Si. They call such attractors spin
glass states, from the spin glass models in statistical mechanics.
The second and third type of attractors are called spurious states, spurious
minima or spurious memories. Their existence means the there is the possibility that the
network will not work perfectly as an associative memory, since it can converge to
"memories" that were not previously defined.
Some measures, however, can be taken to decrease the size of basins of
attractions of these spurious states. For instance, as Hopfield did in his original paper
61

[Hop82], we can force the constraint that a unit does not feedback to itself, i.e. Wii = 0,
for 1 ≤ i ≤ N [KaSo87]. It is possible to show that this modification does not affect the
stability of the patterns that we want to store (the retrieval memories) although it affects
the dynamics of the network [HKP91].
A second possible improvement proposed by Hopfield et. al. [HFP83] is to try
to "unlearn" some of spurious states. To do this, the network weights are determined by
applying eq. 3.8, the network state is initialized in a random position and the network
output is updated until it achieves convergence. If the state to which the network
converged is one of the spurious memories, represented by XF, then the Hebbian rule is
applied with the sign reversed:

∆Wi j (3.22)
F F
Xi Xj
where 0 < << 1. One possible interpretation is that such a procedure changes the
shape of the energy function by raising the energy level at the local minimum XF,
therefore reducing its basin of attraction. The assumption is that memories with the
deepest energy valleys tend to have the largest basins of attraction. However, too much
"unlearning" will result in perturbing and even destroying the retrieval memories that
we intended to store [HFP83].

3.1.6 - Synchronous Updating


The asynchronous updating used in the Hopfield network can be seen as a simple
way to model the random propagation delays of the signals in a network of real neurons.
If synchronous updating is used (all unit outputs are updated simultaneously in
a discrete time formulation), there will be no significant changes in terms of memory
capacity or position of the equilibrium points ([HKP91],[AGS85a],[MPRV87]).
However, the network dynamics will be different, e.g. it will take much less iterations
to converge to a fixed attractor (EP), and there is the possibility for existence of stable
limit cycles that are not present if asynchronous update is used. Zurada [Zur92] shows
an example of this last case.
Another difference is that using synchronous updating the trajectory in the output
space is always the same for a given starting point. When using asynchronous updating
this is not the case because the units are randomly selected to be updated, as explained
before.
62

3.2 - Solving Optimization Problems

After proposing to use ANN with binary or bipolar units with random
asynchronous updating as associative (or content-addressable) memories [Hop82],
Hopfield realized that he could obtain the same computational properties by using a
deterministic continuous-time version with units that have a continuous and
monotonically increasing activation function such as a squashing function [Hop84]. This
network is sometimes referred to as the gradient-type Hopfield network [Zur92] or
Hopfield network with continuous updating [HKP91].
By making such modifications he realized that he could also propose a hardware
analog implementation of the above network using electrical components such as
amplifiers, resistors and capacitances. The capacitances where introduced for each unit
such that they would have an integrative time delay. Consequently, the time evolution
of the network should be represented by a nonlinear differential equation.

3.2.1 - An analog implementation


The behaviour of each unit in this analog version is closer to the behaviour of
a real neuron. Figure 3.2 illustrates such a unit. The variables neti and Yj are voltages,
biasi is a current, Wij and gi are conductances, Ci is a capacitance and the triangle
represents a voltage amplifier with a function f, i.e. Vout = f(Vin) or Yi = fi (neti). We will
assume that the voltage amplifier has an infinite input impedance such that it does not
absorb any current. Figure 3.3 illustrates the implementation of a feedback network
using this type of unit. In order to avoid the need for negative resistances, we have to
assume that the voltage amplifiers have a negative output −Yi as well or use an
additional amplifier for each unit with constant gain −1.
Adding all currents for the units, which are illustrated by arrows in fig. 3.2, the
dynamic behaviour of a unit can be described by:
N
d neti
ic Ci biasi
Wi j Yj neti gi neti (3.23)
dt j 1

Let’s define the parameter Gi as Gi = gi + ∑Nj=1Wij, the external input vector as:
ext = [ext1 ... extN]T and the matrices G and C as G = diag[G1 ... GN] and
C = diag[C1 ... CN]. Then the dynamical behaviour of the whole network can be
described by the following set of differential equations:
63

Figure 3.2 - The analog implementation of a unit


using electrical components

d net (3.24)
C G net W f ( net ) bias
dt
where net and bias are column vectors, and the function f ( ) is applied to each
component of the vector net. Note that, by definition, Y = f (net).

Figure 3.3 - The analog implementation of a continuous-time Hopfield


network with 4 units
64

3.2.2 - An energy function


Assuming that the weight matrix W is symmetric and that the activation function
fi is a monotonically increasing function bounded by lower and upper limits for all units,
Hopfield ([Hop84],[Zur92]) proposed the following energy function in order to prove
the stability of the network:
Yi
N
Gi ⌠ fi (z) d z
1 T 1 (3.25)
H (t) Y W Y bias T Y
2 i 1

0

Applying the chain rule we have:

∂H Y (t) d Yi
N
d H Y (t)
∇ Y H (Y ) T
Ẏ (t) (3.26)
dt i 1 ∂Yi dt
where by definition

 T
∂H ∂H
∇ Y H (Y )  
(Y ) (Y ) (3.27)
.... 
 ∂Y 1
∂Y N 
Using the Leibnitz rule we have:

 
 N
d
Y j
 (3.28)
 G ⌠ fj 1(z) d z  G fi 1 Y Gi neti
d Yi  j 1 j ⌡
0


i i

From this relation and since the matrix W is symmetric, we can write that:

dH T  d net T (3.29)
W Y bias G net Ẏ C  Ẏ
dt  dt 
Comparing eqs. 3.26 and 3.29, we can see that:

dnet
∇ Y H (Y ) C (3.30)
dt
or for each component:

∂H Y d neti
(3.31)
Ci
∂ Yi dt
Since Yi = fi (neti) and fi ( ) is a monotonically increasing function, we can write that
neti = fi−1 (Yi) and
1
d neti d fi Yi d Yi
(3.32)
dt d Yi dt
−1
where d fi (Yi) / d Yi > 0. Finally, by substituting eqs. 3.31 and 3.32 in eq. 3.26:
65

1  2
dH
N
d fi Yi  d Yi  (3.33)
Ci  
dt i 1 d Yi  d t 
Therefore dH/dt ≤ 0 and dH/dt = 0 if and only if dYi/dt = 0 for all units, 1 ≤ i ≤ N.
Since the network "energy" is a bounded function, this proves that the network will
evolve until it settles to an equilibrium point, a local minimum of the energy function.
In other words, the network "searches" for a minimum of the energy function and stops
there. Note that the possibility of limit cycles is excluded since in a limit cycle
dYi/dt ≠ 0 and dH/dt = 0.
It is also interesting to investigate the effect of the steepness of the activation
function fi. This is easily done by replacing Yi = fi(neti) by Yi = fi(λneti) and neti = fi−1(Yi)
by neti = fi−1(Yi) / λ. The energy function H(t) becomes:
Yi
N
Gi ⌠ fi (z) d z
1 T 1 1 (3.34)
H (t) Y W Y bias T Y
2 λ i 1

0

As the gain λ increases the activation function fi tends to a threshold function. Suppose,
for instance, that f( ) = tanh( ). The integral in the third term on the right-hand side of
eq. 3.34 is zero for Yi = 0 and positive otherwise, becoming very large as Yi approaches
its bounds −1 or +1 since such bounds are approached very slowly. In the limit case
when λ → +∞, the contribution by the third term is negligible and the location of the
equilibrium points are given by the maxima and minima of:
N N N
1 T 1
H (t) Y W Y bias T Y W Y (t) Yj(t) Yi(t) biasi (3.35)
2 2 i 1 j 1 ij i i 1

The same arguments are valid if fi(λ neti) = sig(λ neti) = 1/[1+exp(−λ neti)].
For large but finite λ, the third term on the right-hand side in eq. 3.34 begins to
contribute but only when Yi approaches its bounds, i.e. when the network is near to one
of the surfaces, edges or corners of the hypercube that contain the network dynamics.
When all Yi are far from their limits, the contribution of the third term is still negligible.
Consequently, for large but finite λ the maxima of the complete energy function given
by eq. 3.34 at the corners and the minima are slightly displaced toward the interior of
the hypercube [Hop84]. Therefore in this case, it can assumed that the energy function
that is being minimized is the energy function given by eq. 3.35 and that the equilibrium
points will be located at the corners of the hypercube.
Note that if λ is sufficiently large, it is reasonable to assume that neti ≈ 0 and
66

consequently in figures 3.2 and 3.3 the current sources exti can be substituted by an
equivalent voltage source VExti in series with the appropriate resistor RExti such that
VExti/RExti = exti.
Hopfield and Tank [HoTa85] then realized that: 1) if the cost function of an
optimization problem could be expressed in a quadratic equation with the same form as
eq. 3.35, and 2) a network like the one illustrated in fig. 3.3 using units with large finite
gains in their activation functions could be used to search for a minimum of the cost
function. They proposed a solution for the optimization problem using analog hardware
and therefore radically different from implementing an algorithm in a digital computer.
The weights and bias can be determined by comparing the cost function for the problem
in hand with the energy function given by eq. 3.35.
Hopfield and Tank ([HoTa85], [TaHo86], [HoTa86]) showed, as examples, how
such a network could be used to propose solutions to: 1) analog/digital conversion
problems; 2) decomposition/decision signal problems (to determine the decomposition
of a particular signal given the knowledge of its individual components); 3) linear
programming problems [Per92]; and 4) the travelling salesman problem (TSP). Other
possible applications investigated by other researchers are: 1) job shop scheduling
optimization [Zur92]; 2) economic electric power dispatch problems [Zur92]; and 3)
graph bipartitioning (important for chip design where we want to divide a group of
interconnected components into 2 subsets with more or less the same number of
components in each set and minimizing the wire length between the two sets).
It is important to emphasize that it is possible only to prove that, given the
proper constraints, the network converges to a local minimum of the energy function.
However, in general, such a local minimum is not the global minimum. Therefore the
Hopfield approach is best suited to problems where there are several local minima that
give satisfactory solutions and it is more important to rapidly approach a "good" solution
than to take much longer and to have the best possible solution. One could argue that
these are the kind of problems that biological systems have to solve [Per92]. It is not
always easy to decide if a particular optimization problem with a particular set of
parameters will be well suited to be solved using the Hopfield approach.
67

3.3 - The IAC Neural Network

The Interactive Activation and Competition (IAC) Neural Network was proposed
by the psychologists McClelland and Rumelhart to model visual word recognition
[McRu81] and retrieval of general and specific information from specific information
about individual exemplars previously stored in the network [McRu88]. The network
uses as inputs noisy clues, for instance, the network can be used to recognize a word
that was partially obscured or to retrieve the specific information stored about an item
using a partial or incorrect version of its description.
The IAC network is also a feedback network that operates in discrete or
continuous time and the output of the units are real continuous numbers. The principle
of operation is the same as the Hopfield network, i.e. there is no learning phase and the
designer sets the topology and the initial state of the network. The network then evolves
to a equilibrium state (equilibrium point, EP) that represents the network answer to the
problem.
As in the Hopfield network, the network topology is selected in order to satisfy
the specific constraints of the problem in hand. The major difference in operation
between the Hopfield network and the IAC network is the activation function used.
McClelland and Rumelhart [McRu88] define an IAC network as consisting of
a set of units organized into pools. The units in a pool compete against each other such

Figure 3.4 - Typical topology for the IAC network where dashed and solid lines
represent respectively inhibitory and excitatory connections.
Black squares represent activated units.
68

that ideally when the network settles to an EP there is only one activated unit in each
pool. Units situated in different pools can excite or be indifferent to each other but
normally they do not inhibit each other. Figure 3.4 illustrates the typical topology for
the IAC network. All connections are assumed to be bidirectional and therefore the
weight matrix W is symmetric. All units also have an external input, not shown in figure
3.4.
According to McClelland and Rumelhart’s conception, each pool represents a
specific property (or characteristic) and each unit in the pool represents mutually
exclusive possibilities for such a property . For example, in figure 3.4 pool 1 could
represent the gender of an individual, while pools 3 and 4 could represent his education
level, marital status or profession. Pool 2 could contain the names of the individuals.
We can use the above example where the network is used to store specific
information about a set of individuals to show three possible cases of information
retrieval by the network [McRu88].
In the first case information about an individual could then be retrieved by
activating the unit with his name in pool 2 and we want just one unit activated in each
pool after convergence.
In the second case we can initialize the network with the description of an
individual by activating the corresponding units in pools 1, 3 and 4 and, after
convergence is achieved, look for the winner unit in pool 2. To be useful, the network
should retrieve the correct individual even if the description is partial or slightly
incorrect. It is possible to have units that are partially activated in the pool for names
if there is no perfect match and several units have a close match. The amount of
activation should be related to the number of matches with the given description.
In the third case we can retrieve general information about a property by
activating the corresponding unit, for example, to retrieve the general properties of
married individuals. In this case it is also possible to have units that are partially
activated.
McClelland and Rumelhart showed, by using simulations, that the network works
well in the above three cases [McRu88]. However, in order to operate the network the
designer has to adjust some parameters but McClelland and Rumelhart did not provide
guidelines for selecting such parameters.
In this section we derive a few results that are applicable to networks of this type
69

with any number of units, including the proof that, given certain conditions, the IAC
network is a stable system and that it also minimizes an energy function, much like the
Hopfield network. Extensive results are then derived in the case where the network has
2 units. We analyse mathematically the dynamics of an IAC network with 2 units. More
specifically we are interested in how the parameters of the model affect the number,
type, location and zone of attraction (or basin of attraction) of the equilibrium points.
In most cases stability around the EPs is proved using Lyapunov functions.

3.3.1 - The Mathematical Model


McClelland and Rumelhart used the standard form for the combining, activation
and output function (these terms are defined in section 2.3.2) to define the mathematical
model of the IAC network. Assuming that the IAC network is operating in discrete-time
and in synchronous mode, we have:
N
neti (k) Wi j Yj (k) exti (k) (3.36)
j 1

ai (k 1) f ai (k) , neti (k) (3.37)

Yi (k 1) g ai (k 1) (3.38)

where the variables ai(k) and Yi(k) represent respectively the activation and output values
for unit i at iteration k, N represents the number of units in the network, 1 ≤ i ≤ N,
f[ , ] and g[ ] are respectively the activation and output functions and the weight
matrix W is assumed to be symmetric.
McClelland and Rumelhart wanted a model with the following properties:
1) the activation values must be kept between two limits given by the
parameters max and min, where min ≤ 0 < max;
2) when the network is initialized, all the activation values are at the rest
value given by the parameter rest, where min ≤ rest ≤ 0;
3) when the net input for a particular unit is positive, its activation value
must be driven towards the upper limit max;
4) when the net input for a particular unit is negative, its activation value
must be driven towards the lower limit min;
5) when the net input for a particular unit is zero, its activation value
must be driven towards the rest value given by the parameter rest with an adjustable
speed that is given by the parameter decay ≥ 0.
70

To satisfy the above requirements, Rumelhart and McClelland proposed the


following functions f( , ) and g( ):
if neti(k) ≥ 0,

∆ ai (k) max ai (k) neti (k) decay ai (k) rest (3.39)

otherwise

∆ ai (k) ai (k) min neti (k) decay ai (k) rest (3.40)

where ∆ai(k) = ai(k+1) − ai(k), and

 a (k) if ai (k) ≥ 0 (3.41)


Yi (k)  i
 0 otherwise
Typical parameters used in simulations by McClelland and Rumelhart [McRu88] are:
max = 1, min = −0.2, rest = −0.1, decay = 0.1, exti = 0 or 0.4; and Wij = −0.1, 0 or 0.1.
However, such parameters were found through trial and error and not from mathematical
analysis.

3.3.2 - Initial Considerations


Without loss of generality, we can consider that each one of the units is
connected to at least one of the other units (for each unit i, Wij ≠ 0 for at least one j),
since we are not interested in the case where a unit is completely isolated from the other
units.
If we assume that min < 0 < max and min = max, eqs. 3.39 and 3.40 can be
combined into just one equation as:

∆ ai (k) neti (k) ai (k) neti (k) max decay ai (k) rest (3.42)

If the network is operating in continuous time, the above equation is simply replaced by:

d ai
neti ai neti max decay ai rest (3.43)
dt
The equilibrium points of the system aie can be found by solving eq. 3.42 for ∆ai(k) = 0
or eq. 3.43 for dai /dt = 0. So:
e
e max neti decay rest
ai (3.44)
e
neti decay

where netie represents the value of the net input for unit i when the network reaches an
EP. Since netie is in general unknown, eq. 3.44 does not help to find the position of the
EP in the general case. But we can still use it to state that if decay = 0:
71

1) when netie ≠ 0 the EP will be characterized by aie = max if netie > 0


or aie = −max if netie < 0;
2) when netie = 0 eq. 3.44 cannot be used to find the EP but the points
where neti = 0 for all units are also equilibrium points since ∆ai (or dai /dt) = 0 for all
i. One point where this is possible, but not the only one, is to have exti = 0 for all units
and consequently the point aie = 0 for all i is also an EP.
Moreover, for rest = 0 and small values of decay, as long as decay << netie ,
the EP will still be located near max or −max and the condition neti = 0 is not enough
to cause an EP.
Observe that if Wij = 0 for all j, i.e. the unit i is completely isolated from the
other units, neti = exti and the condition for stability is that -decay < exti < 2-decay
that can also be written as - exti < decay < 2- exti . Therefore such unit can form a
stable 1 dimensional system even if decay < 0. The position of the EPs is given by eq.
3.44 replacing netie by exti.

3.3.3 - Minimizing an Energy Function


In this section we show that under certain constraints the continuous time version
of the IAC network, like the Hopfield network, also minimizes a bounded energy
function. Therefore, we can prove that the network is stable and can be used to solve
the same kind of minimization problems for which the Hopfield network has been used.
First, let’s assume that decay = 0 and that the network is within or at the border
of the hypercube [−max max]N where N is the number of units in the network, i.e.
−max ≤ ai ≤ max for all i. We can define the following quadratic function as the energy
function:

1 T (3.45)
H (t) Y W Y ext T Y
2
As in the case for the Hopfield network we can write that:

∂H Y (t) d Yi
N
d H Y (t)
∇ Y H (Y ) T
Ẏ (t) (3.46)
dt i 1 ∂Yi dt
Since the matrix W is symmetric:
N
dH T d Yi
W Y ext Ẏ net T Ẏ neti (3.47)
dt i 1 dt
But Yi = g(ai), so we have:
72

N
dH d g(ai) d ai
neti (3.48)
dt i 1 d ai dt
Using eq. 3.43, finally:

 N dg a
 i
neti max ai if neti ≥ 0
2
 dai
d H 
i 1
(3.49)
dt  N dg a
 i 2
 i 1 da neti max ai if neti < 0
 i

Therefore dH/dt ≤ 0, for decay = 0, −max ≤ ai ≤ max and dg(ai)/dai ≥ 0 for all i (g( )
is a monotonically increasing function). From the above we can also state that dH/dt = 0
if and only if dYi /dt = dai /dt = 0 for all i, i.e. the network has reached an EP. Note that
neti = 0 for all i implies not only dH/dt = 0 but also dai /dt = 0 for all i (see eq. 3.43).
Now we need to deal with the case when the network is initialized outside the
hypercube [−max max]N, i.e. −max > ai > max for at least one i. If ai ≥ 0, eq. 3.43 can
be written as:


 neti ai max decay ai rest if neti ≥ 0
d ai  (3.50)

dt  neti ai max decay ai rest if neti < 0


On the other hand, if ai < 0, eq. 3.43 can be written as:


 neti ai max decay ai rest if neti ≥ 0
d ai  (3.51)

dt  net ai max decay ai rest if neti < 0
 i

Equations 3.50 and 3.51 show respectively that, given that decay > 0 and rest < max:
1) if ai > max, then dai /dt < 0; and 2) if ai < −max, then dai /dt > 0. In other words,
considering the activation space, if the network is outside the hypercube [−max max]N
and decay > 0, the changes in the activation values are such that, given enough time,
the network will reach the borders of the hypercube and we will end up with
ai ≤ max. Note that even in the case when decay = 0, the changes in the activation
will still drive the network to the borders of the hypercube [−max max]N, with the only
exception that the network can be trapped in the condition where neti = 0 (section 3.3.6
shows an example of this case). Once inside or at the borders of the hypercube, the
network then seeks the minima of the energy function given by eq. 3.45, given that,
73

among other conditions, decay = 0.


One way to ensure that the energy function given by eq. 3.45 is minimized
would be to have decay > 0 whenever ai > max for at least one i and when
ai ≤ max for all i we set decay to 0. A less complicated way would be to set decay
to some small positive value and rest to 0, without having to consider if the network is
inside the hypercube or not. From eq. 3.44 we can see that this will cause only a small
perturbation in the position of the EPs that are at the locations were aie = −max or max,
assuming that for such EPs the condition decay << netie is satisfied. If the EPs that
are the solution for the problem satisfy such a condition (in general such information is
not available a priori) then we still could consider that the energy function given by eq.
3.45 is being minimized. However, the location and number of the other EPs (the EPs
that are not at the corners of the hypercube [−max max]N) can change significantly.
A possible interpretation for the reason that decay > 0 brings the network to the
borders of the hypercube is because this stops the points where neti = 0 from being EPs
and from eq. 3.44 we can see that it also forces aie < max. However, some of the
points that were EPs for decay = 0 can suffer large perturbation if the condition
decay << netie is not satisfied.
Note that, as the Hopfield network, the IAC network suffers from the possibility
of being trapped in local minima (instead of converging to the global minima of the
energy function).
A simple modification that makes it easier to analyse the network dynamic
behaviour is to have dg(ai) /dai > 0 for all i, instead of dg(ai) /dai ≥ 0 (see eq. 3.41), for
instance, using the identity function as the output rule: Yi = ai for all i. Such
modification will be used in the next section.

3.3.4 - Considering two units


Consider an IAC network with two units with min < 0 < max, min = max,
decay ≥ 0, rest = 0, and the output function as being the identity function, Yi = ai for
all i. As usual we will assume that the units do not feedback to themselves, i.e. Wii = 0
and use W12 = W21 = c, where c = factor of cooperation (c > 0) or competition (c < 0).
Figure 3.5 illustrates such network. From eq. 3.43 we can write:
74

Figure 3.5 - The IAC Neural Network with 2 units

d a1
ext1 c a2 a1 ext1 c a2 max decay a1 (3.52)
dt
d a2
ext2 c a1 a2 ext2 c a1 max decay a2 (3.53)
dt
We can now consider three main cases ([Nas90], [NaZa92]):
1) external inputs = 0, decay ≥ 0;
2) external inputs ≠ 0, decay = 0;
3) external inputs ≠ 0, decay > 0.

3.3.5 - Case Positive Decay and No External Inputs


Solving eqs. 3.52 and 3.53 for dai/dt = 0 we have that the EP [a1e a2e]/max is
given by:
e e
ai aj / max
max aj
e (3.54)
decay c
c max c max
where (i,j) = (1,2) or (2,1). Solving the above pair of equations by direct substitution for
0 ≤ dec ≤ 1, the normalized EPs [a1’ a2’] are:
for c > 0: [0 0], [δ δ], [−δ −δ]
for c < 0: [0 0], [δ −δ], [−δ δ]
where ai’ = ai/max, i = 1 or 2, δ = 1 − dec, and dec = decay/( c max) = normalized
decay. Using linearization around the EPs, it is possible to show that the origin is a EP
type saddle while the other 2 EPs are type stable node. If dec ≥ 1, all 3 EPs collapse
75

1.5 o o o o o o o o o o o o o

o o o o o o o o o o o o o

1o o o o o o o o o o o o o

o o o o o o o o o o o o o

0.5 o o o o o o o o o o o o o

o o o o o o o o o o o o o

a2/max
0o o o o o o o o o o o o o

o o o o o o o o o o o o o

-0.5 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1.5 o o o o o o o o o o o o o
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max

Figure 3.6 - ext1 = ext2 = 0, decay/(c max) = 0.15, c > 0

into the origin which becomes a stable node. Figure 3.6 shows the phase-plane for
ext1 = ext2 = 0, c > 0 and dec = 0.15 and some trajectories for different initial activation
values. As expected the EPs are at positions [0 0], [0.85 0.85] and [-0.85 -0.85].
We can study the stability and zones of convergence of the stable EPs by
defining a Lyapunov function. For instance, assuming c > 0 and 0 ≤ dec ≤ 1 (0 ≤ δ ≤ 1)
for the EP [δ δ] we can define the following Lyapunov function:
2 2
x1 x2 (3.55)
V ( a1 , a2 )
2 c max
where xi = ai’ - δ, i = 1,2. Therefore:

dV x1 d x1 x2 d x2
(3.56)
dt c max d t c max d t
Assuming that ai’ > 0, we have that xi + δ > 0, i = 1,2. From eqs. 3.52 and 3.53 for
(i,j) = (1,2) and (2,1):

1 d xi
xj δ xi δ xj δ 1 δ xi δ (3.57)
c max d t
i 2
j 1
dV
xi xj δ 1 δ xi xj δ 1 δ xi xi δ (3.58)
2

dt i 1
j 2

dV
x1 x2 δ x2 x1 δ 1 δ
2 2
x1 x2 2 (3.59)
dt
and therefore dV/dt ≤ 0 for δ ≤ 1, i.e. for dec ≥ 0 and dV/dt = 0 implies that x1 = x2 = 0
(the possibility x1 = x2 = −δ is excluded since we assumed that xi + δ > 0). This proves
76

that all trajectories in the 1st quadrant will converge to the EP [δ δ] if c ≥ 0 and
0 ≤ decay/(c max) ≤ 1.
We can also easily see from eqs. 3.52 and 3.53 that for points in the 2nd and 4rd
quadrants situated above the line a2’= -a1’, the property da2’/da1’ > -1 is valid. Therefore
their respective trajectories will enter the 1st quadrant and converge to the EP [δ δ] as
fig. 3.6 illustrates. The same procedure can be used to prove the stability of the EP
[-δ -δ] and the EPs for the case c < 0.
From the above we can conclude that the separatrix (the curve that divides the
zones of convergence of the 2 stable EPs) for c > 0 is the line a2’ = -a1’ and for c < 0
the line a2’ = a1’. Observe that, in the absence of any external disturbances, if the
network is initialized exactly on the separatrix, the activation values will converge to the
unstable EP situated at the origin (see fig. 3.6).

3.3.6 - Case of Non-Zero External Inputs With No Decay


Let’s assume for simplicity that c > 0 (the case c < 0 is completely analogous)
and define the normalized external inputs ext1’ and ext2’, where:
exti’ = exti /(c max), i = 1 or 2.
From eqs. 3.52 and 3.53, we can see that, if decay = 0, then dai’/dt = 0 in the following
cases:
a) when aj’ = -exti’, the main switching line;
b) if aj’ > -exti’, when ai’ = 1;
c) if aj’ < -exti’, when ai’ = −1.
where (i,j) = (1,2) or (2,1). Therefore the increase of ext1’ shifts its associated main
switching line a2’ = -ext1’ downwards. Analogously, the increase of ext2’ shifts its
associated main switching line a1’ = -ext2’ sideways to the left.
The EPs are the points that are common to the above switching lines. The
positioning of these switching lines gives rise to 3 main sub-cases that correspond to
different regions in figure 3.7:
a) if ext1’ < 1 and ext2’ < 1;
b) if exti’ > 1 and extj’ ≠ 1, (i,j) = (1,2) or (2,1);
c) if ext1’ = 1 and/or ext2’ = 1.
Now let’s consider each one of these cases and their sub-cases:
77

Figure 3.7 - Location of the stable E.P.


a) if ext1’ < 1 and ext2’ < 1;
⇒ region A in fig. 3.7,
1 EP at [−ext2’ −ext1’], type saddle,
2 EPs at [a1e’ a2e’] = {[1 1],[−1 −1]}, type stable node.
The phase-plane and the trajectories in this case are similar to those in fig. 3.6 with the
difference that the position of the unstable EP is not necessarily located at the origin.
b) if exti’ > 1 and extj’ ≠ 1, (i,j) = (1,2) or (2,1);
⇒ regions B, C, D, E in fig. 3.7, not including dashed lines in the middle
of regions B and C,
1 unstable EP at [a1e’ a2e’] = [−ext2’ −ext1’],
1 EP type stable node, which location is given by fig. 3.7 according to:

3o o o o o o o o o o o o o

o o o o o o o o o o o o o

2o o o o o o o o o o o o o

o o o o o o o o o o o o o

1o o o o o o o o o o o o o

o o o o o o o o o o o o o
a2/max

0o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-2 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-3 o o o o o o o o o o o o o
-3 -2 -1 0 1 2 3
a1/max

Figure 3.8 - ext1’ = ext2’ = 1.5, decay = 0


78

Region B: [1 1], Region C: [−1 −1],


Region D: [−1 1], Region E: [1 −1].
Figure 3.8 shows the phase-plane when ext1’ = ext2’ = 1.5 and some trajectories for
different initial activation values. The EPs are at [1 1] and [-1.5 −1.5].
c) if ext1’ = 1 and/or ext2’ = 1;
c.1) if exti’ = 1 and extj’ ≠ 1, (i,j) = (1,2) or (2,1);
c.1.1) {if exti’ = 1 and extj’ > −1 and extj’ ≠ 1} OR
{if exti’ = −1 and extj’ < 1 and extj’ ≠ −1};
⇒ dashed lines in fig. 3.7, not including the circles nor the black
squares,
1 EP type stable node which location is given by regions B or C in
fig. 3.7 (the nearest region),
A semi-line of non-isolated EPs.
Figure 3.9 shows the phase-plane for ext1’ = 0, ext2’ = −1 and some trajectories for
different initial conditions. The EPs are [-1 -1] and the semi-line a1’ ≥ 1.

3o o o o o o o o o o o o o

o o o o o o o o o o o o o

2o o o o o o o o o o o o o

o o o o o o o o o o o o o

1o o o o o o o o o o o o o

o o o o o o o o o o o o o
a2/max

0o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-2 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-3 o o o o o o o o o o o o o
-3 -2 -1 0 1 2 3
a1/max

Figure 3.9 - ext1’ = 0, ext2’ = −1, decay = 0


c.1.2) {if exti’ = 1 and extj’ < −1} OR
{if exti’ = −1 and extj’ > 1};
⇒ border of regions D and E in fig. 3.7 (solid lines) not including the
circles,
No stable isolated EPs,
A semi-line of non-isolated EPs.
Figure 3.10 shows the phase-plane for ext1’ = 1, ext2’ = −1.5 and some trajectories.
79

3o o o o o o o o o o o o o

o o o o o o o o o o o o o

2o o o o o o o o o o o o o

o o o o o o o o o o o o o

1o o o o o o o o o o o o o

o o o o o o o o o o o o o

a2/max
0o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-2 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-3 o o o o o o o o o o o o o
-3 -2 -1 0 1 2 3
a1/max

Figure 3.10 - ext1’ = 1, ext2’ = −1.5, decay = 0


c.2) if ext1’ = ext2’ = 1;
c.2.1) if (ext1’, ext2’) = (1,1) or (−1,−1);
⇒ black squares in fig. 3.7,
1 EP type stable node which location is given by regions B or C in
fig. 3.7 (the nearest region),
2 orthogonal semi-lines of non-isolated EPs.
Figure 3.11 shows the phase-plane for ext1’ = ext2’ = 1 and some trajectories.

3o o o o o o o o o o o o o

o o o o o o o o o o o o o

2o o o o o o o o o o o o o

o o o o o o o o o o o o o

1o o o o o o o o o o o o o

o o o o o o o o o o o o o
a2/max

0o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-2 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-3 o o o o o o o o o o o o o
-3 -2 -1 0 1 2 3
a1/max

Figure 3.11 - ext1’ = ext2’ = 1, decay = 0


c.2.2) if (ext1’, ext2’) = (−1,1) or (1,−1);
⇒ circles in fig. 3.7,
no stable isolated EP,
2 orthogonal semi-lines of non-isolated EPs.
80

3o o o o o o o o o o o o o

o o o o o o o o o o o o o

2o o o o o o o o o o o o o

o o o o o o o o o o o o o

1o o o o o o o o o o o o o

o o o o o o o o o o o o o

a2/max
0o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-2 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-3 o o o o o o o o o o o o o
-3 -2 -1 0 1 2 3
a1/max

Figure 3.12 - ext1’ = 1, ext2’ = −1, decay = 0


Figure 3.12 shows the phase-plane for ext1’ = 1, ext2’ = − 1 and some trajectories.
We can determine the zones of attraction of the stable EPs by calculating
analytically the equation of the trajectory that converges to the unstable EP. This
equation can be obtained by combining eqs. 3.52 and 3.53 using decay = 0 and solving
the ordinary differential equation da2’/da1’ = F(a1’, a2’, ext1’, ext2’). In this case, it is easy
to find the equation of the trajectory since the variables a1’ and a2’ are separable. If we
define:
S1 = sign(ext2’+a1’), S2 = sign(ext1’+a2’)
(sign(0) = 0) and assuming that S1 ≠ 0 and S2 ≠ 0, the equation of the trajectory is:

a2’ a2’ (0)    


 1 ext1’  ln  1 S1 a2’ 
S1  S1   1 S a ’ (0) 
  1 2  (3.60)
a1’ a1’ (0)    
 1 ext2’  ln  1 S2 a1’ 
S2  S2   1 S a ’ (0) 
  2 1 

The curves that separate the zones of convergence are asymptotes that can be
calculated by solving the above nonlinear equation, i.e. find a2’(0) given a1’(0), a1’=a1’*,
a2’=a2’*, ext1’, ext2’ where a’* is the unstable EP. The asymptotes that leave the unstable
EP and converge to the stable EP can also be calculated by the same method with a’*
as the stable EP. Figure 3.13 shows the asymptotes for some cases when ext1’ < 1
and ext2’ < 1 where eq. 3.60 was solved numerically.
As before, we can prove the stability of the EPs by defining a Lyapunov
81

1
0.8
0.6
*
0.4
0.2

a2/max
0 * * *

-0.2
-0.4
*
-0.6
-0.8
-1
-1 -0.5 0 0.5 1
a1/max

Figure 3.13 - Asymptotes for ext1’ ≠ 0 and/or ext2’ ≠ 0


with decay = 0
function. For instance, fig. 3.7 shows that if we assume that c > 0, ext1’ > −1 and
ext2’ > -1 (region B), then the point [a1’ a2’] = [1 1] is a EP type stable node. To prove
its stability, we can define the same Lyapunov function defined in eq. 3.55, where xi is
now defined as xi = ai’ − 1, i = 1,2. Assuming that a1’ > -ext2’ and a2’ > -ext1’ (our
region of interest), we can easily show that:

dV
x2 ext2’ x1 1 ≤ 0
2 2
x1 ext1’ x2 1 (3.61)
dt
Since V > 0 and dV/dt < 0 in our region of interest except in the origin x1 = x2 = 0
where V = dV/dt = 0, this proves the asymptotic stability of the EP [1 1].
The same procedure can be used to prove the stability of the EP [−1 −1] and the
EPs for the case c < 0.

3.3.7 - Case of Non-Zero External Inputs and Positive Decay


Again, without loss of generality let’s assume that c > 0. From eqs. 3.52 and
3.53 we have that dai’/dt = 0 when:

exti’ aj’
ai’ (3.62)
exti’ aj’ dec
where (i,j) = (1,2) or (2,1). Since the EPs are the points where da1’/dt = 0 and
da2’/dt = 0, they can be calculated by combining eq. 3.62 for (i,j) = (1,2) with eq. 3.62
for (i,j) = (2,1). This means that we need to find the real-valued roots of the following
quadratic polynomial:
82

2
P a1’ a1’ S1 S2 ext1’ 1 S2 dec
1
a1’ S1 ext1’ S2 ext2’ dec ext2’ S1 S2 dec dec 2 1 S2 ext1’

a1’ 0
ext1’ S2 ext2’ dec ext2’ (3.63)
where:

 1 if exti’ aj’ ≥ 0 (3.64)


Si 
 1 otherwise
However we don’t know a priori the values of S1 and S2. Therefore we apply the
following algorithm:
Step 1) Assume that S1 = 1 and S2 = 1.
Step 2) Find the roots of P(a1’) and reject the complex roots.
Step 3) Check for each real-valued root if ext2’ + root ≥ 0.
If YES, accept this root, otherwise reject it.
Step 4) For each accepted root, use eq. 3.62 to calculate the corresponding value
for a2’.
Step 5) Check to see if ext1’ + a2’ ≥ 0.
If YES, accept this value, otherwise reject it.
Step 6) Assume other combinations for (S1, S2), calculate the possible values of
a1’ and a2’, and check if the assumptions for S1 and S2 are satisfied.
Figure 3.14 illustrates how the values of ext1’ and ext2’ affect the curves da1’/dt
and da2’/dt (dec = 0.15). We can see that an increasing ext1’ shifts the curve da1’/dt to

1.5
ext1’=-4 -1.5 -1

1
ext2’=4

0.5 1.5 0.3 0 -0.3 -1

-0.3
a2/max

0
0
0.3
-0.5 1
-1.5

-1 -4
1 1.5 4
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max
Figure 3.14 - Curves da1’/dt = 0 ( ) and da2’/dt = 0 (–•–)
for several external inputs and dec = 0.15
83

the left and an increasing ext2’ shifts the curve da2’/dt downwards. Figure 3.14 shows
that there are only three possible cases for the EPs, since the curves da1’/dt = 0,
da2’/dt = 0 cross each other 1, 2 or 3 times:
1) If the curves da1’/dt = 0, da2’/dt = 0 cross each other three times, then we
have 2 EPs type stable node and 1 EP type saddle. This is the case if ext1’, ext2’ and
dec are not large.
2) If the curves da1’/dt = 0, da2’/dt = 0 cross each only once, then we have only
1 EPs type stable node. This is the case if ext1’ or ext2’ or dec are large.
3) The curves da1’/dt = 0, da2’/dt = 0 cross each other once (the EP type stable
node) and touch each other at another point (the point that is on the separatrix). In this
case all trajectories with initial conditions on one side of the separatrix will converge
to the EP that is on the separatrix. All trajectories with initial conditions on the other
side of the separatrix will converge to the stable EP. Figure 3.15 illustrates this case,
when ext1’ = ext2’ = 0.3754 and dec = 0.15. The EPs are a1’e = a2’e = 0.894 and
a1’e = a2’e = −0.613.
Some important points are:
a) There will always be 1 or 2 stable EPs and not more than 1 unstable EP since curves
da1’/dt = 0, da2’/dt = 0 cross each other 1, 2 or 3 times;
b) All EPs will be such that a1e’ < 1 and a2e’ < 1 (see eq. 3.62);
c) if dec ≥ 1, then there is only 1 EP and it is a EP type stable node. This can be
verified through visual inspection of figure 3.16 which shows the curves da1’/dt = 0,

1.5 o o o o o o o o o o o o o

o o o o o o o o o o o o o

1o o o o o o o o o o o o o

o o o o o o o o o o o o o

0.5 o o o o o o o o o o o o o

o o o o o o o o o o o o o
a2/max

0o o o o o o o o o o o o o

o o o o o o o o o o o o o

-0.5 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1 o o o o o o o o o o o o o

o o o o o o o o o o o o o

-1.5 o o o o o o o o o o o o o
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max

Figure 3.15 - The case where one of the E.P.s is


over the separatrix
84

1.5
ext1’=-4

ext2’=4
0.5

a2/max
0

-0.5
ext2’=-4

-1
ext1’=4
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max

Figure 3.16 - Curves da1’/dt = 0 ( ) and da2’/dt = 0 (–•–)


for several external inputs and dec = 1
da2’/dt = 0 for dec = 1. The theoretical way to prove that there is only one EP in this
case would be to show that if dec ≥ 1, then for any real values for ext1’ and ext2’:
1) the quadratic polynomial expressed in eq. 3.63 will have only one real-valued
root which is a1e’ and,
2) the application of the algorithm proposed in this section will also give a valid
value for a2e’.

3.3.8 - A Two-Bit Analog-Digital Converter using the IAC network


As in the case of the Hopfield network, the IAC network can be used as an
analog-digital converter since this task can be posed as an optimization problem. Using
the IAC network, the solution can be proposed following the same procedure proposed
by Tank and Hopfield when using Hopfield networks ([TaHo86], [Zur92]).
In this example we will use an IAC network with 2 units and therefore the A/D
converter has a 2-bit resolution. However the same principle can be used to increase the
number of units and the resolution of the A/D converter. The parameters in this case are
max = 1, decay = 0, Wii = 0, i = 1,2 and Wij = c, (i,j) = (1,2) and (2,1), i.e. W is
symmetrical with zero diagonal entries. We will assume that the network is always
initialized within the hypercube [-max max]2. As the output function we can use
Yi(ai) = ai and therefore the network output will be bipolar (-1 or 1) instead of binary
(0 or 1). If desired we could easily force the network to have binary outputs by defining
Yi(ai) = (ai+1)/2.
85

Denoting by x the input analog value, the desired input-final output mapping that
the network should produce is: (x → a2, a1) = (0 → −1,−1), (1 → −1,1), (2 → 1,−1) and
(3 → 1,1). The corresponding decimal value d for a network output is given by:
d = 1.5 + a2 + a1/2. The network should minimize the square of the conversion error
E1(t) where:

 
    2  (3.65)
2  2  a1   a1  
E1(t) x d  x 2 x  1.5 a2   1.5 a2  
  2  2 

To determine the weights and external inputs of the network we compare the
function E1(t) that we want to minimize with the "energy" function H(t) given by eq.
3.50. In this case the function H(t) is given by:

H(t) c a1 a2 ext1 a1 ext2 a2 (3.66)

Since H(t) does not contain terms [ai]2 we need to modify E1(t) in order to eliminate
such terms but in such a way that the resultant function is still non-negative and has the
correct local minima. This also has to be done when using the Hopfield network and we
just need to adapt the procedure adopted there for this case [TaHo86]. The solution is
to define the function to be minimized E(t) as E(t) = E1(t) + E2(t) and:
2
i2 (3.67)
E2(t) ai 1 ai 1
i 1 4

Since we assume that the network was initialized within the hypercube [-1 1]2 and it will
remain within or at the borders of the hypercube, the function E2(t) is always positive
except at the corners of the hypercube where it is zero. The coefficients of E2(t) can
have any negative values but in this case they were chosen in order to cancel the terms
[ai]2 in E1(t). Therefore:

14 (3.68)
E(t) x 2 3 x a1 a2 1.5 x a1 3 2 x a2
4
Finally comparing H(t) with E(t) and ignoring the term x2 - 3x + 14/4 since it is a
constant we have that: c = −1, ext1 = x - 1.5, ext2 = 2x - 3.
In section 3.3.6 we analyzed such a network in this case but for c > 0. In order
to use those results here we just need to rotate our coordinate system ±90 degrees so
that the line a2 = -a1 (position of the EP for ext1 = ext2 = 0) becomes a2 = a1. If we
rotate -90 degrees then we have that a2NEW = a1OLD and a1NEW = -a2OLD. The desired input-
86

final output mapping is then: (x → a2NEW, a1NEW) = (0 → −1,1), (1 → 1,1), (2 → −1,−1)


and (3 → 1,−1). Dropping the superscript "NEW", the corresponding decimal value d
for a network output is now given by: d = 1.5 − a1 + a2 /2. The function E(t) is now:
2
2 1 (3.69)
E(t) x d ai 1 ai 1
2
i 1 i

Again comparing H(t) with this new definition for E(t), ignoring the constant term that
is function only of x, we have that: c = 1, ext1 = 3 − 2x, ext2 = x - 1.5.
From section 3.3.6 we know that such network will produce the stable EP in the
desired locations. Note that: a) ext1 and ext2 can be seen as lines parametrized in x and
therefore we can write that ext2 = -ext1/2; and b) if 1 ≤ x ≤ 2, then ext1 ≤ 1 and
ext2 ≤ 1. Referring to fig. 3.7 we will have an EP type saddle when in region A
( ext1 < 1) or semi-lines of EP when over the dashed lines that are the border between
regions A-B (ext1 = 1) and A-C (ext1 = -1). The semi-lines of EPs can be eliminated
without moving the position of the stable EP significantly by using a very small decay
such as 0.01.
The existence of the saddle point results in the problem that the stable EP to
which the network converges is determined by the point at which the network was
initialized. For instance, for x = 1.5, the line a2 = -a1 divides the two zones of
convergence (also called zones of attraction) for the two stable EPs at (1,1) or (−1,−1).
Lee and Sheu ([LeSh91],[LeSh92],[YuNe93]), when using a Hopfield network as an A/D
converter showed how to modify the Hopfield network in order to eliminate such saddle
points. Consequently the EP to which the network converges does not depend where the
network is initialized and therefore there is only one possible network response. Maybe
an equivalent modification can be proposed for the IAC network.

3.4 - Conclusions

In this chapter we demonstrated how feedback networks can be used as


associative memories or to solve minimization problems. The Hopfield and IAC neural
networks were presented and analyzed.
The main contribution of this chapter is to show that the IAC network can also
be used to solve minimization problems, and as such it is an alternative to Hopfield
networks. As an example we showed how to implement a 2-bit analog-digital converter.
87

Chapter 4 - Faster Learning by


Constraining Decision Surfaces

In chapter 2 we pointed out that one of the main problems with the current
feedforward ANN models is that they take too long to be trained by the training
algorithms in use today. Therefore one active area of research is the development of new
methods to increase the learning speed of feedforward ANN models, mainly the multi-
layer Perceptron since this is the most popular feedforward model. At the end of chapter
2 we mentioned some methods that can be used to try to speed up learning in
feedforward ANNs, i.e. a) without adapting the network topology, for instance, by using
adaptive learning rates or second-order algorithms; or b) adapting the network topology,
such as the Cascade-Correlation Learning algorithm [FaLe90].
In this chapter we propose an alternative method that aims to speed up learning
by constraining the weights arriving at the hidden units of a multi-layer feedforward
ANN. The method is concerned with the case where the hidden units have sigmoidal
functions, such as in the multi-layer Perceptron.
The basic idea of the proposed method is based on the observation that one
condition that is necessary, but not sufficient, for a feedforward multi-layer ANN to
learn a specific mapping is to have the decision surfaces defined by the hidden units
within or close to the boundaries of the network input space. The hidden units then will
not have a constant output value and cannot be simply substituted by the addition of a
bias to the the output unit. Since it is quite reasonable to know beforehand the range of
the network input values, we can assume that the network input space is also known.
The proposed method then simply checks the above condition and resets those hidden
units with decision surfaces outside a valid region. This approach also leads to a new
method for initializing the weights of the ANN.
We show different methods for initializing and constraining during training the
locations of the decision surfaces. We also show how one can adjust the inclination of
the decision surfaces. In the simulation section the proposed method is illustrated for the
88

case where an ANN is trained to perform the nonlinear mapping sin(x) over the range
−2π to 2π. This example uses the Back-Propagation algorithm to train the network but
the proposed method can be used with any other algorithm that adjusts the weights
directly without imposing constraints on the decision surfaces.
The proposed method can be applied to any unit as long as the decision surface
associated with such a unit is a hyperplane, e.g. sigmoid or hyperbolic tangent units.
Therefore the ANN can have more than one hidden layer of units and it does not need
to be a strictly feedforward ANN. Note that by constraining the decision surfaces we are
in effect indirectly constraining the hidden unit weights.

4.1 - Initialization Procedures

In chapter 2 we saw that if a unit has its output defined as: 1) an increasing
function of their net input with saturation above an upper limit and below a lower limit
(a sigmoidal unit, for instance sigmoid and hyperbolic tangent units) and 2) its net input
is defined as a linear combination of the unit inputs; then the decision surface of this
unit is the hyperplane defined by:
wi1 x1 + wi2 x2 + .... + wiNx xNx + biasi = 0
where wij and biasi are the unit incoming weights and bias and xj, j=1,...,Nx, are the unit
inputs and Nx is the number of inputs received by the units. A fundamental component
of the learning process is the correct placement and inclination of these decision surfaces
in the network input space.
The simplest case is a network with just a single layer of sigmoidal hidden units
and an output layer of linear units. In order to perform correctly the desired mapping
the hidden unit weights, i.e. the weights received by the hidden units, have to be such
that the decision surfaces of each hidden unit have the correct position and inclination.
Then the role of the output unit weights is to perform the correct combination of such
weights.
The problems in which this type of ANN can be applied can be divided into two
classes: a) pattern-recognition, where the inputs and desired outputs are binary (0 or 1)
or bipolar (-1 or 1); and b) function mapping, where the inputs and desired outputs are
real numbers. A important difference is that, in general, in the former case there is more
freedom for the placement and inclination of the decision surfaces (vide the XOR
89

problem) than in the latter case. In other words the input-output mapping is less
sensitive in relation to decision surfaces in the former case than in the latter case.

4.1.1 - The Standard Initialization Procedure


The standard and widely used procedure to initialize all the weights and biases
of a feedforward multi-layer ANN, such as the Multi-Layer Perceptron, is to simply set
all weights and biases to small random values [RHW86] using a normal or uniform
distribution. The justification for using small values is to avoid saturation of the unit
since saturated units will operate in the regions where the derivative of the unit output
function is very small and consequently, if the network is trained by the BP algorithm
(or other algorithm that uses first-derivative information), training will be very slow.
One problem with such a procedure is that it does not take into consideration the
size of the network inputs when choosing how spread the random weights should be.
The ANN literature contain a few alternative procedures such as the ones that have been
proposed by Nguyen and Widrow [NgWi89] and Drago and Ridella [DrRi92].
Assuming that the network input space has dimension 1, if we use a gaussian
distribution with zero mean to generate the weight from the input unit to each hidden
unit and the bias of each hidden unit, the position of the decision surface (in this case
a point in a horizontal line) will be given by: xDS = -biasi / wi, where here i specifies the
hidden unit number. Assuming that biasi and wi are random independent variables, xDS
has a cauchy distribution with zero mean [Pap84].
Figure 4.1 shows the histogram of xDS calculated as above by using the quotient
of 1000 computer generated samples of two (assumed independent) random gaussian
variables with zero mean and variance 1.
If the network has 2 inputs then the decision surface (in this case a line) for
hidden unit i will be described by wi1 x1 + wi2 x2 + biasi = 0. Figure 4.2 shows 100 lines
generated by defining the coefficients wi1, wi2 and biasi as gaussian random variables
with zero mean and standard variation 0.1.
From figures 4.1 and 4.2 we can see that most of the decision surfaces will be
concentrated around the origin. It is also very important to note that, depending on how
large the input space is, even if we considered it centered around the origin, there is the
possibility that some of the decision surfaces will fall outside the input space. In this
case, these units will produce a near-constant output for inputs within the valid input
90
Number of samples displayed = 965
160 10
min = -1518.5099 max = 779.7392 8
140
delta = 0.5 6
120
Number of Samples
4
100 2

x2
80 0

60 -2
-4
40
-6
20 -8
0 -10
-15 -10 -5 0 5 10 15 -10 -5 0 5 10
x3 x1

Fig. 4.1 - Histogram of a variable defined Fig. 4.2 - 100 decision surfaces generated
as the quotient of two gaussian random by the standard initialization procedure
variables with zero mean and variance 1

space, especially if their decision surfaces have a steep inclination. Therefore such units
will not be peforming any useful computation and can be substituted by a constant term
added to the bias of the output units. If we train the network with a hidden unit using
a gradient-based algorithm (such as the Back-Propagation alagorithm) initialized in this
way, this hidden unit will take a long time to change its weights since the unit will be
operating in a region where its derivative is very low.
If we consider in fig. 4.2 that the valid range for variables x1 and x2 is [-10 10]
most of the decision surfaces will be within a small distance from the origin. If we have
no pre-knowledge about the correct positioning of the decision surfaces, this seems
difficult to justify. On the other hand if the valid range for x1 and x2 is [-1 1] there will
be several hidden units with their decision surfaces outside the valid network input
space. We get very similar results if, instead of a gaussian distribution, we use a uniform
distribution.
It is possible to explore the concept of decision surfaces to get better
initialization procedures. In this case we only need to make use of information that is
normally available, that is, the valid range of the network input variables.
If no previous information is available about the desired location of the decision
surfaces for a particular problem (the normal case), then it is reasonable to argue that
the available decision surfaces should be uniformly distributed over the valid input
space. Therefore, instead of generating the weights and biases using a particular random
distribution, and then calculating from them the positioning of the decision surfaces and
looking at their distribution over the input space, we propose to do the opposite. We can
generate the location of the decision surfaces using some appropriate random distribution
91

and then we calculate the weights and biases associated with each decision surface.
Finally we adjust the inclination of each decision surfaces also considering the size of
the input space to avoid the possibility of very large inclinations that will result in slow
adaptation.
If the output units are linear we cannot associate a decision surface with them
and therefore we propose to still generate their weights and biases using a random
distribution (gaussian or uniform) with zero mean. On the other hand, if the output units
are sigmoidal, the same method that is used to generate the weights and biases can be
applied with the only difference that the input space for the output units is now the
output space of the hidden units. If there are direct connections from the network input
units to the output units the input space of the output units also includes the network
input space.
We propose two new procedures to initialize the network based on this idea of
initializing the decision surfaces.

4.1.2 - The First Initialization Procedure


One way to initialize the decision surfaces over the valid input space is to select
(using a uniform distribution) a sufficient number of points to define a decision surface,
in this case a hyperplane. Since each point inside the valid input space has the same
probability of being chosen as all the other valid points, the location of the
correspondent decision surfaces will be also uniformly distributed over the input space.
Therefore the first step is from the set of selected points to get the equation of the
decision surface and from that get the values for the weights and for the bias.
Since the network input space dimension is Nx, we select at random from a
uniform distribution Nx points within the valid input range since we need Nx points to
define uniquely a decision surface. Since these points belong to the decision surface,
they must satisfy the equation of the hyperplane (for simplicity we drop the subscript i):
w1 x1 + w2 x2 + .... + wNx xNx + bias = 0
Since we have Nx points we have to solve a system of linear equations: X̃ W̃ = 0, where
each selected point defines a row of the matrix X̃, W̃ = [w1 ... wNx bias]T and
0 = [0 ... 0]T. We have then Nx equations and Nx+1 unknowns. Therefore it is necessary
to add another constraint. There are several possibilities, like forcing one of the weights
or the bias to be equal to 1 by adding the constraint w1 = 1 (or bias = 1). At this point
92

we use:
w1 + w2 + .... + wNx + bias = Nx
The particular constraint used is not relevant, as long it is a valid one (bias = 0 is not
a valid constraint), since it will determine the inclination and we propose to normalize
the inclination in another step later.
Using the above procedure, the following steps are used to initialize a
feedforward ANN with one hidden layer of sigmoidal units and linear output units:
Step 1) Initialize the weights that the output units receive and the bias of
the output units as small random values. Observe that the output units
can receive weights directly from the input units as well.
Step 2) For each unit in the hidden layer:
2.1 - select at random Nx points within the valid input range (Nx =
number of input units). All points within the valid input space have the
same probability of being selected. Let’s use the following notation to
denote each of these points and their components:
Xi = [ x1i x2i ... xNx
i
]
2.2 - The weights that connect the input units to this hidden unit and the
bias of this hidden unit are calculated as the solution of the following set
of linear equations:

 1 
 x1 ... x Nx 1   wi 1   0 
1 1
x2
   
 2  w   
x 2 ... x Nx 1   i 2   0 
2 2
 x1
      (4.1)
 ... ... ... ... ...   ...   ... 
   
 x Nx Nx Nx   wi Nx   0 
 1 x 2 ... x Nx 1     
   biasi   Nx 
 1 1 ... 1 1   
Figure 4.3 shows the locations of 100 decision surfaces obtained by using this
procedure. It is important to notice that the decision surfaces are equally spread over the
input space and, due to the method, it is possible to guarantee that all of them cross the
valid input space.
Using this method we have to solve a system of linear equation with dimension
Nx+1 for each network unit. If the unit receives inputs from several other units (for
instance the network has a large number of inputs) or if the number of units to be
initialized is very large, this method can be too computationally demanding. In order to
93

10
8
6
4
2

x2
0
-2
-4
-6
-8
-10
-10 -5 0 5 10
x1

Fig. 4.3 - 100 decision surfaces generated by the first initialization procedure.

minimize this problem we propose a second initialization procedure. However, the


method used to initialize the weights received by a linear output units is the same.

4.1.3 - The Second Initialization Procedure


Instead of selecting Nx points within the valid input range and then using these
points to generate a decision surface, another method is simply to select at random only
1 point (all points within the valid input space are equally probable) and then to select
a vector with random components such that the direction in which this vector points is
random. The decision surface is then defined as the hyperplane that passes through the
selected point and such that the selected vector is normal to it.
One way of generating this vector normal to the decision surface is firstly to fix
its length, and then each time that a new component of the vector is to be defined the
limit for the size of that component is calculated as the total length decreased by the
sum of the squares of the components already generated, or if Ns is the desired length
for the vector V where:

Ns V1
2 2
V2 ... VNx
2 (4.2)

The first component V1 can be generated as a uniform random variable in the interval
[-Ns Ns]. The second component V2 is chosen in the range [-(Ns2 - V12)1/2 (Ns2 - V12)1/2]
and so on until VNx-1 is generated. The last component of V is calculated such that V has
the desired length, or:
94

Figure 4.4 - Geometrical interpretation of the initialization procedure

VNx ± Ns 2 V1
2 2
V2
2
... VNx (4.3)
1

where the positive sign and the negative sign have the same probability of 50%.
A more direct way of generating this vector normal to the decision surface is to
use the concept of circular symmetry of random variables [Pap84]. This concept states
that if we have several independent normal random variables with zero mean and equal
variance, then these random variables are circular symmetrical and their joint statistics
depends only on the distance from the origin. Therefore all points that have the same
distance from the origin are equally probable. This implies that, if the components of
V are generated using such a concept, for a given magnitude all directions will be
equally probable.
Suppose for the moment that the vector V has been scaled to an arbitrary non-
zero magnitude. Denoting the selected point by X*, all points X that belong to the
decision surface satisfy the equation: (X - X*)T V = 0. Figure 4.4 gives a geometrical
interpretation for this equation. Comparing this equation with the equation for the
decision surface (which is defined by the incoming weights to the unit and the
associated bias as WT X + bias = 0 where W = [w1 w2 ... wNx]T) we have:
W = V, bias = - X*T V
Comparing this method with the standard method (assuming that the standard method
uses a gaussian distribution), we can see that the difference is the way that the bias term
is initialized. Figure 4.5 shows the locations of 100 decision surfaces where V was
generated using random values with gaussian distribution, zero mean and 0.1 as the
standard deviation. Again using the above method we guarantee that the decision
95

10
8
6
4
2

x2
0
-2
-4
-6
-8
-10
-10 -5 0 5 10
x1

Figure 4.5 - 100 decision surfaces generated by the second initialization procedure.

surfaces will cross the valid input space since the selected point X* was chosen from the
set of points that are within the valid input space.
Note that in this procedure we have assumed that the vector V (and therefore the
weight vector W for the unit as well) has been scaled to an arbitrary non-zero
magnitude. It is such a magnitude that dictates the inclination of the decision surface.
In the next sub-section we propose to adjust this inclination as the last step (Step 3) for
both initialization procedures suggested here.

4.1.4 - Adjusting the Inclinations of the Decision Surfaces


In order to adjust the inclination of the decision surface we simply adjust the
variation of the output of the unit for a given variation of the unit input.
The variation of the unit input is specified by choosing any point X1 over the
decision surface and another point X2 such that ∆X = X2 - X1 is orthogonal to the
decision surface.
For convenience, assuming without loss of generality that the center of the valid
input space is the origin, we can choose X1 to be the point belonging to the decision
surface that is closer to the origin. As we have seen in the previous sub-section the unit
weight vector (the incoming weights) W is the vector orthogonal to the decision surface
and consequently X1 = α W. Since X1 belongs to the decision surface, WT X1 + bias = 0.
Combining these two equations, we have that:

bias
α (4.4)
W 2
96

bias (4.5)
X1 W
W 2
where W = (WT W)1/2 = length of the weight vector. The point X2 is then defined as:

W (4.6)
X2 X1 Ks Us
W
where Ks and Us are scalar parameters such that Ks > 0, Us > 0. The parameter Us is set
to the distance from the origin to the most distant corner of the valid input space and
therefore gives a measure of the size of the input space. The parameter Ks therefore is
the length of the vector ∆X in "Us" units.
Once we have selected a value for Ks, for a given unit weight vector and bias,
the variation of the output of the unit is simply calculated as:

∆ F F net2 / T F net1 / T (4.7)

where F(net/T) is the unit function, e.g. sigmoid or hyperbolic tangent, T is the fixed
parameter called temperature, T > 0, net1 = WT X1 + bias and net2 = WT X2 + bias. Note
that: a) net1 is by definition 0 since X1 is on the decision surface; and b) for a unit with
sigmoid function or hyperbolic tangent, F(net1) = 0.5 or 0 respectively.
The objective is to find a scalar positive gain Kw that, when used to scale the unit
weight vector and the unit bias, the unit will have the desired output variation ∆Fdes > 0
for a given input variation specified by Ks. From eq. 4.7, Kw can be calculated for a
sigmoid unit using:

 0.5 ∆ F 
ln  
T des (4.8)
Kw 
net2  0.5 ∆ Fdes 
Using the expression: tanh(x/T) = 2 sig(2x/T) - 1, Kw can be calculated for a hyperbolic
tangent unit using:

 1 ∆F 
ln  
T des (4.9)
Kw 
2 net2  1 ∆ F des 
The unit weight vector and bias are finally replaced by Kw W and Kw bias respectively.
Note that for sigmoid units: 0 < ∆Fdes < 0.5, and for hyperbolic tangent units:
0 < ∆Fdes < 1.
97

4.2 - Constraining the Decision Surfaces during Training

The knowledge that the decision surfaces will have to be within or close to the
boundaries of the network input space can also be explored during training. An easy and
simple (and probably not optimal) way to do this is simply to periodically check if the
decision surfaces are within the boundaries of a permissible region. The units with
decision surfaces outside such a region are then reinitialized using the methods proposed
in the previous section. This permissible region is in general defined as enclosing the
network input space.
In order to perform a sufficiently close approximation to some mappings some
of the decision surfaces may have to be outside the network input region, but still close
to its boundaries. If a decision surface is situated very far from the boundary of the
network input region, the variation of the unit output over the network input region will
be small (if the inclination of the decision surface is small) or zero and therefore this
unit will be operating as a linear unit. This unit can be replaced by another unit with a
decision surface located within the the network input space with a small inclination such
that this unit also operates as a linear unit. A constant term should then be added to the
bias of the output units that represents the averaged output of the original unit over the
input space.
We propose two methods to check if the decision surface of a sigmoidal unit is
within the boundaries of the permissible region. In general such a region is defined as
a hypercube if we can define hard limits for each network input.
In the first method we calculate the unit output for each corner of this hypercube.
If, for all corners of the hypercube, the network output is always less or always greater
than its output midpoint (defined as 0.5 for sigmoid units and 0 for hyperbolic tangent
units), then the decision surface of this unit is outside the hypercube.
In the second method we define a hypersphere such that it encloses the
hypercube. If we assume that all sides of the hypercube have the same length 2u and its
center is the origin, all corners will be equally distant from the origin and the radius of
the hypersphere is equal to the distance from the corners to the origin, that is u (Nx)1/2.
If the distance from the decision surface to the origin ( bias W , see eq. 4.5) is
greater than the radius of the hypersphere, then the decision surface is outside the
hypersphere and outside the hypercube.
98

The first method is more restrictive than the second one since, for input spaces
with dimension greater than 1, if the decision surface (the hyperplane) is nearly parallel
to one of the sides of the hypercube, it is possible that the decision surface is inside the
hypersphere but outside the hypercube. On the other hand, the number of calculations
is much greater in the first method than in the second method. For input spaces with
dimension 1, the two methods are the same.

4.3 - Simulations

In this section we illustrate the application of the proposed method in the case
where it is desired to train a FF ANN to learn the mapping y = sin(x) for x in the
interval [-2π 2π]. The ANN has 5 sigmoid hidden units and the output unit is linear.
In order to perform the desired mapping the learning algorithm has to position
a decision surface where the target function crosses the line y = 0. Therefore 5 hidden
units is the minimum number of hidden units necessary to produce a good
approximation of the desired mapping. Moreover, since the network input and output
variables are continuous, the decision surfaces for the hidden units have to be positioned
at [−2,−1,0,1,2] π with a good precision and have approximately the correct inclination.
The weights for the output units should then provide the correct linear combination of
the outputs of the hidden units. In other words, this is a demanding problem since the
solution space is very limited and contains only a certain combination of weights.
Figure 4.6 shows that the function F(x) can approximate the sine function very
well, where:
5
i 1
F (x) 1.15 1 tanh x π ( i 3 ) (4.10)
i 1

or using the relation tanh(x/T) = 2 sig(2x/T) - 1:


5
i 1
F (x) 1 2.3 1 sig 2 x 2 π ( i 3 ) (4.11)
i 1

The degree of approximation can be measured by calculating the Root-Mean-Squared


(RMS) error. The expression for the RMS error is:
99

1.5
sin(x) F(x) error
1

0.5

-0.5

-1

-1.5
-3 -2 -1 0 1 2 3
x/pi

Figure 4.6 - The function sin(x) and its approximation F(x).

Np
1 2 (4.12)
RMS error sin xi F xi
Np i 1

where Np = number of selected points. Using a set of 40 equally spaced points in the
in the range [-2π 2π] the RMS error is 0.004539.
In this section we compare the simulations for 3 cases: 1) in the first case all the
network weights and biases are initialized using the standard initialization method, that
is as random values; 2) in the second case the network is initialized using the second
initialization method presented in section 4.1.3 and the inclinations of the decision
surfaces for the hidden units are then adjusted as explained in section 4.1.4; 3) in the
third case the network is initialized as in the previous case and the position of the
decision surfaces for hidden units are reinitialized during training whenever they are
found to be outside a pre-defined permissible region.
In all cases the network is trained using the Back-Propagation algorithm using
the following parameters: learning rate = 0.125, momentum = 0, temperature = 0.5. Each
epoch is defined as the presentation of 50 points selected with uniform distribution in
the range [-2.5π 2.5π] (a new set of points is selected in every epoch). Care was taken
to ensure that the same training data, with the same order, was used in all 3 cases. The
network input is defined as x/(2π) and the desired network output is defined as sin(x)
and is presented uncorrupted. The RMS error is calculated every 5 epochs using 40
points equally spaced between [-2π 2π].
In the first case all the network weights and bias are initialized as random values
with gaussian distribution, zero mean and 0.3 as the standard deviation. In the second
and third case: a) the output unit weights and bias were initialized to the same values
100

used in the first case but the hidden unit weights and biases were initialized such that
the decision surfaces were located in the range [-2π 2π] (in network units [-1 1]) using
the method presented in section 4.1.3; b) the inclination of the initial decision surfaces
was adjusted as explained in section 4.1.4 using the user-defined parameters Ks = 1 and
∆Fdes = 0.4 (in our simulations Us = 1). In all 3 cases the location of the decision
surfaces was verified every 5 epochs.
In the third case, whenever the decision surface of a hidden unit was detected to
be outside the permissible region defined to be [-4π 4π] (in network units [-2 2]), the
following procedure was adopted:
a) The amount wOH(MaxHU - MinHU)/2 was added to the output unit
bias, where wOH = weight connecting the hidden unit to the output unit, MaxHU and
MinHU = maximum and minimum values for the hidden unit output when the network
inputs are in the corners of the permissible region [-4π 4π]. The basic idea is to transfer
to the output unit bias the "average" contribution of the hidden unit that is being reset.
b) The weight wOH was set to zero.
c) The hidden unit incoming weights and bias were reinitialized such that
the decision surface went back to be in the range [-2π 2π] (in network units [-1 1])
using the initialization method presented in section 4.1.3. The hidden unit incoming
weights were generated as random gaussian numbers with zero mean and unit variance.
Note that, once a hidden unit is reset, the inclination of its new decision surface
was not readjusted, although this is a possible alternative.
Figure 4.7 shows the RMS error history (sometimes also referred to as the
learning curve) for the 3 cases. Considering that the convergence is obtained when the
RMS error remain less than 0.02, the first case takes 2060 epochs to converge, the

0.14

0.12

0.1 (1)
RMS Error

0.08

0.06

0.04 (2)

0.02 (3)

0
0 500 1000 1500 2000 2500
Number of epochs

Figure 4.7 - The RMS error history for the 3 simulation cases.
101

second case 345 epochs and the third case 230 epochs. The learning speed in the third
case is almost 9 times faster than in the first case.
Figures 4.8-4.11 show the evolution of the location of the decision surfaces for
the 3 cases. Figures 4.8 and 4.9 refer to the first case, plotted using different vertical
scales while figures 4.10 and 4.11 refer to the second and third cases respectively. Note
that the learning curve for case 1 in figure 4.7 has a staircase shape and that, whenever
one of the decision surfaces converges to its correct final value, there is a sharp decrease
in the RMS error in the learning curve. Note in figure 4.9 that a large number of epochs
is wasted since the decision surfaces are very far from their correct locations.
Figure 4.12 shows for case 3 the history of the output unit weights and bias, also
sampled every 5 epochs. Finally figure 4.13 shows for case 3 the approximation
provided by the network after being trained for 500 epochs. At the end of the training
session the RMS error is 0.004807 and the decision surfaces are located at
[-1.9306 -1.0350 -0.0311 1.0455 1.9645]π while they were expected to be located at
[-2 -1 0 1 2]π.

500 4

400 3

300
2
200
1
x/pi

x/pi

100
0
0
-1
-100

-200 -2

-300 -3
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Number of epochs Number of epochs

Fig. 4.8 - Decision surfaces for case 1 Fig. 4.9 - Decision surfaces for case 1

6 6

4 4

2 2

0 0

-2 -2
x/pi

x/pi

-4 -4

-6 -6

-8 -8

-10 -10

-12 -12
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Number of epochs Number of epochs

Fig. 4.10 - Decision surfaces for case 2 Fig. 4.11 - Decision surfaces for case 3
102

4 1.5
sin(x) ANN output error
3
1
2
0.5
1

0 0
bias
-1
-0.5
-2
-1
-3

-4 -1.5
0 50 100 150 200 250 300 350 400 450 500 -3 -2 -1 0 1 2 3
Number of epochs x/pi

Fig. 4.12 - Output unit weights and bias Fig. 4.13 - The function sin(x) and its
for case 3 network approximation for case 3

4.4 - Conclusion

In this chapter we presented a technique that can be used with the Back-
Propagation in order to speed up learning. We propose to use the knowledge about the
range of the network inputs to initialize and constrain the location of the network
decision surfaces during training. We also propose to adjust the inclination of the
decision surfaces during the weight initialization process.
The simulation results demonstrate that once the decision surfaces converge to
their correct location, the adjustment of the second layer of weights is very fast. This
seems to indicate that learning occurs in the bottom to top direction (the input layer is
at the bottom and output layer is at the top).
During training the user has to define the permissible region for the decision
surfaces. A too small permissible region will lead to a large number of unnecessary
reinitializations of the decision surfaces. On the other hand, a too large permissible
region will tend to slow down the convergence. A possible alternative, that avoids the
need of specifying a permissible region, is to treat the location of the decision surface
as a "soft" constraint, not as a "hard" constraint.
In the next chapter we are concerned about how to improve the fault-tolerance
of the feedforward ANN, so as to increase network robustness to loss of hidden units.
103

Chapter 5 - Fault Tolerant Artificial


Neural Networks

One very interesting property of biological neural networks of the more


developed animals is their tolerance against damage to individual neurons. It is believed
this is a consequence of the distributed representation used by the nervous systems of
such animals. The large number of neurons and interconnections available makes it
easier to obtain such a distributed representation by exploiting redundancy, as was
earlier realized by von Neumann [Neu56]. However, the presence of a large number of
neurons and interconnections is not enough to guarantee such fault tolerance. The
learning mechanism has to be able to exploit the capability of the available hardware
according to its priorities.
In the case of biological neural networks a solution tolerant to loss of neurons
has a high priority since a graceful degradation of performance is very important to the
survival of the organism. It is still unclear how living organisms achieve such fault
tolerance since not enough is known about their learning mechanisms. For artificially
created systems, based on artificial neural networks or not, such a graceful degradation
in relation to damage of its internal components is highly desirable since it directly
reduces maintenance costs and increases overall productivity.
In this chapter we propose a simple modification of the training procedure
commonly used with the Back-Propagation algorithm in order to increase the tolerance
of the feedforward multi-layered ANN to internal hardware failures such as the loss of
hidden units. The proposed method switches, during training, between the different
possible fault configurations, i.e. all possible configurations are also trained and forced
to share the same set of network weights. The conventional Back-Propagation algorithm
can be considered as a special case of the proposed method, where the set of possible
configurations contains only the no-fault configuration. The benefits of the proposed
method are demonstrated in this chapter for a bit-mapped image recognition problem and
in chapter 7 for a nonlinear control application (control of an inverted pendulum).
104

5.1 - Approaches to Fault Tolerance

The classical approach to achieve fault tolerance has been to duplicate several
times the system performing an important task. If it is easy to detect when the system
is faulty within the required time period, one could simply have just one of the
duplicates functioning until it becomes faulty. Then it is simply replaced by one of its
non-faulty copies. In this scheme only one of the copies is functioning at a given time.
This is known as standby redundancy [GrBo72]. In other cases it maybe be difficult or
cumbersome to decide if the system is faulty within the required time period. One
alternative in such cases is to have all copies functioning at the same time receiving the
same input signals. Each copy generates its own output independently of the other
copies. The overall system output can then be produced by using a voting procedure,
such as a majority decision. This type of redundancy is known as majority voting
redundancy [DhSi81] and was proposed by von Neuman ([Neu56], [Sho68]). The
simplest majority-voting system contains 3 elements where at least any 2 elements and
the "voter" are required to be working successfully. This is known as a triple-modular
redundant (TMR) system ([DhSi81], [GrBo72]).
The problem with the classical approach is that it can be costly to produce the
duplicates, i.e the trade off between cost and redundancy. Although it seems always
necessary to have some redundancy in order to have some degree of fault tolerance, it
may be possible to exploit the redundancy in more effective ways. In this chapter we
aim to exploit the redundancy that exists in artificial neural networks in terms of a large
number of units and weights to make them fault tolerant.
One effective way to exploit redundancy, assuming that we are dealing with
complex tasks, is to divide the task into sub-tasks and to have several sub-systems.
Assume that it is possible to perform such division in such a way that there is no one-to-
one correspondence between the sub-tasks and sub-systems. Each sub-system contributes
to a number of sub-tasks and each sub-task is performed by a number of sub-systems.
In this case, if the number of sub-systems and sub-tasks is large enough and no sub-
system is vital to any of the sub-tasks, the loss of a relatively small set of sub-systems
randomly selected will probably affect the performance of all sub-tasks. However, the
loss of performance in each sub-task will be small. Consequently the overall loss of
performance will also be small and the overall system will degrade gracefully. The
105

workload to execute the task and the sub-tasks are distributed over the sub-systems and
the power of each sub-system is also distributed over several sub-tasks. Furthermore, if
the sub-systems can operate in parallel and the sub-tasks can be performed at the same
time, the time necessary to execute the overall task will be very much reduced.
The major difficulty with such an approach is to devise the task decomposition
into sub-tasks and the division of the power of each sub-system to each sub-task.
As an example of the above strategy imagine that the task is to paint a square
board with dimensions 10m-by-10m and there are 1000 painters available. The standby
redundancy approach would use only one painter at a time until he becomes "faulty"
(perhaps too tired in this case). Then he is replaced by another one and so on.
Alternatively we can divide the task into 100 sub-tasks where each sub-task is to paint
a 1m-by-1m allocated part of the board. The 1000 painters then go around and paint just
a small portion of a large number of the 1m by 1m squares. The loss of 100 randomly
selected painters probably will affect all squares but only a small portion of each square
will be left unpainted, that is the loss of painters is uniformly distributed over the 100
squares, the sub-tasks.
In another case if we are forced to allocate each group of 10 painters to paint
only a specific area, the alternative is to increase the size of the area painted by each
group. Then, assuming that painters are lost in sets of groups, if a group is lost, its
assigned area will be at least partially painted by the other groups. Unlike the previous
situation, in this case the power of each sub-system is dedicated (or said to be localized)
to a specific sub-task. However, by increasing the scope of the sub-tasks, the fault
tolerance of the system is improved.

5.2 - Fault Tolerance in Artificial Neural Networks

Since Artificial Neural Networks are composed of a large number of simple


computational units operating in parallel, they have the potential to provide fault
tolerance. However just the presence of a large number of units cannot guarantee that
the ANN will be fault tolerant. The learning algorithm has to be able to exploit (or
organize) the existing excess of units in such a way that the network is fault tolerant,
if fault tolerance is one of our requirements. In other words, the learning algorithm is
responsible for decomposing the task into the appropriate sub-tasks and divides the
106

computational power of the units among such sub-tasks accordingly.


Today the most popular algorithm used to train a feedforward multi-layered
artificial neural network (FF ANN) is the Back-Propagation (BP) algorithm. The BP
algorithm, by using a gradient search procedure, tries to minimize an error function
defined as the squared output error averaged over the set of training data. No priority,
however, is given to the network fault tolerance. For a particular network topology
(number of hidden layers, number of hidden units in each layer), it is very likely that
there are several solutions (sets of weights) which will have similar satisfactory squared
output errors over the set of training data but different degrees of tolerance to hardware
failures. The solution to which the BP algorithm will converge (assuming that it does
converge) depends on factors such as the initial network weight values, size of the
learning rate, and pattern order presentation.
In order to give some priority to network fault tolerance, we are then left with
the choice of two main paths: 1) to modify the BP algorithm but still keeping its basic
features, since, relatively, it is a simple algorithm that is not computationally demanding;
or 2) use a more complex algorithm that carries a greater computational overhead but
with the benefit of faster convergence. Each option will have its own strengths and
weaknesses. One important point that has to be taken into consideration is the trade off
between the desired fault tolerance and the network accuracy when there is no fault.
Most methods tend to improve fault tolerance while decreasing network accuracy for the
no-fault case.
Bugmann et al. [BSRP92] propose to apply the BP algorithm until it converges
to a solution with a small output error. Then training is stopped and the network is
tested for fault tolerance for loss of hidden units. The hidden unit which causes the
largest increase in the error function is determined and duplicated (its incoming weights
and bias are kept the same but its outgoing weights are halved). In order to keep the
number of hidden units constant, they propose to remove the hidden unit which causes
the smallest increase in the error function or apply some pruning technique (for instance
[NoHi92a], [NoHi92b]). After such duplication and removal/pruning, network training
is resumed with the BP algorithm in order to further reduce the output error.
We can envisage that a possible generalization of the duplication stage of this
technique is to make X copies of each hidden unit, where X could be set as an integer
measure of the degree of importance of the particular hidden unit. The larger the
107

increase in the error function caused by the loss of a hidden unit, the more important
that hidden unit is. However, such modification of the original algorithm has yet to be
investigated. Note that, since learning continues after the duplication stage if no
measures are taken, the hidden units that were duplicated will have exactly the same
weights. Probably learning will be easier if such a constraint is not imposed. One easy
way to overcome such a constraint is to make imperfect copies, i.e. some noise is added
to the weights of the duplicate units.
Still on the subject of duplication, Izui and Pentland [IzPe90] mathematically
analyze feedforward ANN with a high degree of unit duplication using exact copies, the
simplest form of redundancy. More specifically, they assume that each input and hidden
unit is duplicated M times. Normally it is assumed that the effect of such duplication
would be only to improve the fault tolerance of the network. Surprisingly, they found
out that feedforward ANN with such duplication, the simplest form of redundancy, learn
considerably faster than unduplicated ones. They also showed that feedback ANN with
such duplication will also converge faster to their solution. They argue that duplicated
feedforward and feedback networks also require less accuracy in inter-unit
communication, i.e. the weights can be specified with less accuracy.
In general the use of redundancy results in networks with a large number of
units. However, as the number of units increases, the difficulty of synchronizing all units
also increases. The use of asynchronous operation (as seen in biological systems) avoids
such difficulty. Furthermore, Izui and Pentland [IzPe90] shows that training an
asynchronous network using a gradient descent rule is equivalent to training a
synchronous network using an update rule which takes into consideration the second-
order derivative of the cost function, in effect combining first- and second-order
derivatives. Therefore the use of asynchronous operation has also the computational
benefit that it can speed up training.
Neti et al. [NSY92] pose the fault tolerance issue of feedforward neural networks
as a nonlinear constrained optimization problem by defining the concept of a maximally
fault tolerant neural network. The problem of training a maximally fault tolerant neural
network is defined as finding a set of weights that performs the required mapping
(according to a set of input-desired output training patterns) with the added constraint
that when any of the hidden units is removed the mapping error (measured over the set
of training patterns) should not increase by more than a user-defined parameter ε > 0.
108

In general reducing the size of ε reduces the number of possible solutions. If ε is too
small there is even the possibility that no solution exists. He applies an unspecified
quadratic programming algorithm to solve an example.
A problem with the above formulation arises if there is even just one hidden unit
which causes a large increase in the mapping error while all the other units cause a
small increase. Depending on the problem in hand, too much emphasis will be placed
on just one unit, making the problem more difficult than it needs to be. Another problem
is that quadratic programming algorithms require a much greater computational effort
than the Back-Propagation algorithm. Moreover, Saarinen et al. [SBC91], not
considering the fault tolerant issue, argue that many ANN training problems are ill-
conditioned and may not be solved efficiently by second-order optimization algorithms.

5.3 - Back-Propagation with Increased Fault Tolerance

The Back-Propagation algorithm tries to minimize the scalar cost function J


defined as the squared output error averaged over the set of training data, or:

2 (5.1)
J E T y

where T and y denote respectively the desired (target) and actual network output column
vectors. However, we propose to redefine the cost function as a weighted sum of the
errors for all possible system configurations [NaZa93]:
NC
1 2
J λ[ i ] E T y [ i ] (5.2)
NC i 1
where NC ≥ 1 is the number of possible configurations, y [i] is the network output vector
for configuration i and λ[i] NC can be interpreted as the probability of configuration i
occurring (λ[i] ≥ 0). The set of possible configurations normally includes the no-fault and
fault configurations. Observe that the cost function J can be seen as a particular case of
J* where just the no-fault configuration is considered. Examples of faults are: 1) the loss
of a set of weights or a set of units in any of the layers (including the case where the
faulty weights or units are in different layers); and 2) having the output of a unit stuck
at some value. The above redefinition of the cost function has also been independently
proposed by Vallet and Kerlirzin [VaKe91].
An epoch is defined as the presentation during training of all input-desired output
109

pairs from the training set, with each pair being presented just once. One way to
implement the minimization of the cost function J over the training set is to randomly
choose one training pattern from the training set for each iteration of the BP algorithm
and to update the network weights after each presentation. This is sometimes called
random incremental updating as opposed to sequential cumulative updating [Zur92]
when the patterns are presented with a fixed sequence and the weights are updated only
at the end of the epoch.
In the same manner, the minimization of cost function J* can be implemented by
randomly selecting for each epoch a possible configuration from the set of possible
configurations and then, within such an epoch, using random incremental updating. A
possible variation is to select randomly the configuration whenever a new training
pattern is selected. The basic difference is the frequency used to change the network
configuration being trained. We refer to such procedures as switching training methods.
When the BP algorithm is used with such a switching training method, we refer to the
entire training procedure as the BPS algorithm.
Consecutive weight updates use the network weights given at the end of the
previous update, even if a different configuration was used there. Note that this means
that the same network weights are shared by all network configurations, i.e. the weights
for the fault configurations are a subset of the weights for the no-fault configuration.
This is very different from using a different set of weights for each possible
configuration. In the last case a fault detection algorithm and a much larger storage
capacity would be needed, since we will need to decide which fault has occurred and
then to load the weights for that particular fault configuration into the network. On the
other hand, the training algorithm would try to optimise the network response for each
configuration, without considering the other possible configurations, and consequently
we could, theoretically, get a better solution for each fault configuration.
Consider the particular case where the ANN has only one layer of hidden units,
no direct connections from input to output units, and the possible set of faults is defined
as the loss of each one of the hidden units, with only one of them lost in each fault. The
loss of hidden unit is defined here as the output of the faulty hidden unit fixed to 0 for
any input and for any set of incoming weights. The cost function J* can be written as:
110

NH
1 2
J λ[ i ] E T y [ i ] (5.3)
NH 1 i 0
where NH is the number of hidden units and λ[0] and y [0] correspond respectively to the
probability and network output for the no-fault configuration. The network output y [i]
can be calculated in vectorial notation as:

(5.4)
y [ i ] F Wo h S [ i ] out h bias o

where F( ) is the function used in the output layer, Woh is the weight matrix between
the hidden and the output layer, outh is the column vector with the output of the units
in the hidden layer, and biaso is the bias vector for the output units.
The matrix S[i] is a square matrix with dimension NH defined for i = 0 as the
identity matrix, and for 1 ≤ i ≤ NH as diag(1,...,1,0,1,...,1), i.e. the diagonal is composed
of 1’s except the zero ith element. We can define another vector C which contains the
integers 0 to NH with each integer repeated several times. The number of repetitions for
a particular number i divided by the number of elements in vector C should correspond
to the probability of fault for hidden unit i if 1 ≤ i ≤ NH, or to the probability of the no-
fault configuration if i = 0. An element j of vector C is then randomly chosen with
uniform distribution. This corresponds to the choice of S[i] where i = Cj, i.e. a fault at
hidden unit i if 1 ≤ i ≤ NH, or no fault if i = 0. Since, in effect, the vector C contains
the set of possible network configurations and their relative probabilities, we refer to
C as the configuration vector.
Assume that: 1) the vector biaso and the matrix Woh have been initialized; 2) the
configuration vector C has been properly defined and contain NEC elements, where
NEC ≥ 1; 3) we want to select a possible configuration from the vector C every SWEPO
epochs, where SWEPO ≥ 1. Consider that we want to train the ANN for NEPO epochs,
where NEPO ≥ 1. The BPS algorithm can then be summarized as follows:
111

The BPS Algorithm: (% are comments):

1) SWE := 0; % Initialize this variable


2) Loop EPO from 1 to NEPO; % Loop for the number of epochs
3) If SWE = 0; % Select a possible network configuration
Using a uniform random distribution, select a integer j
in the interval [1,NEC].
NH := C(j); % get the number of the hidden unit to be killed
If NH > 0; % kill temporarily this hidden unit
WTEMP := Woh(:,NH); % save the weights from hidden unit NH
Woh(:,NH) := 0; % set all the weights from hidden unit NH to 0
EndIf NH;
EndIf SWE;
4) If NH = 0; % Train the net
% The no-fault configuration case
Train the ANN for 1 epoch using the BP algorithm.
otherwise;
% The fault configuration case
Train the ANN for 1 epoch using the BP algorithm
keeping Woh(:,NH) := 0.
EndIf NH;
5) SWE := SWE + 1; % Increment this variable
6) If SWE = SWEPO; % Restore the weights of the killed hidden unit
If NH > 0;
Woh(:,NH) := WTEMP;
EndIf NH
SWE := 0; % reset this variable
EndIf SWE;
7) EndLoop EPO

It is important to note that there is no unique solution for the fault tolerance
problem as it has been proposed, i.e. considering the loss of hidden units. A simple
example is enough to explain this. Consider an ANN with one input, one output and two
hidden linear units. The cost function J* can be written as:

 2 
J λ[ 0 ] E  T Wo h Wh i Wo h Wh i x 
 1 1 2 2  (5.5)
 2   2 
λ[ 1 ] E  T Wo h Wh i x  λ[ 2 ] E  T Wo h Wh i x 
 2 2   1 1 
Note that the weights appear in eq. 5.5 always in the same pairs. Therefore eq. 5.5 could
be rewritten in terms of only 2 unknowns, the scalars W1 and W2 where:

W1 Wo h Wh i and W2 Wo h Wh i
1 1 2 2

In other words, the input weights can not be separated from the output weights. In
essence, this is a problem of parameter identifiability.
112

5.4 - Simulations

In this section we simulate the proposed technique of switching between the


different ANN fault configurations during training in order to achieve a more fault
tolerant solution while keeping a reasonable small input-output mapping error.
The particular problem considered is a bit-mapped image recognition problem.
Figure 5.1 illustrates the 16 patterns used to train the network. They were taken from
the 8 by 8 character bitmaps used by the IBM PC XT (0-9,A-F) and were edited to
remove the redundant last row and column, since they contain only 0’s, to form the 7
by 7 bitmaps†. Each pixel of the bit-mapped image used as input is associated with an
input unit and, therefore, the input layer has 49 inputs. The output layer uses a 1-of-16
grandmother coding, where only one of the output units should be activated for each
input pattern from the training set. There is only 1 hidden layer which can have 6, 8,
10 or 12 units and there is no direct connection from the input to the output layer.
The hidden and output units use the hyperbolic tangent (tanh(x)) as the activation
function. The input and desired output signals were scaled to lie between −1 and 1. The
network weights and biases were initialized as small random values with a uniform
distribution in the interval [−1/3, 1/3]. The learning rate and momentum rate were set

Figure 5.1 - The image bitmaps used to train the ANN


Many thanks to Peter Green for his expertise in extracting the bitmaps from the
inner depths of the IBM PC and making it available to us.
113

to 0.25/8 and 0.1 respectively. In order to speed up learning a value of 0.1 was added
to the derivative of the activation function as suggested by Fahlman [Fah89].
The ANN was trained using random incremental updating. The network was
trained, in all cases, using 600 epochs. Care was taken to ensure that the weight initial
values and the pattern presentation order were the same for the BP and BPS algorithms.
For the case where the BPS algorithm was used, it was assumed that all hidden
units have the same probability of failure and the no-fault configuration is as probable
as the fault configurations. The only type of fault considered is the output of only one
of the hidden units is fixed at 0. When a particular network configuration was chosen,
it was kept during training only in that epoch, i.e. SWEPO = 1.
The no-fault configuration was tested, using the same patterns shown in fig. 5.2,
at the end of every 4 epochs, making a total of 150 tests. The root-mean-square error
(RMS Error) calculated at each test is defined as:

Npat
1 2 (5.6)
RMS Error Epat
Nout pat 1

where Nout = number of output units = 16, Npat = number of patterns used during
test = 16 and
T
2
Epat T pat y [ 0 ] T pat y [ 0 ] (5.7)

Figure 5.2 shows the RMS error history for the no-fault configuration when 6,
8, 10 and 12 hidden units were used and the ANN was trained with the BP (fig. 5.2a)
or the BPS algorithm (fig. 5.2b). Table 5.1 compares the RMS error for the no-fault
configuration after training. There are no misclassifications for any of the no-fault
configurations.
After training, the RMS error for each fault configuration was also calculated
using eqs. 5.6 and 5.7, with y[0] been replaced by y[i]. Table 5.2 shows the mean and
standard deviation for the RMS error and the number of misclassifications (minimum,
mean, maximum values) when the ANN is tested with each one of the fault
configurations.
From table 5.2 we can see that, by using the BPS algorithm, the mean RMS error
for the fault configurations was reduced to less than 50% when compared with the case
when the BP algorithm was used. Observe also that: 1) for the cases 6 and 8 hidden
units the maximum number of misclassifications was reduced from 15 and 5 (out of 16
114

patterns) to just 1 misclassification; and 2) the standard deviation of the RMS error was
also greatly reduced indicating that the degree of importance of each hidden unit is
more uniform, i.e. the computational (or representational) load is more evenly spread
over the set of hidden units.

0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
RMS Error

RMS Error
0.3 0.3
0.25 0.25
6 HUs

0.2 0.2
6 HUs 8 HUs
0.15 0.15
8
0.1 0.1
10 10 HUs
0.05 0.05
12 HUs 12 HUs
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Test Number Test Number
(a) (b)
Figure 5.2 - RMS error history for the no-fault configuration for cases:
(a) using the BP algorithm; (b) using the BPS algorithm.

RMS Error BEFORE Number of Hidden Units


faults (x 10-4)
6 8 10 12
BP 384 248 174 142
BPS 308 217 68 62
Ratio 0.802 0.875 0.391 0.437

Table 5.1 - RMS errors for the no-fault configuration after training

RMS Error & number of Number of Hidden Units


misclassifications
AFTER faults 6 8 10 12

BP: RMS Error 2923 (774) 1940 (270) 1024 (176) 764 (187)
Misclass. [min,mean,max] 3,6,15 0,2.13,5 0,0.30,1 0,0,0
BPS: RMS Error 1229 (98) 945 (185) 441 (58) 377 (71)
Misclass. [min,mean,max] 0,0.167,1 0,0.125,1 0,0,0 0,0,0
Ratio - RMS Error 0.420 (0.127) 0.487 (0.685) 0.431 (0.33) 0.494 (0.38)

Table 5.2 - RMS error mean (standard deviation) and number of


misclassifications for the fault configurations after training.
115

From table 5.1 we can also observe that, surprisingly, when training was
performed with the BPS algorithm the RMS error for the no-fault configuration was also
reduced. So in this case it was possible to obtain a fault tolerant solution and at the
same time reduce the input-output mapping error. At this point we speculate that,
perhaps, by adding an extra random component to the search procedure via the
switching training method, the BP algorithm settles to a better solution, in the same way
that training with a random order pattern presentation tends to work better than with a
sequential order presentation [Zur92] or like the simulated annealing technique
[KGV83]. However, more investigation is needed to clarify such issue.

5.5 - Regularizing the Estimation Problem by Considering Fault


Tolerance

In this section we present some results that, although partial, can help to
understand the consequences of defining a cost function that considers the network fault
tolerance. Training a feedforward ANN, as Saarinen et al. [SBC91] and other authors
have pointed out, can be seen as an estimation problem. In this section we show that
under certain assumptions, by giving some consideration to the network fault tolerance,
the problem of minimizing the cost function J* (eq. 5.3) can be seen as a regularized
version of the problem where we seek to minimize J (eq. 5.1). In other words, the
number of local minima may be smaller for function J* than for function J. In certain
cases, each local minimum of J* will have close to it a local minimum of J. Basically,
we are interested in the set of weights that result in a small value for the cost function
J but that are also fault tolerant solutions.
It may be possible that such understanding, when completed, will lead to a cost
function that has a unique solution. In that case there will be only one local mimimum
point that is also the global minimum. If such solution produces a small input-ouput
network mapping error, it may be possible then to use effectively second-order
algorithms which have a much better convergence rate than first-order algorithms such
as the Back-Propagation algorithm. Therefore the time necessary to train a network
would be greatly reduced.
116

5.5.1 - Considering The Input Weights Fixed


Assuming that the output unit is linear consider that: 1) there is only one
output unit; 2) the output unit is linear; and 3) the output bias is zero. There is no loss
of generality since: 1) each output unit receives its own set of weights from the hidden
units; 2) if the nonlinear function used in the output unit is reversible (such as the
sigmoid or hyperbolic tangent functions) then we can uniquely determine the net input
of the output unit given a valid value for its output; and 3) the weight vector could be
extended to include the bias of the output unit.
In a FF ANN with just one hidden layer if we assume that input weights, i.e. the
weights between the input layer and the hidden layer Whi are fixed, the original least-
mean-square problem can then be posed as finding the output unit weight vector Woh
that minimizes J where:

1 T 2
(5.8)
J E T Woh out h
2
and Woh and outh are column vectors with NH components, where NH is the number of
hidden units. The number of network inputs and the type of nonlinearity used in the
hidden units is not important at this point since we can always calculate the vector outh.
The stationary points of J are given by ∂J/∂Woh = 0:

∂J T
(5.9)
E T Woh out h out h 0
∂Woh

Using the property that WToh outh is a scalar, from the previous equation we can write:

E T out h
T
E out h out h Woh (5.10)

The cost function J has a unique minimum if the matrix ∂2J/∂Woh2 is positive definite
where:

∂ 2J T
E out h out h R (5.11)
∂W
2
oh

and R is the correlation matrix of the output of the hidden units (we adopt the
convention that, if Y and X are column vectors, ∂Y/∂X results in a matrix which element
at position i,j is given by ∂Yi /∂Xj). By definition the matrix R is positive semidefinite
(det R ≥ 0) [Pap84]. If the matrix R is singular, then there is more than one solution for
the weight vector Woh. This is the case, for instance, when two hidden units produce the
117

same output because they receive from the inputs units the same weights and have the
same bias.
On the other hand, if we consider the network fault tolerance in relation to faults
in the hidden units, we aim to minimize the cost function J*:
NH
1 2
λ[ i ] E T Woh S [ i ] out h (5.12)
T
J
2 i0
where the factor (NH+1) can be interpreted as included in λ[i]. The stationary points of
J* are given by:

∂J
NH
λ[i] E
T
T Woh S [ i ] out h S [ i ] out h 0 (5.13)
∂Woh i 0

Then the cost function J* has a unique minimum if the matrix ∂2J*/∂Woh2 is positive
definite where:

∂ 2J
NH
T
λ[i] E S [ i ] out h S [ i ] out h F ⊗ R Fd (5.14)
∂Woh
2
i 0

Defining λ = λ[0]+λ[1]+λ[2]+...+λ[NH], the matrices F and Fd are defined as Fij = λ-λ[i]-λ[j]


and Fd = diag(λ[1] R11, λ[2] R22, ..., λ[NH] RNH NH), and the symbol ⊗ denotes element-by-
element matrix multiplication.
Assume that the true system target output T was generated by:
T
T Woh
0
out h ε (5.15)

where Woh0 is the true weight vector and ε is a statistically described perturbation such
as measurement noise. Then, assuming that ε is uncorrelated with outh:

E T out h E out h out h Woh E ε out h


T 0
R Woh
0 (5.16)

NH
λ[ i ] E T S [ i ] out h (5.17)
0
Ld R Woh
i 0

where Ld is a NH-by-NH diagonal matrix defined as Ld = diag(λ-λ[1], λ-λ[2], ..., λ-λ[NH]).


Finally, eq. 5.13 can be rewritten as:

Ld R Woh
0
F ⊗ R Fd Woh (5.18)

In the special case when all the faulty configurations are equally probable, i.e.
λ[i] = λ* > 0 for 1 ≤ i ≤ NH, eq. 5.15 can be simplified and written as:
118

(5.19)
λ[0] NH 1 λ λ[0] NH 2 λ R λ R d Woh
0
R Woh

where Rd = diag(R11, R22, ..., RNH NH). Note that in this special case if Rii > 0 for all i and
λ* > 0, we can guarantee that the cost function J* has a unique minimum solution since
the matrix ∂2J*/∂Woh2 is positive definite. This happens because, even if the matrix R is
singular, the matrix ∂2J*/∂Woh2 is defined as the summation of 2 symmetric matrices,
where one of them is positive semidefinite and the other positive definite.
It is interesting to study how the relationship between the true parameter vector
Woh0 and its estimate Woh varies when the coefficients λ[0] and λ* change. Assume that
the matrix R can be partitioned as:

R A 0 
R  
B
0 R 
where RA and RB are respectively singular and non singular square matrices and
dim RA = N1; dim RB = N2. Assume that the first N1 components of the vector outh are
equal to each other. Therefore RAij = γ > 0, RBd = diag RB and

γ I 
 N 0 
 
1
Rd
 0 RB 
 d 

where the notation IM denotes the identity matrix with dimension M and A = diag(B)
means that A is a diagonal matrix and Aii = Bii. From eq. 5.19 we can write:

 
 QA 1 RA 0  0 (5.20)
W oh   W oh
 B 1 B
 0 Q R 

where dim QA = dim RA = N1; dim QA = dim RB = N2 and:

λ[0] NH 2 λ
α1 (5.21)
λ [0]
NH 1 λ
λ
α2 (5.22)
λ [0]
NH 1 λ
1 

 (5.23)
QA α1 γ   1 1 α2 γ IN

1

1 
119

Q B α1 R B α2 R d (5.24)
B

Note that if RB is a diagonal matrix then:


1
RB ⇒ QB RB ⇒ QB
B
Rd R B IN
2

Applying the Matrix Inversion Lemma [WeZa91] we can find an analytic expression for
the inverse of matrix QA and after some algebraic manipulation we find that:

1 

1  (5.25)
QA RA β  1 1

1 
where the scalar β is defined as:

λ[0] λ NH 1
β (5.26)
N1 λ [0]
λ N1 NH 2 1

Note that β does not depend on γ or the elements of the matrix RB, as long as the
diagonal of RB contains only positive elements. The dimension of RB affects β since it
determines N2 and indirectly NH. Finally, from eqs. 5.20 and 5.26 we have:
N1

β 1 ≤ i ≤ N1 (5.27)
0
Woh Woh
i j
j 1

Figure 5.3 shows the parameter β as a function of the ratio λ[0] λ* for NH = 5
and N1 = 3. Note that as the ratio λ[0] λ* increases a) β tends asymptotically to 1 N1 (eq.

0.41

0.4

0.39

0.38

0.37
β

0.36

0.35

0.34

0.33
0 20 40 60 80 100 120 140 160 180 200
λ[0] / λ∗
Figure 5.3 - The parameter β as a function of λ[0] λ* for NH = 5 and N1 = 3
120

5.26); and b) QB tends to RB (eq. 5.22). As expected in the limit such parameters provide
the correct input-output behaviour since only the no-fault configuration is being
considered. However, the parameters will converge to the values that also give the best
fault tolerant solution.
It is important to consider the penalty paid for including the fault tolerance
criterion. Such a penalty is paid through the output error, or equivalently the parameter
estimation error, for the no-fault configuration. The relative parameter estimation error
for the first N1 parameters of the weight vector Woh can be defined as:

1
β
N1 N1 1
EW (5.28)
1  [0] 
λ 
N1 N1   N1 NH 2 1
 λ 
and EW is maximum when λ[0] = 0, i.e. the probability that one and only one of the
hidden units will fail (with its output fixed to zero) is 1. Note that when N1 is large,
max(EW) is approximately 1 NH, i.e. the upper bound for the parameter estimation error
tends to zero as the number of parameters increase. If we consider that the matrix RB
is diagonal then the estimation error exists only for the first N1 parameters of the weight
vector Woh. However, the upper bound for the parameter estimation error for the last N2
parameters also decreases as N1 or N2 increases, even if the matrix RB is non-diagonal,
since QB also tends to RB in this situation (eqs. 5.22-5.24). This shows that, even if the
cost function does not contain the no-fault model, it is possible to achieve a small input-
output mapping error for the no-fault model.

5.5.2 - Considering The Output Weights Fixed


In the previous section we considered the input weights, i.e. the layer of weights
that connect the input layer to the hidden layer, as fixed weights and we looked for a
set of output weights that minimize the cost function J*. In this section we consider that
the output weights, i.e. the weights that connect the hidden layer to the output layer, are
fixed and that there is only one input unit. The same analysis is valid for networks with
more than one input unit since for each input unit we can define an independent weight
vector that links the input unit to the hidden units.
Consider for a moment that the hidden units are linear and that there is only one
output unit. The hidden unit output is therefore given by:
121

T
y Wo h Wh i x (5.29)

where Whi is a column vector with NH components. From eq. 5.12, the cost function J*
is then:
NH
1 2
λ[ i ] E (5.30)
T
J T Wo h S [ i ] Wh i x
2 i0
Since the matrix S[i] is symmetric, then:
T T
Wo h S [ i ] Wh i Wh i S [ i ] Wo h
and as in eq. 5.13 we have:

∂J
NH
λ[i] E
T
T Whi S [ i ] Woh x S [ i ] Woh x 0 (5.31)
∂Whi i 0

Following eq. 5.14 finally:

∂ 2J
NH
T
λ[i] E F ⊗ Rx Fd
x
S [ i ] Woh x S [ i ] Woh x (5.32)
∂W
2
i 0
hi

where:

Rx E Wo h x Wo h x T T
Woh Woh E x 2 (5.33)

and Fdx = diag(λ[1] R11


x
, λ[2] R22
x
, ..., λ[NH] RNH
x
NH). As in the previous case, when all the

fault configurations are equally probable, and if E [ x2 ] ≠ 0 and the vector Woh contains
only non-zero elements, then the cost function has a unique solution. In other words,
under the above conditions, for every given vector Woh there exists a unique vector Whi
that minimizes the cost function J* since the matrix ∂2J*/∂Wh2i is positive definite.
It should be possible to extend the above result to the case where there is more
than one output unit and the hidden units use a sigmoidal function.

5.5.3 - Considering Input and Output Weights as Variables


At the end of section 5.3 we showed that there is no unique set of weights for
the fault tolerant problem when both layers of weights are variable, at least with linear
hidden units. Basically, this is a parameter identifiability problem since the input weights
could not be separated from the output weights.
One simple way to solve this parameter identifiability problem is to extend the
switching idea. Consider the problem where we want to find the parameters a and b that
122

minimize the following cost function:

1 2
(5.34)
J1 E T abx
2
where T is the true system output, x is the system input and T = k x. If we write the
stationary conditions, we have:

 
 ∂J1 
     
 ∂a  b  0  (5.35)
  ab k ρ    
 ∂J1  a  0 
 
 ∂b 

where ρ = E x2 . If we assume that ρ ≠ 0, then the stationary conditions imply that


k = a b, i.e. there are an infinite number of solutions for (a,b) that result in the correct
input-output behaviour. On the other hand, consider now the problem of minimizing J2
where:

λ[ 0 ] 2 λ[ a ] 2 λ[ b ] 2
(5.36)
J2 E T abx E T bx E T ax
2 2 2
In this cost function the switching constrains each parameter in turn to one, instead of
zero as before. The stationary conditions are now:

 
 ∂J2   
  b λ[0] λ[b] 0  a b k  0 
 ∂a   a k    (5.37)
  ρ  [0] [a]    
 ∂J2  a λ 0 λ   0 
  b k 
 ∂b 

Assuming again that ρ ≠ 0, from the stationary conditions we can write:

a b 2 λ[ 0 ] λ[ b ] b a 2 λ[ 0 ] λ[ a ] (5.38)
k
b λ[ 0 ] λ[ b ] a λ[ 0 ] λ[ a ]
Assuming that λ[a] = λ[b] = λ*, then:

λ[ 0 ] 2 (5.39)
b a 1 a2 b 1 a b 0
λ
The solution for eq. 5.39 is a = b. Substituting this result in eq. 5.38 and assuming that
λ[0] >> λ*, we have that a b ≈ k. It is interesting to note that we would get this same
solution if the cost function considers the length of the parameter vector as in:
123

λ[ 0 ] 2 λ (5.40)
J E T abx a2 b2
2 2
Such a cost function could be interpreted as selecting, from the multiple solutions that
result in the correct input-output behaviour, the one that has the smallest length.
By switching off and on the outputs of the hidden units we solve the problem
of redundancy across the hidden layer, what we could call horizontal redundancy. When
we switch the input and output weights to one, instead of to zero, we solve the problem
of redundancy through the hidden layer, i.e. the vertical redundancy. Both methods can
be combined by defining for each hidden unit: 1) a probability the output of the unit will
be fixed to zero (fault type 1); 2) a probability that its input weights will be fixed to one
(fault type 2); 3) a probability that its output weights will be fixed to one (fault type 3);
and 4) a probability that no failure will occur. One cost function that considers such
possibilities is:
NH 3
[0] 2 2
J λ [0]
E T y λ[ i j ] E T y[ij] (5.41)
i 1 j 1

where y [0]
= network output for the no-fault model, λ[0] = probability of the no-fault
model, y[i j] = network output when fault i j occurs, λ[i j] = probability of type j fault
occurring to hidden unit i. It is implicitly assumed that no more than one hidden unit
is faulty at any given time.

5.6 - Conclusions

In this chapter we presented the BPS algorithm. Such an algorithm can be used
to obtain the weights of a feedforward ANN which performs the desired input-output
mapping with a small error and is robust to loss of hidden units. In the example
simulated, the BPS algorithm clearly outperformed the BP algorithm in terms of
achieving the desired robustness without increasing the desired mapping error.
The mathematical foundation of the BPS algorithm was analyzed and it was
shown that, in some situations, the switching regularizes the mapping (or estimation)
problem. Such analysis opened the possibility of generalizing the switching method.
Future investigations may be able to show that such generalized switching methods
result in a cost function that has a unique solution. In such cases there would be only
a local minimum that is also the global minimum and second-order minimization
124

algorithms could be used effectively. Consequently the training of the feedforward


ANNs would be much more reliable and faster.
In the next chapter we show how non-standard ANNs can be used to solve the
problem of extremum control of asymmetrical functions with finite dither.
125

Chapter 6 - Extremum Control Using


Artificial Neural Networks

Optimization problems in engineering can often be posed as the maximization


of a static and noisy performance index. The task of maximizing such a performance
index is known as extremum control ([WeZa91], [WeSc90]). Basically the extremum
control task is to climb a hill, given only noisy measurements of the height at different
locations.
This chapter shows how artificial neural network concepts can be used to solve
the extremum control problem with a static asymmetric performance index ([NZM92],
[NZM93]). A novel non-standard neural network structure is proposed and used to
develop a model of a static system using the available noisy measurements of the
performance index. This ANN model has the nice properties that: a) the input that
maximizes its output is also adapted and is readily available; and b) the model is
flexible enough to accommodate non-quadratic functions. Therefore the optimum input
for static systems with an asymmetric performance index can be estimated with a small
error, even if the system is excited by a dither with a large amplitude. The standard
Back-Propagation algorithm, with the necessary modifications, is used to adapt the ANN
parameters.
One possible extension to the multi-input case is also proposed and two
simulation examples are shown. The first simulation example is for the single input case
while the second example considers a two-input case where the goal is to find the
location of an object within a larger image.

6.1 - The Extremum Control Problem

An extremum controller is basically a self-tuning optimizer. It continuously


adjusts the input of the static system under observation such that the single output of
126

this system is maximized. In dynamical feedback control systems the extremum


controller can be used to determine the setpoint by finding the optimum operating point
and to track it when it changes. The task of the feedback controller is then to keep the
system output close to the setpoint. To justify the ‘static’ assumption the sampling
period for the extremum controller has to be large enough such that (in relation to the
extremum controller) the dynamics of the system can be ignored.
The differences between the problems covered by extremum control and classical
optimization theory [GMW81] are: a) the function to be optimized by the extremum
controller is not known a priori (although we need to make some general assumptions
about its behaviour); and b) the measurements of the system output are noisy.
Examples of practical application of extremum controllers are: a) adaptive
optimization of the spark ignition angles of an automotive engine for different conditions
of load and speed ([WeSc90], [ScWe90]); b) control of the air-fuel ratio for optimal
combustion in an internal combustion engine, where the optimum point depends on the
temperature and fuel quality [DrLi51]; c) adjustment of the blade angle of windmills and
water turbines of the Kaplan type in order to give maximum output power [AsWi89];
and d) optimization of the biomass productivity of continuous fermentors [GoYd89].
Extremum control systems can be classified into four types ([Ste80], [Bla62]):
a) Perturbation systems, where a small periodic test signal is added to the
input signal and its effect on the output is used to derive local
information about the slope of the performance index;
b) Switching systems, where the input changes at a constant rate, until the
extremum is passed; then the direction of the input is reversed
accordingly to some fixed rule;
c) Self-driving systems, where the output measurements are used directly
to determine the input signal, for instance, when the time derivative of
the output signal is used to drive the input signal via an integrator and
other auxiliary circuits;
d) Model-based systems, where measurements of the input and output
signals are used by some identification procedure to build a model of the
performance index over a large region; the model is then used to
determine the optimum input.
In this chapter we will deal with model-based extremum control systems. This type of
127

controller exploits the noise suppression properties of recursive parameter estimation


algorithms by using them to update the free parameters of a chosen model. This avoids
the need to use the noisy output measurements directly to estimate the local derivatives
of the performance index (using, for instance, finite difference methods [GMW81]). If
the chosen model matches the observed system and if the recursive parameter estimation
algorithm produces sensible parameter values, then the optimum point for the model
should be close to the optimum point of the observed system.
Denoting by x0 the input value that maximizes the system output (the optimum
input value) the model-based extremum control approach can be summarized in the
following steps:
1) collect a set of input-output data points around the estimated x0,
2) using the new data, update the system model,
3) update the estimate of x0,
4) go back to step 1).
Step 3 can be greatly simplified by a proper selection of the model structure. This is the
case if we can constrain our model structure such that, for a given set of model
parameters, x0 can be easily calculated or if x0 is one of the model parameters. For
instance, the first case happens when we assume that the uncorrupted system output y
is modelled by y = a x2 + b x + c. Then, if the parameters a and b are estimated, x0 can
be calculated as x0 = −b/(2 a). The second case happens if we assume as the system
model y = k (x − x0)2 + c, so when the model is updated, the estimate of x0 is
automatically updated as well, since it is part of the model.
Another point that must be taken into consideration is that the selection of a
particular model structure will affect the set of algorithms that can be used to estimate
the parameters of the model. For instance, for the above example, all parameters of the
first model structure are related linearly to the output, so we have the option of using
the Recursive Least Squares algorithm (RLS). We do not have this option for the second
model structure and, in general, more complicated estimation algorithms will have to be
used. On the other hand, the second model structure avoids some problems such as a
possible division by zero in the calculation of x0. However, if we write the second model
in incremental form, we can still apply the RLS algorithm [WeSc90].
Figure 6.1 illustrates the basic algorithm for the extremum controller. For
simplicity, we assume that the output noise is additive and has zero mean. A dither, i.e.
128

Figure 6.1 - The Extremum Controller Algorithm

a zero mean random test perturbation signal, is added to the estimated value of x0 in
order to force an exploration of the local shape of performance index. In some cases it
is possible to show that the dither signal assures persistent excitation of the system
([WeSc90], [WeZa91], [BoZa90])).

6.2 - The Quadratic System Model

One possible approach is to try to model the true system with a static quadratic
model and to use algorithms derived from linear estimation theory to estimate the
parameters of such a model [WeZa91]. One argument in favour of the quadratic model
is that around any point on a smooth function we can always fit a quadratic model,
assuming that the region of interest is small enough. In terms of an expansion in Taylor
series this is equivalent to assuming that we can neglect the third and higher derivatives
of the function that expresses the true input-output relationship.
As a consequence of such assumptions, one may be interested in the robustness
of such algorithms, i.e. under what circumstances will our estimate of x0 converge to the
true x0, the optimum input value?
Bozin and Zarrop [BoZa90] have proved, using the ODE method, that for a
quadratic model, if the estimate of x0 converges, it will converge to a value close to the
optimum value providing the unmodelled nonlinearities or dynamics are small.
Moreover, they prove that in the presence of even large unmodelled nonlinearities, the
estimate of x0 will converge to the optimum value for any static true performance index
with a single extremum point if: a) the true performance index is symmetric or b) the
129

dither amplitude decays towards zero at a sufficiently slow rate. So, if the true
performance index has a large asymmetry and the dither amplitude is kept large or
decays to zero too quickly, the use of a quadratic model will inevitably result in a bias
in the estimate of the optimum input. This happens simply because the quadratic model
is not flexible enough to reflect asymmetry and a mismatch between the system true
model and the estimated model is inevitable.
It is this last result that we improve in this chapter. The basic idea is to modify
the standard feedforward neural network model such that we can estimate x0 with small
error when the true performance index is an unknown asymmetric function with a single
extremum point and dither amplitude remains large. This can be useful in situations
where it is undesirable to allow the dither amplitude to decay to zero, for instance, when
we want to improve the tracking ability of our extremum controller or when the true
performance index is "coarse-grained" as in the second simulation example in this
chapter.

6.3 - Adapting the Artificial Neural Network to Extremum Control

Considering for the moment the single input extremum control problem, if we
use a FF ANN in figure 6.1 with hidden units executing a quadratic function of their net
input, we will have exactly the original quadratic formulation that we are trying to
escape from. From the representation point of view there would be no advantage of
having more than one hidden unit, because there would always be an equivalent
quadratic FF ANN with one hidden unit performing the same symmetric input-output
mapping.
If we use a FF ANN with sigmoidal functions in the hidden layer as the system
model, we will need an optimization algorithm to find the input that maximizes the FF
ANN output, since such a FF ANN does not have an obvious optimum input.
Another option would be to use a FF ANN with Radial Basis Functions, such as
gaussian units, in the hidden layer and to force all hidden units to have the same center.
Because the network output is a linear combination of the outputs of the hidden units,
the network output would be maximum when the output of the hidden units is
maximum. Since all hidden units have the same center position, their outputs would be
maximum at the same input. Therefore the optimal network input corresponds to the
130

common center of the hidden units. By increasing the number of hidden units and
allowing them to have different widths, such a FF ANN can execute an input-output
mapping that a similar ANN with a smaller number of hidden units can not imitate.
However, since a RBF is a symmetric function, the input-output mapping of such ANNs
can only be a symmetric function.
From the above discussion we can see that we need the hidden units of the FF
ANN to execute a single input-single output function f(x) with the following properties:
P.1 - f (x) → 0 when x → ± ∞
P.2 - If x is finite, d f (x)/dx = 0 only when x = 0
P.3 - f (x) ≠ f (−x)
P.4 - f (x) ≥ 0
So, in the x space, f (x) is an asymmetric local function with only one extremum and it
has only non-negative values. An example of a function with such properties is:

2 v2 (6.1)
f (x)
1 exp ( v 2 x ) v 2 exp ( x )
Figure 6.2 shows how the asymmetry of such a function changes for three values of the
v parameter. For v = ± 1, this function is symmetric around x = 0. The numerator 2 + v2
is just a scaling factor such that the maximum value of the function is equal to 1. Note
as well that since v is squared in eq. 6.1, the value of the function is the same for v
and −v.
The proposed FF ANN executes the following equations, where i = 1, 2, ..., NH:

1
0.9
0.8 ± 0.3

0.7
0.6
f(x)

0.5
0.4
0.3
0.2
0.1 ±1 ±2 ±2 ±1 ± 0.3

0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x

Figure 6.2 - The asymmetric function used in the hidden


units with different parameters v
131

Figure 6.3 - The diagram of the FF ANN used in the extremum controller

neti
h h
wi x x0 (6.2)

outi
h
fi neti
h (6.3)
NH
ou 2 h (6.4)
y NN bias wi outi
i 1

where fi denotes f in eq. 6.1 with v replaced by vi; NH = number of hidden units;
wh = weights from the input to the hidden units (input weights); wou = weights from the
hidden units to the output (output weights); outh = output of the hidden units. Figure 6.3
shows a diagram of such a FF ANN.
Observe in the above equations that, in contrast to a standard FF ANN using
sigmoidal (squashing) functions: a) the input x and the parameter x0 use input weights
with the same absolute value but with opposite signs; b) there is no bias term for the
hidden units, only for the output unit; and c) the output weights are squared such that
the output of the hidden units are always weighted by non-negative numbers. By
squaring the output weights we are including the assumption that the second derivative
of true function is always non-positive (true function is a hill, not a valley), therefore
we avoid the possibility that the contribution of any hidden unit would be, even
temporarily before convergence is achieved, used to approximate a valley. Consequently,
if there is at least one hidden unit with a non-zero input weight, a non-zero output
weight and non-zero associated parameter v, then x0 is the only value of x that
maximizes the network output and the network output function has only one extremum
point. The formal proof of this important result is delayed until section 6.6 where it is
132

proved for the multi-input case (Theorem 6.1).


In relation to estimating the network parameters (training the network) the
difference is that the output value of one of the input units (x0) and the coefficients of
asymmetry of the hidden units (v) are extra network parameters that need to be adjusted
as well.
Each of the ANN parameters executes a different role: the input weights and the
coefficients of asymmetry control the width and the asymmetry of the response of each
hidden unit, the output weights control the influence of each hidden unit in the ANN
output, the output bias and x0 control respectively the vertical and horizontal shift of the
ANN response.
The number of parameters that need to be adjusted by a training algorithm are:
NH input weights + NH output weights + NH v’s + 1 bias + 1 x0, making a total of
3 NH + 2 parameters.

6.4 - Training the Neural Network

The Back-Propagation algorithm (chapter 2) can be directly applied to the


proposed FF ANN, with some extra equations added to take into consideration the non-
standard ANN structure. The basic idea is, at each time step k, to change the ANN
parameters in the direction that decreases the cost function E given by:

1 2 1 2 (6.5)
E y y NN e
2 2
Denoting by P the vector that contains all ANN parameters (NH input weights, NH
output weights, NH v’s, bias and x0) and by p one of its elements, then:

p  ∂E 
pk p k ηk   (6.6)
 ∂p  P Pk
1

where ηpk is the learning rate for the ANN parameter p at time step k and the term ∂E/∂p
is calculated analytically using the chain rule and evaluated using the numerical
parameter values of P at time step k. For instance, the term ∂E/∂p for x0 is:

∂ y NN ∂ outi ∂ neti
h
∂E ∂E
NH
(6.7)
∂ x0 ∂ y NN i 1 ∂ outi ∂ neti
h ∂ x0

or:
133

∂outi
h
∂E
NH
ou 2 h (6.8)
e wi wi
∂ x0 i 1 ∂ neti

Analogously, the following equations apply for the other network parameters, where
i = 1, 2, ..., NH:

∂ outi
h
∂E ou 2 (6.9)
e wi
∂ vi ∂ vi
∂ outi
h
∂E ou 2
(6.10)
e wi x x0
∂ wi
h ∂ neti

∂E ou h
2 e wi outi (6.11)
∂ wi
ou

∂E ∂E ∂ y NN (6.12)
e
∂ bias ∂ y NN ∂ bias
where:

 2 
∂outi
h
 vi  (6.13)
fi   exp neti
2 2
exp vi neti
∂ neti 2 v2 
 i 

∂outi
h
 2 vi   
fi   1 fi exp neti neti exp vi neti 
2 (6.14)
∂ vi  2 v 2 
 i 

and fi denotes f(neti) which is given by eq. 6.1 with v replaced by vi.
Finally, the Neural Network Extremum Controller algorithm can be summarized
in the following steps:
1) Initialize the ANN parameter vector P = P0;
2) Set k = 1;
3) Generate ditherk;

4) Set the system input to: xk = x0 + ditherk


k 1

5) Collect the true system output yk;


6) Use eqs. 6.2-6.4 to calculate the ANN output yNkN;
7) Use eqs. 6.6-6.12 to update the ANN parameter vector P, i.e. update
Pk-1 to Pk;
8) Increment the time step counter: k = k + 1, and loop back to step 3.
134

Note that x0 is one of the elements of the vector P.

6.5 - Simulation of a Single Input Example

The following simulation of the proposed single input ANN extremum controller
illustrates the above concepts.
The true system output is expressed by the following smooth asymmetric
function:

 a x x 2, if x ≤ x
 1 0 0
(6.15)
y 
 a x x 2, if x > x
 2 0 0

The measurements of the true system output are corrupted by zero mean gaussian noise.
In this simulation we use: a1 = 0.25, a2 = 0.1, x0 = −2 (the true optimum input), standard
deviation of the output noise = 0.1.
To construct the ANN, 6 hidden units were used and the output weights, the
input weights and the parameters v were initialized by choosing, with equal probability,
1 or −1 and then adding a small random number with zero mean and uniformly
distributed on the interval [−l l ], where l = 0.4 for the input and output weights and
l = 0.3 for the parameters v. Additionally, the initial output weights were then divided
by 3, in order to limit the initial maximum ANN output. The parameter bias was set to
0.4 and the initial estimate of x0 to −4.
The simulation covered 400 time steps (iterations), and the dither was generated
as a random number with uniform distribution in the interval [−3 3] throughout the
simulation, in other words, with no decay in the dither amplitude. A global decreasing
learning rate η was defined for each time step k as:
for 1 ≤ k ≤ 99: ηk = 1.00*0.6;
for 100 ≤ k ≤ 199: ηk = 0.75*0.6;
for 200 ≤ k ≤ 299: ηk = 0.50*0.6;
for 300 ≤ k ≤ 400: ηk = 0.25*0.6;
The learning rates for the ANN x0, the input weights and the parameters v were set at
each time step to be equal to the global learning rate. The learning rates for the output
weights and the output bias were set at each time step to be a tenth of the global
learning rate. The basic idea is to have different learning rates for different types of
135

True Function (-) and ANN approximation (-*-)


1 *****
*** **
** **
** ** Before Training
** **
**
***
****
0.5 ******
**************
**********

********
0 **
** **
**
* **
* **
* **
* **
* **
* **
-0.5 * **
**
* **
* **
* **
*
-1 *
* After Training ± 1 s.d.
*
*
*
*
*
*
*
-1.5 **
*

-2
-5 -4 -3 -2 -1 0 1
x

Figure 6.4 - The true function with the noise band (±1 standard deviation)
and the ANN approximation (before and after training)

parameters since they perform different tasks, as was explained in section 6.3.
At each time step, the ANN parameter routine was executed 3 times using the
above learning rates, i.e. the initial point in the parameter space that is used to calculate
the direction of change was updated twice within each time step.
Figure 6.4 shows the true function and the ANN approximation before and after
being trained. Figures 6.5 and 6.6 show respectively how the ANN estimate of x0 and
the coefficients of asymmetry v change during the ANN training.
From figures 6.4 and 6.5 we can see that after being trained: a) the neural
network estimate of the true function remains in the band ±1 measurement noise
standard deviation for an interval larger than x0 ± 2, and b) despite the measurement
noise and the large asymmetry, the neural network estimate of the optimum input has

Parameter x0 - Neural Network Parameter ν


-1 2
-1.2
1.5
-1.4
-1.6 1

-1.8
0.5
-2
0
-2.2
-2.4 -0.5
-2.6
-1
-2.8
-3 -1.5
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Number of Time Steps Number of Time Steps

Fig. 6.5 - The ANN estimate of the Fig. 6.6 - The time evolution of the
optimum input x0 6 coefficients of asymmetry v
136

only a small bias.


For comparison, we rerun the simulation program using a quadratic model, i.e.
y = −a2 (x−x0)2 + c. Such a model corresponds to the FF ANN introduced before with
just one hidden unit performing the function outh = −(neth)2 and with the input weight
fixed and equal to 1. The parameter a corresponds therefore to the output weight. Other
possibilities are to fix the output weight and to vary the input weight or to vary both
weights.
The same gradient descent approach was used to update the 3 parameters of the
quadratic model (a, x0, c) and the same output noise and dither realizations were used.
As before, a global decreasing learning rate was defined using the same intervals (1-99,
100-199, 200-299, 300-400) and the same multiplicative factors (1, 0.75, 0.5, 0.25) for
a total of 400 time steps. However, the initial global learning rate was set to 0.3 instead
of 0.6 and the parameter updating routine was called just once at each time step, instead
of 3 times as before. The learning rate for each parameter of the model in relation to the
global learning rate (glr) was set at each time step as: a) for parameter a: glr/50, for
parameter x0: glr, for parameter c: 2*glr. The initial parameter estimates were:
a = (0.175)1/2, x0 = −4, c = 0.4. Note that, for parameters x0 and c, these are the same
initial estimates used in the ANN model.
Figure 6.7 shows the time evolution of the estimate of x0 for 3 runs, where each
run used the quadratic model and the same setup but with different parameters for the
true model (parameters a1 and a2 in eq. 6.15): 1) (a1,a2)=(0.1,0.1), 2) (a1,a2)=(0.25,0.25),

Parameter x0 - Quadratic Modelling


-1
-1.2
(a1,a2)=(0.25,0.1)
-1.4
-1.6
-1.8 (0.25,0.25)

-2
-2.2
-2.4 (0.1,0.1)

-2.6
-2.8
-3
0 50 100 150 200 250 300 350 400
Number of Time Steps

Figure 6.7 - The time evolution of the estimate of the optimum input x0
for the quadratic model when the true model is symmetric or asymmetric.
Compare this figure with figure 6.5
137

3) (a1,a2)=(0.25,0.1). Note that the square of the initial estimate of parameter a is the
average of 0.1 and 0.25.
From fig. 6.7 we can see that, when the quadratic model is used and the true
performance index is symmetric (quadratic in this case when a1 = a2), the estimate of
the optimum input converges to the true value, otherwise there is large bias in the
estimate. Figure 6.7 should be compared with fig. 6.5.

6.6 - An Extension to Multi-Input Extremum Control

The quadratic model can also be applied to the multi-input-single output case and
some convergence results can be obtained as shown by Zarrop and Rommens [ZaRo93].
A possible and straightforward extension of the proposed artificial neural network
extremum controller proposed in the previous sections to cover the multi-input-single
output case is simply to modify the equation for the net input of each hidden unit.
Equation 6.2 is then modified to:
NP
neti
h h
wi j xj x0 (6.16)
j
j 1

where NP = number of inputs (x and x0 become vectors with NP components) and


h
wi j = weight from input xj x0 to hidden unit i. The other equations used to
j

calculate the ANN output (eqs. 6.3 and 6.4) are kept the same.
In relation to training such an ANN, we can still apply the BP algorithm in the
same way as in the single input case. Therefore, only eqs. 6.8 and 6.10 need to be
modified to:

∂outi
h
∂E
NH
ou 2 h
e wi wi j (6.17)
∂ x0 i 1 ∂ neti
j

∂outi
h
∂E ou 2
(6.18)
e wi xj x0
∂w
h
ij
∂ neti j

with i = 1, 2, ..., NH and j = 1, 2, ..., NP and the other equations are still applicable.
For such an ANN the number of parameters that need to be adjusted by a
training algorithm are: NH*NP input weights + NH output weights + NH v’s + 1 bias
+ NP components of x0, making a total of (NP+2)*NH + NP + 1 parameters.
138

In a two dimensional input space the output of each hidden unit will be a ridge.
Such ridges will cross each other at the point x0 and it is necessary at least two ridges
with different orientations to uniquely define the point x0. The following theorem
generalizes this statement for any number of inputs [NZM92].

Theorem 6.1: If x and x0 are vectors with real finite components and if the ANN output
is given by eqs. 6.16, 6.3, 6.1 and 6.4, then the ANN output has a unique maximum
point and it occurs when x = x0, given that, within the set of hidden units, we can select
a subset such that:
C.1) the number of selected hidden units (NH) is greater or equal to the number
of inputs (NP);
C.2) the selected hidden units have non-zero output weights and non-zero
coefficients of asymmetry;
C.3) the input weights of the selected hidden units form a full rank matrix, i.e
a matrix with rank equal to the number of inputs.

Proof: We need to prove that for a finite real x, ∂yNN/∂x = [∂yNN/∂x1, ..., ∂yNN/∂xNP]T =
0 if and only if x = x0, where dim x = NP and 0 = [0, ..., 0]T. Without loss of generality
we can assume that x0 = 0, since x0 can be seen as a translation of the origin of the
coordinate system [x1, ..., xNP].

Sufficiency: If x = 0 then nethi = 0, where i = 1, ..., NH. From P.2 in section 6.3, we
have that d f (nethi)/dnethi = 0 when nethi = 0. Therefore, from eq. 6.4 if x = 0 then
∂yNN/∂x = 0.

Necessity: If ∂yNN/∂x = 0, then:

∂ y NN
NH
ou 2 ∂ fi h
wi wi j 0 (6.19)
∂ xj i 1 ∂ neti
h

and consequently we can write:

∂ y NN
NP
xj 0 (6.20)
j 1 ∂ xj

Now, by combining eqs. 6.19 and 6.20 we have:


139

NP NH
ou 2 ∂ fi h
xj wi wi j 0 (6.21)
∂ neti
h
j 1 i 1

 NP 
NH
ou 2 ∂ fi  wh x  (6.22)
h  j
wi ij 0
i 1 ∂ neti  j 1 
Finally:
NH
ou 2 ∂ fi h
wi neti 0 (6.23)
∂ neti
h
i 1

This last equation implies that, if the output weights wou and the v’s are different from
zero, then neth = 0 (from fig. 6.2 and eqs. 6.1 and 6.13 we can see that: 1) if vi = 0 then
fi = 1 and ∂fi/∂nethi = 0; 2) if vi ≠ 0 and nethi ≠ 0 then [∂fi/∂nethi]*nethi < 0 for finite nethi).
Since neth = wh x, where wh is the input weight matrix with dimension NH by NP (NH
is the number of selected hidden units), then if NH ≥ NP and if wh has full rank, i.e.
rank NP, then neth = 0 implies x = 0. Therefore, given the previous conditions,
∂yNN/∂x = 0 implies x = 0, as we want to show.

6.7 - Simulation of a Two Input Example

As an example of the application of the proposed ANN-based multi-input


extremum controller, we discuss in this section the practical problem of locating an
object within a larger image ([NZM92],[NZM93]). The basic idea is to find the
parameters of a mask with a pre-determined shape in order to maximize a function that
is related to the position, size and orientation of the object image. Here we assume that
such an image is available as a set of pixels. For each pixel, there is an associated pixel
value (its gray level) and a pixel position (in real-world coordinates). Using a discrete
mass analogy, we use the interpretation that the pixel values are like "masses"
concentrated at the center of the pixel area. Therefore a function that uses the pixel
positions as continuous input variables will have discontinuities.
We illustrate the application of the proposed neural network extremum controller
in this area by using a simplified version of this problem. We assume that we already
know that the object has a rectangular shape with a known width and orientation but
unknown length and position (within certain limits). Moreover, we know that the object
140

lies along the horizontal axis with a known vertical position and the background of the
image (all pixels that do not belong to the object) has all pixel values set to zero.
Let’s assume for this example that the object image is specified by the following
vectors:
XObjPos = [10, 10.5, 11, ..., 15.5, 16],
XObjPix = [0.5, ..., 0.5]
In other words, the center of leftmost pixel of the object image is at position XObj = 10
units, the object image size is SObj = 6 units, the horizontal distance between the
centers of two neighbouring pixels is 0.5 units and the object image contains 13 pixels,
each one with a 0.5 value. So, the unknown true optimum input value is
x0 = [XObj, SObj]T = [10, 6]T. Figure 6.8 illustrates the object position and size.
The algorithm starts by setting the initial values for the mask parameters, in this
case, the position of the left side of the mask = XMask and the mask length = SMask.
Looking only inside the mask, i.e. using only local information, we add the value of all
pixels that are inside the mask. This gives us a measure of the overlapping area between
the object and the mask. However, we need to put some penalty on the size of the mask,
because if the mask can become very large the object will be completely inside the mask
for a range of mask sizes and we will not notice any change in the summation of the
value of all pixels inside the mask. So, we need to define a function (our performance
index) of the mask size and the overlapping area that has only one extreme point (a
maximum point) and the input that produces this extremum point is the correct mask
position and size. There are several ways to define such a function. One possibility for

1.2

1 XMask = 5, SMask = 15 --> L = Lmax = 0.5*13 = 6.5

0.8

Mask
0.6

0.4

Object
0.2

0
0 5 10 15 20 25
Position

Figure 6.8 - The object image and the initial features of the mask
141

this performance index is:

 α
y L  
L (6.24)

S 
where L* = L / Lmax, L = summation of the value of all pixels inside the mask and
Lmax = estimate of max(L) = estimate of the summation of all object pixels values;
S* = SMask / SOmax, SOmax = estimate of an upper limit for the size of the object and α
is a real positive number. We use α = 2 in this simulation. The parameters Lmax and
SOmax are kept constant throughout the simulation and they act as normalization
constants setting the maximum value of the performance index. The above function is
maximum when L is maximum, that is when the mask contains the whole object, and
SMask is the smallest value such that L is still at its maximum. Note that for a fixed
object L is a function of XMask and SMask.
If we assume that we can initialize XMask and SMask such that we know that
the whole object is within the mask, then we can set Lmax to max(L) by simply adding
all pixels values within the mask in the first iteration. This is the procedure adopted here
and illustrated in figure 6.8. A value of 15 is used for SOmax and for the initial SMask.
The parameter XMask was initialized as 5 (see fig. 6.8). Since the maximum possible
value of L* is 1, the true performance index has an unknown maximum value of
(SObj/SOmax)-α = (6/15)-2 = 6.25 for [XMask, SMask] = [10, 6].
The ANN input and output weights and the v’s were initialized as in the single
input example. The parameter bias was initialized to zero and, as explained before, x0
was initialized to [5, 15]T.
Other parameters used in the simulation were: six hidden units, dither uniformly
distributed in the range [−3 3] without decay for both input variables, 400 time steps (or
iterations), a constant global learning rate glr of 0.1/6, learning rate (lr) for output
weights = glr/10, lr for v’s = glr/5, lr for input weights = glr, lr for x0 = glr*10, lr for
bias = glr, the parameter updating routine was called 3 times in each time step.
Figure 6.9 shows the time evolution of the mask parameters. Although the true
performance index is not a continuous function of the mask position and size, their
estimates converged to values very close to the true optimum point (10 and 6
respectively). Figure 6.10 shows the output of the ANN at each time step for the
estimate of x0.
142
Mask history ANN Output for x(t) = x0(t)
25 7

6
Mask Position & Size 20
5

ANN Output
15
4

3
10

2
5
1

0 0
-50 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400
Number of Time Steps Number of Time Steps

Figure 6.9 - The mask history. The object Figure 6.10 - The output of the ANN for
is situated between positions 10 and 16. the estimate of optimum input x0

True Function ANN Approximation

7 7
6 6
5 5
Function

Function

4 4
3 3
2 2
1 1
0 15 0 15
-3 12 -3 12
0 9 0 9
3 6 3 6
6 3 6 3
9 9
12 0 Size 12 0 Size
15 15
18 -3 18 -3
Position 21
24
Position 21
24

True Function ANN Approximation

15 15

12 12

9 9
Size

Size

6 6

3 3

0 0

-3 -3

0 2.5 5 7.5 10 12.5 15 17.5 20 0 2.5 5 7.5 10 12.5 15 17.5 20


Position Position
Figure 6.11 - The true performance index as a function of the mask position and size,
its ANN approximation after being trained and the respective contour plots
143

True Function and ANN Approximation (--) True Function and ANN Approximation (--)
9 12
11
8 10
9
7 8
7
Size

Size
6 6
5
5 4
3
4 2
1
3 0
7 8 9 10 11 12 13 4 5 6 7 8 9 10 11 12 13 14 15 16
Position Position

Figure 6.12 - Contour plots considering Figure 6.13 - Contour plots using a dither
only the region of the dither [−3 3] of [−6 6] and 600 time steps

Figure 6.11 shows the true performance index as a function of the mask position
and size. This same figure also shows the ANN approximation after being trained and
the respective contour plots.
Figure 6.12 shows the superimposed contour plots of the true performance index
and its network approximation, highlighting the region of the dither ([−3 3] for both
dimensions) around the optimum point [10, 6]. Figure 6.13 shows the same contour plots
for the case where the dither was increased to [−6 6] in both dimensions and 600 times
steps were used. Again we can see that the network converges to a value very close the
optimum point.

6.8 - Conclusions

This chapter showed that the extremum control of an static asymmetric


performance index with a non-decaying dither can be achieved by using an ANN
approach. Basically the conceptual difference and the novelty is the use of an
appropriate nonlinear modelling technique and the corresponding parameter estimation
algorithm.
It was shown that the proposed technique applies to the single input and multi
input cases. The proposed approach was compared with the quadratic approach in the
single input case and it was shown that the use of the first approach results in a much
smaller error in the estimation of the optimum input. A two input example was also
simulated where, despite a large asymmetry in the true performance index and a large
dither, the optimum input was estimated with a small error.
144

In the next chapter we show how ANN can be used for the control of nonlinear
dynamical systems.
145

Chapter 7 - Dynamical Control Using


Artificial Neural Networks

The field of control theory is currently well developed in the area of analysis and
design of linear time-invariant dynamical systems. However, the area of control of
nonlinear dynamical systems is much less advanced and few general results are
available. Therefore each system is treated on a case-by-case basis. Among other
features, the ability of feedforward ANNs to approximate arbitrary nonlinear mappings
has attracted the attention of several control engineers since it provides a new approach
to the difficult problem of nonlinear dynamical control.
In this chapter we firstly review some approaches proposed in the literature to
integrate feedforward ANNs in the general control structure. The concept of feedback-
error-learning is then introduced and we propose a feedback-error-learning control
structure. Such a control structure is then mathematically analysed and we show that,
at least for the case of single input-single output linear dynamical systems, when certain
requirements are satisfied, the inverse dynamical model can be correctly identified.
In the subsequent section the technique of using a variable feedback controller
is proposed. Simulations of the control of a two-joint robot by an ANN are presented
and we show that the use of a variable feedback controller improves the performance
of the neural controller in relation to other trajectories that were not used during
training.
Finally, the concept of fault-tolerant ANN, introduced in chapter 5, is explored
in order to improve the fault tolerance of the neural controller in relation to faults in the
ANN. The control of a inverted pendulum is used as the simulation example in this case.

7.1 - Artificial Neural Networks and Dynamical Control

Hunt et. al. [HSZG92] highlight the following features of ANN as important to
146

their application in control:


1) ANNs have the theoretical ability to approximate arbitrary nonlinear
mappings. There is also the possibility that such an ANN
approximation is more parsimonious, i.e. it requires less parameters,
than other competitive techniques such as orthogonal polynomials,
splines, or Fourier series. However, this has yet to be theoretically
proven [Son93].
2) Since ANNs can have multi-inputs and multi-outputs, they can be
naturally used for control of multivariable systems.
3) ANNs can be trained off-line using past data records of the system to be
controlled or they can be adapted on-line in order to compensate for
changes in the controlled system.
4) Since ANNs are parallel distributed processing devices, they can be
implemented in parallel hardware. Therefore, as a consequence of the
currently possible very fast processing capability, ANNs can be used
in real-time control. Also due to their distributed organization, ANNs
have the possibility of offering, when properly trained, a good level of
fault tolerance against internal damage to the network itself.
5) By using ANNs it may be possible to perform efficient sensor data fusion
where symbolic and numerical information received from different
types of sensor can be naturally integrated. Similar to performing
sensor data fusion, one could perform actuator integration, where
several actuators act on different systems without a one-to-one
correspondence between actuator and system (each actuator provides
signals to several systems and each system receive signals from several
actuators). Such features could result in neural controllers that are
robust to loss of sensors and actuators.
If one could obtain all these features simultaneously, the result would be very
impressive: a real-time multi-variable nonlinear adaptive fault-tolerant controller. Nature,
through evolution, has managed to achieve such controllers in biological systems.
However, the current artificial neural and non-neural systems are still far from achieving
such a level of refinement.
147

7.2 - Neural Control Architectures

In the previous chapters it was explained that one of the problems of applying
ANNs is to decide how to construct them, e.g. to decide the number of layers, the
number of units in each layer and the activation functions for each unit. When using
ANNs for control, an additional problem is to decide how to integrate the ANN into the
control structure, i.e. the neural control architecture. Werbos ([Wer90], [NaMc91])
classifies the current neural control architectures into five categories: supervised control,
adaptive critic, back-propagation through time, direct inverse control and neural adaptive
control. In this section we explore the basic characteristics of these neural control
architectures.

7.2.1 - Supervised Control


The first neural net controller was probably the one built by B. Widrow and F.
W. Smith at Stanford University in 1963 ([Wid87], [WiSm63]). It was used to control
a motorized cart that had an inverted pendulum on its top. The cart moved along a uni-
dimensional finite and straight track. The controller was required to balance the inverted
pendulum in the upright position, keeping the cart position within certain bounds. The
controller was of the bang-bang type, i.e. it could only apply a force of constant
magnitude in the left or right direction.
Figure 7.1 illustrates the neural control architecture proposed by Widrow and
Smith [WiSm63]. The basic idea is to use an already existing controller to train the
neural controller. This type of neural control architecture is known as supervised control

Figure 7.1 - Widrow and Smith’s ADALINE controller


148

[Wer90]. In their original 60’s work the teaching controller implemented a linear
switching surface, i.e. the location of the switching surface was a linear function of the
state variables of the system being controlled. With the switch shown in figure 7.1 in
position A, the neural controller is being trained to approximate the desired switching
surface provided by the teaching controller and the teaching controller still provides the
control action for the system. When the switch is in position B, training stops and the
ADALINE (a TLU with bipolar output) controls the system.
The network weights were adapted by using the LMS algorithm (chapter 2) such
that the network is trained to associate its binary input variables to the desired bipolar
output, the "bang-bang" control action.
The input variables were firstly encoded (for instance by a grandmother or
"single spot" code) such that the network could approximate switching surfaces more
complicated than hyperplanes. Such coding is in effect quantizing the state space. In this
way it is possible to show that the ADALINE can approximate (except for quantization
effects) any switching surface that does not contain cross-product terms, i.e. terms of the
form xi xj where i ≠ j [WiSm63].
Twenty five years after developing the original cart-pole balancer, Widrow and
Viral Tolat [ToWi88] showed that the network could also be trained by using as input
visual images of the cart-pole system, instead of using direct measurements of the state
variables. Such a modification makes possible the use of a human as the teaching
controller to provide the correct control actions. According to Hecht-Nielsen [Hec90],
in order to use a human as the teaching controller it is necessary to build a computer
simulation of the dynamical system and camera and then to run the simulation slowed
down by a factor of 10 or 100, otherwise humans are not able to balance the pole.
This last case illustrates a major application of the supervised control
architecture, i.e. to train the ANN to imitate a human expert. This can be specially
useful in situations where it is not viable to have humans controlling the system
continuously, for instance in dangerous environments. It is also important to note that
it can be very difficult to formulate explicitly the rules used by humans to control a
system. This is not a problem for the ANN since, when properly trained, it learns to
extract such rules from the set of examples provided and therefore it is not necessary
to provide an explicit formulation of the control law. Another important point is that the
ANN could extract the information from the examples whether the input sensor data was
149

provided as direct measurements or as crude images. In other words, the ANN is flexible
in relation to the particular representation used to code the training data.
On the other hand, one should note that the solution of the cart-pole balancer
problem can also be obtained using classical control theory, as Geva and Sitte [GeSi93]
and Hecht-Nielsen [Hec90] pointed out. The important point is the approach used, not
the particular problem solved in this case per se.

7.2.2 - Adaptive Critic and Reinforcement Learning


In order to use the supervised control architecture it is necessary to have the
existence of a "teacher", for instance a human expert in the particular control task, that
can provide the correct action for every state of the system. For this reason it is said that
in supervised control the ANN "learns with a teacher".
In some situations, however, no such a teacher is available. Furthermore, a series
of actions have to be taken before it is possible to evaluate the success of the actions.
In general, the success of such actions is simply evaluated as a "success" or as a
"failure", i.e. the evaluation is not quantitative but only qualitative. In this case the ANN
is said to "learn with a critic". The crucial problem in this case is known as the temporal
credit assignment problem [HKP91], i.e. which actions should receive the blame in case
of failure or the credit in case of success. This is different from structural (or spatial)
credit assignment, where the problem is to attribute the network output error to different
units or weights.
The idea of the adaptive critic neural architecture is based on the principle of
reinforcement learning, a term borrowed from the theories of animal learning, i.e.
reward the correct actions and punish the wrong actions. In 1973 Widrow, Gupta and
Maitra proposed what they called selective bootstrap adaptation [WGM73]. Once a
series of actions is taken and the result is evaluated as success or failure, the basic idea
of selective bootstrap adaptation is as follows:
a) if the result is evaluated as a success, the actions are rewarded by
applying the LMS algorithm to each action using as desired output the
output already used in that action. They called this positive bootstrap
adaptation or learning by reward.
b) if the result is evaluated as failure, the actions are punished by
applying the LMS algorithm to each action using as desired output the
150

Figure 7.2 - Widrow, Gupta and Maitra’s bootstrap adaptation

inverse of the output used in that action. They called this negative
bootstrap adaptation or learning by punishment.
Figure 7.2 illustrates the arrangement proposed by Widrow, Gupta and Maitra [WGM73]
where the bootstrap control input determines if the action should be rewarded or
punished by setting the position of the switch.
In order to illustrate the idea Widrow, Gupta and Maitra applied the bootstrap
adaptation technique to a simulated card game of blackjack. By using a single
ADALINE with the input variables properly coded (to a set of binary variables), they
found out that the network could learn to play the game very well without explicitly
knowing the rules or objective of the game. The only feedback provided by the
environment is if the at the end of the game the network has won or lost. In each game
the network has to take a series of actions which are recorded with the associated
network inputs. If the network has won, the game is replayed, i.e. all input patterns are
reapplied to the network input side, and the weights are adapted using positive bootstrap
adaptation. Alternatively, if the network has lost, the weights are adapted using negative
bootstrap adaptation. In this way, all experience acquired by playing the game and
observing the final qualitative evaluation of a series of actions is stored in the network
weights.
In 1983 Barto, Sutton and Anderson [BSA83] showed that the cart-pole problem
could be solved by reinforcement learning by combining, as figure 7.3 illustrates, two
units with a decoder that divides the state space into non-overlapping regions. The only
evaluation of success or failure available is the failure signal, that is only non-zero when
the pole falls more than a certain limit or the cart hits the track boundary. A new
151

Figure 7.3 - Barto, Sutton and Anderson’s Adaptive Critic

learning trial begins after a failure signal is received. One unit is called the Associative
Search Element (ASE) and it is responsible for selecting one of the possible actions (left
or right) for each time step. The ASE is a stochastic unit, i.e. some source of
randomness is used to force the network to explore the space of possible solutions.
Before training the weights of the ASE unit are initialized such that, for any state of the
car-pole system, both actions are equally probable. The other unit is called the Adaptive
Critic Element (ACE) and it is responsible for generating an internal reinforcement
signal.
The ASE could be trained using only the failure signal provided by the cart-pole
system as the reinforcement signal. However, learning would be very slow since the
reinforcement signal would be zero most of the time. The role of the ACE is exactly to
learn how to generate an improved version of the external failure signal that can be used
as the internal reinforcement signal. In this way the ASE can learn between the failures,
not just when there is a failure. Therefore learning is considerably faster when compared
with the case where only the ASE is used. However, in absolute terms learning can be
very slow, which is not very surprising, given that feedback provided by the
environment is delayed and qualitative.
The ACE learns its task using the method of Temporal Differences (TD)
([BSA83], [Sut88]). Basically the ACE learns to predict for each region of the system
state space (indicated by the output of the decoder) the failure signal ([And89],
[HKP91]), where the strength of the prediction indicates how soon the failure is
expected to occur.
152

As the cart-pole moves from one region to the other, the internal reinforcement
signal sent to the ASE is the difference between the predictions of failure for the current
and previous regions. Actions that result in a increase in the prediction of failure are
penalized. Actions that result in a decrease in the prediction of failure are rewarded.
Rosen, Goodwin and Vidal [RGV88] managed to speed up learning by adding
two heuristic procedures: 1) the constant recurrence learning heuristic: reinforce any
cycle, i.e. if a region has been visited more than once during one trial, reward it; 2) the
short recurrence learning heuristic: reward a short cycle more than a longer one.
One important decision that has to taken in the design illustrated in figure 7.3
is to how to divide the system state space into non-overlapping regions. A very fine
quantization allows a better approximation of complex functions but will require a
longer time to train, while a coarse quantization will result in faster learning with worse
approximation. Anderson ([And89], [And88]) showed that the combination of a decoder
with two units (the ASE and the ACE) can be replaced by two multi-layer networks, i.e.
with hidden units, that were called respectively the action and evaluation networks (see
fig. 7.3). In this way the networks will develop, if learning is successful, their own
adaptive representation of the system state space.
Rodrigues, Nascimento and Yoneyama [RNY91] also proposed to use a multi-
layer network, which was trained using reinforcement learning, to tune the parameters
of a PID controller until a pre-specified controller performance is achieved.
More recently some researchers have begun to investigate the relationships
between reinforcement learning methods and dynamic programming and optimal control
([SBW92], [WaDa92]).

7.2.3 - Back-Propagation Through Time (BPTT)


In some situations the user can specify a cost function to be minimized (or
maximized) that, as in the case of the adaptive critic, depends on a series of actions.
However, since a cost function is specified, after the actions are executed, the user can
evaluate the performance of the controller quantitatively. Furthermore, if a model of the
plant (the system being controlled) can be developed, then the BPTT method is used to
calculate the derivative of the cost function with respect to current actions [Wer90]. This
derivative is then used to update the ANN responsible for generating the actions, i.e the
ANN controller. Figure 7.4 shows the location of the ANN controller in relation to the
153

Figure 7.4 - The plant and the ANN controller

plant. The plant may be nonlinear and assumed to the perform the nonlinear mapping
xk+1 = A (xk,uk) where x are the plant states and u the plant inputs.
In order to illustrate the technique, Nguyen and Widrow [NgWi90b] used the
BPTT method to train an ANN to back-track a truck with a trailer to a specific point of
a loading dock, with the requirement that the back of the trailer is as parallel as possible
to the loading dock.
The first task is to develop a model of the plant, which is basically to perform
nonlinear plant identification. Nguyen and Widrow [NgWi90b] trained using the Back-
Propagation algorithm a FF ANN to emulate the plant. The inputs to the ANN are the
plant states xk and plant inputs uk and the desired ANN outputs are the plant states at
the next time step xk+1 (see figure 7.5). It is normally assumed that the plant states are

Figure 7.5 - The ANN emulator


154

Figure 7.6 - Training the ANN controller


(C = ANN controller, T = ANN truck emulator)

directly observable without noise. The training of the ANN emulator consists of
positioning the truck at an arbitrary position and applying a sequence of random inputs
uk.
After the ANN emulator is trained, the following procedure is used to train the
ANN controller:
1) The ANN controller receives x0 and generates u0. The truck moves and
generates x1. Without updating the weights in both ANNs, the controller receives x1
and generates u1, and so on for a maximum number of steps (specified by the user) or
until the truck hits the dock. This is the final position xNT, where NT is the number of
time steps in this trial.
2) The state xNT is compared with the desired final state xNdT. The
difference is used by the BP algorithm to the update the weights of the ANN controller,
while the weights of the ANN emulator are kept fixed. However, instead of being
updated just once, the ANN controller is updated NT times, as if a large network with
NT copies of the ANN emulator and ANN controller were used. Figure 7.6 illustrates
this part of the procedure.
3) The truck and trailer are initialized at another position and steps 1 and
2 are repeated.
Strictly speaking all the weight changes for each of the NT stages would have
to be saved so that they could be added together at the end of the trial. In this way the
ANN controller would only be updated at the end of each trial. In practice, however, the
weight changes can be added immediately to the weights of the ANN controller as they
are calculated.
155

Note that, when training the ANN controller, the real truck and trailer are used
in the forward pass to obtain the final position of the trailer. However, the ANN
emulator is needed by the BP algorithm such that the final error eNT (calculated at the
output of the plant) can be back-propagated to change the weights of the ANN
controller. Such an error cannot be back-propagated through the real plant but only
through a suitable mathematical model, in this case the ANN emulator.
One important point is that Nguyen and Widrow [NgWi90b] used the interesting
training strategy of dividing the training phase into several sessions, where each session
was composed of several trials and the sessions had increasing levels of difficulty. It
took about 20,000 backups to train the ANN controller, but afterwards the ANN could
back the truck from several initial positions with which it had not been trained.

7.2.4 - Direct Inverse Control


In this neural control architecture the basic idea is to train an ANN as the inverse
dynamical model of the plant. After being trained the ANN is simply used as a
feedforward controller such that the composition of ANN and plant act as the identity
mapping [Zur92]. The assumption in this case is that, at least in the space where the
ANN is being used, the inverse dynamical model of the plant is uniquely defined and
is stable.
It is possible to use a state-space formulation or an input-output formulation. In
the former case the ANN inputs are the current state vector xk and the desired state at
the next time step xdk+1. In the latter case the ANN inputs are: 1) the desired plant output
at the next time step ydk+1; 2) the current and past plant outputs yk, yk-1, ..., yk-Ny; and 3) the
past plant inputs uk-1, ..., uk-Nu. In both cases the ANN output is uk, the input that should
be applied at the plant at time step k. Note that plant input and output at any particular
time step can be vectors.
Psaltis, Sideris and Yamamura [PSY87] propose three neural control architectures
based on the idea of learning the inverse dynamical model of the plant: Indirect
Learning, Generalized Learning and Specialized Learning.
Assuming an input-output representation figure 7.7 shows the Indirect Learning
architecture. Additional inputs to the ANNs, not shown in figure 7.7, are the past plant
inputs and current and past plant outputs. Two ANNs are used. The ANN that receives
the plant output as one of the inputs is trained to emulate the inverse dynamical model.
156

Figure 7.7 - The Indirect Learning architecture

The other ANN is a copy of the ANN that is being trained.


One advantage of this architecture is that the ANN can be trained on-line, i.e. it
learns while it is performing a useful task. Furthermore, since the ANN inputs are the
desired plant outputs, the ANN can be trained on the particular region of interest of the
output domain.
Sometimes, however, this architecture may not work as desired [PSY87]. The
reason is that, according to simulations, the ANN converges to a set of weights for
which a large number of different desired outputs ydk+1 are mapped to the same plant
input uk. The copy of the ANN then maps yk+1, the output of the plant, to the same uk.
So the error used to correct the network is zero, although the total error ydk+1 - yk+1 will
not be zero. One example is when the ANN response to any inputs is zero. Then its
copy also has a zero output and the weights of both networks do not change. In other
words, the problem is that this architecture is not goal-orientated, i.e. not orientated to
decrease the plant output error.
The Generalized Learning architecture (figure 7.8) avoids this possibility by

Figure 7.8 - The Generalized Learning architecture


157

Figure 7.9 - The Specialized Learning architecture

generating directly the inputs uk to the plant and collecting the plant output yk+1. Then
plant output yk+1 and the past plant output and past plant inputs are used as inputs to the
ANN. The ANN output vk is compared with the respective plant input uk. The
disadvantage associated with this architecture is that it can only be used for off-line
training, because it is necessary to have a training phase before the ANN can be used
as a controller. During this training phase, another controller or a human expert is used
to generate the inputs u to the plant. Another point is that, in order to plan the training
stage, it is necessary to know the input operational range of the plant. Consequently
there is the risk of training the ANN on regions over which the plant will not operate
during the control phase or of not training the ANN over some important regions.
The Specialized Learning architecture (figure 7.9), as in the case of Indirect
Learning, uses as input to the ANN controller the desired response of the plant ydk+1 and
applies the output of the ANN as the input to the plant uk. However in the Specialized
Learning architecture the plant response yk+1 is compared with ydk+1 and the difference is
used to train the ANN. This architecture can learn on-line, but to change the weights the
error ydk+1 - yk+1 must be back-propagated through the plant, which is unknown or only
approximately known. In order to do this, Psaltis et al. [PSY87] suggest interpreting the
plant as an additional layer of the ANN with non-modifiable weights. They also suggest
that when the plant is unknown the partial derivatives of the output of the plant in
relation to its inputs (the so-called Jacobian of the plant) around some operating point
can be estimated by some perturbation method such as by changing slightly each input
and measuring the change at the output or by using the changes in the previous time
steps.
158

In order to back-propagate the plant output error through the plant Saerens and
Soquet [SaSo91] proposed using the signs of the partial derivatives of the plant outputs
in relation to its inputs since this qualitative knowledge is more easily available than the
quantitative information. Using the state space approach to specify the ANN inputs they
applied this scheme to train a two joint robot to follow a moving target in a 2D space
and to solve the cart-pole problem with good results. However, they pointed out that this
learning architecture can result in slow training.
Note that another possibility to back-propagate the error at the plant output in
order to update the weights of the ANN controller is to train off-line an ANN to emulate
the forward dynamical model of the plant as Jordan proposes [Jor89]. Then the error at
the plant output is back-propagated not through the plant but through the ANN emulator.
It is possible to combine the advantages of these two learning architectures by
first using the Generalized Learning architecture to train the ANN controller off-line
over a large region of interest and then using the Specialized Learning to fine-tune it on-
line around some specific operating points. The initial training using the Generalized
Learning should speed up the convergence and make easier any necessary relearning
when the plant changes or when new operating points are defined [PSY87].

7.2.5 - Neural Adaptive Control


The neural adaptive control architecture basically follows the same designs as
conventional linear adaptive control, with the difference that the linear mappings used
in the latter case are replaced by ANNs. Possible designs arise in Self-Tuning and
Model-Reference adaptive control.
Narendra and Parthasarathy ([NaPa90], [Nar90]) provide several examples of how
to perform neural adaptive control based on the model-reference approach, most of them
for single-input single-output plants, but also some examples for the multi-input multi-
output case.
The aim in model-reference adaptive control is to make the plant behave like a
reference model, which must be supplied by the designer. First, an ANN learns to
emulate the plant in the identification stage. This identification is initially carried out
off-line. Later, the parameters of a second ANN, which acts as the controller, are
updated using the model of the plant generated by the first ANN. The identification
process continues also during the control stage, so that the model represented by the
159

Figure 7.10 - Indirect Adaptive Control

ANN emulator is fine-tuned on-line. Figure 7.10 illustrates this two stage procedure
(identification and then control), which is known as Indirect Adaptive Control.
One of the examples used by Narendra and Parthasarathy [NaPa90] is the control
of a plant whose dynamics are unknown and described by:

yk 3
yk 1
uk (7.1)
2
1 yk

The identification model was chosen as two ANNs:

ŷk 1
Nf yk Ng uk (7.2)

where Nf [ yk ] approximates f [ yk ] = yk / [1+yk2] and Ng [ uk ] approximates g[uk] = uk3. First


a random input uniformly distributed over the interval [−2,2] was used in the off-line
identification stage. Hence, the Ng approximates g only over this interval. For inputs in
that range, it was observed that the plant output y varied over the interval [−10,10], so
Nf approximates f only over the interval −10 ≤ y ≤ 10.
If the reference model is: ymk+1 = 0.6 ymk + rk , then the input to the plant could be
calculated as:

g uk yk 1
f yk 0.6 yk rk f yk (7.3)

Since the true function f [.] is unknown, we have to use its approximation Nf [.]. So the
control input uk can be generated using the following expression:

uk ĝinv Nf yk 0.6 yk rk (7.4)

where ĝinv denotes the approximation of the inverse of the function g[.]. If another ANN
Nc was adjusted so that Ng[ Nc(r) ] ≈ r, then this network Nc is the approximation of the
160

Figure 7.11 - Neural Adaptive Control using the Model-Reference approach

inverse of the function g[.]. The interval [−4,4] was used to train off-line this last ANN,
so that its output was within the interval used to train the network Ng. Finally, the
control input uk is calculated by:

uk Nc Nf yk 0.6 yk rk (7.5)

Figure 7.11 shows the overall adaptive control system.


Paying attention to the fact that an ANN can only approximate the inverse or
forward model of the plant over a specified finite region, Sanner and Slotine [SaSl92]
and Tzirkel-Hancock and Fallside [TzFa92] proposed to add a sliding control term to
the the ANN controller. If the state of the plant is outside the region where the ANN
has good approximation properties, then the ANN output and any ANN adaptation are
turned off and only the sliding controller is used. Then, whenever the state of the plant
returns to the region where the ANN provides a good approximation the sliding
controller output is turned off, the ANN regains control of the plant and ANN training
proceeds. A modulation mechanism [SaSl92] is provided to achieve a smooth transition
between the ANN and sliding controllers.
Willis et al. ([WDMT91], [WMDT92]) and Saint-Donat et al. [SBM91] proposed
neural predictive control methods, i.e. to use the adaptive model of the plant developed
by the ANN to calculate the control action in order to minimise the sum of squares of
future setpoint tracking errors. The time interval considered for such minimization is
called the output prediction horizon. Another innovation proposed by Willis et al.
161

([WDMT91], [WMDT92]) was the use of ANNs composed of sigmoidal units where
their outputs are filtered in time, for instance by a first-order low-pass filter:

λ yk 1 λ yk (7.6)
f f
yk 1

where λ is the time constant and 0 ≤ λ ≤ 1. The basic idea is that by including dynamics
in the network model in this way the ANN modelling capability will hopefully be
expanded and therefore a smaller ANN (perhaps with a smaller number of delayed plant
inputs and outputs as the ANN input) could be used. Since the filter constant λ is not
known beforehand it may be necessary to adjust λ during training. In order to use the
BP algorithm to adjust λ it is necessary to calculate the gradient of the cost (error)
function in respect to λ. A possible approach suggested by Willis et al. ([WDMT91],
[WMDT92]) to determine λ is to use the chemotaxis algorithm, an algorithm similar to
the Simulated Annealing technique. The chemotaxis algorithm assumes that the ANN
parameters follow a multivariate gaussian distribution with zero mean. The ANN
weights and any other possible parameters are adjusted simply by adding gaussian
distributed random values to them. The new parameters are accepted if such adjustments
result in a smaller prediction error.
Finally, for the benefit of completeness, we should mention that it is also
possible to use unsupervised learning algorithms to train ANN to be used in control
systems. For instance, Ritter, Martinetz and Schulten [RMS92] shows how the Kohonen
algorithm can be applied to solve the dynamical control problem for a three joint robot
arm.

7.3 - The Feedback-Error-Learning Method

In 1987 Kawato and co-workers ([KFS87], [MKSS88]) proposed the feedback-


error-learning method to train an ANN to perform dynamical control of a robotic
manipulator. Their main motivation was to propose a model of the computational
scheme used in the central nervous system (CNS) for motor learning. Their basic idea
is to combine an already available and tuned conventional feedback controller with an
ANN acting as the feedforward controller. The feedback controller should at least be
good enough to stabilize the plant when used alone, but it does not need to be optimally
tuned. For simplicity such a feedback controller is normally a PID controller.
The aim is to adapt the ANN in order to minimize the tracking error e defined
162

as the difference between a reference signal ref and the measured output y, normally a
subset of the state vector x. In order to achieve this Kawato [MKSS88] proposed using
the output of the feedback controller as the ANN output error and therefore called such
a learning method Feedback-Error-Learning. Although apparently Kawato and his co-
workers have not realized this, the feedback-error-learning rule can be interpreted as a
method of minimizing the mean-squared-value of the output of the feedback controller,
as we will show later.
It is important to note that by using the feedback error signal as the ANN output
error the problem of back-propagating the control error through the plant (or through a
model of the plant) is avoided [MKSS88]. Furthermore, the ANN can be trained on-line
and the training method is goal orientated since, when the output tracking error is zero,
the output of the feedback controller will also be zero (in reality if there is a integral
component in the feedback controller, its output can be a non-zero constant, in which
case a bias term at the linear output unit of the ANN can be used to cancel such
constant output).
Before being trained, the ANN is initialized such that its output is zero for any
input. Hopefully, as the ANN is being trained, it will smoothly take over control from
the feedback controller and at the same time improve the overall control performance.
In this way the ANN is being trained to be the inverse dynamical model of the plant.

7.3.1 - The Original Feedback-Error-Learning Control Structure


Since Kawato was mainly concerned with robot dynamical control he proposed
composing the ANN input by using, at each time step, the desired angle joints and their
respective desired velocities and desired accelerations. He also proposed that the ANN
could be simply a linear combination of nonlinear functions, where these nonlinear
functions could be determined by analysing the inverse dynamical robot model.
Therefore, for a robot with N joints, the feedforward controller would be formed by N
ANNs, each with 3 N inputs and 1 linear output.
The number of nonlinear functions to be used in each ANN is a function of the
particular dynamical robot model chosen. Figure 7.12 illustrates such a control structure
for a two-joint robot model. Note that in each ANN there is only one layer of
modifiable weights and therefore there is no need for using an algorithm such as Back-
Propagation since there are no hidden weights to be adjusted. However there is also the
163

Figure 7.12 - The original network topology and control structure


used in the Feedback-Error-Learning method

considerable inconvenience for the designer of specifying the nonlinear functions that
will basically perform a nonlinear transformation of the desired angles, velocities and
accelerations. One simple way to avoid such inconvenience is to use an ANN composed,
for instance, of sigmoidal units. In [Kaw90] Kawato presents a few incomplete
simulation results using this approach for a 3-joint robot arm.
In fig. 7.12 the following notation was adopted: qdk, qk = desired and actual joint
angle at time step k, TFk = feedback torque generated as the output of the feedback
controller, TNk = feedforward torque generated as the ANN output, and Tk = torque
applied to each robot joint. From fig. 7.12 we have: Tk = TFk + TNk .
Since the output of the feedback controller TFk is used as the ANN output error,
the role of the training algorithm can be interpreted as being to adapt the network
weights in order to associate the vector qdk and its derivatives with the vector Tk. Since
in the robot control problem we can define the plant states as the joint position and joint
velocities, the above is equivalent to mapping from [(xdk)T (xdk+1)T ] to Tk, i.e. the desired
trajectory for each joint with the torque that should be applied to each joint.
164

Since in order to learn the true inverse dynamical model of the plant, it is
necessary to associate the vector [(xk)T (xk+1)T ] to Tk, the implicit assumption of the
feedback-error-learning method is that the feedback controller is adequate so that, when
used alone to control the plant, the plant approximately follows the correct trajectory,
i.e. [(xdk)T (xdk+1)T ] ≈ [(xk)T (xk+1)T ] [NaMc91]. Therefore one can state that the role of the
feedback controller is to provide an approximate solution for the problem. Because of
this, when used within such a control structure, the ANN tends to be trained more
rapidly than in other situations.
The Feedback-Error-Learning method can be seen as a special case of the
adaptive critic neural control architectures (section 7.2.2) [Wer90], where the feedback
controller performs the role of the critic. In this case however, the critic is in general
(but not necessarily) non-adaptive, as we show later in this chapter.
The role of the critic (the feedback controller) is not to provide the desired ANN
output, but to provide a signal that can be used as an evaluation of the ANN
performance. The role of the training algorithm is to adapt the ANN in order to
maximize the evaluation signal provided by the critic, which means minimizing the
amplitude of the feedback controller output.
Kawato’s interpretation is to see the feedback controller as an imperfect teacher
and that, by being trained by this teacher, the ANN learns to surpass the teacher
[KFS87]. Such an interpretation, however, is incorrect since the ANN is not trained to
imitate the feedback controller but to make the output of the feedback controller (and
consequently the output error) as small as possible. In other words, the feedback
controller provides the ANN output error, not the ANN output.

7.3.2 - The Modified Feedback-Error-Learning Control Structure


In this section we propose some modifications to the feedback-error-learning
control structure proposed by Kawato and develop a mathematical formulation for the
learning rule used. Kawato’s original design (illustrated in figure 7.12) was appropriate
to be used with robot manipulators. The modifications that we propose generalize the
design allowing its use with a larger class of systems. Furthermore, we show, for single
input-single output linear time-invariant plants, how the reference signal should be
designed.
Kawato’s original design extracts the variation of the reference signal (the desired
165

ref FF. Controller NN


k u
Neural k
Network
ref ref
k-1 k-L
x = f ( x ,u )
-1 -1 k+1 k k
z z

ref PID FB +
k ref ek uk uk x x
-M k-M Feedback k+1 -1 k
z Plant z
+ - Controller +

y x
d k k
ref
k-M
= y Sensor
k

Figure 7.13 - The modified feedback-error-learning neural control structure

joint angle profile) by using its higher order derivatives, normally calculated numerically
using some central difference method. Therefore the first modification is to substitute
these higher order derivatives by a tapped delay line [ONY91]. The ANN input becomes
the current value of the reference signal and a limited number of its past samples. A
limitation of such an arrangement is that the plant must be observable (i.e. from its
output trajectory we can determine its state trajectory), since we are specifying the
desired state trajectory of the plant by specifying only the desired trajectory of the plant
output.
The second modification is to use a delayed version of the reference signal to
calculate the input of the feedback controller, i.e. for single input-single output plants
the output error is defined as ek = yk − refk−M, where M is a non-negative integer. The
use of such a delay was proposed by Widrow and Stearns [WiSt85] for a related non-
neural controller architecture.
Figure 7.13 illustrates the proposed neural control structure for the feedback-
error-learning method when the plant is single input-single output. The length of the
tapped delay line L (a non-negative integer) and the delay M to be applied to the
reference signal have to be specified by the designer. The same control structure can be
applied in the case of MIMO (multi-input multi-output) plants simply by extending the
ANN input vector to include all current values of the reference signal vector and its past
166

values, where the number of past samples to be collected for each component of the
reference signal can be different for each component. The delay to be applied to each
component of the reference signal in order to calculate the input of the feedback
controller can also be different for each component. Each component of the feedback
controller output is then used as the error for one of the ANN output units.
The use of the delay M makes the task easier for the ANN since it then needs
to implement an approximation of the delayed inverse of the plant. Without such a delay
(M = 0), the ANN would have to implement the inverse of the plant, a more difficult
task since the ANN would have to act as a predictor in order to compensate for the time
that the physical plant takes to react to a new input. Since in many applications a
delayed inverse is acceptable, the use of the delayed reference signal is not a major
limitation.
One problem of trying to do inverse modeling is if the plant is linear and non-
minimum phase. Such plants have inverse models that are unstable. However, the
delayed inverse models of linear non-minimum phase plants have two-sided impulse
responses, are stable and can be approximated by linear filters with a finite number of
finite coefficients, the so-called FIR filters (finite impulse response filters) [WiSt85]. For
this reason, when dealing with linear systems a good choice for the proposed control
structure illustrated in fig. 7.13 is L ≈ 2 M, and an ANN without hidden units and with
linear outputs, i.e. a linear ANN. In this way, the number of ANN weights are equally
divided between the two sides of the impulse response. By including the delay M the
ANN can approximate delayed inverse models of minimum-phase and non-minimum-
phase plants without the requirement of knowing a priori whether or not the plant is
minimum-phase.

7.3.3 - Mathematical Analysis of the Modified Feedback-Error-Learning Control


Structure
Let the plant be a single-input-single-output open-loop stable linear time-invariant
dynamical system that, using a discrete-time domain notation, is described by the
following equation:

yk a1 yk 1
a2 yk 2
aNa yk Na
b0 uk b1 uk 1
b2 uk 2
bNb uk Nb
(7.7)

or equivalently:
167

1 Nb
Y (z) b0 b1 z bNb z
G (z) β0 β1 z 1
β2 z 2 (7.8)
U (z) 1 2 Na
1 a1 z a2 z aNa z

Designing for this case the ANN as a linear filter we have:

U NN (z)
G NN (z) α0 α1 z 1
α2 z 2
αL z L (7.9)
Ref (z)
From fig. 7.13 we have that:

E (z) z M
Ref (z) Y (z) (7.10)

U FB (z) (7.11)
G FB (z)
E (z)

U (z) U FB (z) U NN (z) (7.12)

Combining eqs. 7.8-7.12 we have:

Y (z) G (z) U (z) G (z) U FB (z) U NN (z) (7.13)

Y (z) G (z) z M
G FB (z) G NN (z) Ref (z) G (z) G FB (z) Y (z) (7.14)

And finally:

Y (z) G (z) z M G FB (z) G NN (z) (7.15)


Ref (z) 1 G (z) G FB (z)

Following the same development, we have:

 Y (z)  (7.16)
U FB (z) G FB (z) E (z) G FB (z) z M
 Ref (z)
 Ref (z) 

U FB (z) G FB (z) z M G (z) G NN (z) (7.17)


Ref (z) 1 G (z) G FB (z)

Therefore if GNN(z) = z−M G(z), then from eqs. 7.15 and 7.17 we have:
a) Y(z) Ref (z) = z−M, and
b) UFB(z) Ref (z) = E(z) Ref (z) = 0.
However, since GNN(z) has a finite number of parameters (see eq. 7.9), the ANN can
only be an approximation of the delayed inverse model of the plant. As the number of
network parameters increases the network approximation improves and the magnitude
of the output of the feedback controller decreases.
Let’s assume that the feedback loop without the ANN is stable such that the
polynomials γ (z) and φ(z) converge (i.e. ∃ 0 < λ < 1, c > 0, d > 0, N > 0 such that
168

γi < c λi and φi < d λi for all i ≥ N) where:

G FB (z)
γ (z) γ0 γ1 z 1
γ2 z 2 (7.18)
1 G (z) G FB (z)
and

G(z) G FB(z)
φ(z) G(z) γ(z) φ0 φ1 z 1
φ2 z 2 (7.19)
1 G(z) G FB(z)
By substituting eqs. 7.18 and 7.19 in eq. 7.17 we can write:

U FB (z) z M
γ(z) φ(z) G NN(z) Ref (z) (7.20)

The learning problem can then be defined as finding the coefficients of the ANN,
(α0, α1, , αL ) that minimize the square of the output of the feedback controller, i.e. we
desire to minimize the following scalar cost function J:

1 2 1 T 2
γ(z) refk φ(z) α (7.21)
FB M
J E uk E z ref k
2 2
where α* = [α0 α1 αL]T and ref *k = [refk refk−1 refk−L ]T and both of these column
vectors have L+1 components where L ≥ 0.
The stationary points of J are given by ∂J/∂α* = 0 where:

 FB 
∂J  FB ∂ u k  (7.22)
φ(z) ref k
FB
E uk  E uk 0
∂α  ∂α 

The cost function J has a unique minimum if the matrix ∂2J/∂α*2 is positive definite
where:

∂ 2J T
E φ(z) ref k φ(z) ref k Fl (7.23)
∂α 2

and F l is the correlation matrix of the vector φ(z) ref *k and it has dimensions L+1 by
L+1. From eqs. 7.20 and 7.22 we have:

Fl α Fr (7.24)

where:

(7.25)
Fr E z M
γ(z) refk φ(z) ref k

and F r is a column vector with L+1 rows. Assume that the reference signal refk is
stationary, i.e. E [refk refk±i ] = ρi = ρ−i. Then element (i’, j’) of F l (denoted by F li’j’) and
element i’ of F r (denoted by F ri’) can be written as:
169

 
 ∞  ∞  ∞ ∞ (7.26)
E  φi refk   φ ref  φi φj ρ
l
1)   1)  
Fi j i (i j k j (j i j i j
i 0  j 0  i 0 j 0

 
 ∞  ∞  ∞ ∞ (7.27)
E  γi refk   φ ref  γi φj ρ
r
M 1)  
Fi i j k j (i i j M (i 1)
i 0  j 0  i 0 j 0

where 1 ≤ i’ ≤ L+1 and 1 ≤ j’ ≤ L+1. Note that: a) Fl is a symmetric matrix with the
Toeplitz (banded) structure common to covariances; and b) apart from having a stable
closed-loop response, no other condition is imposed on the feedback controller.
Finally, if the reference signal is sufficiently exciting so that matrix Fl is positive
definite, then the system of linear equations given by eq. 7.24 can be solved. The
calculated ANN parameters αi, 0 ≤ i ≤ L, will be the set that minimises the mean value
of the square of the feedback controller output. The particular values of L and M will
determine how small the minimum is.
The following numerical example illustrates the use of the above equations. First
let’s assume that the reference signal refk is such that ρi = ρ−i = 0 for i ≥ 2. Then eqs.
7.26 and 7.27 can be rewritten as:
∞ ∞
ρ0 φi φi ρ1 φi φi φi (7.28)
l
Fi j (i j ) 1 (i j ) 1 (i j )
i 0 i 0

∞ ∞
ρ0 γi φi ρ1 γi φi φi (7.29)
r
F i i M 1 i M i M 2
i 0 i 0

where by definition φi = 0 for i < 0. Such a reference signal can be generated as:

refk S0 sk S1 sk 1
(7.30)

where the sequence sk is a white noise signal uniformly distributed between −1 and 1.
If the scalars S0 and S1 are set respectively to 1 and 0.7, then by definition we have:


ρ 1
, for i 0 (7.31)
E s k sk ± i  s 3

 0, for i ≠ 0
170

 2
 S0 S12 ρs 1.49 , for i 0
 3
 (7.32)
E refk refk ± i ρi  0.7
 S0 S1 ρs , for i 1
 3

 0, for i > 1
−1 −2 FB −1 −1
Let G(z) = z + 0.5 z , G (z) = kp + ki z (1−z ) (a PI controller) with kp = 0.6, ki = 0.3.
The poles of the polynomial φ(z) then have magnitudes 0.7822 and 0.6193. Truncating
the infinite series in eqs. 7.28 and 7.29 after 61 terms, using ρ0 = 1.49 / 3 and ρ1 = 0.7/3,
for the case L = 3 and M = 1 the numerical values for matrices Fl and Fr are:

 0.4087 0.1271 0.04924 0.07376 



 0.1271 0.04924
 0.4087 0.1271
Fr  
 0.04924 0.1271 0.4087 0.1271 
 
 0.07376 0.04924 0.1271 0.4087 

T
Fl 0.3263 0.03604 0.03122 0.08937
and using eq. 7.24 we get:
α* = [0.9860 −0.4717 0.2149 −0.08296]T
Following the same procedure for L = 7 we get:
α* = [0.9997 −0.4994 0.2491 −0.1236 0.06061 −0.02865 0.01279 −0.004863]T
while the delayed inverse of the plant can be expressed as:

z M 1 i 1 2 3 4
z 1 0.5 z 0.25 z 0.125 z 0.0625 z
G(z) i 0 2
i (7.33)
5 6 7
0.03125 z 0.015625 z 0.0078125 z

The above analysis show that, under certain conditions, the cost function
expressed in eq. 7.21 has a unique minimum and the ANN network parameters that
minimize such a cost function can be used to form an approximation of the delayed
inverse of the plant G(z). At this point then we need a method that will search for this
minimum point, i.e. a learning (or parameter estimation) algorithm. Denoting by α̂i(k)
the estimated value of the ANN parameter αi at time step k, we can simply use a
gradient descent approach (as in the BP algorithm), i.e. to change the ANN parameters
in the direction that decreases the cost function. Therefore we can write:
171

 
∂J
α̂ i k 1 α̂ i k η   (7.34)
 ∂αi  α α̂ k
where η is the learning rate, i = 0, 1, ..., L and α̂* (k) = [α̂0 (k) α̂1 (k) α̂L (k)]T. Using
eq. 7.22 we have:

α̂ i k 1 α̂ i k η uk φj refk (7.35)
FB
i j
j 0

However the polynomial φ(z), the closed-loop transfer function without the ANN, is not
known a priori since it would imply knowledge of the transfer function of the plant
G(z). Therefore we propose to use the following learning rule:

α̂ i k 1 α̂ i k η uk refk (7.36)
FB
i

This is the same rule that Kawato ([KFS87], [MKSS88]) uses in his original feedback-
error-learning structure without a formal theoretical proof. Kawato argues that this
learning rule was based on physiological information about the plasticity of biological
neurons.
Stricty speaking the above rule would be correct only if the feedback controller
is very good, in which case φ(z) ≈ 1, yk ≈ refk−M. Therefore training the ANN to
associate its input refk (and its past values refk−1, ..., refk−L) to the plant input uk is
equivalent to training the ANN to associate yk (or more precisely yk+M, ..., yk, ..., yk+M−L)
to uk, i.e. training the ANN to learn the inverse of the dynamical model of the plant.
Another equivalent interpretation is that the learning rule expressed in eq. 7.36 ignores
any temporal correlation between the ANN output uNN
k and the plant input uk. Since we

can always write:

1 2 1 T 2
α (7.37)
FB
J E uk E uk ref k
2 2
if we could assume that uk is independent of uNN
k then eq. 7.22 could be written as:

∂J T
α
FB (7.38)
E uk ref k ref k E uk ref k
∂α

If we could assume that φ(z) ≈ 1, such an independence becomes evident if we use eq.
7.20 to write:
172

U (z) U FB (z) U NN (z) z M


γ(z) Ref (z) 1 φ(z) G NN(z) Ref (z) (7.39)

An important advantage of using the learning rule proposed in eq. 7.36 is that it can be
readily generalized to the nonlinear case when the ANN can be, for instance, a multi-
layer perceptron and the BP algorithm is used to train the ANN, as the next sections will
show.
Numerical simulations seem to support the use of the learning rule proposed in
eq. 7.36. However, more theoretical work is needed to justify it formally. A possible
fruitful line may be to interpret the output of the feedback controller as an
approximation of the true ANN output error. This is the same as seeing the plant input
uk as an imperfect teacher [ShBr71] or, as in reinforcement learning and adaptive critic
theory (see section 7.2.2), the output of the feedback controller more as a qualitative
than quantitative performance index.

7.3.4 - Simulations for the Linear Case


In this section we present the results of simulations for linear plants. The
example is a minimum-phase plant but the same principle could be used for a non-
minimum-phase plant. In order to make comparisons between theoretical and
experimental results easier the plant chosen is the same used in the numerical example
shown in the previous section.
The feedback controller was implemented as a discrete-time PID (proportional-
integral-derivative) controller at time step k using the following algorithm:

ek : refk M
yk
i i
e : e ek
(7.40)
FB
uk : kp ek ki e i kd ek e d
e d : ek
where k = 0, 1, ..., NT, and the variables ei and ed were initialized as zero at the
beginning of the simulation. The ANN was implemented as a FIR linear filter with an
added bias, i.e. with no hidden units and a linear output such that:
L
αi refk (7.41)
NN
uk bias i
i 0

Consequently the number of parameters to be adjusted by the learning rule is L+2 and
they were initialized as zero such that before being trained the ANN output is zero for
173

any reference signal used as input.


It is also assumed that the simulation starts at its zero state, i.e. yk = 0 for k ≤ 0
and uk = 0 for k < 0. In relation to the reference signal it is assumed that refk = ref0 for
k < 0.
The ANN parameters were adjusted by the following learning rules:
αi (k+1) = αi (k) + η uFB
k refk−i ,

bias (k+1) = bias (k) + η uFB


k ,

where η = learning rate.


The plant is described by yk+1 = uk + 0.5 uk−1. The parameters of the feedback
controller were set to [kp, ki, kd] = [0.6, 0.3, 0]. The parameters L and M were set
respectively to 3 and 1. The learning rate was set to 0.08/3. The total number of steps
NT used in this simulation was set to 3600.
The reference signal refk was generated for 0 ≤ k ≤ 2000 as:

refk sk 0.7 sk 1

where, as explained in the previous section, the sequence sk is generated to simulate

Output y
4

0
y

-2

-4

-6
0 500 1000 1500 2000 2500 3000 3500
Time step k
Error e
6

2
e

-2

-4
0 500 1000 1500 2000 2500 3000 3500
Time step k
Figure 7.14 - The plant output y and the error e
174

ANN parameters
2.5

1.5
α0
1

0.5
α2
0
bias
α3
-0.5
α1
-1

-1.5
0 500 1000 1500 2000
Time step k
Figure 7.15 - The variation of ANN parameters during training

white noise uniformly distributed between −1 and 1.


For the period 2001 ≤ k ≤ 3600 the reference period was a square wave with
amplitude varying between 2 and -2 and with period 400 time steps. The ANN was
trained only during the period 0 ≤ k ≤ 2000.
Figure 7.14 shows the plant output y and the error e (defined in eq. 7.10 as the
difference between the delayed reference signal refk−M and the plant output yk). Figure
7.15 shows how the 5 ANN parameters vary during training. At the end of the training
period (2000 time steps) their values were:
[α0 α1 α2 α3 bias] = [1.0077 −0.5094 0.2252 −0.08290 0.001774]
According to previous section the expected values for the ANN parameters are:
[α0 α1 α2 α3 bias]= [0.9860 −0.4717 0.2149 −0.08296 0]
Note in figure 7.14 that after 500 time steps the error was very small. The plant output
followed very closely the square wave reference signal, although the ANN was no
longer being trained.

7.3.5 - Using a Variable Feedback Controller


In the next section we present the results of simulations where the plant is
nonlinear (more specifically a robot with two revolute joints) and a variable feedback
controller is used.
As we showed before, the aim of the control structure illustrated in fig. 7.13 is
to perform closed-loop identification of the inverse dynamical model of the plant. If the
175

plant is assumed to be linear as in the previous section, the ANN can simply be a FIR
filter with an added bias as in eq. 7.41. The addition of the delay M was necessary to
guarantee the existence of an approximate inverse dynamical model.
In the nonlinear case we have to assume that an inverse dynamical model also
exists and to use an ANN capable of approximating nonlinearities, such as a FF ANN
with nonlinear hidden units. In these simulations we will use the multi-layer perceptron
(MLP) but in principle any other ANN model capable of nonlinear modelling, such as
the RBF ANNs, could also be used. Again we have the problem of how to specify the
ANN topology since the ANN must be such that it can approximate the inverse of the
(assumed unknown) plant.
Another problem is to design the reference signal such that the ANN will
converge to a close approximation of the inverse dynamical model of the plant. As in
the linear case it is important that the reference signal is exciting enough so that
important features of the plant appear and are detectable at its output. At the same time
the reference signal must be such that the control of the plant and the dynamics of the
ANN parameters are stable.
When trained with reference signals that are not very exciting (especially when
the plant is nonlinear) the ANN can converge to an approximation of the inverse model
that is very good but only for the training signal(s). If training stops and the ANN is
used with reference signals different from the ones used during training, the quality of
control can be worse than when only the feedback controller is used. In other words, the
neural controller may not be able to generalize well from the training trajectory to other
trajectories not seen before.
One technique that can be used to alleviate this problem is to use an extra degree
of freedom available in the control structure, i.e. to vary slowly the gains of the
feedback controller. Such an idea has also been used in the "classical" closed loop
identification of linear systems (see [SoSt89], chapter 10) where it is possible to show
that, by shifting between different feedback controllers, the accuracy of the estimates is
increased. By varying slowly the gains of the feedback controller we hope to excite
different plant modes and to make their effects noticeable at the output so that they can
be better identified.
Assume a simple case where the training session consists of a set number of runs
where in each run a fixed number of periods of the same reference signal is used. One
176

alternative to varying the gains of the feedback controller is to gradually reduce the
gains, for instance, by multiplying the gains of the previous run by a positive constant
(< 1). By reducing the gains of the feedback controller we hope to decrease the quality
of the feedback controller and thereby increase the output errors. At the beginning of
training a stronger feedback signal may be more suitable since it may provide the
necessary stability, given that the ANN output is still not significant. Later on, a weaker
feedback controller may be more appropriate since it will increase the output tracking
errors and force the ANN to be trained in regions where it would not be trained
otherwise.
Another alternative, the one that was used in these simulations, is to adapt the
gains of the feedback controller according to some performance criteria for the previous
runs, for instance, to decrease the gains if the RMS (root-mean-square) error calculated
for the previous run was reduced and to increase the gains otherwise. This avoids the
situation where the quality of the feedback controller becomes too poor while the ANN
has not yet received enough training, and consequently the plant becomes unstable.

7.3.6 - Simulations for the Nonlinear Case with a Variable Feedback Controller
Assuming that the robot is moving freely in its workspace, its dynamical
equations, in general, can be written in vectorial notation as [Sch90]:

τ D ( q ) q̈ H ( q , q̇ ) G ( q ) B ( q̇ ) (7.42)

where:
τ = vector of torques (for revolute joints) or forces (for prismatic joints) applied
by the actuators at the joints of the robot arm;
q = generalized coordinates (angles if revolute joints or distances if prismatic
joints);
D (q) q̈ = acceleration term that represents the forces and torques generated by
the motions of the links of the robot arm;
D (q) = Inertia Matrix; a symmetric and positive-definite matrix;
H (q,q̇) = product velocity term that represents the Coriolis and centrifugal forces.
Some authors also refer to this term as H (q,q̇) q̇;
G (q) = position term that represents the loading due to gravity;
B (q̇) = velocity term that represents the friction opposing the motion of the robot
arm.
177

Figure 7.16 - The two-joint robot arm

Moreover, this equation assumes that the robot arm is rigid and it does not include the
actuator dynamics.
For a two-joint robot with revolute joints, assuming that the masses are
concentrated at the joint ([LBDD88], [GuSe89]]), we have (see figure 7.16):

τ1 D11 q̈1 D12 q̈2 H1 ( 2 q̇1 q̇2 q̇22 ) B1 G1 (7.43)

τ2 D21 q̈1 D22 q̈2 H1 q̇22 B2 G2 (7.44)

where:
D11 = (m1 + m2) d12 + m2 d22 + 2 m2 d1 d2 cos(q2)
D12 = D21 = m2 d22 + m2 d1 d2 cos(q2)
D22 = m2 d22
H1 = −m2 d1 d2 sin(q2)
B1 = b1v q̇1 + b1c sign(q̇1)
B2 = b2v q̇2 + b2c sign(q̇2)
G1 = (m1 + m2) g d1 sin(q1) + m2 g d2 sin(q1 + q2)
G2 = m2 g d2 sin(q1 + q2)
and:
m1, m2 = masses of the links.
d1, d2 = lengths of the links
b1v, b2v = coefficients of viscous friction.
b1c, b2c = coefficients of coulomb friction.
g = gravitational constant.
178

The parameter m2 includes the mass of any existing load.


For a state-space representation, we can define the state vector x as:

x [ x1 x2 x3 x4 ]T [ q1 q2 q̇1 q̇2 ]T (7.45)

Then, defining the following matrices:

 D22 D12 
E  

(7.46)
 D12 D11 

 
 τ1 H1 2 x3 x4 x4 B1 G1 
2

F   (7.47)
 τ H x
2
B G 
 2 1 3 2 2 

where det(E ) = D11 D22 − (D12)2, the space-state dynamics are:

ẋ1  x3 
    (7.48)
ẋ  x 
 2  4

ẋ3  D11 D12  1


    E (7.49)
ẋ  D D  F F
 4  12 22  det E
In relation to the diagram of the control structure illustrated in fig. 7.13 we have:
u = plant input = [τ1 τ2 ]T, y = plant output = [q1 q2 ]T.
The simulations are divided into two groups. In the first group the ANN is
trained using a fixed feedback controller (i.e. its gains are kept constant). In the second
group a variable feedback controller is used to train the ANN. The same set of nine
reference trajectories (RTs) are used in each group of simulations. The ANN is first
trained using the first RT, and then tested with all 9 RTs. Even when the ANN is trained
with a variable feedback, it is tested using the same original feedback gains used to
perform the ANN training and testing in the other group of simulations. Finally we show
that by training the ANN with a variable feedback controller the RMS tracking errors
are significantly reduced, i.e. the ANN generalization to other RTs is improved.
The following parameters where used to simulate the robot arm: m1 = m2 = 10
Kg, d1 = d2 = 1 m, b1v = b2v = 5 N m s , b1c = b2c = 0 N m s , g = 9.81 m / s2. The dynamic
equations of the robot arm (eqs. 7.48 and 7.49) were simulated using the classical
fourth-order Runge-Kutta algorithm [PFTV88] with an integration step size h = ∆T / 2,
where ∆T = sampling period = 0.01 s (100 Hz). The delays M (see fig. 7.13) for each
component of the RT were set to 1.
179

In relation to the ANN, the parameter L was set for each component of the RT
to 19. Therefore the ANN had 40 input units. Two hidden layers of hyperbolic tangent
(tanh) units were used, where the hidden layer closer to the input side had 30 units and
the other hidden layer had 10 units. Since there are two joints, the ANN had 2 output
linear units. The ANN was a strictly feedforward network, i.e. each layer sent
connections only to the next consecutive layer. All network weights and biases, except
the weights and biases of the output units, were initialized as random numbers with
gaussian distribution of zero mean and 1/2 as the standard deviation. The weights and
biases of the two output units were initialized as zero, such that before training the ANN
output is zero for any RT. The ANN was trained using the BP algorithm.
All RTs have a duration of 4 s. The initial point of a RT is the same as its final
point since the basic idea was to perform the training and testing sessions by following
using periodic movements. All RTs, except RT 2, were generated as:

π   2πt  (7.50)
q d (t)  a b cos  
4   c 
The RT 2 was generated as:

 
π a b  (7.51)
q d (t)
2  
 1 exp 5(t c) 

The parameters a, b and c and the minimum and maximum values (in degrees) for each
of the RTs are respectively:
Joint 1 Joint 2
RT 1: 2, −2, 4 ( 0°, 180°) 1, −1, 2 ( 0°, 90°)
RT 2: 1, −1, 1 ( 0°, 90°) 0, 1, 1 ( 0°, 90°) (0 ≤ t ≤ 2 s)
0, 1, 3 ( 0°, 90°) 1, −1, 3 ( 0°, 90°) (2 ≤ t ≤ 4 s)
RT 3: 1, 1, 4 ( 0°, 90°) 1, −1, 4 ( 0°, 90°)
RT 4: 2, −2, 4 ( 0°, 180°) 0, −2, 2 (−90°, 90°)
RT 5: 3, −1, 4 (90°, 180°) −1, 1, 4 (−90°, 0°)
RT 6: 2, −2, 4 ( 0°, 180°) 1, −1, 4 ( 0°, 90°)
RT 7: 3, −1, 4 (90°, 180°) 1, −1, 4 ( 0°, 90°)
RT 8: 3, −1, 4 (90°, 180°) −1, 1, 2 (−90°, 0°)
RT 9: 2, −2, 4 ( 0°, 180°) 1, −1, 4/3 ( 0°, 90°)
Note that the RTs and the plant outputs (angles of each joint) are specified in radians.
Figure 7.17 shows the desired joint positions of the robot arm for RT 1 at t = 0, 1, 2,
180

Figure 7.17 - Desired joint positions for RT 1 at t = 0, 1, 2, 3 and 4 seconds.

3 and 4 seconds. RT 2 and RT 3 are very similar with the only difference that the
extreme points in RT 2 are joined by a sigmoidal function while in RT 3 they are joined
by a sinusoidal function. Note that, except for RT 2, the parameter c gives the period
of the RT for the joint. Therefore the period of all 9 RTs for joint 1 is 4 s. For joint 2:
RTs 2, 3, 5, 6 and 7 have period 4 s; RTs 1, 4 and 8 have period 2 s; and RT 9 has
period 1.33 s.
The numbering for each RT was selected considering the RMS error for joint 1
when only the feedback controller (with the original fixed gains) was used to control the
robot arm, such that RT 1 and RT 9 were respectively the RTs with the smallest and
largest RMS errors. Considering the whole set of 9 RTs the maximum and minimum
values for each joint was calculated (minimum: 0°, −90°; maximum: 180°, 90°) and such
values were used to scale the ANN inputs to be between −1 and 1.
Each group of simulations consisted first of a training session and then a recall
session. The training session consisted of using RT 1 for 40 runs, i.e. the total training
time was only 160 s = 2.67 min. Only at the beginning of the training session the robot
arm position was reset in order to coincide with the initial desired position for RT 1. In
the recall session all ANN weights and biases are fixed and each one of the 9 RTs was
tested, i.e the recall session consisted of 9 runs where each run used a different RT.
Whenever a new RT is being tested in the recall session the arm position is reset to
coincide with the initial desired position of the new RT.
During the training session the following parameters where used in the BP
algorithm: a) the momentum was set to zero; and b) the learning rate for all network
weights and biases was set to 1/500 for the first 20 runs and halved for the last 20 runs.
The output of the feedback controller was calculated as in eq. 7.37 with the
difference that the variables e, ref, y, uFB, ei, ed are now vectors, kp, ki and kd are now
181

matrices and:

kd
kpek ki∆T ei (7.52)
FB
uk ek ed
∆T
For the case where the ANN was trained with a fixed feedback controller the gains set
set to: kp = diag[2000 500], ki = diag[0 0]; kd = diag[200 100];
Since the duration of each RT was 4 s and the sampling period ∆T was 0.01 s,
each run had 400 time steps. In order to measure the performance of the neural
controller the RMS values of some variables were calculated for each run by:

NT
1 2 (7.53)
RMS var vark
NT k 1

where NT = number of time steps in each run = 400, and var is replaced by each
component of the vector output error e, ANN output uNN, output of the feedback
controller uFB, and plant input u.
In the training session with the variable feedback the gains of the feedback
controller (kp, ki, kd) for the first run (run 1) were the original gains as in the case of
fixed feedback. The following rules were used to determine the feedback gains for the
subsequent runs (i > 1):
1) IF the performance of the last run (run i−1) improved in relation to the
run before that (run i−2); AND IF the performance of the last run was
better than the performance when only the original feedback controller
was used: THEN decrease the feedback gains.
2) IF the performance of the last run improved in relation to the run
before that; AND IF the performance of the last run was NOT better than
the performance when only the original feedback controller was used:
THEN do not change the feedback gains.
3) IF the performance of the last run deteriorated in relation to the run
before that; THEN increase the feedback gains.
The measure of performance used for each run is the summation of the RMS error for
both joints. Run 0 is the run where only the original fixed feedback controller is used.
The ANN begins to be trained only in run 1. The feedback gains for run i were
calculated by multipling or dividing the feedback gains for run i−1 by 0.92, depending
on whether they should be decreased or increased. Another alternative is to use different
factors for increasing and decreasing the feedback gains. Figure 7.18 shows for each run
182

during the training session the value of the feedback gains in relation to the original
feedback gains.
Considering only the case where the variable feedback controller was used during
the training session, fig. 7.19 (a) and (b) show the history of the RMS values of the
ANN output, the output of the feedback controller and the control signal for each joint.
Figures 7.20 and 7.21 show RT 1 and RT 3 being recalled before and after training was
performed only using RT 1.
Figure 7.22 shows the RMS errors for all RTs during the recall session (after
training) for the cases where the ANN was trained using a fixed or variable feedback
controller. For comparison figure 7.22 also shows the RMS errors for the case where
only the original feedback controller is used. We can see that when trained with a fixed
feedback controller the ANN cannot generalize well and it can even be considerably
worse than using only the feedback controller (RTs 4, 5 and 8). We can also see that
the ANN trained with the variable feedback controller generalizes considerably better
than the one trained with a fixed feedback. Considering the set of all 9 RTs the ANN
trained with the variable feedback had a performance much better or comparable (i.e.
never much worse) to the case where only the feedback controller was used.
Note that, as expected, as the difference between a specific RT and the RT used
during training increases, the performance of the ANNs deteriorate. The worst case
happens for RT 9 that has a frequency greater than RT 1.

FB gain history
1.1

1.05

1
FB gain multiplier

0.95

0.9

0.85

0.8

0.75
0 5 10 15 20 25 30 35 40
Run number
Figure 7.18 - The feedback gain multiplier during the training session
183

Training Session - RT 1 - Joint 1 Training Session - RT 1 - Joint 2


140
300

FB Controller
ANN Controller
120
250 Control Signal

100

RMS values (N m)
RMS values (N m)

200
80

150
60

100
40

FB Controller
50 20 ANN Controller
Control Signal

0
0 0 5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40
Run number Run number

(a) (b)
Figure 7.19 - The RMS values for the feedback and ANN controllers during training

Reference and Output - Run 0 - RT 1 - Just FB Reference and Output - Run 1 - RT 1 - Recall Session
200 200

Joint 1 Joint 1
150 150

100 100
[Degrees]

[Degrees]

50 50

0 0
Reference
Joint 2 Joint 2 Reference
Output Output

-50 -50
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
Time [s] Time [s]
(a) (b)
Figure 7.20 - Recalling RT 1 before and after training with RT 1,
using a variable feedback controller
Reference and Output - RT 3 - Just FB Reference and Output - Run 3 - RT 3 - Recall Session

Reference Reference
90 Output
90 Output

80 80
Joint 2
70 70
Joint 2
60 60
[Degrees]

[Degrees]

50 50

40 40

30 30

20 20

10 10

0 0
Joint 1 Joint 1
-10 -10
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
Time [s] Time [s]
(a) (b)
Figure 7.21 - Recalling RT 3 before and after training with RT 1,
using a variable feedback controller
184

RMS Errors - Joint 2


RMS Errors - Joint 1 25
14

12
20

10

15

[Degrees]
[Degrees]

6 10

4 Just FB
Fix. FB 5
Just FB
Var. FB
2 Fix. FB
Var. FB

0
0 0 2 4 6 8 10
0 2 4 6 8 10
Trajectory number
Trajectory number
(a) (b)
Figure 7.22 - The RMS errors for all RTs during the recall session for the cases
when the ANN is trained with a fixed or variable feedback controller

In practice, instead of just one RT, a set of several RTs and a longer session
should be used to train the ANN. However, these simulations show that the feedback
controller also has an important role to play in such a control structure for nonlinear
identification/adaptive control.

7.4 - Fault-Tolerant Dynamical Control

One field in control theory that has not received much attention is the design of
fault-tolerant controllers (also called reliable controllers), i.e. controllers that have a
graceful degradation of performance in relation to a set of faults. Some possible faults
are loss of actuators, loss of sensors or internal damage to the controller. However, in
certain application areas such as nuclear reactors, aircrafts, space missions and chemical
plants, fault tolerance (or reliability) can be a crucial requirement.
A different problem is to have a controller that is robust in relation to faults in
the plant, in which case a conventional solution is: 1) to detect that a fault has happened
and to determine in which part of the plant it has happened (fault detection and fault
isolation); 2) switch to a previously designed controller for the specific fault; 3) if such
a controller is not available (perhaps because the possibility of the specific fault was not
predicted), then run a parameter identification algorithm to determine the model of the
faulty plant and design a new controller for the faulty system.
The same approach could be used for faults in the controller or loss of sensors
185

and actuators. One problem is to have to account for all possible faults, a cumbersome
task for complex plants that normally require complex controllers with several sensors
and actuators. The second problem is the delay in detecting and isolating the fault. Such
a delay may be enough to make the plant unstable and cause a catastrophic failure.
The aim of a fault-tolerant controller is to have the plant, whenever a fault occurs
(in the controller or in the plant itself), if not operating optimally, at least in a safe
configuration until the fault is neutralised by repairing the plant or by reconfiguring the
controller. Therefore an increase in the fault tolerance of the controller will increase the
overall stability of the control system.
Popular control systems design, in general, do not take into consideration the
possibility of faults in the controller and therefore, not surprisingly, are not fault tolerant
(or reliable), as Rosenbrock and McMorran [RoMc71] pointed out. Viswanadham, Sarma
and Singh [VSS87] proposed a method of designing controllers for linear plants, using
a stable factorization approach, that are tolerant to loss of sensors and actuators.
Due to their distributed organization, ANN have the potential of being fault-
tolerant in relation to internal damage to itself, such as the loss of internal units or
weights. However, as chapter 5 shows, the training algorithm has to be able to exploit
such potential. Moreover, since ANNs are adaptable (or learning) devices, they appear
to be able to find alternative solutions when they suffer partial damage, i.e. they can at
least partially reconfigure themselves. In this section we show how such a potential for
fault tolerance to internal damage can be used to obtain fault-tolerant neural controllers.
In this section the BPS algorithm, developed in chapter 5, is used to improve the
fault tolerance of the neural controller in relation to faults in the ANN [NaZa93]. The
controller structure is the same used in the previous section, i.e. the modified feedback-
error-learning neural control structure (see figure 7.13). The plant to be controlled is an
inverted pendulum that should follow a varying reference signal. Note that this is
different from aiming to maintain the pendulum in the upright position.
The dynamical equation for the inverted pendulum (see fig. 7.23) can be simply
derived using the principle that the resultant torque acting on the body is equal to the
time variation of the total angular momentum of the body. Assuming that the pendulum
is a thin cylinder or rod of mass m1 and length a1 with mass mL concentrated at the tip
of the pendulum, g = gravity, b1 = coefficient of viscous friction, θ = angular position
and τ = external applied torque, we can write [Sch90]:
186

0
y
m
L

a1
m1 g

τ
θ 0
x

Figure 7.23 - The inverted pendulum, a 1-axis robot

   
m  2 m  (7.54)
τ  1 mL  a1 θ̈ g  1 mL  a1 cos θ b1 θ̇
 3   2 
where the term (m1 / 3 + mL) a21 is the moment of inertia of the pendulum. In relation to
the diagram illustrated in figure 7.13, we have: u = plant input = τ, x = [x1 x2]T =
[θ, dθ/dt]T, y = x1 = θ. If it is desired to force the angular position to have a large
excursion, then control of such a plant is a difficult task because it is equivalent to
having a linear plant where one of the parameters (the gravitational load in this case, the
second term on the right side of eq. 7.54) varies according to the position of the
pendulum.
The following parameters were used in our simulations: M = 1, L = 29,
∆T = sampling period = 0.1 s, and plant parameters:
[a1, m1, mL, b1, g] = [0.5 m, 0.25 kg, 0.25 kg, 0.1 N m s, 9.8 m s2].
The plant was again simulated using the classical fourth-order Runge-Kutta algorithm
with an integration step size h = ∆T/2. The feedback controller was also a PID controller
with parameters [kp, ki, kd] = [0.5, 0.5, 0.1] and implemented as in the previous section
(see eq. 7.52).
The variable ISE (Integral of the Squared Error) is used as a performance index
for the particular setup used during the simulation. Such a variable was initialized as
zero and updated at each time step as: ISE := ISE + e2k. The reference signal ref is
described by:
187

π π
refk cos π k ∆ T (7.55)
4 4
where k = 0, 1, ..., NT. In other words, the desired angle trajectory is a sinusoidal signal
varying between 0 and −90 degrees with period 2 s and sampled with period ∆T = 0.1
s. At the beginning of each simulation it is assumed that xk = [0 0]T and refk = 0 for
k ≤ 0.
The ANN has 30 inputs (L = 29), 1 hidden layer with 10 hyperbolic tangent
units, 1 linear output unit and no direct connection from the input to the output layer.
Before entering the ANN, the reference signal is scaled such that its magnitude lies
between −1 and 1. The incoming weights and biases for the hidden units were initialized
as random values using a gaussian distribution with zero mean and standard deviation
0.5. Again the incoming weights and biases for the output units were initialized as zero
since, before being trained, the ANN should produce a zero output for any input.
The learning rate and momentum rate used by the BP algorithm were set at the
beginning of the simulation to 1/800 and zero respectively. For the second half of the
of the training period, the learning rate was reduced to half of its initial value.
The network was trained for 400 s using the BP and BPS algorithms. Care was
taken to ensure that the network initial weights were the same for both algorithms.
It was assumed that the set of possible faults was composed of the failure of each
hidden unit (output of the faulty hidden unit is clamped to zero), all hidden units have
the same probability of failure and the no-fault configuration is as probable as each fault
configuration. Therefore there are 11 possible configurations, each of them with the
same probability. When the BPS algorithm was used, one of the possible network
configurations was randomly chosen for each time step k, i.e. SWEPO = 1.
Figures 7.24 (a) and (b) illustrate the actual and desired trajectories for the joint
angle (θk and refk−1) during the last 10 s of the training section respectively for the cases:
a) training with the BP algorithm; and b) training with the BPS algorithm. Figure
7.24 (c) shows the same variables when only the feedback controller is used to control
the plant for the same 400 s. Observe that the feedback controller is far from being
optimally tuned for this particular plant. Despite this, the ANN still converges to a good
solution for both training methods as figures 7.24 (a) and (b) show. This illustrates that,
given a long enough period to train the ANN, the employed control structure is robust
in relation to the quality of the feedback controller.
188

20 20
Error Error
0 0 + + + + +
++ ++ ++ ++ ++ + + + + + + + + + +
+ + + + + + + + + + + + + + +
Joint Angle (Degrees)

Joint Angle (Degrees)


+ + + + +
+ + + + +
+ + + + + + + + + + + + + + +
+ + + + +
-20 -20
+ + + + + + + + + +
+ + + + + + + + + +

+ + + + +
-40 + + + + + + + + + + -40 + + + + +
ref + +
ref + + +
+ + + + +
+ + + + +
+ +
-60 -60 + + +

+ + + + + + + + + +
+ + + + +
+ + + + +
-80 + + + + + -80 + + + + +
+ + + + +
+ + + + +
+
+
+
+
+
+
y +
+
+
+
+
+
+
+
+ + + y + + +
++ ++ ++ ++ ++
-100 -100
390 391 392 393 394 395 396 397 398 399 400 390 391 392 393 394 395 396 397 398 399 400
Time (s) Time (s)
(a) (b)

40 7

20 Error 6
Plant Input = FB, only FB control
Plant Input = ANN + FB control
Joint Angle (Degrees)

0 5

RMS values
-20 +++ +++ +++ +++ +++
4
+ + + + + + + + + +
+ + + + + + + + + +

-40
+ + + + + + + + + +
3 ANN control (BPS)
+ + + + +
+
+
+
y +
+
+
+
+
+
+
+ + + + +
-60 +
+
+
+
+
+
+
+
+
+
2 FB control
+ + + + +
+ + + + + + + + + +
++ ++ ++ ++ ++ Error, only FB control
-80 1
ref
Error, ANN + FB control
-100 0
390 391 392 393 394 395 396 397 398 399 400 0 20 40 60 80 100 120 140 160 180 200
Time (s) Interval Number
(c) (d)
Figure 7.24 - Actual and desired joint angle trajectory at the end of the training session
when: (a) using the BP algorithm; (b) using the BPS algorithm; (c) using only the
feedback controller. Part (d) shows the RMS values, calculated at every 2 s, for the
error, the controller outputs and the control signal for the previous cases (b) and (c).

Figure 7.24 (d) shows the root-mean-square (RMS) values for the control signals
during the training session when the ANN is trained with the BPS algorithm and also
the RMS value for the feedback controller alone. The RMS values were calculated for
every 2 s interval, i.e. one period of the reference signal, using:

20 i 1
2 (7.56)
RMS var i
vark
k 20 ( i 1 )

where i = interval number, 1 ≤ i ≤ 200, and var is replaced by e, uFB, uNN or u. The case
when the ANN is trained using the BP algorithm results in a graph similar to fig.
7.24 (d) where the ANN is trained with the BPS algorithm. Considering the whole
training session the values for the variable ISE are: a) 43.07, when using the BP
algorithm; b) 53.10, when using the BPS algorithm; and c) 332.7, when only the
189

feedback controller is used to control the plant.


After the training session all 11 possible configurations for the ANN were tested
using the same reference signal for the same period of time (400 s) and the ANN
weights and biases were kept fixed, i.e. learning was disabled. Before each test the
system state x was reinitialized to zero, i.e. the pendulum was placed in its initial
horizontal position with a zero angular velocity. Since the calculation of the ISE starts
at the beginning of the simulation for the training and the testing phases, the ISE also
includes a transient error which is a consequence of changing the pendulum state from
a rest position to a periodic movement.
Table 7.1 shows the values for the integral of the squared error ISE for the test

Hidden unit BP BPS


removed ISE Ratio ISE Ratio

None 22.34 1.00 32.93 1.00

1 31.95 1.43 24.74 0.75


2 58.53 2.62 28.61 0.87
3 26.33 1.18 34.06 1.03
4 60.53 2.71 68.65 2.09
5 29.47 1.32 43.31 1.32
6 53.19 2.38 43.24 1.31
7 118.40 5.30 77.30 2.35
8 25.37 1.14 26.65 0.81
9 22.70 1.02 25.10 0.76
10 64.75 2.90 78.10 2.37

max 118.40 5.30 78.10 2.37


mean 49.12 2.20 44.98 1.37
min 22.70 1.02 24.74 0.75
st. dev. 29.32 1.31 21.69 0.66

Table 7.4 - Integral of the squared error (ISE) when the ANN is tested for fault
tolerance. The mean and standard deviation values include only the fault configurations.
190

runs of the no-fault and the 10 fault configurations for both training methods. The mean
and standard deviation values were calculated considering only the ISE for the fault
configurations. When the BPS algorithm was used, the ISE for the no-fault
configurations increased from 22.34 to 32.93 (a variation of 47.4%), but for the fault
configurations the mean ISE reduced from 49.12 to 44.98 (8.4%) and the standard
deviation reduced from 29.32 to 21.69 (26%). Also the maximum possible ISE was
reduced from 118.40 to 78.10 (a variation of 34%). Observe that, when the ANN was
trained with the BPS algorithm, the ISE was in fact reduced with the loss of some of
the hidden units.
Figure 7.25 shows the tracking error if the ANN is tested as before but with a
fault occuring at 200 s, instead of at the beginning of the simulation. The no-fault
configuration is used until 200 s and afterwards the ANN loses hidden unit 7. The
tracking error is shown for the cases when the ANN was trained with the BP or with
the BPS algorithm. From figure 7.25 we can see that, in comparison with the BP
algorithm, by training the ANN with the BPS algorithm, the magnitude of the tracking
error is increased before the fault and decreased after it.

7.5 - Conclusions

In the first part of this chapter the major neural control architectures were
reviewed. The feedback-error-learning architecture was then analysed in more detailed.

25
BPS
20
BP
Tracking Error (Degrees)

15

10

-5

-10

-15
190 195 200 205 210 215 220
Time (s)
Figure 7.25 - The tracking error during testing when the hidden unit 7 is lost at 200 s
191

We proposed the modified feedback-error-learning architecture and showed that it


performs closed-loop linear/nonlinear identification of the inverse dynamical model of
the plant. Although the inverse dynamical model is identified, the architecture can still
be applied to linear non-minimum phase plants (that have an unstable inverse) since the
ANN searches for a delayed inverse of the plant (which should be stable).
We show that for a single-input single-output linear time-invariant plant the
minimization of the square of the output of the feedback controller results in a good
approximation of the inverse delayed dynamical model of the plant. We also proposed
the technique of using a variable (or adaptable) feedback controller to improve the
generalization of the ANN. The results of simulations with a two-joint robot were then
shown.
Finally we show that the BPS algorithm, present in chapter 5, can also be used
to improve the fault tolerance of neural controllers in relation to internal damage to the
ANN.
192

Chapter 8 - Conclusions and Directions


for Further Work

In this thesis we have developed algorithms that employ artificial neural network
models to solve the problems of: a) extremum control of static systems with an
asymmetric performance index; b) adaptive control of nonlinear dynamical systems
under feedback.
The IAC (Interactive Activation and Competition) feedback network was
presented and analysed in detail. We have proved that the IAC network can also be used
to solve quadratic optimization problems and, as such, is an alternative to the Hopfield
network.
We have also proposed an algorithm that can be used to speed up the training
of feedforward neural networks that use sigmoidal functions in the hidden layers. The
basic idea is to constrain the location of the decision surfaces, which are defined by the
weights arriving at the hidden units.
We have mathematically analysed the fault tolerance of feedforward neural
networks and we showed that, by incorporating fault tolerance in a novel way, the
problem of training the network is regularized. We have shown that in some cases the
proposed cost function will have a unique minimum point. However, we have also
shown that in general there is no unique solution for the set of network weights. The
BPS algorithm was proposed and we showed that its application results in fault tolerant
networks.
We have developed a novel non-standard neural network model and have used
it to solve the extremum control problem of static systems with an asymmetric
performance index. The standard Back-Propagation algorithm was modified and used to
adapt the network free parameters. We have also shown, theoretically and by using
simulations, that the same network model can also be used in the multi-input case.
A modified feedback-error-learning control structure was proposed and
mathematically analysed. We have shown that the aim of this structure is to perform
193

closed-loop identification of the inverse dynamical system. The technique of using a


variable (or adaptive) feedback controller was also proposed and we showed that it
improves the generalization of the neural network controller. Finally, we have applied
the BPS algorithm to improve the fault tolerance of the neural network controller.
The work presented in this thesis shows that the properties of nonlinear
modelling, adaptability and fault tolerance exhibited by artificial neural network models
can offer effective solutions to problems that may be very difficult or intractable by
other approaches. On the other hand there is still the need for much more formal
mathematical analysis in several areas of artificial neural networks. For instance, one of
the main outstanding problems in using artificial neural networks is to decide how large
the network needs to be in terms of the number of hidden units. Dynamic allocation of
hidden units during training may offer a solution for static problems.
The work presented in this thesis can be further developed in several directions.
In relation to chapter 3 possible areas of research are: a) to study the storage capacity
of the IAC network; and b) to modify the IAC network model in order to eliminate local
minima.
The algorithm proposed in chapter 4 to speed up training has the disadvantage
that a permissible region for the decision surfaces has to be defined by the designer.
Such a permissible region is then treated as a "hard" constraint. A possible modification
that avoids the need for specifying a permissible region would be to treat the location
of the decision surfaces as a "soft" constraint, so that the penalty for violating the
constraints is finite.
An interesting investigation in relation to the results obtained in chapter 5
concerning fault tolerant networks would be to test the hypothesis that the cost function
specified in eq. 5.41 has a unique solution and therefore solves the problem of parameter
identifiability. Again, using inspiration from biology, it would interesting to investigate
if the asynchronous operation of biological neural networks (there is no central clock to
synchronize the neurons) has an important role in their fault tolerance to loss of neurons.
In relation to the neural network extremum controller developed in chapter 6 a
possible area for further research is an investigation of the modelling capabilities of the
proposed network model in the single and multi-input cases, i.e if it is possible to prove
that, given enough hidden units, the proposed network model can approximate any single
and multi-input unimodal asymmetric functions with an arbitrary small error. The use
194

of more sophisticated training methods and network model fault tolerance should also
be investigated.
In chapter 7 the algorithm used to train the network to control the plant adjusts
only the network weights. The network configuration (number of hidden layers and
number of hidden units) has to be decided before training begins. It would be useful to
investigate if the network configuration can also be adjusted at the same time that the
network is being training to control the plant, using some of the techniques mentioned
in chapter 2 (section 2.5.3), such as weight decay and weight pruning. Another possible
research area is to investigate under which conditions the matrix Fl, defined in eq. 7.26,
is positive definite and to find a formal proof for the validity of the learning rule
proposed in eq. 7.36.
195

References

[AbJa85] - Abu-Mostafa, Y. S. & Jacques, J. M. S. (1985). Information Capacity of


the Hopfield Model, IEEE Transactions on Information Theory, 31 (4),
461-464

[AGS85a] - Amit, D., Gutfreund, H. & Sompolinsky, H. (1985). Spin-Glass Models


of Neural Networks, Physical Review A, 32, 1007-1018

[AGS85b] - Amit, D., Gutfreund, H. & Sompolinsky, H. (1985). Storing Infinite


Number of Patterns in a Spin-Glass Model of Neural Networks, Physical
Review Letters, 55, 1530-1533

[Alm89] - Almeida, L. B. (1989). Back-Propagation in Non-Feedforward Networks,


in I. Aleksander (Ed.), Neural Computing Architectures, pp. 74-91,
London, UK: North Oxford Academic

[AlMo90] - Aleksander, I. & Morton, H. (1990). Neural Computing, London, UK:


Chapman and Hall

[Ama90] - Amari, S-I. (1990). Mathematical Foundations of Neurocomputing,


Proceedings of the IEEE, 78 (9), 1443-1463

[And68] - Anderson, J. A. (1968). A Memory Storage Model Utilizing Spatial


Correlation Functions, Kybernetik, 5, 113-119

[And83] - Anderson, J. A. (1983, Sept/Oct). Cognitive and Psychological


Computation with Neural Models, IEEE Transactions on Systems, Man,
and Cybernetics, 13 (5), 799-815 (also in [Vem88])

[And88] - Anderson, C. W. (1988, 17 May). Strategy Learning with Multilayer


Connectionist Representations, Computer and Intelligent Systems
Laboratory, GTE Laboratories Incorporated, Technical Report TR87-
509.3, Waltham, USA

[And89] - Anderson, C. W. (1989, April). Learning to Control an Inverted


Pendulum Using Neural Networks, IEEE Control Systems Magazine, 9
(3), 31-37

[AnRo88] - Anderson, J. A. & Rosenfeld, E. (1988). Neurocomputing, Foundations


of Research, Cambridge, USA: The MIT Press (collection of 43
"classical" papers)

[AsWi89] - Astrom, K. J. & Wittenmark, B. (1989). Adaptive Control, Reading,


USA: Addison-Wesley Publishing Co.

[BaHa89] - Baum, Eric B. & Haussler, David (1989). What Size Net Gives Valid
Generalization?, Neural Computation, 1 (1), 151-160
196

[Bat92] - Battiti, Roberto (1992). First- and Second-Order Methods for Learning:
Between Steepest Descent and Newton’s Method, Neural Computation,
4 (2), 141-166

[BeCu88] - Becker, Sue, & le Cun Yann (1988). Improving the Convergence of
Back-Propagation Learning with Second Order Methods, in D. Touretzky,
G. Hinton & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist
Summer School, pp. 29-37, San Mateo, USA: Morgan Kauffman

[BeJa90] - Beale, R. & Jackson, T. (1990). Neural Computing: An Introduction,


Bristol, UK: Adam Hilger

[Bla62] - Blackman, P. F. (1962). Extremum-seeking Regulators, in J. H. Westcott


(Ed.), An Exposition of Adaptive Control, pp. 36-50, Oxford, UK:
Pergamon Press

[Blo70] - Block, H. D. (1970). A Review of "Perceptrons: An Introduction to


Computational Geometry", Information and Control, 17, 501-522

[BoZa90] - Bozin, A. S. & Zarrop, M. B. (1990). Self-Tuning Extremum Optimizer -


Convergence and Robustness Properties, in Proceedings of the First
European Control Conference (ECC’91), Grenoble, France (also CSC
Report 737, UMIST, Manchester, UK)

[BrHo69] - Bryson, A. & Ho, Y.-C. (1969). Applied Optimal Control, New York,
USA: Blaisdell

[BSA83] - Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983, Sept/Oct).


Neuronlike Adaptive Elements That Can Solve Difficult Learning Control
Problems, IEEE Transactions on Systems, Man, and Cybernetics, SMC-13
(5), 834-846

[BSRP92] - Bugmann, G., Sojka, P., Reiss, M., Plumbley, M. & Taylor, J. G. (1992).
Direct Approaches to Improving the Robustness of Multilayer Neural
Networks, in I. Aleksander & J. Taylor (Eds.), Proceedings of the Int.
Conf. on Artificial Neural Networks (ICANN 92), 4-7 September,
Brighton, UK, pp. 1063-1066, Amsterdam, The Netherlands: Elsevier
Science Publishers B. V.

[Cyb89] - Cybenko, G. (1989). Approximation by Superposition of a Sigmoidal


Function, Math. Control Signals Systems, 2, 303-314

[DeDe93] - Deutsch, S. & Deutsch, A. (1993). Understanding the Nervous System -


An Engineering Perspective, New York, USA: IEEE Press

[DhSi81] - Dhillon, B. S. & Singh, C. (1981). Engineering Reliability, New


Techniques and Applications, New York, USA: John Wiley & Sons
197

[DrLi51] - Draper, C. S. & Li, Y. (1951). Principles of Optimizing Control Systems


and an Application to the Internal Combustion Engine, A.S.M.E.
Publication Dept.

[DrRi92] - Drago, G. P. & Ridella, S. (1992, July). Statistically Controlled


Activation Weight Initialization (SCAWI), IEEE Transactions on Neural
Networks, 3 (4), 627-631

[DuRu89] - Durbin, R. & Rumelhart, D. E. (1989). Product Units: A Computationally


Powerful and Biological Plausible Extension to Backpropagation
Networks, Neural Computation, 1, 133-142

[Fah89] - Fahlmann, S. E. (1989). Faster-Learning Variations on Back-Propagation:


An Empirical Study, in D. Touretzsky, G. Hinton & T. Sejnowski (Eds.),
Proceedings of the 1988 Connectionist Summer School, pp. 38-51, San
Mateo, USA: Morgan Kauffman

[FaLe90] - Fahlmann, S. E. & Lebiere, C. (1990). The Cascade-Correlation Learning


Architecture, School of Computer Science, Carnegie Mellon University,
CMU-CS-90-100, Pittsburgh, USA

[FaMi90] - Farrell, J. A. & Michel, A. N. (1990). A Synthesis Procedure for


Hopfield’s Continuous-Time Associative Memory, IEEE Transactions on
Circuits and Systems, 37, 877-884

[Fun89] - Funahashi, Ken-ichi (1889). On the Approximate Realization of


Continuous Mappings by Neural Networks, Neural Networks, 2, 183-192

[GeSi93] - Geva, S. & Sitte, J. (1993, October). A Cartpole Experiment Benchmark


for Trainable Controllers, IEEE Control Systems Magazine, 13 (5), 40-51

[GiMa87] - Giles, C. L. & Maxwell, T. (1987). Learning, Invariance, and


Generalization in High-Order Neural Networks, Applied Optics, 23, 23,
4972-4978

[GMW81] - Gill, P. E., Murray, W. & Wright, M. H. (1981). Practical Optimization,


London, UK: Academic Press

[GNY93] - Green, Peter R., Nascimento Jr., Cairo L. & York, Trevor A. (1993,
January). Structuring Networks for Image Classification Using
Competitive Learning, Control Systems Centre, Dept. of Electrical
Engineering and Electronics, CSC Report 784, UMIST, Manchester, UK

[GoYd89] - Golden, M. P. & Ydstie, B. E. (1989, July). Adaptive Extremum Control


Using Approximate Process Models, American Institute of Chemical
Engineers (AIChE) Journal, 35 (7), 1157-1169

[GrBo72] - Green, A. E. & Bourne, A. J. (1972). Reliability Technology, New York,


USA: John Wiley & Sons
198

[GuSe89] - Guez, A. & Selinsky, J. (1989). Neurocontroller Design Via Supervised


and Unsupervised Learning, Journal of Intelligent and Robotic Systems,
2, 307-335

[HAS84] - Hinton, G. E., Ackley, D. & Sejnowski, T. (1984). Boltzmann Machines:


Constraint satisfaction networks that learn, Department of Computer
Science Technical Report CMU-CS-84-119, Carnegie-Mellon University,
USA

[Heb49] - Hebb, D. O. (1949). The Organization of Behavior, New York, USA:


John Wiley & Sons

[Hec90] - Hecht-Nielsen, R. (1990). Neurocomputing, Reading, USA: Addison-


Wesley Publishing Co.

[HFP83] - Hopfield, J. J., Feinstein, D. I. & Palmer, R. G. (1983). "Unlearning" Has


a Stabilising Effect in Collective Memories, Nature, 304, 158-159 (14
July 1983)

[Hin89] - Hinton, G. E. (1989). Neural Networks - Notes of the 1st Sun Annual
Lecture in Computer Science at the University of Manchester,
Manchester, UK: University of Manchester Press

[HiSe86] - Hinton, G. E. & & Sejnowski, T. (1986). Learning and Relearning in


Boltzmann Machines, in D. E. Rumelhart & J. L. McClelland (Eds.),
Parallel Distributed Processing: Explorations in the Microstructure of
Cognition, Vol. 1, chapter 7, pp. 282-317, Cambridge, USA: Bradford
Books/MIT Press

[HKP91] - Hertz, J. A., Krogh, A. S. & Palmer, R. G. (1991). Introduction to the


Theory of Neural Computation, Lecture Notes Volume I in the Santa Fe
Institute Studies in the Sciences of Complexity, Redwood City, USA:
Addison-Wesley Publishing Co.

[Hop82] - Hopfield, J. J. (1982). Neural Networks and Physical Systems with


Emergent Collective Computational Abilities, Proceedings of the National
Academy of Sciences USA, 79, 2554-2558 (also as chapter 27 in
[AnRo88])

[Hop84] - Hopfield, J. J. (1984). Neurons with Graded Response Have Collective


Computational Properties Like Those of Two-State Neurons, Proceedings
of the National Academy of Sciences USA, 81, 3088-3092 (also in
[Vem88] and [AnRo88])

[Hop86] - Hoppensteadt, F. C. (1986). An Introduction to the Mathematics of


Neurons, Cambridge, UK: Cambridge University Press

[HoTa85] - Hopfield, J. J. & Tank, D. W. (1985). "Neural" Computation of Decisions


in Optimization Problems, Biological Cybernetics, 52, 141-152
199

[HoTa86] - Hopfield, J. J. & Tank, D. W. (1986, 8 August). Computing With Neural


Circuits: A Model, Science, 233, 625-633 (see also comments from
several authors in Science, vol. 235, 6 March 1987, pages 1226-1229)

[HSW89] - Hornik, K., Stinchcombe, M. & White, H. (1989). Multilayer


Feedforward Networks are Universal Approximators, Neural Networks,
2, 359-366

[HSZG92] - Hunt, K. J., Sbarbaro, D., Zbikowski, R. & Gawthrop, P. J. (1992,


November). Neural Networks for Control Systems - A Survey,
Automatica, 28 (6), 1083-1112

[HuHo93] - Hush, D. R. & Horne, B. G. (1993, January). Progress in Supervised


Neural Networks, IEEE Signal Processing Magazine, 8-39

[HYH91] - Hirose, Y., Yamashita, K. & Hijiya, S. (1991). Back-Propagation


Algorithm Which Varies the Number of Hidden Units, Neural Networks,
4, 61-66

[Ito91] - Ito, Yoshifusa (1991). Representation of Functions by Superpositions of


a Step or Sigmoid Function and Their Applications to Neural Network
Theory, Neural Networks, 4 (3), 385-394

[IzPe90] - Izui, Yoshio & Pentland, A. (1990). Analysis of Neural Networks with
Redundancy, Neural Computation, 2, 226-238

[Jac88] - Jacobs, Robert A. (1988). Increased Rates of Convergence Through


Learning Rate Adaptation, Neural Networks, 1, 295-307

[Jor89] - Jordan, M. I. (1989). Generic Constraints on Underspecified Target


Trajectories, in Proceedings of the International Joint Conference on
Neural Networks (IJCNN89), 18-22 June, Washington, DC, USA, vol. 1,
pp. 217-225, New York, USA: IEEE Press

[Jud90] - Judd, J. Stephen (1990). Neural Network Design and the Complexity of
Learning, Cambridge, USA: Bradford Books/MIT Press

[KaSo87] - Kanter, I. & Sompolinsky, H. (1987). Associative Recall of Memory


Without Errors, Physical Review A, 35, 380-392

[Kaw90] - Kawato, M. (1990). Computational Schemes and Neural Network Models


for Formation and Control of Multijoint Arm Trajectory, in W. T. Miller,
R. S. Sutton & P. J. Werbos (Eds.), Neural Networks for Control, pp.
197-228, Cambridge, USA: Bradford Books/MIT Press

[KeHa93] - Kendall, G. D. & Hall, T. J. (1993). Optimal Network Construction by


Minimum Description Length, Neural Computation, 5 (2), 210-212
200

[FFS87] - Kawato, M., Furukawa, K. & Suzuki, R. (1987). A Hierarchical Neural-


Network Model for Control and Learning of Voluntary Movement,
Biological Cybernetics, 57, 169-185

[KGV83] - Kirkpatrick, S., Gelatt, C. & Vecchi, M. (1983, May). Optimization by


Simulated Annealing, Science, 220 (4598), 671-680

[Kle86] - Kleinfeld, D. (1986). Sequential State Generation by Model Neural


Networks, Proceedings of the National Academy of Sciences, USA, 83,
9469-9473

[KlSo89] - Kleinfeld, D. & Sompolinsky, H. (1989). Associative Network Models for


Central Pattern Generators, in C. Koch & I. Segev (Eds.), Methods in
Neuronal Modeling: From Synapses to Networks, pp. 195-246,
Cambridge, USA: MIT Press

[KoAn89] - Kollias, S. & Anastassiou, D. (1989, August). An Adaptive Least Squares


Algorithm for the Efficient Training of Artificial Neural Networks, IEEE
Trans. on Circuits and Systems, 36 (8), 1092-1101

[Kre91] - Kreinovich, Vladik Ya. (1991). Arbitrary Nonlinearity is Sufficient to


Represent All Functions by Neural Networks: A Theorem, Neural
Networks, 4 (3), 381-383

[LBDD88] - Lobbezzo, A. J., Bruijn, P. M., Davies, M. S., Dunford, W. G., Lawrence,
P. D. & Lemke, H. R. V. N. (1988, February). Robot Control Using
Adaptive Transformations, IEEE Journal of Robotics and Automation, 4
(1), 104-108

[LeC89] - le Cun, Y. (1989). A Theoretical Formulation for the Back-Propagation,


in D. Touretzky, G. Hinton & T. Sejnowski (Eds.), Proceedings of the
1988 Connectionist Summer School, pp. 21-28, San Mateo, USA: Morgan
Kauffman

[LeSh91] - Lee, B. W. & Sheu, B. J. (1991, January). Modified Hopfield Neural


Networks for Retrieving the Optimal Solution, IEEE Transactions on
Neural Networks, 2 (1), 137-142

[LeSh92] - Lee, Bang W. & Sheu, Bing J. (1992). Design and Analysis of Analog
VLSI Neural Networks, in Bart Kosko (Ed.), Neural Networks for Signal
Processing, chapter 8, pp. 229-286, Englewood Cliffs, USA: Prentice-
Hall

[Lip87] - Lippmann, R. P. (1987, April). An Introduction to Computing with


Neural Nets, IEEE ASSP Magazine, 4-22 (also in [Vem88])

[LTS90] - Levin, E., Tishby, N. & Solla, S. A. (1990). A Statistical Approach to


Learning and Generalization in Layered Neural Networks, Proceedings
of the IEEE, 78 (10), 1568-1574
201

[Luo91] - Luo, Zhi-Quan (1991). On the Convergence of the LMS Algorithm with
Adaptive Linear Rate for Linear Feedforward Networks, Neural
Computation, 3 (2), 226-245

[McPi43] - McCulloch, W. S. & Pitts, W. (1943). A Logical Calculus of the Ideas


Immanent in Nervous Activity, Bulletin of Mathematical Biophysics, 5,
115-133 (also as chapter 2 in [AnRo88])

[McRu81] - McClelland, J. L. & Rumelhart, D. E. (1988). An Interactive Activation


Model of Context Effects in Letter Perception: Part 1, An Account of
Basic Findings, Psychological Review, 88, 375-407 (also as chapter 25
in [AnRo88])

[McRu88] - McClelland, J. L. & Rumelhart, D. E. (1988). Explorations in Parallel


Distributed Processing: A Handbook of Models, Programs and Exercises,
Cambridge, USA: Bradford Books/MIT Press (it includes software for
IBM PC)

[MiPa69] - Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to


Computational Geometry, Cambridge, USA: The MIT Press (an expanded
edition was published in 1988)

[MKSS88] - Miyamoto, H., Kawato, M., Setoyama, T. & Suzuki, R. (1988). Feeback-
Error-Learning Neural Network for Trajectory Control of a Robotic
Manipulator, Neural Networks, 1, 251-265

[MoDa89] - Moody, John & Darken, Christian J. (1989). Fast Learning in Networks
of Locally-Tuned Processing Units, Neural Computation, 1 (2), 281-294

[MoDav89] - Montana, David J. & Davis, Lawrence (1989). Training Feedforward


Neural Networks Using Genetic Algorithms, in Proceedings of the 11th
Int. Joint Conf. on Artificial Intelligence (IJCAI-89), 20-25 August,
Detroit, USA, pp. 762-767, San Mateo, USA: Morgan Kaufmann

[MPRV87] - McEliece, R. J., Posner, E. C., Rodemich, E. R. & Venkatesh, S. S.


(1087). The Capacity of the Hopfield Associative Memory, IEEE
Transactions on Information Theory, 33 (4), 461-82 (also in [Vem88])

[Nar90] - Narendra, K. S. (1990). Adaptive Control Using Neural Networks, in W.


T. Miller, R. S. Sutton & P. J. Werbos (Eds.), Neural Networks for
Control, pp. 115-142, Cambridge, USA: Bradford Books/MIT Press

[Nas90] - Nascimento Jr., Cairo L. (1990, March). An Introduction to Artificial


Neural Networks, Control Systems Centre, Dept. of Electrical
Engineering and Electronics, CSC Report 751, UMIST, Manchester, UK

[NaMc91] - Nascimento Jr., Cairo L. & McMichael, Daniel (1991, April). Application
of Neural Networks in Control, Control Systems Centre, Dept. of
Electrical Engineering and Electronics, CSC Report 726, UMIST,
Manchester, UK
202

[NaPa90] - Narendra, K. S. & Parthasarathy, K. (1990, March). Identification and


Control of Dynamical Systems Using Neural Networks, IEEE
Transactions on Neural Networks, 1 (1), 4-27

[NaZa92] - Nascimento Jr., Cairo L. & Zarrop, Martin B. (1992). Analysis of the
Interactive Activation and Competition Neural Network With 2 Units, in
Proceedings of the Sixth IMA (The Institute of Mathematics and Its
Applications) International Conference on Control: Modelling,
Computation, Information, 2-4 September, UMIST, Manchester, U.K.

[NaZa93] - Nascimento Jr., Cairo L. & Zarrop, Martin B. (1993). Improving the
Fault Tolerance of Artificial Neural Networks for Image Recognition and
Dynamic Control Applications, in Proceedings of the BNNS’93 (British
Neural Network Society) - Symposium on Recent Advances in Neural
Networks, 29 January 1993, University of Birmingham, Birmingham,
U.K. (also available as Control Systems Centre Report 783, May 1993,
Dept. of Electrical Engineering and Electronics, UMIST, Manchester,
U.K.)

[Neu56] - von Neumann, J. (1956). Probabilistic Logics and the Synthesis of


Reliable Organisms from Unreliable Components, in C. E. Shannon & J.
McCarthy (Eds.), Automata Studies, pp. 43-98, Princeton, USA: Princeton
University Press

[NgWi90] - Nguyen, D. & Widrow, B. (1990). Improving the Learning Speed of 2-


Layer Neural Networks by Choosing Initial Values of the Adaptive
Weights, in Proceedings of the International Joint Conference on Neural
Networks (IJCNN90), 17-21 June, San Diego, USA, vol. 3, pp. 21-26,
New York, USA: IEEE Press

[NgWi90b] - Nguyen, D. & Widrow, B. (1990, April). Neural Networks for Self-
Learning Control Systems, IEEE Control Systems Magazine, 10 (3)

[Nil65] - Nilsson, N. J. (1965). Learning Machines, New York, USA: McGraw-


Hill (also published in 1990 as The Mathematical Foundations of
Learning Machines, San Mateo, USA: Morgan Kaufmann Publishers)

[NoHi92a] - Nowlan, Steven J. & Hinton, Geoffrey E. (1992). Adaptive Soft Weight
Tying Using Gaussian Mixtures, in J. E. Moody, S. J. Hanson & R. P.
Lippmann (Eds.), Advances in Neural Information Processing Systems 4
(NIPS 4), San Mateo, USA: Morgan Kauffmann

[NoHi92b] - Nowlan, Steven J. & Hinton, Geoffrey E. (1992). Simplifying Neural


Networks by Soft Weight-sharing, Neural Computation, 4 (4), 473-493

[NRPD93] - Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G. & Marcos, S.
(1993). Neural Networks and Nonlinear Adaptive Filtering: Unifying
Concepts and New Algorithms, Neural Computation, 5 (2), 165-199
203

[NSY92] - Neti, C., Schneider, M. & Young, E. (1992, January). Maximally Fault
Tolerant Neural Networks, IEEE Transactions on Neural Networks, 3 (3),
14-23

[NZM92] - Nascimento Jr., Cairo, Zarrop, Martin B. & Muir, Allan (1992, October).
A Neural Network Extremum Controller, Control Systems Centre, Dept.
of Electrical Engineering and Electronics, CSC Report 774, UMIST,
Manchester, UK

[NZM93] - Nascimento Jr., Cairo, Zarrop, Martin B. & Muir, Allan (1993). A Neural
Network Extremum Controller for Static Systems, in Proceedings of the
Second European Control Conference (ECC’93), 28 June - 1 July,
Groningen, The Netherlands, Vol. 1, pp. 99-104

[ONY91] - de Oliveira, Roberto C. Limão, Nascimento Jr., Cairo L. & Yoneyama,


Takashi (1991). A Fault Tolerant Controller Based on Neural Nets, in
Proceedings of the IEE Int. Conf. on Control’91, 25-28 March,
Edinburgh, UK, vol. 1, pp. 399-404, London, UK: IEE

[Pap84] - Papoulis, A. (1984). Probability, Random Variables and Stochastic


Processes, New York, USA: McGraw-Hill

[Par82] - Parker, D. (1982). Learning-Logic, Invention Report, S81-64, File 1,


Office of Technology Licensing, Stanford University, USA

[Par85] - Parker, D. (1985, April). Learning-Logic, Technical Report TR-47, Center


for Computational Research in Economics and Management Science,
M.I.T., USA

[PaSa91] - Park, J. & Sandberg, I. W. (1991). Universal Approximation Using


Radial-Basis-Function Networks, Neural Computation, 3 (2), 246-257

[PaSa93] - Park, J. & Sandberg, I. W. (1993). Approximation and Radial-Basis-


Function Networks, Neural Computation, 5 (2), 305-316

[Per92] - Peretto, P. (1992). An Introduction to the Modelling of Neural Networks,


Cambridge, UK: Cambridge University Press

[PFTV88] - Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T.


(1988). Numerical Recipes in C: The Art of Scientific Computing,
Cambridge, UK: Cambridge University Press

[PoGi90a] - Poggio, T. & Girosi, F. (1990, September). Networks for Approximation


and Learning, Proceedings of the IEEE, 78 (9), 1481-1497

[Pogi90b] - Poggio, T & Girosi, F. (1990, 23 February). Regularization Algorithms


for Learning That Are Equivalent to Multilayer Networks, Science, 247,
978-982
204

[PSY87] - Psaltis, D., Sideris, A. & Yamamura, A. (1987). Neural Controllers, in


Proceedings of the IEEE Int. Conf. on Neural Networks , 21-24 June, San
Diego, USA, vol. 4, pp. 551-558, New York, USA: IEEE Press

[RGV88] - Rosen, B. E., Goodwin, J. M. & Vidal, J. J. (1988). Learning by State


Recurrence Detection, in D. Z. Anderson (Ed.), Neural Information
Processing Systems, pp. 642-651, New York, USA: American Institute of
Physics

[RHM86] - Rumelhart, D. E., Hinton, G. E. & McClelland, J. L. (1986). A General


Framework for Parallel Distributed Processing, in D. E. Rumelhart & J.
L. McClelland (Eds.), Parallel Distributed Processing: Explorations in
the Microstructure of Cognition, Vol. 1: Explorations, chapter 2, pp. 45-
76, Cambridge, USA: Bradford Books/MIT Press

[RHW86] - Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). Learning


Internal Representations by Error Propagation, in D. E. Rumelhart & J.
L. McClelland (Eds.), Parallel Distributed Processing: Explorations in
the Microstructure of Cognition, Vol. 1, chapter 8, pp. 318-362,
Cambridge, USA: Bradford Books/MIT Press

[RiLi91] - Richard, M. D. & Lippmann, R. P. (1991). Neural Network Classifiers


Estimate Bayesian a Posteriori Probabilities, Neural Computation, 3 (4),
461-483

[RMS92] - Ritter, H., Martinetz, T. & Schulten, K. (1992). Neural Computation and
Self-Organizing Maps: An Introduction, Reading, USA: Addison-Wesley
Publishing Co.

[RNY91] - Rodriguez, Claudio C., Nascimento Jr., Cairo L. & Yoneyama, Takashi
(1991). An Auto-Tuning Controller with Supervised Learning Using
Neural Nets, in Proceedings of the IEE Int. Conf. on Control’91, 25-28
March, Edinburgh, UK, vol. 1, pp. 140-144, London, UK: IEE

[RoMc71] - Rosenbrock, H. H. & McMorran, P.D. (1971). Good, Bad or Optimal,


IEEE Trans. on Automatic Control, 16, 552-554

[Ros58] - Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for


Information Storage and Organization in the Brain, Psychological Review,
65, 386-408 (also as chapter 8 in [AnRo88])

[Ros62] - Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the


Theory of Brain Mechanisms, Washington, USA: Spartan Books

[RSO89] - Reid, M. B., Spirkovska, L. & Ochoa, E. (1989). Simultaneous Position,


Scale, and Rotation Invariant Pattern Classification Using Third-Order
Neural Networks, Neural Networks, 1 (3), 154-159
205

[SaSl92] - Sanner, R. M. & Slotine, J.-J. E. (1992, November). Gaussian Networks


for Direct Adaptive Control, IEEE Transactions on Neural Networks, 3
(6), 837-863

[SaSo91] - Saerens, M. & Soquet, A. (1991, February). Neural Controller Based on


Back-Propagation Algorithm, IEE Proceedings Part F, 138 (1), 55-62

[SBM91] - Saint-Donat, J., Bhat, N. & McAvoy, T. J. (1991). Neural Net Based
Model Predictive Control, International Journal of Control, 54 (6), 1453-
1468

[SBC91] - Saarinen, S., Bramley, R. & Cybenko, G. (1991, January). Ill-


conditioning in Neural Networks Training Problems, CSRD Report 1089,
Center for Supercomputing Research and Development, University of
Illinois at Urbana-Champaign, Urbana, IL, USA

[SBW92] - Sutton, R. S., Barto, A. G. & Williams, R. J. (1992, April).


Reinforcement Learning is Direct Adaptive Optimal Control, IEEE
Control Systems Magazine, 12 (2), 19-22

[Sch90] - Schilling, R. J. (1990). Fundamentals of Robotics - Analysis and Control,


London, UK: Prentice-Hall

[Sci79] - Scientific American, editors of (1979). The Brain, San Francisco, USA:
Freeman, (The eleven chapters of this book originally appeared as articles
in the September 1979 issue of Scientific American magazine)

[ScWe90] - Scotson, P. E. & Wellstead, P. E. (1990, April). Self-Tuning Optimization


of Spark Ignition Automotive Engines, IEEE Control Systems Magazine,
10 (3), 94-101

[Sha89] - Shackleford, J. B. (1989, June). Neural Data Structures: Programming


with Neurons, Hewlett-Packard Journal, 69-78

[ShBr71] - Shanmugam, K. & Breipohl, A. M. (1971, July). An Error Correcting


Procedure for Learning with an Imprefect Teacher, IEEE Transactions on
Systems, Man, and Cybernetics, SMC-1 (3), 223-229

[Sho68] - Shooman, M. L. (1968). Probabilistic Reliability: An Engineering


Approach, New York, USA: McGraw-Hill Book Company

[ShRo90] - Shynk, J. J. & Roy, S. (1990). Convergence Properties and Stationary


Points of a Perceptron Learning Algorithm, Proceedings of the IEEE, 78
(10), 1599-1604

[SiAl90] - Silva, F. M. & Almeida, L. B. (1990). Acceleration Techniques for the


Backpropagation Algorithm, in L. B. Almeida & C. J. Wellekens (Eds.),
Neural Networks, Proceedings of the EURASIP Workshop 1990,
Sesimbra, Portugal, Febraury 15-17, Lectures Notes in Computer Science
412, pp. 110-119, Berlin: Springer-Verlag
206

[Sim90] - Simpson, P. K. (1990). Artificial Neural Systems - Foundations,


Paradigms, Applications, and Implementations, New York, USA:
Pergamon Press

[SiWu89] - Singhal, S. & Wu, L. (1989). Training Multilayer Perceptrons With the
Extended Kalman Algorithm, in D. S. Touretzky (Ed.), Advances in
Neural Information Processing Systems I (NIPS I), pp. 133-140, San
Mateo, USA: Morgan Kauffmann

[Son93] - Sontag, E. D. (1993, 6 July). Some Topics in Neural Networks and


Control, Department of Mathematics, Report Number LS93-02, Rutgers
University, New Brunswick, USA

[SoSt89] - Soderstrom, T. & Stoica, P. (1989). System Identification, New York,


USA: Prentice Hall

[SPD92] - Shah, S., Palmieri, F. & Datum, M. (1992). Optimal Filtering Algorithms
for Fast Learning in Feedforward Neural Networks, Neural Networks, 5
(5), 779-787

[SpRe92] - Spirkovska, L. & Reid, M. (1992). Higher Order Neural Networks in


Position, Scale, and Rotation Invariant Object Recognition, in Branko
Soucek (Ed.), Fast Learning and Invariant Object Recognition, pp. 153-
184, New York, USA: John Wiley & Sons

[Ste80] - Sternby, Jan (1980). Extremum Control Systems - An Area for Adaptive
Control?, in Proceedings of the 1980 Joint Automatic Control
Conference, 13-15 August, San Francisco, USA, vol. 1, paper WA2-A,
New York, USA: IEEE Press

[StWh89] - Stinchcombe, M. & White, H. (1989). Universal Approximation Using


Feedforward Networks with Non-sigmoid Hidden Layer Activation
Function, in Proceedings of the Int. Joint Conf. on Neural Networks
(IJCNN 89), 18-22 June, Washington DC, USA, vol. 1, pp. 613-617, New
York, USA: IEEE Press

[Sut88] - Sutton, R. (1988, August). Learning to Predict by the Method of


Temporal Differences, Machine Learning, 3 (1), 9-44

[Szu86] - Szu, H. (1986). Fast simulated annealing, in J. Denker (Ed.), AIP


Conference Proceedings 151: Neural Networks for Computing, pp. 420-
425, New York: American Institute of Physics

[TaHo86] - Tank, D. W. & Hopfield, J. J. (1986, May). Simple "Neural"


Optimization Networks: An A/D Converter, Signal Decision Circuit, and
a Linear Programming Circuit, IEEE Transactions on Circuits and
Systems, 33 (5), 533-541 (also in [Vem88])

[Tho92] - Thornton, C. J. (1992). Techniques in Computational Learning: An


Introduction, London, UK: Chapman & Hall Computing
207

[TiAr77] - Tikhonov, A. N. & Arsenin, V. Y. (1977). Solutions of Ill-Posed


Problems, Washington D.C., USA: W. H. Winston

[ToWi88] - Tolat, V. V. & Widrow, B. (1988). An Adaptive "Broom Balancer" with


Visual Inputs, in Proceedings of the IEEE Int. Conf. on Neural Networks
, 24-27 July, San Diego, USA, vol. 2, pp. 641-647, New York, USA:
IEEE Press

[TRSR91] - Tepedelenlioglu, N., Rezgui, A., Scalero, R. & Rosario, R. (1991). Fast
Algorithms for Training Multilayer Perceptron, in Branko Soucek (Ed.),
Neural and Intelligent Systems Integration, pp. 107-133, New York,
USA: John Wiley & Sons

[TzFa92] - Tzirkel-Hancock, E. & Fallside, F. (1992, May). Stable Control of


Nonlinear Systems Using Neural Networks, International Journal of
Robust and Nonlinear Control, 2 (1), 63-86

[VaKe91] - Vallet, F., & Kerlirzin, Ph. (1991). Robustness in Multi-Layer


Perceptrons, in T. Kohonen, K. Makisara, O. Simula and J. Kangas
(Eds.), Artificial Neural Networks: Proc. of the Int. Conf. on Artificial
Neural Networks, ICANN 91, 24-28 June 1991, Espoo, Finland, vol. 1,
pp. 641-646, Amsterdam, The Netherlands: Elsevier Science Publishers
B. V.

[Vem88] - Vemuri, V. (1988). Artificial Neural Networks: Theoretical Concepts,


Washington D.C., USA: IEEE Computer Society Press (collection of 12
"classical" papers)

[VePs89] - Venkatesh, S. S. & Psaltis, D. (1989). Linear and Logarithmic Capacities


in Associative Neural Networks, IEEE Transactions on Information
Theory, 35, 558-568

[VSS87] - Viswanadham, N., Sarma, V. V. S. & Singh, M. G. (1987) Reliability of


Computer and Control Systems, Amsterdam, Holland: North-Holland

[WaDa92] - Watkins, C. J. C. H. & Dayan, P. (1992, May). Technical Note: Q-


Learning, Machine Learning, 8 (3/4), 279-292

[Was89] - Wasserman, P. D. (1989). Neural Computing: Theory and Practice, New


York, USA: Van Nostrand Reinhold

[WDMT91] - Willis, M. J., Di Massimo, C., Montague, G. A., Tham, M. T. & Morris,
A. J. (1991, May). Artificial Neural Networks in Process Engineering,
IEE Proceedings Part D, 138 (3), 256-266

[Wer74] - Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and
Analysis in the Behavioral Sciences, Ph.D. Thesis in Applied
Mathematics, Harvard University, USA
208

[Wer90] - Werbos, P. J. (1990). Overview of Designs and Capabilities, in W. T.


Miller, R. S. Sutton & P. J. Werbos (Eds.), Neural Networks for Control,
pp. 59-65, Cambridge, USA: Bradford Books/MIT Press

[WeSc90] - Wellstead, P. E. & Scotson, P. G. (1990, May). Self-Tuning Extremum


Control, IEE Proceedings Pt. D, 137 (3), 165-175

[WeZa91] - Wellstead, P. E. & Zarrop, M. B. (1991). Self-Tuning Systems: Control


and Signal Processing, Chichester, UK: John Wiley & Sons

[WGM73] - Widrow, B., Gupta, N. K. & Maitra S. (1973, September).


Punish/Reward: Learning With a Critic in Adaptive Threshold Systems,
IEEE Trans. on Systems, Man and Cybernetics, 3 (5)

[Whi89] - White, H. (1989). Learning in Artificial Neural Networks: A Statistical


Perspective, Neural Computation, 1 (4), 425-464

[Whi92] - White, H. (1992). Artificial Neural Networks: Approximation and


Learning Theory, Oxford, UK: Blackwell Publishers

[Wid87] - Widrow, B. (1987, May). The Original Adaptive Neural Net Broom-
Balancer, in Proc. of the 20th IEEE Int. Symp. on Circuits and Systems,
pp. 351-357

[WiHo60] - Widrow, B. & Hoff M. E. (1960). Adaptive Switching Circuits, in Proc.


of the 1960 WESCON Convention Record, pp. 96-104 (also as chapter 10
in [AnRo88])

[WiLe90] - Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks:


Perceptron, Madaline, and Backpropagation, Proceedings of the IEEE, 78
(9), 1415-1442

[WiPa88] - Wilson, G. V. & Pawley, G. S. (1988). On the Stability of the Travelling


Salesman Problem Algorithm of Hopfield and Tank, Biological
Cybernetics, 58, 63-70

[WiSm63] - Widrow, B. & Smith, F. W. (1963). Pattern Recognizing Control


Systems, in Proc. of the Symposium on Computer and Information
Sciences (COINS), pp. 288-317, Washington, USA: Spartan Books

[WiSt85] - Widrow, B. & Stearns, S. (1985). Adaptive Signal Processing, Englewood


Cliffs, USA: Prentice-Hall

[WMDT92] - Willis, M. J., Montague, G. A., Di Massimo, C., Tham, M. T. & Morris,
A. J. (1992, November). Artificial Neural Networks in Process Estimation
and Control, Automatica, 28 (6), 1181-1187

[YuNe93] - Yuh, J. & Newcomb, R. W. (1993, May). A Multilevel Neural Network


for A/D Conversion, IEEE Transaction on Neural Networks, 4 (3), 470-
483
209

[ZaRo93] - Zarrop, M. B. & Rommens, M. J. J. J. (1993, March). Convergence of a


Multi-Input Adaptive Extremum Controller, IEE Proceedings Part D, 140
(2), 65-69

[Zur92] - Zurada, Jacek M. (1992). Introduction to Artificial Neural Systems, New


York, USA: West Publishing Company

You might also like