Professional Documents
Culture Documents
Neural Networks
David Kriesel
www.dkriesel.com
dkriesel.com
In remembrance of
Dr. Peter Kemp, Notary (ret.), Bonn, Germany.
The above abstract has not yet become a to understand to offer as many people as
preface but, after all, a little preface since possible access to the field of neural net-
the extended paper (then 40 pages long) works.
has turned out to be a download hit.
Nevertheless, the mathematically and for-
mally skilled readers will be able to under-
stand the definitions without reading the
Ambition and Intention of flow text, while the reverse is true for read-
this Manuscript ers only interested in the subject matter;
everything is explained in colloquial and
At the time I promised to continue the pa- formal language. Please let me know if
per chapter by chapter, and this is the re- you find out that I have acted against this
sult. Meanwhile I changed the structure principle.
of the document a bit so that it is easier
for me to add a chapter or to vary the out-
The Sections of This Work are
line.
Mostly Independent from Each
The entire text is written and laid out Other
more effectively and with more illustra-
tions than before. I did all the illustra- The document itself is divided into differ-
tions myself, most of them directly in ent parts, which are again divided into
LATEX by using XYpic. They reflect what chapters. Although the chapters contain
I would have liked to see when becoming cross-references they are also individually
acquainted with the subject: Text and il- accessible to readers with little previous
lustrations should be memorable and easy knowledge. There are larger and smaller
v
dkriesel.com
chapters: While the larger chapters should Terms of Use and License
provide profound insight into a paradigm
of neural networks (e.g. the classic neural
network structure: the perceptron and its From the epsilon edition, the text is
learning procedures), the smaller chapters licensed under the Creative Commons
give a short overview – but this is also ex- Attribution-No Derivative Works 3.0 Un-
plained in the introduction of each chapter. ported License 1 , except for some little por-
In addition to all the definitions and expla- tions of the work licensed under more lib-
nations I have included some excursuses eral licenses as mentioned (mainly some
to provide interesting information not di- figures from Wikimedia Commons). A
rectly related to the subject. quick license summary:
There’s no official publisher, so you need Different types of chapters are directly
to be careful with your citation. Please marked within the table of contents. Chap-
find more information in English and ters, that are marked as ”fundamental”
German language on my home page, re- are definitively ones to read because al-
spectively the subpage concerning the most all subsequent chapters heavily de-
manuscript2 . pend on them. Other chapters addi-
tionally depend on information given in
other (preceding) chapters, which then is
marked in the table of contents, too.
It’s easy to print this
manuscript Speaking Headlines throughout the
Text, Short Ones in the Table of
Contents
This paper is completely illustrated in
color, but it can also be printed as is in The whole manuscript is now pervaded by
monochrome: The colors of figures, tables such headlines. Speaking headlines are
and text are well-chosen so that in addi- not just title-like (”Reinforcement Learn-
tion to an appealing design the colors are ing”), but centralize the information given
still easy to distinguish when printed in in the associated section to a single sen-
monochrome. tence. In the named instance, an appro-
priate headline would be ”Reinforcement
learning methods provide Feedback to the
Network, whether it behaves good or bad”.
However, such long headlines would bloat
There are many tools directly the table of contents in an unacceptable
integrated into the text way. So I used short titles like the first one
in the table of contents, and speaking ones,
like the letter, throughout the text.
Different tools are directly integrated in
the document to make reading more flexi-
ble: But anyone (like me) who prefers read- Marginal notes are a navigational
ing words on paper rather than on screen aide
can also enjoy some features.
The entire document contains marginal
2 http://www.dkriesel.com/en/science/ notes in colloquial language (see the exam-
neural_networks ple in the margin), allowing you to ”skim”
Hypertext
on paper
:-)
the document quickly to find a certain pas- Kathrin Gräve, Paul Imhoff, Thomas
sage in the text (including the titles). Kühn, Christoph Kunze, Malte Lohmeyer,
Joachim Nock, Daniel Plohmann, Daniel
New mathematical symbols are marked by
Rosenthal, Christian Schulz and Tobias
specific marginal notes for easy finding
Wilken.
(see the example for x in the margin).
xI
Additionally, I want to thank the read-
ers Dietmar Berger, Igor Buchmüller,
There are several kinds of indexing Marie Christ, Julia Damaschek, Maxim-
ilian Ernestus, Hardy Falk, Anne Feld-
This document contains different types of meier, Sascha Fink, Andreas Friedmann,
indexing: If you have found a word in Jan Gassen, Markus Gerhards, Sebas-
the index and opened the corresponding tian Hirsch, Andreas Hochrath, Nico Höft,
page, you can easily find it by searching Thomas Ihme, Boris Jentsch, Tim Hussein,
for highlighted text – all indexed words Thilo Keller, Mario Krenn, Mirko Kunze,
are highlighted like this. Maikel Linke, Adam Maciak, Benjamin
Mathematical symbols appearing in sev- Meier, David Möller, Andreas Müller,
eral chapters of this document (e.g. Ω for Rainer Penninger, Matthias Siegmund,
an output neuron; I tried to maintain a Mathias Tirtasana, Oliver Tischler, Max-
consistent nomenclature for regularly re- imilian Voit, Igor Wall, Achim Weber,
curring elements) are separately indexed Frank Weinreis, Gideon Maillette de Buij
under ”Mathematical Symbols”, so they Wenniger, Philipp Woock and many oth-
can easily be assigned to the correspond- ers for their feedback, suggestions and re-
ing term. marks.
Names of persons written in small caps Especially, I would like to thank Beate
are indexed in the category ”Persons” and Kuhl for translating the entire paper from
ordered by the last names. German to English, and for her questions
which made think of changing the text and
understandability of some paragraphs.
Acknowledgement I would particularly like to thank Prof.
Rolf Eckmiller and Dr. Nils Goerke as
Now I would like to express my grati- well as the entire Division of Neuroinfor-
tude to all the people who contributed, in matics, Department of Computer Science
whatever manner, to the success of this of the University of Bonn – they all made
work, since a paper like this needs many sure that I always learned (and also had
helpers. First of all, I want to thank to learn) something new about neural net-
the proofreaders of this paper, who helped works and related subjects. Especially Dr.
me and my readers very much. In al- Goerke has always been willing to respond
phabetical order: Wolfgang Apolinarski, to any questions I was not able to answer
A Little Preface v
xi
Contents dkriesel.com
5 The Perceptron 71
5.1 The Single-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1.1 Perceptron Learning Algorithm and Convergence Theorem . . . 75
5.2 Delta Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Linear Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 The Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Backpropagation of Error . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.1 Boiling backpropagation down to delta rule . . . . . . . . . . . . 91
5.5.2 Selecting a learning rate . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.3 Initial configuration of a Multi-layer Perceptron . . . . . . . . . . 92
5.5.4 Variations and extensions to backpropagation . . . . . . . . . . . 94
5.6 The 8-3-8 encoding problem and related problems . . . . . . . . . . . . 97
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A Excursus: Cluster Analysis and Regional and Online Learnable Fields 169
A.1 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.2 k-Nearest Neighbouring . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.3 ε-Nearest Neighbouring . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.4 The Silhouette coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.5 Regional and Online Learnable Fields . . . . . . . . . . . . . . . . . . . 173
A.5.1 Structure of a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.5.2 Training a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.5.3 Evaluating a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.5.4 Comparison with Popular Clustering Methods . . . . . . . . . . 177
A.5.5 Initializing Radii, Learning Rates and Multiplier . . . . . . . . . 178
A.5.6 Application examples . . . . . . . . . . . . . . . . . . . . . . . . 178
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Bibliography 207
Index 219
1
Chapter 1
Introduction, Motivation and History
How to teach a computer? You can either write a rigid program – or you can
enable the computer to learn on its own. Living beings don’t have any
programmer writing a program for developing their skills, which only has to be
executed. They learn by themselves – without the initial experience of external
knowledge – and thus can solve problems better than any computer today.
What qualities are needed to achieve such a behavior for devices like
computers? Can such cognition be adapted from biology? History,
development, decline and resurgence of a wide approach to solve problems.
3
Chapter 1 Introduction, Motivation and History dkriesel.com
Brain Computer
No. of processing units ≈ 1011 ≈ 109
Type of processing units Neurons Transistors
Type of calculation massively parallel usually serial
Data storage associative address-based
Switching time ≈ 10−3 s ≈ 10−9 s
Possible switching operations ≈ 1013 1s ≈ 1018 1s
Actual switching operations ≈ 1012 1s ≈ 1010 1s
Table 1.1: The (flawed) comparison between brain and computer at a glance. Inspired by: [Zel94]
mum, from which the computer is pow- eralize and associate data: After suc-
ers of ten away (Table 1.1). Additionally, cessful training a neural network can find
a computer is static - the brain as a bi- reasonable solutions for similar problems
ological neural network can reorganize it- of the same class that were not explicitly
self during its ”lifespan” and therefore is trained. This in turn results in a high de-
able to learn, to compensate errors and so gree of fault tolerance against noisy in-
force. put data.
Within this paper I want to outline how Fault tolerance is closely related to biolog-
we can use the said brain characteristics ical neural networks, in which this charac-
for a computer system. teristic is very distinct: As previously men-
tioned, a human has about 1011 neurons
So the study of artificial neural networks that continuously reorganize themselves
is motivated by their similarity to success- or are reorganized by external influences
fully working biological systems, which - (complete drunkenness destroys about 105
compared to the complete system - consist neurons, some types of food or environ-
of very simple but numerous nerve cells mental influences can also destroy brain
simple
but many that work massively parallel and (which cells). Nevertheless, our cognitive abilities
processing is probably one of the most significant are not significantly affected. Thus, the
aspects) have the capability to learn. brain is tolerant against internal errors –
units n. network
fault
There is no need to explicitly program a and also against external errors, for we tolerant
neural network. For instance, it can learn can often read a really ”dreadful scrawl”
from training examples or by means of en- although the individual letters are nearly
n. network
capable couragement - with a carrot and a stick, impossible to read.
to learn so to speak (reinforcement learning).
Our modern technology, however, is not
One result from this learning procedure is automatically fault-tolerant. I have never
the capability of neural networks to gen- heard that someone forgot to install the
hard disk controller into the computer and What types of neural networks particu-
therefore the graphics card automatically larly develop what kinds of abilities and
took on its tasks, i.e. removed conductors can be used for what problem classes will
and developed communication, so that the be discussed in the course of the paper.
system as a whole was affected by the
missing component, but not completely de- In the introductory chapter I want to
stroyed. clarify the following: ”The neural net-
work” does not exist. There are differ-
Important!
An disadvantage of this distributed fault- ent paradigms for neural networks, how
tolerant storage is certainly the fact that they are trained and where they are used.
we cannot realize at first sight what a neu- It is my aim to introduce some of these
ral neutwork knows and performs or where paradigms and supplement some remarks
its faults lie. Usually, it is easier to per- for practical application.
form such analyses for conventional alo-
gorithms. Most often we can only trans- We have already mentioned that our brain
fer knowledge into our neural network by is working massively in parallel in contrast
means of a learning procedure, which can to the work of a computer, i.e. every com-
cause several errors and is not always easy ponent is active at any time. If we want
to manage. to state an argument for massive parallel
processing, then the 100-step rule can
Fault tolerance of data, on the other hand, be cited.
is already more sophisticated in state-of-
the-art technology: Let us compare a
record and a CD. If there is a scratch on a
record, the audio information on this spot
1.1.1 The 100-step rule
will be completely lost (you will hear a
pop) and then the music goes on. On a CD Experiments showed that a human can
the audio data are distributedly stored: A recognize the picture of a familiar object
scratch causes a blurry sound in its vicin- or person in ≈ 0.1 seconds, which cor-
ity, but the data stream remains largely responds to a neuron switching time of
unaffected. The listener won’t notice any- ≈ 10−3 seconds in ≈ 100 discrete time
thing. steps of parallel processing.
parallel
processing
So let us summarize the main characteris- A computer following the von Neumann
tics we try to adapt from biology: architecture, however, can do practically
nothing in 100 time steps of sequential pro-
. Self-organization and learning capa-
cessing, which are 100 assembler steps or
bility,
cycle steps.
. Generalization capability and
Now we want to look at a simple applica-
. Fault tolerance. tion example for a neural network.
f : R8 → B1 ,
f : R8 → R2 ,
Figure 1.3: Initially, we regard the robot control
as a black box whose inner life is unknown. The
black box receives eight real sensor values and which gradually controls the two motors
maps these values to a binary output value. by means of the sensor inputs and thus
cannot only, for example, stop the robot
but also lets it avoid obstacles. Here it is
more difficult to mentally derive the rules,
and de facto a neural network would be
treat the neural network as a kind of black more appropriate.
box (fig. 1.3), this means we do not know
its structure but just regard its behavior Our aim is not to learn the examples by
in practice. heart, but to realize the principle behind
them: Ideally, the robot should apply the
The situations in form of simply mea- neural network in any situation and be
sured sensor values (e.g. placing the robot able to avoid obstacles. In particular, the
in front of an obstacle, see illustration), robot should query the network continu-
which we show to the robot and for which ously and repeatedly while driving in order
we specify whether to drive on or to stop, to continously avoid obstacles. The result
are called training examples. Thus, a train- is a constant cycle: The robot queries the
ing example consists of an exemplary in- network. As a consequence, it will drive
put and a corresponding desired output. in one direction, which changes the sen-
Now the question is how to transfer this sors values. EAgain the robot queries the
knowledge, the information, into the neu- network and changes its position, the sen-
ral network. sor values are changed once again, and so
on. It is obvious that this system can also
The examples can be taught to a neural be adapted to dynamic, i.e changing, en-
network by using a simple learning pro- vironments (e.g. the moving obstacles in
cedure (a learning procedure is a simple our example).
algorithm or a mathematical formula. If
we have done everything right and chosen 2 There is a robot called Khepera with more or less
similar characteristics. It is round-shaped, approx.
good examples, the neural network will 7 cm in diameter, has two motors with wheels
generalize from these examples and find and various sensors. For more information I rec-
a universal rule when it has to stop. ommend to refer to the internet.
Figure 1.2: The robot is positioned in a landscape that provides sensor values for different situa-
tions. We add the desired output values H and so receive our learning examples. The directions in
which the sensors are oriented are exemplarily applied to two robots.
Figure 1.4: Some institutions of the field of neural networks. From left to right: John von Neu-
mann, Donald O. Hebb, Marvin Minsky, Bernard Widrow, Seymour Papert, Teuvo Kohonen, John
Hopfield, ”in the order of appearance” as far as possible.
1949: Donald O. Hebb formulated the 1951: For his dissertation Marvin Min-
classical Hebbian rule [Heb49] which sky developed the neurocomputer
represents in its more generalized Snark, which has already been capa-
form the basis of nearly all neural ble to adjust its weights3 automati-
learning procedures. The rule im- cally. But it has never been practi-
plies that the connection between two cally implemented, since it is capable
neurons is strengthened when both to busily calculate, but nobody really
neurons are active at the same time. knows what it calculates.
This change in strength is propor-
1956: Well-known scientists and ambi-
tional to the product of the two activ-
tious students met at the Dart-
ities. Hebb could postulate this rule,
mouth Summer Research Project
but due to the absence of neurological
research he was not able to verify it. 3 We will learn soon what weights are.
and discussed, to put it crudely, how rule or delta rule. At that time
to simulate a brain. Differences be- Hoff, later co-founder of Intel Corpo-
tween top-down and bottom-up re- ration, was a PhD student of Widrow,
search were formed.While the early who himself is known as the inven-
supporters of artificial intelligence tor of modern microprocessors. One
wanted to simulate capabilities by advantage the Delta rule had over
means of software, supporters of neu- the original perceptron learning algo-
ral networks wanted to achieve sys- rithm was its adaptivity: If the differ-
tem behavior by imitating the small- ence between the actual output and
est parts of the system – the neurons. the correct solution was large, the
connecting weights also changed in
1957-1958: At the MIT, Frank Rosen- larger steps – the smaller the steps,
blatt, Charles Wightman and the closer the target was. Disadvan-
their coworkers developed the first tage: missapplication led to infinites-
successful neurocomputer, the Mark imal small steps close to the target.
I perceptron, which was capable to In the following stagnation and out of
development
accelerates recognize simple numerics by means fear of scientific unpopularity of the
of a 20 × 20 pixel image sensor and neural networks ADALINE was re-
electromechanically worked with 512 named in adaptive linear element
motor driven potentiometers - each – which was undone again later on.
potentiometer representing one vari-
able weight. 1961: Karl Steinbuch Karl Steinbruch
1959: Frank Rosenblatt described dif- introduced technical realizations of as-
ferent versions of the perceptron, for- sociative memory, which can be seen
mulated and verified his perceptron as predecessors of today’s neural as-
convergence theorem. He described sociative memories [Ste61]. Addition-
neuron layers mimicking the retina, ally, he described concepts for neural
threshold switches,and a learning rule techniques and analyzed their possi-
adjusting the connecting weights. bilities and limits.
1960: Bernard Widrow and Mar- 1965: In his book Learning Machines
cian E. Hoff introduced the ADA- Nils Nilsson gave an overview of
LINE (ADAptive LInear NEu- the progress and works of this period
ron) [WH60], a fast and precise of neural network research. It was
learning adaptive system being the assumed that the basic principles of
first widely commercially used neu- self-learning and therefore, generally
ral network: It could be found in speaking, ”intelligent” systems had al-
nearly every analog telephone for real- ready been discovered. Today this as-
time adaptive echo filtering and was sumption seems to be an exorbitant
trained by menas of the Widrow-Hoff overestimation, but at that time it
first
spread
use
provided for high popularity and suf- 1972: Teuvo Kohonen introduced a
ficient research funds. model of the linear associator,
a model of an associative memory
1969: Marvin Minsky and Seymour [Koh72]. In the same year, such a
Papert published a precise mathe- model was presented independently
matical analysis of the perceptron and from a neurophysiologist’s point
[MP69] to show that the perceptron of view by James A. Anderson
model was not capable of representing [And72].
many important problems (keywords:
1973: Christoph von der Malsburg
XOR problem and linear separability),
used a neuron model that was non-
and so put an end to overestimation,
linear and biologically more moti-
popularity and research funds. The
research vated [vdM73].
funds were implication that more powerful mod-
stopped els would show exactly the same prob- 1974: For his dissertation in Harvard
lems ang the forecast that the entire Paul Werbos developed a learning
field would be a research dead-end re- procedure called backpropagation of
sulted in a nearly complete decline in error [Wer74], but it was not until
research funds for the next 15 years one decade later that this procedure
– no matter how incorrect these fore- reached today’s importance.
backprop
casts were from today’s point of view.
1976-1980 and thereafter: Stephen developed
15
Chapter 2 Biological Neural Networks dkriesel.com
is also heavily involved in the human cir- All parts of the nervous system have one
cadian rhythm (”internal clock”) and the thing in common: information processing.
sensation of pain. This is accomplished by means of huge ac-
cumulations of billions of very similar cells,
whose structure is very simple but who
2.1.5 The brainstem connects the communicate continuously. Large groups
brain with the spinal cord and of these cells send coordinated signals and
controls reflexes. thus reach the enormous information pro-
cessing capacity we are familiar with from
our brain. We will now leave the level of
In comparison with the diencephalon the
brain areas and go on with the cellular
brainstem or the (Truncus cerebri) re-
level of the body - the level of neurons.
spectively is phylogenetically much older.
Roughly speaking, it is the ”extended
spinal cord” and thus the connection be-
tween brain and spinal cord. The brain- 2.2 Neurons Are Information
stem can also be divided into different ar- Processing Cells
eas, some of which will be exemplarily in-
troduced in this chapter. The functions
will be discussed from abstract functions Before specifying the functions and pro-
towards more fundamental ones. One im- cesses within a neuron, we will give a
portant component is the pons (=bridge), rough description of neuron functions: A
a kind of way station for many a nerve sig- neuron is nothing more than a switch with
nal from brain to body and vice versa. information input and output. The switch
will be activated if there are enough stim-
If the pons is damaged (e.g. by a cere- uli of other neurons hitting the informa-
bral infarct), then the result could be tion input. Then, at the information out-
the locked-in syndrome – a condition put, a pulse is sent to other neurons, for
in which a patient is ”walled-in” within example.
his own body. He is conscious and aware
with no loss of cognitive function, but can-
not move or communicate by any means.
2.2.1 Components of a neuron
Only his senses of sight, hearing, smell and
taste are generally working perfectly nor-
mal. Locked-in patients may often be able Now we want to take a look at the com-
to communicate with others by blinking or ponents of a neuron (Fig. 2.3 on the right
moving their eyes. page). In doing so, we will follow the way
the electrical information takes the neuron
Furthermore, the brainstem is responsible in and the electrical infomation within the
for many fundamental reflexes, such as the neuron (the electrical information takes
winking reflex or coughing. within the neuron). The dendrites of a
Figure 2.3: Illustration of a biological neuron with the components discussed in this text.
neuron receive the information by special the signal transmitter and the signal re-
connections, the synapses. ceiver, which is, for example, relevant to
shortening reactions that must be ”hard
coded” within a living organism.
2.2.1.1 Synapses weight the individual The chemical synapse is the more dis-
parts of information tinctive variant. Here, the electrical cou-
pling of source and target does not take
Incoming signals from other neurons or place, the coupling is interrupted by the
cells are transferred to a neuron by special synaptic cleft. This cleft electrically sep-
connections, the synapses. For the most arates the pre-synaptic side from the post-
part, such a connection can be found at synaptic one. You might think that, never-
the dendrites of a neuron, sometimes also theless, the information has to flow, so we
directly at the Soma. We distinguish be- will discuss how this happens: It is not an
tween electrical and chemical synapses. electrical, but a chemical process. On the
pre-synaptic side of the synaptic cleft the
The electrical synapse is the simpler vari- electrical signal is converted into a chemi-
ant is the electrical synapse. An elec- cal signal, a process induced by chemical
electrical
synapse: trical signal received by the synapse, i.e. cues released there (the so-called neuro-
simple coming from the pre-synaptic side, is di- transmitters). These neurotransmitters
rectly transferred to the postsynaptic nu- cross the synaptic cleft and transfer the
cleus of the cell. Thus, there is a direct, information into the nucleus of the cell
strong, unadjustable connection between (this is a very simple explanation, but later
on we will see how this exactly works), 2.2.1.2 Dendrites collect all parts or
where it is reconverted into electrical in- information
formation. The neurotransmitters are de-
graded very fast, so that it is possible to re-
Dendrites ramify like trees from the cell
lease very precise information pulses here, nucleus of the neuron (which is called
too. soma) and receive electrical signals from
many different sources, which are then
In spite of the more complex function- transferred into the nucleus of the cell.
ing the chemical synapse has - compared The ramifying amount of dendrites is also
with the electrical synapse - utmost advan- called dendrite tree.
cemical
synapse
is more tages:
complex
but also
2.2.1.3 In the soma the weighted
Single way connection: A chemical
more
powerful information are accumulated
synapse is a single way connection.
Due to the fact that there is no direct After the cell nucleus (soma) has re-
electrical connection between the ceived a plenty of activating (=stimulat-
pre- and postsynaptic area, electrical ing) and inhibiting (=diminishing) signals
pulses in the postsynaptic area by synapses or dendrites, the soma accu-
cannot flash over to the pre-synaptic mulates these signals. As soon as the ac-
area. cumulated signal exceeds a certain value
(called threshold value), the cell nucleus
Adjustability: There are a large number of of the neuron activates an electrical pulse
different neurotransmitters that can which then is transmitted to the neurons
also be released in various quanti- connected to the current one.
ties in a synaptic cleft. There are
neurotransmitters that stimulate the
postsynaptic cell nucleus, and oth- 2.2.1.4 The axon transfers outgoing
ers that make such stimulation slow pulses
down. Some synapses transfer a
strongly stimulating signal, some only The pulse is transferred to other neurons
weakly stimulating ones. The adjusta- by means of the axon. The axon is a
bility varies a lot, and one of the cen- long, slender extension of the soma. In
tral points in the examination of the an extreme case an axon can stretch up
learning ability of the brain is that to one meter (e.g. within the spinal cord).
here the synapses are variable, too. The axon is electrically isolated in order
That is, over time they can form a to achieve a better conduction of the elec-
stronger or weaker connection. trical signal (we will return to this point
later on) and it leads to dendrites, which
transfer the information to, for example,
other neurons. So now we are back to the membrane from the inside outwards, we
beginning of our description of the neuron will find certain kinds of ions more often
elements. or less often than on the inside. This de-
Remark: An axon can, however, transfer scent or ascent of concentration is called a
information to other kinds of cells in order concentration gradient.
to control them.
Let us first take a look at the membrane
potential of the resting state of the neu-
2.2.2 Electrochemical processes in ron, i.e., we assume that no electrical sig-
the neuron and its nals are received from the outside. In this
case, the membrane potential is −70 mV.
components
Since we have learned that this potential
depends on the concentration gradients of
After having pursued the path of an elec-
various ions, there is of course the central
trical signal from the dendrites via the
question of how to maintain these concen-
synapses to the nucleus of the cell and
tration gradients: Normally, diffusion pre-
from there via the axon into other den-
dominates and therefore each ion is eager
drites, we now want to take a small step
to decrease concentration gradients and
from biology towards technology. In doing
to spread out evenly. If this happens,
so a simplified introduction of the electro-
the membrane potential will move towards
chemical information processing should be
0 mV, so finally there would be no mem-
provided.
brane potential anymore. Thus, the neu-
ron actively maintains its membrane po-
2.2.2.1 Neurons maintain electrical tential to be able to process information.
membrane potential How does this work.
outside of the neuron, and therefore Due to the low diffusion of sodium into the
it slowly diffuses out through the cell the intracellular sodium concentration
neuron’s membrane. But another increases. But at the same time the inside
collection of negative ions, called of the cell becomes less negative, so that
A− , remains within the neuron since K+ pours in slower (we can see that this
the membrane is not permeable is a complex mechanism where everything
to them. Thus, the inside of the is influenced by everything). The sodium
neuron becomes negatively charged. shifts the intracellular equilibrium from
Negative A ions remain, positive K negative to less negative, compared with
ions disappear, and so the inside of its environment. But even with these two
the cell becomes more negative. The ions a standstill with all gradients being
result is another gradient. balanced out could still be achieved. Now
the last piece of the puzzle gets into the
game: a ”pump” (or rather, the protein
Electrical Gradient: The electrical gradi-
ATP) actively transports ions against the
ent acts contrary to the concentration
direction they actually want to take!
gradient. The intracellular charge is
now very strong, therefore it attracts Sodium is actively pumped out of the cell,
positive ions: K+ wants to get back although it tries to get into the cell
into the cell. along the concentration gradient and
the electrical gradient.
If these two gradients are now left to take Potassium , however, diffuses strongly
care of themselves, they would balance out, out of the cell, but is actively pumped
reach a steady state, and a membrane po- back into it.
tential of −85 mV would develop. But we
want to achieve a resting membrane po- For this reason the pump is also called
tential of −70 mV, thus there seem to ex- sodium-potassium pump. The pump
ist some disturbances which prevent this maintains the concentration gradient for
plan. Furthermore, there is another impor- the sodium as well as for the potassium,
tant ion, Na+ (sodium), to which the mem- so that some sort of steady state equilib-
brane is not very permeable but which, rium is created and finally the resting po-
however, slowly pours through the cell into tential is −70 mV as observed. All in all
the membrane. As a result, the sodium the membrane potential is maintained by
feels all the more driven into the cell: On the fact that the membrane is imperme-
the one hand, there is less sodium within able to some ions and other ions are ac-
the neuron than outside the neuron. On tively pumped against the concentration
the other hand, sodium is positive but the and electrical gradients. Now that we
interior of the cell is negative, which is a know that each neuron has a membrane
second reason for the sodium to want to potential we want to observe how a neu-
get into the cell. ron receives and transmits signals.
2.2.2.2 The neuron is activated by of the action potential (Fig. 2.4 on the next
changes in the membrane page):
potential
Resting state: Only the permanently
open sodium and potassium channels
Above we have learned that sodium and are open. The membrane potential is
potassium can diffuse through the mem- at −70 mV and actively kept there
brane - sodium slowly, potassium faster. by the neuron.
They move through channels within the
membrane, the sodium and potassium Stimulus up to the threshold: A (stim-
channels. In addition to these per- ulus) opens channels so that sodium
manently open channels responsible for can pour in. The intracellular charge
diffusion and balanced by the sodium- becomes more positive. As soon as
potassium pump, there also exist channels the membrane potential exceeds the
that are not always open but which only threshold of −55 mV, the action po-
response ”if required”. Since the opening tential is initiated by the opening of
of these channels changes the concentra- many sodium channels.
tion of ions within and outside of the mem- Depolarization: Sodium is pouring in. Re-
brane, it also changes the membrane po- member: Sodium wants to pour into
tential. the cell because there is a lower in-
tracellular than extracellular concen-
These controllable channels are opened as tration of sodium. Additionally, the
soon as the accumulated received stimu- cell is dominated by a negative en-
lus exceeds a certain threshold. For ex- vironment which attracts the posi-
ample, stimuli can be received from other tive sodium ions. This massive in-
neurons or have other causes. There exist, flux of sodium drastically increases
for example, specialized forms of neurons, the membrane potential - up to ap-
the sense cells, for which a light incidence prox. +30 mV - which is the electrical
could be such a stimulus. If the incom- pulse, i.e., the action potential.
ing amount of light exceeds the threshold,
controllable channels are opened. Repolarization: Now the sodium channels
are closed and the potassium channels
The said threshold (the threshold poten- are opened. The positively charged
tial) is at about −55 mV. As soon as the ion wants to leave the positive inte-
received stimuli reach this value, the neu- rior of the cell. Additionally, the intra-
ron is activated and an electrical signal, cellular concentration is much higher
an action potential, is initiated. Then than the extracellular one, which in-
this signal is transmitted to the cells con- creases the efflux of ions even more.
nected to the observed neuron, i.e., the The interior of the cell is once again
cells ”listen” to the neuron. Now we want more negatively charged than the ex-
to take a closer look at the different stages terior.
Hyperpolarization: Sodium as well as (in the CNS) 1 , which insulate the axon
potassium channels are closed again. very good from electrical activity. At a
At first the membrane potential is distance of 0.1 − 2mm there are gaps be-
slightly more negative than the rest- tween these cells, the so-called nodes of
ing potential. This is due to the Ranvier. The said gaps appear where one
fact that the potassium channels close insulate cell ends and the next one begins.
slower. As a result, (positively It is obvious that at such a node the axon
charged) potassium is effused because is less insulated.
of its lower extracellular concentra-
tion. After a refractory period Now you may assume that these less in-
of 1 − 2 ms the resting state is re- sulated nodes are a disadvantage of the
established so that the neuron can axon - it is not. At the nodes mass can be
react to newly applied stimuli with transferred between the intracellular and
an action potential. In simple terms, extracellular area, a transfer that is im-
the refractory period is a mandatory possible at those parts of the axon which
break a neuron has to take in order to are situated between two nodes (intern-
regenerate. The shorter this break is, odesn) and therefore insulated by the
the more often a neuron can fire per myelin sheath. This mass transfer permits
time. the generation of signals similar to the gen-
eration of the action potential within the
soma. The action potential is transferred
Then the resulting pulse is transmitted by as follows: It does not continuously travel
the axon. along the axon but jumps from node to
node. Thus, a series of depolarization trav-
els along the nodes of Ranvier. One ac-
tion potential initiates the next one, and
mostly even several nodes are active at
2.2.2.3 In the axon a pulse is the same time here. The pulse ”jumping”
conducted in a saltatory way from node to node is responsible for the
name of this pulse conductor: saltatory
conductor.
We have already learned that the axon Obviously, the pulse will move faster if its
is used to transmit the action potential jumps are larger. Axons with large in-
across long distances (remember: You will ternodes (2 mm) achieve a signal disper-
find an illustration of a neuron including sion of approx. 180 meters per second.
an axon in Fig. 2.3 on page 19). The axon
is a long, slender extension of the soma. 1 Schwann cells as well as oligodendrocytes are vari-
eties of the glial cells. There are about 50 times
In vertebrates it is normally coated by a more glial cells than neurons: They surround the
myelin sheath that consists of Schwann neurons (glia = glue), insulate them from each
cells (in the PNS) or oligodendrocytes other, provide energy, etc.
However, the internodes cannot grow in- means of the stimulus-conducting ap-
definitely, since the action potential to be paratus. The resulting action potential
transferred would fade too much until it can be processed by other neurons and is
reaches the next node. So the nodes have a then transmitted into the thalamus, which
task, too: to constantly amplify the signal. is, as we have already learned, a gate to
The cells receiving the action potential are the cerebral cortex and therefore can sort
attached to the end of the axon – often out sensory impressions according to cur-
connected by dendrites and synapses. As rent relevance and thus prevent an abun-
already indicated above the action poten- dance of information to be managed.
tials cannot only be generated by informa-
tion received by the dendrites from other
neurons. 2.3.1 There are different receptor
cells for various sorts of
perceptions
After having outlined how an information go on with the lowest level, the sen-
is received from the environment, it will be sory cells.
interesting to look at how the information
. On the lowest level, the receptor
is processed.
cells, the information is not only re-
ceived and transferred but directly
processed. One of the main aspects
2.3.2 Information are processed on of this subject is to prevent the trans-
every level of the nervous mission of ”continuous stimuli” to
system the central nervous system because of
sensory adaptation: Due to contin-
There is no reason to believe that every uous stimulation many receptor cells
information received is transmitted to the automatically become insensitive to
brain and processed there, and that the stimuli. Thus, receptor cells are not
brain ensures that it is ”output” in the a direct mapping of specific stimu-
form of motor pulses (the only thing an lus energy onto action potentials but
organism can actually do within its envi- depend on the past. Other sensors
ronment is to move). The information pro- change their sensitivity according to
cessing is entirely decentralized. In order the situation: There are taste recep-
to illustrate this principle, we want to take tors which respond more or less to the
a look at some examples, which leads us same stimulus according to the nutri-
again from the abstract to the fundamen- tional condition of the organism.
tal in our hierarchy of information process-
. Even before a stimulus reaches the
ing.
receptor cells information processing
. It is certain that information are pro- can already be executed by a preced-
cessed in the cerebrum, which is the ing signal carrying apparatus, for ex-
most developed natural information ample in the form of amplification:
processing structure. The external and the internal ear
have a specific shape to amplify the
. The midbrain and the thalamus, sound, which also allows – in asso-
which serves – as we have already ciation with the sensory cells of the
learned – as a gate to the cerebral sense of hearing – the sensory stim-
cortex, are much lower down in the hi- ulus only to increase logarithmically
erarchy. The filtering of information with the intensity of the heard sig-
with respect to the current relevance nal. On closer examination, this is
executed by the midbrain is a very necessary, since the sound pressure of
important method of information pro- the signals for which the ear is con-
cessing, too. But even the thalamus structed can vary over a wide expo-
does not receive any pre-processed nential range. Here, a logarithmic
stimuli from the outside. Now let us measurement is an advantage. Firstly,
an overload is prevented and, sec- two organs of sight which, from an evolu-
ondly, the fact that the intensity mea- tionary point of view, exist much longer
surement of intensive signals will be than the human.
less precise, doesn’t matter as well. If
a jet fighter is starting next to you,
small changes in the noise level can
2.3.3.1 Complex eyes and pinhole eyes
be ignored.
only provide high temporal or
Just to get a feel for sense organs and in- spatial resolution
formation processing in the organism we
will briefly describe ”usual” light sensing Let us first take a look at the so-called
organs, i.e. organs often found in nature. complex eye (Fig. 2.5 on the right page),
At the third light sensing organ described also called compound eye, which is, for
below, the single lens eye, we will discuss example, common in insects and crus-
the information processing in the eye. taceans. The complex eye consists of a
complex eye:
great number of small, individual eyes. high temp.,
If we look at the complex eye from the low
2.3.3 An outline of common light outside, the individual eyes are clearly vis-
spatial
sensing organs
resolution
ible and arranged in a hexagonal pattern.
Each individual eye has its own nerve fiber
For many organisms it turned out to be ex- which is connected to the insect brain.
tremely useful to be able to perceive elec- Since the individual eyes can be distin-
tromagnetic radiation in certain regions of guished, it is obvious that the number of
the spectrum. Consequently, sense organs pixels, i.e. the spatial resolution, of com-
have been developed which can detect plex eyes must be very low and the image
such electromagnetic radiation. Therefore, is blurred. But complex eyes have advan-
the wavelength range of the radiation per- tages, too, especially for fast-flying insects.
ceivable by the human eye is called visible Certain complex eyes process more than
range or simply light. The different wave- 300 images per second (to the human eye,
lengths of this electromagnetic radiation however, movies with 25 images per sec-
are perceived by the human eye as differ- ond appear as a fluent motion).
ent colors. The visible range of the elec-
tromagnetic radiation is different each or- Pinhole eyes are, for example, found in
ganism. Some organisms cannot see the octopus species and work – you can guess
colors (=wavelength ranges) we can see, it – similar to a pinhole camera. A pin-
pinhole
others can even perceive additional wave- hole eye has a very small opening for light camera:
length ranges (e.g. in the UV range). Be- entry, which projects a sharp image onto high spat.,
fore we begin with the human being – in the sensory cells behind. Thus, the spatial
low
temporal
order to get a broader knowledge of the resolution is much higher than in the com- resolution
sense of sight– we briefly want to look at plex eye. But due to the very small open-
end of its legs and much more. Thus, It uses acoustic signals to localize
a fly has considerable differential and self-camouflaging insects (e.g. some
integral calculus in high dimensions moths have a certain wing structure
implemented ”in the hardware”. We that reflects less sound waves and the
all know that a fly is not easy to catch. echo will be small) and also eats its
Of course, the bodily functions are prey while flying.
also controlled by neurons, but these
should be ignored here. 1.6 · 108 neurons are required by the
brain of a dog, companion of man for
With 0.8 · 106 neurons we have enough
ages. Now take a look at another pop-
cerebral matter to create a honeybee.
ular companion of man:
Honeybees build colonies and have
amazing capabilities in the field of
3 · 108 neurons can be found in a cat,
aerial reconnaissance and navigation.
which is about twice as much as in
4 · 106 neurons result in a mouse, and a dog. We know that cats are very
here already begins the world of ver- elegant, patient carnivores that can
tebrates. show a variety of behaviors. By the
way, an octopus can be positioned
1.5 · 107 neurons are sufficient for a rat,
within the same dimension. Only very
an animal which is denounced as be-
few people know that, for example, in
ing extremely intelligent and are of-
labyrinth orientation the octopus is
ten used to participate in a variety
vastly superior to the rat.
of intelligence tests representative for
the animal world. Rats have an ex-
For 6 · 109 neurons you already get a
traordinary sense of smell and orien-
chimpanzee, one of the animals being
tation, and they also show social be-
very similar to the human.
havior. The brain of a frog can be
positioned within the same dimension. 1011 neurons make a human. Usually,
The frog has a complex physique with the human has considerable cognitive
many functions, it can swim and has capabilities, is able to speak, to ab-
evolved complex behavior. A frog stract, to remember and to use tools
can continuously target the said fly as well as the knowledge of other hu-
by means of his eyes while jumping in mans to develop advanced technolo-
the three-dimensional space and and gies and manifold social structures.
catch it with its tongue with reason-
able probability. With 2 · 1011 neurons there are nervous
5 · 107 neurons make a bat. The bat can systems having more neurons than
navigate in total darkness through a the human nervous system. Here we
space, exact to several centimeters, should mention elephants and certain
by only using their sense of hearing. whale species.
Adjustable weights: The weights weight- synapses per neuron. Let us further as-
ing the inputs are variable, similar to sume that a single synapse could save 4
the chemical processes at the synap- bits of information. Naı̈vely calculated:
tic cleft. This adds a great dynamic How much storage capacity does the brain
to the network because a large part of have? Note: The information which neu-
the ”knowledge” of a neural network ron is connected to which other neuron is
is saved in the weights and in the form also important.
and power of the chemical processes
in a synaptic cleft.
Exercises
This chapter contains the formal defini- the neural network, respectively. The time
tions for most of the neural network com- is divided in discrete time steps:
discrete
ponents used later in the text. After this Definition 3.1 (The concept of time): The time steps
chapter you will be able to read the in- current time (present time) is referred to
dividual chapters of this paper without as (t), the next time step as (t + 1), the
knowing the preceding ones (although this preceding one as (t − 1). Any other time J(t)
would be useful). steps are analogously referred to. If in
Remark (on definitions): In the following the following chapters several mathemat-
definitions, especially in the function def- ical variables (e.g. netj or oi ) refer to a
initions, I indicate the input elements or certain point of time, the notation there-
the target element (e.g. netj or oi ) instead fore will be, for example, netj (t − 1) or
of the usual pre-image set or target set (e.g. oi (t).
R or C) – this is necessary and also easier Remark: From a biological point of view
to write and to understand, since the form this is, of course, not very plausible (in the
of these elements can be chosen nearly ar- human brain a neuron does not wait for
bitrarily. But usually these values are nu- another one), but it significantly simplifies
meric. the implementation.
In some definitions of this paper we use A technical neural network consists of sim-
the term time or the number of cycles of ple processing units and the Neurons
35
anderen Neuronen
Ausgabefunktion
(Erzeugt aus Aktivierung die Ausgabe,
ist oft Identität)
3.2.4 Neurons get activated, if the 1) into a new activation state aj (t), with
network input exceeds their the threshold value Θ playing an impor-
treshold value tant role, as already mentioned.
Remark: Unlike the other variables within
When centered around the threshold value, the neural network (particularly unlike the
the activation function of a neuron re- ones defined so far) the activation function
acts particularly sensitive. From the bi- is often defined globally for all neurons or
ological point of view the threshold value at least for a set of neurons, and only the
represents the threshold from which on a threshold values are different for each neu-
neuron fires. The threshold value is also ron. We should also keep in mind that the
highest
point of widely included in the definition of the ac- threshold values can be changed, for exam-
sensation tivation function, but generally the defini- ple by a learning procedure. So it can be-
tion is the following: come particularly necessary to relate the
Definition 3.6 (General threshold value): threshold value to the time and to write,
Let j be a neuron. The threshold for instance Θj als Θj (t) (but for reasons
value Θj is explicitly assigned to j and of clarity, I omitted this here). The ac-
ΘI tivation function is also called transfer
marks the position of the maximum gradi-
ent value of the activation function. function.
−x , (3.6)
1+e T
f(x)
0
functions which are not explicitly defined Fermi Function with Temperature Parameter
once again 0
−4 −2 0 2 4
formal: 0.6
0.4
Definition 3.8 (Output function): Let j 0.2
tanh(x)
informs
other be a neuron. The output function 0
−0.2
neurons
fout : aj → oj (3.7) −0.4
−0.6
then calculates the output value oj of the −0.8
fout I
neuron j from its activation state aj .
−1
−4 −2 0 2 4
Unless explicitly specified otherwise, we clarify that the connections are between
will use the identity as output function the line neurons and the column neurons,
within this paper. I have inserted the small arrow in the
upper-left cell.
i1 i2 h1 h2 h3 Ω1 Ω2
3.3.2 Recurrent networks have i1
influence on theirselves i2
h1
Recurrence is defined as the process of a h2
neuron influencing itself by any means or h3
by any connection. Recurrent networks do Ω1
not always have explicitly defined input or Ω2
output neurons. Therefore in the figures
I omitted all markings that concern this Figure 3.3: A feedforward network with three
layers: Two input neurons, three hidden neurons
matter and only numbered the neurons.
and two output neurons. Characteristic for the
Hinton diagram of completely linked feedforward
networks is block building above the diagonal
3.3.2.1 Direct recurrencies start and
line.
end at the same neuron
89:;
?>=< 89:;
?>=<
v v
1 2
?>=<
89:; ?>=<
89:; ) ?>=<
89:;
u v v v
3 4 5
?>=<
89:; ) ?>=<
89:;
uv v
6 7
GFED
@ABC GFED
@ABC
i1 i2 1 2 3 4 5 6 7
1
2
@ABC
GFED GFED
@ABC * GFED
@ABC
~t ~ 3
h1 h2 h3
4
5
GFED
@ABC + * GFED
@ABC
~
t
~ 6
Ω1 s Ω 2 7
8 ?>=<
89:; 82 ?>=<
89:;
direct ways forward to influence itself, for
example, by influencing the neurons of the 1 g 2
X X
next layer and the neurons of this next
?>=<
89:; 8 ?>=<
89:; 82 ?>=<
) 89:;
layer influencing j (fig. 3.6). u
3 g 4 5
Definition 3.13 (Indirect recurrence): X X
Again our network is based on a feedfor-
?>=<
89:; ) ?>=<
89:;
ward network with now additional connec- u
6 7
tions between neurons and the preceding
layer being allowed. Therefore, below the
1 2 3 4 5 6 7
diagonal of W unequal to 0.
1
2
3
4
3.3.2.3 Lateral recurrences 5
6
7
[Lateral recurrences connect neurons
Figure 3.6: A network similar to a feedforward
within one layer] Connections between
network with indirectly recurrent neurons. The
neurons within one layer are called indirect recurrences are represented by solid lines.
lateral recurrences (fig. 3.7 on the As we can see, connections to the preceding lay-
next page). Here, each neuron often ers can exist here, too. The fields that are sym-
inhibits the other neurons of the layer metric to the feedforward blocks in the Hinton
and strengthens itself. As a result only diagram are now occupied.
the strongest neuron becomes active
(winner-takes-all scheme).
Definition 3.14 (Lateral recurrence): A
laterally recurrent network permits con-
nections within one layer.
89:;
?>=< + ?>=<
89:;
must be symmetric (fig. 3.8 on the right
1 k 2 page). A popular example are the self-
organizing maps, which will be introduced
?>=<
89:; + ?>=<
89:; k +* ) ?>=<
89:;
in chapter 10.
u
3 jk 4 5 Definition 3.15 (Complete interconnec-
tion): In this case, every neuron is always
?>=<
89:; + ) ?>=<
89:;
u allowed to be connected to every other neu-
6 k 7 ron – but as a result every neuron can
become an input neuron. Therefore, di-
1 2 3 4 5 6 7 rect recurrences normally cannot be ap-
1 plied here and clearly defined layers do not
2 longer exist. Thus, the matrix W may be
3 unequal to 0 everywhere, except along its
4 diagonal.
5
6
7
3.4 The bias neuron is a
Figure 3.7: A network similar to a feedforward
network with laterally recurrent neurons. The
technical trick to consider
direct recurrences are represented by solid lines. threshold values as
Here, recurrences only exist within the layer.
In the Hinton diagram, filled squares are con-
connection weights
centrated around the diagonal in the height of
the feedforward blocks, but the diagonal is left By now we know that in many network
uncovered.
paradigms neurons have a threshold value
that indicates when a neuron becomes ac-
tiv. Thus, the threshold value is an activa-
tion function parameter of a neuron. From
the biological point of view this sounds
most plausible, but it is complicated to ac-
cess the activation function at runtime in
order to train the threshold value.
?>=<
89:; / ?>=<
89:;
ues.
o
@ 1O >^ Ti >TTTTT jjjjj5 @ 2O >^ > Definition 3.16: A bias neuron is a neu-
>> jTjTjTjT >>
>>jj
j TTTT > ron whose output value is always 1 and
jj jj > T TTTT >>>
?>=<
89:; / ?>=<
89:; / ?>=<
89:;
j j >> which is represented by
ju jjj
j TTTT>)
@ABC
GFED
3 >^ Ti jo TTT 4 o
@ >^ > jjj45 @ 5
>> TTTT j j
>> TTTT >> jjj
TTTT jjjj>>j>jj
BIAS .
>>
89:;
?>=< / ?>=<
89:;
>>
jjTT >
ju jjjjj TTTTT>) It is used to represent neuron biases as con-
6 o 7
nection weights, which enables any weicht-
1 2 3 4 5 6 7 training algorithm to train the biases at
the same time.
1
2
Then the threshold value of the neurons
3
j1 , j2 , . . . , jn is set to 0. Now the thresh-
4
old values are implemented as connecting
5
weights (fig. 3.9 on page 47) and can di-
6
rectly be trained together with the con-
7
necting weights, which considerably facil-
Figure 3.8: A completely linked network with itates the learning process.
symmetric connections and without direct recur-
In other words: Instead of including the
rences. In the Hinton diagram only the diagonal
are left blank. threshold value in the activation function,
it is now included in the propagation func-
tion. Or even shorter: The threshold value
is subtracted from the network input, i.e.
it is part of the network input. More for-
mally:
bias neuron
GFED
@ABC GFED
@ABC / ?>=<
89:;
Θ1 B BIAS T
AA TTTTT−Θ1 0
|| BB AA TTTT
|| BB −ΘA2A −Θ3 TTT
| BB
89:;
?>=< 89:;
?>=<
| AA TTTT
GFED
@ABC GFED
@ABC
| BB
|~ | TTT*
Θ2 Θ3 0 0
Figure 3.9: Two equivalent neural networks, one without bias neuron on the left, one with bias
neuron on the right. The neuron threshold values can be found in the neurons, the connecting
weights at the connections. Furthermore, I omitted the weights of the already existing connections
(represented by dotted lines on the right side).
Here, the neurons do not change their val- With random permutation each neuron
ues simultaneously but at different points is chosen exactly once, but in random or-
of time. For this, there exist different or- der, during one cycle.
ders, some of which I want to introduce in Definition 3.19 (Random permutation):
easier to
the following: Initially, a permutation of the neurons is
implement
For all orders either the previous neuron 3.7 Communication with the
activations at time t or, if already existing, outside world: Input and
the neuron activations at time t + 1, for
which we are calculating the activations, output of data in and
can be taken as starting point. from Neural Networks
Exercises
As written above, the most interesting Theoretically, a neural network could learn
characteristic of neural networks is their by
capability to familiarize with problems
1. developing new connections,
by means of training and, after sufficient
training, to be able to solve unknown prob- 2. deleting existing connections,
lems of the same class. This approach is re-
ferred to as generalization. Before intro- 3. changing connecting weights,
ducing specific learning procedures, I want 4. changing the threshold values of neu-
to propose some basic principles about the rons,
learning procedure in this chapter.
5. varying one or more of the three neu-
ron functions (remember: activation
function, propagation function and
4.1 There are different output function),
51
Chapter 4 How to Train a Neural Network? (fundamental) dkriesel.com
no longer trained. Moreover, we can de- patterns, which we use to train our neu-
velop further connections by setting a non- ral net.
existing connection (with the value 0 in
the connection matrix) to a value differ- Additionally, I will introduce the three es-
ent from 0. As for the modification of sential paradigms of learning by the dif-
threshold values I refer to the possibility ferences between the regarding training
of implementing them as weights (section sets.
3.4).
should be more effective than unsuper- Remark: This learning procedure is not
vised learning since the network receives always biologically plausible, but it is ex-
specific critera for problem-solving. tremely effective and therefore very feasi-
Definition 4.3 (Reinforcement learning): ble.
The training set consists of input patterns,
after completion of a sequence a value is re- At first we want to look at the the su-
turned to the network indicating whether pervised learning procedures in general,
the result was right or wrong and, possibly, which - in this paper - are corresponding
how right or wrong it was. to the following steps:
of a whole batch of training examples in- . How can the learned patterns be
cluding the related change in weight values stored in the network?
is called epoch.
. Is it possible to avoid that newly
Definition 4.5 (Offline learning): Several learned patterns destroy previously
training patternns are entered into the net- learned associations (the so-called sta-
work all at once, the errors are accumu- bility/plasticity dilemma)?
lated and it learns for all patterns at the
Remark: We will see that all these ques-
same time.
tions cannot be generally answered but
Definition 4.6 (Online learning): The net- that they have to be discussed for each
work learns directly from the errors of each learning procedure and each network JJJ
training example.
no easy
topology individually. answers!
output. The set of training patterns is on the type of network being used the
called P . It contains a finite number of or- neural network will output an
dered pairs(p, t) of training patterns with
output vector y. Basically, the
corresponding desired output.
Remark: Training patterns pattern are training example p is nothing more than
often simply called patterns, that is why an input vector. We only use it for
they are referred to as p. In the litera- training purposes because we know
ture as well as in this paper they are called the corresponding
synonymously patterns, training examples
etc. teaching input t which is nothing more
than the desired output vector to the
Definition 4.8 (Teaching input): Let j be
tI training example. The
an output neuron. The teaching input
tj is the desired, correct value j should error vector Ep is the difference between
output after the input of a certain training the teaching input t and the actural
pattern . Analogous to the vector p the output y.
desired
output teaching inputs t1 , t2 , . . . , tn of the neurons
can also be combined into a vector t. t So, what x and y are for the general net-
always refers to a specific training pattern work operation are p and t for the network
Important!
p and is, as already mentioned, contained training - and during training we try to
in the set P of the training patterns. bring y and t as close as possible.
Definition 4.9 (Error vector): For several Note: One advice concerning notation:
Ep I
output neurons Ω1 , Ω2 , . . . , Ωn the differ- We referred to the output values of a neu-
ence between output vector and teaching ron i as oi . Thus, the output of an out-
input under a training input p put neuron Ω is called oΩ . But the output
t1 − y1 values of a network are referred to as yΩ .
– provided that there are enough train- provokes that the patterns will be memo-
ing examples. The usual division relations rized when using recurrent networks (later,
are, for instance, 70% for training data we will learn more about this type of net-
and 30% for verification data (randomly works). A random permutation would
chosen). We can finish the training when solve both problems, but it is – as already
the network provides good results on the mentioned – very timeconsuming to calcu-
traing data as well as on the verification late such a permutation.
data.
0.00025 0.0002
0.00018
0.0002 0.00016
0.00014
0.00015 0.00012
Fehler
Fehler
0.0001
0.0001 8e−005
6e−005
5e−005 4e−005
2e−005
0 0
0 100 200 300 400 500 600 700 800 900 1000 1 10 100 1000
Epoche Epoche
1 1
1e−005 1e−005
1e−010 1e−010
1e−015 1e−015
Fehler
Fehler
1e−020 1e−020
1e−025 1e−025
1e−030 1e−030
1e−035 1e−035
0 100 200 300 400 500 600 700 800 900 1000 1 10 100 1000
Epoche Epoche
Figure 4.2: All four illustrations show the same (idealized, because very smooth) learning curve.
Note the alternating logarithmic and linear scalings! Also note the small ”inaccurate spikes” visible
in the bend of the curve in the first and second diagram from bottom.
Confidence in the results, for example, is next point of the two curves, and maybe
boosted, when the network always reaches the final point of learning is to be applied
objectivity
nearly the same final error-rate for differ- here (this procedure is called early stop-
ent random initializations – so repeated ping).
initialization and training will provide a
more objective result. Once again I want to remind you that they
are all acting as indicators and not as If-
On the other hand, it can be possible that Then keys.
a curve descending fast in the beginning
can, after a longer time of learning, be Let me say one word to the number of
overtaken by another curve: This can indi- learning epochs: At the moment, vari-
cate that either the learning rate of the ous publications often use ≈ 106 to≈ 107
worse curve was too high or the worse epochs. Why not trying to use some more?
curve itself simply got stuck in a secondary The answer is simple: The current stan-
minimum, but was the first to find it. dard PC is not yet fast enough to support
8
Remember: Larger error values are worse 10 epochs. But with increasing process-
than the small ones. ing speed the trend will go towards it all
by itself.
But, in any case, note: Many people only
generate a learning curve in respect of the
training data (and then they are surprised
that only a few things will work) – but for 4.5 Gradient optimization
reasons of objectivity and clarity it should procedures
not be forgotten to plot the verification
data on a second learning curve, which gen-
erally provides values being slightly worse In order to establish the mathematical ba-
and with stronger oscillation. But with sis for some of the following learning pro-
good generalization the curve can decrease, cedures I want to explain briefly what is
too. meant by gradient descent: the backprop-
agation of error learning procedure, for
When the network eventually begins to example, involves this mathematical basis
memorize the examples, the shape of the and thus inherits the advantages and dis-
learning curve can provide an indication: advantages of the gradient descent.
If the learning curve of the verification
examples is suddenly and rapidly rising Gradient descent procedures are generally
while the learning curve of the verifica- used where we want to maximize or mini-
tion data is continuously falling, this could mize n-dimensional functions. Due to clar-
indicate memorizing and a generalization ity the illustration (fig. 4.3 on the right
getting poorer and poorer. At this point page) shows only two dimensions, but prin-
it could be decided whether the network cipally there is no limit to the number of
has already learned good enough at the dimensions.
Figure 4.3: Visualization of the gradient descent on two-dimensional error function. Vi-
sualization of the gradient descent on two-dimensional error function. We go forward in
diametrical opposition to g, i.e. with the steepest descent towards the lowest point, with the
step width being proportional to |g| ist (the steeper the descent, the faster the steps). On
the left the area is shown in 3D, on the right the steps over the level curve shown in 2D.
Here it is obvious how a movement is made in the opposite direction of g towards the min-
imum of the function and becomes continuously slowing-down proportional to |g|. Source:
http://webster.fhs-hagenberg.ac.at/staff/sdreisei/Teaching/WS2001-2002/
PatternClassification/graddescent.pdf
The gradient is a vector g that is de- we move slowly on a flat plateau, and on a
fined for any differentiable point of a func- steep ascent we run downhill. If we came
tion, that points from this point exactly into a valley, we would - depending on the
towards the steepest ascent and indicates size of our steps - jump over it or we would
the gradient in this direction by means return into the valley across the opposite
of its norm |g|. Thus, the gradient is a hillside in order to come closer and closer
generalization of the derivative for multi- to the deepest point of the valley by walk-
dimensional functions 2 . Accordingly, the ing to and fro, similar to our ball moving
negative gradient −g exactly points to- within a round bowl.
wards the steepest ascent. The gradient Definition 4.15 (Gradient descent): Let
operator ∇ is referred to as nabla op- f be an n-dimensional function and s =
We go
∇I towards the
erator, the overall notation of the the (s1 , s2 , . . . , sn ) the given starting point. gradient
gradient is
multi-dim. gradient g of the point (x, y) of a two- Gradient descent means to going from
derivative dimensional function f being, for instance, f (s) against the direction of g, i.e. towards
g(x, y) = ∇f (x, y). −g with steps of the size of |g| towards
Definition 4.14 (Gradient): Let g be a smaller and smaller values of f .
gradient. Then g is a vector with n
components that is defined for any point Now, we will see that the gradient descent
of a (differential) n-dimensional function procedures are not free from errors (sec-
f (x1 , x2 , . . . , xn ). The gradient operator tion 4.5.1) but, however, they are promis-
notation is defined as ing.
g(x1 , x2 , . . . , xn ) = ∇f (x1 , x2 , . . . , xn )
4.5.1 Gradient procedures
. g directs from any point of f towards incorporate several problems
the steepest ascent from this point, with
|g| corresponding to the degree of this as- As already implied in section 4.5, the gra-
cent. dient descent (and therefore the backprop-
agation) is promising but not foolproof.
gradient descent means to going downhill One problem, is that the result does not
in small steps from any starting point of always reveal if an error has occurred.
our function towards the gradient g (which gradient
descent
means, vividly speaking, the direction to with errors
which a ball would roll from the start- 4.5.1.1 Convergence against
ing point), with the size of the steps be- suboptimal minima
ing proportional to |g| (the steeper the de-
scent, the broader the steps). Therefore, Every gradient descent procedure can, for
2 I don’t want to dwell on how to determine the
example, get stuck within a local mini-
multidimensional derivative – the interested reader mum (part a of fig. 4.4 on the right page).
may consult the usual analysis literature. This problem is increasing proportionally
Figure 4.4: Possible errors during a gradient descent: a) Detecting bad minima, b) Quasi-standstill
with small gradient, c) Oscillation in canyons, d) Leaving good minima.
to the size of the error surface, and there 4.5.1.4 Oscillation in steep canyons
is no universal solution.
A sudden alternation from one very strong
negative gradient to a very strong positive
4.5.1.2 Stagnation at flat plateaus one can even result in oscillation (part c
of fig. 4.4). In nature, such an error does
When passing a flat plateau, for instance, not occurr very often so that we can think
the gradient also becomes negligibly small about the possibilities b and d.
(because there is hardly a descent (part b
of fig. 4.4), which requires many further
steps. A hypothetically possible gradient 4.6 Problem examples allow
of 0 would completely stop the descent.
for testing self-coded
learning strategies
4.5.1.3 Leaving good minima
We looked at learning from the formal
On the other hand the gradient is very point of view – not much yet but a little.
large at a steep slope so that large steps Now it is time to look at a few problem
can be made and a good minimum can pos- examples you can later use to test imple-
sibly be missed (part d of fig. 4.4). mented networks and learning rules.
Figure 4.5: Illustration to the training example Figure 4.6: Illustration of training examples for
of the 2-spiral problem the checkerboard problem
4.7 The Hebbian learning rule Remark: Why am I speaking twice about
is the basis for most activation, but in the formula I am using
oi and aj , i.e. the output of neuron output
other learning rules of neuron i and the activation of neuron j?
Remember that the identity is often used
In 1949, Donald O. Hebb formulated as output function and therefore ai and oi
the Hebbian rule [Heb49] which is the ba- of a neuron are often the same. Besides,
sis for most of the more complicated learn- Hebb postulated his rule long before the
ing rules we will discuss in this paper. We specification of technical neurons.
distinguish between the original form and
the more general form, which is a kind of Considering that this learning rule was
principle for other learning rules. preferred in binary activations, it is clear
that with the possible activations (1, 0) the
weights will either increase or remain con-
4.7.1 Original rule stant . Sooner or later they would go ad
weights
infinitum, since they can only be corrected go ad
Definition 4.16 (Hebbian rule): ”If neu- ”upwards” when an error occurs. This can infinitum
ron j receives an input from neuron i and be compensated by using the activations
if both neurons are strongly active at the (-1,1)3 . Thus, the weights are decreased
same time, then increase the weight wi,j when the activation of the predecessor neu-
(i.e. the strength of the connection be- ron dissents from the one of the successor
tween i and j).” Mathematically speaking, neuron, otherwise they are increased.
the rule is:
early
with ∆wi,j being the change in weight Remark: Most of the afore-discussed
from i to j , which is proportional to the learning rules are a specialization of the
∆wi,j I
following factors: mathematically more general form [MR86]
of the Hebbian rule.
. The output oi of hte predecessor neu-
ron i, as well as, Definition 4.17 (Hebbian rule, more gen-
eral): The generalized form of the
. The activation aj of the successor neu- Hebbian Rule only specifies the propor-
ron j, tionality of the change in weight to the
product of two undefined functions, but
. A constant η, i.e. the learning rate,
with defined input values.
which will be discussed in section
5.5.2. ∆w = η · h(o , w ) · g(a , t ) (4.6)
i,j i i,j j j
The changes in weight ∆wi,j are simply 3 But that is no longer the ”original version” of the
added to the weight wi,j . Hebbian rule.
Exercises
p1 = (2, 2, 2)
p2 = (3, 3, 3)
p3 = (4, 4, 4)
p4 = (6, 0, 0)
p5 = (0, 6, 0)
p6 = (0, 0, 6)
69
Chapter 5
The Perceptron
A classic among the neural networks. If we talk about a neural network, then
in the majority of cases we speak about a percepton or a variation of it.
Perceptrons are multi-layer networks without recurrence and with fixed input
and output layers. Description of a perceptron, its limits and extensions that
should avoid the limitations. Derivation of learning procedures and discussion
about their problems.
As already mentioned in the history of neu- {0, 1} or {−1, 1}). Thus, a binary thresh-
ral networks the perceptron was described old function is used as activation function,
by Frank Rosenblatt in 1958 [Ros58]. depending on the threshold value Θ of the
Initially, Rosenblatt defined the already output neuron..
discussed weighted sum and a non-linear
activation function as components of the
perceptron. In a way, the binary activation function
represents an IF query which can also
be negated by means of negative weights.
There is no established definition for a per-
The perceptron can be used to accomplish
ceptron, but most of the time the term
real and logical information processing.
perception is used to describe a feedfor-
ward network with shortcut connections. Remark: Whether this method is reason-
This network has a layer of scanner neu- able is another matter – of course, this
rons (retina) with statically weighted con- is not the easiest way to achieve Boolean
nections to the next following layer and logic. I just want to illustrate that percep-
is called input layer (fig. 5.1 on the next trons can be used as simple logical compo-
page); but the weights of all other layers nents and that, theoretically speaking, any
are allowed to be changed. All neurons Boolean function can be realized by means
subordinate to the retina are pattern de- of perceptrons being subtly connected in
tectors. Here we initially use a binary series or interconnected. But we will see
perceptron with every output neuron hav- that this is not possible without series con-
ing exactly two possible output values (e.g. nection.
71
Kapitel 5 Das Perceptron dkriesel.com
GFED
@ABC @ABC
GFED )GFED
@ABC + )GFED
@ABC ,+ )GFED
@ABC
| " { us # {u # | "
u
Osr O @
OOO
OOO @@@ ~~ ooooo
OOO @@ ~~ ooo
OOO @@ ~~~ ooooo
WVUT
PQRS
OOO@ ~ o
' Σ ~ wooo
L|H
i1 PPP GFED
GFED
@ABC @ABC
i2 C @ABC
GFEDi3 @ABC
GFED
i4 @ABC
GFED
i
PPP CC {{ nnnn 5
PPP CC { nn
PPP CC {{ nnn
PPP CC {{ nnn
( ?>=<
89:; vn
PPPC! {} {n{nnnn
Ω
Architecture
Figure 5.1:Abbildung 5.1: of a perceptron
Aufbau with onemit
eines Perceptrons layer ofSchicht
einer variablevariabler
connections in different
Verbindungen views.
in verschiede-
nen Ansichten.
The solid-drawn Die durchgezogene
weight layer Gewichtsschicht
in the two illustrations on thein bottom
den unteren
can beiden Abbildungen ist trainier-
be trained.
Left side: bar.
Example of scanning information in the eye.
Right side,Oben:
upper Ampart:
Beispiel der Informationsabtastung
Drawing of the same example im with
Auge.indicated fixed-weight layer using the
Mitte: Skizze desselben mit eingezeichneter fester Gewichtsschicht unter Verwendung der definier-
defined designs of the functional descriptions for neurons.
ten funktionsbeschreibenden Designs für Neurone.
Right side, lower part: Without indicated fixed-weight layer, with the name of each neuron
Unten: Ohne eingezeichnete feste Gewichtsschicht, mit Benennung der einzelnen Neuronen nach
corresponding to our
unserer convention.
Konvention. The fixed-weight
Wir werden layer will noim
die feste Gewichtschicht longer be taken
weiteren Verlaufinto
der account in the
Arbeit nicht mehr
course of this paper.
betrachten.
Before providing the definition of the per- Definition 5.3 (Perceptron): The percep-
ceptron, I want to define some types of tron (fig. 5.1 on the left page) is1 a feed-
neurons used in this chapter. forward network containing a retina that
is used only for data acquisition and which
Definition 5.1 (Input neuron): An input
has fixed-weighted connections with the
neuron is an identity neuron. It ex-
first neuron layer (input layer). The fixed-
actly forwards the information received .
input neuron weight layer is followed by at least one
Thus, it represents the identity function,
trainable weight layer. One neuron layer is
only forwards
data which should be indicated by the symbol
completely linked with the following layer.
. Therefore the input neuron is repre-
sented by the symbol GFED
@ABC
The first layer of the perceptron consists
. of the above-defined input neurons.
Definition 5.2 (Information processing neu-
ron): Information processing neu-
rons process the input information some-
how or other, i.e. do not represent the A feedforward network often contains
identity function. A binary neuron shortcuts, which does not exactly corre-
sums up all inputs by using the weighted spond to the original description and there-
sum as propagation function, which we fore is not added to the definition.
want to illustrate by the sigma signΣ.
Remark (on the retina): We can see that
Then the activation function of the neuron
the retina is not included in the lower part
is the binary threshold function, which can
be illustrated byL|H . Now we turn our at-
of fig. 5.1. As a matter of fact the first
resented as propagation function but with the input values. The retina itself and the
the activation functions hyperbolic tangent static weights behind it are no longer men-
or Fermi function, or with a separately de- tioned or mapped since they do not pro-
fined activation function fact as cess information in any case. So, the map-
ping of a perceptron starts with the input
WVUT
PQRS WVUT
PQRS ONML
HIJK
neurons.
Σ Σ Σ
Tanh Fermi fact.
i1 @PUPUUU GFED
GFED
@ABC @ABC @ABC
GFED @ABC
GFED iGFED
@ABC
with the same input.
i2 P i3 i4 i5
@@PPPUPUUUU AAPAPPPP}} AAAnnnn}n} iiiinininin~n~
U P
@@ PPP UUAUA}} PP nn AA}i}ii nnn ~~ n i
@@ PP }UAUU nPnP ii}iA nn ~
The Boolean functions AND and OR shown @@ P}}P}PnPnAnAinAUinUiU iUiPUiP}U}P}PnPnAnAAn ~~~
@ABC nit
GFED @ABC
GFED @ABC
GFED
n i
P
~}v niii P' U
n P
~}nw n UUUP* ( ~~
in fig. 5.4 on the right page are trivial, com-
Ω1 Ω2 Ω3
posable examples.
GFED
@ABC @ABC
GFED
in finite time the perceptron can learn any-
A thing it can represent (perceptron con-
AA }}
A }} vergence theorem, [Ros62]). But don’t
1AA 1
}} halloo till you’re out of the wood! What
@ABC
GFED
AA
~}}
1.5 the perceptron is capable to represent will
be explored later.
During the exploration of linear separabil-
GFED
@ABC @ABC
GFED
ity of problems we will cover the fact that
A at least the single-layer perceptron unfor-
AA }
A }}
} tunately cannot represent a lot of prob-
1AA 1
}} lems.
@ABC
GFED
AA
~}}
0.5
has the advantage to be suitable for non- Now our learning target will certainly be
binary activation functions and, being far that for all training examples the output
away from the learning target, to automat- y of the network is approximate to the de-
ically learn faster. sired output t i.e. it is formally true that
. y is the output vector of a neural net- understands the set2 weights W as a vec-
work, tor and maps the values on the normed error as
output error (normed because otherwise
. output neurons are referred to as
function
not all errors can be mapped in one sin-
Ω1 , Ω2 , . . . , Ω|O| ,
gle e e ∈ R to perform a gradient descent).
. i is the input and It is obvious that a specific error func-
tion can analogously be generated for a
. o is the output of a neuron. single pattern p. JErrp (W )
Additionally, we defined that
As already shown in section 4.5 to the sub-
. the error vector Ep represents the dif- ject gradient descent procedure, gradient
ference (t−y) under a certain training descent procedures calculate the gradient
example p. of a random but finite-dimensional func-
tion (here: of the error function Err(W ))
. Furthermore, let O be the set of out- and go down towards the gradient until a
put neurons and minimum is reached. Err(W ) is defined
on the set of all weights which is herein
. I be the set of input neurons.
understood to be vector W . So we try to
Another naming convention shall be that, decrease or to minimize the error by means
for example, for the output o and the of, casually speaking, turning the weights
teaching input t an additional index p may – thus you receive information about how
be set in order to indicate that this value is 2 Following the tradition of the literature, I previ-
pattern-specific. Sometimes this will con- ously defined W as a weight matrix. I am aware
siderably enhance clarity. of this conflict but it shall not mind us here.
∂Err(W )
−2 ∆wi,Ω = −η . (5.3)
−1 2 ∂wi,Ω
1
0 0
w1 1
2 −2
−1 w2 A question therefore arises: How is our
error function exactly defined? It is not
Figure 5.5: Exemplary error surface of a neural good, if many results are far away from
network with two trainable connections w1 und the desired one; the error function should
w2 . Generally, neural networks have more than then provide large values – on the other
two connections, but this would have made the hand it is similarly bad if many results
illustration too complex. And most of the time are close to the desired one but includ-
the error surface is too craggy, which complicates ing an extreme very far away outlier. So
the search for the minimum.
we use the squared distance between the
output vector y and the teaching input t
which provides the Errp that is specific for
a training example p over the output of all
to change the weights (the change in all
output neurons Ω:
weights is referred to as ∆W ) by deriving
the error function Err(W ): 1 X
Errp (W ) = (tp,Ω − yp,Ω )2 . (5.4)
2 Ω∈O
∆W ∼ −∇Err(W ). (5.1)
Due to this proportionality a proportional Thus, we square the difference of the com-
constant η provides equality (η will soon ponents of the vectors t and y under the
get another meaning and a real practical pattern p and sum these squares. Then
use beyond the mere meaning of propor- the error definition Err and therefore the
tional constant. I just want to ask the definition of the error function Err(W ) )
reader to be patient for a while.): result from the summation of the specific
errors Errp (W ) of all patterns p:
∆W = −η∇Err(W ). (5.2)
Err(W ) = Errp (W ) (5.5)
X
Remark: The attentive reader will cer- / processes data. Basically, the data is
tainly wonder from where the factor 21 in only transferred through a function, the
equation 5.4 on the left page suddenly ap- result of the function is sent through an-
peared and where the root is in the equa- other one, and so on. If we ignore the
tion, since the equation is very similar to output function, the way of the neuron
the Euclidean distance. Both facts result outputs oi1 and oi2 , which the neurons i1
from simple pragmatics: It is a matter of and i2 entered into a neuron Ω, initially
error minimization. Since the root func- is the propagation function function (here
tion decreases with its argument, therefore weighted sum), from which the network in-
we can omit it for reasons of calculating put is going to be received:
and implementation efforts, since we do
not need them for minimization. Equally, oi1 , oi2 → fprop
it does not matter if the part to be min- ⇒ fprop (oi1 , oi2 )
imized is divided in half by the prefactor = oi1 · wi1 ,Ω + oi2 · wi2 ,Ω
1
2 : Therefore I am allowed to multiply by = netΩ
1
2 . This is mere idleness so that it can be
reduced in the course of our calculation to-Then this is sent through the activation
wards 2. function of the neuron Ω so that we re-
ceive the output of this neuron which is at
Now we want to continue to derive the
the same time a component of the output
Delta rule for linear activation functions.
vector y:
We have already discussed that we turn
the individual weights wi,Ω a bit and see
netΩ → fact
how the error Err(W ) is changing – which
corresponds to the derivative of the er- = fact (netΩ )
ror function Err(W ) according to the very = oΩ
same weight wi,Ω . This derivative cor- = yΩ .
responds to the sum of the derivatives
of all specific errors Errp according to As we can see, this output results from
this weight (since the total error Err(W ) many nested functions:
results from the sum of the specific er-
rors): oΩ = fact (netΩ ) (5.9)
= fact (oi1 · wi1 ,Ω + oi2 · wi2 ,Ω ). (5.10)
∂Err(W )
∆wi,Ω = −η (5.7)
∂wi,Ω
It is clear that we could break down the
∂Errp (W )
= (5.8) output into the input neurons (this is un-
X
−η .
p∈P
∂wi,Ω necessary here, since they do not process
information in an SLP). Thus, we want to
Once again I want to think about the ques- execute the derivates of equation 5.8 and
tion of how a neural network is processing due to the nested functions we can apply
the chain rule to factorize the derivative can just as well look at the change of the
∂Errp (W )
∂wi,Ω included in equation 5.8 on the network input when wi,Ω is changing:
previous page.
∂Errp (W ) ∂ i∈I (op,i wi,Ω )
P
= −δp,Ω · .
∂Errp (W ) ∂Errp (W ) ∂op,Ω ∂wi,Ω ∂wi,Ω
= · . (5.11)
∂wi,Ω ∂op,Ω ∂wi,Ω (5.14)
(op,i wi,Ω )
Let us take a look at the first multiplica-
P
∂
The resulting derivative i∈I
∂wi,Ω
tive factor of the above-mentioned equa- can now be reduced: The function
tion 5.11 which represents the derivative
i∈I (op,i wi,Ω ) to be derived consists of
P
of the specific error Errp (W ) according to many summands, and only the sum-
the output, i.e. the change of the error mand op,i wi,Ω contains the variable wi,Ω ,
Errp with an output op,Ω : The examina- according to which we derive. Thus,
tion of Errp (equation 5.4 on page 78) (op,i wi,Ω )
P
∂
clearly shows that this change is exactly
i∈I
∂wi,Ω = op,i and therefore:
the difference between teaching input and
output (tp,Ω − op,Ω ) (remember: Since Ω
∂Errp (W )
is output neuron, op,Ω = yp,Ω ). The closer = −δp,Ω · op,i (5.15)
the output is to the teaching input, the ∂wi,Ω
smaller is the specific error. Thus we can = −op,i · δp,Ω . (5.16)
replace the one by the other. This differ-
ence is also called δp,Ω (which is the reason This we insert in equation 5.8 on the previ-
for the name Delta rule): ous page, which results in our modification
rule for a weight wi,Ω :
∂Errp (W ) ∂op,Ω
= −(tp,Ω − op,Ω ) ·
∆wi,Ω = η · (5.17)
X
∂wi,Ω ∂wi,Ω op,i · δp,Ω .
(5.12) p∈P
∂op,Ω
= −δp,Ω · (5.13) However: From the very first the deriva-
∂wi,Ω tion has been intended as an offline rule by
means of the question of how to add the
errors of all patterns and how learn them
The second multiplicative factor of equa-
after all patterns have been represented.
tion 5.11 and of the following one is the
Although this approach is mathematically
derivative of the output of the neuron Ω
correct, the implementation is far more
to the pattern p according to the weight
time-consuming and, as we will see later
wi,Ω . So how does op,Ω change when the
in this chapter, and partially needs a lot
weight is changed from i to Ω? Due to
of compuational effort during training.
the requirement at the beginning of the
derivative / derivation we only have a lin- The ”online-learning version” of the Delta
ear activation function fact , therefore we rule simply omits the summation and
GFED
@ABC GFED
@ABC
i1 B i2
BB |
B |||
BB
wi1 ,Ω w
||
i2 ,Ω
89:;
?>=<
BB
|~ |
Ω
XOR?
GFED
@ABC @ABC
GFED
one layer of hidden neurons can approxi-
A mate arbitrarily precisely a function with
11 AA }
11 A }
}}
a finite number of points of discontinuity
11 1AA }1
as well as their first derivative. Unfortu-
111 GFED
@ABC
A
11 A ~}}
}
1.5
1
nately, this proof is not constructive and
11
therefore it is left to us to find the correct
11
11−2
number of neurons and weights.
GFED
@ABC
0.5 In the following we want to use a
widely-spread abbreviated form for differ-
ent multi-layer perceptrons: Thus, a two-
XOR stage perceptron with 5 neurons in the in-
put layer, 3 neurons in the hidden layer
Figure 5.9: Neural network realizing the XOR and 4 neurons in the output layer is a 5-3-
function. Threshold values (as far as they are 4-MLP.
existing) are located within the neurons. Definition 5.7 (Multi-layer perceptron):
Perceptrons with more than one layer of
variably weighted connections are referred
to as multi-layer perceptron (MLP).
ers, three neuron layers) can classify con- Thereby an n-layer or n-stage perceptron
vex polygons by proceeding these straight has exactly n variable weight layers and
lines, e.g. in the form ”recognize pat- n + 1 neuron layers (the retina is disre-
terns lying above straight line 1, below garded here) with neuron layer 1 being the
straight line 2 and below straight line input layer.
3”. Thus, we – metaphorically speaking
- took an SLP with several output neu- Since three-stage perceptrons can classify
rons and ”attached” another SLP (upper sets of any form by combining and separat- 3-stage
part of fig. 5.10 on the right page). A ing arbitrarily many convex polygons, an- MLP is
Multi-layer Perceptron represents an uni- other step will not be advantageous with sufficient
@ABC
GFED
i1 UU jGFED
@ABC
i2
@@@UUUUUUjUjjjjjj @@@
@@jjjj UUUU @@
jjjjjjj@@@ UUUUUUU @@@
@ABC
GFED @ABC
GFED @ABC
GFED
@ UUUU @
tjjjjjj U*
h1 PP h2 o h3
PPP oo o
PPP ooo
PPP ooo
PPP oo
' ?>=<
89:;
PPP oooo
wo
Ω
GFED
@ABC
i1 @ @ABC
GFED
i2 @
~~ @ @@ ~ ~ @@
~~ @@ ~~ @@
~~ @ ~~ @@
@ABC
GFED GFED
@ABC GFED
@ABC @ABC
GFED ) GFED
@ABC * GFED
@ABC
~ @@ ~~ @@
~~~t
u w ' ~ ~
h1 PP h2 @ h3 h4 h5 n h6
PPP @ ~ nn n
PPP @@ ~ n
PPP @@ ~~ nnn
PPP @@ ~~ nnnnn
' GFED
@ABC -*, GFED
@ABC
PPP@ ~~~nnn~
t nw
h7 @rq h8
@@ ~
@@ ~~~
@@ ~
~~
89:;
?>=<
@@
~ ~
Ω
Figure 5.10: We know that an SLP represents a straight line. With 2 trainable weight layers several
straight lines can be combined to form convex polygons (above). By using 3 trainable weight layers
several polygons can be formed into arbitrary sets (below).
ONML
HIJK
LL=& ppp ∂w ∂net h ∂wk,h
wpp k,h
Σ
h
| {z }
N H =−δh
rr f act NNNN
r
rr NN
rr rrr wh,lN
NNN
()*+
/.-, ()*+
/.-, ()*+
/.-, 89:;
?>=<
r NNN
x rrrr N' The first factor of the equation 5.23 is −δh ,
... l L
, which we will regard later in this text.
The numerator of the second factor of the
equation includes the network input, i.e.
the weighted sum is included in the numer-
ator so that we can immediately derive it.
Figure 5.11: Illustration of the position of our
All summands of the sum drop out again
neuron h within the neural network. It is lying
in layer H, the preceding layer is K, the next
apart from the summand containing wk,h .
following layer is L. This summand is referred to as wk,h · ok .
If it is derived, the output of the neuron k
is left:
∂neth ∂ wk,h ok
P
= k∈K
(5.24)
as with the Delta rule (equation 5.20). As ∂wk,h ∂wk,h
general-
ization already indicated, we have to generalize = ok (5.25)
of δ the variable δ for every neuron.
At first: Where is the neuron for which we
want to calculate a δ? ? It is obvious to se- As promised we will now discuss the −δh of
lect an arbitrary inner neuron h having a the equation 5.23, which is splitted again
set K of predecessor neurons k as well as a by means of the chain rule:
set of L successor neurons l, which are also
inner neurons (see fig. 5.11). Thereby it ∂Err
δh = − (5.26)
is irrelevant whether the predecessor neu- ∂net h
rons are already input neurons. ∂Err ∂oh
=− · (5.27)
∂oh ∂neth
Now we perform the same derivation as
for the Delta rule and split functions by
means the chain rule. I will not discuss The derivation of the output according to
this derivation in great detail, but the the network input (the second factor in
principal is similar to that of the Delta the equation 5.27) is certainly equal to
the derivation of the activation function The same applies for the first factor accord-
according to the network input: ing to the definition of our δ:
∂Err
∂oh ∂fact (neth ) − = δl (5.34)
= (5.28) ∂netl
∂neth ∂neth
= fact 0 (neth ) (5.29) Now we replace:
∂Err X
Now we analogously derive the first factor ⇒− = δl wh,l (5.35)
∂oh
in the equation 5.27 on the previous page. l∈L
The reader may well mull over this pas-
sage. For this we have to point out that You can find a graphic version of the δ
the derivation of the error function accord- generalization including all splittings in
ing to the output of an inner neuron layer fig. 5.12 on the right page.
depends on the vector of all network in- The reader might already have noticed
puts of the next following layer. This is that some intermediate results were
reflected in equation 5.30: framed. Exactly those intermediate re-
sults were framed which are a factor in
∂Err(netl1 , . . . , netl|L| )
(5.30) the change in weight of wk,h . If the above-
∂Err
− =−
∂oh ∂oh mentioned equations are combined with
the framed intermediate results, the result
According to the definition of the multi-
/the outcome of this will be the wanted
dimensional chain rule equation 5.31 im-
change in weight ∆wk,h to
mediately follows4 : 5.31:
∆wk,h = ηok δh mit (5.36)
∂Err X ∂Err ∂netl
= (5.31) δh = fact (neth ) ·
0
(δl wh,l )
X
− − ·
∂oh l∈L
∂netl ∂oh
l∈L
δh
∂Err
− ∂net h
∂oh
∂neth − ∂Err
∂oh
0 (net )
fact ∂Err
− ∂net
P ∂netl
h l l∈L ∂oh
P
∂ wh,l ·oh
δl h∈H
∂oh
wh,l
Figure 5.12: Graphical representation of the equations (by equal signs) and chain rule splittings
(by arrows) in the framework of the backpropagation derivation. The leaves of the tree reflect the
final results from the generalization of δ, which are framed in the derivation.
Unlike the Delta rule, δ, is treated differ- changed with the weight wk,h , the neu-
ently depending on whether h is an output ron h is the successor of the connec-
or an inner (i.e. hidden) neuron: tion to be changed and the neurons
l are lying in the layer following the
1. If h is output neuron, then successor neuron. Thus, according to
our training pattern p the weight wk,h
δp,h = fact
0
(netp,h ) · (tp,h − yp,h )
from k toh is proportionally changed
(5.38)
to
Thus, under our training pattern p . the learning rate η,
the weight wk,h from k nach h is pro-
portionally changed into . the output of the predecessor
neuron op,k ,
. the learning rate η,
. the gradient of the activation
. the output op,k of the predeces- function at the position of the
sor neuron k, network input of the successor
neuron fact
0 (net
p,h ),
. the gradient of the activation
function at the position of the . as well as, and this is the differ-
network input of the successor ence, from the weighted sum of
0 (net
neuron fact p,h ) and the changes in weight to all neu-
rons following h, l∈L (δp,l ·wh,l ).
P
. the difference between teaching
Definition 5.8 (Backpropagation): If we
input tp,h and output yp,h of the
summarize the formulas 5.38 and 5.39,
successor neuron h.
Teach. Input we receivethe following total formula for
In this case, backpropagation is work- backpropagation (the identifiers p are
changed for
the outer
weight layer ing on two neuron layers, the output ommited for reasons of clarity):
layer with the successor neuron h and
∆wk,h(= ηok δh mit
the preceding layer with the predeces- 0 (net ) · (t − y ) (h außen)
sor neuron k. fact
δh = h
0 (net ) · P
h h
fact h l∈L (δl wh,l ) (h innen)
2. If h is an inner, hidden neuron, then (5.40)
δp,h = fact
0
(netp,h ) · (δp,l · wh,l )
X
l∈L
It is obvious that backpropagation initially
(5.39) processes/executes the last weight layer di-
rectly by means of the teaching input and
Here I want to explicitly mention that it then works forward from layer to layer
backpropagation is now working on in consideration of each preceding change
three layers. At this, the neuron k in weights. Thus, the teaching input leaves
is predecessor of the connection to be traces in all weight layers.
back-
propagation
for inner
layers
90 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)
dkriesel.com 5.5 Backpropagation of Error
Remark: Here I am describing the first Furthermore, we only want to use linear
(Delta rule) and the second part of back- activation functions so that fact 0 (light-
propagation (generalized Delta rule on colored) is constant. KAs is generally
more layers) in one go, which may meet known, constants can be summarized, and
the requirements of the matter but not of therefore we directly combine the constant
the research. The first part is obvious, derivative fact
0 and (being constant for at
which you will see in the framework of least one lerning cycle) the learning rate η
a mathematical gimmick. Decades of de- (also light-colored) in η. Thus, the result
velopment time and work lie between the is:
first and the second, recursive part. Like
many groundbreakingng inventions it was ∆wk,h = ηok δh = ηok · (th − oh ) (5.43)
not until its development that it was rec-
ognized how plausible this invention was. This exactly corresponds with the Delta
rule definition.
Experience shows that good learning rate 5.5.2.2 Different layers – Different
values are in the range of learning rates
5.5.2.1 Variation of the learning rate After having discussed the backpropaga-
over time tion of error learning procedure and know-
ing how to train an existing network, it
would be useful to consider how to acquire
During training another stylistic device such a network.
can be a variable learning rate: In
the beginning, a large learning rate learns
good, but later it results in inaccurate 5.5.3.1 Number of layers: Mostly two
learning. A smaller learning rate is more or three
time-consuming, but the result is more pre-
cise. Thus, the learning rate is to be de-
Let us begin with the trivial circumstance
creased by one unit once or repeatedly –
that a network should have one layer of in-
during the learning process.
put neurons and one layer of output neu-
Remark: A common error (which also rons, which results in at least two layers.
seems to be a very neat solution at a first
glance) is to continually decrease the learn- Additionally, we need – as we have already
ing rate. Here it easily happens that the learned during the examination of linear
descent of the learning rate is larger than separability – at least one hidden layer of
the ascent of a hill of the error function we neurons, if our problem is not linearly sep-
are scaling. The result is that we simply arable (which is, as we have seen, very
get stuck at this ascent. Solution: Rather likely).
reduce the learning rate gradually as men-
It is possible, as already mentioned, to
tioned above.
mathematically prove that this MLP with
one hidden neuron layer is already capable
of arbitrarily accurate approximation of ar- it is clear that our goal is to have as few
bitrary functions 5 – but it is necessary not free parameters as possible but as many as
only to discuss the representability of a necessary.
problem by means of a perceptron but also
the learnability. Representability means But we also know that there is no patent
that a perceptron can principally realize a formula for the question of how many neu-
mapping - but learnability means that we rons should be used. Thus, the most use-
are also able to teach it. ful approach is to initially train with only
a few neurons and to repeatedly train new
In this respect experience shows that two
networks with more neurons until the re-
hidden neuron layers (or three trainable
sult significantly improves and, particu-
weight layers) can be very useful to solve
larly, the generalization performance is not
a problem, since many problems can be
affected (bottom-up approach).
represented by a hidden layer but are very
difficult to learn. Two hidden layers are
still a good value because three hidden lay-
ers are not needed very often. Moreover, 5.5.3.3 Selecting an activation function
any additional layer generates additional
sub-minima of the error function in which
Another very important parameter for the
we can get stuck. All things considered,
way of information processing of a neural
a promising way is to try it with one hid-
network is the selection of an activa-
den layer at first and if that fails try two
tion function. The activation function
ones.
for input neurons is defined, since they do
not process information.
5.5.3.2 The number of neurons has to
be tested The first question to be asked is whether
we actually want to use the same activa-
The number of neurons (apart from in- tion function for/in the hidden layer and
put and output layer, the number of input in the ouput layer – noone prevents us
and output neurons is already defined by from varying the different functions. Gen-
the problem statement) principally corre- erally, the activation function is the same
sponds to the number of free parameters for all hidden neurons as well as for the
of the problem to be represented. output neurons.
Since we have already discussed the net- For tasks of function approximation it
work capacity with respect to memorizing has been found reasonable to use the hy-
or a too imprecise problem representation perbolic tangent (left part of fig. 5.13 on
5 Note: We have not indicated the number of neu-
page 95) as activation function of the hid-
rons in the hidden layer, we only mentioned the den neurons, while a linear activation func-
hypothetical possibility. tion is used in the output. The latter is
f(x)
0
−0.2 0.4
−0.4
−0.6 0.2
−0.8
−1 0
−4 −2 0 2 4 −4 −2 0 2 4
x x
Figure 5.13: As a reminder the illustration of the hyperbolic tangent (left) and the Fermi function
(right). DThe Fermi function was expanded by a temperature parameter. Thereby the original
Fermi function is represented by dark colors, the temperature parameter of the modified Fermi
functions are, from outward to inward 12 , 51 , 10
1 1
und 25 .
proportion of the previous change to every inertia can be varied over the prefactor α
new change in weight: , common values are between 0.6 und 0.9.
Additionally, the momentum enables the
Jα
(∆p wi,j )jetzt = ηop,i δp,j + α · (∆p wi,j )vorher positive effect that our skier swings back
and forth several times in a minimum,
Of course, this notation is only used for and finally lands in the minimum. Despite
a better understanding. Generally, as al- its nice one-dimensional appearance,
ready defined by the concept of time, the the otherwise very rare error of leaving
moment of the current cycle is referred to good minima unfortunately occurs more
as (t) , then the previous cycle is identified frequently because of the momentum
by (t − 1), which is successively continued. term – which means that there is again no
And now we come to the formal definition easy answer (but we become accustomed
of the momentum term: to this conclusion).
Definition 5.10 (Momentum term): The
moment of
inertia variation of backpropagation by means of
the momentum term is defined as fol- 5.5.4.2 Flat spot elimination
lows:
∆wi,j (t) = ηoi δj + α · ∆wi,j (t − 1) (5.44) It must be pointed out that with the hy-
perbolic tangent as well as with the Fermi
Remark: We accelerate on plateaus function the derivative outside of the close
(avoids quasi-standstill on plateaus) and proximity of Θ is nearly 0. This results
slow down on craggy surfaces (against in the fact that it is very difficult to re-
oscillations). Moreover, the effect of move neurons from the limits of the activa-
According to David Parker [Par87] Sec- The weight decay according to Paul
ond order backpropagation is also us- Werbos [Wer88] is a modification that ex-
ing the second gradient, i.e. the sec- tends the error by a term punishing large
ond multi-dimensional derivative of the er- weights. So the error under weight de-
ror function, to obtain more precise es- cay
timations of the correct ∆wi,j . Higher ErrWD
does not only increase proportional to the as usual, considers the difference between
ErrWD I
actual error but also proportionally to the output and teaching input, the other one
square of the weights. As a result the net- tries to ”press” a weight against 0. If a
work is keeping the weights small during weight is strongly needed to minimize the
learning. error, the first term will win. If this is not
the case, the second term will win. Neu-
1 X
ErrWD = Err + β · (w)2 (5.45) rons which only have zero weights can be
2 w∈W cut again in the end.
| {z }
There are many other variations of back-
punishment
This approach is inspired by nature where prop and whole books only about this
synaptic weights cannot become infinitely subject, but since my aim is to offer an
strong, as well. Additionally, due to overview of neural networks I just want to
keep weights
small these small weights the error function of- mention the variations above as a motiva-
ten shows less strong fluctuations allowing tion to read on.
easier and more controlled learning. For some of these extensions it is obvi-
The prefactor 2 again resulted from sim- ous that they cannot only be applied to
1
ple pragmatics. The factor β controls the feedforward networks with backpropaga-
βI
strength of punishment: Values from 0.001 tion learning procedures.
to 0.02 are often used here. We have got to know backpropagation and
feedforward topology – now we have to
learn how to build a neural network. It
5.5.4.6 Pruning / Optimal Brain
is certainly impossible to provide this ex-
Damage
perience in the framework of this paper,
for not it’s your turn: You could now try
If we have executed the weight decay long some of the Problem examples from sec-
enough and determine that for a neuron in tion 4.6.
the input layer all successor weights are 0
prune the
network or close to 0, we can remove the neuron,
have lost one neuron and some weights
and so reduce the opportunity that the
5.6 The 8-3-8 encoding
network will memorize. This procedure is problem and related
called pruning pruning. problems
Such a method to detect and delete un-
necessary weights and neurons is reffered The 8-3-8 encoding problem is a clas-
to as optimal brain damage [lCDS90]. sic among the multilayer perceptron test
I only want to describe it briefly: The training problems. In our MLP we
mean error per output neuron is composed have one input layer with eight neurons
of two competing terms. While one term, i1 , i2 , . . . , i8 , one output layer with eight
1
Err = Errp = (t − y)2
2
converges and if so, at what value. How
does the error curve look like? Let the
pattern (p, t) be defined by p = (p1 , p2 ) =
(0.3, 0.7) and tΩ = 0.4. Randomly initalize
the weights in the interval [1; −1].
Exercise 11: A one-stage perceptron with
two input neurons, bias neuron and binary
threshold function as activation function
divides the two-dimensional space into two
regions by means of a straight line g. Ana-
lytically calculate a set of weight values for
such a perceptron so that the following set
Despite all things in common: What is the Hidden neurons hare also called RBF
difference between RBF networks and per- neurons (as well as the layer in which
ceptrons? The difference is in the informa- they are located is referred to as RBF
tion processing itself and in the computa- layer). As propagation function each
tional rules within the neurons lying out- hidden neuron receives a norm that
side of the input layer. So in a moment calculates the distance between the in-
we will define a so far unknown type of put into the network and the so-called
neurons. position of the neuron (center). This
101
Chapter 6 Radial Basis Functions dkriesel.com
is entered into a radial activation func- pletely linked with the next following one,
tion which calculates and outputs the shortcuts do not exist (fig. 6.1 on the right
activation of the neuron. page) – it is a feedforward topology. The
Definition 6.1 (RBF input neuron): Defi- connections between input layer and RBF
nition and representation is identical to to layer are unweighted, i.e. they only trans-
the definition 5.1 on page 73 of the input mit the input. The connections between
input
is linear
again neuron. RBF layer and output layer are weighted.
The original definition of an RBF network
Definition 6.2 (Center of an RBF neuron):
only referred to an output neuron, but –
The center ch of an RBF neuron h is the
cI analogous to the perceptrons – it is appar-
point in the input space where the RBF
ent that such a definition can be general-
neuron is located . The closer the input
Position ized. A bias neuron is unknown in the
vector is to the center vector of an RBF
RBF network. The set of input neurons
in the input
space neuron, the higher, generally, is its activa-
shall be represented by I, the set of hid-
tion. JH
den neurons by H and the set of output
Definition 6.3 (RBF neuron): The so- neurons by O.
called RBF neurons h have a propaga-
tion function fprop that determines the dis-
Therefore, the inner neurons are called ra-
tance between the center ch of a neuron
Important! dial basis neurons because from their def-
and the input vector y. This distance rep-
inition directly follows that all input vec-
resents the network input. Then the net-
tors with the same distance from the cen-
work input is sent through a radial basis
ter of a neuron also produce the same out-
function fact which outputs the activation
put value (fig. 6.2 on page 104).
or the output of the neuron. RBF neurons
are represented by the symbol WVUT
PQRS
||c,x||
Gauß
.
GFED
@ABC @ABC
GFED
ERVRVRVVV h i1 , i2 , . . . , i|I|
y EE RRRVVVVV hhhhhlhlhlhlyly EE
y yy EE RhRhRhRhVhVlVlVlVl yy EE
EE
y y h hEEhh lRlRlR VVyyVVV
h EE
yy hh h E l l R R y
yRRR V VV
y| yhhhhhhhh
h llEE" VVVVVV EE"
WVUT
PQRS WVUT
PQRS WVUT
PQRS WVUT
PQRS WVUT
PQRS
l l y
| y R VVV+
||c,x|| sh ||c,x||
vll ||c,x||
R(
||c,x|| ||c,x||
V
QV
Gauß C QQ VVV Gauß C QQ Q h hm Gauß h1 , h2 , . . . , h|H|
V Gauß C mm hh m
CC QQQ VVVV CC QQQ{{ C mm { hhh mm {
Gauß
Figure 6.1: An exemplary RBF network with two input neurons, five hidden neurons and three
output neurons. The connections to the hidden neurons are not weighted, they only transmit the
input. Right of the illustration you can find the names of the neurons, which correspond to the
known names of the MLP neurons: Input neurons are called i, hidden neurons are called h and
output neurons are called Ω. The associated sets are referred to as I, H and O.
Actually, Gaussian bells are, related to the tion (fig. 6.4 on the right page). Addi-
whole input space, added here. tionally, the network includes the centers
c1 , c2 , . . . , c4 dof the four inner neurons
Let us assume that we have a second, a h1 , h2 , . . . , h4 , and therefore it has Gaus-
third and a fourth RBF neuron and there- sian bells which are finally added within
fore four differently located centers. Each the output neuron Ω. The network also
of these neurons now measures another dis- possesses four values σ1 , σ2 , . . . , σ4 which
tance from the input to its own center and influence the width of the Gaussian bells.
de facto provides different values, even if However, the height of the Gaussian bell
the Gaussian bell is the same. Since finally is influenced by the next following weights,
these values are simply accumulated in the since the individual output values of the
output layer, it is easy to understand that bells are multiplied by those weights.
h(r)
Gaussian in 1D Gaussian in 2D
1 1
0.8
0.8 0.6
0.4
0.6 0.2
h(r)
0
0.4
−2
0.2 2
−1 1
0 0
x
1 −1 y
0 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
r
Figure 6.3: Two individual one- or two-dimensional Gaussian bells. In both cases σ = 0.4 is true
and in both cases the center of the Gaussian bell is in the point of origin. The distance r to the
center (0, 0) is simply calculated from the Pythagorean theorem: r = x2 + y 2 .
p
1.4
1.2
0.8
0.6
0.4
y
0.2
−0.2
−0.4
−0.6
−2 0 2 4 6 8
x
Figure 6.4: For different Gaussian bells in one-dimensional space generated by means of RBF
neurons are added by an output neuron of the RBF network. The Gaussian bells have different
heights, widths and positions. Their centers c1 , c2 , . . . , c4 were located at 0, 1, 3, 4, the widths
σ1 , σ2 , . . . , σ4 at 0.4, 1, 0.2, 0.8. You can see an example for a two-dimensional casein fig. 6.5 on
the next page.
h(r) h(r)
Gaussian 1 Gaussian 2
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−2 2 −2
−1 2
1 −1 1
x 0 0 x 0 0
1 −1 y 1 −1 y
−2 −2
h(r) h(r)
Gaussian 3 Gaussian 4
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−2 2 −2 2
−1 1 −1 1
x 0 0 x 0 0
1 −1 y 1 −1 y
−2 −2
WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x||
QQQ Gauß A m
m Gauß
Gauß
QQQ AA }}
Gauß
mmm
QQQ AA } mm
QQQ
QQQ AAA }} mmm
QQQ AA }}} mmmmm
ONML
HIJK
QQ( }~ }mmm
Σ vm
−2
−1.5 2
−1 1.5
−0.5 1
0 0.5
0
x 0.5 −0.5
1 −1 y
1.5 −1.5
2 −2
Figure 6.5: Four different Gaussian bells in two-dimensional space generated p by menas of RBF
neurons are added by an output neuron of the RBF ntwork. Once again r = x2 + y 2 applies for
the distance. The heights w, widths σ and centers c = (x, y) are: w1 = 1, σ1 = 0.4, c1 = (0.5, 0.5),
w2 = −1, σ2 = 0.6, c2 = (1.15, −1.15), w3 = 1.5, σ3 = 0.2, c3 = (−0.5, −1), w4 = 0.8, σ4 =
1.4, c4 = (−2, 0).
Since we use a norm to calculate the dis- Remark (on nomenclature): It is obvious
tance between the input vector and the that both the center ch and the widthσh
center of a neuron h, we have different can be understood as part of the activa-
choices: Often the Euclidian distance is tion function fact , and according to this
chosen to calculate the distance: not all activation functions can be re-
ferred to as fact . One solution would be
rh = ||x − ch || (6.1) to number the activation functions like
sX
fact 1 , fact 2 , . . . , fact |H| with H being the
= (xi − ch,i )2 (6.2)
set of hidden neurons. But as a result the
explanation would be very confusing. So I
i∈I
Remember: The input vector was referred simply use the name fact for all activation
to as x. Here, the index i passes through functions and regard σ and c as variables
the input neurons as well as through the that are defined for individual neurons but
input vector components and the neuron no directly included in the activation func-
center components. As we can see, As we tion.
can see, the Euclidean distance generates Remark: The reader will definitely notice
the sqaures of the differences of all vector that in the literature the Gaussian bell is
components, adds them and extracts the often provided with a multiplicative fac-
root of the sum. In two-dimensional space tor. Due to the existing multiplication by
this equals the Pythagorean theorem. the following weights and the comparabil-
Remark: From the definition of a norm di- ity of constant multiplication we do not
rectly follows that the distance can only need this factor (especially because for our
be positive, because of which we, strictly purpose the integral of the Gaussian bell
speaking, use the positive part of the ac- must not always be 1) and therefore we
tivation function. By the way, activation simply leave it out.
functions other than the Gaussian bell are
possible. Normally, functions that mono-
tonically decrease in the interval [0; ∞] are 6.2.2 Some analytical thoughts
selected. prior to the training
Now that we know the distance rh be- The output yΩ of an RBF output neuron
rh I
tween the input vector x zand the center Ω results from combining the functions of
ch of the RBF neuron h this distance has an RBF neuron to
to be passed through the activation func-
tion. Here we use, as already mentioned, yΩ = wh,Ω · fact (||x − ch ||) . (6.4)
X
−r 2
h
2σ 2 Let us assume that similar to the multi-
fact (rh ) = e h (6.3) layer perceptron we have a s P , that con-
If we have more training examples Another reason for the use of the
than RBF neurons, we cannot assume Moore-Penrose pseudo inverse is the
that every training example is exactly fact that it minimizes the squared er-
hit. So, if we cannot exactly hit the ror (which is our aim): The estima-
points and therefore cannot only in- tion of the vector G in equation 6.15
terpolate as in the above-mentioned corresponds to the Gauss-Markov
ideal case with |P | = |H|, we must model known from statistics, which
try to find a function that approxi- is used to minimize the squared error.
mates our training set P as exactly
1 Particularly, M + = M −1 is true if M is invertible.
as possible: As with the MLP we try I don’t want further discuss the reasons for these
to reduce the sum of the squared error circumstances and applications of M + - they can
to a minimum. easily be found in linear algebra.
Figure 6.6: Example for even coverage of a two- 6.3.1.2 Conditional, fixed selection
dimensional input space by applying radial basis
functions.
Let us assume that our training examples
are not evenly distributed across the in-
put space. Then it seems to be obvious
to arrange the centers and Sigmas of the
6.3.1.1 Fixed selection RBF neurons by means of the pattern dis-
tribution. So the training patterns can be
In any case, the goal is to cover the in- analyzed by statistical techniques such as
put space as evenly as possible. Here, a cluster analysis, and so it can be deter-
widths of 23 of the distance between the mined whether there are statistical factors
centers can be selected so that the Gaus- according to which we should distribute
sian bells overlap by approx. ”one third”2 the centers and sigmas (fig. 6.7 on the right
(fig. 6.6). The closer the bells are set the page).
more precise but the more time-consuming
the whole thing becomes. A more trivial alternative would be to set
|H| centers on positions randomly selected
This may seem to be very inelegant, but from the set of patterns. So this method
in the field of function approximation we would allow for every training pattern p
cannot avoid even coverage. Here it is use- to be directly in the center of a neuron
less when the function to be approximated (fig. 6.8 on the right page). This is not
2 It is apparent that a Gaussian bell is mathemati-
yet very elegant but a good solution when
cally infinitely wide, therefore I ask the reader to time is of the essence. Generally, for this
apologize this sloppy formulation. method the widths are fixedly selected.
only used to generate some previous knowl- 6.4 Growing RBF Networks
edge. Therefore we will not discuss them automatically adjust the
in this chapter but independently in the
indicated chapters. neuron density
∂Err(σh ch ) ∂Err(σh ch )
and . 6.4.1 Neurons are added to places
∂σh ∂ch
with large error values
Since the derivation of these terms corre-
sponds to the derivation of backpropaga- After generating this initial configuration
tion we do not want to discuss it here. the vector of the weights G is analytically
calculated. Then all specific errors Errp
concerning the set P of the training ex-
But experience shows that no convincing amples are calculated and the maximum
results are obtained by regarding how the specific error
error behaves depending on the centers
and sigmas. Even if mathematics claims max(Errp )
P
that such methods are promising, the gra-
dient descent, as we already know, leads is sought.
to problems with very craggy error sur-
faces. The extension of the network is simple:
We replace this maximum error with a new
replace
RBF neuron. Of course, we have to exer- error with
And that’s the crucial point: Naturally, cise care in doing this: IF the σ re small, neuron
RBF networks generate very craggy error the neurons will only influence one another
surfaces because if we considerably change when the distance between them is short.
a c or a σ, we will significantly change the But if the σ are large, the already exisiting
appearance of the error function. neurons are considerably influenced by the
new neuron because of the overlapping of
the Gaussian bells.
So it is obvious, that we will adjust the al- instance, one single neuron with a higher
ready existing RBF neurons when adding Gaussian bell would be appropriate.
the new neuron.
But to develop automated procedures in
To put it crudely: this adjustment is made order to find less relevant neurons is very
by moving the centers c of the other neu- problem dependent and we want to leave
rons away from the new neuron and re- this to the programmer.
ducing their width σ a bit. Then the
current output vector y of the network is With RBF networks and multi-layer per-
compared to the teaching input t and the ceptrons we have already become ac-
weight vector G s improved by means of quainted with and extensivley discussed
training. Subsequently, a new neuron can two network paradigms for similar prob-
be inserted if necessary. This method is lems. Therefore we want to compare these
particulary suited for function approxima- two paradigms and look at their advan-
tions. tages and disadvantages.
a great problem. Please use any previ- Spread: Here the MLP is ”advantaged”
ous knowledge you have when apply- since RBF networks are used consid-
ing them. Such problems do not occur erably less often – which is not always
with the MLP. understood by professionals (at least
as far as low-dimensional input spaces
Output dimension: The advantage of are concerned). The MLPs seem to
RBF networks is that the training is have a considerably higher tradition
not much influenced when the output and they are working too good to take
dimension of the network is high. the effort to read some pages of this
For an MLP a learning procedure paper about RBF networks) :-).
such as backpropagation thereby will
be very protracted.
Generally, recurrent networks are net- to briefly discuss how recurrencies can
works being capable to influence them- be structured and how network-internal
selves by means of recurrents, e.g. by states can be generated. Thus, I will
including the network output in the follow- briefly introduce two paradigms of recur-
ing computation steps. There are many rent networks and afterwards roughly out-
types of recurrent networks of nearly arbi- line their training.
trary form, and nearly all of them are re-
ferred to as recurrent neural networks. With a recurrent network a temporally
As a result, for the few paradigms in- constant input x may lead to different re-
troduced here I use the name recurrent sults: For one thing, the network could state
multi-layer perceptrons. converge, i.e. it could transform itself into dynamics
more capable
than MLP a fixed state and at any time it will return
Apparently, such a recurrent network is ca- a fixed output value y, for another thing
pable to compute more than the ordinary it would never converge, or at least not
MLP: If the recurrent weights are set to 0, until a long time later, so that we do not
the recurrent network will be reduced to recognize anymore the consequences of a
an ordinary MLP. Additionally, the recur- constant change of y.
rency generates different network-internal
states so that in the context of the network If the network does not converge, it is, for
state different inputs can produce different example, possible to check if periodicals
outputs. or attractors (fig. 7.1 on the next page)
are returned. Here, we can expect the
Recurrent networks in itself have a great complete variety of dynamical systems.
dynamic that is mathematically difficult That is the reason why I particularly want
to conceive and has to be discussed exten- to refer to the literature concerning dy-
sively. The aim of this chapter is only namical systems.
117
Chapter 7 Recurrent Perceptron-like Networks (depends on chapter 5) dkriesel.com
GFED
@ABC @ABC
GFED GFED
@ABC GFED
@ABC
i1 AUUUU i
i 2 k2 k1
}} AA UUUUUiiiiiii}} AAA O O
} A i ii U U }
UU}U}U AA
}}
}
i iiAiAAi UUUU A
ii } UUUU AAA
GFED
@ABC @ABC
GFED @ABC
GFED
} i A }
~}x v ti}iiiiii A { ~}} UUU*
h1 AUUUU h2 A iii} h3
AA UUUUU } AAA iii i ii i
AA UUUU }}} }}
AA }}UUUUU iiiiAiAiAi }}}
AA Ui A
GFED
@ABC @ABC
GFED
} iU }
}~ it}iiiiii UUUUUUA* }~ }
@A BC
Ω1 Ω2
Figure 7.2: Illustration of a Jordan network. The network output is buffered in the context neurons
and with the next time step it is entered into the network together with the new input.
with one context neuron per output neu- layer during the next time step (i.e. on the
ron. The set of context neurons is called way back a complete link again). So the
K. The context neurons are completely complete information processing part1 of
linked toward the input layer of the net- the MLP exists a second time as ”context
work. version” – which once again considerably
increases dynamics and state variety.
GFED
@ABC @ABC
GFED
i1 @UUUU i
i 2
~~ @@ UUUUUiiiiiii~~ @@@
~~ @ @ i ii U U ~
UUU~~U @@
~~~ ii i ii@i@i@ ~~ UUUU @@
@
@ABC
GFED @ABC
GFED * GFED
@ABC ONML
HIJK ONML
HIJK ONML
HIJK
~ ii i @@ ~ U U UUUUU@@
~~it~tu iiii ~~~uv zw v
h1 @UUUU h2 @ ii h 3 4 kh1 5
kh kh
@@ UUUUU iiii ~~ 5
2 3
~ @@
@@ UUUU ~~~ @@ iiiiii ~~
U i
@@
@@ ~~~ UUUUiUiUiiii @@@ ~ ~~
GFED
@ABC GFED
@ABC ONML
HIJK ONML
HIJK
~ it~u iiiiii UUUUUU@* ~ ~wv
Ω1 Ω2 5
kΩ kΩ
1
5 2
Figure 7.3: Illustration of an Elman network. The entire information processing part of the network
exists, in a manner of speaking, twice. The output of each neuron (except for the output of the
input neurons) is buffered and reentered into the associated layer. For the reason of clarity I named
the context neurons on the basis of their models in the actual network, but it is not mandatory to
do so.
neuron layer with exactly the same num- 7.3.1 Unfolding in Time
ber of context neurons. Every neuron has
a weighted connection to exactly one con- Remember our actual learning procedure
text neuron while the context layer is com- for MLPs, the backpropagation of error,
pletely linked towards its original layer. which backpropagates the delta values.
So, in case of recurrent networks the
Now it is interesting to take a look at the delta values would cyclically backpropa-
training of recurrent network since, for in- gate through the network again and again,
stance, the ordinary backpropagation of which makes the training more difficult.
error cannot work on recurrent networks. On the one hand we cannot know which
Once again, the style of the following part of the many generated delta values for a
is more informal, which means that I will weight should be selected for training, i.e.
not use any formal definitions. which values are useful. On the other hand
we cannot definitely know when learning
should be stopped. The advantage of re-
current networks is a great state dynamics
7.3 Training Recurrent within the network operation; the disad-
vantage of recurrent networks is that this
Networks dynamics is also granted to the training
and therefore makes it difficult.
In order to explain the training as descrip- One learning approach would be the at-
tively as possible, we have to agree upon tempt to unfold the temporal states of the
some simplifications that do not affect the network (fig. 7.4 on page 123): Recursions
learning principle itself. are deleted by setting an even network
over the context neurons, i.e. the context
So for the training let us assume that neurons are, as a manner of speaking, the
initially the context neurons are initiated output neurons of the attached network.
with an input since otherwise they would More generally spoken, we have to back-
have an undefined input (this is no simpli- track the recurrencies and place ”‘earlier”’
fication but reality). instances of neurons in the network – thus
creating a larger, but forward-oriented
Furthermore, we use a Jordan network network without recurrencies. This en-
without a hidden neuron layer for our ables training a recurrent network with
training attempts so that the output neu- any training strategy developed for non-
rons can directly provide input. his ap- recurrent ones. Here, the input is entered
attach
proach is a strong simplification because as teaching input into every ”copy” of the the same
generally more complicated networks are input neurons. This can be done for a de- network
used. But this does not change the learn- screte number of time steps. These train-
to each
context
ing principle. ing paradigms are called unfolding in layer
Disadvantages: The training of such an un- Due to the already long lasting train-
folded network will take a long time since ing time, evolutionary algorithms have
a large number of layers could possibly be been proved of value especially with re-
produced. A problem that is no longer current networks. One reason for this is
negligible is the limited computational ac- that they are not only unrestricted with
curacy of ordinary computers, which is respect to recurrences but they also have
exhausted very fast because of so many other advantages when the mutation mech-
nested computations (the farther we re- anisms are suitably chosen: So, for ex-
move from the output layer, the smaller ample, neurons and weights can be ad-
the influence of backpropagation becomes, justed and the network topology can be
so that this limit is reached). Furthermore, optimized (of course the result of learn-
with several levels of context neurons this ing is not necessarily a Jordan or Elman
procedure could produce very large net- network). With ordinary MLPs, however,
works to be trained. evolutionary strategies are less popular
GFED
@ABC
i1 OUOUUU GFED
@ABC GFED
@ABC @ABC
GFED @ABC
GFED
i2 @PP i3 A n} kO 1 iiininin kO 2
OOOUUUU @@PPP A nnn i n
OOO UUUU@@ PPP AA nn } iii nn
OOO U@U@UU PPP AA nnnniii}i}i}iinnnn
OOO @@ UUUUPPPPnAnAniiii }}nnnn
@ABC
GFED * GFED
@ABC
OO' nw UniUniUniUPiUPiA' }~ nw }nn
@A BC
it
Ω1 Ω2
.. .. .. .. ..
. . . . .
/.-,
()*+RVRVV /.-,
()*+ /.-,
()*+? /.-,
()*+ /.-,
()*+
RRVRVVV CPCPCPPP oo jjjojojojo
RRRVVVV CC PP ??? oo j
RRR VVVCV PPP ?? ooojojjjjjoooo
()*+RVRVV
/.-, /.-,
()*+DQQ /.-,
()*+C n/.-,
()*+ /.-,
()*+
RRR CVCVVVVPPPoo?ojjj ooo
RRRC! ojVojVjVPjVP? oo
( wotj * ( wo
V
RRRVVVV DDQQQ C nn j j jpjpjp
RRR VVVV DD QQQ C n j p
RRR VVDVDV QQQ CCC nnnnjnjjjjjpjppp
RRR DVDVVV QQQ CnCnnjjjj ppp
GFED
@ABC @ABC
GFED GFED
@ABC @ABC
GFED vntjnjj VVQ* (GFED
@ABC
RRR D! VVVVQnVQnjQjjC! wpppp
R(
i1 OUOUUU i2 @PP i n k1 iiin k2
OOOUUUU @@PPP 3 AAA nnn}}}iiiiinininnn
OOO UUUU@@ PPP A n n
OOO U@U@UU PPP AA nnnniii}i}ii nnnn
OOO @@ UUUUPPPPnAnAniiii }}nnnn
@ABC
GFED * GFED
@ABC
OO' nw UniUniUniUPiUPiA' }~ nw }nn
it
Ω1 Ω2
Figure 7.4: Illustration of the unfolding in time with a small exemplary recurrent MLP. Top: The
recurrent MLP. Bottom: The unfolded network. or reasons of clarity, I only added names to the
lowest part of the unfolded network. Dotted arrows leading into the network mark the inputs. otted
arrows leading out of the network mark the outputs. Each ”network copy” represents a time step
of the network with the most current time step being at the bottom.
Another supervised learning example of particles or neurons rotate and thereby en-
the wide range of neural networks was courage each other to continue this rota-
developed by John Hopfield: the so- tion. As a manner of speaking, our neural
called Hopfield networks [Hop82]. Hop- network is a cloud of particles
field and his physically motivated net-
works have contributed a lot to the renais- Based on the fact that the particles auto-
sance of neural networks. matically detect the minima of the energy
function, Hopfield had the idea to use the
”spin” of the particles to process informa-
tion: Why not letting the particles search
8.1 Hopfield networks are minima on self-defined functions? Even if
inspired by particles in a we only use two of those spins, i.e. a bi-
nary activation, we will recognize that the
magnetic field developed Hopfield network shows a con-
siderable dynamics.
The idea for the Hopfield networks origi-
nated from the behavior of particles in a
magnetic field: Every particle ”communi- 8.2 In a hopfield network, all
cates” (by means of magnetic forces) with neurons influence each
every other particle (comple link) with
each particle trying to reach an energeti-
other symmetrically
cally favorable state (i.e. a minimum en-
ergy function). As for the neurons this Briefly speaking, a Hopfield network con-
state is known as activation. Thus, all sists of a set K of completely linked neu-
JK
125
Chapter 8 Hopfield Networks dkriesel.com
Additionally, the complete link provides agonal always converges [CG88] , i.e. at
always
for the fact that we do not know any input, some point it will stand still. Then the converges
output or hidden neurons. Thus, we have output is a binary string y ∈ {−1, 1}|K| ,
namely the state string of the network that state −1 would try to urge a neuron
has found a minimum. j into state 1.
Now let us take a closer look at the con- Zero weights see to it that the two in-
tents of the weight matrix and the rules volved neurons do not influence each
for the state change of the neurons. other.
Definition 8.3 (Input and output of a Hop- The weights as a whole apparently take
field network): The input of a Hopfield net- the way from the current state of the net-
work is binary string x ∈ {−1, 1}|K| that work towards the next minimum of the en-
initializes the state of the network. After ergy function. We now want to discuss
the convergence of the network, the output how the neurons follow this way.
is the binary string y ∈ {−1, 1}|K| gener-
ated from the new network state.
8.2.3 A Neuron changes its state
according to the influence of
the other neurons
8.2.2 Significance of Weights
Change in the state of neurons xk of
We have already said that the neurons the individual neurons k according to the
change their states, i.e. their direction, scheme
from −1 to 1 or vice versa. These spins oc-
cur dependent on the current states of the
xk (t) = fact wj,k · xj (t − 1) (8.1)
X
other neurons and the associated weights.
j∈K
Thus, the weights are capable to control
the complete change of the network. The every time step, whereby the function fact
weights can be positive, negative, or 0. generally is the binary threshold function
Colloquially speaking, a weight wi,j be- (fig. 8.2 on the next page) with the thresh-
tween two neurons i and j: old 0. Colloquially speaking: A neuron k
calculates the sum of wj,k · xj (t − 1), which
If wi,j is positive, it will try to force the indicates how strong and in which direc-
two neurons to become equal – the tion the neuron k is urged by the other
larger they are, the harder the net- neurons j. Thus, the new state of the net-
work will try. If the neuron i is in work (time t) results from the state of the
state 1 and the neuron j is in state network at the previous time t − 1. This
−1, a high positive weight will advise sum is the direction the neuron k gis urged
the two neurons that it is energeti- in. Depending on the sign of the sum the
cally more favorable to be equal. neuron takes state 1 or −1.
If wi,j is negative, its behavior will be Another difference between the Hopfield
analoguous only that i and j are networks and the other already known net-
urged to be different. A neuron i in work topologies is the asynchronous up-
0
work towards a certain minimum.
−0.5
−1
8.3 The weight matrix is
generated directly out of
−4 −2 0 2 4
x
This results in the weight matrix W . Col- ments of the weight matrix W are defined
loquially speaking: We initialize the net- by single processing of the learning rule
work by means of a training pattern and
wi,j =
X
then process weights wi,j one after another. pi · pj ,
For each of these weights we verify: Are p∈P
the neurons i, j n the same state or do the whereby the diagonal of the matrix is cov-
states vary? In the first case we add 1 ered with zeros. Here, no more than
to the weight, in the second case we add |P |MAX ≈ 0.139 · |K| training examples
−1. can be trained and at the same time main-
tain their function.
This we repeat for each training pattern
p ∈ P . Finally, the values of the weights Now we know the functionality of Hopfield
wi,j are high when i and j corresponded networks but nothing about their practical
with many training patterns. Colloquially use.
speaking, this high value tells the neurons:
”Often, it is energetically favorable to hold
the same state”. The same applies to neg- 8.4 Autoassociation and
ative weights. Traditional Application
Due to this training we can store a certain
fixed number of patterns p in the weight Hopfield networks like those mentioned
matrix. At an input x the network will above are called autoassociators. An
autoassociator a exactly shows the above-
converge to the stored pattern that is clos-
mentioned behavior: Firstly, when a
Ja
est to the input p.
known pattern p is entered exactly this
Unfortunately, the number of the maxi- known pattern is returned. Thus,
mum storable and reconstructible patterns
p is limited to a(p) = p,
there for a while, goes on to the next pat- Which letter in the alphabet follows the
tern, and so on. letter P ?
xi (t + 1) = (8.5)
Another example is the phenomenon that
one cannot remember a situation, but the
0.8
0.6
f(x)
0.4
0.2
0
−4 −2 0 2 4
x
Exercises
Previously, I want to announce that there The elements of our example are exactly
are different variations of LVQ, which will such numbers, because the natural num-
be mentioned but not exactly represented. bers do not include, for example, numbers
The goal of this chapter rather is to ana- between 1 and 2. On the other hand, the
lyze the underlying principle.. sequence of real numbers R, for instance,
is continuous: It does not matter how
close two selected numbers are, there will
always be a number between them.
135
Chapter 9 Learning Vector Quantization dkriesel.com
Figure 9.1: BExamples for quantization of a two-dimensional input space. DThe lines represent
the class limit, the × mark the codebook vectors.
that the set of classes |C| contains many Learning process: The learning process
codebook vectors C1 , C2 , . . . , C|C| . takes place by the rule
This leads to the structure of the training ∆Ci = η(t) · h(p, Ci ) · (p − Ci )
examples: They are of the form (p, c) and (9.1)
therefore contain the training input vector
Ci (t + 1) = Ci (t) + ∆Ci , (9.2)
p and its class affiliation c. For the class
affiliation
which we now want to break down.
c ∈ {1, 2, . . . , |C|}
. We have already seen that the first
is true, which means that it clearly assigns factor η(t) is a time-dependent learn-
the training example to a class or a code- ing rate allowing us to differentiate
book vector. between large learning steps and fine
tuning.
Remark: Intuitively, we could say about
learning: ”Why a learning procedure? We . The last factor (p − Ci ) obviously is
calculate the average of all class members the direction toward which the code-
and there place their codebook vectors – book vector is moved.
and that’s it.” But we will see soon that
. But the function h(p, Ci ) is the core
our learning procedure can do a lot more.
of the rule: It makes a case differenti-
We only want to briefly discuss the steps ation.
of the fundamental LVQ learning proce- Assignment is correct: The winner
dure: vector is the codebook vector of
Initialization: We place our set of code- the class that includes p. In this
Important!
book vectors on random positions in case, the function provides posi-
the input space. tive values and the codebook vec-
tor moves towards p.
Training example: A training example p
of our training set P is selected and Assignment is wrong: The winner
presented. vector does not represent the
class that includes p. Therefore
Distance measurement: We measure the it moves away from p.
distance ||p − C|| between all code-
book vectors C1 , C2 , . . . , C|C| and our We can see that our definition of the func-
input p. tion h was not precise enough. With
good reason: From here on, the LVQ
Winner: The closest codebook vector
is divided into different nuances, depen-
wins, i.e. the one with
dent of how exact h and the learning rate
min ||p − Ci ||. should be defined (called LVQ1, LVQ2,
Ci ∈C LVQ3, OLVQ, etc). The differences are,
for instance, in the strength of the code- H in the five-dimensional unit cube H into
book vector movements. They are not all one of 1024 classes.
based on the same principle described here,
and as announced I don’t want to discuss
them any further. Therefore I don’t give
any formal definition regarding the above-
mentioned learning rule and LVQ.
Exercises
141
Chapter 10
Self-organizing Feature Maps
A paradigm of unsupervised learning neural networks, which maps an input
space by its fixed topology and thus independently looks for simililarities.
Function, learning procedure, variations and neural gas.
If you take a look at the concepts of bio- Unlike the other network paradigms we
logical neural networks mentioned in the have already got to know, for SOMs it is
introduction, one question will arise: How unnecessary to ask what the neurons calcu-
does our brain store and recall the impres-late. We only ask which neuron is active at
sions it receives every day. Let me point the moment. Biologically, this is very mo- no output,
out that the brain does not have any train-tivated: If in biology the neurons are con- but active
How are
data stored ing examples and therefore no ”desired nected to certain muscles, it will be less neuron
in the output”. And while already considering interesting to know how strong a certain
muscle is contracted but which muscle is
this subject we realize that there is no out-
brain?
put in this sense at all, too. Our brain activated. In other words: We are not in-
responds to external input by changes in terested in the exact output of the neuron
state. These are, so to speak, its output. but in knowing which neuron provides out-
put. Thus, SOMs are considerably more
related to biology than, for example, the
feedforward networks, which are increas-
Based on this principle and exploring ingly used for calculations.
the question of how biological neural net-
works organize themselves, Teuvo Ko-
honen developed in the Eighties his self- 10.1 Structure of a Self
organizing feature maps [Koh82, Koh98],
shortly referred to as self-organizing
Organizing Map
maps or SOMs. A paradigm of neural
networks where the output is the state of Typically, SOMs have – like our brain –
the network, which learns completely un- the task to map a high-dimensional in-
supervised, i.e. without a teacher. put (N dimensions) onto areas in a low-
143
Chapter 10 Self-organizing Feature Maps dkriesel.com
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
dimensional grid of cells (G dimensions)
to draw a map of the high-dimensional
high-dim.
input space, so to speak. To generate this map,
↓ the SOM simply obtains arbitrary many
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
points of the input space. During the in-
low-dim.
map
put of the points the SOM will try to cover
as good as possible the positions on which
the points appear by its neurons. This par- /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
ticularly means, that every neuron can be
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
assigned to a certain position in the input
space.
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
At first, these facts seem to be a bit con-
fusing, and it is recommended to briefly
reflect about them. There are two spaces
in which SOMs are working: Figure 10.1: Example topologies of a self-
organizing map. Above we can see a one-
. The N -dimensional input space and
dimensional topology, below a two-dimensional
. the G-dimensional grid on which the one.
input space
neurons are lying and which indi-
and topology cates the neighborhood relationships
between the neurons and therefore
the network topology. spaces are not equal and have to be dis-
In a one-dimensional grid, the neurons tinguished. In this special case they only
could be, for instance, like pearls on a have the same dimension.
string. Every neuron would have exactly
two neighbors (except for the two end neu- Initially, we will briefly and formally re-
rons). A two-dimensional grid could be a gard the functionality of a self-organizing
square array of neurons (fig. 10.1). An- map and then make it clear by means of
other possible array in two-dimensional some examples.
space would be some kind of honeycomb Definition 10.1 (SOM neuron): Similar to
shape. Irregular topologies are possible, the neurons in an RBF network a SOM
too, but not very often. Topolgies with neuron k does not occupy a fixed position
more dimensions and considerably more ck (a center) in the input space.
neighborhood relationships would also be
Jc
Definition 10.2 (Self-organizing map): A
possible, but due to their lack of visualiza-
self-organizing map is a set K of SOM neu-
tion capability they are not employed very
rons. If an input vector is entered, exactly
often. that neuron k ∈ K is activated which is
JK
Remark: Even if N = G is true, the two closest to the input pattern in the input
Important!
space. The dimension of the input space other neurons remain inactive.This
is referred to as N . paradigm of activity is also called
NI input
Definition 10.3 (Topology): The neurons winner-takes-all scheme. The output ↓
are interconnected by neighborhood rela- we expect due to the input of a SOM winner
training:
input space RN . Now this stimulus is Definition 10.4 (SOM learning rule): A
input, entered into the network. SOM is trained by presenting an input pat-
→ winner i, tern and determining the associated win-
Distance measurement: Then the dis- ner neuron. The winner neuron and its
change in
position
i and tance ||p − ck || is determined for every neighbor neurons, which are defined by the
neighbors
neuron k in the network. topology function, then adapt their cen-
ters according to the rule
Winner takes all: The winner neuron i
is determined, which has the smallest ∆ck = η(t) · h(i, k, t) · (p − ck ),
distance to p, i.e. which fulfills the (10.1)
condition c (t + 1) = c (t) + ∆c (t). (10.2)
k k k
||p − ci || ≤ ||p − ck || ∀ k 6= i
10.3.1 The topology function
. You can see that from several win- defines, how a learning
ner neurons one can be selected at neuron influences its
will. neighbours
Adapting the centers: The neuron cen- The topology function h is not defined
ters are moved within the input space on the input space but on the grid and rep-
according to the rule2 resents the neighborhood relationships be-
tween the neurons, i.e. the topology of the
∆ck = η(t) · h(i, k, t) · (p − ck ), network. It can be time-dependent (which
it often is) – which explains the parameter
defined on
t. The parameter k is the index running the grid
where the values ∆ck are simply through all neurons, and the parameter i
added to the existing centers. The is the index of the winner neuron.
last factor shows that the change in
In principle, the function shall take a large
position of the neurons k is propor-
value if k is the neighbor of the winner neu-
tional to the distance to the input
ron or even the winner neuron itself, and
pattern p and, as usual, to a time-
small values if not. SMore precise defini-
dependent learning rate η(t). The
tion: The topology function must be uni-
above-mentioned network topology ex-
modal, i.e. it must have exactly one maxi-
erts its influence by means of the func-
mum. This maximum must be next to the
tion h(i, k, t), which will be discussed
winner neuron i, for which the distance to
in the following.
itself certainly is 0.
only 1 maximum
2 Note: In many sources this rule is written ηh(p −
Additionally, the time-dependence enables
for the winner
ck ), which wrongly leads the reader to believe that
h is a constant. This problem can easily be solved us, for example, to reduce the neighbor-
by not omitting the multiplication dots ·. hood in the course of time.
a monotonically decreasing σ(t). Then our first, let us talk about the learning rate:
topology function could look like this: Typical sizes of the target value of a learn-
ing rate are two sizes smaller than the ini-
||gi −ck ||2
− tial value, e.g
h(i, k, t) = e 2·σ(t)2
, (10.3)
where gi and gk represent the neuron po- 0.01 < η < 0.6
sitions on the grid, not the neuron posi-
tions in the input space, which would be could be true. But this size must also de-
referred to as ci and ck . pend on the network topology or the size
of the neighborhood.
Other functions that can be used in-
stead of the Gaussian function are, for As we have already seen, a decreasing
instance, the cone function, the cylin- neighborhood size can be realized, for ex-
der function or the Mexican hat func- ample, by means of a time-dependent,
tion (fig. 10.3 on the right page). Here, monotonically decreasing σ with the
the Mexican hat function offers a particu- Gaussin bell being used in the topology
lar biological motivation: Due to its neg- function.
ative digits it rejects some neurons close
to the winner neuron, a behavior that has
The advantage of a decreasing neighbor-
already been observed in nature. This can
hood size is that in the beginning a moving
cause sharply separated map areas – and
neuron ”pulls along” many neurons in its
that is exactly why the Mexican hat func-
vicinity, i.e. the randomly initialized net-
tion has been suggested by Teuvo Koho-
work can unfold fast and properly in the
nen himself. But this adjustment charac-
beginning. In the end of the learning pro-
teristic is not necessary for the functional-
cess, only a few neurons are influenced at
ity of the map, it could even be possible
the same time which stiffens the network
that the map would diverge, i.e. it could
as a whole but enables a good ”fine tuning”
virtually explode.
of the individual neurons.
0.6 0.6
h(r)
f(x)
0.4 0.4
0.2
0.2
0
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −2 0 2 4
r x
f(x)
1
0.4 0.5
0
0.2
−0.5
0 −1
−1.5
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
x x
Figure 10.3: Gaussian bell, cone function, cylinder function and the Mexican hat function sug-
gested by Kohonen as examples for topology functions of a SOM..
89:;
?>=<
1
89:;
?>=<
1 ?>=<
89:;
2 89:;
?>=<
2
89:;
?>=< 89:;
?>=<
7 3
89:;
?>=<
4
89:;
?>=<
?>=<
89:; 89:;
?>=<
5
4> 6
>>
89:;
?>=<
>>
>>
> 6
89:;
?>=<
3 // p 89:;
?>=<
5 89:;
?>=<
7
Figure 10.4: Illustration of the two-dimensional input space (left) and the one-dimensional topolgy
space (right) of a self-organizing map. Neuron 3 is the winner neuron since it is closest to p. In
the topology, the neurons 2 and 4 are the neighbors of 3. The arrows mark the movement of the
winner neuron and its neighbors towards the training example p.
To illustrate the one-dimensional topology of the network, it is plotted into the input space by the
dotted line. The arrows mark the movement of the winner neuron and its neighbors towards the
pattern.
Now let us take a look at the above- Although the center of neuron 7 – seen
from the input space – is considerably
mentioned network with random initializa-
closer to the input pattern p than neuron
tion of the centers (fig. 10.4 on the left
2, neuron 2 is learning and neuron 7 is
page) and enter a training example p. Ob-
not. I want to remind that the network
viously, in our example the input pattern
topology specifies which neuron is allowed
is closest to neuron 3, i.e. this is the win- topology
ning neuron. to learn and not its position in the input specifies,
space. DThis is exactly the mechanism by who will learn
We remember the learning rule for which a topology can significantly cover an
SOMs input space without having to be related
to it by any sort.
∆ck = η(t) · h(i, k, t) · (p − ck )
and process the three factors from the After the adaptation of the neurons 2, 3
back: and 4 the next pattern is applied, and so
on. Another example of how such a one-
Learning direction: Remember that the dimensional SOM can develop in a two-
neuron centers ck are vectors in the dimensional input space with uniformly
input space, as well as the pattern p. distributed input patterns in the course of
A remedy for topological defects could We have seen that a SOM is trained by
be to increase the initial values for the entering input patterns of the input space
Figure 10.5: Behavior of a SOM with one-dimensional topology (G = 1) after the input of 0, 100,
300, 500, 5000, 50000, 70000 and 80000 randomly distributed input patterns p ∈ R2 . During the
training η decreased from 1.0 to 0.1, the σ parameter of the Gauss function decreased from 10.0
to 0.2.
Figure 10.6: End states of one-dimensional (left column) and two-dimensional (right column)
SOMs on different different input spaces. 200 neurons were used for the one-dimensional topology,
10 × 10 neurons for the two-dimensionsal topology and 80.000 input patterns for all maps.
RN one after another, again and again so For example, the different phonemes of
that the SOM will be aligned with these the finnish language have successfully been
patterns and map them. It could happen mapped onto a SOM with a two dimen-
that we want a certain subset U of the in- sional discrete grid topology and therefore
put space to be mapped more precise than neighborhoods have been found (a SOM
the other ones. does nothing else than finding neighbor-
hood relationships). So one tries once
This problem can easily be solved by more to break down a high-dimensional
means of SOMs: During the training dis- space into a low-dimensional space (the
proportionally many input patterns of the topology), looks if some structures have
area U are presented to the SOM. If the been developed – et voilà: Clearly defined
number of training patterns of U ⊂ RN areas for the individual phenomenons are
presented to the SOM exceeds the number formed.
of those patterns of the remaining RN \ U ,
then more neurons will group there while Teuvo Kohonen himself made the ef-
the remaining neurons are sparsely dis- fort to search many papers mentioning his
tributed on RN \ U (fig. 10.8 on the next SOMs for key words. In this large input
more page). space the individual papers now occupy in-
patterns dividual positions, depending on the occur-
↓ As you can see in the illustration, the edge rence of key words. Then Kohonen created
higher
resolution
of the SOM could be deformed. This can a SOM with G = 2 and used it to map the
be compensated by assigning to the edge high-dimensional ”paper space” developed
of the input space a slightly higher proba- by him.
bility of being hit by training patterns (an
often applied approach for ”reaching every Thus, it is possible to enter any paper
nook and corner” with the SOMs). into the completely trained SOM and look
which neuron in the SOM is activated. It
Also, a higher learning rate is often used will be likely to discover that the neigh-
for edge and corner neurons, since they are bored papers in the topology are interest-
only pulled into the center by the topol- ing, too. This type of brain-like context-
ogy. This also results in a significantly im- based search also works with many other
proved corner coverage. input spaces.
SOM finds
similarities
It is to be noted that the system itself
10.6 Application of SOMs defines what is neighbored, i.e. similar,
within the topology – and that’s why it
is so interesting.
Regarding the biologically inspired asso-
ciative data storage, there are many This example shows that the position c of
fields of application for self-organizing the neurons in the input space is not signif-
maps and their variations. icant. t is rather interesting to see which
Figure 10.8: Training of a SOM with G = 2 on a two-dimensional input space. On the left side,
the chance to become a training pattern was equal for each coordinate of the input space. On the
right side, for the central circle in the input space, this chance is more than ten times larger than
for the remaining input space (visible in the larger pattern density in the background). In this circle
the neurons are obviously more crowded and the remaining area is covered less dense but in both
cases the neurons are still evenly distributed. The two SOMS were trained by means of 80.000
training examples and decreasing η (1 → 0.2) as well as decreasing σ (5 → 0.5).
neuron is activated when an unknown in- it can be used to influence neighbour RBF
put pattern is entered. Next, we can look as well.
at which of the previous inputs this neu-
For this, many neural network simulators
ron was also activated – and will imme-
offer an additional so-called SOM layer
diately discover a group of very similar
in connection with the simulation of RBF
inputs. The more the inputs within the
networks.
topology are diverging, the less things in
common they have, um so weniger Gemein-
samkeiten haben sie. Virtually, the topol-
ogy generates a map of the input character- 10.7 Variations of SOMs
istics – reduced to descriptively few dimen-
sions in relation to the input dimension. There are different variations of SOMs
for different variations of representation
Therefore, the topology of a SOM often
tasks:
is two-dimensional so that it can be easily
visualized, while the input space can be
very high-dimensional. 10.7.1 A Neural Gas is a SOM
without a static topology
10.6.1 SOMs can be used to
determin centers for The neural gas is a variation of the self-
RBF-Neurons organizing maps of Thomas Martinetz
[MBS93], which has been developed from
the difficulty of mapping complex input
SOMs are exactly directed towards the po-
information that partially only occur in
sitions of the outgoing inputs. As a result
the subspaces of the input space or even
they are used, for example, to select the
change the subspaces (fig. 10.9 on the next
centers of an RBF network. We have al-
page).
ready been introduced to the paradigm of
the RBF network in chapter 6. The idea of a neural gas is, roughly speak-
As we have already seen, it is possible ing, to realize a SOM without a grid struc-
to control which areas of the input space ture. Due to the fact that they are de-
should be covered with higher resolution rived from the SOMs the learning steps
- or, in connection with RBF networks, are very similar to the SOM learning steps,
which areas of our function should word but they include an additional intermedi-
the RBF network with more neurons, i.e. ate step:
work more exactly. A further useful fea-
. Again, random initialization of ck ∈
ture of the combination of RBF networks
Rn
with SOMs is the topology between the
RBF Neurons, which is given by the SOM: . Selection and presentation of a pat-
During the final training of a RBF Neuron tern of the input space p ∈ Rn
Figure 10.9: A figure filling different subspaces of the actual input space of different positions
therefore can hardly be filled by a SOM.
Inspite of all practical hints it is, as always, problem: What do we do with input pat-
the user’s responsibility not to understand terns from which we know that they sepa-
this paper as a catalog for easy answers rate themselves into different (maybe dis-
but to explore all advantages and disad- joint) areas?
several SOMs
vantages himself.
Here, the idea is to use not only one
Remark: Unlike a SOM, the neighbor- SOM but several ones: A multi-self-
hood of a neural gas initially must refer to organizing map, shortly referred to as
all neurons since otherwise some outliers M-SOM [GKE01b, GKE01a, GS06]. It is
of the random initialization may never ap- unnecessary that the SOMs dispose of the
proximate the remaining group. To forget same topology or size, an M-SOM is only
this is a popular error during the imple- a combination of M SOMs.
mentation of a neural gas.
This learning process is analog to that of
the SOMs. However, only the neurons be-
With a neural gas it is possible to learn a
longing to the winner SOM of each train-
kind of complex input such as in fig. 10.9
can classify ing step are adapted. Thus, it is easy to
on the left page since we are not bound to
represent two disjoint clusters of data by
complex
figure a fixed-dimensional grid. But some com-
means of two SOMs even if one of the clus-
putational effort could be necessary for
ters is not represented in every dimension
the permanent sorting of the list (here,
of the input space RN . Actually, the in-
it could be effective to store the list in
dividual SOMs exactly reflect these clus-
an ordered data structure right from the
ters.
start).
Definition 10.7 (Multi-SOM): A multi-
Definition 10.6 (Neural gas): A neural
SOM is nothing more than the simultane-
gas differs from a SOM by a completely
ous use of M SOMs.
dynamic neighborhood function. With ev-
ery learning cycle it is decided anew which
neurons are the neigborhood neurons of 10.7.3 A multi-neural gas consists
the winner neuron. Generally, the crite- of several separate neural
rion for this decision is the distance be- gases
tween the neurosn and the winner neuron
in the input space.
Analogous to the multi-SOM, we also have
a set of M neural gases: a multi-neural
gas [GS06, SG06]. This construct be-
several gases
10.7.2 A Multi-SOM consists of haves analogous to neural gas and M-SOM:
several separate SOMs Again, only the neurons of the winner gas
are adapted.
In order to present another variant of the The reader certainly wonders what advan-
SOMs, I want to formulate an extended tage is there to use a multi-neural gas since
an individual neural gas is already capa- n log2 (n) pattern comparisons. To sort
ble to divide into clusters and to work on n = 216 input patterns
complex input patterns with changing di-
mensions. Basically, this is correct, but n log2 (n) = 216 · log2 (216 )
a multi-neural gas has two serious advan- = 1048576
tages over a simple neural gas.
≈ 1 · 106
Exercises
As in the other brief chapters, we want additionally an ART network shall be ca-
to try to figure out the basic idea of pable to find new classes.
the adaptive resonance theory (abbre-
viated: ART ) without discussing its the-
ory profoundly.
11.1 Task and Structure of
In several sections we have already men- an ART Network
tioned that it is difficult to use neural
networks for the learning of new informa-
tion in addition to but without destroy- ART network comprises exactly two lay-
ing the already existing ones. This cir- ers: The input layer I and the recog-
cumstance is called stability / plasticity nition layer O with the input layer be-
dilemma. ing completely linked towards the recog-
nition layer. This complete link indicates
In 1987, Stephen Grossberg and Gail a top-down weight matrix W that con-
Carpenter published the first version of tains the weight values of the connections
their ART network [Gro76] in order to al- between each neuron in the input layer
leviate this problem. This was followed and each neuron in the recognition layer
by a whole family of ART improvements (fig. 11.1 on the next page).
(which we want to briefly discuss, too).
Simple binary patterns are entered into
It is the idea of unsupervised learning, the the input layer and transferred to the pattern
aim of which is the (initially binary) pat- recognition layer while the recognition recognition
tern recognition, or more precisely the cat- layer shall return a 1-out-of-|O| encoding,
egorization of patterns into classes. But i.e. it should follow the winner-takes-all
163
Chapter 11 Adaptive Resonance Theory dkriesel.com
GFED
@ABC @ABC
GFED @ABC
GFED @ABC
GFED
i 1 O S
gSFi OSOSS ; 2 Og FOO
i o7 ; 3 F i kkok7 ; 4 5 i
E O 4Y 44cF4F4OFFFOSOFFOSOSFxOOSxSxOSxSxxOS
Sx
ESSO 4Y 44cF4F4OoFFFOoOFFOoOoFxOoOoxxoOoxxxOo
x
E O k4Y k44cF4Fk4koFFkFkoFkFkooFxkokoxxokoxxxo
x
E O 4Y 4444
4 F O
S
O S S 4
o o F O
O k k k4
4x44x4xxxxFFF
F
OF
OOoOOoOoSoOoSSoSS4xo44SSx4xxSxxFkFF
Fkk
OF
kkOkkOoOkOoOooOoo4xo44x4xxxxFFF
F
o o F 444
x x 4
F o
F o O O
x x kS
4 k S k
S F o
F o O O
x x 4
F F 444
x 4
o F o O x Ok 4 kS
xxxx o4o44
o4
oo
oookFFkFxFkkxFxkxkxkkOkOOok4OOo44
oO4
Ooo
SOooSoSSFSFSFxFSSxFxSxxSOOO4OO44
O4
O
O FFFFF 44444
So F o O x O 4
F
x
xxxx o o
44kkxkxxkxFFFFo o
44O OxOxxxFSFFSFSS
44O OO FFFF 44
xxxxxoxooookookook
ko
k
k
kkkkx444xkx4x4kxoxooooooFooF
Fo
FF
F x444xx4Ox4xOxOOOOOOFOFS
F
FSF
FSSSS44S4S4OS4SOSOOSOOOOFOFFFFF 44444
xxxoxooookokkkkkk
xxxoxoo4o4o4o
FxFxFxFx444 O
O
O
OOFOFFF444SSSSSSOSOOOOFOFFF444
xo
{xkxxokoxxokokxookkokkkkk
xo
{xxxooxxooxooo 44
x
{xxxxxx FFFFF4# 4
Figure 11.1: Simplified illustration of the ART network structure. Top: the input layer, bottom:
the recognition layer. In this illustration the lateral inhibition of the recognition layer and the control
neurons are omitted.
scheme. For instance, to realize this 1- activity within the recognition layer while
out-of-|O| encoding the principle of lateral in turn in the recognition layer every ac-
inhibition can be used – or in the imple- tivity causes an activity within the input
mentation the most activated neuron can layer.
be searched. For practical reasons an IF
query would suit this task best.
input is
The training of the backward weights of network.
teach. inp. the matrix V is a bit tricky: EOnly the
for backward weights of the respective winner neuron
are trained towards the input layer and 11.3 Extensions
weights
GFED
@ABC GFED
@ABC @ABC
GFED @ABC
GFED
i1 Fb i2 i
E O 3 < i4
Y FF O 4Y 4 E
FF
FF
44
44
FF
FF 4
FF 444
FF 4
FF 44
FF 4
GFED
@ABC @ABC
GFED
F"
|
Ω1 Ω2
167
Appendix A
Excursus: Cluster Analysis and Regional
and Online Learnable Fields
In Grimm’s dictionary the extinct German word ”Kluster” is described by ”was
dicht und dick zusammensitzet (a thick and dense group of sth.)”. In static
cluster analysis, the formation of groups within point clouds is explored.
Introduction of some procedures, comparison of their advantages and
disadvantages. Discussion of an adaptive clustering method based on neural
networks. A regional and online learnable field models from a point cloud,
possibly with a lot of points, a comparatively small set of neurons being
representative for the point cloud.
169
Appendix A Excursus: Cluster Analysis and Regional and Online Learnable Fields
dkriesel.com
tering procedure that uses a metric as dis- 7. Continue with 4 until the assignments
tance dimension. are no longer changed.
number of
Now we want to introduce and briefly dis- 2 already shows one of the great questions cluster
must be
cuss different clustering procedures. of the k-means algorithm: The number k known
of the cluster centers has to be determined previously
in advance. This cannot be done by the al-
gorithm. The problem is that it is not nec-
A.1 k-Means Clustering essarily known in advance how k can be de-
allocates data to a termined best. EAnother problem is that
predefined number of the procedure can become quite instable
if the codebook vectors are badly initial-
clusters ized. But since this is random, it is often
useful to restart the procedure. This has
k-means clustering according to J. the advantage of not requiring much com-
MacQueen [Mac67] is an algorithm that putational effort. If you are fully aware
is often used because of its low computa- of those weaknesses, you will receive quite
tion and storage complexity and which is good results.
regarded as ”inexpensive and good”. The
However, complex structures such as ”clus-
operation sequence of the k-means cluster-
ters in clusters” cannot be recognized. If k
ing algorithm is the following:
is high, the outer ring of the construction
1. Provide data to be examined. in the following illustration will be recog-
nized as many single clusters. If k is low,
2. Define k, which is the number of clus- the ring with the small inner clusters will
ter centers. definieren be recognized as one cluster.
3. Select k random vectors for the clus- For an illustration see the upper right part
ter centers (also referred to as code- of fig. A.1 on page 172.
book vectors).
group builds a cluster. The advantage is which is the reason for the name epsilon-
that the number of clusters occurs all by it- nearest neighboring. Points are neig-
self. The disadvantage is that a large stor- bors if they are ε apart from each other at
age and computational effort is required to the most. Here, the storage and computa-
find the next neighbor (the distances be- tional effort is obviously very high, which
tween all data points must be computed is a disadvantage.
Clustering
and stored).
But note that there are some special cases:
radii around
clustering
points
There are some special cases in which the Two separate clusters can easily be con-
next
points
procedure combines data points belonging nected due to the unfavorable situation of
to different clusters, if kis too high. (see a single data point. This can also happen
the two small clusters in the upper right with k-nearest neighbouring, but it would
of the illustration). Clusters consisting of be more difficult since in this case the num-
only one single data point are basically ber of neighbors per point is limited.
conncted to another cluster, which is not
always intentional. An advantage is the symmetric nature of
the neighborhood relationships. Another
Furthermore, it is not mandatory that advantage is that the combination of min-
the links between the points are symmet- imal clusters due to a fixed number of
rical. neighbors is avoided.
But this procedure allows a recognition of On the other hand, it is necessary to skill-
rings and therefore of ”clusters in clusters”, fully initialize ε in order to be successful,
which is a clear advantage. Another ad- i.e. smaller than half the smallest distance
vantage is that the procedure adaptively between two clusters. With variable clus-
responds to the distances in and between ter and point distances within clusters this
the clusters. can possibly be a problem.
For an illustration see the lower left part For an illustration see the lower right part
of fig. A.1. of fig. A.1.
Another approach of neighboring: Here, As we can see above, there is no easy an-
the neighborhood detection does not use a swer for clustering problems. Each proce-
fixed number k of neighbors but a radius ε, dure described has very specific disadvan-
Figure A.1: Top left: our set of points. We will use this set to explore the different clustering
methods. Top right: k-means clustering. Using this procedure we chose k = 6. As we can
see, the procedure is not capable to recognize ”clusters in clusters” (bottom left of the illustration).
Long ”Lines” of points are a problem, too: SThey would be recognized as many small clusters (if k
is sufficiently large). Bottom left: k-nearest neighboring. If k is selected too high (higher than
the number of points in the smallest cluster), this will result in cluster combinations shown in the
upper right of the illustration. Bottom right: ε-nearest neighbouring. This procedure will cause
difficulties ε is selected larger than the minimum distance between two clusters (see upper left of
the illustration), which will then be combined.
tages. In this respect it is useful to have Apparently, the whole term s(p) can only
a criterion to decide how good our clus- be within the interval [−1; 1]. A value
ter division is. This possibility is offered close to -1 indicates a bad classification of
by the silhouette coefficient according p.
to [Kau90]. This coefficient measures how
good the clusters are delimited from each The silhouette coefficient S(P ) results
other and indicates if points are maybe from the average of all values s(p):
sorted into the wrong clusters.
clustering 1 X
quality is
Let P be a point cloud and p a point ∈ P . S(P ) = s(p) (A.4)
measureable |P | p∈P
Let c ⊆ P be a cluster within the point
cloud and p be part of this cluster, i.e.
applies. DAs above the total quality of the
p ∈ c. The set of clusters is called C. Sum-
cluster division is expressed by the interval
mary:
[−1; 1].
p∈c⊆P
applies. As different clustering strategies with dif-
ferent charakteristics are presented now
To calculate the silhouette coefficient we
(lots of further material are presented in
initially need the average distance between
[DHS01]), as well as a measure to indi-
point p and all its cluster neighbors. This
cate the quality of an existing arrange-
variable is referred to as a(p) and defined
ment of given data into clusters, I want
as follows:
to introduce a clustering method based
1 on an unsupervised learning neural net-
a(p) = dist(p, q) (A.1)
X
|c| − 1 q∈c,q6=p work [SGE05] which was published in 2005.
Like all the other methods this one may
Furthermore, let b(p) be the average dis- not be perfect but it eliminates large stan-
tance between our point p and all points dard weaknesses of the known clustering
of the next cluster (g includes all clusters methods
except for c):
1 X
b(p) = min dist(p, q) (A.2)
g∈C,g6=c |g|
q∈g A.5 Regional and Online
The point p is classified well if the distance Learnable Fields are a
to the center of the own cluster is minimal neural clustering strategy
and the distance to the centers of the other
clusters is maximal. In this case, the fol-
lowing term provides a value close to : The paradigm of neural networks, which I
want to introduce now, are the regional
(A.3) and online learnable fields, shortly re-
b(p) − a(p)
s(p) =
max{a(p), b(p)} ferred to as ROLFs.
consists of all points within the radius ρ · σ is an accepting neuron k. Then the radius
in the input space. moves towards ||p − ck || (i.e. towards the
distance between p and ck ) and the center
ck towards p. Additionally, let us define
A.5.2 A ROLF learns unsupervised the two learning rates ησ and ηc for radii
by presenting training and centers.
Jησ
examples online Jηc
ck (t + 1) = ck (t) + ηc (p − ck (t))
Like many other paradigms of neural net- σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t))
works our ROLF network learns by receiv-
Note that here σk is a scalar while ck is a
ing many training examples p of a training
vector in the input space.
set P . The learning is unsupervised. For
each training example p entered into the Definition A.6 (Adapting a ROLF neuron):
network two cases can occur: A neuron k accepted by a point p is
adapted according to the following rules:
1. There is one accepting neuron k for p
or ck (t + 1) = ck (t) + ηc (p − ck (t)) (A.5)
σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t))
2. there is no accepting neuron at all. (A.6)
If in the first case several neurons are suit-
able, then there will be exactly one ac- A.5.2.2 The radius multiplier cares for
cepting neuron insofar as the closest neu- neurons not only to shrink
ron is the accepting one. For the accepting
neuron k ck and σk are adapted. Now we can understand the function of the
Definition A.5 (Accepting neuron): The multiplier ρ: Due to this multiplier the per-
criterion for a ROLF neuron k to be an ceptive surface of a neuron includes more Jρ
accepting neuron of a point p is that the than only all points surrounding the neu-
point p must be located within the percep- ron in the radius σ. This means that due
tive surface of k. If p is located in the per- to the above-mentioned learning rule σ
ceptive surfaces of several neurons, then cannot only decrease but also increase.
so the
the closest neuron will be the accepting Definition A.7 (Radius multiplier): The ra- neurons
one. If there are several closest neurons, dius multiplier ρ > 1 is globally defined can grow
one can be randomized. and expands the perceptive surface of a
neuron k to a multiple of σk . So it is en-
sured that the radius σk cannot only de-
A.5.2.1 Both positions end radii are
crease but also increase.
adapted throughout learning
Adapting Generally, the radius multiplier is set to
existing
neurons
Let us assume that we entered a training values in the lower one-digit range, such
example p into the network and that there as 2 or 3.
So far we only have discussed the case in Currently, the mean-σ variant is the fa-
the ROLF training that there is an accept- vorite one although the learning procedure
ing neuron for the training example p. also works with the other ones. In the
minimum-σ variant the neurons tnd to
cover less surface, in the maximum-σ vari-
A.5.2.3 As and when required, new ant they tend to cover more surface.
neurons are generated Definition A.8 (Generating a ROLF neu-
ron): If a new ROLF neuron k k is gen-
This suggests to discuss the approach for erated by entering a training example p,
initialization
the case that there is no accepting neu- then ck is intialized with p and σk ac- of a
ron. cording to one of the above-mentioned neurons
all neurons and select the maximum. nected when their perceptive surfaces over-
lap (i.e. some kind of nearest neighbour-
Mean σ: We select the mean σ of all neu- ing is executed with the variable percep-
rons. tive surfaces). A cluster is a group of
tance from each other is addressed by us- are relatively robust after some training
ing variable perceptive surfaces - which is time.
also not always the case for the two men-
As a whole, the ROLF is on a par with
tioned methods.
the other clustering methods and is par-
The ROLF compares favorably with the ticularly very interesting for systems with
k-means clustering, as well: Firstly, it is low storage capacity or huge data sets.
unnecessary to previously know the num-
ber of clusters and, secondly, thek-means A.5.6 Application examples
clustering recognizes clusters enclosed by
other clusters as separate clusters.
A first application example may be, for
example, finding color clusters in RGB
images. Another field of application di-
A.5.5 Initializing Radii, Learning rectly described in the ROLF publication
Rates and Multiplier is not is the recognition of words transferred into
trivial a 720-dimensional feature space. Thus, we
can see that ROLFs are relatively robust
Certainly, the disadvantages of the ROLF against higher dimensions. Further appli-
shall not be concealed: It is not always cations can be found in the field of analy-
easy to select the appropriate initial value sis of attacks on network systems and their
for σ and ρ. The previous knowledge classification.
about the data set can colloquially be in-
cluded in ρ and the σ initial value of the
ROLF: Fine-grained data clusters should Exercises
use a small ρ and a small σ initial value.
But the smaller the ρ the smaller the Exercise 18: Determine at least four
chance that the neurons will grow if neces- adaptation steps for one single ROLF neu-
sary. Here again, there is no easy answer, ron k if the four patterns stated below
just like for the learning rates ηc and ησ . are presented one after another in the in-
dicated order. Let the initial values for
For ρ the multipliers in the lower one-digit
the ROLF neuron be ck = (0.1, 0.1) and
range such as 2 or 3 are very popular. ηc
σk = 1. Furthermore, let gelte ηc = 0.5
and ησ successfully work with values about
and ησ = 0. Let ρ = 3.
0.005 to 0.1, variations during run-time
are also imaginable for this type of net- P = {(0.1, 0.1);
work. Initial values for σ generally depend
= (0.9, 0.1);
on the cluster and data distribution (i.e.
they often have to be tested). But com- = (0.1, 0.9);
pared to wrong initializations they are – = (0.9, 0.9)}.
at least with the mean-σ strategy – they
179
Appendix B Excursus: Neural Networks Used for Prediction dkriesel.com
.-+ predictor
Figure B.2: Representation of the one-step-ahead prediction. It is tried to calculate the future
value from a series of past values. The predicting element (in this case a neural network) is referred
to as predictor.
usually have a lot of past values so that we means of the delta rule provides results
can set up a series of equations1 : very close to the analytical solution.
0 predictor
O
xt−3 xt−2 xt−1 xt x̃t+1 x̃t+2
J
.-+ predictor
Figure B.3: Representation of the two-step-ahead prediction. Attempt to predict the second future
value out of a past value series by means of a second predictor and the involvement of an already
predicted value.
.-+ predictor
Figure B.4: Representation of the direct two-step-ahead prediction. Here, the second time step is
predicted directly, the first one is omitted. Technically, it does not differ from a one-step-ahead
prediction.
daily period. We could also include an an- der to benefit from this knowledge. Share
nual period in the form of the beginning of prices are discontinuous and therefore they
the holidays (for sure, everyone of us has are principally difficult functions. Further-
already spent a lot of time on the highway more, the functions can only be used for
because he forgot the beginning of the hol- discrete values – often, for example, in a
idays). daily rhythm (including the maximum and
minimum values per day, if we are lucky)
with the daily variations certainly being
B.4.2 Heterogeneous prediction eliminated. But this makes the whole
thing even more difficult.
Another prediction approach would be to
predict the future values of a single time There are chartists, i.e. people who look
series out of several time series, if it is at many diagrams and decide by means
use assumed that the additional time series of a lot of background knowledge and
information is related to the future of the first one decade-long experience whether the equi-
outside of
(heterogeneous one-step-ahead pre- ties should be bought or not (and often
time series
diction, fig. B.5 on the next page). they are very successful.
If we want to predict two outputs of two Apart from the share prices it is very in-
teresting to predict the exchange rates of
related time series, it is certainly possible
to perform two parallel one-step-ahead pre-currencies: If we exchange 100 Euros into
Dollars, the Dollars into Pounds and the
dictions (analytically this is done very of-
ten because otherwise the equations would Pounds back into Euros it could be pos-
become very confusing); or in case of sible that we will finally receive 110 Eu-
the neural networks an additional output ros. But once found out, we would do this
neuron is attached and the knowledge of more often and thus we would change the
both time series is used for both outputs exchange rates into a state in which such
(fig. B.6 on the next page). an increasing circulation would no longer
be possible (otherwise we could produce
You’ll find more and more general material
money by generating, so to speak, a finan-
on time series in [WG94].
cial perpetual motion machine.
.0-1+3 predictor
.0-1+3 predictor
yt−3 yt−2 yt−1 yt ỹt+1
Figure B.6: Heterogeneous one-step-ahead prediction of two time series at the same time.
In Great Britain, the heterogeneous one- Time and again some software appears
step-ahead prediction was successfully which uses key words such as neural net-
used to increase the accuracy of such pre- works to purport that it is capable to pre-
dictions to 76%: In addition to the time dict where share prices are going. Do
series of the values indicators indicators not buy such software! In addition to
such as the oil price in Rotterdam or the the above-mentioned scientific exclusions
US national debt were included. there is one simple reason for this: If
these tools work – why should the man-
This just as an example to show the dimen- ufacturer sell them? Normally, useful eco-
sion of the accuracy of stock-exchange eval- nomic knowledge is kept secret. If we knew
uations, since we are still talking about the a way to definitely gain wealth by means
first bit of the first derivation! We still do of shares, we would earn our millions by
not know how strong the expected increase using this knowledge instead of selling it
or decrease will be and also whether the for 30 euros, wouldn’t we?
effort will pay off: Probably, one wrong
prediction could nullify the profit of one
hundred correct predictions.
I now want to introduce a more exotic ap- While it is generally known that pro-
proach of learning – just to leave the usual cedures such as backpropagation cannot
paths. We know learning procedures in work in the human brain itself, the rein-
which the network is exactly told what to forcement learning is usually considered as
do, i.e. we provide exemplary output val- being biologically more motivated.
ues. We also know learning procedures
The term reinforcement learning The
like those of the self-organizing maps, into
term reinforcement learning comes from
which only input values are entered.
cognitive science and psychology and it de-
Now we want to explore something in- scribes the learning system of carrot and
between: The learning paradigm of rein- stick, which occurs everywhere in nature,
forcement learning – reinforcement learn- i.e. learning by means of good or bad expe-
ing according toSutton and Barto rience, reward and punishment. But there
[SB98]. is no learning aid that exactly explains
what we have to do: We only receive a
Reinforcement learning in itself is no neu-
total result for a process (Did we win the
ral network but only one of the three learn-
game of chess or not? And how sure was
ing paradigms already mentioned in chap-
this victory?), but no results for the indi-
ter 4. In some sources it is counted among
vidual intermediate steps.
the supervised learning procedures since a
no feedback is given. Due to its very rudimen- For example, if we ride our bike with worn
examples tary feedback it is reasonable to separate tires and at a speed of exactly 21,5 21, 5 km
h
but
feedback
it from the supervised learning procedures through a bend over some sand with a
– apart from the fact that there are no grain size of 0.1mm, on the average, then
training examples at all. nobody could tell us exactly which handle-
189
Appendix C Excursus: Reinforcement Learning dkriesel.com
bar angle we have to adjust or, even worse, interaction between an agent and an envi-
how strong the great number of muscle ronmental system (fig. C.2).
parts in our arms or legs have to contract
for this. Depending on whether we reach The agent shall solve some problem. He
the end of the bend unharmed or not, we could, for instance, be an autonomous
soon have to face the good or bad learn- robot that shall avoid obstacles. The
ing experience, i.e. a feedback or a reward. agent performs some actions within the
Thus, the reward is very simple - but on environment and in return receives a feed-
the other hand it is considerably easier to back from the environment, which in the
achieve. If we now have tested different ve- following is called reward. This circle of
locities and bend angles often enough and action and reward is characteristic for rein-
received some rewards, we will get a feel forcement learning. The agent influences
for what works and what does not. The the system, the system provides a reward
aim of reinforcement is to maintain exactly and then changes.
this feeling. The reward is a real or discrete scalar
which describes, as mentioned above, how
Another example for the quasi-
well we achieve our aim, but it does not
impossibility to achieve a sort of cost
give any guidance how we can achieve it.
or utility function is a tennis player
The aim is always to make the sum of
who tries to maximize his athletic glory
rewards as high as possible on the long
for a long time by means of complex
term.
movements and ballistic trajectories in
the three-dimensional space including the
wind direction, the importance of the C.1.1 The gridworld
tournament, private factors and many
more.
As a learning example for reinforcement
To get straight to the point: Since we learning I would like to use the so-called
receive only little feedback, reinforcement gridworld. WWe will see that its struc-
learning often means trial and error – and ture is very simple and easy to figure out
therefore it is very slow. and therefore reinforcement is actually not
necessary. However, it is very suitable
simple
for representing the approach of reinforce- world of
C.1 System Structure ment learning. Now let us exemplary de- examples
Now we know that reinforcement learning knowledge about its state. This approx-
is an interaction between the agent and imation (about which the agent cannot
the system including Actions at and sit- even know how good it is) makes clear pre-
uations st . The agent cannot determine dictions impossible.
by itself whether the current situation isDefinition C.5 (Action): Actions at n be
good or bad: This is exactly the reason performed by the agent (whereby it could
Jat
why it receives the said reward from the be possible that depending on the situa-
environment. tion another action space A(S) exists).
They cause state transitions and therefore JA(S)
In the gridworld: States are positions
where the agent can be situated. Sim- a new situation from the agent’s point of
ply said, the situations equal the states view.
in the gridworld. Possible actions would
be to move towards north, south, east or C.1.4 Reward and return
west.
Remark: Situation and action can be vec- As in real life it is our aim to receive a
torial, the reward, however, is always a recompense as high as possible, i.e. to
scalar (in an extreme case even only a bi- maximize the sum of the expected [re-
nary value) since the aim of reinforcement ward]rewards r, called return R, on the
learning is to get along with little feedback. long term. For finitely many time steps1
A complex vectorial reward would equal a the rewards can simply be added:
real teaching input.
Rt = rt+1 + rt+2 + . . . (C.3)
By the way, the cost function should be ∞
minimized, which would not be possible, = (C.4)
X
rt+x
however, with a vectorial reward since we x=1
do not have any intuitive order relations Certainly, the return is only estimated
in multi-dimensional space, i.e. we do not here (if we knew all rewards and therefore
directly know what is better or worse. the return completely, it would no longer
Definition C.3 (State): Within its envi- be necessary to learn).
ronment the agent is in a state. States Definition C.6 (Reward): A reward rt is
contain any information about the agent a scalar, real or discrete (even sometimes Jrt
within the environmental system. Thus, only binary) reward or punishment which
it is theoretically possible to clearly pre- the environmental system returns to the
dict a successor state to a performed ac- agent as reaction to an action.
tion within a deterministic system out of Definition C.7 (Return): The return R
this godlike state knowledge.
t
is the accumulation of all received rewards
Definition C.4 (Situation): Situations st JRt
1 In practice, only finitely many time steps will eb
(hier at time t) of a situation space possible, even though the formulas are stated with
st I
S are the agent’s limited, approximate an infinite sum in the first place
SI
with ai 6= ai (si ); i > 0. DThus, in the be- When selecting the actions to be per-
ginning the agent develops a plan and con- formed, again two basic strategies can be
secutively executes it to the end without regarded.
considering the interim situations (there-
In the gridworld: A closed-loop policy
fore ai 6= ai (si ), actions after a0 do not
would be responsive to the current posi-
depend on the situations).
tion and choose the direction according to
the action. n particular, when an obstacle
In the gridworld: In the gridworld, an
appears dynamically, such a policy is the
open-loop policy would provide a precise
better choice.
direction towards the exit, such as the way
from the given starting position to (in ab-
breviations of the directions) OOOON. C.1.5.1 Exploitation vs. exploration
Another approach would be to explore Now we have to adapt from daily life how
shorter ways every now and then, even at we learn exactly.
the risk of taking a long time and being
unsuccessful, and therefore we finally will
take the original way and arrive too late C.2.1 Rewarding strategies
at the restaurant.
Interesting and very important is the ques-
In reality, often a combination of both tion for what a reward and what kind of
methods is applied: In the beginning of reward is awarded since the design of the
the learning process it is researched with reward significantly controls system behav-
a higher probability while at the end more ior. As we have seen above, there gener-
existing knowledge is exploited. Here, a ally are (again as in daily life) various ac-
static probability distribution is also pos- tions that can be performed in any situa-
sible and often applied. tion. There are different strategies to eval-
In the gridworld: For finding the way in uate the selected situations and to learn
the gridworld, the restaurant example ap- which series of actions would lead to the
plies equally. target. First of all, this principle should
be explained in the following.
We now want to indicate some extreme
C.2 Learning process cases as design examples for the reward:
A rewarding similar to the rewarding in a
Let us again take a look at daily life. Ac- chess game is referred to as pure delayed
tions can lead us from one situation into reward: We only receive the reward at
different subsituations, from each subsit- the end of and not during the game. This
uation into further sub-subsituations. In method is always advantageous when we
a sense, we get a situation tree where finally can say whether we were succesful
links between the nodes must be consid- or not; but the interim steps do not allow
ered (often there are several ways to reach an estimation of our situation. If we win,
a situation – so the tree could more accu- then
rately be referred to as situation graph).
he leaves of such tree are the end situa- rt = 0 ∀t < τ (C.10)
tions of the system. The exploration ap-
proach would search the tree as thoroughly as well as rτ = 1. If we lose, then rτ = −1.
as possible and become acquainted with all With this rewarding strategy a reward is
leaves. The exploitation approach would only returned by the leaves of the situation
unerringly go to the best known leave. tree.
evaluate the current and future situations. The optimal state-value function is called
So let us take a look at another system VΠ∗ (s).
component of reinforcement learning: the JVΠ∗ (s)
state-value function V (s), which with Unfortunaely, unlike us our robot does not
regard to a policy Π is often called VΠ (s). have a godlike view of its environment. It
Because whether a situation is bad often does not have a table with optimal returns
depends on the general behavior Π of the like the one shown above to orient itself.
agent. he aim of reinforcement learning is that
the robot little by little generates its state-
A situation being bad under a policy that value function on the basis of the returns
is searching risks and checking out limits of many trials and approximates the op-
would be, for instance, if an agent on a bi- timal state-value function V ∗ (if there is
cycle turns a corner and the front wheel one).
begins to slide out. And due to its dare-
devil policy the agent would not brake in In this context I want to introduce two
this situation. With a risk-aware policy terms closely related to the cycle between
the same situations would look much bet- state-value function and policy:
ter, thus it would be evaluated higher by
a good state-value function C.2.2.1 Policy evaluation
VΠ (s) state-value function simply returns
VΠ (s)I the value the current situation s has for Policy evaluation the approach to try
the agent under policy Π. Abstractly a policy a few times, to provide many re-
speaking, according to the above defini- wards that way and to gradually accumu-
tions, the value of the state-value function late a state-value function by means of
corresponds to the return Rt (the expected these rewards.
value) of a situation st . EΠ denotes the set
of the expected returns under Π and the
C.2.2.2 Policy improvement
current situation st .
VΠ (s) = EΠ {Rt |s = st } Policy improvement means to improve
a policy itself, i.e. to turn it into a new and
Definition C.9 (State-value function): better one. In order to improve the policy
The state-value function VΠ (s) as the task we have to aim at the return finally having
of determining the value of situations un- a larger value than before, i.e. until we
der a policy, i.e. to answer the agent’s have found a shorter way to the restaurant
question of whether a situation s is good and have walked it successfully
or bad or how good or bad it is. For this
purpose it returns the expectation of the The principle of reinforcement learning is
return under the situation: to realize an interaction. It is tried to eval-
uate how good a policy is in individual sit-
VΠ (s) = EΠ {Rt |s = st } (C.13) uations. The changed state-value function
C.2.6 Q learning
0
× +1
This implies QΠ (s, a) as learning fomula
-1 for the action-value function, and – analo-
gously to TD learning – its application is
called Q learning:
direction of actions
GFED
@ABC
s0 hk 0 / GFED
@ABC
s1 k 1 / GFED
@ABC / ONML
HIJK
sτ −1 l τ −1 / GFED
@ABC
a a aτ −2 a (
··· k sτ
r1 r2 rτ −1 rτ
direction of reward
Figure C.9: Actions are performed until the desired target situation is achieved. Attention should
be paid to numbering: Rewards are numbered beginning with 1, actions and situations beginning
with 0 (This method has been adopted as tried and true).
learning is: Π can be initialized arbitrar- played backgammon knows that the situ-
ily, and by means of Q learning the result ation space is huge (approx. 1020 situa-
is always Q∗ . tions). As a result, the state-value func-
Definition C.13 (Q learning): Q learning tions cannot be computed explicitly (par-
trains the action-value function by means ticularly in the late eighties when the TD
of the learning rule gammon was introduced). The selected re-
warding strategy was the pure delayed re-
ward, i.e. the system receives the reward
not before the end of the game and at the
(C.15)
Q(st , a)new =Q(st , a)
same time the reward is the return. Then
the system was allowed to practice itself
+ α(rt+1 + γ max Q(st+1 , a) − Q(st , a)).
a
the pit. Trivially, the executable actions The angle of the pole relative to the verti-
are here the possibilities to drive forwards cal line is referred to as α. Furthermore,
and backwards. The intuitive solution we the vehicle always has a fixed position x an
think of immediately is to move backwards, our one-dimensional world and a velocity
to gain momentum at the opposite slope of ẋ. Our one-dimensional world is lim-
and severally oscillite in this way to dash ited, i.e. there are maximum values and
out of the pit. minimum values x can adopt.
The actions of a reinforcement learning
system would be ”full throttle forward”, The aim of our system is to learn to steer
”full reverse” and ”doing nothing”. the car in such a way that it can balance
the pole, to prevent the pole from tipping
Here, ”everything costs” would be a good over. This is achieved best by an avoid-
choice for awarding the reward so that the ance strategy: As long as the pole is bal-
system learns fast how to leave the pit and anced the reward is 0. If the pole tips over,
realizes that our problem cannot be solved the reward is -1.
by means of mere forward directed engine
power. DSo the system will slowly build
Interestingly, the system is soon capable
up the movement.
to keep the pole balanced by tilting it suf-
The policy can no longer be stored as a ficiently fast and with small movements.
table since the state space is hard to dis- At this the system mostly is in the cen-
cretize. As policy a function has to be ter of the space since this is farthest from
generated. the walls which it understands as negative
(if it touches the wall, the pole will tip
over).
C.3.3 The pole balancer
Exercises
207
Bibliography dkriesel.com
[KSJ00] E.R. Kandel, J.H. Schwartz, and T.M. Jessell. Principles of neural science.
Appleton & Lange, 2000.
[MP43] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanent
in nervous activity. Bulletin of Mathematical Biology, 5(4):115–133, 1943.
[Par87] David R. Parker. Optimal algorithms for adaptive networks: Second or-
der back propagation, second order direct propagation, and second order
hebbian learning. In Maureen Caudill and Charles Butler, editors, IEEE
First International Conference on Neural Networks (ICNN’87), volume II,
pages II–593–II–600, San Diego, CA, June 1987. IEEE.
[PM47] W. Pitts and W.S. McCulloch. How we know universals the perception of
auditory and visual forms. Bulletin of Mathematical Biology, 9(3):127–147,
1947.
[RD05] G. Roth and U. Dicke. Evolution of the brain and intelligence. Trends in
Cognitive Sciences, 9(5):250–257, 2005.
[RHW86a] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by
back-propagating errors. Nature, 323:533–536, October 1986.
[RHW86b] David E. Rumelhart, Geoffrey E. Hinton, and R. J. Williams. Learning
internal representations by error propagation. In D. E. Rumelhart, J. L.
McClelland, and the PDP research group., editors, Parallel distributed pro-
cessing: Explorations in the microstructure of cognition, Volume 1: Foun-
dations. MIT Press, 1986.
[Ros58] F. Rosenblatt. The perceptron: a probabilistic model for information stor-
age and organization in the brain. Psychological Review, 65:386–408, 1958.
[Ros62] F. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.
[SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction.
MIT Press, Cambridge, MA, 1998.
[SG06] A. Scherbart and N. Goerke. Unsupervised system for discovering patterns
in time-series, 2006.
[SGE05] Rolf Schatten, Nils Goerke, and Rolf Eckmiller. Regional and online learn-
able fields. In Sameer Singh, Maneesha Singh, Chidanand Apté, and Petra
Perner, editors, ICAPR (2), volume 3687 of Lecture Notes in Computer
Science, pages 74–83. Springer, 2005.
[Ste61] K. Steinbuch. Die lernmatrix. Kybernetik (Biological Cybernetics), 1:36–45,
1961.
[vdM73] C. von der Malsburg. Self-organizing of orientation sensitive cells in striate
cortex. Kybernetik, 14:85–100, 1973.
[Was89] P. D. Wasserman. Neural Computing Theory and Practice. New York :
Van Nostrand Reinhold, 1989.
[Wer74] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis
in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
[Wer88] P. J. Werbos. Backpropagation: Past and future. In Proceedings ICNN-88,
San Diego, pages 343–353, 1988.
[WG94] A.S. Weigend and N.A. Gershenfeld. Time series prediction. Addison-
Wesley, 1994.
[WH60] B. Widrow and M. E. Hoff. Adaptive switching circuits. In Proceedings
WESCON, pages 96–104, 1960.
[Wid89] R. Widner. Single-stage logic. AIEE Fall General Meeting, 1960. Wasser-
man, P. Neural Computing, Theory and Practice, Van Nostrand Reinhold,
1989.
[Zel94] Andreas Zell. Simulation Neuronaler Netze. Addison-Wesley, 1994. Ger-
man.
213
List of Figures dkriesel.com
217
Index
ATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
* attractor . . . . . . . . . . . . . . . . . . . . . . . . . . 117
autoassociator . . . . . . . . . . . . . . . . . . . . . 129
100-step rule . . . . . . . . . . . . . . . . . . . . . . . . 5 axon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20, 25
A B
Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
backpropagation . . . . . . . . . . . . . . . . . . . . 90
action potential . . . . . . . . . . . . . . . . . . . . 23
second order . . . . . . . . . . . . . . . . . . . 96
action space . . . . . . . . . . . . . . . . . . . . . . . 193
backpropagation of error. . . . . . . . . . . .86
action-value function . . . . . . . . . . . . . . 201
recurrent . . . . . . . . . . . . . . . . . . . . . . 122
activation . . . . . . . . . . . . . . . . . . . . . . . . . . 37
bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
activation function . . . . . . . . . . . . . . . . . 38
basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
selection of . . . . . . . . . . . . . . . . . . . . . 93
bias neuron . . . . . . . . . . . . . . . . . . . . . . . . . 45
ADALINE . . see adaptive linear neuron
binary threshold function . . . . . . . . . . 38
adaptive linear element . . . see adaptive
bipolar cell . . . . . . . . . . . . . . . . . . . . . . . . . 29
linear neuron
black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
adaptive linear neuron . . . . . . . . . . . . . . 10
brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
adaptive resonance theory . . . . . 11, 163
brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . 18
agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .52
amacrine cell . . . . . . . . . . . . . . . . . . . . . . . 30
approximation. . . . . . . . . . . . . . . . . . . . .109
ART . . . . see adaptive resonance theory C
ART-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
ART-2A . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 capability to learn . . . . . . . . . . . . . . . . . . . 4
ART-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 center
artificial intelligence . . . . . . . . . . . . . . . . 10 of a ROLF neurons . . . . . . . . . . . . 174
associative data storage . . . . . . . . . . . 155 of a SOM Neuron . . . . . . . . . . . . . 144
219
Index dkriesel.com
flat spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
elimination . . . . . . . . . . . . . . . . . . . . . 96 I
function approximation . . . . . . . . . . . . . 93
function approximator individual eye . . . . . . . . see ommatidium
universal . . . . . . . . . . . . . . . . . . . . . . . 84 input dimension . . . . . . . . . . . . . . . . . . . . 49
input patterns . . . . . . . . . . . . . . . . . . . . . . 52
input vector . . . . . . . . . . . . . . . . . . . . . . . . 48
interbrain . . . . . . . . . . . . . . . . . . . . . . . . . . 17
internodesn . . . . . . . . . . . . . . . . . . . . . . . . . 25
G interoceptor . . . . . . . . . . . . . . . . . . . . . . . . 26
interpolation
ganglion cell . . . . . . . . . . . . . . . . . . . . . . . . 29 precise . . . . . . . . . . . . . . . . . . . . . . . . 108
Gauss-Markov model . . . . . . . . . . . . . . 109 ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Gaussian bell . . . . . . . . . . . . . . . . . . . . . . 147 iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
generalization . . . . . . . . . . . . . . . . . . . . 4, 51
glial cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
gradient descent . . . . . . . . . . . . . . . . . . . . 62
problems . . . . . . . . . . . . . . . . . . . . . . . 62 J
grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Jordan network. . . . . . . . . . . . . . . . . . . .118
gridworld. . . . . . . . . . . . . . . . . . . . . . . . . .190
H K
k-means clustering . . . . . . . . . . . . . . . . 170
Heaviside function see binary threshold k-nearest neighbouring . . . . . . . . . . . . 170
function
Hebbian rule . . . . . . . . . . . . . . . . . . . . . . . 66
generalized form . . . . . . . . . . . . . . . . 66
heteroassociator . . . . . . . . . . . . . . . . . . . 130
Hinton diagram . . . . . . . . . . . . . . . . . . . . 36 L
history of development. . . . . . . . . . . . . . .8
Hopfield networks . . . . . . . . . . . . . . . . . 125 layer
continuous . . . . . . . . . . . . . . . . . . . . 132 hidden . . . . . . . . . . . . . . . . . . . . . . . . . 40
horizontal cell . . . . . . . . . . . . . . . . . . . . . . 30 input . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
hyperbolic tangent . . . . . . . . . . . . . . . . . 38 output . . . . . . . . . . . . . . . . . . . . . . . . . 40
hyperpolarization . . . . . . . . . . . . . . . . . . . 25 learnability . . . . . . . . . . . . . . . . . . . . . . . . . 93
hypothalamus . . . . . . . . . . . . . . . . . . . . . . 17 learning
lateral . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Snark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
recurrent . . . . . . . . . . . . . . . . . . . . . . . . . . 117 sodium-potassium pump . . . . . . . . . . . . 22
refractory period . . . . . . . . . . . . . . . . . . . 25 SOM . . . . . . . . . . see self-organizing map
regional and online learnable fields 173 soma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
reinforcement learning . . . . . . . . . . . . . 189 spin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
repolarization . . . . . . . . . . . . . . . . . . . . . . 23 spinal cord . . . . . . . . . . . . . . . . . . . . . . . . . 16
representability . . . . . . . . . . . . . . . . . . . . . 93 stability / plasticity dilemma . . . . . . 163
resonance . . . . . . . . . . . . . . . . . . . . . . . . . 164 state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
retina . . . . . . . . . . . . . . . . . . . . . . . . . . . 29, 73 state space forecasting . . . . . . . . . . . . . 181
return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 state-value function . . . . . . . . . . . . . . . 198
reward stimulus . . . . . . . . . . . . . . . . . . . . . . . 23, 145
avoidance strategy . . . . . . . . . . . . 197 stimulus-conducting apparatus. . . . . .26
pure delayed . . . . . . . . . . . . . . . . . . 196 surface, perceptive. . . . . . . . . . . . . . . . .174
pure negative . . . . . . . . . . . . . . . . . 196 swing up an inverted pendulum. . . .204
RMS . . . . . . . . . . . . see root mean square symmetry breaking . . . . . . . . . . . . . . . . . 94
ROLFs . . . . . . . . . see regional and online synapse
learnable fields chemical . . . . . . . . . . . . . . . . . . . . . . . 19
root mean square . . . . . . . . . . . . . . . . . . . 57 electrical . . . . . . . . . . . . . . . . . . . . . . . 19
synapses. . . . . . . . . . . . . . . . . . . . . . . . . . . .19
synaptic cleft . . . . . . . . . . . . . . . . . . . . . . . 19
S
saltatory conductor . . . . . . . . . . . . . . . . . 25 T
Schwann cell . . . . . . . . . . . . . . . . . . . . . . . 25
self fulfilling prophecy . . . . . . . . . . . . . 187 target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
self-organizing feature maps . . . . . . . . 11 TD gammon . . . . . . . . . . . . . . . . . . . . . . 203
self-organizing map . . . . . . . . . . . . . . . . 143 TD learning. . . .see temporal difference
multi- . . . . . . . . . . . . . . . . . . . . . . . . . 159 learning
sensory adaptation . . . . . . . . . . . . . . . . . 27 teacher forcing . . . . . . . . . . . . . . . . . . . . 122
sensory transduction. . . . . . . . . . . . . . . .26 teaching input . . . . . . . . . . . . . . . . . . . . . . 55
shortcut connections . . . . . . . . . . . . . . . . 41 telencephalon . . . . . . . . . . . . see cerebrum
silhouette coefficient . . . . . . . . . . . . . . . 173 temporal difference learning . . . . . . . 200
single lense eye . . . . . . . . . . . . . . . . . . . . . 29 thalamus . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Single Shot Learning . . . . . . . . . . . . . . 128 threshold potential . . . . . . . . . . . . . . . . . 23
situation . . . . . . . . . . . . . . . . . . . . . . . . . . 192 threshold value . . . . . . . . . . . . . . . . . . . . . 38
situation space . . . . . . . . . . . . . . . . . . . . 193 time horizon . . . . . . . . . . . . . . . . . . . . . . 194
situation tree . . . . . . . . . . . . . . . . . . . . . . 196 time series . . . . . . . . . . . . . . . . . . . . . . . . 179
SLP . . . . . . . see perceptron, single-layer time series prediction . . . . . . . . . . . . . . 179
U
unfolding in time . . . . . . . . . . . . . . . . . . 121
V
voronoi diagram . . . . . . . . . . . . . . . . . . . 137
W
weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
weight matrix . . . . . . . . . . . . . . . . . . . . . . 36
bottom-up . . . . . . . . . . . . . . . . . . . . 164
top-down. . . . . . . . . . . . . . . . . . . . . .163
weight vector . . . . . . . . . . . . . . . . . . . . . . . 36
weighted sum . . . . . . . . . . . . . . . . . . . . . . . 37
Widrow-Hoff rule. . . . . . . .see Delta rule
winner-takes-all scheme . . . . . . . . . . . . . 43