Neural Networks

A Brief Introduction to
Neural Networks
David Kriesel
www.dkriesel.com
dkriesel.com
In remembrance of
Dr. Peter Kemp, Notary (ret.), Bonn, Germany.
D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) iii

A Little Preface
”Originally, this paper has been prepared in the framework of a seminar of the
University of Bonn in Germany, but it has been and will be extended (after
being delivered and published online under www.dkriesel.com on
5/27/2005). First and foremost, to provide a comprehensive overview of the
subject of neural networks and, secondly, just to acquire more and more
knowledge about LATEX . And who knows – maybe one day this summary will
become a real preface!”
Abstract of this paper, end of 2005
The above abstract has not yet become a to understand to offer as many people as
preface but, after all, a little preface since possible access to the field of neural net-
the extended paper (then 40 pages long) works.
has turned out to be a download hit.
Nevertheless, the mathematically and for-
mally skilled readers will be able to under-
stand the definitions without reading the
Ambition and Intention of flow text, while the reverse is true for read-
this Manuscript ers only interested in the subject matter;
everything is explained in colloquial and
At the time I promised to continue the pa- formal language. Please let me know if
per chapter by chapter, and this is the re- you find out that I have acted against this
sult. Meanwhile I changed the structure principle.
of the document a bit so that it is easier
for me to add a chapter or to vary the out-
The Sections of This Work are
line.
Mostly Independent from Each
The entire text is written and laid out Other
more effectively and with more illustra-
tions than before. I did all the illustra- The document itself is divided into differ-
tions myself, most of them directly in ent parts, which are again divided into
LATEX by using XYpic. They reflect what chapters. Although the chapters contain
I would have liked to see when becoming cross-references they are also individually
acquainted with the subject: Text and il- accessible to readers with little previous
lustrations should be memorable and easy knowledge. There are larger and smaller
v
dkriesel.com
chapters: While the larger chapters should Terms of Use and License
provide profound insight into a paradigm
of neural networks (e.g. the classic neural
network structure: the perceptron and its From the epsilon edition, the text is
learning procedures), the smaller chapters licensed under the Creative Commons
give a short overview – but this is also ex- Attribution-No Derivative Works 3.0 Un-
plained in the introduction of each chapter. ported License 1 , except for some little por-
In addition to all the definitions and expla- tions of the work licensed under more lib-
nations I have included some excursuses eral licenses as mentioned (mainly some
to provide interesting information not di- figures from Wikimedia Commons). A
rectly related to the subject. quick license summary:
1. You are free to redistribute this docu-

When this document still was a term pa-
per, I never kept secret that some of ment (even though it is a much better
the content is, among others, inspired by idea to just distribute the URL of my
[Zel94], a book I liked alot but only one home page, for it always contains the
book. In the meantime I have added many most recent version of the text).
other sources and a lot of experience, so
this statement is no longer appropriate. 2. You may not modify, transform, or
What I would like to say is that this script build upon the document except for
is – in contrast to a book – intended for personal use.
those readers who think that a book is too
comprehensive and who want to get a gen- 3. You must maintain the author’s attri-
eral idea of and an easy approach to the bution of the document at all times.
field. But it is not intended to be what is
often passed off as lecture notes: it’s not 4. You may not use the attribution to
simply copied from a book. imply that the author endorses you
or your document use.
Unfortunately, I was not able to find free
For I’m no lawyer, the above bullet-point
German sources that are multi-faceted
summary is just informational: if there is
in respect of content (concerning the
any conflict in interpretation between the
paradigms of neural networks) and, nev-
summary and the actual license, the actual
ertheless, written in coherent style. The
license always takes precedence. Note that
aim of this work is (even if it could not be
this license does not extend to the source
fulfilled at first go) to close this gap bit by
files used to produce the document. Those
bit and to provide easy acces to the sub-
are still mine.
ject.
1 http://creativecommons.org/licenses/
by-nd/3.0/
vi D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com
How to Cite this Manuscript In the Table of Contents, Different

Types of Chapters are marked
There’s no official publisher, so you need Different types of chapters are directly
to be careful with your citation. Please marked within the table of contents. Chap-
find more information in English and ters, that are marked as ”fundamental”
German language on my home page, re- are definitively ones to read because al-
spectively the subpage concerning the most all subsequent chapters heavily de-
manuscript2 . pend on them. Other chapters addi-
tionally depend on information given in
other (preceding) chapters, which then is
marked in the table of contents, too.
It’s easy to print this
manuscript Speaking Headlines throughout the
Text, Short Ones in the Table of
Contents
This paper is completely illustrated in
color, but it can also be printed as is in The whole manuscript is now pervaded by
monochrome: The colors of figures, tables such headlines. Speaking headlines are
and text are well-chosen so that in addi- not just title-like (”Reinforcement Learn-
tion to an appealing design the colors are ing”), but centralize the information given
still easy to distinguish when printed in in the associated section to a single sen-
monochrome. tence. In the named instance, an appro-
priate headline would be ”Reinforcement
learning methods provide Feedback to the
Network, whether it behaves good or bad”.
However, such long headlines would bloat
There are many tools directly the table of contents in an unacceptable
integrated into the text way. So I used short titles like the first one
in the table of contents, and speaking ones,
like the letter, throughout the text.
Different tools are directly integrated in
the document to make reading more flexi-
ble: But anyone (like me) who prefers read- Marginal notes are a navigational
ing words on paper rather than on screen aide
can also enjoy some features.
The entire document contains marginal
2 http://www.dkriesel.com/en/science/ notes in colloquial language (see the exam-
neural_networks ple in the margin), allowing you to ”skim”
Hypertext
on paper
:-)
D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) vii

dkriesel.com
the document quickly to find a certain pas- Kathrin Gräve, Paul Imhoff, Thomas
sage in the text (including the titles). Kühn, Christoph Kunze, Malte Lohmeyer,
Joachim Nock, Daniel Plohmann, Daniel
New mathematical symbols are marked by
Rosenthal, Christian Schulz and Tobias
specific marginal notes for easy finding
Wilken.
(see the example for x in the margin).
xI
Additionally, I want to thank the read-
ers Dietmar Berger, Igor Buchmüller,
There are several kinds of indexing Marie Christ, Julia Damaschek, Maxim-
ilian Ernestus, Hardy Falk, Anne Feld-
This document contains different types of meier, Sascha Fink, Andreas Friedmann,
indexing: If you have found a word in Jan Gassen, Markus Gerhards, Sebas-
the index and opened the corresponding tian Hirsch, Andreas Hochrath, Nico Höft,
page, you can easily find it by searching Thomas Ihme, Boris Jentsch, Tim Hussein,
for highlighted text – all indexed words Thilo Keller, Mario Krenn, Mirko Kunze,
are highlighted like this. Maikel Linke, Adam Maciak, Benjamin
Mathematical symbols appearing in sev- Meier, David Möller, Andreas Müller,
eral chapters of this document (e.g. Ω for Rainer Penninger, Matthias Siegmund,
an output neuron; I tried to maintain a Mathias Tirtasana, Oliver Tischler, Max-
consistent nomenclature for regularly re- imilian Voit, Igor Wall, Achim Weber,
curring elements) are separately indexed Frank Weinreis, Gideon Maillette de Buij
under ”Mathematical Symbols”, so they Wenniger, Philipp Woock and many oth-
can easily be assigned to the correspond- ers for their feedback, suggestions and re-
ing term. marks.
Names of persons written in small caps Especially, I would like to thank Beate
are indexed in the category ”Persons” and Kuhl for translating the entire paper from
ordered by the last names. German to English, and for her questions
which made think of changing the text and
understandability of some paragraphs.
Acknowledgement I would particularly like to thank Prof.
Rolf Eckmiller and Dr. Nils Goerke as
Now I would like to express my grati- well as the entire Division of Neuroinfor-
tude to all the people who contributed, in matics, Department of Computer Science
whatever manner, to the success of this of the University of Bonn – they all made
work, since a paper like this needs many sure that I always learned (and also had
helpers. First of all, I want to thank to learn) something new about neural net-
the proofreaders of this paper, who helped works and related subjects. Especially Dr.
me and my readers very much. In al- Goerke has always been willing to respond
phabetical order: Wolfgang Apolinarski, to any questions I was not able to answer
viii D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com
to speak, a place of honor: My girl-friend

myself during the writing process. Conver-
sations with Prof. Eckmiller made me stepVerena Thomas, who found many mathe-
back from the whiteboard to get a better matical and logical errors in my paper and
overall view on what I was doing and whatdiscussed them with me, although she has
I should do next. to do lots of other things, and Christiane
Schultze, who carefully reviewed the paper
Globally, and not only in the context of
for spelling and inconsistency.
this paper, I want to thank my parents
who never get tired to buy me special- What better could have happened to me
ized and therefore expensive books and than having your accuracy and preciseness
who have always supported me in my stud- on my side?
ies.
For many ”remarks” and the very special
and cordial atmosphere ;-) I want to thank
Andreas Huber and Tobias Treutler. Since
our first semester it has rarely been boring
with you!
Now I would like to think back to my
David Kriesel
school days and cordially thank some
teachers who (in my opinion) had im-
parted some scientific knowledge to me –
although my class participation had not
always been wholehearted: Mr. Wilfried
Hartmann, Mr. Hubert Peters and Mr.
Frank Nökel.
Further I would like to thank the whole
team at the notary’s office of Dr. Kemp
and Dr. Kolb in Bonn, where I have al-
ways felt myself to be in good hands and
who have helped me to keep my print-
ing costs low - in particular Christiane
Flamme and Dr. Kemp!
Thanks go also to the Wikimedia Com-
mons, where I took some (few) images and
altered them to fit this paper.
Last but not least I want to thank two per-
sons who render outstanding services to
both the paper and me and who occupy, so
D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) ix

Contents
A Little Preface v
I From Biology to Formalization – Motivation, Philosophy, History and

Realization of Neural Models 1
1 Introduction, Motivation and History 3

1.1 Why Neural Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 The 100-step rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Simple application examples . . . . . . . . . . . . . . . . . . . . . 6
1.2 History of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 The beginning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Golden age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Long silence and slow reconstruction . . . . . . . . . . . . . . . . 11
1.2.4 Renaissance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Biological Neural Networks 15

2.1 The Vertebrate Nervous System . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Peripheral and central nervous system . . . . . . . . . . . . . . . 15
2.1.2 Cerebrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Cerebellum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Diencephalon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.5 Brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 The Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Electrochemical processes in the neuron . . . . . . . . . . . . . . 21
2.3 Receptor Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Various Sorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Information processing within the nervous system . . . . . . . . 27
2.3.3 Light sensing organs . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 The Amount of Neurons in Living Organisms . . . . . . . . . . . . . . . 30
xi
Contents dkriesel.com
2.5 Technical Neurons as Caricature of Biology . . . . . . . . . . . . . . . . 32

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Components of Artificial Neural Networks (fundamental) 35

3.1 The Concept of Time in Neural Networks . . . . . . . . . . . . . . . . . 35
3.2 Components of Neural Networks . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Propagation function and Network Input . . . . . . . . . . . . . 37
3.2.3 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Threshold value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.5 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.6 Output function . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.7 Learning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Network topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Feedforward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Recurrent networks . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Completely linked networks . . . . . . . . . . . . . . . . . . . . . 44
3.4 The bias neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Representing Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Orders of activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.1 Synchronous activation . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.2 Asynchronous activation . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Input and output of data . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 How to Train a Neural Network? (fundamental) 51

4.1 Paradigms of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.4 Offline or Online Learning? . . . . . . . . . . . . . . . . . . . . . 53
4.1.5 Questions in advance . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Training Patterns and Teaching Input . . . . . . . . . . . . . . . . . . . 54
4.3 Using Training Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Division of the training set . . . . . . . . . . . . . . . . . . . . . 56
4.3.2 Order of pattern representation . . . . . . . . . . . . . . . . . . . 57
4.4 Learning curve and error measurement . . . . . . . . . . . . . . . . . . . 57
4.4.1 When do we stop learning? . . . . . . . . . . . . . . . . . . . . . 58
4.5 Gradient optimization procedures . . . . . . . . . . . . . . . . . . . . . . 60
4.5.1 Problems of gradient procedures . . . . . . . . . . . . . . . . . . 62
xii D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com Contents
4.6 Problem Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6.1 Boolean functions . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.2 The parity function . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.3 The 2-spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.4 The checkerboard problem . . . . . . . . . . . . . . . . . . . . . . 64
4.6.5 The identity function . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Hebbian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7.1 Original rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7.2 Generalized form . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
II Supervised learning Network Paradigms 69
5 The Perceptron 71
5.1 The Single-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1.1 Perceptron Learning Algorithm and Convergence Theorem . . . 75
5.2 Delta Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Linear Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 The Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Backpropagation of Error . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.1 Boiling backpropagation down to delta rule . . . . . . . . . . . . 91
5.5.2 Selecting a learning rate . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.3 Initial configuration of a Multi-layer Perceptron . . . . . . . . . . 92
5.5.4 Variations and extensions to backpropagation . . . . . . . . . . . 94
5.6 The 8-3-8 encoding problem and related problems . . . . . . . . . . . . 97
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6 Radial Basis Functions 101

6.1 Components and Structure . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 Information processing of a RBF network . . . . . . . . . . . . . . . . . 102
6.2.1 Information processing in RBF-neurons . . . . . . . . . . . . . . 104
6.2.2 Analytical thoughts prior to the training . . . . . . . . . . . . . . 107
6.3 Training of RBF-Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.1 Centers and widths of RBF neurons . . . . . . . . . . . . . . . . 111
6.4 Growing RBF Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.1 Adding neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.2 Limiting the number of neurons . . . . . . . . . . . . . . . . . . . 115
6.4.3 Deleting neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Comparing RBF Networks and Multi-layer Perceptrons . . . . . . . . . 115
D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) xiii

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7 Recurrent Perceptron-like Networks (depends on chapter 5) 117

7.1 Jordan Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2 Elman Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3 Training Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3.1 Unfolding in Time . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3.2 Teacher forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3.3 Recurrent backpropagation . . . . . . . . . . . . . . . . . . . . . 122
7.3.4 Training with evolution . . . . . . . . . . . . . . . . . . . . . . . 122
8 Hopfield Networks 125

8.1 Inspired by Magnetism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2 Structure and Functionality . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2.1 Input and Output of a Hopfield network . . . . . . . . . . . . . . 126
8.2.2 Significance of Weights . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2.3 Change in the state of neurons . . . . . . . . . . . . . . . . . . . 127
8.3 Generating the Weight Matrix . . . . . . . . . . . . . . . . . . . . . . . 128
8.4 Autoassociation and Traditional Application . . . . . . . . . . . . . . . . 129
8.5 Heteroassociation and Analogies to Neural Data Storage . . . . . . . . . 130
8.5.1 Generating the heteroassociative matrix . . . . . . . . . . . . . . 131
8.5.2 Stabilizing the heteroassociations . . . . . . . . . . . . . . . . . . 131
8.5.3 Biological motivation of heterassociation . . . . . . . . . . . . . . 132
8.6 Continuous Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . 132
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9 Learning Vector Quantization 135

9.1 About Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9.2 Purpose of LVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.3 Using Code Book Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.4 Adjusting Codebook Vectors . . . . . . . . . . . . . . . . . . . . . . . . 137
9.4.1 The procedure of learning . . . . . . . . . . . . . . . . . . . . . . 137
9.5 Connection to Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 139
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
III Unsupervised learning Network Paradigms 141
10 Self-organizing Feature Maps 143

10.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
xiv D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com Contents
10.2 Functionality and Output Interpretation . . . . . . . . . . . . . . . . . . 145

10.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.3.1 The topology function . . . . . . . . . . . . . . . . . . . . . . . . 146
10.3.2 Monotonically decreasing learning rate and neighborhood . . . . 148
10.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
10.4.1 Topological defects . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.5 Resolution Dose and Position-dependent Learning Rate . . . . . . . . . 152
10.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.6.1 Interaction with RBF networks . . . . . . . . . . . . . . . . . . . 157
10.7 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.7.1 Neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.7.2 Multi-SOMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.7.3 Multi-neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.7.4 Growing neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11 Adaptive Resonance Theory 163

11.1 Task and Structure of an ART Network . . . . . . . . . . . . . . . . . . 163
11.1.1 Resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.2 Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
11.2.1 Pattern input and top-down learning . . . . . . . . . . . . . . . . 165
11.2.2 Resonance and bottom-up learning . . . . . . . . . . . . . . . . . 165
11.2.3 Adding an output neuron . . . . . . . . . . . . . . . . . . . . . . 165
11.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
IV Excursi, Appendices and Registers 167
A Excursus: Cluster Analysis and Regional and Online Learnable Fields 169
A.1 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.2 k-Nearest Neighbouring . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.3 ε-Nearest Neighbouring . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.4 The Silhouette coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.5 Regional and Online Learnable Fields . . . . . . . . . . . . . . . . . . . 173
A.5.1 Structure of a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.5.2 Training a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.5.3 Evaluating a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.5.4 Comparison with Popular Clustering Methods . . . . . . . . . . 177
A.5.5 Initializing Radii, Learning Rates and Multiplier . . . . . . . . . 178
A.5.6 Application examples . . . . . . . . . . . . . . . . . . . . . . . . 178
D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) xv

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B Excursus: Neural Networks Used for Prediction 179

B.1 About Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.2 One-Step-Ahead Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 181
B.3 Two-Step-Ahead Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.3.1 Recursive two-step-ahead prediction . . . . . . . . . . . . . . . . 183
B.3.2 Direct two-step-ahead prediction . . . . . . . . . . . . . . . . . . 183
B.4 Additional Optimization Approaches for Prediction . . . . . . . . . . . . 183
B.4.1 Changing temporal parameters . . . . . . . . . . . . . . . . . . . 183
B.4.2 Heterogeneous prediction . . . . . . . . . . . . . . . . . . . . . . 185
B.5 Remarks on the Prediction of Share Prices . . . . . . . . . . . . . . . . . 185
C Excursus: Reinforcement Learning 189

C.1 System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
C.1.1 The gridworld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
C.1.2 Agent und environment . . . . . . . . . . . . . . . . . . . . . . . 191
C.1.3 States, situations and actions . . . . . . . . . . . . . . . . . . . . 192
C.1.4 Reward and return . . . . . . . . . . . . . . . . . . . . . . . . . . 193
C.1.5 The policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
C.2 Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
C.2.1 Rewarding strategies . . . . . . . . . . . . . . . . . . . . . . . . . 196
C.2.2 The state-value function . . . . . . . . . . . . . . . . . . . . . . . 197
C.2.3 Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . 199
C.2.4 Temporal difference learning . . . . . . . . . . . . . . . . . . . . 200
C.2.5 The action-value function . . . . . . . . . . . . . . . . . . . . . . 201
C.2.6 Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
C.3 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
C.3.1 TD gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
C.3.2 The car in the pit . . . . . . . . . . . . . . . . . . . . . . . . . . 203
C.3.3 The pole balancer . . . . . . . . . . . . . . . . . . . . . . . . . . 204
C.4 Reinforcement Learning in Connection with Neural Networks . . . . . . 205
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Bibliography 207
List of Figures 213
List of Tables 217
Index 219
xvi D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

Part I
From Biology to Formalization –

Motivation, Philosophy, History and
Realization of Neural Models
1
Chapter 1
Introduction, Motivation and History
How to teach a computer? You can either write a rigid program – or you can
enable the computer to learn on its own. Living beings don’t have any
programmer writing a program for developing their skills, which only has to be
executed. They learn by themselves – without the initial experience of external
knowledge – and thus can solve problems better than any computer today.
What qualities are needed to achieve such a behavior for devices like
computers? Can such cognition be adapted from biology? History,
development, decline and resurgence of a wide approach to solve problems.
1.1 Why Neural Networks? If we compare computer and brain1 , we

will note that, theoretically, the computer
should be more powerful than our brain:
There are problem categories that cannot It comprises 109 transistors with a switch-
be formulated as an algorithm. Problems ing time of 10−9 seconds. The brain con-
that depend on many subtle factors, for ex- tains 1011 neurons, but these only have a
ample the purchase price of a real estate switching time of about 10−3 seconds.
which our brain can (approximately) cal- The largest part of the brain is work-
culate. Without an algorithm a computer ing continuously, while the largest part of
cannot do the same. Therefore the ques- the computer is only passive data storage.
tion to be asked is: How do we learn to Thus, the brain is parallel and therefore
parallelism
explore such problems? performing close to its theoretical maxi-
1 Of course, this comparison is - for obvious rea-
Exactly – we learn; a capability comput- sons - controversially discussed by biologists and
ers obviously do not have. Humans have computer scientists, since response time and quan-
Computers
cannot a brain that can learn. Computers have tity do not tell anything about quality and perfor-
learn some processing units and memory. They mance of the processing units as well as neurons
and transistors cannot be compared directly. Nev-
allow the computer to perform the most ertheless, the comparison serves its purpose and
complex numerical calculations in a very indicates the advantage of parallelism by means
short time, but they are not adaptive. of processing time.
3
Chapter 1 Introduction, Motivation and History dkriesel.com
Brain Computer
No. of processing units ≈ 1011 ≈ 109
Type of processing units Neurons Transistors
Type of calculation massively parallel usually serial
Data storage associative address-based
Switching time ≈ 10−3 s ≈ 10−9 s
Possible switching operations ≈ 1013 1s ≈ 1018 1s
Actual switching operations ≈ 1012 1s ≈ 1010 1s
Table 1.1: The (flawed) comparison between brain and computer at a glance. Inspired by: [Zel94]
mum, from which the computer is pow- eralize and associate data: After suc-
ers of ten away (Table 1.1). Additionally, cessful training a neural network can find
a computer is static - the brain as a bi- reasonable solutions for similar problems
ological neural network can reorganize it- of the same class that were not explicitly
self during its ”lifespan” and therefore is trained. This in turn results in a high de-
able to learn, to compensate errors and so gree of fault tolerance against noisy in-
force. put data.
Within this paper I want to outline how Fault tolerance is closely related to biolog-
we can use the said brain characteristics ical neural networks, in which this charac-
for a computer system. teristic is very distinct: As previously men-
tioned, a human has about 1011 neurons
So the study of artificial neural networks that continuously reorganize themselves
is motivated by their similarity to success- or are reorganized by external influences
fully working biological systems, which - (complete drunkenness destroys about 105
compared to the complete system - consist neurons, some types of food or environ-
of very simple but numerous nerve cells mental influences can also destroy brain
simple
but many that work massively parallel and (which cells). Nevertheless, our cognitive abilities
processing is probably one of the most significant are not significantly affected. Thus, the
aspects) have the capability to learn. brain is tolerant against internal errors –
units n. network
fault
There is no need to explicitly program a and also against external errors, for we tolerant
neural network. For instance, it can learn can often read a really ”dreadful scrawl”
from training examples or by means of en- although the individual letters are nearly
n. network
capable couragement - with a carrot and a stick, impossible to read.
to learn so to speak (reinforcement learning).
Our modern technology, however, is not
One result from this learning procedure is automatically fault-tolerant. I have never
the capability of neural networks to gen- heard that someone forgot to install the
4 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com 1.1 Why Neural Networks?
hard disk controller into the computer and What types of neural networks particu-
therefore the graphics card automatically larly develop what kinds of abilities and
took on its tasks, i.e. removed conductors can be used for what problem classes will
and developed communication, so that the be discussed in the course of the paper.
system as a whole was affected by the
missing component, but not completely de- In the introductory chapter I want to
stroyed. clarify the following: ”The neural net-
work” does not exist. There are differ-
Important!
An disadvantage of this distributed fault- ent paradigms for neural networks, how
tolerant storage is certainly the fact that they are trained and where they are used.
we cannot realize at first sight what a neu- It is my aim to introduce some of these
ral neutwork knows and performs or where paradigms and supplement some remarks
its faults lie. Usually, it is easier to perfor practical application.
form such analyses for conventional alo-
gorithms. Most often we can only trans- We have already mentioned that our brain
fer knowledge into our neural network by is working massively in parallel in contrast
means of a learning procedure, which can to the work of a computer, i.e. every com-
cause several errors and is not always easy ponent is active at any time. If we want
to manage. to state an argument for massive parallel
processing, then the 100-step rule can
Fault tolerance of data, on the other hand, be cited.
is already more sophisticated in state-of-
the-art technology: Let us compare a
record and a CD. If there is a scratch on a
record, the audio information on this spot
1.1.1 The 100-step rule
will be completely lost (you will hear a
pop) and then the music goes on. On a CD Experiments showed that a human can
the audio data are distributedly stored: A recognize the picture of a familiar object
scratch causes a blurry sound in its vicin- or person in ≈ 0.1 seconds, which cor-
ity, but the data stream remains largely responds to a neuron switching time of
unaffected. The listener won’t notice any- ≈ 10−3 seconds in ≈ 100 discrete time
thing. steps of parallel processing.
parallel
processing
So let us summarize the main characteris- A computer following the von Neumann
tics we try to adapt from biology: architecture, however, can do practically
nothing in 100 time steps of sequential pro-
. Self-organization and learning capa-
cessing, which are 100 assembler steps or
bility,
cycle steps.
. Generalization capability and
Now we want to look at a simple applica-
. Fault tolerance. tion example for a neural network.
D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) 5

put is called H for ”halt signal”). There-

fore we need a mapping
f : R8 → B1 ,
that applies the input signals to a robot

activity.
1.1.2.1 The classical way
There are two ways of realizing this map-

ping. For one thing, there is the classical
way: We sit down and think for a while,
and finally the result is a circuit or a small
computer program which realizes the map-
Figure 1.1: A small robot with eight sensors ping (this is easily possible, since the ex-
and two motors. The arrow indicates the driv- ample is very simple). After that we refer
ing direction. to the technical reference of the sensors,
study their characteristic curve in order to
learn the values for the different obstacle
distances, and embed these values into the
above-mentioned set of rules. Such proce-
1.1.2 Simple application examples
dures are applied to the classic artificial in-
telligence, and if you know the exact rules
Let us assume that we have a small robot of a mapping algorithm, you are always
as shown in fig. 1.1. This robot has eight well advised to follow this scheme.
distance sensors from which it extracts in-
put data: Three sensors are placed on the
front right, three on the front left, and two 1.1.2.2 The way of learning
on the back. Each sensor provides a real
numeric value at any time, that means we More interesting and more successful for
are always receiving an input I ∈ R8 . many mappings and problems that are
hard to comprehend at first go is the way
Despite its two motors (which will be
of learning: We show different possible sit-
needed later) the robot in our simple ex-
uations to the robot (fig. 1.2 on page 8), –
ample is not capable to do much: It shall
and the robot shall learn on its own what
only drive on but stop when it might col-
to do in the course of its robot life.
lide with an obstacle. Thus, our output
is binary: H = 0 for”Everything is okay, In this example the robot shall simply
drive on” and H = 1 for ”Stop” (The out- learn when to stop. To begin with, we

dkriesel.com 1.1 Why Neural Networks?
Our example can be optionally expanded.

For the purpose of direction control it
would be possible to control the motors
of our robot separately2 , with the sensor
layout being the same. In this case we are
looking for a mapping
f : R8 → R2 ,
Figure 1.3: Initially, we regard the robot control
as a black box whose inner life is unknown. The
black box receives eight real sensor values and which gradually controls the two motors
maps these values to a binary output value. by means of the sensor inputs and thus
cannot only, for example, stop the robot
but also lets it avoid obstacles. Here it is
more difficult to mentally derive the rules,
and de facto a neural network would be
treat the neural network as a kind of black more appropriate.
box (fig. 1.3), this means we do not know
its structure but just regard its behavior Our aim is not to learn the examples by
in practice. heart, but to realize the principle behind
them: Ideally, the robot should apply the
The situations in form of simply mea- neural network in any situation and be
sured sensor values (e.g. placing the robot able to avoid obstacles. In particular, the
in front of an obstacle, see illustration), robot should query the network continu-
which we show to the robot and for which ously and repeatedly while driving in order
we specify whether to drive on or to stop, to continously avoid obstacles. The result
are called training examples. Thus, a train- is a constant cycle: The robot queries the
ing example consists of an exemplary in- network. As a consequence, it will drive
put and a corresponding desired output. in one direction, which changes the sen-
Now the question is how to transfer this sors values. EAgain the robot queries the
knowledge, the information, into the neu- network and changes its position, the sen-
ral network. sor values are changed once again, and so
on. It is obvious that this system can also
The examples can be taught to a neural be adapted to dynamic, i.e changing, en-
network by using a simple learning pro- vironments (e.g. the moving obstacles in
cedure (a learning procedure is a simple our example).
algorithm or a mathematical formula. If
we have done everything right and chosen 2 There is a robot called Khepera with more or less
similar characteristics. It is round-shaped, approx.
good examples, the neural network will 7 cm in diameter, has two motors with wheels
generalize from these examples and find and various sensors. For more information I rec-
a universal rule when it has to stop. ommend to refer to the internet.

Figure 1.2: The robot is positioned in a landscape that provides sensor values for different situa-
tions. We add the desired output values H and so receive our learning examples. The directions in
which the sensors are oriented are exemplarily applied to two robots.
1.2 A Brief History of Neural The history of neural networks begins in

Networks the early 1940’s and thus nearly simultane-
ous with the history of programmable elec-
tronic computers. The youth of this field
of research, as with the field of computer
science itself, can be easily recognized due
The field of neural networks has, like any
to the fact that many of the cited persons
other field of science, a long history of
are still with us.
development with many ups and downs,
as we will see soon. To continue the style
of my work I will not represent this his-
tory in text form but more compact in 1.2.1 The beginning
form of a timeline. Citations and biblio-
graphical references are for the most part As soon as 1943 Warren McCulloch
put down for those topics that will not be and Walter Pitts introduced mod-
further discussed in this paper. Citations els of neurological networks, recre-
for keywords that will be figured out later ated threshold switches based on neu-
are mentioned in the corresponding chap- rons and showed that even simple
ters. networks of this kind are able to

dkriesel.com 1.2 History of Neural Networks
Figure 1.4: Some institutions of the field of neural networks. From left to right: John von Neu-
mann, Donald O. Hebb, Marvin Minsky, Bernard Widrow, Seymour Papert, Teuvo Kohonen, John
Hopfield, ”in the order of appearance” as far as possible.
calculate nearly any logic or arith- 1950: The neuropsychologist Karl

metic function [MP43]. Further- Lashley defended the thesis
more, the first computer precur- that brain information storage is
sors (”electronic brains”)were de- realized as a distributed system His
veloped, among others supported by thesis was based on experiments
Konrad Zuse, who was tired of cal- on rats, where only the extent but
culating ballistic trajectories by hand. not the location of the destroyed
nerve tissue influences the rats’
1947: Walter Pitts and Warren Mc- performance to find their way out of
Culloch indicated a practical field a labyrinthperformance to find their
of application (which was not men- way out of a labyrinth.
tioned in their work from 1943),
namely the recognition of spacial pat-
terns by neural networks [PM47]. 1.2.2 Golden age
1949: Donald O. Hebb formulated the 1951: For his dissertation Marvin Min-
classical Hebbian rule [Heb49] which sky developed the neurocomputer
represents in its more generalized Snark, which has already been capa-
form the basis of nearly all neural ble to adjust its weights3 automati-
learning procedures. The rule im- cally. But it has never been practi-
plies that the connection between two cally implemented, since it is capable
neurons is strengthened when both to busily calculate, but nobody really
neurons are active at the same time. knows what it calculates.
This change in strength is propor-
1956: Well-known scientists and ambi-
tional to the product of the two activ-
tious students met at the Dart-
ities. Hebb could postulate this rule,
mouth Summer Research Project
but due to the absence of neurological
research he was not able to verify it. 3 We will learn soon what weights are.

and discussed, to put it crudely, how rule or delta rule. At that time
to simulate a brain. Differences be- Hoff, later co-founder of Intel Corpo-
tween top-down and bottom-up re- ration, was a PhD student of Widrow,
search were formed.While the early who himself is known as the inven-
supporters of artificial intelligence tor of modern microprocessors. One
wanted to simulate capabilities by advantage the Delta rule had over
means of software, supporters of neu- the original perceptron learning algo-
ral networks wanted to achieve sys- rithm was its adaptivity: If the differ-
tem behavior by imitating the small- ence between the actual output and
est parts of the system – the neurons. the correct solution was large, the
connecting weights also changed in
1957-1958: At the MIT, Frank Rosen- larger steps – the smaller the steps,
blatt, Charles Wightman and the closer the target was. Disadvan-
their coworkers developed the first tage: missapplication led to infinites-
successful neurocomputer, the Mark imal small steps close to the target.
I perceptron, which was capable to In the following stagnation and out of
development
accelerates recognize simple numerics by means fear of scientific unpopularity of the
of a 20 × 20 pixel image sensor and neural networks ADALINE was re-
electromechanically worked with 512 named in adaptive linear element
motor driven potentiometers - each – which was undone again later on.
potentiometer representing one vari-
able weight. 1961: Karl Steinbuch Karl Steinbruch
1959: Frank Rosenblatt described dif- introduced technical realizations of as-
ferent versions of the perceptron, for- sociative memory, which can be seen
mulated and verified his perceptron as predecessors of today’s neural as-
convergence theorem. He described sociative memories [Ste61]. Addition-
neuron layers mimicking the retina, ally, he described concepts for neural
threshold switches,and a learning rule techniques and analyzed their possi-
adjusting the connecting weights. bilities and limits.
1960: Bernard Widrow and Mar- 1965: In his book Learning Machines
cian E. Hoff introduced the ADA- Nils Nilsson gave an overview of
LINE (ADAptive LInear NEu- the progress and works of this period
ron) [WH60], a fast and precise of neural network research. It was
learning adaptive system being the assumed that the basic principles of
first widely commercially used neu- self-learning and therefore, generally
ral network: It could be found in speaking, ”intelligent” systems had al-
nearly every analog telephone for real- ready been discovered. Today this as-
time adaptive echo filtering and was sumption seems to be an exorbitant
trained by menas of the Widrow-Hoff overestimation, but at that time it
first
spread
use

provided for high popularity and suf- 1972: Teuvo Kohonen introduced a
ficient research funds. model of the linear associator,
a model of an associative memory
1969: Marvin Minsky and Seymour [Koh72]. In the same year, such a
Papert published a precise mathe- model was presented independently
matical analysis of the perceptron and from a neurophysiologist’s point
[MP69] to show that the perceptron of view by James A. Anderson
model was not capable of representing [And72].
many important problems (keywords:
1973: Christoph von der Malsburg
XOR problem and linear separability),
used a neuron model that was non-
and so put an end to overestimation,
linear and biologically more moti-
popularity and research funds. The
research vated [vdM73].
funds were implication that more powerful mod-
stopped els would show exactly the same prob- 1974: For his dissertation in Harvard
lems ang the forecast that the entire Paul Werbos developed a learning
field would be a research dead-end re- procedure called backpropagation of
sulted in a nearly complete decline in error [Wer74], but it was not until
research funds for the next 15 years one decade later that this procedure
– no matter how incorrect these fore- reached today’s importance.
backprop
casts were from today’s point of view.
1976-1980 and thereafter: Stephen developed
Grossberg presented many papers

(for instance [Gro76]) in which
1.2.3 Long silence and slow numerous neural models are analyzed
reconstruction mathematically. Furthermore, he
applied himself to the problem of
keeping a neural network capable
The research funds were, as previously-
of learning without destroying
mentioned, extremely short. Everywhere
already learned associations. Under
research went on, but there were neither
cooperation of Gail Carpenter
conferences nor other events and therefore
this led to models of adaptive
only few publications. This isolation of
resonance theory (ART).
individual researchers provided for many
independently developed neural network 1982: Teuvo Kohonen described the
paradigms: They researched, but there self-organizing feature maps
was no discourse among them. (SOM) [Koh82, Koh98] – also
known as Kohonen maps. He was
In spite of the poor appreciation the field looking for the mechanisms involving
inspired, the basic theories for the still self-organization in the brain (He
continuing renaissance were laid at that knew that the information about the
time: creation of a being is stored in the

genome, which has, however, not the Parallel Distributed Processing

enough memory for a structure like Group [RHW86a]: Non-linear separa-
the brain. As a consequence, the ble problems could be solved by mul-
brain has to organize and create tilayer perceptrons, and Marvin Min-
itself for the most part). sky’s negative evaluations were dis-
proved at a single blow. At the same
John Hopfield also invented the
time a certain kind of fatigue spread
so-called Hopfield networks [Hop82]
in the field of artificial intelligence,
which are inspired by the laws of mag-
caused by a series of failures and un-
netism in physics. They were not
fulfilled hopes.
widely used in technical applications,
but the field of neural networks slowly From this time on, the development of
regained importance. the field of research has almost been
explosive. It can no longer be item-
1983: Fukushima, Miyake and Ito the
ized, but some of its results will be
neural model of the Neocognitron
seen in the following.
which could recognize handwritten
characters [FMI83] and was an exten-
sion of the Cognitron network already
developed in 1975.
Exercises
Exercise 1: Indicate one example for each

1.2.4 Renaissance of the following topics:
The influence of John Hopfield, . A book on neural networks or neuroin-

who had personally convinced many formatics,
researchers of the importance of the field, . A collaborative group of a university
and the wide publication of backprop- working with neural networks,
agation by Rumelhart, Hinton and
Williams the field of neural networks . A software tool realizing neural net-
slowly showed signs of upswing. works (”simulator”),
1985: veröffentlicht John Hopfield pub- . A company using neural networks,

lished an article describing a way of and
finding acceptable solutions for the . A product or service being realized by
Travelling Salesman problem by using means of neural networks.
Hopfield nets.
Renaissance Exercise 2: Indicate at least four appli-
1986: The backpropagation of error learn- cations of technical neural networks: two
ing procedure as a generalization from the field of pattern recognition and
of the Delta rule was separately two from the field of function approxima-
developed and widely published by tion.

Exercise 3: Briefly characterize the four

development phases of neural networks
and indicate expressive examples for each
phase.

Chapter 2
Biological Neural Networks
How do biological systems solve problems? How is a system of neurons
working? How can we understand its functionality? What are different
quantities of neurons able to do? Where in the nervous system are information
processed? A short biological overview of the complexity of simple elements of
neural information processing followed by some thoughts about their
simplification in order to technically adapt them.
Before we begin to describe the technical 2.1 The Vertebrate Nervous

side of neural networks, it would be use- System
ful to briefly discuss the biology of neu-
ral networks and the cognition of living
organisms – the reader may skip the fol- The entire information processing system,
lowing chapter without missing any tech- i.e., the vertebrate nervous system, con-
nical information. On the other hand I sists of the central nervous system and the
recommend to read the said excursus if peripheral nervous system, which is only
you want to learn something about the un- a first and simple subdivision. In real-
derlying neurophysiology and see that our ity, such a rigid subdivision does not make
small approaches, the technical neural net- sense, but here it is helpful to outline the
works, are only caricatures of nature – how information processing in a body.
powerful their natural counterparts must
be when our small approaches are already
that effective. Now we want to take a brief 2.1.1 Peripheral and central
look at the nervous system of vertebrates: nervous system
We will start with very rough granularity
and then proceed with the brain and the The peripheral nervous system (PNS)
neural level. For further reading I want comprises the nerves that are situated out-
to recommend the books [CR00, KSJ00], side of the brain or the spinal cord. These
which helped me a lot during this chap- nerves form a branched and very dense net-
ter. work throughout the whole body. The pe-
15
Chapter 2 Biological Neural Networks dkriesel.com
ripheral nervous system includes, for ex-

ample, the spinal nerves which pass out of
the spinal cord (two on a level within each
vertebra of the spine) and supply extrem-
ities, neck and trunk, but also the cranial
nerves directly leading to the brain.
The central nervous system (CNS),

however, is the ”main-frame” within the
vertebrate. It is the place where infor-
mation received by the sense organs are
stored and managed. Furthermore, it con-
trols the inner processes in the body and,
last but not least, coordinates the mo-
tor functions of the organism. The ver-
tebrate central nervous system consists of
the brain and the spinal cord (Fig. 2.1).
Thus, we want to focus on the brain, which
can - for the purpose of simplification - be
divided into four areas (Fig. 2.2 on the
right page) to be discussed here.
2.1.2 The cerebrum is responsible

for abstract thinking
processes.
The cerebrum (telencephalon) is one of

the areas of the brain that changed most
during evolution. Along an axis running
from the lateral face to the back of the
head this area is divided into two hemi-
spheres, which are organized in a folded
structure. These cerebral hemispheres
are connected by one strong nerve cord
(”bar”) and several small ones. A large
number of neurons are located in the cere-
bral cortex (cortex) which is approx. 2-
4 cm thick and divided into different cor-
tical fields, each having a specific task to Figure 2.1: Illustration of the central nervous
system with spinal cord and brain.

dkriesel.com 2.1 The Vertebrate Nervous System
movements are controlled and errors are

continually corrected. For this purpose,
the cerebellum has direct sensory informa-
tion about muscle lengths as well as acous-
tic and visual information. Furthermore,
it also receives messages about more ab-
stract motor signals coming from the cere-
brum.
In the human brain the cerebellum is con-

siderably smaller than the cerebrum, but
this is rather an exception. In many verte-
Figure 2.2: Illustration of the brain. The col- brates this ratio is less pronounced. If we
ored areas of the brain are discussed in the text. take a look at vertebrate evolution, we will
The more we turn from abstract information pro- recognize that the cerebellum is not ”too
cessing to direct reflexive processing, the darker small” but the cerebum is ”too large” (at
the areas of the brain are colored. least, it is the most highly developed struc-
ture in the vertebrate brain). The two re-
maining brain areas should also be briefly
discussed: the diencephalon and the brain-
fulfill. Primary cortical fields are re- stem.
sponsible for processing qualitative infor-
mation, such as the management of differ-
ent perceptions (e.g. the visual cortex) 2.1.4 The diencephalon controls
is responsible for the management of vi- fundamental physiological
sion). Association cortical fields, how- processes
ever, perform more abstract association
and thinking processes; they also contain The interbrain (diencephalon) includes
our memory. parts of which only the thalamus will
thalamus
be briefly discussed: This part of the di- filters
encephalon mediates between sensory and incoming
2.1.3 The cerebellum controls and motor signals and the cerebrum. Particu-
data
coordinates motor functions larly, the thalamus decides which part of

the information is transferred to the cere-
The little brain (cerebellum) is located brum, so that especially less important
below the cerebrum, therefore it is closer sensory perceptions can be suppressed at
to the spinal cord. Accordingly, it serves short notice to avoid overloads. Another
less abstract functions with higher prior- part of the diencephalon is the hypotha-
ity: Here, large parts of motor coordi- lamus, which controls a number of pro-
nation are performed, i.e., balance and cesses within the body. The diencephalon

is also heavily involved in the human cir- All parts of the nervous system have one
cadian rhythm (”internal clock”) and the thing in common: information processing.
sensation of pain. This is accomplished by means of huge ac-
cumulations of billions of very similar cells,
whose structure is very simple but who
2.1.5 The brainstem connects the communicate continuously. Large groups
brain with the spinal cord and of these cells send coordinated signals and
controls reflexes. thus reach the enormous information pro-
cessing capacity we are familiar with from
our brain. We will now leave the level of
In comparison with the diencephalon the
brain areas and go on with the cellular
brainstem or the (Truncus cerebri) re-
level of the body - the level of neurons.
spectively is phylogenetically much older.
Roughly speaking, it is the ”extended
spinal cord” and thus the connection be-
tween brain and spinal cord. The brain- 2.2 Neurons Are Information
stem can also be divided into different ar- Processing Cells
eas, some of which will be exemplarily in-
troduced in this chapter. The functions
will be discussed from abstract functions Before specifying the functions and pro-
towards more fundamental ones. One im- cesses within a neuron, we will give a
portant component is the pons (=bridge), rough description of neuron functions: A
a kind of way station for many a nerve sig- neuron is nothing more than a switch with
nal from brain to body and vice versa. information input and output. The switch
will be activated if there are enough stim-
If the pons is damaged (e.g. by a cere- uli of other neurons hitting the informa-
bral infarct), then the result could be tion input. Then, at the information out-
the locked-in syndrome – a condition put, a pulse is sent to other neurons, for
in which a patient is ”walled-in” within example.
his own body. He is conscious and aware
with no loss of cognitive function, but can-
not move or communicate by any means.
2.2.1 Components of a neuron
Only his senses of sight, hearing, smell and
taste are generally working perfectly nor-
mal. Locked-in patients may often be able Now we want to take a look at the com-
to communicate with others by blinking or ponents of a neuron (Fig. 2.3 on the right
moving their eyes. page). In doing so, we will follow the way
the electrical information takes the neuron
Furthermore, the brainstem is responsible in and the electrical infomation within the
for many fundamental reflexes, such as the neuron (the electrical information takes
winking reflex or coughing. within the neuron). The dendrites of a

dkriesel.com 2.2 The Neuron
Figure 2.3: Illustration of a biological neuron with the components discussed in this text.
neuron receive the information by special the signal transmitter and the signal re-
connections, the synapses. ceiver, which is, for example, relevant to
shortening reactions that must be ”hard
coded” within a living organism.
2.2.1.1 Synapses weight the individual The chemical synapse is the more dis-
parts of information tinctive variant. Here, the electrical cou-
pling of source and target does not take
Incoming signals from other neurons or place, the coupling is interrupted by the
cells are transferred to a neuron by special synaptic cleft. This cleft electrically sep-
connections, the synapses. For the most arates the pre-synaptic side from the post-
part, such a connection can be found at synaptic one. You might think that, never-
the dendrites of a neuron, sometimes also theless, the information has to flow, so we
directly at the Soma. We distinguish be- will discuss how this happens: It is not an
tween electrical and chemical synapses. electrical, but a chemical process. On the
pre-synaptic side of the synaptic cleft the
The electrical synapse is the simpler vari- electrical signal is converted into a chemi-
ant is the electrical synapse. An elec- cal signal, a process induced by chemical
electrical
synapse: trical signal received by the synapse, i.e. cues released there (the so-called neuro-
simple coming from the pre-synaptic side, is di- transmitters). These neurotransmitters
rectly transferred to the postsynaptic nu- cross the synaptic cleft and transfer the
cleus of the cell. Thus, there is a direct, information into the nucleus of the cell
strong, unadjustable connection between (this is a very simple explanation, but later

on we will see how this exactly works), 2.2.1.2 Dendrites collect all parts or
where it is reconverted into electrical in- information
formation. The neurotransmitters are de-
graded very fast, so that it is possible to re-
Dendrites ramify like trees from the cell
lease very precise information pulses here, nucleus of the neuron (which is called
too. soma) and receive electrical signals from
many different sources, which are then
In spite of the more complex function- transferred into the nucleus of the cell.
ing the chemical synapse has - compared The ramifying amount of dendrites is also
with the electrical synapse - utmost advan- called dendrite tree.
cemical
synapse
is more tages:
complex
but also
2.2.1.3 In the soma the weighted
Single way connection: A chemical
more
powerful information are accumulated
synapse is a single way connection.
Due to the fact that there is no direct After the cell nucleus (soma) has re-
electrical connection between the ceived a plenty of activating (=stimulat-
pre- and postsynaptic area, electrical ing) and inhibiting (=diminishing) signals
pulses in the postsynaptic area by synapses or dendrites, the soma accu-
cannot flash over to the pre-synaptic mulates these signals. As soon as the ac-
area. cumulated signal exceeds a certain value
(called threshold value), the cell nucleus
Adjustability: There are a large number of of the neuron activates an electrical pulse
different neurotransmitters that can which then is transmitted to the neurons
also be released in various quanti- connected to the current one.
ties in a synaptic cleft. There are
neurotransmitters that stimulate the
postsynaptic cell nucleus, and oth- 2.2.1.4 The axon transfers outgoing
ers that make such stimulation slow pulses
down. Some synapses transfer a
strongly stimulating signal, some only The pulse is transferred to other neurons
weakly stimulating ones. The adjusta- by means of the axon. The axon is a
bility varies a lot, and one of the cen- long, slender extension of the soma. In
tral points in the examination of the an extreme case an axon can stretch up
learning ability of the brain is that to one meter (e.g. within the spinal cord).
here the synapses are variable, too. The axon is electrically isolated in order
That is, over time they can form a to achieve a better conduction of the elec-
stronger or weaker connection. trical signal (we will return to this point
later on) and it leads to dendrites, which
transfer the information to, for example,

other neurons. So now we are back to the membrane from the inside outwards, we
beginning of our description of the neuron will find certain kinds of ions more often
elements. or less often than on the inside. This de-
Remark: An axon can, however, transfer scent or ascent of concentration is called a
information to other kinds of cells in order concentration gradient.
to control them.
Let us first take a look at the membrane
potential of the resting state of the neu-
2.2.2 Electrochemical processes in ron, i.e., we assume that no electrical sig-
the neuron and its nals are received from the outside. In this
case, the membrane potential is −70 mV.
components
Since we have learned that this potential
depends on the concentration gradients of
After having pursued the path of an elec-
various ions, there is of course the central
trical signal from the dendrites via the
question of how to maintain these concen-
synapses to the nucleus of the cell and
tration gradients: Normally, diffusion pre-
from there via the axon into other den-
dominates and therefore each ion is eager
drites, we now want to take a small step
to decrease concentration gradients and
from biology towards technology. In doing
to spread out evenly. If this happens,
so a simplified introduction of the electro-
the membrane potential will move towards
chemical information processing should be
0 mV, so finally there would be no mem-
provided.
brane potential anymore. Thus, the neu-
ron actively maintains its membrane po-
2.2.2.1 Neurons maintain electrical tential to be able to process information.
membrane potential How does this work.
The secret is the membrane itself, which is

One fundamental aspect is the fact that permeable to some ions, but not for others.
compared to their environment the neu- To maintain the potential, various mecha-
rons show a difference in electrical charge, nisms are in progress at the same time:
a potential. In the membrane (=enve-
lope) of the neuron the charge is different Concentration gradient: As described
from the charge on the outside. This differ- above the ions try to be as uniformly
ence in charge is a central concept impor- distributed as possible. If the
tant to understand the processes within concentration of an ion is higher on
the neuron. The difference is called mem- the inside of the neuron than on
brane potential. The membrane poten- the outside, it will try to diffuse
tial, i.e., the difference in charge, is created to the outside and vice versa.
by several kinds of loaded atoms (ions), The positively charged ion K+
whose concentration varies within and out- (potassium) occurs very frequently
side of the neuron. If we penetrate the within the neuron but less frequently

outside of the neuron, and therefore Due to the low diffusion of sodium into the
it slowly diffuses out through the cell the intracellular sodium concentration
neuron’s membrane. But another increases. But at the same time the inside
collection of negative ions, called of the cell becomes less negative, so that
A− , remains within the neuron since K+ pours in slower (we can see that this
the membrane is not permeable is a complex mechanism where everything
to them. Thus, the inside of the is influenced by everything). The sodium
neuron becomes negatively charged. shifts the intracellular equilibrium from
Negative A ions remain, positive K negative to less negative, compared with
ions disappear, and so the inside of its environment. But even with these two
the cell becomes more negative. The ions a standstill with all gradients being
result is another gradient. balanced out could still be achieved. Now
the last piece of the puzzle gets into the
game: a ”pump” (or rather, the protein
Electrical Gradient: The electrical gradi-
ATP) actively transports ions against the
ent acts contrary to the concentration
direction they actually want to take!
gradient. The intracellular charge is
now very strong, therefore it attracts Sodium is actively pumped out of the cell,
positive ions: K+ wants to get back although it tries to get into the cell
into the cell. along the concentration gradient and
the electrical gradient.
If these two gradients are now left to take Potassium , however, diffuses strongly
care of themselves, they would balance out, out of the cell, but is actively pumped
reach a steady state, and a membrane po- back into it.
tential of −85 mV would develop. But we
want to achieve a resting membrane po- For this reason the pump is also called
tential of −70 mV, thus there seem to ex- sodium-potassium pump. The pump
ist some disturbances which prevent this maintains the concentration gradient for
plan. Furthermore, there is another impor- the sodium as well as for the potassium,
tant ion, Na+ (sodium), to which the mem- so that some sort of steady state equilib-
brane is not very permeable but which, rium is created and finally the resting po-
however, slowly pours through the cell into tential is −70 mV as observed. All in all
the membrane. As a result, the sodium the membrane potential is maintained by
feels all the more driven into the cell: On the fact that the membrane is imperme-
the one hand, there is less sodium within able to some ions and other ions are ac-
the neuron than outside the neuron. On tively pumped against the concentration
the other hand, sodium is positive but the and electrical gradients. Now that we
interior of the cell is negative, which is a know that each neuron has a membrane
second reason for the sodium to want to potential we want to observe how a neu-
get into the cell. ron receives and transmits signals.

2.2.2.2 The neuron is activated by of the action potential (Fig. 2.4 on the next
changes in the membrane page):
potential
Resting state: Only the permanently
open sodium and potassium channels
Above we have learned that sodium and are open. The membrane potential is
potassium can diffuse through the mem- at −70 mV and actively kept there
brane - sodium slowly, potassium faster. by the neuron.
They move through channels within the
membrane, the sodium and potassium Stimulus up to the threshold: A (stim-
channels. In addition to these per- ulus) opens channels so that sodium
manently open channels responsible for can pour in. The intracellular charge
diffusion and balanced by the sodium- becomes more positive. As soon as
potassium pump, there also exist channels the membrane potential exceeds the
that are not always open but which only threshold of −55 mV, the action po-
response ”if required”. Since the opening tential is initiated by the opening of
of these channels changes the concentra- many sodium channels.
tion of ions within and outside of the mem- Depolarization: Sodium is pouring in. Re-
brane, it also changes the membrane po- member: Sodium wants to pour into
tential. the cell because there is a lower in-
tracellular than extracellular concen-
These controllable channels are opened as tration of sodium. Additionally, the
soon as the accumulated received stimu- cell is dominated by a negative en-
lus exceeds a certain threshold. For ex- vironment which attracts the posi-
ample, stimuli can be received from other tive sodium ions. This massive in-
neurons or have other causes. There exist, flux of sodium drastically increases
for example, specialized forms of neurons, the membrane potential - up to ap-
the sense cells, for which a light incidence prox. +30 mV - which is the electrical
could be such a stimulus. If the incom- pulse, i.e., the action potential.
ing amount of light exceeds the threshold,
controllable channels are opened. Repolarization: Now the sodium channels
are closed and the potassium channels
The said threshold (the threshold poten- are opened. The positively charged
tial) is at about −55 mV. As soon as the ion wants to leave the positive inte-
received stimuli reach this value, the neu- rior of the cell. Additionally, the intra-
ron is activated and an electrical signal, cellular concentration is much higher
an action potential, is initiated. Then than the extracellular one, which in-
this signal is transmitted to the cells con- creases the efflux of ions even more.
nected to the observed neuron, i.e., the The interior of the cell is once again
cells ”listen” to the neuron. Now we want more negatively charged than the ex-
to take a closer look at the different stages terior.

Figure 2.4: Initiation of action potential over time.

Hyperpolarization: Sodium as well as (in the CNS) 1 , which insulate the axon
potassium channels are closed again. very good from electrical activity. At a
At first the membrane potential is distance of 0.1 − 2mm there are gaps be-
slightly more negative than the rest- tween these cells, the so-called nodes of
ing potential. This is due to the Ranvier. The said gaps appear where one
fact that the potassium channels close insulate cell ends and the next one begins.
slower. As a result, (positively It is obvious that at such a node the axon
charged) potassium is effused because is less insulated.
of its lower extracellular concentra-
tion. After a refractory period Now you may assume that these less in-
of 1 − 2 ms the resting state is re- sulated nodes are a disadvantage of the
established so that the neuron can axon - it is not. At the nodes mass can be
react to newly applied stimuli with transferred between the intracellular and
an action potential. In simple terms, extracellular area, a transfer that is im-
the refractory period is a mandatory possible at those parts of the axon which
break a neuron has to take in order to are situated between two nodes (intern-
regenerate. The shorter this break is, odesn) and therefore insulated by the
the more often a neuron can fire per myelin sheath. This mass transfer permits
time. the generation of signals similar to the gen-
eration of the action potential within the
soma. The action potential is transferred
Then the resulting pulse is transmitted by as follows: It does not continuously travel
the axon. along the axon but jumps from node to
node. Thus, a series of depolarization trav-
els along the nodes of Ranvier. One ac-
tion potential initiates the next one, and
mostly even several nodes are active at
2.2.2.3 In the axon a pulse is the same time here. The pulse ”jumping”
conducted in a saltatory way from node to node is responsible for the
name of this pulse conductor: saltatory
conductor.
We have already learned that the axon Obviously, the pulse will move faster if its
is used to transmit the action potential jumps are larger. Axons with large in-
across long distances (remember: You will ternodes (2 mm) achieve a signal disper-
find an illustration of a neuron including sion of approx. 180 meters per second.
an axon in Fig. 2.3 on page 19). The axon
is a long, slender extension of the soma. 1 Schwann cells as well as oligodendrocytes are vari-
eties of the glial cells. There are about 50 times
In vertebrates it is normally coated by a more glial cells than neurons: They surround the
myelin sheath that consists of Schwann neurons (glia = glue), insulate them from each
cells (in the PNS) or oligodendrocytes other, provide energy, etc.

However, the internodes cannot grow in- means of the stimulus-conducting ap-
definitely, since the action potential to be paratus. The resulting action potential
transferred would fade too much until it can be processed by other neurons and is
reaches the next node. So the nodes have a then transmitted into the thalamus, which
task, too: to constantly amplify the signal. is, as we have already learned, a gate to
The cells receiving the action potential are the cerebral cortex and therefore can sort
attached to the end of the axon – often out sensory impressions according to cur-
connected by dendrites and synapses. As rent relevance and thus prevent an abun-
already indicated above the action poten- dance of information to be managed.
tials cannot only be generated by informa-
tion received by the dendrites from other
neurons. 2.3.1 There are different receptor
cells for various sorts of
perceptions
2.3 Receptor Cells Are Primary receptors transmit their pulses

Modified Neurons directly to the nervous system. A good
example for this is the sense of pain.
Here, the stimulus intensity is propor-
Action potentials can also be generated by tional to the amplitude of the action po-
sensory information an organism receives tential. Technically, this is an amplitude
from its environment by means of sensory modulation.
cells. Specialized receptor cells are able
Secondary receptors, however, continu-
to perceive specific stimulus energies such
ously transmit pulses. These pulses con-
as light, temperature and sound or the ex-
trol the amount of the related neurotrans-
istence of certain molecules (like, for exam-
mitter, which is responsible for transfer-
ple, the sense of smell). This is working
ring the stimulus. The stimulus in turn
because of the fact that these sense cells
controls the frequency of the action poten-
are actually modified neurons. They do
tial of the receiving neuron. This process
not receive electrical signals via dendrites
is a frequency modulation, an encoding of
but the existence of the stimulus being
the stimulus, which allows to better per-
specific for the receptor cell ensures that
ceive the increase and decrease of a stimu-
the ion channels open and an action po-
lus.
tential is developed. This process of trans-
forming stimulus energy into changes in There can be individual receptor cells or
the membrane potential is called sensory cells forming complex sense organs (e.g.
transduction. Usually, the stimulus en- eyes or ears). They can receive stimuli
ergy itself is too weak to directly cause within the body (by means of the intero-
nerve signals. Therefore, the signals are ceptors) as well as stimuli outside of the
amplified either during transduction or by body (by means of the exteroceptors).

dkriesel.com 2.3 Receptor Cells
After having outlined how an information go on with the lowest level, the sen-
is received from the environment, it will be sory cells.
interesting to look at how the information
. On the lowest level, the receptor
is processed.
cells, the information is not only re-
ceived and transferred but directly
processed. One of the main aspects
2.3.2 Information are processed on of this subject is to prevent the trans-
every level of the nervous mission of ”continuous stimuli” to
system the central nervous system because of
sensory adaptation: Due to contin-
There is no reason to believe that every uous stimulation many receptor cells
information received is transmitted to the automatically become insensitive to
brain and processed there, and that the stimuli. Thus, receptor cells are not
brain ensures that it is ”output” in the a direct mapping of specific stimu-
form of motor pulses (the only thing an lus energy onto action potentials but
organism can actually do within its envi- depend on the past. Other sensors
ronment is to move). The information pro- change their sensitivity according to
cessing is entirely decentralized. In order the situation: There are taste recep-
to illustrate this principle, we want to take tors which respond more or less to the
a look at some examples, which leads us same stimulus according to the nutri-
again from the abstract to the fundamen- tional condition of the organism.
tal in our hierarchy of information process-
. Even before a stimulus reaches the
ing.
receptor cells information processing
. It is certain that information are pro- can already be executed by a preced-
cessed in the cerebrum, which is the ing signal carrying apparatus, for ex-
most developed natural information ample in the form of amplification:
processing structure. The external and the internal ear
have a specific shape to amplify the
. The midbrain and the thalamus, sound, which also allows – in asso-
which serves – as we have already ciation with the sensory cells of the
learned – as a gate to the cerebral sense of hearing – the sensory stim-
cortex, are much lower down in the hi- ulus only to increase logarithmically
erarchy. The filtering of information with the intensity of the heard sig-
with respect to the current relevance nal. On closer examination, this is
executed by the midbrain is a very necessary, since the sound pressure of
important method of information pro- the signals for which the ear is con-
cessing, too. But even the thalamus structed can vary over a wide expo-
does not receive any pre-processed nential range. Here, a logarithmic
stimuli from the outside. Now let us measurement is an advantage. Firstly,

an overload is prevented and, sec- two organs of sight which, from an evolu-
ondly, the fact that the intensity mea- tionary point of view, exist much longer
surement of intensive signals will be than the human.
less precise, doesn’t matter as well. If
a jet fighter is starting next to you,
small changes in the noise level can
2.3.3.1 Complex eyes and pinhole eyes
be ignored.
only provide high temporal or
Just to get a feel for sense organs and in- spatial resolution
formation processing in the organism we
will briefly describe ”usual” light sensing Let us first take a look at the so-called
organs, i.e. organs often found in nature. complex eye (Fig. 2.5 on the right page),
At the third light sensing organ described also called compound eye, which is, for
below, the single lens eye, we will discuss example, common in insects and crus-
the information processing in the eye. taceans. The complex eye consists of a
complex eye:
great number of small, individual eyes. high temp.,
If we look at the complex eye from the low
2.3.3 An outline of common light outside, the individual eyes are clearly vis-
spatial
sensing organs
resolution
ible and arranged in a hexagonal pattern.
Each individual eye has its own nerve fiber
For many organisms it turned out to be ex- which is connected to the insect brain.
tremely useful to be able to perceive elec- Since the individual eyes can be distin-
tromagnetic radiation in certain regions of guished, it is obvious that the number of
the spectrum. Consequently, sense organs pixels, i.e. the spatial resolution, of com-
have been developed which can detect plex eyes must be very low and the image
such electromagnetic radiation. Therefore, is blurred. But complex eyes have advan-
the wavelength range of the radiation per- tages, too, especially for fast-flying insects.
ceivable by the human eye is called visible Certain complex eyes process more than
range or simply light. The different wave- 300 images per second (to the human eye,
lengths of this electromagnetic radiation however, movies with 25 images per sec-
are perceived by the human eye as differ- ond appear as a fluent motion).
ent colors. The visible range of the elec-
tromagnetic radiation is different each or- Pinhole eyes are, for example, found in
ganism. Some organisms cannot see the octopus species and work – you can guess
colors (=wavelength ranges) we can see, it – similar to a pinhole camera. A pin-
pinhole
others can even perceive additional wave- hole eye has a very small opening for light camera:
length ranges (e.g. in the UV range). Be- entry, which projects a sharp image onto high spat.,
fore we begin with the human being – in the sensory cells behind. Thus, the spatial
low
temporal
order to get a broader knowledge of the resolution is much higher than in the com- resolution
sense of sight– we briefly want to look at plex eye. But due to the very small open-

dkriesel.com 2.3 Receptor Cells
single lens eye contains an additional ad-

justable lens.
2.3.3.3 The retina does not only

receive information but is also
responsible for information
processing
The light signals falling on the eye are

received by the retina and directly pre-
processed by several layers of information-
Figure 2.5: Complex eye of a robber fly processing cells. We want to briefly dis-
cuss the different steps of this informa-
tion processing, and in doing so we follow
the way of the information inserted by the
ing for light entry the resultant image is light:
less bright. Photoreceptors receive the light signal
und cause action potentials (there
are different receptors for different
2.3.3.2 Single lens eyes combine the
color components and light intensi-
advantages of the other two
ties). These receptors are the real
eye types, but they are more
light-receiving part of the retina and
complex
they are sensitive to such an extent
that only one single photon falling
The light sensing organ common in verte-
on the retina can cause an action po-
brates is the single lense eye. The re-
tential. Then several photoreceptors
sultant image is a sharp high-resolution
transmit their signals to one single
image of the environment at high or vari-
able light intensity. On the other hand it bipolar cell . This means that here the
is more complex. Similar to the pinhole information has already been summa-
eye the light enters through an opening rized. Finally, the now transformed
(pupil) and is projected onto a layer of light signal travels from several bipo-
sensory cells in the eye. (retina). But in lar cells 2 into
Single
lense eye: contrast to the pinhole eye the size of the
high temp. pupil can be adapted to the lighting condi- ganglion cells. Various bipolar cells can
and spat.
tions (by means of the iris muscle, which transmit their information to one gan-
glion cell. The higher the number
resolution
expands or contracts the pupil). These
differences in pupil dilation require to ac- 2 There are different kinds of bipolar cells, as well,
tively focus the image. Therefore, the but to discuss all of them would go too far.

of photoreceptors that affect the gan- within millions of information-processing

glion cell, the larger the field of per- cells. The systems poer and resistance to
ception, the receptive field, which errors is based upon this massive division
covers the ganglions – and the less of work.
sharp is the image in the area of
this ganglion cell. So the information
are already sorted out directly in the 2.4 The Amount of Neurons
retina and the overall image is, for ex-
in Living Organisms on
ample, blurred in the peripheral field
of vision. So far, we have learned Different Levels of
about the information processing in Development
the retina only as a top-down struc-
ture. Now we want to take a look at An overview of different organisms and
the their neural capacity (in large part from
horizontal and amacrine cells. These [RD05]):
cells are not connected from the
302 neurons are required by the nervous
front backwards but laterally. They
system of a nematode worm, which
allow the light signals to influence
serves as a popular model organism
themselves laterally directly during
in biology. Nematodes live in the soil
the information processing in the
and feed on bacteria.
retina – a much more powerful
method of information processing 104 neurons make an ant (To simplify
than compressing and blurring. matters we neglect the fact that some
When the horizontal cells are excited ant species also can have more or less
by a photoreceptor, they are able to efficient nervous systems). Due to the
excite other nearby photoreceptors use of different attractants and odors,
and at the same time inhibit more ants are able to engage in complex
distant bipolar cells and receptors. social behavior and form huge states
This ensures the clear perception of with millions of individuals. If you re-
outlines and bright points. Amacrine gard such an ant state as an individ-
cells can further intensify certain ual, it has a cognitive capacity similar
stimuli by distributing information to a chimpanzee or even a human.
from bipolar cells to several ganglion
With 105 neurons the nervous system of
cells or by inhibiting ganglions.
a fly can be constructed. In a three-
These first steps of transmitting visual in- dimensional space a fly can evade an
formation to the brain show that infor- object in real-time, it can land upon
mation are processed from the first mo- the ceiling upside down, has a consid-
ment the information are received and, on erable sensory system because of com-
the other hand, are parallely processed pound eyes, vibrissae, nerves at the

dkriesel.com 2.4 The Amount of Neurons in Living Organisms
end of its legs and much more. Thus, It uses acoustic signals to localize
a fly has considerable differential and self-camouflaging insects (e.g. some
integral calculus in high dimensions moths have a certain wing structure
implemented ”in the hardware”. We that reflects less sound waves and the
all know that a fly is not easy to catch. echo will be small) and also eats its
Of course, the bodily functions are prey while flying.
also controlled by neurons, but these
should be ignored here. 1.6 · 108 neurons are required by the
brain of a dog, companion of man for
With 0.8 · 106 neurons we have enough
ages. Now take a look at another pop-
cerebral matter to create a honeybee.
ular companion of man:
Honeybees build colonies and have
amazing capabilities in the field of
3 · 108 neurons can be found in a cat,
aerial reconnaissance and navigation.
which is about twice as much as in
4 · 106 neurons result in a mouse, and a dog. We know that cats are very
here already begins the world of ver- elegant, patient carnivores that can
tebrates. show a variety of behaviors. By the
way, an octopus can be positioned
1.5 · 107 neurons are sufficient for a rat,
within the same dimension. Only very
an animal which is denounced as be-
few people know that, for example, in
ing extremely intelligent and are of-
labyrinth orientation the octopus is
ten used to participate in a variety
vastly superior to the rat.
of intelligence tests representative for
the animal world. Rats have an ex-
For 6 · 109 neurons you already get a
traordinary sense of smell and orien-
chimpanzee, one of the animals being
tation, and they also show social be-
very similar to the human.
havior. The brain of a frog can be
positioned within the same dimension. 1011 neurons make a human. Usually,
The frog has a complex physique with the human has considerable cognitive
many functions, it can swim and has capabilities, is able to speak, to ab-
evolved complex behavior. A frog stract, to remember and to use tools
can continuously target the said fly as well as the knowledge of other hu-
by means of his eyes while jumping in mans to develop advanced technolo-
the three-dimensional space and and gies and manifold social structures.
catch it with its tongue with reason-
able probability. With 2 · 1011 neurons there are nervous
5 · 107 neurons make a bat. The bat can systems having more neurons than
navigate in total darkness through a the human nervous system. Here we
space, exact to several centimeters, should mention elephants and certain
by only using their sense of hearing. whale species.

Remark: Our state-of-the-art computers Vectorial input: The input of technical

are not able to keep up with the above neurons consists of many components,
mentioned processing power of a fly. therefore it is a vector. In nature a
neuron receives pulses of 103 to 104
other neurons on average.
2.5 Transition to Technical Scalar output: The output of a neuron is

Neurons: Neural a scalar, which means that the neu-
ron only consists of one component.
Networks are a Caricature Several scalar outputs in turn form
of Biology the vectorial input of another neuron.
This particularly means that some-
where in the neuron the various input
How do we change from biological neural components have to be summarized in
networks to the technical ones? By means such a way that only one component
of radical simplification. I want to briefly remains.
summarize the conclusions relevant for the
technical part: Synapses change input: In technical neu-
ral networks the inputs are pre-
We have learned that the biological neu- processed, too. They are multiplied
rons are linked to each other in a weighted by a number (the weight) – they are
way and when stimulated they electrically weighted. The set of such weights rep-
transmit their signal via the axon. From resents the information storage of a
the axon they are not directly transferred neural network – in both biological
to the succeeding neurons, but they first original and technical adaptation.
have to cross the synaptic cleft where the
signal is changed again by variable chem- Accumulating the inputs: In biology, the
ical processes. In the receiving neuron inputs are summarized to a pulse ac-
the various inputs that have been post- cording to the chemical change, i.e.,
processed in the synaptic cleft are summa- they are accumulated – on the techni-
rized or accumulatedto one single pulse. cal side this is often realized by the
Depending on how the neuron is stimu- weighted sum, which we will get to
lated by the cumulated input, the neuron know later on. This means that after
itself emits a pulse or not – thus, the out- accumulation we continue with only
put is non-linear and not proportional to one value, a scalar, instead of a vec-
the cumulated input. Our brief summary tor.
corresponds exactly with the few elements
of biological neural networks we want to Non-linear characteristic: The input of
take over into the technical approxima- our technical neurons is also not pro-
tion: portional to the output.

dkriesel.com 2.5 Technical Neurons as Caricature of Biology
Adjustable weights: The weights weight- synapses per neuron. Let us further as-
ing the inputs are variable, similar to sume that a single synapse could save 4
the chemical processes at the synap- bits of information. Naı̈vely calculated:
tic cleft. This adds a great dynamic How much storage capacity does the brain
to the network because a large part of have? Note: The information which neu-
the ”knowledge” of a neural network ron is connected to which other neuron is
is saved in the weights and in the form also important.
and power of the chemical processes
in a synaptic cleft.
So our current, only casually formulated

and very simple neuron model receives a
vectorial input
~x,
with components xi . These are multiplied
by the appropriate weights wi and accumu-
lated: X
wi x i .
i
The above mentioned term is called

weighted sum. Then the non-linear map-
ping f defines the scalar output y:
!
y=f
X
wi x i .
i
After this transition we now want to spec-

ify more precisely our neuron model and
add some odds and ends. Afterwards we
will take a look at how the weights can be
adjusted.
Exercises
Exercise 4: It is estimated that a hu-

man brain consists of approx. 1011 nerve
cells, each of which has about 103 to 104
synapses. For this exercise we assume 103

Chapter 3
Components of Artificial Neural Networks
Formal definitions and colloquial explanations of the components that realize
the technical adaptations of biological neural networks. Initial descriptions of
how to combine these components to a neural network.
This chapter contains the formal defini- the neural network, respectively. The time
tions for most of the neural network com- is divided in discrete time steps:
discrete
ponents used later in the text. After this Definition 3.1 (The concept of time): The time steps
chapter you will be able to read the in- current time (present time) is referred to
dividual chapters of this paper without as (t), the next time step as (t + 1), the
knowing the preceding ones (although this preceding one as (t − 1). Any other time J(t)
would be useful). steps are analogously referred to. If in
Remark (on definitions): In the following the following chapters several mathemat-
definitions, especially in the function def- ical variables (e.g. netj or oi ) refer to a
initions, I indicate the input elements or certain point of time, the notation there-
the target element (e.g. netj or oi ) instead fore will be, for example, netj (t − 1) or
of the usual pre-image set or target set (e.g. oi (t).
R or C) – this is necessary and also easier Remark: From a biological point of view
to write and to understand, since the form this is, of course, not very plausible (in the
of these elements can be chosen nearly ar- human brain a neuron does not wait for
bitrarily. But usually these values are nu- another one), but it significantly simplifies
meric. the implementation.
3.1 The Concept of Time in 3.2 Components of Neural

Neural Networks Networks
In some definitions of this paper we use A technical neural network consists of sim-
the term time or the number of cycles of ple processing units and the Neurons
35
anderen Neuronen
Ausgabefunktion
(Erzeugt aus Aktivierung die Ausgabe,
ist oft Identität)
Chapter 3 Components of Artificial Neural Networks (fundamental) dkriesel.com
which are connected by directed, weighted Data Input of

connections. Here, the strength of a other Neurons
connection (or the connecting weight) be-
tween two neurons i and j is referred to as
wi,j 1 . Propagation function
Definition 3.2 (Neural network): A neu- (often weighted sum, transforms
outputs of other neurons to net input)
ral network is a sorted triple (N, V, w)
with two sets N , V and a function w, Network Input
whereas N is the set of neurons and V a Activation function

(Transforms net input and sometimes
sorted set {(i, j)|i, j ∈ N} whose elements old activation to new activation)
are called connections between neuron i Activation

and neuron j. The function w : V → R Output function
n. network
= neurons defines the weights, whereasw((i, j)), the (often Identity function, transforms
weight of the connection between neuron i
+ weighted activation to output for other neurons)
and neuron j, is shortly referred to as wi,j .

connection
wi,j I
Depending on the point of view it is either
undefined or 0 for connections that do not Data Output to
other Neurons
exist in the network.
Remark: So the weights can be imple- Figure 3.1: Data processing of a neuron. The
mented in a square weight matrix W or, activation function of a neuron implies the
optionally, in a weight vector W with the threshold value.
WI
line number of the matrix indicating where
the connection begins, and the column
number of the matrix indicating, which
neuronis the target. Indeed, in this case
the numeric 0 marks a non-existing con- neuron, which is according to fig. 3.1 in
nection. This matrix representation is also top-down direction):
called Hinton diagram 2 .
The neurons and connections comprise the 3.2.1 Connections conduct

following components and variables (I’m Information that is processed
following the path of the data within a by Neurons
1 Note: In some of the cited literature i and j could
be interchanged in wi,j . Here, a common standard Data are transferred between neurons via
does not exist. But in this paper I try to use the connections with the connecting weight be-
notation I found more frequently and in the more ing either excitatory or inhibitory. The
significant citations.
2 Note here, too, that in some of the cited litera-
definition of connections has already been
ture axes and lines could be interchanged. The included in the definition of the neural net-
published literature is not consistent here, as well work.

dkriesel.com 3.2 Components of Neural Networks
3.2.2 The Propagation Function is calculated by means of the propagation

converts vector inputs to function:
scalar Network Inputs
netj = fprop (oi1 , . . . , oin , wi1 ,j , . . . , win ,j )
Looking at a neuron j, most of the time we (3.3)
will find a lot of neurons with a connection
to j, i.e. which transfer the output to j.
3.2.3 The Activation is the
For a neuron j the propagation func- ”switching status” of a
tion receives the outputs oi1 , . . . , oin of Neuron
other neurons i1 , i2 , . . . , in (which are con-
nected to j), and transforms them in con-
sideration of the connecting weights wi,j Based on the model of nature every neu-
manages
inputs
into network input netj , that can be used ron is always active, excited, or whatever
by the activation function. you will call it, to a certain extent. The
reactions of the neurons to the input val-
Definition 3.3 (Propagation function): ues depend on this activation state. The
Let I = {i1 , i2 , . . . , in } be the set of neu- activation state indicates the extent of a How active
is a
rons such that ∀z ∈ {1, . . . , n} : ∃wiz ,j . neuron’s activation and is often shortly re- neuron?
Then the propagation function of a neu- ferred to as activation. Its formal defini-
ron j is defined as tion is included in the following definition
fprop : {oi1 , . . . , oin } × {wi1 ,j , . . . , win ,j } of the activation function. But generally,it
can be defined as follows:
→ netj (3.1)
Definition 3.5 (Activation state/activation
in general): Let j be a neuron. The ac-
Here the weighted sum is very popular: :
tivation state aj , shortly called activation,
The multiplication of the output of each i
is explicitly assigned to j indicates the ex-
by wi,j , and the addition of the results:
tent of the neuron’s activity and results
net =
X
(o · w ) (3.2)
from the activation function.
j i i,j
Provided that the definition of activation

i∈I
function is given, the activation state of j

The network input is the result of the
is defined as:
propagation function and is already in- Jaj
cluded in the afore-mentioned definition. aj (t) = fact (netj (t), aj (t − 1), Θj )
But for reasons of completeness, let us set
the following definition:
Definition 3.4 (Network input): Let j be a
neuron and let the assumptions from the Thereby the variable Θ stands for the
propagation function definition be given. threshold value explained in the follow-
Then the network input of j, called netj , ing.

3.2.4 Neurons get activated, if the 1) into a new activation state aj (t), with
network input exceeds their the threshold value Θ playing an impor-
treshold value tant role, as already mentioned.
Remark: Unlike the other variables within
When centered around the threshold value, the neural network (particularly unlike the
the activation function of a neuron re- ones defined so far) the activation function
acts particularly sensitive. From the bi- is often defined globally for all neurons or
ological point of view the threshold value at least for a set of neurons, and only the
represents the threshold from which on a threshold values are different for each neu-
neuron fires. The threshold value is also ron. We should also keep in mind that the
highest
point of widely included in the definition of the ac- threshold values can be changed, for exam-
sensation tivation function, but generally the defini- ple by a learning procedure. So it can be-
tion is the following: come particularly necessary to relate the
Definition 3.6 (General threshold value): threshold value to the time and to write,
Let j be a neuron. The threshold for instance Θj als Θj (t) (but for reasons
value Θj is explicitly assigned to j and of clarity, I omitted this here). The ac-
ΘI tivation function is also called transfer
marks the position of the maximum gradi-
ent value of the activation function. function.
Example activation functions: The most

3.2.5 The Activation Function simple activation function is the binary
determines the activation of a threshold function (fig. 3.2 on the right
Neuron dependent on network page), which can only take two values
input and treshold value (also referred to as Heaviside function).
If the input is above a certain threshold,
At a certain time – as we have already the function changes from one value to
learned – the activation aj of a neuron j another, but otherwise remains constant.
depends on the prior 3 activation state of This implies that the function is not differ-
the neuron and the external input. entiable at the threshold and for the rest
Definition 3.7 (Activation function): Let the derivative is 0. Due to this fact back-
calculates
j be a neuron. The activation function propagation learning, for example, is im-
activation
is defined as possible (as we will see later). Also very
popular is the Fermi function or logis-
fact : netj (t) × aj (t − 1) × Θj → aj (t). tic function (fig. 3.2)
(3.4)
1
, (3.5)
It transforms the network input netj as 1 + e−x
fact I
well as the previous activation state aj (t − with a range of values of (0, 1) and the
3 The prior activation is not always relevant for the hyperbolic tangent (fig. 3.2) mit with a
current – we will see examples for both variants. range of values of (−1, 1). Both functions

dkriesel.com 3.2 Components of Neural Networks
are differentiable. The Fermi function can Heaviside Function
be expanded by a temperature parame- 1

ter T into the form
TI
1 0.5
−x , (3.6)
1+e T
f(x)
0
and the smaller this parameter is, the

more does it compress the function on −0.5
the x axis. Thus, it can be arbitrar- −1

ily approximated to the Heaviside func- −4 −2 0 2 4
tion. Incidentally, there exist activation x
functions which are not explicitly defined Fermi Function with Temperature Parameter
but depend on the input according to a

1
random distribution (stochastic activation 0.8

function).
0.6
f(x)
3.2.6 An Output Function may be

0.4
used to process the activation 0.2
once again 0
−4 −2 0 2 4
The output function of a neuron j cal-

x
culates the values which are transferred to

Hyperbolic Tangent
1
the other neurons connected to j. More 0.8
formal: 0.6
0.4
Definition 3.8 (Output function): Let j 0.2
tanh(x)
informs
other be a neuron. The output function 0
−0.2
neurons
fout : aj → oj (3.7) −0.4
−0.6
then calculates the output value oj of the −0.8
fout I
neuron j from its activation state aj .
−1
−4 −2 0 2 4
Remark: Generally, the output function is x
defined globally, too.

Example: Often this function is the iden- Figure 3.2: Various popular activation func-
tity, i.e. the activation aj is directly out- tions, from top down: Heaviside or binary thresh-
put4 : old function, Fermi function, hyperbolic tangent.
The Fermi function was expanded by a temper-
fout (aj ) = aj , so oj = aj (3.8)
ature parameter. The original Fermi function
4 Other definitions of output functions may be use- is represented by dark colors, the temperature
ful if the range of the activation function is not parameter of the modified Fermi functions are,
sufficient. from outward to inward 12 , 15 , 10
1 1
und 25 .

Unless explicitly specified otherwise, we clarify that the connections are between
will use the identity as output function the line neurons and the column neurons,
within this paper. I have inserted the small arrow in the
upper-left cell.
3.2.7 Learning strategies adjust a

Network to fit our needs
3.3.1 Feedforward
Since we will address this subject later in
detail and at first want to get to know the FeedForward-Networks consist of layers
principles of neural network structures, I and connections towards any one next lay-
will only provide a brief and general defi- ers I n this paper feedforward networks
nition here: (fig. 3.3 on the right page) are the net-
Definition 3.9 (General learning rule): works we will explore at first (even if we
The learning strategy is an algorithm will use different topologies later). The
that can be used to change the neural net- neurons are grouped in the following lay-
work and so such network can be trained ers: One input layer, n hidden pro-
network of
to produce a desired output for a given cessing layers (invisible from the out- layers
input. side, that’s why the neurons are also re-

ferred to as hidden) and one output layer.
In a feedforward network each neuron in
one layer has only directed connections to
3.3 Network topologies
the neurons of the next following layer (to-
wards the output layer). In fig. 3.3 on the
After we have become acquainted with the right page the connections permitted for
structure of neural network components, a feedforward network are represented by
I want to give an overview of the usual solid lines. We will often be confronted
topologies (= designs) of neural networks, with feedforward networks in which every
i.e. to construct networks consisting of the neuron i is connected to all neurons of the
components. Every topology described in next following layer (these layers are called
this paper is illustrated by a map and its completely linked). To prevent naming
Hinton diagram so that the reader can im- conflicts the output neurons are often re-
mediately see the characteristics and apply ferred to as Ω.
them to other networks.
Definition 3.10 (Feedforward network):
In the Hinton diagram the dotted-drawn The neuron layers of a feedforward net-
weights are represented by light grey fields, work (fig. 3.3 on the right page) are clearly
the solid-drawn ones by dark grey fields. separated: One input layer, one output
The input and output arrows, which were layer and one or more processing layers
added for reasons of clarity, cannot be which are invisible from the outside (also
found in the Hinton diagram. In order to called hidden layers). Connections are

dkriesel.com 3.3 Network topologies
only permitted to neurons of the next fol-

lowing layer.
3.3.1.1 ShortCut connections skip

layers
ShortCuts
skip Some feedforward networks permit the so-
layers
called shortcut connections (fig. 3.4 on
@ABC
GFED @ABC
GFED

the next page): Connections that skip one i1 AUUUU i i2
or more levels. These connections may }} AA UUUUUiiiiiii}} AAA
} A ii i U U UU}U}U} AA
only be directed towards the output layer, }} i iiAiAAi UUUU A
} ii } UUUU AAA
GFED
@ABC @ABC
GFED @ABC
GFED
}}iiiiii AA }}
too. ~ }tii ~ } U UU*
h1 AUUUU h2 A iii} h3
Definition 3.11 (Feedforward network with AAA UU UU } AAA iii i ii i
UUUU }} }}
shortcut connections): Similar to the feed-
AA
AA
U}U}UU
} U U i i iiAiAiAi }}}
iUiU A
@ABC
GFED @ABC
GFED
} }
forward network, but the connections may
A }~ it}iiiiii UUUUUUA* }~ }
Ω1 Ω2
not only be directed towards the next fol-
lowing layer but also towards any other
subsequent layer.
i1 i2 h1 h2 h3 Ω1 Ω2
3.3.2 Recurrent networks have i1
influence on theirselves i2
h1
Recurrence is defined as the process of a h2
neuron influencing itself by any means or h3
by any connection. Recurrent networks do Ω1
not always have explicitly defined input or Ω2
output neurons. Therefore in the figures
I omitted all markings that concern this Figure 3.3: A feedforward network with three
layers: Two input neurons, three hidden neurons
matter and only numbered the neurons.
and two output neurons. Characteristic for the
Hinton diagram of completely linked feedforward
networks is block building above the diagonal
3.3.2.1 Direct recurrencies start and
line.
end at the same neuron
Some networks allow for neurons to be

connected to themselves, which is called
direct recurrence (or sometimes self-
recurrence (fig. 3.5 on the next page). As

89:;
?>=< 89:;
?>=<
v v
1 2
?>=<
89:; ?>=<
89:; ) ?>=<
89:;
u v v v
3 4 5
?>=<
89:; ) ?>=<
89:;
uv v
6 7
GFED
@ABC GFED
@ABC

i1 i2 1 2 3 4 5 6 7
1
2
@ABC
GFED GFED
@ABC * GFED
@ABC
~t ~ 3
h1 h2 h3
4
5
GFED
@ABC + * GFED
@ABC
~
t
~ 6
Ω1 s Ω 2 7
Figure 3.5: A network similar to a feedforward

network with directly recurrent neurons. The di-
rect recurrences are represented by solid lines and
i1 i2 h1 h2 h3 Ω1 Ω2 exactly correspond to the diagonal in the Hinton
i1 diagram matrix.
i2
h1
h2
h3
Ω1
Ω2 a result, neurons inhibit and therefore
strengthen themselves in order to reach
Figure 3.4: A feedforward network with short- their activation limits.
cut connections, which are represented by solid
lines. On the right side of the feedforward blocks Definition 3.12 (Direct recurrence): Now neurons
new connections have been added to the Hinton we expand the feedforward network by influence
diagram. connecting a neuron j to itself, with the themselves
weights of these connections being referred

to as wj,j . In other words: the diagonal of
the weight matrix W may be unequal to
0.

dkriesel.com 3.3 Network topologies
3.3.2.2 Indirect recurrences can

influence their starting neuron
only by making detours
If connections are allowed towards the in-

put layer, they will be called indirect re-
currences. Then a neuron j can use in-
8 ?>=<
89:; 82 ?>=<
89:;
direct ways forward to influence itself, for
example, by influencing the neurons of the 1 g 2
X X
next layer and the neurons of this next
?>=<
89:; 8 ?>=<
89:; 82 ?>=<
) 89:;
layer influencing j (fig. 3.6). u
3 g 4 5
Definition 3.13 (Indirect recurrence): X X
Again our network is based on a feedfor-
?>=<
89:; ) ?>=<
89:;
ward network with now additional connec- u
6 7
tions between neurons and the preceding
layer being allowed. Therefore, below the
1 2 3 4 5 6 7
diagonal of W unequal to 0.
1
2
3
4
3.3.2.3 Lateral recurrences 5
6
7
[Lateral recurrences connect neurons
within one layer] Connections between
network with indirectly recurrent neurons. The
neurons within one layer are called indirect recurrences are represented by solid lines.
lateral recurrences (fig. 3.7 on the As we can see, connections to the preceding lay-
next page). Here, each neuron often ers can exist here, too. The fields that are sym-
inhibits the other neurons of the layer metric to the feedforward blocks in the Hinton
and strengthens itself. As a result only diagram are now occupied.
the strongest neuron becomes active
(winner-takes-all scheme).
Definition 3.14 (Lateral recurrence): A
laterally recurrent network permits con-
nections within one layer.

3.3.3 Completely linked networks

allow for every possible
connection
Completely linked networks permit connec-

tions between all neurons, except for direct
recurrences. Furthermore, the connections
89:;
?>=< + ?>=<
89:;
must be symmetric (fig. 3.8 on the right
1 k 2 page). A popular example are the self-
organizing maps, which will be introduced
?>=<
89:; + ?>=<
89:; k +* ) ?>=<
89:;
in chapter 10.
u
3 jk 4 5 Definition 3.15 (Complete interconnec-
tion): In this case, every neuron is always
?>=<
89:; + ) ?>=<
89:;
u allowed to be connected to every other neu-
6 k 7 ron – but as a result every neuron can
become an input neuron. Therefore, di-
1 2 3 4 5 6 7 rect recurrences normally cannot be ap-
1 plied here and clearly defined layers do not
2 longer exist. Thus, the matrix W may be
3 unequal to 0 everywhere, except along its
4 diagonal.
5
6
7
3.4 The bias neuron is a
network with laterally recurrent neurons. The
technical trick to consider
direct recurrences are represented by solid lines. threshold values as
Here, recurrences only exist within the layer.
In the Hinton diagram, filled squares are con-
connection weights
centrated around the diagonal in the height of
the feedforward blocks, but the diagonal is left By now we know that in many network
uncovered.
paradigms neurons have a threshold value
that indicates when a neuron becomes ac-
tiv. Thus, the threshold value is an activa-
tion function parameter of a neuron. From
the biological point of view this sounds
most plausible, but it is complicated to ac-
cess the activation function at runtime in
order to train the threshold value.

dkriesel.com 3.4 The bias neuron
But threshold values Θj1 , . . . , Θjn for neu-

rons j1 , j2 , . . . , jn can also be realized as
connecting weight of a continuously firing
neuron: For this purpose an additional
bias neuron whose output value is always 1
is integrated in the network and connected
to the neurons j1 , j2 , . . . , jn These new con-
nections get the weights −Θj1 , . . . , −Θjn ,
i.e. they get the negative threshold val-
?>=<
89:; / ?>=<
89:;
ues.
o
@ 1O >^ Ti >TTTTT jjjjj5 @ 2O >^ > Definition 3.16: A bias neuron is a neu-
>> jTjTjTjT >>
>>jj
j TTTT > ron whose output value is always 1 and
jj jj > T TTTT >>>
?>=<
89:; / ?>=<
89:; / ?>=<
89:;
j j >> which is represented by
ju jjj
j TTTT>)
@ABC
GFED
3 >^ Ti jo TTT 4 o
@ >^ > jjj45 @ 5
>> TTTT j j
>> TTTT >> jjj
TTTT jjjj>>j>jj
BIAS .
>>
89:;
?>=< / ?>=<
89:;
>>
jjTT >
ju jjjjj TTTTT>) It is used to represent neuron biases as con-
6 o 7
nection weights, which enables any weicht-
1 2 3 4 5 6 7 training algorithm to train the biases at
the same time.
1
2
Then the threshold value of the neurons
3
j1 , j2 , . . . , jn is set to 0. Now the thresh-
4
old values are implemented as connecting
5
weights (fig. 3.9 on page 47) and can di-
6
rectly be trained together with the con-
7
necting weights, which considerably facil-
Figure 3.8: A completely linked network with itates the learning process.
symmetric connections and without direct recur-
In other words: Instead of including the
rences. In the Hinton diagram only the diagonal
are left blank. threshold value in the activation function,
it is now included in the propagation func-
tion. Or even shorter: The threshold value
is subtracted from the network input, i.e.
it is part of the network input. More for-
mally:
bias neuron
Letj1 , j2 , . . . , jn be neurons with

replaces
thresh. value
the threshold values Θj1 , . . . , Θjn . with weights
By inserting a bias neuron whose
output value is always 1, generating

connections between the said bias

neuron and the neurons j1 , j2 , . . . , jn WVUT
PQRS
||c,x|| @ABC
GFED
ONML
HIJK
Σ WVUT
PQRS
Σ
and weighting these connections Gauß L|H
wBIAS,j1 , . . . , wBIAS,jn with −Θj1 , . . . , −Θjn ,
we can set Θj1 = . . . = Θjn = 0 and
receive an equivalent neural network WVUT
PQRS
Σ WVUT
PQRS
Σ ONML
HIJK
Σ @ABC
GFED
BIAS
Tanh Fermi fact
whose threshold values are realized by
connecting weights.
Remark: Undoubtedly, the advantage of Figure 3.10: Different types of neurons that will
appear in the following text.
the bias neuron is the fact that it is much
easier to implement it into the network.
One disadvantage is that the representa-
tion of the network already becomes quite
plain with only a few neurons, let alone 3.6 Take care of the order in
with a great number of them. By the way, which neuron activations
a bias neuron is often referred to as on
neuron.
are calculated
For a neural network it is very important

From now on, the bias neuron is omit- in which order the individual neurons re-
ted for clarity of the following illustrations, ceive and process the input and output the
but we know that it exists and that with it results. Here, we distinguish two model
the threshold values can simply be treated classes:
as weights.
3.6.1 Synchronous activation
3.5 Representing Neurons All neurons change their values syn-

chronously, i.e. they simultaneously cal-
culate network inputs, activation and out-
We have already seen that we can either put, and pass them on. Synchronous acti-
write its name or its threshold value into vation corresponds closest to its biological
a neuron. Another useful representation, counterpart, but it is only useful on cer-
which we will use several times in the tain parallel computers and especially not
following, is to illustrate neurons accord- for feedforward networks. However, syn-
ing to their type of data processing. See chronous activation can be emulated by
fig. 3.10 for some examples without fur- saving old neuron activations in memory
ther explanation – the different types of and using them to generate new values of
neurons are explained as soon as we need the at this time next time step. Thus,
them. only activation values of the time step t

dkriesel.com 3.6 Orders of activation
GFED
@ABC GFED
@ABC / ?>=<
89:;

Θ1 B BIAS T
AA TTTTT−Θ1 0
|| BB AA TTTT
|| BB −ΘA2A −Θ3 TTT
| BB
89:;
?>=< 89:;
?>=<
| AA TTTT
GFED
@ABC GFED
@ABC
| BB
|~ | TTT*
Θ2 Θ3 0 0

Figure 3.9: Two equivalent neural networks, one without bias neuron on the left, one with bias
neuron on the right. The neuron threshold values can be found in the neurons, the connecting
weights at the connections. Furthermore, I omitted the weights of the already existing connections
(represented by dotted lines on the right side).
are used for the calculation of those of 3.6.2.1 Random order

(t+1). This order of activation is the most
generic. Definition 3.18 (Random order of activa-
Definition 3.17 (Synchronous activation): tion): With random order of activa-
All neurons of a network calculate network tion a neuron i is randomly chosen and
inputs at the same time by means of the its neti , ai and oi are updated. For n neu-
biologically
plausible
propagation function, activation by means rons a cycle is the n-fold execution of this
of the activation function and output by step. Obviously, some neurons are repeat-
means of the output function. After that edly updated during one cycle, and others,
the activation cycle is complete. however, not at all. nicht.
Remark: Apparently, this order of activa-
tion is not always useful.
3.6.2 Asynchronous activation

3.6.2.2 Random permutation
Here, the neurons do not change their val- With random permutation each neuron
ues simultaneously but at different points is chosen exactly once, but in random or-
of time. For this, there exist different order, during one cycle.
ders, some of which I want to introduce in Definition 3.19 (Random permutation):
easier to
the following: Initially, a permutation of the neurons is
implement

randomly calculated and therefore defines 3.6.2.4 Fixed orders of activation

the order of activation. Then the neurons during implementation
are successively executed in this order.
Remark: This order of activation is rarely Obviously, fixed orders of activation
used, too, because firstly, the order is gen- can be defined as well. Therefore, when
erally unuseful and, secondly, it is very implementing, for instance, feedforward
time-consuming to compute a new permu- networks it is very popular to determine
tation for every cycle. A Hopfield network the order of activation once according to
(chapter 8) is a topology nominally having the topology and to use this order without
a random or a randomly permuted order of further verification at runtime. But this is
activation. But note that in practice, for not necessarily useful for networks that are
the previously mentioned reasons, a fixed capable to change their topology.
order of activation is preferred.
For all orders either the previous neuron 3.7 Communication with the
activations at time t or, if already existing, outside world: Input and
the neuron activations at time t + 1, for
which we are calculating the activations, output of data in and
can be taken as starting point. from Neural Networks
At last, let us take a look at the fact

3.6.2.3 Topological order that, of course, many types of neural net-
works permit the input of data. Then
these data are processed and can produce
Definition 3.20 (Topological activation): output. Let us, for example, regard the
With topological order of activation feedforward network shown in fig. 3.3 on
often very
useful the neurons are updated during one cycle page 41: It has two input neurons and
and according to a fixed order. The order two output neurons, which means that it
is defined by the network topology. Thus, also has two numerical inputs x1 , x2 and
in feedforward networks (for which the pro- outputs y1 , y2 . So we take it easy and
cedure is very reasonable) the input neu- summarize the input and output compo-
rons would be updated at first, then the nents for n input or output neurons within
inner neurons and at last the output neu- the vectors x = (x1 , x2 , . . . , xn ) und y =
rons. (y1 , y2 , . . . , yn ).
Remark: This procedure can only be con- Definition 3.21 (Input vector): A network
sidered for non-cyclic, i.e. non-recurrent, with n input neurons needs n inputs of
Jx
networks, since otherwise there is no order x1 , x2 , . . . , xn . They are understood as in-
of activation. put vector x = (x1 , x2 , . . . , xn ). As a

dkriesel.com 3.7 Input and output of data
consequence, the input dimension is re- 2. tanh0 (x) = 1 − tanh2 (x)

ferred to as n. Data is put into a neural
nI are true.
network by using the components of the
input vector as net inputs of the input neu-
rons.
Definition 3.22 (Output vector): A net-
yI
work with m output neurons provides
m outputs y1 , y2 , . . . , ym . They are
understood as output vector y =
(y1 , y2 , . . . , ym ). Thus, the output di-
mension is referred to as m. Date is
mI
output by a neural network by defining
the components of the output vector by
outputs of the output neurons.
Now we have defined and closely examined

the basic components of neural networks –
without having seen a network in action.
But at first we will continue with theo-
retical explanations and generally describe
how a neural network could learn.
Exercises
Exercise 5: Would it be useful (from your

point of view) to insert one bias neuron in
each layer of a layer-based network, such
as a feedforward network? Discuss this in
relation to the representation and imple-
mentation of the network. Will the result
of the network change?
Exercise 6: Show for the Fermi function
f (x)as well as for the hyperbolic tangent
tanh(x), that the derivative functions can
be expressed by the function itself so that
the two statements
1. f 0 (x) = f (x) · (1 − f (x)) and

Chapter 4
How to Train a Neural Network?
Approaches and thoughts of how to teach machines. Should neural networks
be corrected? Should they only be encouraged? Or should they even learn
without any help? Thoughts about what we want to change during the
learning procedure and how we will change it, about the measurement of
errors and when we have learned enough.
As written above, the most interesting Theoretically, a neural network could learn
characteristic of neural networks is their by
capability to familiarize with problems
1. developing new connections,
by means of training and, after sufficient
training, to be able to solve unknown prob- 2. deleting existing connections,
lems of the same class. This approach is re-
ferred to as generalization. Before intro- 3. changing connecting weights,
ducing specific learning procedures, I want 4. changing the threshold values of neu-
to propose some basic principles about the rons,
learning procedure in this chapter.
5. varying one or more of the three neu-
ron functions (remember: activation
function, propagation function and
4.1 There are different output function),
paradigms of learning 6. developing new neurons, or
7. Deleting existing neurons (and so, of

course, existing connections).
Learning is a broad term. A neural net-
work could learn from many things but, of As mentioned above, we assume the
From what
do we learn? course, there will always be the question change in weight to be the most common
of implementability. In principle, a neu- procedure. Furthermore, deletion of con-
ral network changes when its components nections can be realized by additionally
are changing, as we have learned above. seeing to it that a connection set to 0 is
51
Chapter 4 How to Train a Neural Network? (fundamental) dkriesel.com
no longer trained. Moreover, we can de- patterns, which we use to train our neu-
velop further connections by setting a non- ral net.
existing connection (with the value 0 in
the connection matrix) to a value differ- Additionally, I will introduce the three es-
ent from 0. As for the modification of sential paradigms of learning by the dif-
threshold values I refer to the possibility ferences between the regarding training
of implementing them as weights (section sets.
3.4).
The change of neuron functions is difficult 4.1.1 Unsupervised learning

to implement, not very intuitive and not provides input patterns to the
biologically motivated. Therefore it is not network, but no learning aides
very popular and I will omit this topic
here. The possibilities to develop or delete Unsupervised learning is the biologi-
neurons do not only provide well adjusted cally most plausible method, but is not
weights during the training of a neural net- suitable for all problems. Only the in-
work, but also optimize the network topol- put patterns are given; the network tries
ogy. Thus, they attract a growing interest to identify similar patterns and to classify
and are often realized by using evolution- them into similar categories.
ary procedures. But since we accept that
a large part of learning possibilities can Definition 4.2 (Unsupervised learning):
already be covered by changes in weight, The training set only consists of input
they are also not the subject matter of this patterns, the network tries by itself to de-
paper. tect similarities and to generate pattern
Learning classes.
by changes
in weight Thus, we let our neural network learn by Example: Here again I want to refer to
modifying the connecting weights accord- the popular example of Kohonen’s self-
ing to rules that can be formulated as al- organising maps (chapter 10).
gorithms. Therefore a learning procedure
is always an algorithm that can easily be
implemented by means of a programming 4.1.2 Reinforcement learning
language. Later in the text I will assume methods provide Feedback to
that the definition of the term desired out- the Network, whether it
put which is worth learning is known (and behaves good or bad
I will define formally what a training pat-
tern is) and that we have a training set of
In reinforcement learning the network
learning examples. Let a training set bereceives a logical value or a real value af-
defined as follows: ter completion of a sequence, which defines
Definition 4.1 (Training set): A Train- whether the result is right or wrong. It
PI network
ing set (named P ) is a set of training is intuitively obvious that this procedure receives
remuneration or
punishment

dkriesel.com 4.1 Paradigms of Learning
should be more effective than unsuper- Remark: This learning procedure is not
vised learning since the network receives always biologically plausible, but it is ex-
specific critera for problem-solving. tremely effective and therefore very feasi-
Definition 4.3 (Reinforcement learning): ble.
The training set consists of input patterns,
after completion of a sequence a value is re- At first we want to look at the the su-
turned to the network indicating whether pervised learning procedures in general,
the result was right or wrong and, possibly, which - in this paper - are corresponding
how right or wrong it was. to the following steps:
Entering the input pattern (activation of

4.1.3 Supervised learning methods input neurons),
provide Training patterns Forward propagation of the input by the
together with appropriate network, generation of the output,
desired outputs learning
scheme
Comparing the output with the desired
In supervised learning the training set output (teaching input), provides er-
consists of input patterns as well as their ror vector (difference vector),
correct results in the form of the precise ac-
Corrections of the Network are
tivation of all output neurons. Thus, for
calculated based on the error vector.
each training set that is input in the net-
work the output, for instance, can directly Corrections are applied.
network
receives be compared with the correct solution and
correct and the network weights can be changed
according to / on the basis of the dif- 4.1.4 Offline or Online Learning?
results for
examples
ference. The objective is to change the
weights to the effect that the network can- It must be noted that learning can be of-
not only associate input and output pat- fline (a set of training examples is pre-
terns independently after the training, but sented, then the weights are changed, the
can provide unknown, similar input pat- total error is calculated by means of a er-
terns to a plausible result, i.e. to gener- ror function operation or simply accumu-
alise. lated - see also section 4.4) or online (af-
Definition 4.4 (Supervised learning): The ter every example presented the weights
training set consists of input patterns with are changed). Both procedures have ad-
correct results so that the network can re- vantages and disadvantages, which will be
ceive a precise error vector 1 can be re- discussed with the learning procedures sec-
turned. tion if necessary. Offline training proce-
1 The term error vector will be defined in section
dures are also called batch training pro-
4.2, where mathematical formalisation of learning cedures since a batch of results is cor-
is discussed. rected all at once. Such a training section

of a whole batch of training examples in- . How can the learned patterns be
cluding the related change in weight values stored in the network?
is called epoch.
. Is it possible to avoid that newly
Definition 4.5 (Offline learning): Several learned patterns destroy previously
training patternns are entered into the net- learned associations (the so-called sta-
work all at once, the errors are accumu- bility/plasticity dilemma)?
lated and it learns for all patterns at the
Remark: We will see that all these ques-
same time.
tions cannot be generally answered but
Definition 4.6 (Online learning): The net- that they have to be discussed for each
work learns directly from the errors of each learning procedure and each network JJJ
training example.
no easy
topology individually. answers!
4.1.5 Questions you should take

care of before learning
4.2 Training Patterns and
Teaching Input
The application of such schemas certainly
requires preliminary thoughts about some Before we get to know our first learning
questions, which I want to introduce now rule, we need to introduce the teaching
as a check list and, if possible, answer input. In (this) case of supervised learn-
them in the course of this paper. Laufe ing we assume a training set consisting of
der Arbeit sukzessive beantworten möchte, training patterns and the corresponding
soweit möglich: correct output values we want to see at
desired
the output neurons after the training. Un- output
. Where does the learning input come til the network is trained, i.e. as long as it
from and in what form? is generating wrong outputs, these output
. How must the weights be modified to values are referred to as teaching input,
allow fast and reliable learning? and that is for each neuron individually.
Thus, for a neuron j with the incorrect
. How can the success of a learning pro- output oj tj is the teaching input, which
cess be measured in an objective way? means it is the correct or desired output
for a training pattern p.
. Is it possible to determine the ”best”
learning procedure? Definition 4.7 (Training patterns): A
training pattern is an input vector p
Jp
. Is it possible to predict if a learning with the components p1 , p2 , . . . , pn whose
procedure terminates, i.e. whether it desired output is known. By entering the
will reach an optimal state after a fi- training pattern into the network we re-
nite time or if it, for example, will os- ceive an output that can be compared with
cillate between different states? the teaching input, which is the desired

dkriesel.com 4.3 Using Training Examples
output. The set of training patterns is on the type of network being used the
called P . It contains a finite number of or- neural network will output an
dered pairs(p, t) of training patterns with
output vector y. Basically, the
corresponding desired output.
Remark: Training patterns pattern are training example p is nothing more than
often simply called patterns, that is why an input vector. We only use it for
they are referred to as p. In the litera- training purposes because we know
ture as well as in this paper they are called the corresponding
synonymously patterns, training examples
etc. teaching input t which is nothing more
than the desired output vector to the
Definition 4.8 (Teaching input): Let j be
tI training example. The
an output neuron. The teaching input
tj is the desired, correct value j should error vector Ep is the difference between
output after the input of a certain training the teaching input t and the actural
pattern . Analogous to the vector p the output y.
desired
output teaching inputs t1 , t2 , . . . , tn of the neurons
can also be combined into a vector t. t So, what x and y are for the general net-
always refers to a specific training pattern work operation are p and t for the network
Important!
p and is, as already mentioned, contained training - and during training we try to
in the set P of the training patterns. bring y and t as close as possible.
Definition 4.9 (Error vector): For several Note: One advice concerning notation:
Ep I
output neurons Ω1 , Ω2 , . . . , Ωn the differ- We referred to the output values of a neu-
ence between output vector and teaching ron i as oi . Thus, the output of an out-
input under a training input p put neuron Ω is called oΩ . But the output
t1 − y1 values of a network are referred to as yΩ .
 
Ep =  .. Certainly, these network outputs are only

.
 
neuron outputs, too, but they are outputs

tn − yn
of output neurons. In this respect
is referred to as error vector, sometimes
it is also called difference vector. De- yΩ = o Ω
pending on whether you are learning of-
fline or online, the difference vector refers is true.
to a specific training pattern, or to the er-
ror of a set of training patterns which is
normed in a certain way. 4.3 Using Training Examples
Remark: Now I want to briefly summarize
the vectors we have yet defined. There is
We have seen how we can basically learn
the [input vector] x, which can be entered and which steps are required to do so. Now
into the neural network. Depending we should take a look at the selection of

training data and the learning curve. Af-

ter successful learning it is particularly in-
teresting whether the network has only
memorized – i.e. whether it can use our
training examples to quite exactly produce
the right output but to provide wrong an-
swers for all other problems of the same
class.
Let us assume that we want the network to

train a mapping R2 → B1 and therefor use
the training examples from fig. 4.1: Then
there could be a chance that, finally, the
network will exactly mark the colored ar-
eas around the training examples with the
output 1 (fig. 4.1 oben), and otherwise will
output 0 . Thus, it has sufficient storage
capacity to concentrate on the six training
examples with the output 1. This implies
an oversized network with too much free
storage capacity.
On the other hand a network could have

insufficient capacity (fig. 4.1 below) – this
rough presentation of input data does not
correspond to the good generalization per-
formance we desire. Thus, we have to find
the balance (fig. 4.1 in the middle.
4.3.1 It is useful to divide the set of

Training Examples
An often proposed solution for these prob-

lems is to divide, the training set into
Figure 4.1: Visualization of training results of
. one training set really used to train ,
the same training set on networks with a capacity
being too high (above), correct (middle) or too
. and one verification set to test our
low (below.
progress

dkriesel.com 4.4 Learning curve and error measurement
– provided that there are enough train- provokes that the patterns will be memo-
ing examples. The usual division relations rized when using recurrent networks (later,
are, for instance, 70% for training data we will learn more about this type of net-
and 30% for verification data (randomly works). A random permutation would
chosen). We can finish the training when solve both problems, but it is – as already
the network provides good results on the mentioned – very timeconsuming to calcu-
traing data as well as on the verification late such a permutation.
data.
But note: If the verificationo data provide

poor results do not modify the network 4.4 Learning curve and error
structure until these data provide good re- measurement
sults – otherwise you run the risk of tai-
loring the network to the verification data.
The learning curve indicates the progress
This means, that these data are included
of the error, der The learning curve indi-
in the training, even if they are not used norm
cates the progress of the error, which can
explicitly for the training. The solution
to
be determined in various ways. The mo- compare
is a third set of validation data used only
tivation to create a learning curve is that
for validation after a supposably success-
such a curve can indicate whether the net-
ful training.
work is progressing or not. For this, the
Remark: By training less patterns, we ob- error should be normed, i.e. indicate a dis-
viously withhold information from the net- tance dimension between the correct and
work and risk to worsen the learning per- the current output of the network. For
formance. But this paper is not about example, we can take the same pattern-
100% exact reproduction of given exam- specific, squared error with prefactor like
ples but about successful generalization we use to derive the backpropagation of er-
and approximation of a whole function – ror (Let Ω be output neurons and O the
for which it can definitely be useful to set of output neurons):
train less information into the network.
1 X
Errp = (tΩ − yΩ )2 (4.1)
2 Ω∈O
4.3.2 Order of pattern
representation Definition 4.10 (Specific error): The spe-
cific error Errp is based on a single train-
You can find different strategies to choose ing example, which means it is generated JErrp
the order of pattern presentation: If pat- online.
terns are presented in random sequence,
there is no guarantee that the patterns Additionally, the root mean square (ab-
are learned equally well. Always the same breviated: RMS) and the Euclidean
sequence of patterns, on the other hand, distance are often used.

The Euclidean distance (generalization of looks like a negative exponential func-

the theorem of Pythagoras) is useful for tion, that means it is proportional to e−t
inferior dimensions where we can still visu-(fig. 4.2 on the right page). Thus, the rep-
alize its usefulness. resentation of the learning curve can be
Definition 4.11 (Euclidean distance): The illustrated by means of a logarithmic scale
Euclidean distance between two vectors t (fig. 4.2, second diagram from the bot-
and y is defined as tom) – with the said scaling combination
a descending line implies an exponential
(4.2) descent of the error.
sX
Err =
p (t − y )2 .
Ω Ω
Ω∈O
With the network doing a good job, the
problems being not too difficult and the
Generally, the root mean square is com- logarithmic representation of Err you can
monly used since it considers extreme out- see - metaphorically speaking - a descend-
liers to a greater extent. ing line that often forms ”spikes” at the
Definition 4.12 (Root mean square): The bottom – here, we reach the limit of the
root mean square of two vectors t and y is 64-bit resolution of our computer and our
defined as network has actually learned the optimum
of what it is capable of learning.
Ω∈O (tΩ − yΩ )2
sP
Errp = . (4.3)
Typical learning curves can show a few flat
|O|
areas, as well, i.e. they can show some
As for offline learning, the total error in steps, which is no sign for a malfunctioning
the course of one training epoch is inter- learning process. As we can also see in fig.
esting and useful, too: 4.2, a well-suited representation can make
any slightly decreasing learning curve look
Err = Errp (4.4) good – so just be cautious when reading
X
p∈P the literature.
Definition 4.13 (Total error): The total

error Err is based on all training exam- 4.4.1 When do we stop learning?
ErrI
ples, that means it is generated offline.
Now, the great question is: When do we
Analogous we can generate a total rms and
stop learning? Generally, the training is
a total Euclidean distance in the course
stopped when the user in front of the learn-
of a whole epoch. Of course, other types
ing computer ”thinks” the error was small
of error measurement are also possible to
enough. Indeed, there is no easy answer
use.
and thus I can once again only give you
Depending on our error measurement something to think about, which, however,
method our learning curve certainly depends on a more objective view on the
changes, too. A perfect learning curve comparison of several learning curves.

dkriesel.com 4.4 Learning curve and error measurement
0.00025 0.0002
0.00018
0.0002 0.00016
0.00014
0.00015 0.00012
Fehler
Fehler
0.0001
0.0001 8e−005
6e−005
5e−005 4e−005
2e−005
0 0
0 100 200 300 400 500 600 700 800 900 1000 1 10 100 1000
Epoche Epoche
1 1
1e−005 1e−005
1e−010 1e−010
1e−015 1e−015
Fehler
Fehler
1e−020 1e−020
1e−025 1e−025
1e−030 1e−030
1e−035 1e−035
0 100 200 300 400 500 600 700 800 900 1000 1 10 100 1000
Epoche Epoche
Figure 4.2: All four illustrations show the same (idealized, because very smooth) learning curve.
Note the alternating logarithmic and linear scalings! Also note the small ”inaccurate spikes” visible
in the bend of the curve in the first and second diagram from bottom.

Confidence in the results, for example, is next point of the two curves, and maybe
boosted, when the network always reaches the final point of learning is to be applied
objectivity
nearly the same final error-rate for differ- here (this procedure is called early stop-
ent random initializations – so repeated ping).
initialization and training will provide a
more objective result. Once again I want to remind you that they
are all acting as indicators and not as If-
On the other hand, it can be possible that Then keys.
a curve descending fast in the beginning
can, after a longer time of learning, be Let me say one word to the number of
overtaken by another curve: This can indi- learning epochs: At the moment, vari-
cate that either the learning rate of the ous publications often use ≈ 106 to≈ 107
worse curve was too high or the worse epochs. Why not trying to use some more?
curve itself simply got stuck in a secondary The answer is simple: The current stan-
minimum, but was the first to find it. dard PC is not yet fast enough to support
8
Remember: Larger error values are worse 10 epochs. But with increasing process-
than the small ones. ing speed the trend will go towards it all
by itself.
But, in any case, note: Many people only
generate a learning curve in respect of the
training data (and then they are surprised
that only a few things will work) – but for 4.5 Gradient optimization
reasons of objectivity and clarity it should procedures
not be forgotten to plot the verification
data on a second learning curve, which gen-
erally provides values being slightly worse In order to establish the mathematical ba-
and with stronger oscillation. But with sis for some of the following learning pro-
good generalization the curve can decrease, cedures I want to explain briefly what is
too. meant by gradient descent: the backprop-
agation of error learning procedure, for
When the network eventually begins to example, involves this mathematical basis
memorize the examples, the shape of the and thus inherits the advantages and dis-
learning curve can provide an indication: advantages of the gradient descent.
If the learning curve of the verification
examples is suddenly and rapidly rising Gradient descent procedures are generally
while the learning curve of the verifica- used where we want to maximize or mini-
tion data is continuously falling, this could mize n-dimensional functions. Due to clar-
indicate memorizing and a generalization ity the illustration (fig. 4.3 on the right
getting poorer and poorer. At this point page) shows only two dimensions, but prin-
it could be decided whether the network cipally there is no limit to the number of
has already learned good enough at the dimensions.

dkriesel.com 4.5 Gradient optimization procedures
Figure 4.3: Visualization of the gradient descent on two-dimensional error function. Vi-
sualization of the gradient descent on two-dimensional error function. We go forward in
diametrical opposition to g, i.e. with the steepest descent towards the lowest point, with the
step width being proportional to |g| ist (the steeper the descent, the faster the steps). On
the left the area is shown in 3D, on the right the steps over the level curve shown in 2D.
Here it is obvious how a movement is made in the opposite direction of g towards the min-
imum of the function and becomes continuously slowing-down proportional to |g|. Source:
http://webster.fhs-hagenberg.ac.at/staff/sdreisei/Teaching/WS2001-2002/
PatternClassification/graddescent.pdf

The gradient is a vector g that is de- we move slowly on a flat plateau, and on a
fined for any differentiable point of a func- steep ascent we run downhill. If we came
tion, that points from this point exactly into a valley, we would - depending on the
towards the steepest ascent and indicates size of our steps - jump over it or we would
the gradient in this direction by means return into the valley across the opposite
of its norm |g|. Thus, the gradient is a hillside in order to come closer and closer
generalization of the derivative for multi- to the deepest point of the valley by walk-
dimensional functions 2 . Accordingly, the ing to and fro, similar to our ball moving
negative gradient −g exactly points to- within a round bowl.
wards the steepest ascent. The gradient Definition 4.15 (Gradient descent): Let
operator ∇ is referred to as nabla op- f be an n-dimensional function and s =
We go
∇I towards the
erator, the overall notation of the the (s1 , s2 , . . . , sn ) the given starting point. gradient
gradient is
multi-dim. gradient g of the point (x, y) of a two- Gradient descent means to going from
derivative dimensional function f being, for instance, f (s) against the direction of g, i.e. towards
g(x, y) = ∇f (x, y). −g with steps of the size of |g| towards
Definition 4.14 (Gradient): Let g be a smaller and smaller values of f .
gradient. Then g is a vector with n
components that is defined for any point Now, we will see that the gradient descent
of a (differential) n-dimensional function procedures are not free from errors (sec-
f (x1 , x2 , . . . , xn ). The gradient operator tion 4.5.1) but, however, they are promis-
notation is defined as ing.
g(x1 , x2 , . . . , xn ) = ∇f (x1 , x2 , . . . , xn )
4.5.1 Gradient procedures
. g directs from any point of f towards incorporate several problems
the steepest ascent from this point, with
|g| corresponding to the degree of this as- As already implied in section 4.5, the gra-
cent. dient descent (and therefore the backprop-
agation) is promising but not foolproof.
gradient descent means to going downhill One problem, is that the result does not
in small steps from any starting point of always reveal if an error has occurred.
our function towards the gradient g (which gradient
descent
means, vividly speaking, the direction to with errors
which a ball would roll from the start- 4.5.1.1 Convergence against
ing point), with the size of the steps be- suboptimal minima
ing proportional to |g| (the steeper the de-
scent, the broader the steps). Therefore, Every gradient descent procedure can, for
2 I don’t want to dwell on how to determine the
example, get stuck within a local mini-
multidimensional derivative – the interested reader mum (part a of fig. 4.4 on the right page).
may consult the usual analysis literature. This problem is increasing proportionally

dkriesel.com 4.6 Problem Examples
Figure 4.4: Possible errors during a gradient descent: a) Detecting bad minima, b) Quasi-standstill
with small gradient, c) Oscillation in canyons, d) Leaving good minima.
to the size of the error surface, and there 4.5.1.4 Oscillation in steep canyons
is no universal solution.
A sudden alternation from one very strong
negative gradient to a very strong positive
4.5.1.2 Stagnation at flat plateaus one can even result in oscillation (part c
of fig. 4.4). In nature, such an error does
When passing a flat plateau, for instance, not occurr very often so that we can think
the gradient also becomes negligibly small about the possibilities b and d.
(because there is hardly a descent (part b
of fig. 4.4), which requires many further
steps. A hypothetically possible gradient 4.6 Problem examples allow
of 0 would completely stop the descent.
for testing self-coded
learning strategies
4.5.1.3 Leaving good minima
We looked at learning from the formal
On the other hand the gradient is very point of view – not much yet but a little.
large at a steep slope so that large steps Now it is time to look at a few problem
can be made and a good minimum can pos- examples you can later use to test imple-
sibly be missed (part d of fig. 4.4). mented networks and learning rules.

4.6.1 Boolean functions i1 i2 i3 Ω

0 0 0 1
A popular example is the one that does 0 0 1 0
not work in the nineteen-sixties: the XOR 0 1 0 0
function (B2 → B1 ). We need a hidden 0 1 1 1
neuron layer, which we have discussed in 1 0 0 0
detail. Thus, we need at least two neu- 1 0 1 1
rons in the inner layer. Let the activation 1 1 0 1
function in all layers (except in the input 1 1 1 0
layer, of course) be the hyperbolic tangent.
Table 4.1: Illustration of the parity function
Trivially, we now expect the outputs 1.0
with three inputs.
or −1.0, depending on whether the func-
tion XOR outputs 1 or 0 - and exactly
here is where the first beginner’s mistake
occurs.
the learning effort rapidly increases from
For outputs close to 1 or -1, i.e. close to n = 4.
the limits of the hyperbolic tangent (or
Remark: The reader may create a score
in case of the Fermi function 0 or 1), we
table for the 2-bit parity function. What
need very large network inputs. The only
is conspicuous?
chance to reach these network inputs are
large weights, which have to be learned:
The learning process is largely extended. 4.6.3 The 2-spiral problem
Therefore it is wiser to enter the teaching
inputs 0.9 or −0.9 as desired outputs or
As a training example for a function let
to be satisfied when the network outputs
us take two spirals coiled into each other
those values instead of 1 and −1.
(fig. 4.5 on the right page) with the
Another favourite example for single-layer function certainly representing a mapping
perceptrons are the boolean functions R2 → B1 . One of the spirals is assigned
AND and OR. to the output value 1, the other spiral to
0. Here, memorizing does not help. The
network has to understand the mapping it-
4.6.2 The parity function self. This example can be solved by means
of an MLP, too.
The parity function maps a set of bits to 1
or 0, depending on whether an even num-
ber of input bits is set to 1 or not. Ba- 4.6.4 The checkerboard problem
sically, this is the function Bn → B1 . It
is characterized by easy learnability up to We again create a two-dimensional func-
approx. n = 3 (shown in table 4.1), but tion of the form R2 → B1 and specify

dkriesel.com 4.6 Problem Examples
Figure 4.5: Illustration to the training example Figure 4.6: Illustration of training examples for
of the 2-spiral problem the checkerboard problem
checkered training examples (fig. 4.6) with

4.6.5 The identity function
one colored field representing 1 and all the
rest of them representing 0. The difficulty
increases proportionally to the size of the
By using linear activation functions the
function: While a 3×3 field is easy to learn,
identity mapping from R1 to R1 (of course
the larger fields are more difficult (here
only within the parameters of the used ac-
we eventually use methods that are more
tivation function) is no problem for the
suitable for this kind of problems than the
network, but we put some obstacles in its
MLP).
way by using our sigmoid functions so that
it would be difficult for the network to
The 2-spiral problem is very similar to the learn the identity. Just try it for the fun
checkerboard problem, only that, mathe- of it.
matically speaking, the first problem is us-
ing polar coordinates instead of Cartesian
coordinates. I just want to introduce as an Now, it is time to hava a look at our first
example one last trivia: The identity. mathematical learning rule.

4.7 The Hebbian learning rule Remark: Why am I speaking twice about
is the basis for most activation, but in the formula I am using
oi and aj , i.e. the output of neuron output
other learning rules of neuron i and the activation of neuron j?
Remember that the identity is often used
In 1949, Donald O. Hebb formulated as output function and therefore ai and oi
the Hebbian rule [Heb49] which is the ba- of a neuron are often the same. Besides,
sis for most of the more complicated learn- Hebb postulated his rule long before the
ing rules we will discuss in this paper. We specification of technical neurons.
distinguish between the original form and
the more general form, which is a kind of Considering that this learning rule was
principle for other learning rules. preferred in binary activations, it is clear
that with the possible activations (1, 0) the
weights will either increase or remain con-
4.7.1 Original rule stant . Sooner or later they would go ad
weights
infinitum, since they can only be corrected go ad
Definition 4.16 (Hebbian rule): ”If neu- ”upwards” when an error occurs. This can infinitum
ron j receives an input from neuron i and be compensated by using the activations
if both neurons are strongly active at the (-1,1)3 . Thus, the weights are decreased
same time, then increase the weight wi,j when the activation of the predecessor neu-
(i.e. the strength of the connection be- ron dissents from the one of the successor
tween i and j).” Mathematically speaking, neuron, otherwise they are increased.
the rule is:
early
4.7.2 Generalized form

form of
the rule ∆wi,j ∼ ηoi aj (4.5)
with ∆wi,j being the change in weight Remark: Most of the afore-discussed
from i to j , which is proportional to the learning rules are a specialization of the
∆wi,j I
following factors: mathematically more general form [MR86]
of the Hebbian rule.
. The output oi of hte predecessor neu-
ron i, as well as, Definition 4.17 (Hebbian rule, more gen-
eral): The generalized form of the
. The activation aj of the successor neu- Hebbian Rule only specifies the propor-
ron j, tionality of the change in weight to the
product of two undefined functions, but
. A constant η, i.e. the learning rate,
with defined input values.
which will be discussed in section
5.5.2. ∆w = η · h(o , w ) · g(a , t ) (4.6)
i,j i i,j j j
The changes in weight ∆wi,j are simply 3 But that is no longer the ”original version” of the
added to the weight wi,j . Hebbian rule.

dkriesel.com 4.7 Hebbian Rule
Thus, the product of the functions

. g(aj , tj ) and
. h(oi , wi,j )
. as well as the constant learning rate
η
results in the change in weight. As you
can see, h receives the output of the pre-
decessor cell oi as well as the weight from
predecessor to successor wi,j while g ex-
pects the actual and desired activation of
the successor aj and tj (here t stands for
the above-mentioned teaching input). As
already mentioned g and h are not speci-
fied in this general definition. Therefore,
we will now return to the path of special-
ization we discussed before equation 4.6
on the left page After we have had a short
picture of what a learning rule could look
like and of our thoughts about learning it-
self, we will be introduced to our first net-
work paradigm including the learning pro-
cedure.
Exercises
Exercise 7: Calculate the average value

µ and the standard deviation σ for the fol-
lowing data points.
p1 = (2, 2, 2)
p2 = (3, 3, 3)
p3 = (4, 4, 4)
p4 = (6, 0, 0)
p5 = (0, 6, 0)
p6 = (0, 0, 6)

Part II
Supervised learning Network

Paradigms
69
Chapter 5
The Perceptron
A classic among the neural networks. If we talk about a neural network, then
in the majority of cases we speak about a percepton or a variation of it.
Perceptrons are multi-layer networks without recurrence and with fixed input
and output layers. Description of a perceptron, its limits and extensions that
should avoid the limitations. Derivation of learning procedures and discussion
about their problems.
As already mentioned in the history of neu- {0, 1} or {−1, 1}). Thus, a binary thresh-
ral networks the perceptron was described old function is used as activation function,
by Frank Rosenblatt in 1958 [Ros58]. depending on the threshold value Θ of the
Initially, Rosenblatt defined the already output neuron..
discussed weighted sum and a non-linear
activation function as components of the
perceptron. In a way, the binary activation function
represents an IF query which can also
be negated by means of negative weights.
There is no established definition for a per-
The perceptron can be used to accomplish
ceptron, but most of the time the term
real and logical information processing.
perception is used to describe a feedfor-
ward network with shortcut connections. Remark: Whether this method is reason-
This network has a layer of scanner neu- able is another matter – of course, this
rons (retina) with statically weighted con- is not the easiest way to achieve Boolean
nections to the next following layer and logic. I just want to illustrate that percep-
is called input layer (fig. 5.1 on the next trons can be used as simple logical compo-
page); but the weights of all other layers nents and that, theoretically speaking, any
are allowed to be changed. All neurons Boolean function can be realized by means
subordinate to the retina are pattern de- of perceptrons being subtly connected in
tectors. Here we initially use a binary series or interconnected. But we will see
perceptron with every output neuron hav- that this is not possible without series con-
ing exactly two possible output values (e.g. nection.
71
Kapitel 5 Das Perceptron dkriesel.com
Chapter 5 The Perceptron dkriesel.com
GFED
@ABC @ABC
GFED )GFED
@ABC + )GFED
@ABC ,+ )GFED
@ABC
| " { us # {u # | "
u
Osr O @
OOO
OOO @@@ ~~ ooooo
OOO @@ ~~ ooo
OOO @@ ~~~ ooooo
WVUT
PQRS
OOO@ ~ o
' Σ ~ wooo
L|H

i1 PPP GFED
GFED
@ABC @ABC
i2 C @ABC
GFEDi3 @ABC
GFED
i4 @ABC
GFED
i
PPP CC {{ nnnn 5
PPP CC { nn
PPP CC {{ nnn
PPP CC {{ nnn
( ?>=<
89:; vn
PPPC! {} {n{nnnn
Ω
Architecture
Figure 5.1:Abbildung 5.1: of a perceptron
Aufbau with onemit
eines Perceptrons layer ofSchicht
einer variablevariabler
connections in different
Verbindungen views.
in verschiede-
nen Ansichten.
The solid-drawn Die durchgezogene
weight layer Gewichtsschicht
in the two illustrations on thein bottom
den unteren
can beiden Abbildungen ist trainier-
be trained.
Left side: bar.
Example of scanning information in the eye.
Right side,Oben:
upper Ampart:
Beispiel der Informationsabtastung
Drawing of the same example im with
Auge.indicated fixed-weight layer using the
Mitte: Skizze desselben mit eingezeichneter fester Gewichtsschicht unter Verwendung der definier-
defined designs of the functional descriptions for neurons.
ten funktionsbeschreibenden Designs für Neurone.
Right side, lower part: Without indicated fixed-weight layer, with the name of each neuron
Unten: Ohne eingezeichnete feste Gewichtsschicht, mit Benennung der einzelnen Neuronen nach
corresponding to our
unserer convention.
Konvention. The fixed-weight
Wir werden layer will noim
die feste Gewichtschicht longer be taken
weiteren Verlaufinto
der account in the
Arbeit nicht mehr
course of this paper.
betrachten.
70 D. Kriesel – Ein kleiner Überblick über Neuronale Netze (EPSILON-DE)

dkriesel.com
Before providing the definition of the per- Definition 5.3 (Perceptron): The percep-
ceptron, I want to define some types of tron (fig. 5.1 on the left page) is1 a feed-
neurons used in this chapter. forward network containing a retina that
is used only for data acquisition and which
Definition 5.1 (Input neuron): An input
has fixed-weighted connections with the
neuron is an identity neuron. It ex-
first neuron layer (input layer). The fixed-
actly forwards the information received .
input neuron weight layer is followed by at least one
Thus, it represents the identity function,
trainable weight layer. One neuron layer is
only forwards
data which should be indicated by the symbol
completely linked with the following layer.
. Therefore the input neuron is repre-
sented by the symbol GFED
@ABC
The first layer of the perceptron consists
. of the above-defined input neurons.
Definition 5.2 (Information processing neu-
ron): Information processing neu-
rons process the input information some-
how or other, i.e. do not represent the A feedforward network often contains
identity function. A binary neuron shortcuts, which does not exactly corre-
sums up all inputs by using the weighted spond to the original description and there-
sum as propagation function, which we fore is not added to the definition.
want to illustrate by the sigma signΣ.
Remark (on the retina): We can see that
Then the activation function of the neuron
the retina is not included in the lower part
is the binary threshold function, which can
be illustrated byL|H . Now we turn our at-
of fig. 5.1. As a matter of fact the first
tention to the symbol WVUT

PQRS
neuron layer is often understood (simpli-
Σ
L|H . Other neu- fied and sufficient for this method) as in- retina is
rons with weighted sum are similarly rep- put layer, because this layer only forwards unconsidered
resented as propagation function but with the input values. The retina itself and the
the activation functions hyperbolic tangent static weights behind it are no longer men-
or Fermi function, or with a separately de- tioned or mapped since they do not pro-
fined activation function fact as cess information in any case. So, the map-
ping of a perceptron starts with the input
WVUT
PQRS WVUT
PQRS ONML
HIJK
neurons.
Σ Σ Σ
Tanh Fermi fact.
These neurons are also referred to as

Fermi neurons or Tanh neuron.
1 It may rub some readers the wrong way that I
claim there is no definition of a perceptron but
then define the perceptron in the following section.
Now that we know the components of a I therefore suggest keeping my definition in the
perceptron we should be able to define back of your mind and just take it for granted in
it. the course of this paper.

5.1 The Single-layer

Perceptron provides only
one trainable weight layer @ABC
GFED
BIAS @ABC
GFED
i1 @ABC
GFED
i2

wBIAS,Ωwi1 ,Ω wi2 ,Ω

89:;
?>=<
Here, connections with trainable weights
go from the input layer to an output Ω
neuron Ω, which returns the information
1 trainable
layer whether the pattern entered at the input
neurons is recognized or not. Thus, a
single-layer perception (abbreviated SLP)
has only one level of trainable weights Figure 5.2: A single-layer perceptron with two
input neurons and one output neuron. The net-
(fig. 5.1 on page 72).
work returns the output by means of the ar-
Definition 5.4 (Single-layer perceptron):row leaving the network. The trainable layer of
A Single-layer perceptron (SLP) is weights is situated in the center (labeled). As a
a perceptron having only one variable- reminder the bias neuron is again included here.
Although the weight wBIAS,Ω is a normal weight
weight layer and one layer of output neu-
and also treated like this, I have represented it
rons Ω. The technical view of an SLP is by a dotted line – which significantly increases
shown in fig. 5.2. the perceivability of larger networks. In future,
Remark: Certainly, the existence of sev- the bias neuron will no longer be included.
eral output neurons Ω1 , Ω2 , . . . , Ωn does
not considerably change the principle of
Important!
the perceptron (fig. 5.3): A perceptron
with several output neurons can also be
regarded as several different perceptrons
i1 @PUPUUU GFED
GFED
@ABC @ABC @ABC
GFED @ABC
GFED iGFED
@ABC
with the same input.
i2 P i3 i4 i5
@@PPPUPUUUU AAPAPPPP}} AAAnnnn}n} iiiinininin~n~
U P
@@ PPP UUAUA}} PP nn AA}i}ii nnn ~~ n i
@@ PP }UAUU nPnP ii}iA nn ~
The Boolean functions AND and OR shown @@ P}}P}PnPnAnAinAUinUiU iUiPUiP}U}P}PnPnAnAAn ~~~
@ABC nit
GFED @ABC
GFED @ABC
GFED
n i
P
~}v niii P' U
n P
~}nw n UUUP* ( ~~
in fig. 5.4 on the right page are trivial, com-
Ω1 Ω2 Ω3
posable examples.
Now we want to know how to train a single-

layer perceptron. For this, we will at first
take a look at the perceptron learning al- Figure 5.3: Single-layer perceptron with several
gorithm and then we will look at the Delta output neurons
rule.

dkriesel.com 5.2 Delta Rule
5.1.1 Perceptron Learning

Algorithm and Convergence
Theorem
The original perceptron learning algo-

rithm with binary neuron activation func-
tion is described in alg. 1. It is proved that
the algorithm converges in finite time – so
GFED
@ABC @ABC
GFED
in finite time the perceptron can learn any-
A thing it can represent (perceptron con-
AA }}
A }} vergence theorem, [Ros62]). But don’t
1AA 1
}} halloo till you’re out of the wood! What
@ABC
GFED
AA
~}}
1.5 the perceptron is capable to represent will
be explored later.
During the exploration of linear separabil-
GFED
@ABC @ABC
GFED
ity of problems we will cover the fact that
A at least the single-layer perceptron unfor-
AA }
A }}
} tunately cannot represent a lot of prob-
1AA 1
}} lems.
@ABC
GFED
AA
~}}
0.5
5.2 The Delta-Rule as a

gradient based learning
Figure 5.4: Two single-layer perceptrons for strategy for SLPs
Boolean functions. The upper single-layer per-
ceptron realizes an AND, the lower one realizes
an OR. The activation function of the informa- In the followingwe deviate from our binary
tion processing neuron is the binary threshold threshold value as activation function, be-
function, the threshold values are written into
cause at least for the backpropagation of
the neurons where available.
error we need, as you will see, a differ-
fact now differ-
entiable or even a semi-linear activation entiable
function. For the now following Delta rule
(like backpropagation derived in [MR86])
it is not always necessary but useful. This
fact, however, will also be pointed out in
the appropriate part of this paper. Com-
pared with the above-mentioned percep-
tron learning algorithm the Delta rule

Algorithm 1 Perceptron learning algorithm. The perceptron learning algorithm re-

duces the weights to output neurons that return 1 instead of 0, and in inverse case
increases weights.
1: while ∃p ∈ P and Fehler zu groß do
2: Input p into the network, calculate output y {P set of training patterns}
3: for all output neurons Ω do
4: if yΩ = tΩ then
5: Output is okay, no correction of weights
6: else
7: if yΩ = 0 then
8: for all input neurons i do
9: wi,Ω := wi,Ω + oi {...increase weight towards Ω by oi }
10: end for
11: end if
12: if yΩ = 1 then
13: for all input neurons i do
14: wi,Ω := wi,Ω − oi {...decrease weight towards Ω by oi }
15: end for
16: end if
17: end if
18: end for
19: end while

has the advantage to be suitable for non- Now our learning target will certainly be
binary activation functions and, being far that for all training examples the output
away from the learning target, to automat- y of the network is approximate to the de-
ically learn faster. sired output t i.e. it is formally true that
Let us assume that we have a single-layer ∀p : y ≈ t or. ∀p : Ep ≈ 0.

perceptron with randomly set weights to
which we want to teach a function by
This means we first have to understand the
means of training examples. The set of
total error Err as a function of the weights:
these training examples is called P . It con-
The total error increases or decreases de-
tains, as already defined, the pairs (p, t) of
pending on how we change the weights.
the training examples p and the associated
teaching input t. I also want to remind you Definition 5.5 (Error function): The er-
that ror function
JErr(W )
. x is the input vector and Err : W → R
. y is the output vector of a neural net- understands the set2 weights W as a vec-
work, tor and maps the values on the normed error as
output error (normed because otherwise
. output neurons are referred to as
function
not all errors can be mapped in one sin-
Ω1 , Ω2 , . . . , Ω|O| ,
gle e e ∈ R to perform a gradient descent).
. i is the input and It is obvious that a specific error func-
tion can analogously be generated for a
. o is the output of a neuron. single pattern p. JErrp (W )
Additionally, we defined that
As already shown in section 4.5 to the sub-
. the error vector Ep represents the dif- ject gradient descent procedure, gradient
ference (t−y) under a certain training descent procedures calculate the gradient
example p. of a random but finite-dimensional func-
tion (here: of the error function Err(W ))
. Furthermore, let O be the set of out- and go down towards the gradient until a
put neurons and minimum is reached. Err(W ) is defined
on the set of all weights which is herein
. I be the set of input neurons.
understood to be vector W . So we try to
Another naming convention shall be that, decrease or to minimize the error by means
for example, for the output o and the of, casually speaking, turning the weights
teaching input t an additional index p may – thus you receive information about how
be set in order to indicate that this value is 2 Following the tradition of the literature, I previ-
pattern-specific. Sometimes this will con- ously defined W as a weight matrix. I am aware
siderably enhance clarity. of this conflict but it shall not mind us here.

5
4
3 how the error function is changing, i.e. we
2 derive the error function according to a
1
weight wi,Ω and receive the information
∆wi,Ω of how change this weight.
0
∂Err(W )
−2 ∆wi,Ω = −η . (5.3)
−1 2 ∂wi,Ω
1
0 0
w1 1
2 −2
−1 w2 A question therefore arises: How is our
error function exactly defined? It is not
Figure 5.5: Exemplary error surface of a neural good, if many results are far away from
network with two trainable connections w1 und the desired one; the error function should
w2 . Generally, neural networks have more than then provide large values – on the other
two connections, but this would have made the hand it is similarly bad if many results
illustration too complex. And most of the time are close to the desired one but includ-
the error surface is too craggy, which complicates ing an extreme very far away outlier. So
the search for the minimum.
we use the squared distance between the
output vector y and the teaching input t
which provides the Errp that is specific for
a training example p over the output of all
to change the weights (the change in all
output neurons Ω:
weights is referred to as ∆W ) by deriving
the error function Err(W ): 1 X
Errp (W ) = (tp,Ω − yp,Ω )2 . (5.4)
2 Ω∈O
∆W ∼ −∇Err(W ). (5.1)
Due to this proportionality a proportional Thus, we square the difference of the com-
constant η provides equality (η will soon ponents of the vectors t and y under the
get another meaning and a real practical pattern p and sum these squares. Then
use beyond the mere meaning of propor- the error definition Err and therefore the
tional constant. I just want to ask the definition of the error function Err(W ) )
reader to be patient for a while.): result from the summation of the specific
errors Errp (W ) of all patterns p:
∆W = −η∇Err(W ). (5.2)
Err(W ) = Errp (W ) (5.5)
X
The derivative of the error-function accord- p∈P

ing to the weights is now written as normal sum over all p
partial derivative according to a weight

z  }| {
1 X X
wi,Ω (there are only variable weights to = (tp,Ω − yp,Ω )2  .
the output neurons Ω), so that it can be 2 p∈P Ω∈O
mathematically used to a greater extent.
| {z }
sum over all Ω
Thus, we turn every single weight and see (5.6)

Remark: The attentive reader will cer- / processes data. Basically, the data is
tainly wonder from where the factor 21 in only transferred through a function, the
equation 5.4 on the left page suddenly ap- result of the function is sent through an-
peared and where the root is in the equa- other one, and so on. If we ignore the
tion, since the equation is very similar to output function, the way of the neuron
the Euclidean distance. Both facts result outputs oi1 and oi2 , which the neurons i1
from simple pragmatics: It is a matter of and i2 entered into a neuron Ω, initially
error minimization. Since the root func- is the propagation function function (here
tion decreases with its argument, therefore weighted sum), from which the network in-
we can omit it for reasons of calculating put is going to be received:
and implementation efforts, since we do
not need them for minimization. Equally, oi1 , oi2 → fprop
it does not matter if the part to be min- ⇒ fprop (oi1 , oi2 )
imized is divided in half by the prefactor = oi1 · wi1 ,Ω + oi2 · wi2 ,Ω
1
2 : Therefore I am allowed to multiply by = netΩ
1
2 . This is mere idleness so that it can be
reduced in the course of our calculation to-Then this is sent through the activation
wards 2. function of the neuron Ω so that we re-
ceive the output of this neuron which is at
Now we want to continue to derive the
the same time a component of the output
Delta rule for linear activation functions.
vector y:
We have already discussed that we turn
the individual weights wi,Ω a bit and see
netΩ → fact
how the error Err(W ) is changing – which
corresponds to the derivative of the er- = fact (netΩ )
ror function Err(W ) according to the very = oΩ
same weight wi,Ω . This derivative cor- = yΩ .
responds to the sum of the derivatives
of all specific errors Errp according to As we can see, this output results from
this weight (since the total error Err(W ) many nested functions:
results from the sum of the specific er-
rors): oΩ = fact (netΩ ) (5.9)
= fact (oi1 · wi1 ,Ω + oi2 · wi2 ,Ω ). (5.10)
∂Err(W )
∆wi,Ω = −η (5.7)
∂wi,Ω
It is clear that we could break down the
∂Errp (W )
= (5.8) output into the input neurons (this is un-
X
−η .
p∈P
∂wi,Ω necessary here, since they do not process
information in an SLP). Thus, we want to
Once again I want to think about the ques- execute the derivates of equation 5.8 and
tion of how a neural network is processing due to the nested functions we can apply

the chain rule to factorize the derivative can just as well look at the change of the
∂Errp (W )
∂wi,Ω included in equation 5.8 on the network input when wi,Ω is changing:
previous page.
∂Errp (W ) ∂ i∈I (op,i wi,Ω )
P
= −δp,Ω · .
∂Errp (W ) ∂Errp (W ) ∂op,Ω ∂wi,Ω ∂wi,Ω
= · . (5.11)
∂wi,Ω ∂op,Ω ∂wi,Ω (5.14)
(op,i wi,Ω )
Let us take a look at the first multiplica-
P
∂
The resulting derivative i∈I
∂wi,Ω
tive factor of the above-mentioned equa- can now be reduced: The function
tion 5.11 which represents the derivative
i∈I (op,i wi,Ω ) to be derived consists of
P
of the specific error Errp (W ) according to many summands, and only the sum-
the output, i.e. the change of the error mand op,i wi,Ω contains the variable wi,Ω ,
Errp with an output op,Ω : The examina- according to which we derive. Thus,
tion of Errp (equation 5.4 on page 78) (op,i wi,Ω )
P
∂
clearly shows that this change is exactly
i∈I
∂wi,Ω = op,i and therefore:
the difference between teaching input and
output (tp,Ω − op,Ω ) (remember: Since Ω
∂Errp (W )
is output neuron, op,Ω = yp,Ω ). The closer = −δp,Ω · op,i (5.15)
the output is to the teaching input, the ∂wi,Ω
smaller is the specific error. Thus we can = −op,i · δp,Ω . (5.16)
replace the one by the other. This differ-
ence is also called δp,Ω (which is the reason This we insert in equation 5.8 on the previ-
for the name Delta rule): ous page, which results in our modification
rule for a weight wi,Ω :
∂Errp (W ) ∂op,Ω
= −(tp,Ω − op,Ω ) ·
∆wi,Ω = η · (5.17)
X
∂wi,Ω ∂wi,Ω op,i · δp,Ω .
(5.12) p∈P
∂op,Ω
= −δp,Ω · (5.13) However: From the very first the deriva-
∂wi,Ω tion has been intended as an offline rule by
means of the question of how to add the
errors of all patterns and how learn them
The second multiplicative factor of equa-
after all patterns have been represented.
tion 5.11 and of the following one is the
Although this approach is mathematically
derivative of the output of the neuron Ω
correct, the implementation is far more
to the pattern p according to the weight
time-consuming and, as we will see later
wi,Ω . So how does op,Ω change when the
in this chapter, and partially needs a lot
weight is changed from i to Ω? Due to
of compuational effort during training.
the requirement at the beginning of the
derivative / derivation we only have a lin- The ”online-learning version” of the Delta
ear activation function fact , therefore we rule simply omits the summation and

dkriesel.com 5.3 Linear Separability
learning is realized immediately after the In. 1 In. 2 Output

presentation of each pattern, this also sim- 0 0 0
plifies the notation (which is no longer nec- 0 1 1
essary to be related to a pattern p): 1 0 1
1 1 0
∆wi,Ω = η · oi · δΩ . (5.18)
Table 5.1: Definition of the logical XOR. The
This version of the Delta rule shall be used input values are shown of the left, the output
for the following definition: values on the right.
Definition 5.6 (Delta rule): If we deter-

mine, analogously to the above-mentioned
derivation, that the function h of the Heb-
bian theory (equation 4.6 on page 66) only Apparently, the Delta rule only applies for
provides the output neuron oi of the pre- SLPs, since the formula always refers to
decessor neuron i and if the function g is the teaching input , and there is no teach-
Delta rule
the difference between the desired activa- ing input for the inner processing layers of only for SLP
tion tΩ and the actual activation aΩ , we neurons.
will receive the Delta rule, also known as
Widrow-Hoff rule:
5.3 A SLP is only capable of
∆wi,Ω = η · oi · (tΩ − aΩ ) = ηoi δΩ (5.19) representing linear
If the desired output instead of activation separable data
is used as teaching input, and therefore
the output function of the output neurons Letf be the XOR function which expects
does not represent an identity, we will re- two binary inputs and generates a binary
ceive output (for the precise definition see ta-
ble 5.1).
∆wi,Ω = η · oi · (tΩ − oΩ ) = ηoi δΩ (5.20)
Let us try to represent the XOR func-
and δΩ then corresponds to the difference tion by means of an SLP with two input
between tΩ and oΩ . neurons i1 , i2 and one output neuron Ω
Remark: In case of the Delta rule, the (fig. 5.6 on the next page).
change of all weights to an output neuron
Here we use the weighted sum as propaga-
Ω is proportional
tion function, a binary activation function
. to the difference between the current with the threshold value Θ and the iden-
activation or output aΩ or oΩ and the tity as output function. Dependent on i1
corresponding teaching input tΩ . We and i2 Ω has to output the value 1:
want to refer to this factor as δΩ ,
δI
which is also spoken of as ”Delta”. netΩ = oi1 wi1 ,Ω + oi2 wi2 ,Ω ≥ ΘΩ (5.21)

GFED
@ABC GFED
@ABC

i1 B i2
BB |
B |||
BB
wi1 ,Ω w
||
i2 ,Ω
89:;
?>=<
BB
|~ |
Ω

XOR?
Figure 5.6: Sketch of a single-layer perceptron

that shall represent the XOR function - which is
impossible.
We assume a positive weight wi2 ,Ω , the

inequation 5.21 on the previous page is
equivalent to the inequation
1
o i1 ≥ (ΘΩ − oi2 wi2 ,Ω ) (5.22)
wi1 ,Ω
With a constant threshold value ΘΩ the

right part of the inequation 5.22 is a Figure 5.7: Linear separation of n = 2 inputs
straight line through a coordination sys- of input neurons i1 and i2 by a 1-dimensional
straight line. A and B show the corners be-
tem defined by the possible outputs oi1
longing to the sets of the XOR function to be
und oi2 of the input neurons i1 and i2 separated..
(fig. 5.7).
For a (as required for inequation 5.22) pos-
itive wi2 ,Ω the output neuron Ω fires at the
input combinations lying above the gener-
ated straight line. For a negative wi2 ,Ω it
would fire for all input combinations lying
below the straight line. Note that only the
four corners of the unit square are possi-
ble inputs because the XOR function only
knows binary inputs.

dkriesel.com 5.4 The Multi-layer Perceptron
n number of lin. share

binary separable
functions ones
1 4 4 100%
2 16 14 87.5%
3 256 104 40.6%
4 65, 536 1, 772 2.7%
5 4.3 · 109 94, 572 0.002%
6 1.8 · 1019 5, 028, 134 ≈ 0%
Table 5.2: Number of functions concerning n bi-

nary inputs, and number and proportion of the
functions thereof which can be linearly separated.
In accordance with [Zel94, Wid89, Was89].
Figure 5.8: Linear Separation of n = 3 inputs

from input neurons i1 , i2 and i3 by 2-dimensional
plane.
separability are difficult. Thus, for more
difficult tasks with more inputs we need
something more powerful than SLP.
Example: The XOR problem itself is one
of these tasks, since a perceptron that
In order to solve the XOR problem, we
wants to represent the XOR function al-
have to turn and move the straight line so
ready is a hidden layer (fig. 5.9 on the next
that input set A = {(0, 0), (1, 1)} is sepa-
page).
rated from input set B = {(0, 1), (1, 0)} –
which is, obviously, impossible.
Generally, the input options n of many in- 5.4 A Multi-layer Perceptron
put neurons can be represented in an n-
dimensional cube which is separated from contains more trainable
weight layers
SLP cannot
do everything an SLP by an (n − 1)-dimensional hyper-
plane (fig. 5.8). ). Only sets that can be
separated by such a hyperplane, i.e. which A perceptron with two or more trainable
are linearly separable can be classified weight layers (called multi-layer percep-
by an SLP. tron or MLP) is more powerful than an
Remark: Unfortunately, it seems that the SLP. As we know a single-layer perceptron
percentage of the linearly separable prob- can divide the input space by means of
lems rapidly decreases with increasing n a hyperplane (in a two-dimensional input
(see table 5.2), which limits the functional- space by means of a straight line). A two-
few tasks
are linearly ity of the SLP. Additionally, tests for linear stage perceptron (two trainable weight lay-
more planes
separable

GFED
@ABC @ABC
GFED
one layer of hidden neurons can approxi-
A mate arbitrarily precisely a function with
11 AA }
11 A }
}}
a finite number of points of discontinuity
11 1AA }1 as well as their first derivative. Unfortu-
111 GFED
@ABC
A
11 A ~}} }
1.5 1
nately, this proof is not constructive and
11 therefore it is left to us to find the correct
11
11−2 number of neurons and weights.
GFED
@ABC

0.5 In the following we want to use a
widely-spread abbreviated form for differ-
ent multi-layer perceptrons: Thus, a two-
XOR stage perceptron with 5 neurons in the in-
put layer, 3 neurons in the hidden layer
Figure 5.9: Neural network realizing the XOR and 4 neurons in the output layer is a 5-3-
function. Threshold values (as far as they are 4-MLP.
existing) are located within the neurons. Definition 5.7 (Multi-layer perceptron):
Perceptrons with more than one layer of
variably weighted connections are referred
to as multi-layer perceptron (MLP).
ers, three neuron layers) can classify con- Thereby an n-layer or n-stage perceptron
vex polygons by proceeding these straight has exactly n variable weight layers and
lines, e.g. in the form ”recognize pat- n + 1 neuron layers (the retina is disre-
terns lying above straight line 1, below garded here) with neuron layer 1 being the
straight line 2 and below straight line input layer.
3”. Thus, we – metaphorically speaking
- took an SLP with several output neu- Since three-stage perceptrons can classify
rons and ”attached” another SLP (upper sets of any form by combining and separat- 3-stage
part of fig. 5.10 on the right page). A ing arbitrarily many convex polygons, an- MLP is
Multi-layer Perceptron represents an uni- other step will not be advantageous with sufficient
versal function approximator, which respect to function representations.

is proven by the Theorem of Cybenko Remark: Be cautious when reading the lit-
[Cyb89]. erature: Opinions are divided over the def-
inition of layers. Some sources count the
Another trainable weight layer proceeds
neuron layers, some count the weight lay-
equally, only with convex polygons that
ers. Some sources include the retina, some
can be added, subtracted or reworked with
the trainable weight layers. Some exclude
other operations (lower part of fig. 5.10 on
(for any reason) the output neuron layer.
the right page).
In this paper, I chose the definition that
Generally it can be mathematically proved provides, in my opinion, the most informa-
that even a multi-layer perceptron with tion about the learning capabilities – and

dkriesel.com 5.4 The Multi-layer Perceptron
@ABC
GFED
i1 UU jGFED
@ABC
i2
@@@UUUUUUjUjjjjjj @@@
@@jjjj UUUU @@
jjjjjjj@@@ UUUUUUU @@@
@ABC
GFED @ABC
GFED @ABC
GFED
@ UUUU @
tjjjjjj U*
h1 PP h2 o h3
PPP oo o
PPP ooo
PPP ooo
PPP oo
' ?>=<
89:;
PPP oooo
wo
Ω
GFED
@ABC
i1 @ @ABC
GFED
i2 @
~~ @ @@ ~ ~ @@
~~ @@ ~~ @@
~~ @ ~~ @@
@ABC
GFED GFED
@ABC GFED
@ABC @ABC
GFED ) GFED
@ABC * GFED
@ABC
~ @@ ~~ @@
~~~t
u w ' ~ ~
h1 PP h2 @ h3 h4 h5 n h6
PPP @ ~ nn n
PPP @@ ~ n
PPP @@ ~~ nnn
PPP @@ ~~ nnnnn
' GFED
@ABC -*, GFED
@ABC
PPP@ ~~~nnn~
t nw
h7 @rq h8
@@ ~
@@ ~~~
@@ ~
~~
89:;
?>=<
@@
~ ~
Ω
Figure 5.10: We know that an SLP represents a straight line. With 2 trainable weight layers several
straight lines can be combined to form convex polygons (above). By using 3 trainable weight layers
several polygons can be formed into arbitrary sets (below).

n classifiable sets longer supported, but that doesn’t mat-

1 hyperplane ter: We have seen that the Fermi function
2 convex polygon or the hyperbolic tangent can be arbitrar-
3 any set ily approximated on the binary threshold
4 any set as well, i.e. no function by means of a temperature pa-
advantage rameter T . To a large extent I will fol-
low the derivation according to [Zel94] and
Table 5.3: Here it is represented which percep- [MR86]. Once again I want to point out
tron can classify which types of sets with n being
that this procedure had previously been
the number of trainable weight layers.
published by Paul Werbos in [Wer74]
but had consideraby less readers than in
[MR86].
Backpropagation is a gradient descent pro-
I will use it cosistently. Remember: An n-
cedure (including all strengths and weak-
stage perceptron has exactly n trainable
nesses of the gradient descent) with the
weight layers.
error function Err(W ) receiving all n
weights as argument (fig. 5.5 on page 78)
You can find a summary of which percep-
and assigning them to the output error,
trons can classify which types of sets in
i.e. being n-dimensional. On Err(W ) a
table 5.3. We now want to face the chal-
point of small error or even a point of
lenge of training perceptrons with more
the smallest error is sought by means of
than one weight layer.
the gradient descent. Thus, equally to the
Delta rule the backpropagation trains the
weights of the neural network. And it is ex-
5.5 Backpropagation of Error actly the Delta rule or its variable δi for a
generalizes the neuron i which is expanded from one train-
able weight layer to several ones by back-
Delta-Rule to allow for propagation.
MLP training
Let us define in advance that the network
input of the individual neurons i results
Next, I want to derive and explain the from the weighted sum. Furthermore, as
backpropagation of error learning rule with the derivation of the Delta rule, let
(abbreviated: backpropagation, backprop op,i , netp,i etc. be defined as the already
or BP), which can be used to train multi- familiear oi , neti , etc. under the input pat-
stage perceptrons semi-linear 3 activation tern p we use for the training. Let the out-
functions. Binary threshold functions and put function be the identity again, thus
other non-differentiable functions are no oi = fact (netp,i ) is true vor any neuron i.
3 Semilinear functions are monotonous and differen- Since this is a generalization of the Delta
tiable – but generally they are not linear. rule, we use the same formula framework

dkriesel.com 5.5 Backpropagation of Error
rule (the differences are, as already men-

tioned, in the generalized δ). We initially
derive the error function Err according to
()*+
/.-,L ()*+
/.-,= ()*+
/.-, 89:;
?>=<
a weight wk,h .
LLL ... k K
LLL ===
=
ppppp
LLL == pp ∂Err(wk,h )
wk,hp
∂Err ∂neth
LLL == pp = · (5.23)
ONML
HIJK
LL=& ppp ∂w ∂net h ∂wk,h
wpp k,h
Σ
h
| {z }
N H =−δh
rr f act NNNN
r
rr NN
rr rrr wh,lN
NNN
()*+
/.-, ()*+
/.-, ()*+
/.-, 89:;
?>=<
r NNN
x rrrr N' The first factor of the equation 5.23 is −δh ,
... l L
, which we will regard later in this text.
The numerator of the second factor of the
equation includes the network input, i.e.
the weighted sum is included in the numer-
ator so that we can immediately derive it.
Figure 5.11: Illustration of the position of our
All summands of the sum drop out again
neuron h within the neural network. It is lying
in layer H, the preceding layer is K, the next
apart from the summand containing wk,h .
following layer is L. This summand is referred to as wk,h · ok .
If it is derived, the output of the neuron k
is left:
∂neth ∂ wk,h ok
P
= k∈K
(5.24)
as with the Delta rule (equation 5.20). As ∂wk,h ∂wk,h
general-
ization already indicated, we have to generalize = ok (5.25)
of δ the variable δ for every neuron.
At first: Where is the neuron for which we
want to calculate a δ? ? It is obvious to se- As promised we will now discuss the −δh of
lect an arbitrary inner neuron h having a the equation 5.23, which is splitted again
set K of predecessor neurons k as well as a by means of the chain rule:
set of L successor neurons l, which are also
inner neurons (see fig. 5.11). Thereby it ∂Err
δh = − (5.26)
is irrelevant whether the predecessor neu- ∂net h
rons are already input neurons. ∂Err ∂oh
=− · (5.27)
∂oh ∂neth
Now we perform the same derivation as
for the Delta rule and split functions by
means the chain rule. I will not discuss The derivation of the output according to
this derivation in great detail, but the the network input (the second factor in
principal is similar to that of the Delta the equation 5.27) is certainly equal to

the derivation of the activation function The same applies for the first factor accord-
according to the network input: ing to the definition of our δ:
∂Err
∂oh ∂fact (neth ) − = δl (5.34)
= (5.28) ∂netl
∂neth ∂neth
= fact 0 (neth ) (5.29) Now we replace:
∂Err X
Now we analogously derive the first factor ⇒− = δl wh,l (5.35)
∂oh
in the equation 5.27 on the previous page. l∈L
The reader may well mull over this pas-
sage. For this we have to point out that You can find a graphic version of the δ
the derivation of the error function accord- generalization including all splittings in
ing to the output of an inner neuron layer fig. 5.12 on the right page.
depends on the vector of all network in- The reader might already have noticed
puts of the next following layer. This is that some intermediate results were
reflected in equation 5.30: framed. Exactly those intermediate re-
sults were framed which are a factor in
∂Err(netl1 , . . . , netl|L| )
(5.30) the change in weight of wk,h . If the above-
∂Err
− =−
∂oh ∂oh mentioned equations are combined with
the framed intermediate results, the result
According to the definition of the multi-
/the outcome of this will be the wanted
dimensional chain rule equation 5.31 im-
change in weight ∆wk,h to
mediately follows4 : 5.31:
∆wk,h = ηok δh mit (5.36)
∂Err X ∂Err ∂netl

= (5.31) δh = fact (neth ) ·
0
(δl wh,l )
X
− − ·
∂oh l∈L
∂netl ∂oh
l∈L
But certanly only in case of h being an in-

The sum in equation 5.31 has two factors. ner neuron (otherweise there wouldn’t be
Now we want to discuss these factors being a following layer L).
added over the next following layer L. We The case of h being an output neuron has
simply calculate the second factor in the already been discussed during the deriva-
following equation 5.33: tion of the Delta rule. All in all, the re-
X ∂netl ∂
P
wh,l · oh sult is the generalization of the Delta rule,
= h∈H
(5.32) called backpropagation of error:
l∈L
∂oh ∂oh
∆wk,h(= ηok δh mit
= wh,l (5.33)
0 (net ) · (t − y ) (h außen)
fact
δh = h
0 (net ) · P
h h
4 For more information please consult the usual anal-

fact h l∈L (δl wh,l ) (h innen)
ysis literature. (5.37)

δh
∂Err
− ∂net h

∂oh
∂neth − ∂Err
∂oh

0 (net )
fact ∂Err
− ∂net
P ∂netl
h l l∈L ∂oh
P
∂ wh,l ·oh
δl h∈H
∂oh
wh,l
Figure 5.12: Graphical representation of the equations (by equal signs) and chain rule splittings
(by arrows) in the framework of the backpropagation derivation. The leaves of the tree reflect the
final results from the generalization of δ, which are framed in the derivation.

Unlike the Delta rule, δ, is treated differ- changed with the weight wk,h , the neu-
ently depending on whether h is an output ron h is the successor of the connec-
or an inner (i.e. hidden) neuron: tion to be changed and the neurons
l are lying in the layer following the
1. If h is output neuron, then successor neuron. Thus, according to
our training pattern p the weight wk,h
δp,h = fact
0
(netp,h ) · (tp,h − yp,h )
from k toh is proportionally changed
(5.38)
to
Thus, under our training pattern p . the learning rate η,
the weight wk,h from k nach h is pro-
portionally changed into . the output of the predecessor
neuron op,k ,
. the learning rate η,
. the gradient of the activation
. the output op,k of the predeces- function at the position of the
sor neuron k, network input of the successor
neuron fact
0 (net
p,h ),
. the gradient of the activation
function at the position of the . as well as, and this is the differ-
network input of the successor ence, from the weighted sum of
0 (net
neuron fact p,h ) and the changes in weight to all neu-
rons following h, l∈L (δp,l ·wh,l ).
P
. the difference between teaching
Definition 5.8 (Backpropagation): If we
input tp,h and output yp,h of the
summarize the formulas 5.38 and 5.39,
successor neuron h.
Teach. Input we receivethe following total formula for
In this case, backpropagation is work- backpropagation (the identifiers p are
changed for
the outer
weight layer ing on two neuron layers, the output ommited for reasons of clarity):
layer with the successor neuron h and
∆wk,h(= ηok δh mit
the preceding layer with the predeces- 0 (net ) · (t − y ) (h außen)
sor neuron k. fact
δh = h
0 (net ) · P
h h
fact h l∈L (δl wh,l ) (h innen)
2. If h is an inner, hidden neuron, then (5.40)
δp,h = fact
0
(netp,h ) · (δp,l · wh,l )
X
l∈L
It is obvious that backpropagation initially
(5.39) processes/executes the last weight layer di-
rectly by means of the teaching input and
Here I want to explicitly mention that it then works forward from layer to layer
backpropagation is now working on in consideration of each preceding change
three layers. At this, the neuron k in weights. Thus, the teaching input leaves
is predecessor of the connection to be traces in all weight layers.
back-
propagation
for inner
layers
Remark: Here I am describing the first Furthermore, we only want to use linear
(Delta rule) and the second part of back- activation functions so that fact 0 (light-
propagation (generalized Delta rule on colored) is constant. KAs is generally
more layers) in one go, which may meet known, constants can be summarized, and
the requirements of the matter but not of therefore we directly combine the constant
the research. The first part is obvious, derivative fact
0 and (being constant for at
which you will see in the framework of least one lerning cycle) the learning rate η
a mathematical gimmick. Decades of de- (also light-colored) in η. Thus, the result
velopment time and work lie between the is:
first and the second, recursive part. Like
many groundbreakingng inventions it was ∆wk,h = ηok δh = ηok · (th − oh ) (5.43)
not until its development that it was rec-
ognized how plausible this invention was. This exactly corresponds with the Delta
rule definition.
5.5.1 Heading back: Boiling

5.5.2 The selection of the learning
backpropagation down to
rate has heavy influence on
delta rule
the learning process
As explained above, the Delta rule is a In the meantime we have often seen that
special case of backpropagation for one- the change in weight is, in any case, pro-
stage perceptrons and linear activation portional to the learning rate η. Thus, the
functions – I want to briefly explain this selection of η is very decisive for the be-
backprop
expands circumstance and develop the Delta rule haviour of backpropagation and for learn-
Delta rule out of backpropagation in order to raise ing procedures in general.
the understanding of both rules. We have how fast
Definition 5.9 (Learning rate): DSpeed will be
seen that backpropagation is defined by
and accuracy of a learning procedure can
learned?
∆wk,h(= ηok δh mit always be controlled by and is always pro-

fact (neth ) · (th − yh ) (h außen)
0 portional to a learning rate which is writ-
δh = 0 (net ) · ten as η.
fact l∈L (δl wh,l ) (h innen)
P
h Jη
(5.41) Remark: If the value of the chosen η is
too large, the jumps on the error surface
Since we only use it for one-stage percep- are also too large and, for example, narrow
trons, the second part of backpropagation valleys could simply be jumped over. Ad-
(light-colored) is omitted without substitu- ditionally, the movements across the error
tion. The result is: surface would be very uncontrolled. Thus,
a small η is the desired input, which, how-
∆wk,h = ηok δh mit
(5.42) ever, can involve a huge, often unaccept-
δh = fact
0 (net ) · (t − o )
h h h able expenditure of time.

Experience shows that good learning rate 5.5.2.2 Different layers – Different
values are in the range of learning rates
0.01 ≤ η ≤ 0.9. The farer we move away from the out-

put layer during the learning process, the
The selection of η significally depends on slower backpropagation is learning. Thus,
the problem, the network and the training it is a good idea to select a larger learning
data so that it is barely possible to give rate for the weight layers close to the in-
practical advise. But it is popular to start put layer than for the weight layers close
with a relatively large η, e.g. 0.9, dand to to the output layer.
slowly decrease it down to 0.1, for instance.
For simpler problems η can ofte be kept
constant. 5.5.3 Initial configuration of a
Multi-layer Perceptron
5.5.2.1 Variation of the learning rate After having discussed the backpropaga-
over time tion of error learning procedure and know-
ing how to train an existing network, it
would be useful to consider how to acquire
During training another stylistic device such a network.
can be a variable learning rate: In
the beginning, a large learning rate learns
good, but later it results in inaccurate 5.5.3.1 Number of layers: Mostly two
learning. A smaller learning rate is more or three
time-consuming, but the result is more pre-
cise. Thus, the learning rate is to be de-
Let us begin with the trivial circumstance
creased by one unit once or repeatedly –
that a network should have one layer of in-
during the learning process.
put neurons and one layer of output neu-
Remark: A common error (which also rons, which results in at least two layers.
seems to be a very neat solution at a first
glance) is to continually decrease the learn- Additionally, we need – as we have already
ing rate. Here it easily happens that the learned during the examination of linear
descent of the learning rate is larger than separability – at least one hidden layer of
the ascent of a hill of the error function we neurons, if our problem is not linearly sep-
are scaling. The result is that we simply arable (which is, as we have seen, very
get stuck at this ascent. Solution: Rather likely).
reduce the learning rate gradually as men-
It is possible, as already mentioned, to
tioned above.
mathematically prove that this MLP with
one hidden neuron layer is already capable

of arbitrarily accurate approximation of ar- it is clear that our goal is to have as few
bitrary functions 5 – but it is necessary not free parameters as possible but as many as
only to discuss the representability of a necessary.
problem by means of a perceptron but also
the learnability. Representability means But we also know that there is no patent
that a perceptron can principally realize a formula for the question of how many neu-
mapping - but learnability means that we rons should be used. Thus, the most use-
are also able to teach it. ful approach is to initially train with only
a few neurons and to repeatedly train new
In this respect experience shows that two
networks with more neurons until the re-
hidden neuron layers (or three trainable
sult significantly improves and, particu-
weight layers) can be very useful to solve
larly, the generalization performance is not
a problem, since many problems can be
affected (bottom-up approach).
represented by a hidden layer but are very
difficult to learn. Two hidden layers are
still a good value because three hidden lay-
ers are not needed very often. Moreover, 5.5.3.3 Selecting an activation function
any additional layer generates additional
sub-minima of the error function in which
Another very important parameter for the
we can get stuck. All things considered,
way of information processing of a neural
a promising way is to try it with one hid-
network is the selection of an activa-
den layer at first and if that fails try two
tion function. The activation function
ones.
for input neurons is defined, since they do
not process information.
5.5.3.2 The number of neurons has to
be tested The first question to be asked is whether
we actually want to use the same activa-
The number of neurons (apart from in- tion function for/in the hidden layer and
put and output layer, the number of input in the ouput layer – noone prevents us
and output neurons is already defined by from varying the different functions. Gen-
the problem statement) principally corre- erally, the activation function is the same
sponds to the number of free parameters for all hidden neurons as well as for the
of the problem to be represented. output neurons.
Since we have already discussed the net- For tasks of function approximation it
work capacity with respect to memorizing has been found reasonable to use the hy-
or a too imprecise problem representation perbolic tangent (left part of fig. 5.13 on
5 Note: We have not indicated the number of neu-
page 95) as activation function of the hid-
rons in the hidden layer, we only mentioned the den neurons, while a linear activation func-
hypothetical possibility. tion is used in the output. The latter is

absolutely necessary so that we do not gen- Gewichtsänderung stattfinden. If they are

erate a limited output intervall. DIn con- simply initialized by 0, there will be no
trast to the linear input layer the also lin- change in weights. If they are all initial-
ear output layer has threshold values but ized by the same value, they will all be
although can process information. How- changed similarly during training. The
ever, linear activation functions in the out- simple solution of this problem is called
put can also cause huge learning steps and symmetry breaking, which is the initial-
jumping over good minima in the error sur- ization of weights with with small random
random
face. This can be avoided by setting the values. initial
learning rate to very small values at the Example: The range of random values
weights
output layer. could be the interval [−0.5; 0.5] not includ-

An unlimited output interval is not essen- ing 0 or values very close to 0.
tial for pattern recognition tasks6 . If Remark: This random initialization has a
the hyperbolic tangent is used in any case, nice side effect: Chances are that the aver-
the output interval will be a bit larger. age of network inputs is close to 0.
Unlike the hyperbolic tangent, the Fermi
function (right part of fig. 5.13 on the
right page) hardly has the possibility to 5.5.4 Backpropagation has often
learn something long before the threshold been extended and altered
value is reached (where its result is close
to 0). However, here a lot of discretion is Backpropagation has often been extended.
given for selecting an activation function. Many of these extensions can simply be
SBut generally, the disadvantage of sig- implemented as optional features of back-
moid functions is the fact that they hardly propagation in order to have a larger test
learn something long before their thresh- scope. In the following I want to briefly
old value is reached, unless the network describe some of them.
will be modified.
5.5.4.1 Momentum term

5.5.3.4 Weights should be initialized
with small, randomly chosen Let us assume to descent a steep slope
values on skis - what prevents us from immedi-
ately stopping at the edge of the slope
The initialization of weights is not as to the plateau? Exactly - the momen-
trivial as one might think Initialisiert tum. With backpropagation the momen-
man sie einfach mit 0, wird gar keine tum term [RHW86b] is responsible for the
6 Generally, pattern recognition is understood as a
fact that a kind of moment of inertia
special case of function approximation with a few (momentum) is added to every step size
discrete output possibilities. (fig. 5.14 on page 96), by always adding a

Hyperbolic Tangent Fermi Function with Temperature Parameter

1 1
0.8
0.6 0.8
0.4
0.2 0.6
tanh(x)
f(x)
0
−0.2 0.4
−0.4
−0.6 0.2
−0.8
−1 0
−4 −2 0 2 4 −4 −2 0 2 4
x x
Figure 5.13: As a reminder the illustration of the hyperbolic tangent (left) and the Fermi function
(right). DThe Fermi function was expanded by a temperature parameter. Thereby the original
Fermi function is represented by dark colors, the temperature parameter of the modified Fermi
functions are, from outward to inward 12 , 51 , 10
1 1
und 25 .
proportion of the previous change to every inertia can be varied over the prefactor α
new change in weight: , common values are between 0.6 und 0.9.
Additionally, the momentum enables the
Jα
(∆p wi,j )jetzt = ηop,i δp,j + α · (∆p wi,j )vorher positive effect that our skier swings back
and forth several times in a minimum,
Of course, this notation is only used for and finally lands in the minimum. Despite
a better understanding. Generally, as al- its nice one-dimensional appearance,
ready defined by the concept of time, the the otherwise very rare error of leaving
moment of the current cycle is referred to good minima unfortunately occurs more
as (t) , then the previous cycle is identified frequently because of the momentum
by (t − 1), which is successively continued. term – which means that there is again no
And now we come to the formal definition easy answer (but we become accustomed
of the momentum term: to this conclusion).
Definition 5.10 (Momentum term): The
moment of
inertia variation of backpropagation by means of
the momentum term is defined as fol- 5.5.4.2 Flat spot elimination
lows:
∆wi,j (t) = ηoi δj + α · ∆wi,j (t − 1) (5.44) It must be pointed out that with the hy-
perbolic tangent as well as with the Fermi
Remark: We accelerate on plateaus function the derivative outside of the close
(avoids quasi-standstill on plateaus) and proximity of Θ is nearly 0. This results
slow down on craggy surfaces (against in the fact that it is very difficult to re-
oscillations). Moreover, the effect of move neurons from the limits of the activa-

derivates only rarely improve the estima-

tions. Thus, less training cycles are needed
but those require much more computa-
tional effort.
For higher order methods we generally

use further derivatives (i.e. Hessian ma-
trices, since the functions are multidimen-
sional). As expected the procedures re-
duce the number of learning epochs, but
significantly increase the computational ef-
fort of the individual epochs. So finally
these procedures often need more learning
time than backpropagation.
Figure 5.14: We want to execute the gradient
descent like a skier crossing a slope who hardly
would immediately stop at the edge to the
5.5.4.4 Quickpropagation
plateau.
The quickpropagation learning proce-

dure [Fah88] uses the second derivative of
the error propagation and locally under-
neurons tion (flat spots), which could extremely stands the error function to be a parabola.
get stuck extend the learning time. This problem We analytically determine the vertex of
can be dealt with by modifying the deriva- the said parabola and directly jump to this
tive, for example by adding a constant (e.g. vertex. Thus, this learning procedure is
0.1), which is called flat spot elimination a second-order procedure. Of course, this
flat spot elimination. does not work with error surfaces that can-
Remark: Interesting: Success has also not locally be approximated by a parabola
been achieved by using constants defined (certainly it is not always possible to di-
as derivatives [Fah88]. rectly say whether this is the case).
5.5.4.3 Second order backpropagation 5.5.4.5 Weight decay
According to David Parker [Par87] Sec- The weight decay according to Paul
ond order backpropagation is also us- Werbos [Wer88] is a modification that ex-
ing the second gradient, i.e. the sec- tends the error by a term punishing large
ond multi-dimensional derivative of the er- weights. So the error under weight de-
ror function, to obtain more precise es- cay
timations of the correct ∆wi,j . Higher ErrWD

dkriesel.com 5.6 The 8-3-8 encoding problem and related problems
does not only increase proportional to the as usual, considers the difference between
ErrWD I
actual error but also proportionally to the output and teaching input, the other one
square of the weights. As a result the net- tries to ”press” a weight against 0. If a
work is keeping the weights small during weight is strongly needed to minimize the
learning. error, the first term will win. If this is not
the case, the second term will win. Neu-
1 X
ErrWD = Err + β · (w)2 (5.45) rons which only have zero weights can be
2 w∈W cut again in the end.
| {z }
There are many other variations of back-
punishment
This approach is inspired by nature where prop and whole books only about this
synaptic weights cannot become infinitely subject, but since my aim is to offer an
strong, as well. Additionally, due to overview of neural networks I just want to
keep weights
small these small weights the error function of- mention the variations above as a motiva-
ten shows less strong fluctuations allowing tion to read on.
easier and more controlled learning. For some of these extensions it is obvi-
The prefactor 2 again resulted from sim- ous that they cannot only be applied to
1
ple pragmatics. The factor β controls the feedforward networks with backpropaga-
βI
strength of punishment: Values from 0.001 tion learning procedures.
to 0.02 are often used here. We have got to know backpropagation and
feedforward topology – now we have to
learn how to build a neural network. It
5.5.4.6 Pruning / Optimal Brain
is certainly impossible to provide this ex-
Damage
perience in the framework of this paper,
for not it’s your turn: You could now try
If we have executed the weight decay long some of the Problem examples from sec-
enough and determine that for a neuron in tion 4.6.
the input layer all successor weights are 0
prune the
network or close to 0, we can remove the neuron,
have lost one neuron and some weights
and so reduce the opportunity that the
5.6 The 8-3-8 encoding
network will memorize. This procedure is problem and related
called pruning pruning. problems
Such a method to detect and delete un-
necessary weights and neurons is reffered The 8-3-8 encoding problem is a clas-
to as optimal brain damage [lCDS90]. sic among the multilayer perceptron test
I only want to describe it briefly: The training problems. In our MLP we
mean error per output neuron is composed have one input layer with eight neurons
of two competing terms. While one term, i1 , i2 , . . . , i8 , one output layer with eight

neurons Ω1 , Ω2 , . . . , Ω8 and one hidden

layer with three neurons. Thus, this net-
work represents a function B8 → B8 . Now
the training task is that the input of a
value 1 into the neuron ij provides the out-
put of a value 1 into the neuron Ωj (only
one neuron should be activated, which re-
sults in 8 training examples.
During the analysis of the trained network

we will see that the network with the 3
hidden neurons represents some kind of bi-
nary encoding and that the upper map-
ping is possible (assumed training time:
≈ 104 epochs). Thus, our network is a
machine in which the input is encoded, de-
coded and output.
Analogously, we can train a 1024-10-1024

encoding problem. But is it possible to
improve the efficiency of this procedure?
Could there be, for example, a 1024-9-
1024- or an 8-2-8-encoding network?
Figure 5.15: Illustration of the functionality of
Yes, even that is possible, since the net- 8-2-8 network encoding. The marked points rep-
work does not depend on binary encod- resent the vectors of the inner neuron activation
ings: Thus, an 8-2-8 network is sufficient associated to the samples. As you can see, it
for our problem. But the encoding of the is possible to find inner activation formations so
network is far more difficult to understand that each point can be separated from the rest
of the points by means of a straight line. The il-
(fig. 5.15) and the training of the networks
lustration shows an exemplary separation of one
requires a lot more time. point.
An 8-1-8 network, however, does not work,

since the possibility that the output of one
neuron is compensated by another one is
essential, and if there is only one hidden
neuron, there is certainly no compensatory
neuron.

dkriesel.com 5.6 The 8-3-8 encoding problem and related problems
Exercises P of the 6 patterns of the form (p1 , p2 , tΩ )

with ε 1 is correctly classified.
Exercise 8: Fig. 5.4 on page 75 shows a P ={(0, 0, −1);
small network for the boolean functions
(2, −1, 1);
AND and OR. Write a table with all compu-
tational variables of neural networks (e.g. (7 + ε, 3 − ε, 1);
network input, activation etc.). Calculate (7 − ε, 3 + ε, −1);
the four possible inputs of the networks (0, −2 − ε, 1);
and write down the values of these vari- (0 − ε, −2, −1)}
ables for each input. Do the same for the
XOR network (fig. 5.9 on page 84). Exercise 12: Calculate in a comprehensi-
Exercise 9: ble way one vector ∆W of all changes in
weight by means of the backpropagation of
1. Indicate all boolean functions B3 → error procedure with η = 1. Let a 2-2-1
B1 , that are linearly separable and ex- MLP with bias neuron be given nd let the
actly characterize them. pattern be defined by
2. NIndicate those that are not linearly p = (p1 , p2 , tΩ ) = (2, 0, 0.1).
separable and exactly characterize
them, too. For all weights with the target Ω the ini-
Exercise 10: A simple 2-1 network shall tial value of the weights should be 1. For
be trained with one single pattern by all other weights should the initial value
means of backpropagation of error and should be 0.5. What is conspicuous about
η = 0.1. Verify if the error the changes?
1
Err = Errp = (t − y)2
2
converges and if so, at what value. How
does the error curve look like? Let the
pattern (p, t) be defined by p = (p1 , p2 ) =
(0.3, 0.7) and tΩ = 0.4. Randomly initalize
the weights in the interval [1; −1].
Exercise 11: A one-stage perceptron with
two input neurons, bias neuron and binary
threshold function as activation function
divides the two-dimensional space into two
regions by means of a straight line g. Ana-
lytically calculate a set of weight values for
such a perceptron so that the following set

Chapter 6
Radial Basis Functions
RBF networks approximate functions by stretching and compressing Gaussian
bells and then summing them spatially shifted. Description of their functions
and their learning process. Comparison with multi-layer perceptrons.
According to Poggio and Girosi [PG89] 6.1 Components and

radial basis function networks (RBF net- Structure of an
works) are a paradigm of neural networks,
which was developed considerably later RBF-Network
than that of perceptrons. Like perceptrons
the RBF networks are built in layers but, Initially, we want to colloquially discuss
however, in this case it has exactly three and then define some concepts concerning
layers, i.e. only one single layer of hidden RBF networks.
neurons.
Output neurons: In an RBF network the
Like perceptrons the networks have a feed- output neurons only contain the iden-
forward structure and their layers are com- tity as activation function and one
pletely linked. Here, the input layer does weighted sum as propagation func-
also not participate in information process- tion. Thus, they do little more
ing. The RBF networks are - like MLPs - than adding all input values and out-
universal function approximators. putting the sum.
Despite all things in common: What is the Hidden neurons hare also called RBF
difference between RBF networks and per- neurons (as well as the layer in which
ceptrons? The difference is in the informa- they are located is referred to as RBF
tion processing itself and in the computa- layer). As propagation function each
tional rules within the neurons lying out- hidden neuron receives a norm that
side of the input layer. So in a moment calculates the distance between the in-
we will define a so far unknown type of put into the network and the so-called
neurons. position of the neuron (center). This
101
Chapter 6 Radial Basis Functions dkriesel.com
is entered into a radial activation func- pletely linked with the next following one,
tion which calculates and outputs the shortcuts do not exist (fig. 6.1 on the right
activation of the neuron. page) – it is a feedforward topology. The
Definition 6.1 (RBF input neuron): Defi- connections between input layer and RBF
nition and representation is identical to to layer are unweighted, i.e. they only trans-
the definition 5.1 on page 73 of the input mit the input. The connections between
input
is linear
again neuron. RBF layer and output layer are weighted.
The original definition of an RBF network
Definition 6.2 (Center of an RBF neuron):
only referred to an output neuron, but –
The center ch of an RBF neuron h is the
cI analogous to the perceptrons – it is appar-
point in the input space where the RBF
ent that such a definition can be general-
neuron is located . The closer the input
Position ized. A bias neuron is unknown in the
vector is to the center vector of an RBF
RBF network. The set of input neurons
in the input
space neuron, the higher, generally, is its activa-
shall be represented by I, the set of hid-
tion. JH
den neurons by H and the set of output
Definition 6.3 (RBF neuron): The so- neurons by O.
called RBF neurons h have a propaga-
tion function fprop that determines the dis-
Therefore, the inner neurons are called ra-
tance between the center ch of a neuron
Important! dial basis neurons because from their def-
and the input vector y. This distance rep-
inition directly follows that all input vec-
resents the network input. Then the net-
tors with the same distance from the cen-
work input is sent through a radial basis
ter of a neuron also produce the same out-
function fact which outputs the activation
put value (fig. 6.2 on page 104).
or the output of the neuron. RBF neurons
are represented by the symbol WVUT
PQRS
||c,x||
Gauß
.
Definition 6.4 (RBF output neuron): 6.2 Information processing of

RBF output neurons Ω use the a RBF network
weighted sum as propagation function
fprop , and the identity as activation func-
only sums
tion fact . They are represented by the sym- Now the question is what is realized by
bol ONML
HIJK
up
Σ such a network and what is its sense. Let
.
us go over the RBF network from top to
Definition 6.5 (RBF network): An RBF bottom: An RBF network receives the
network has exactly three layers in the input by means of the unweighted con-
following order: The input layer consist- nections. Then the input vector is sent
ing of input neurons, the hidden layer (also through a norm so that the result is a
called RBF layer) consisting of RBF neu- scalar. This scalar (which, by the way, can
rons and the output layer consisting of only be positive due to the norm) is pro-
RBF output neurons. Each layer is com- cessed by a radial basis function, for exam-
3 layers,
feedforward

dkriesel.com 6.2 Information processing of a RBF network
GFED
@ABC @ABC
GFED

ERVRVRVVV h i1 , i2 , . . . , i|I|
y EE RRRVVVVV hhhhhlhlhlhlyly EE
y yy EE RhRhRhRhVhVlVlVlVl yy EE
EE
y y h hEEhh lRlRlR VVyyVVV
h EE
yy hh h E l l R R y
yRRR V VV
y| yhhhhhhhh
h llEE" VVVVVV EE"
WVUT
PQRS WVUT
PQRS WVUT
PQRS WVUT
PQRS WVUT
PQRS
l l y
| y R VVV+
||c,x|| sh ||c,x||
vll ||c,x||
R(
||c,x|| ||c,x||
V
QV
Gauß C QQ VVV Gauß C QQ Q h hm Gauß h1 , h2 , . . . , h|H|
V Gauß C mm hh m
CC QQQ VVVV CC QQQ{{ C mm { hhh mm {
Gauß
CC QQQQ VVVVCVC {{ QQQQ mmmmCmCC {h{h{hhhhmhmmmm {{{

CC QQQ {V{CVCVVVmmmQQQhhh{h{hCC mmm {
CC Q{Q{Q mCmCm hVhVhVhV Q{Q{Q mCmCm {{{
ONML
HIJK ONML
HIJK * ONML
HIJK
C! { m
Q hh
C
}{ mhmhmhhQhQQ( ! V
{V m
Q C
}{ mmVmVVQVQVQ( ! }{{
Σ vm th Σ vm Σ
Ω1 , Ω2 , . . . , Ω|O|

Figure 6.1: An exemplary RBF network with two input neurons, five hidden neurons and three
output neurons. The connections to the hidden neurons are not weighted, they only transmit the
input. Right of the illustration you can find the names of the neurons, which correspond to the
known names of the MLP neurons: Input neurons are called i, hidden neurons are called h and
output neurons are called Ω. The associated sets are referred to as I, H and O.

any surface can be shaped by dragging,

compressing and removing Gaussian bells
and subsequent accumulation. Here, the
development terms for the superposition
of the Gaussian bells are in the weights
of the connections between the RBF layer
and the output layer.
Furthermore, the network architecture of-

fers the possibility to freely define or train
height and width of the Gaussian bells –
due to which the network paradigm be-
comes even more manifold. We will get
to know methods and approches for this
later.
Figure 6.2: Let ch be the center of an RBF neu-
ron h. Then the activation function fact h pro-
vides the same value for all inputs lying on the 6.2.1 Information processing in
circle. RBF-neurons
RBF-Neurons process information by us-

ing norms and radial basis functions
ple by a Gaussian bell (fig. 6.3 on the right
input
page) . At first, let us take as an example a sim-
The output values of the different neurons ple 1-4-1 RBF network. It is apparent
→ distance
→ Gaussian bell
→ sum of the RBF layer or of the different Gaus- that we will receive a one-dimensional out-
sian bells are added within the third layer: put which can be represented as a func-
→ output
Actually, Gaussian bells are, related to the tion (fig. 6.4 on the right page). Addi-
whole input space, added here. tionally, the network includes the centers
c1 , c2 , . . . , c4 dof the four inner neurons
Let us assume that we have a second, a h1 , h2 , . . . , h4 , and therefore it has Gaus-
third and a fourth RBF neuron and there- sian bells which are finally added within
fore four differently located centers. Each the output neuron Ω. The network also
of these neurons now measures another dis- possesses four values σ1 , σ2 , . . . , σ4 which
tance from the input to its own center and influence the width of the Gaussian bells.
de facto provides different values, even if However, the height of the Gaussian bell
the Gaussian bell is the same. Since finally is influenced by the next following weights,
these values are simply accumulated in the since the individual output values of the
output layer, it is easy to understand that bells are multiplied by those weights.

h(r)
Gaussian in 1D Gaussian in 2D
1 1
0.8
0.8 0.6
0.4
0.6 0.2
h(r)
0
0.4
−2
0.2 2
−1 1
0 0
x
1 −1 y
0 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
r
Figure 6.3: Two individual one- or two-dimensional Gaussian bells. In both cases σ = 0.4 is true
and in both cases the center of the Gaussian bell is in the point of origin. The distance r to the
center (0, 0) is simply calculated from the Pythagorean theorem: r = x2 + y 2 .
p
1.4
1.2
0.8
0.6
0.4
y
0.2
−0.2
−0.4
−0.6
−2 0 2 4 6 8
x
Figure 6.4: For different Gaussian bells in one-dimensional space generated by means of RBF
neurons are added by an output neuron of the RBF network. The Gaussian bells have different
heights, widths and positions. Their centers c1 , c2 , . . . , c4 were located at 0, 1, 3, 4, the widths
σ1 , σ2 , . . . , σ4 at 0.4, 1, 0.2, 0.8. You can see an example for a two-dimensional casein fig. 6.5 on
the next page.

h(r) h(r)
Gaussian 1 Gaussian 2
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−2 2 −2
−1 2
1 −1 1
x 0 0 x 0 0
1 −1 y 1 −1 y
−2 −2
h(r) h(r)
Gaussian 3 Gaussian 4
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−2 2 −2 2
−1 1 −1 1
x 0 0 x 0 0
1 −1 y 1 −1 y
−2 −2
WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x|| WVUT
PQRS
||c,x||
QQQ Gauß A m
m Gauß
Gauß
QQQ AA }}
Gauß
mmm
QQQ AA } mm
QQQ
QQQ AAA }} mmm
QQQ AA }}} mmmmm
ONML
HIJK
QQ( }~ }mmm
Σ vm

Sum of the 4 Gaussians

2
1.75
1.5
1.25
1
0.75
0.5
0.25
0
−0.25
−0.5
−0.75
−1
−2
−1.5 2
−1 1.5
−0.5 1
0 0.5
0
x 0.5 −0.5
1 −1 y
1.5 −1.5
2 −2
Figure 6.5: Four different Gaussian bells in two-dimensional space generated p by menas of RBF
neurons are added by an output neuron of the RBF ntwork. Once again r = x2 + y 2 applies for
the distance. The heights w, widths σ and centers c = (x, y) are: w1 = 1, σ1 = 0.4, c1 = (0.5, 0.5),
w2 = −1, σ2 = 0.6, c2 = (1.15, −1.15), w3 = 1.5, σ3 = 0.2, c3 = (−0.5, −1), w4 = 0.8, σ4 =
1.4, c4 = (−2, 0).

Since we use a norm to calculate the dis- Remark (on nomenclature): It is obvious
tance between the input vector and the that both the center ch and the widthσh
center of a neuron h, we have different can be understood as part of the activa-
choices: Often the Euclidian distance is tion function fact , and according to this
chosen to calculate the distance: not all activation functions can be re-
ferred to as fact . One solution would be
rh = ||x − ch || (6.1) to number the activation functions like
sX
fact 1 , fact 2 , . . . , fact |H| with H being the
= (xi − ch,i )2 (6.2)
set of hidden neurons. But as a result the
explanation would be very confusing. So I
i∈I
Remember: The input vector was referred simply use the name fact for all activation
to as x. Here, the index i passes through functions and regard σ and c as variables
the input neurons as well as through the that are defined for individual neurons but
input vector components and the neuron no directly included in the activation func-
center components. As we can see, As we tion.
can see, the Euclidean distance generates Remark: The reader will definitely notice
the sqaures of the differences of all vector that in the literature the Gaussian bell is
components, adds them and extracts the often provided with a multiplicative fac-
root of the sum. In two-dimensional space tor. Due to the existing multiplication by
this equals the Pythagorean theorem. the following weights and the comparabil-
Remark: From the definition of a norm di- ity of constant multiplication we do not
rectly follows that the distance can only need this factor (especially because for our
be positive, because of which we, strictly purpose the integral of the Gaussian bell
speaking, use the positive part of the ac- must not always be 1) and therefore we
tivation function. By the way, activation simply leave it out.
functions other than the Gaussian bell are
possible. Normally, functions that mono-
tonically decrease in the interval [0; ∞] are 6.2.2 Some analytical thoughts
selected. prior to the training
Now that we know the distance rh be- The output yΩ of an RBF output neuron
rh I
tween the input vector x zand the center Ω results from combining the functions of
ch of the RBF neuron h this distance has an RBF neuron to
to be passed through the activation func-
tion. Here we use, as already mentioned, yΩ = wh,Ω · fact (||x − ch ||) . (6.4)
X
a Gaussian bell: h∈H

−r 2
h
2σ 2 Let us assume that similar to the multi-
fact (rh ) = e h (6.3) layer perceptron we have a s P , that con-

tains |P | training examples (p, t). Then

we receive |P |functions of the form
T =M ·G (6.6)
yΩ = wh,Ω · fact (||p − ch ||) , (6.5) ·T =M
X
−1 −1
⇔ M · M · G (6.7)
h∈H
⇔ M −1
·T =E·G (6.8)
that mean one function for each training ⇔ M −1

· T = G, (6.9)
example.
where
Certainly, the aim of this effort is to let
the output y for all training patterns p run . T is the vector of the teaching
inputs for all training examples,
JT
against the corresponding teaching input
t. . M is the |P | × |H| matrix of
the outputs of all |H| RBF neu-
JM
rons to |P | examples (remember:
6.2.2.1 Weights can simply be |P | = |H|, the matrix is squared
computed as solution of a and therefore invertible),
system of equations . G is the vector of the desired
weights and
JG
Thus we can see that we have |P | equa- . E is a unit matrix appropriate
tions. Now let us assume that the widths to G.
JE
σ1 , σ2 , . . . , σk , the centers c1 , c2 , . . . , ck and
the training examples p including teach- Mathematically speaking, we can sim-
ing input t are given. We are looking for ply calculate the weights: In the case
the weights wh,Ω with |H| weights for one of |P | = |H| there is exactly one RBF
output neuron Ω. Thus, our problem can neuron available per training exam-
be seen as a system of equations since the ple. This means, that the network ex-
only thing we want to change at the mo- actly meets the |P | existing nodes af-
ment are the weights. ter having calculated the weights, i.e.
it performs a precise interpolation.
This demands a case differentiation con- To calculate such an equation we cer-
cerning the number of training examples tainly do not need an RBF network,
|P | and the number of RBF neurons |H|: and therefore we can proceed to the
next case.
|P | = |H|: If the number of RBF neurons Remark: Exact interpolation must
equals the number of patterns, i.e. not be mistaken for the memorizing
|P | = |H|, the equation can be re- mentioned with the MLPs: The first
duced to a matrix multiplication is, that we are not talking about the
simply
calculate
weights

training of RBF networks at the mo- How do we continue the calculation in

ment. The second is, that it could the case of |P | > |H|? As above we
be very good for us and definitely in- have to solve a matrix multiplication
tended if the network exactly interpo-
lates between the nodes. T =M ·G (6.10)
by means of a matrix M in order to

|P | < |H|: The system of equations is solve the system of equations. The
under-determined, there are more problem is that this time we cannot
RBF neurons than training examples, invert the |P |×|H| matrix M because
i.e. |P | < |H|. Certainly, this case it is not squared (here, |P | =
6 |H|
normally does not occur very often. is true). Here, we have to use the
In this case, there are so many solu- Moore-Penrose pseudo inverse
M + which is defined JM +
tions which we do not need to discuss
in detail. We can select one set of M + = (M T · M )−1 · M T (6.11)
weights out of many obviously possi-
ble ones.
Although the Moore-Penrose pseudo
inverse is not the inverse of a matrix,
|P | > |H|: But most interesting for fur- it can be used like one in this case1 .
ther discussion is that there are signif- We get equations that are very similar
icantly more training examples than to those in the case of |P | = |H|:
RBF neurons, that means |P | > |H|.
T =M ·G (6.12)
Thus, we again want to use the gener-
+ +
alization capability of the neural net- ⇔ M ·T =M ·M ·G (6.13)
+
work. ⇔ M ·T =E·G (6.14)
+
⇔ M ·T =G (6.15)
If we have more training examples Another reason for the use of the
than RBF neurons, we cannot assume Moore-Penrose pseudo inverse is the
that every training example is exactly fact that it minimizes the squared er-
hit. So, if we cannot exactly hit the ror (which is our aim): The estima-
points and therefore cannot only in- tion of the vector G in equation 6.15
terpolate as in the above-mentioned corresponds to the Gauss-Markov
ideal case with |P | = |H|, we must model known from statistics, which
try to find a function that approxi- is used to minimize the squared error.
mates our training set P as exactly
1 Particularly, M + = M −1 is true if M is invertible.
as possible: As with the MLP we try I don’t want further discuss the reasons for these
to reduce the sum of the squared error circumstances and applications of M + - they can
to a minimum. easily be found in linear algebra.

Remark: In the above-mentioned 6.2.2.3 Computational effort and

equations 6.11 on the previous page accuracy
and the following ones please do
not mistake the T in M T (of the
transpose of the matrix M ) for the T For realistic problems it normally applies
of the vector of all teaching inputs. that there are considerably more training
examples than RBF neurons, i.e. |P |
|H|: You can, without any difficulty, train
with 106 training examples, if you like.
6.2.2.2 The generalization on several Theoretically, we could find the terms for
outputs is trivial and not quite the mathematically correct solution at the
computationally expensive blackboard (after a very long time), but
such calculations often seem to be impre-
cise and very time-consuming (matrix in-
We have found a mathematically exact versions require a lot of computational ef-
way to directly calculate the weights. fort).
What will happens if there are several out-
put neurons, i.e. |O| > 1, withO being, as Furthermore, our Moore-Penrose pseudo-
usual, the set of the output neurons Ω? In inverse is, inspite of numeric stability, no
M + complex
this case, as we have already indicated, it guarantee for the output vector being cor- and imprecise
does not change much: The additional out- respondent to the teaching vector, since
put neurons have their own set of weights such time-consuming computations could
while we do not change the σ and c of the be subject to many imprecisions, although
RBF layer. Thus, in an RBF network it is the calculation is mathematically correct:
easy for given σ and c to realize a lot of Even our computers could only provide
output neurons since we only have to calcu- the pseudo-inverse matrices by (nonethe-
late the individual vector of the weights less very good) approximation. This
means that we also get only an approxi-
GΩ = M + · TΩ (6.16) mation of the correct weights (maybe with
a lot of imprecisions) and therefore only
one (maybe very rough or even unrecog-
nizable) approximation of output values at
for every new output neuron Ω with the
the desired output.
matrix M + which generally requires a lot
of computational effort, being always the
same: So it is quite inexpensive – at least If we have enough computing power to an-
as far as the computational complexity alytically determine a weight vector, we
inexpensive
output
dimension is concerned – to add more output neu- should use it in any case only as initial
rons. value for our learning process, which leads
us to the real training methods – but oth-
erwise it would be boring, wouldn’t it?

dkriesel.com 6.3 Training of RBF-Networks
6.3 Combinations of equation But we have already implied that in an

system and gradient RBF network not only the weights previ-
ous to the output layer can be optimized.
strategies are useful for So let us take a look at the possibility to
training vary σ snd c.
Analogous to the MLP we perform a gra-

dient descent to find the suitable weights 6.3.1 It is not always trivial to
retraining
by means of the already well known Delta determine centers and widths
Delta rule rule. Here, backpropagation is unneces- of RBF neurons
sary since we only have to train one single
weight layer – which requires less comput-
ing time. It is obvious that the approximation accu-
racy of RBF networks can be decreased by
We know that the Delta rule is adapting the widths and positions of the
Gaussian bells in the input space to the
∆wh,Ω = η · δΩ · oh , (6.17)
problem to be approximated. There are
several methods to deal with the centers c
in which we now insert the following: vary
and the widths σ of the Gaussian bells: σ and c
∆wh,Ω = η · (tΩ − yΩ ) · fact (||p − ch ||)

Fixed selection: The centers and widths
(6.18)
can be selected fixedly and regardless
of the patterns – this is what we have
Here again I explicitly want to mention assumed until now.
that it is very popular to divide the train-
ing into two phases by analytically com- Conditional, fixed selection: Again cen-
puting a set of weights and then retraining ters and widths are selected fixedly,
it by means of the Delta rule. but we have previous knowledge
There is still the question whether to learn about the functions to be approxi-
offline or online. Here, the answer is sim- mated and comply with it.
ilar to the answer for the multi-layer per-
ceptron: Initially, , it is often learned on- Adaptive to the learning process: This
is definitely the most elegant variant,
training
in phases line (faster movement across the error sur-
face). Then, after having approximated but certainly the most ambitious one,
the solution, the errors are once again ac- too. A realization of this approach
cumulated and for amore precise approxi- will not be discussed in this chapter
mation learned offline in a third learning but it can be found in connection
phase. But, similar to the MLPs, you can with another network topology
be successful by using many methods. (section 10.6.1).

is precisely represented at some positions

but at other positions the return value is
only 0.
However, the high input dimension re-

quires a great many RBF neurons, which input
increases the computational effort expo- dimension
nentially with the dimension – and is re- very expensive
sponsible for the fact that six- to ten-

dimensional problems in RBF networks
are already called ”high-dimensional” (an
MLP, for example, does not cause any
problems here).
Figure 6.6: Example for even coverage of a two- 6.3.1.2 Conditional, fixed selection
dimensional input space by applying radial basis
functions.
Let us assume that our training examples
are not evenly distributed across the in-
put space. Then it seems to be obvious
to arrange the centers and Sigmas of the
6.3.1.1 Fixed selection RBF neurons by means of the pattern dis-
tribution. So the training patterns can be
In any case, the goal is to cover the in- analyzed by statistical techniques such as
put space as evenly as possible. Here, a cluster analysis, and so it can be deter-
widths of 23 of the distance between the mined whether there are statistical factors
centers can be selected so that the Gaus- according to which we should distribute
sian bells overlap by approx. ”one third”2 the centers and sigmas (fig. 6.7 on the right
(fig. 6.6). The closer the bells are set the page).
more precise but the more time-consuming
the whole thing becomes. A more trivial alternative would be to set
|H| centers on positions randomly selected
This may seem to be very inelegant, but from the set of patterns. So this method
in the field of function approximation we would allow for every training pattern p
cannot avoid even coverage. Here it is use- to be directly in the center of a neuron
less when the function to be approximated (fig. 6.8 on the right page). This is not
2 It is apparent that a Gaussian bell is mathemati-
yet very elegant but a good solution when
cally infinitely wide, therefore I ask the reader to time is of the essence. Generally, for this
apologize this sloppy formulation. method the widths are fixedly selected.

dkriesel.com 6.3 Training of RBF-Networks
Figure 6.7: Example of an uneven coverage of

a two-dimensional input space, about which we
have previous knowledge, by applying radial basis
functions.
If we have reason to believe that the set of

training examples has cluster points, we Figure 6.8: Example of an uneven coverage of
can use clustering methods to determine a two-dimensional input space by applying radial
them. There are different methods to de- basis functions. The widths were fixedly selected,
termine clusters in an arbitrarily dimen- the centers of the neurons were randomly dis-
tributed throughout the training patterns. This
sional set of points. We will be introduced
distribution can certainly lead to slightly unrepre-
to some of them in excursus A. One neural sentative results, which can be seen at the single
clustering method are the so-called ROLFs data point down to the left.
(section A.5), and self-organizing maps are
also useful in connection with determin-
ing the position of RBF neurons (section
10.6.1). Using ROLFs, one can also receive
indicators for useful radii of the RBF neu-
rons. Learning vector quantisation (chap-
ter 9) has also provided good results. All
these methods do not have anything to do
with the RBF networks directly but are

only used to generate some previous knowl- 6.4 Growing RBF Networks
edge. Therefore we will not discuss them automatically adjust the
in this chapter but independently in the
indicated chapters. neuron density
In growing RBF networks, the number

Another approach is to use the approved
|H| of RBF networks is not constant. A
methods: We could turn the positions of
certain number |H| of neurons as well as
the centers and observe how our error func-
their centers ch and widths σh are previ-
tion Err is changing – a gradient descent,
ously selected (e.g. by means of a cluster-
as already known from the MLPs. In a
ing method) and then extended or reduced.
similar manner we could look how the er-
In the following text, only simple mecha-
ror depends on the values σ. Analogous
nisms are sketched. For more information,
to the derivation of backpropagation we
I refer to [Fri94].
derive
∂Err(σh ch ) ∂Err(σh ch )
and . 6.4.1 Neurons are added to places
∂σh ∂ch
with large error values
Since the derivation of these terms corre-
sponds to the derivation of backpropaga- After generating this initial configuration
tion we do not want to discuss it here. the vector of the weights G is analytically
calculated. Then all specific errors Errp
concerning the set P of the training ex-
But experience shows that no convincing amples are calculated and the maximum
results are obtained by regarding how the specific error
error behaves depending on the centers
and sigmas. Even if mathematics claims max(Errp )
P
that such methods are promising, the gra-
dient descent, as we already know, leads is sought.
to problems with very craggy error sur-
faces. The extension of the network is simple:
We replace this maximum error with a new
replace
RBF neuron. Of course, we have to exer- error with
And that’s the crucial point: Naturally, cise care in doing this: IF the σ re small, neuron
RBF networks generate very craggy error the neurons will only influence one another
surfaces because if we considerably change when the distance between them is short.
a c or a σ, we will significantly change the But if the σ are large, the already exisiting
appearance of the error function. neurons are considerably influenced by the
new neuron because of the overlapping of
the Gaussian bells.

dkriesel.com 6.5 Comparing RBF Networks and Multi-layer Perceptrons
So it is obvious, that we will adjust the al- instance, one single neuron with a higher
ready existing RBF neurons when adding Gaussian bell would be appropriate.
the new neuron.
But to develop automated procedures in
To put it crudely: this adjustment is made order to find less relevant neurons is very
by moving the centers c of the other neu- problem dependent and we want to leave
rons away from the new neuron and re- this to the programmer.
ducing their width σ a bit. Then the
current output vector y of the network is With RBF networks and multi-layer per-
compared to the teaching input t and the ceptrons we have already become ac-
weight vector G s improved by means of quainted with and extensivley discussed
training. Subsequently, a new neuron can two network paradigms for similar prob-
be inserted if necessary. This method is lems. Therefore we want to compare these
particulary suited for function approxima- two paradigms and look at their advan-
tions. tages and disadvantages.
6.4.2 Limiting the number of

neurons
6.5 Comparing RBF Networks
and Multi-layer
Here it is mandatory to see that the net- Perceptrons
work will not grow ad infinitum, which can
happen very fast. Thus, it is very useful
We will compare multi-layer perceptrons
to previously define a maximum number
and RBF networks by means of different
for neurons |H|max .
aspects.
Input dimension: We must be careful

6.4.3 Less important neurons are with RBF networks in high-
deleted dimensional functional spaces
since the network could very fast
Which leads to the question whether it require huge memory storage and
is possible to continue learning when this computational effort. Here, a
limit |H|max is reached. The answer is: multi-layer perceptron would cause
this would not stop learning. We only have less problems because its number of
to look for the ”most unimportant” neu- neuons does not grow exponentially
ron and delete it. A neuron is, for exam- with the input dimension.
ple, unimportant for the network if there
is another neuron that has a similar func- Center selection: However, selecting the
tion: It is often that two Gaussian bells centers c for RBF networks is (de-
exactly overlap and at such a position, for spite the introduced approaches) still
delete
unimportant
neurons

a great problem. Please use any previ- Spread: Here the MLP is ”advantaged”
ous knowledge you have when apply- since RBF networks are used consid-
ing them. Such problems do not occur erably less often – which is not always
with the MLP. understood by professionals (at least
as far as low-dimensional input spaces
Output dimension: The advantage of are concerned). The MLPs seem to
RBF networks is that the training is have a considerably higher tradition
not much influenced when the output and they are working too good to take
dimension of the network is high. the effort to read some pages of this
For an MLP a learning procedure paper about RBF networks) :-).
such as backpropagation thereby will
be very protracted.
Extrapolation: Advantage and disadvan- Exercises

tage RBF networks is the lack of ex-
trapolation capability: An RBF net- Exercise 13: An |I|-|H|-|O| RBF network
work returns the result 0 far away with fixed widths and centers of the neu-
from the centers of the RBF layer. On rons should approximate a target function
the one hand it does not extrapolate, u. For this, |P | training examples of the
unlike the MLP it cannot be used form (p, t) of the function u are given. LEt
for extrapolation (whereby we could |P | > |H| be true. The weights should
never know if the extrapolated values be analytically determed by means of the
of the MLP are reasonable, but expe- Moore-Penrose pseudo inverse. Indicate
rience shows that MLPs are good na- the running time behavior regarding |P |
tured for that matter). On the other and |O| as precise as possible.
hand, unlike the MLP the network is
Important! Note: There are methods for matrix mul-
capable to use this 0 to tell us ”I don’t
tiplications and matrix inversions that are
know”, which could be an advantage.
more efficient than the canonical methods.
Lesion tolerance: For the output of an For better estimations I recommend to
MLP it is no so important if a weight search for such methods (and their com-
or a neuron is missing. It will only plexity). In addition to your complexity
worsen a little in total. If a weight calculation, please indicate the used meth-
or a neuron is missing in an RBF ods together with their complexity.
network then large parts of the out-
put remain practically uninfluenced.
But one part of the output is very af-
fected because a Gaussian bell is di-
rectly missing. Thus, we can choose
between a strong local error for lesion
and a weak but global error.

Chapter 7
Recurrent Perceptron-like Networks
Some thoughts about networks with internal states. Learning approaches
using such networks, overview of their dynamics.
Generally, recurrent networks are net- to briefly discuss how recurrencies can
works being capable to influence them- be structured and how network-internal
selves by means of recurrents, e.g. by states can be generated. Thus, I will
including the network output in the follow- briefly introduce two paradigms of recur-
ing computation steps. There are many rent networks and afterwards roughly out-
types of recurrent networks of nearly arbi- line their training.
trary form, and nearly all of them are re-
ferred to as recurrent neural networks. With a recurrent network a temporally
As a result, for the few paradigms in- constant input x may lead to different re-
troduced here I use the name recurrent sults: For one thing, the network could state
multi-layer perceptrons. converge, i.e. it could transform itself into dynamics
more capable
than MLP a fixed state and at any time it will return
Apparently, such a recurrent network is ca- a fixed output value y, for another thing
pable to compute more than the ordinary it would never converge, or at least not
MLP: If the recurrent weights are set to 0, until a long time later, so that we do not
the recurrent network will be reduced to recognize anymore the consequences of a
an ordinary MLP. Additionally, the recur- constant change of y.
rency generates different network-internal
states so that in the context of the network If the network does not converge, it is, for
state different inputs can produce different example, possible to check if periodicals
outputs. or attractors (fig. 7.1 on the next page)
are returned. Here, we can expect the
Recurrent networks in itself have a great complete variety of dynamical systems.
dynamic that is mathematically difficult That is the reason why I particularly want
to conceive and has to be discussed exten- to refer to the literature concerning dy-
sively. The aim of this chapter is only namical systems.
117
Chapter 7 Recurrent Perceptron-like Networks (depends on chapter 5) dkriesel.com
Further discussions could reveal what will

happen if the input of recurrent networks
is changed.
In this chapter the related paradigms of

recurrent networks according to Jordan
and Elman should be introduced.
7.1 Jordan Networks
A Jordan network [Jor86] is a multi-

layer perceptron with a set K of so-called
context neurons k1 , k2 , . . . , k|K| . There
is one context neuron per output neuron
(fig. 7.2 on the right page). In principle, a
context neuron just memorizes an output
until it can be processed in the next time output
step. Therefore, there are weighted con- neurons
nections between each output neuron and are buffered
one context neuron. The stored values are

returned to the actual network by means
of complete links between the context neu-
rons and the input layer.
In the originial definition of a Jordan net-

work the context neurons are also recur-
rent to themselves due to a connecting
weight λ. But the most applications omit
Figure 7.1: The Roessler attractor
this recurrence since the Jordan network
is already very dynamic and difficult to
analyze, even without these additional re-
currences.
Definition 7.1 (Context neuron): A con-
text neuron k receives the output value of
another neuron i at a time t and then reen-
ters it into the network at a time (t + 1).
Definition 7.2 (Jordan network): A Jor-
dan network is a multi-layer perceptron

dkriesel.com 7.2 Elman Networks
GFED
@ABC @ABC
GFED GFED
@ABC GFED
@ABC

i1 AUUUU i
i 2 k2 k1
}} AA UUUUUiiiiiii}} AAA O O
} A i ii U U }
UU}U}U AA
}}
}
i iiAiAAi UUUU A
ii } UUUU AAA
GFED
@ABC @ABC
GFED @ABC
GFED
} i A }
~}x v ti}iiiiii A { ~}} UUU*
h1 AUUUU h2 A iii} h3
AA UUUUU } AAA iii i ii i
AA UUUU }}} }}
AA }}UUUUU iiiiAiAiAi }}}
AA Ui A
GFED
@ABC @ABC
GFED
} iU }
}~ it}iiiiii UUUUUUA* }~ }
@A BC
Ω1 Ω2

Figure 7.2: Illustration of a Jordan network. The network output is buffered in the context neurons
and with the next time step it is entered into the network together with the new input.
with one context neuron per output neu- layer during the next time step (i.e. on the
ron. The set of context neurons is called way back a complete link again). So the
K. The context neurons are completely complete information processing part1 of
linked toward the input layer of the net- the MLP exists a second time as ”context
work. version” – which once again considerably
increases dynamics and state variety.
7.2 Elman Networks Compared with Jordan networks the El-

man networks often have the advantage to
act more purposeful since every layer can
The Elman networks (a variation of the access its own context.
Jordan networks) [Elm90] have context
neurons, too. But, however, one layer of Definition 7.3 (Elman network): An El-
context neurons per information process- man network is an MLP with one con-
ing neuron layer (fig. 7.3 on the next page). text neuron per information processing
Thus, the outputs of each hidden neuron neuron. The set of context neurons is
or output neuron are led into the associ- called K. This means that there exists one
nearly every-
thing is
buffered ated context layer (again exactly one con- context layer per information processing
text neuron per neuron) and from there 1 Remember: The input layer does not process in-
it is reentered into the complete neuron formation.

GFED
@ABC @ABC
GFED

i1 @UUUU i
i 2
~~ @@ UUUUUiiiiiii~~ @@@
~~ @ @ i ii U U ~
UUU~~U @@
~~~ ii i ii@i@i@ ~~ UUUU @@
@
@ABC
GFED @ABC
GFED * GFED
@ABC ONML
HIJK ONML
HIJK ONML
HIJK
~ ii i @@ ~ U U UUUUU@@
~~it~tu iiii ~~~uv zw v
h1 @UUUU h2 @ ii h 3 4 kh1 5
kh kh
@@ UUUUU iiii ~~ 5
2 3
~ @@
@@ UUUU ~~~ @@ iiiiii ~~
U i
@@
@@ ~~~ UUUUiUiUiiii @@@ ~ ~~
GFED
@ABC GFED
@ABC ONML
HIJK ONML
HIJK
~ it~u iiiiii UUUUUU@* ~ ~wv
Ω1 Ω2 5
kΩ kΩ
1
5 2

Figure 7.3: Illustration of an Elman network. The entire information processing part of the network
exists, in a manner of speaking, twice. The output of each neuron (except for the output of the
input neurons) is buffered and reentered into the associated layer. For the reason of clarity I named
the context neurons on the basis of their models in the actual network, but it is not mandatory to
do so.

dkriesel.com 7.3 Training Recurrent Networks
neuron layer with exactly the same num- 7.3.1 Unfolding in Time
ber of context neurons. Every neuron has
a weighted connection to exactly one con- Remember our actual learning procedure
text neuron while the context layer is com- for MLPs, the backpropagation of error,
pletely linked towards its original layer. which backpropagates the delta values.
So, in case of recurrent networks the
Now it is interesting to take a look at the delta values would cyclically backpropa-
training of recurrent network since, for in- gate through the network again and again,
stance, the ordinary backpropagation of which makes the training more difficult.
error cannot work on recurrent networks. On the one hand we cannot know which
Once again, the style of the following part of the many generated delta values for a
is more informal, which means that I will weight should be selected for training, i.e.
not use any formal definitions. which values are useful. On the other hand
we cannot definitely know when learning
should be stopped. The advantage of re-
current networks is a great state dynamics
7.3 Training Recurrent within the network operation; the disad-
vantage of recurrent networks is that this
Networks dynamics is also granted to the training
and therefore makes it difficult.
In order to explain the training as descrip- One learning approach would be the at-
tively as possible, we have to agree upon tempt to unfold the temporal states of the
some simplifications that do not affect the network (fig. 7.4 on page 123): Recursions
learning principle itself. are deleted by setting an even network
over the context neurons, i.e. the context
So for the training let us assume that neurons are, as a manner of speaking, the
initially the context neurons are initiated output neurons of the attached network.
with an input since otherwise they would More generally spoken, we have to back-
have an undefined input (this is no simpli- track the recurrencies and place ”‘earlier”’
fication but reality). instances of neurons in the network – thus
creating a larger, but forward-oriented
Furthermore, we use a Jordan network network without recurrencies. This en-
without a hidden neuron layer for our ables training a recurrent network with
training attempts so that the output neu- any training strategy developed for non-
rons can directly provide input. his ap- recurrent ones. Here, the input is entered
attach
proach is a strong simplification because as teaching input into every ”copy” of the the same
generally more complicated networks are input neurons. This can be done for a de- network
used. But this does not change the learn- screte number of time steps. These train-
to each
context
ing principle. ing paradigms are called unfolding in layer

time [MP69]. After the unfolding a train- 7.3.2 Teacher forcing

ing by means of backpropagation of error
is possible. Other procedures are the equivalent
teacher forcing and open loop learn-
But obviously, for one weight wi,j sev- ing. They force open the recurrence dur-
eral changing values ∆wi,j are received, ing the learning process: During the learn-
which can be treated differently: accumu- ing process we simply act as if the recur-
teaching
input
lation, averaging etc. A simple accumu- rence does not exist and apply the teach- applied at
lation could possibly result in enormous ing input to the context neurons during context
neurons
changes per weight if all changes have the the training. So, a backpropagation be-
same sign. Hence, the average is not to be comes possible, too. Disadvantage: With
scoffed at. We could also introduce a be- Elman networks a teaching input for non-
griffdiscounting factor, which weakens the output-neurons is not given.
influence of ∆wi,j of the past.
Unfolding in time is particularly useful if 7.3.3 Recurrent backpropagation

we receive the impression that the closer
past is more important for the network Another popular procedure without lim-
than the one being further away. The rea- ited time horizon is the recurrent back-
son for this is that backpropagation has propagation using methods of differ-
only few influence in the layers farther ential calculus to solve the problem
away from the output (Remember: The [Pin87].
farther we remove from the output layer,
the smaller the influence of backpropaga-
tion becomes). 7.3.4 Training with evolution
Disadvantages: The training of such an un- Due to the already long lasting train-
folded network will take a long time since ing time, evolutionary algorithms have
a large number of layers could possibly be been proved of value especially with re-
produced. A problem that is no longer current networks. One reason for this is
negligible is the limited computational ac- that they are not only unrestricted with
curacy of ordinary computers, which is respect to recurrences but they also have
exhausted very fast because of so many other advantages when the mutation mech-
nested computations (the farther we re- anisms are suitably chosen: So, for ex-
move from the output layer, the smaller ample, neurons and weights can be ad-
the influence of backpropagation becomes, justed and the network topology can be
so that this limit is reached). Furthermore, optimized (of course the result of learn-
with several levels of context neurons this ing is not necessarily a Jordan or Elman
procedure could produce very large net- network). With ordinary MLPs, however,
works to be trained. evolutionary strategies are less popular

dkriesel.com 7.3 Training Recurrent Networks
GFED
@ABC
i1 OUOUUU GFED
@ABC GFED
@ABC @ABC
GFED @ABC
GFED

i2 @PP i3 A n} kO 1 iiininin kO 2
OOOUUUU @@PPP A nnn i n
OOO UUUU@@ PPP AA nn } iii nn
OOO U@U@UU PPP AA nnnniii}i}i}iinnnn
OOO @@ UUUUPPPPnAnAniiii }}nnnn
@ABC
GFED * GFED
@ABC
OO' nw UniUniUniUPiUPiA' }~ nw }nn
@A BC
it
Ω1 Ω2

.. .. .. .. ..
. . . . .
/.-,
()*+RVRVV /.-,
()*+ /.-,
()*+? /.-,
()*+ /.-,
()*+

RRVRVVV CPCPCPPP oo jjjojojojo
RRRVVVV CC PP ??? oo j
RRR VVVCV PPP ?? ooojojjjjjoooo
()*+RVRVV
/.-, /.-,
()*+DQQ /.-,
()*+C n/.-,
()*+ /.-,
()*+
RRR CVCVVVVPPPoo?ojjj ooo
RRRC! ojVojVjVPjVP? oo
( wotj * ( wo
V
RRRVVVV DDQQQ C nn j j jpjpjp
RRR VVVV DD QQQ C n j p
RRR VVDVDV QQQ CCC nnnnjnjjjjjpjppp
RRR DVDVVV QQQ CnCnnjjjj ppp
GFED
@ABC @ABC
GFED GFED
@ABC @ABC
GFED vntjnjj VVQ* (GFED
@ABC
RRR D! VVVVQnVQnjQjjC! wpppp
R(
i1 OUOUUU i2 @PP i n k1 iiin k2
OOOUUUU @@PPP 3 AAA nnn}}}iiiiinininnn
OOO UUUU@@ PPP A n n
OOO U@U@UU PPP AA nnnniii}i}ii nnnn
OOO @@ UUUUPPPPnAnAniiii }}nnnn
@ABC
GFED * GFED
@ABC
OO' nw UniUniUniUPiUPiA' }~ nw }nn
it
Ω1 Ω2

Figure 7.4: Illustration of the unfolding in time with a small exemplary recurrent MLP. Top: The
recurrent MLP. Bottom: The unfolded network. or reasons of clarity, I only added names to the
lowest part of the unfolded network. Dotted arrows leading into the network mark the inputs. otted
arrows leading out of the network mark the outputs. Each ”network copy” represents a time step
of the network with the most current time step being at the bottom.

since they certainly need a lot more time

than a directed learning procedure such as
backpropagation.

Chapter 8
Hopfield Networks
In a magnetic field, each particle applies a force to any other particle so that
all particles adjust their movements in the energetically most favorable way.
This natural mechanism is copied to adjust noisy inputs in order to match
their real models.
Another supervised learning example of particles or neurons rotate and thereby en-
the wide range of neural networks was courage each other to continue this rota-
developed by John Hopfield: the so- tion. As a manner of speaking, our neural
called Hopfield networks [Hop82]. Hop- network is a cloud of particles
field and his physically motivated net-
works have contributed a lot to the renais- Based on the fact that the particles auto-
sance of neural networks. matically detect the minima of the energy
function, Hopfield had the idea to use the
”spin” of the particles to process informa-
tion: Why not letting the particles search
8.1 Hopfield networks are minima on self-defined functions? Even if
inspired by particles in a we only use two of those spins, i.e. a bi-
nary activation, we will recognize that the
magnetic field developed Hopfield network shows a con-
siderable dynamics.
The idea for the Hopfield networks origi-
nated from the behavior of particles in a
magnetic field: Every particle ”communi- 8.2 In a hopfield network, all
cates” (by means of magnetic forces) with neurons influence each
every other particle (comple link) with
each particle trying to reach an energeti-
other symmetrically
cally favorable state (i.e. a minimum en-
ergy function). As for the neurons this Briefly speaking, a Hopfield network con-
state is known as activation. Thus, all sists of a set K of completely linked neu-
JK
125
Chapter 8 Hopfield Networks dkriesel.com
to think about how we can enter some-

?>=<
89:;
↑ iSo S k6/ 5 ?>=<
89:;
↓ thing into the |K| neurons.
@ O ^<<<SSSSSSkkkkkk@ O ^<<<
<< kkk SSSS << Definition 8.1 (Hopfield network): A Hop-
k kkk<k< SSSSS <
kk SSSSS <<< field network consists of a set K of com-
?>=<
89:; / ?>=<
89:; /4 ?>=<
89:;
k <<
ukkkkkk SS)
↑ ^<iSjo SSS o pletely linked neurons without direct re-
kkk5 @
↓
@ ^<< ↑
<< SSSS << kkk kkk currences. The activation function of the
<< SSSS < k
<< SSS kkk<k neurons is the binary threshold function
<<
kkSkSkSkSkSS <<<
89:;
?>=< / ?>=<
89:;
ukkkk SSS) with outputs ∈ {1, −1}.
↓ o ↑
Definition 8.2 (State of a Hopfield net-
work): The state of the network con-
Figure 8.1: Illustration of an exemplary Hop- sists of the activation states of all neu-
field network. The arrows ↑ and ↓ mark the rons. Thus, the state of the network can
binary ”spin”. Due to the completely linked neu- be understood as a binary string z ∈
rons the layers cannot be separated, which means {−1, 1}|K| .
that a Hopfield network simply includes a set of
neurons.
8.2.1 Input and Output of a

Hopfield Network are
represented by neuron states
rons with binary activation (since we only
use two spins), with the weights being We have learned that a network, i.e. a
symmetric between the individual neurons set of |K| particles, that is in a state
completely
linked and without any neuron being directly con- is automatically looking for a minimum.
set of nected to itself (fig. 8.1). Thus, the stateAn input pattern of a Hopfield network
of |K| neurons with two possible states
neurons
is exactly such a state: A binary string
∈ {−1, 1} can be described by a string x ∈ {−1, 1}|K| that initializes the neurons.
x ∈ {−1, 1}|K| . Then the network is looking for the mini-
mum to be entered (which we have previ-
The complete link provides a full square ously defined by the input of training ex-
matrix of weights under the neurons. The amples) on its energy surface.
meaning of the weights will be discussed in
the following. Furthermore, we will soon But when do we know that the minimum
recognize according to which rules the neu- has been found? AThis is simple, too: input and
rons are spinning, i.e. are changing their When the network stopps. It can be output =
state. proved that a Hopfield network with a sym- network
metric weight matrix and zeros in the di-

states
Additionally, the complete link provides agonal always converges [CG88] , i.e. at
always
for the fact that we do not know any input, some point it will stand still. Then the converges
output or hidden neurons. Thus, we have output is a binary string y ∈ {−1, 1}|K| ,

dkriesel.com 8.2 Structure and Functionality
namely the state string of the network that state −1 would try to urge a neuron
has found a minimum. j into state 1.
Now let us take a closer look at the con- Zero weights see to it that the two in-
tents of the weight matrix and the rules volved neurons do not influence each
for the state change of the neurons. other.
Definition 8.3 (Input and output of a Hop- The weights as a whole apparently take
field network): The input of a Hopfield net- the way from the current state of the net-
work is binary string x ∈ {−1, 1}|K| that work towards the next minimum of the en-
initializes the state of the network. After ergy function. We now want to discuss
the convergence of the network, the output how the neurons follow this way.
is the binary string y ∈ {−1, 1}|K| gener-
ated from the new network state.
8.2.3 A Neuron changes its state
according to the influence of
the other neurons
8.2.2 Significance of Weights
Change in the state of neurons xk of
We have already said that the neurons the individual neurons k according to the
change their states, i.e. their direction, scheme
from −1 to 1 or vice versa. These spins oc-  
cur dependent on the current states of the
xk (t) = fact  wj,k · xj (t − 1) (8.1)
X
other neurons and the associated weights.
j∈K
Thus, the weights are capable to control
the complete change of the network. The every time step, whereby the function fact
weights can be positive, negative, or 0. generally is the binary threshold function
Colloquially speaking, a weight wi,j be- (fig. 8.2 on the next page) with the thresh-
tween two neurons i and j: old 0. Colloquially speaking: A neuron k
calculates the sum of wj,k · xj (t − 1), which
If wi,j is positive, it will try to force the indicates how strong and in which direc-
two neurons to become equal – the tion the neuron k is urged by the other
larger they are, the harder the net- neurons j. Thus, the new state of the net-
work will try. If the neuron i is in work (time t) results from the state of the
state 1 and the neuron j is in state network at the previous time t − 1. This
−1, a high positive weight will advise sum is the direction the neuron k gis urged
the two neurons that it is energeti- in. Depending on the sign of the sum the
cally more favorable to be equal. neuron takes state 1 or −1.
If wi,j is negative, its behavior will be Another difference between the Hopfield
analoguous only that i and j are networks and the other already known net-
urged to be different. A neuron i in work topologies is the asynchronous up-

Heaviside Function Now that we know how the weights influ-

1 ence the changes in the states of the neu-
rons and force the entire network towards
0.5 a minimum, then there is the question of
how to teach the weights to force the net-
f(x)
0
work towards a certain minimum.
−0.5
−1
8.3 The weight matrix is
generated directly out of
−4 −2 0 2 4
x
Figure 8.2: Illustration of the binary threshold the training patterns

function.
The aim is to generate minima on the
mentioned energy surface, so that at an
input the network can converge to them.
date: A neuron k is randomly chosen ev- As with many other network paradigms,
ery time which recalculates the activation. we use a set P von Mustereingaben p ∈
Thus, the new activation of the previously {1, −1}|K| , representing the minima of our
changed neurons directly exerts influence, energy surface.
i.e. one time step indicates the change of Remark: Unlike many other network
a single neuron. paradigms, we do not look for the min-
ima of an unknown error function but de-
Remark: Regardless of the above-
fine minima on such a function. The pur-
mentioned random selection of the
pose is that the network shall automati-
neuron, a Hopfield network is often much
cally take the closest minimum when the
easier to implement: The neurons are
input is presented. For now this seems un-
simply processed one after another and
usual, but we will understand the whole
their activations are recalculated until no
purpose later.
more changes occur.
random
neuron Definition 8.4 (Change in the state of a Roughly speaking, the training of a Hop-
calculates
Hopfield network): DThe change in the field network is done by training each train-
ing pattern exactly once using the rule
new
activation state of the neurons occurs asynchronously
with the neuron to be updated being ran- described in the following (Single Shot
domly chosen and the new state being gen- Learning), whereby pi and pj are the
erated by means of rule states of the neurons i and j under p ∈
  P:
xk (t) = fact  wj,k · xj (t − 1) . wi,j =

X
(8.2)
X
pi · pj
j∈J p∈P

dkriesel.com 8.4 Autoassociation and Traditional Application
This results in the weight matrix W . Col- ments of the weight matrix W are defined
loquially speaking: We initialize the net- by single processing of the learning rule
work by means of a training pattern and
wi,j =
X
then process weights wi,j one after another. pi · pj ,
For each of these weights we verify: Are p∈P
the neurons i, j n the same state or do the whereby the diagonal of the matrix is cov-
states vary? In the first case we add 1 ered with zeros. Here, no more than
to the weight, in the second case we add |P |MAX ≈ 0.139 · |K| training examples
−1. can be trained and at the same time main-
tain their function.
This we repeat for each training pattern
p ∈ P . Finally, the values of the weights Now we know the functionality of Hopfield
wi,j are high when i and j corresponded networks but nothing about their practical
with many training patterns. Colloquially use.
speaking, this high value tells the neurons:
”Often, it is energetically favorable to hold
the same state”. The same applies to neg- 8.4 Autoassociation and
ative weights. Traditional Application
Due to this training we can store a certain
fixed number of patterns p in the weight Hopfield networks like those mentioned
matrix. At an input x the network will above are called autoassociators. An
autoassociator a exactly shows the above-
converge to the stored pattern that is clos-
mentioned behavior: Firstly, when a
Ja
est to the input p.
known pattern p is entered exactly this
Unfortunately, the number of the maxi- known pattern is returned. Thus,
mum storable and reconstructible patterns
p is limited to a(p) = p,
with a being the associative mapping. Sec-

|P |MAX ≈ 0.139 · |K|, (8.3)
ondly, and that is the practical use, this
also works with inputs being close to a pat-
which in turn only applies to orthogo- tern:
nal patterns. This was shown by precise a(p + ε) = p.
(and time-consuming) mathematical anal-
yses, which we do not want to specify Afterwards, the autoassociator is, in any
now. If more patterns are entered, already case, in a stable state, namely in the state
stored information will be destroyed. p.
Definition 8.5 (Learning rule for Hopfield If the set of patterns P consists, for exam-
networks): The individual elements of the ple, of letters or other characters in the
network
weight matrix W wThe individual ele- form of pixels, the network will be able to restores
damaged
inputs

correctly recognize deformed or noisy let-

ters with high probability (fig. 8.3).
The primary fields of application of Hop-

field networks are pattern recognition
and pattern completion, such as the zip
code recognition on letters in the eight-
ies. But soon the Hopfield networks have
been overtaken by other systems in most of
their fields of application, for example by
OCR systems in the field of letter recog-
nition. Today Hopfield networks are vir-
tually no longer used, they have become
established in practice.
8.5 Heteroassociation and

Analogies to Neural Data
Storage
So far we have been introduced to Hopfield

networks that converge to an arbitrary in-
put into the closest minimum of a static
energy surface.
Another variant is a dynamic energy sur-

face: Here, the appearance of the energy
surface depends on the current state and
we receive a heteroassociator instead of
an autoassociator. For a heteroassocia- Figure 8.3: Illustration of the convergence of an
tor exemplary Hopfield network. Each of the pic-
a(p + ε) = p tures has 10 × 12 = 120 binary pixels. In the
Hopfield network each pixel corresponds to one
is no longer true, but rather neuron. The upper illustration shows the train-
ing examples, the lower shows the convergence
h(p + ε) = q, of a heavily noisy 3 to the corresponding training
example.
which means that a pattern is mapped
onto another one. h is the heteroasso-
hI
ciative mapping. Such heteroassociations

dkriesel.com 8.5 Heteroassociation and Analogies to Neural Data Storage
are achieved by means of an asymmetric always, adapted during operation. Several

weight matrix V . transitions can be introduced into the ma-
trix by a simple addition, whereby the said
Heteroassociations connected in series of
limitation exists here, too.
the form
Definition 8.6 (Learning rule for the het-
h(p + ε) = q eroassociative matrix): For two training
h(q + ε) = r examples p being predecessor and q being
h(r + ε) = s successor of a heteroassociative transition
the weights of the heteroassociative matrix
..
. V result from the learning rule
h(z + ε) = p
vi,j =
X
p i qj ,
can provoke a fast cycle of states p,q∈P,p6=q
p → q → r → s → . . . → z → p, with several heteroassociations being intro-

duced into the network by a simple addi-
whereby a single pattern is never com- tion.
pletely accepted: Before a pattern is com-
pletely accomplished the heteroassociation
already tries to generate the successor 8.5.2 Stabilizing the
of this pattern. Additionally, the net- heteroassociations
work would never stop since after having
reached the last state z it would proceed We have already mentioned the problem
to the first state p again. that the patterns are not completely gen-
erated but that the next pattern is already
beginning before the generation of the pre-
8.5.1 Generating the
vious pattern is finished.
heteroassociative matrix
This problem can be avoided by not only
We generate the matrix V by means of el- influencing the network by means of the
VI
ements v very similar to the autoassocia- heteroassociative matrix V but also by
vI tive matrix with (per transition) p being the already known autoassociative matrix
the training example before the transition W.
and q being the training example to be
qI
generated from p: Additionally, the neuron adaptation rule
is changed so that competing terms are
vi,j = (8.4) generated: One term autoassociating an
X
p i qj
p,q∈P,p6=q existing pattern and one term trying to
convert the very same pattern into its suc-
The diagonal of the matrix is again cov- cessor. The associative rule provokes that
ered by zeros. The neuron states are, as the network stabilizes a pattern, remains
netword
is instable
while
changing
states
there for a while, goes on to the next pat- Which letter in the alphabet follows the
tern, and so on. letter P ?
xi (t + 1) = (8.5)
Another example is the phenomenon that
one cannot remember a situation, but the
 
place at which one memorized it the last

X 
wi,j xj (t) + vi,k xk (t − ∆t)
 X 
fact 

j∈K
 time is perfectly known. If one returns
k∈K
to this place, the forgotten situation often

| {z } | {z }
recurs.
autoassociation heteroassociation
Here, the value ∆t bewirkt hierbei causes,

∆tI
descriptively speaking, the influence of the
stable change
matrix V to be delayed, since it only
in states
refers to a network being ∆t versions be-
8.6 Continuous Hopfield
hind. The result is a change in state, dur- Networks
ing which the individual states are stable
awhile. If ∆t is set on, for example, twenty
steps, then the asymmetric weight matrix So far, we only have discussed Hopfield net-
will realize any change in the network only works with binary activations. But Hop-
twenty steps later so that it initially works field also described a version of his net-
with the autoassociative matrix (since it works with continuous activations [Hop84],
still perceives the predecessor pattern of which we want to regard at least briefly:
the current one), and only after that it will continuous Hopfield networks. Here,
work against it. the activation is no longer calculated by
the binary threshold function but by the
Fermi function with temperature parame-
8.5.3 Biological motivation of ters (fig. 8.4 on the right page).
heterassociation
Here, the network is stable for symmetric
From a biological point of view the transi- weight matrices with zeros on the diagonal,
tion of stable states into other stable states too.
is highly motivated: At least in the begin-
ning of the nineties it was assumed that Hopfield also stated, that continuous Hop-
the Hopfield modell will achieve an ap- field Networks can be applied to find ac-
proximation of the state dynamics in the ceptable solutions for the NP-hard trav-
brain, which realizes much by means of elling salesman problem [HT85]. Accord-
state chains: When I would ask you, dear ing to some verification trials [Zel94] this
reader, to recite the alphabet, you gener- statement can’t be kept up any more. But
ally will manage this better than (please today there are faster algorithms for han-
try it immediately) to answer the follow- dling this problem and therefore the Hop-
ing question: field network is no longer used here.

dkriesel.com 8.6 Continuous Hopfield Networks
Fermi Function with Temperature Parameter

1
0.8
0.6
f(x)
0.4
0.2
0
−4 −2 0 2 4
x
Figure 8.4: The already known Fermi function

with different temperature parameter variations.
Exercises
Exercise 14: ndicate the storage require-

ments for a Hopfield network with |K| =
1000 neurons when the weights wi,j shall
be stored as whole numbers. Is it possible
to limit the value range of the weights in
order to save storage space?
Exercise 15: Compute the weights wi,j
for a Hopfield network using the training
set
P ={(−1, −1, −1, −1, −1, 1);

(−1, 1, 1, −1, −1, −1);
(1, −1, −1, 1, −1, 1)}.

Chapter 9
Learning Vector Quantization
Learning vector quantization is a learning procedure with the aim to reproduce
the vector training sets divided in predefined classes as good as possible by
using a few representative vectors. If this has been managed, vectors which
were unkown until then could easily be assigned to one of these classes.
Slowly, part II of this paper draws to a 9.1 About Quantization

close – and therefore I want to write a last
chapter for this part that will be a smooth
transition into the next one: A chapter In order to explore the learning vec-
about the learning vector quantization tor quantization we should at first get
(abbreviated LVQ) [Koh89] described by a clearer picture of what quantization
Teuvo Kohonen, which can be charac- (which can also be referred to as dis-
terized as being related to the self orga- cretization is.
nizing feature maps. These SOMs are de-
scribed in the next chapter that already be- Everybody knows the discrete number se-
longs to part III of this paper, since SOMs quence
learn unsupervised. Thus, after the explo- N = {1, 2, 3, . . .},
ration of LVQ I want to bid farewell to
supervised learning. including the natural numbers. Discrete
means, that this sequence consists of sepa-
discrete
rated elements that are not interconnected. = separated
Previously, I want to announce that there The elements of our example are exactly
are different variations of LVQ, which will such numbers, because the natural num-
be mentioned but not exactly represented. bers do not include, for example, numbers
The goal of this chapter rather is to ana- between 1 and 2. On the other hand, the
lyze the underlying principle.. sequence of real numbers R, for instance,
is continuous: It does not matter how
close two selected numbers are, there will
always be a number between them.
135
Chapter 9 Learning Vector Quantization dkriesel.com
Quantization means that a continuous able us to do: A set of representatives

should be used to divide an input space
space is divided into discrete sections: By
deleting, for example, all decimal placesinto classes that reflect the input space
of the real number 2.71828, it could be as good as possible (fig. 9.1 on the right input space
assigned to the natural number 2. Here page). Thus, each element of the input reduced to
it is obvious that any other number hav- space should be assigned to a vector as a vector repre-
sentatives
ing a 2 in front of the comma would also representative, i.e. to a class, where the
be assigned to the natural number 2, i.e.set of these representatives should repre-
2 would be some kind of representative sent the entire input space as precise as
for all real numbers within the interval possible. Such a vector is called codebook
[2; 3). vector. A codebook vector is the represen-
tative of exactly those input space vectors
It must be noted that a sequence can lying closest to it, which divides the input
be irregularly quantized, too: Thus, for space into the said discrete areas.
instance, the timeline for a week could
be quantized into working days and week- It is to be emphasized that we have to
end. know in advance how many classes we
A special case of quantization is digiti- have and which training example belongs
zation: In case of digitization we always to which class. Furthermore, it is impor-
talk about regular quantization of a con- tant that the classes must not be disjoint,
tinuous space into a number system with which means they may overlap.
respect to a certain basis. If we enter, for
example, some numbers into the computer, Such separation of data into classes is in-
teresting for many problems for which it
these numbers will be digitized into the bi-
nary system (basis 2). is useful to explore only some characteris-
tic representatives instead of the possibly
Definition 9.1 (Quantization): Separation huge set of origin – be it because it is less
of a continuous space into discrete sec- time-consuming or because it is sufficiently
tions. precise.
Definition 9.2 (Digitization): Regular
quantization.
9.3 Using Code Book

9.2 LVQ divides the input Vectors: The nearest one
space into distinctive is the winner
areas
The use of a prepared set of codebook vec-
Now it is almost possible to describe by tors is very simple: For an input vector
means of its name what LVQ should en- y the class affiliation is easily decided the
closest
vector
wins

dkriesel.com 9.4 Adjusting Codebook Vectors
Figure 9.1: BExamples for quantization of a two-dimensional input space. DThe lines represent
the class limit, the × mark the codebook vectors.
class affiliation is easily decided by consid- the number of classes to be represented or

ering which codebook vector is the clos- the number of codebook vectors.
est – so, the codebook vectors build a
Roughly speaking, it is the aim of the
voronoi diagram out of the set. Since
learning procedure that training examples
each codebook vector can clearly be asso-
are used to cause a previously defined num-
ciated to a class, each input vector is asso-
ber of randomly initialized codebook vec-
ciated to a class, too.
tors to reflect the training data as precise
as possible.
9.4 Adjusting Codebook

9.4.1 The procedure of learning
Vectors
Learning works according to a simple
As we have already indicated, the LVQ is scheme. We have (since learning is super-
a supervised learning procedure. Thus, we vised) a set P of |P | training examples.
have a teaching input that tells the learn- Additionally, we already know that classes
ing procedure whether the classification of are predefined, too, i.e. we also have a set
the input pattern is right or wrong: In of classes C. A codebook vector is clearly
other words, we have to know in advance assigned to each class. Thus, we can say

Chapter 9 Learning Vector Quantization dkriesel.com
that the set of classes |C| contains many Learning process: The learning process
codebook vectors C1 , C2 , . . . , C|C| . takes place by the rule
This leads to the structure of the training ∆Ci = η(t) · h(p, Ci ) · (p − Ci )
examples: They are of the form (p, c) and (9.1)
therefore contain the training input vector
Ci (t + 1) = Ci (t) + ∆Ci , (9.2)
p and its class affiliation c. For the class
affiliation
which we now want to break down.
c ∈ {1, 2, . . . , |C|}
. We have already seen that the first
is true, which means that it clearly assigns factor η(t) is a time-dependent learn-
the training example to a class or a code- ing rate allowing us to differentiate
book vector. between large learning steps and fine
tuning.
Remark: Intuitively, we could say about
learning: ”Why a learning procedure? We . The last factor (p − Ci ) obviously is
calculate the average of all class members the direction toward which the code-
and there place their codebook vectors – book vector is moved.
and that’s it.” But we will see soon that
. But the function h(p, Ci ) is the core
our learning procedure can do a lot more.
of the rule: It makes a case differenti-
We only want to briefly discuss the steps ation.
of the fundamental LVQ learning proce- Assignment is correct: The winner
dure: vector is the codebook vector of
Initialization: We place our set of code- the class that includes p. In this
Important!
book vectors on random positions in case, the function provides posi-
the input space. tive values and the codebook vec-
tor moves towards p.
Training example: A training example p
of our training set P is selected and Assignment is wrong: The winner
presented. vector does not represent the
class that includes p. Therefore
Distance measurement: We measure the it moves away from p.
distance ||p − C|| between all code-
book vectors C1 , C2 , . . . , C|C| and our We can see that our definition of the func-
input p. tion h was not precise enough. With
good reason: From here on, the LVQ
Winner: The closest codebook vector
is divided into different nuances, depen-
wins, i.e. the one with
dent of how exact h and the learning rate
min ||p − Ci ||. should be defined (called LVQ1, LVQ2,
Ci ∈C LVQ3, OLVQ, etc). The differences are,

dkriesel.com 9.5 Connection to Neural Networks
for instance, in the strength of the code- H in the five-dimensional unit cube H into
book vector movements. They are not all one of 1024 classes.
based on the same principle described here,
and as announced I don’t want to discuss
them any further. Therefore I don’t give
any formal definition regarding the above-
mentioned learning rule and LVQ.
9.5 Connection to Neural

Networks
Until now, inspite of the learning process,

the question was what LVQ has to do with
neural networks. The codebook vectors
can be understood as neurons with a fixed
position within in the input space, similar
vectors
to RBF networks. Additionally, in nature
= neurons? it often occurs that in a group one neuron
may fire (a winner neuron, here: a code-
book vector) and, in return, inhibits all
other neurons.
I decided to place this brief chapter about
learning vector quantization here so that
this approach can be continued in the fol-
lowing chapter about self-organizing maps:
We will classify further inputs by means of
neurons distributed throughout the input
space, only that, this time, we do not know
which input belongs to which class.
Now let us take a look at the unsupervised
learning networks!
Exercises
Exercise 16: Indicate a quantization

which equally distributes all vectors H ∈

Part III
Unsupervised learning Network

Paradigms
141
Chapter 10
Self-organizing Feature Maps
A paradigm of unsupervised learning neural networks, which maps an input
space by its fixed topology and thus independently looks for simililarities.
Function, learning procedure, variations and neural gas.
If you take a look at the concepts of bio- Unlike the other network paradigms we
logical neural networks mentioned in the have already got to know, for SOMs it is
introduction, one question will arise: How unnecessary to ask what the neurons calcu-
does our brain store and recall the impres-late. We only ask which neuron is active at
sions it receives every day. Let me point the moment. Biologically, this is very mono output,
out that the brain does not have any train-tivated: If in biology the neurons are con- but active
How are
data stored ing examples and therefore no ”desired nected to certain muscles, it will be less neuron
in the output”. And while already considering interesting to know how strong a certain
muscle is contracted but which muscle is
this subject we realize that there is no out-
brain?
put in this sense at all, too. Our brain activated. In other words: We are not in-
responds to external input by changes in terested in the exact output of the neuron
state. These are, so to speak, its output. but in knowing which neuron provides out-
put. Thus, SOMs are considerably more
related to biology than, for example, the
feedforward networks, which are increas-
Based on this principle and exploring ingly used for calculations.
the question of how biological neural net-
works organize themselves, Teuvo Ko-
honen developed in the Eighties his self- 10.1 Structure of a Self
organizing feature maps [Koh82, Koh98],
shortly referred to as self-organizing
Organizing Map
maps or SOMs. A paradigm of neural
networks where the output is the state of Typically, SOMs have – like our brain –
the network, which learns completely un- the task to map a high-dimensional in-
supervised, i.e. without a teacher. put (N dimensions) onto areas in a low-
143
Chapter 10 Self-organizing Feature Maps dkriesel.com
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
dimensional grid of cells (G dimensions)
to draw a map of the high-dimensional
high-dim.
input space, so to speak. To generate this map,
↓ the SOM simply obtains arbitrary many
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
points of the input space. During the in-
low-dim.
map
put of the points the SOM will try to cover
as good as possible the positions on which
the points appear by its neurons. This par- /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
ticularly means, that every neuron can be
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
assigned to a certain position in the input
space.
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
At first, these facts seem to be a bit con-
fusing, and it is recommended to briefly
reflect about them. There are two spaces
in which SOMs are working: Figure 10.1: Example topologies of a self-
organizing map. Above we can see a one-
. The N -dimensional input space and
dimensional topology, below a two-dimensional
. the G-dimensional grid on which the one.
input space
neurons are lying and which indi-
and topology cates the neighborhood relationships
between the neurons and therefore
the network topology. spaces are not equal and have to be dis-
In a one-dimensional grid, the neurons tinguished. In this special case they only
could be, for instance, like pearls on a have the same dimension.
string. Every neuron would have exactly
two neighbors (except for the two end neu- Initially, we will briefly and formally re-
rons). A two-dimensional grid could be a gard the functionality of a self-organizing
square array of neurons (fig. 10.1). An- map and then make it clear by means of
other possible array in two-dimensional some examples.
space would be some kind of honeycomb Definition 10.1 (SOM neuron): Similar to
shape. Irregular topologies are possible, the neurons in an RBF network a SOM
too, but not very often. Topolgies with neuron k does not occupy a fixed position
more dimensions and considerably more ck (a center) in the input space.
neighborhood relationships would also be
Jc
Definition 10.2 (Self-organizing map): A
possible, but due to their lack of visualiza-
self-organizing map is a set K of SOM neu-
tion capability they are not employed very
rons. If an input vector is entered, exactly
often. that neuron k ∈ K is activated which is
JK
Remark: Even if N = G is true, the two closest to the input pattern in the input
Important!

dkriesel.com 10.3 Training
space. The dimension of the input space other neurons remain inactive.This
is referred to as N . paradigm of activity is also called
NI input
Definition 10.3 (Topology): The neurons winner-takes-all scheme. The output ↓
are interconnected by neighborhood rela- we expect due to the input of a SOM winner
tionships. These neighborhood relation- shows which neuron becomes active.

ships are called topology. The training of Remark: In many literature citations, the
a SOM is highly influenced by the topol- description of SOMs is more formal: Of-
ogy. It is defined by the topology function ten an input layer is described that is
h(i, k, t), where i is the winner neuron1 ist, completely linked towards an SOM layer.
iI
k the neuron to be adapted (which will be Then the input layer (N neurons) forwards
kI discussed later) and t the timestep. The all inputs to the SOM layer. The SOM
dimension of the topology is referred to as layer is laterally linked in itself so that
G. a winner neuron can be established and
GI
inhibit the other neurons. I think that
this explanation of a SOM is not very de-
10.2 SOMs always activate scriptive and therefore I tried to provide
a clearer description of the network struc-
the Neuron with the ture.
least distance to an
input pattern Now the question is which neuron is ac-
tivated by which input – and the answer
is given by the network itself during train-
Like many other neural networks, the ing.
SOM has to be trained before it can be
used. But let us regard the very simple
functionality of a complete self-organizing
map before training, since there are many
10.3 Training
analogies to the training. Functionality
consists of the following steps: [Training makes the SOM topology cover
the input space] The training of a SOM
Input of an arbitrary value p of the input
is nearly as straightforward as the func-
space RN .
tionality described above. Basically, it is
Calculation of the distance between ev- structured into five steps, which partially
ery neuron k and p by means of a correspond to those of functionality.
norm, i.e. calculation of ||p − ck ||.
Initialization: The network starts with
One neuron becomes active, namely random neuron centers ck ∈ RN from
such neuron i with the shortest the input space.
calculated distance to the input. All
Creating an input pattern: A stimulus,
1 We will learn soon what a winner neuron is. i.e. a point p, is selected from the

training:
input space RN . Now this stimulus is Definition 10.4 (SOM learning rule): A
input, entered into the network. SOM is trained by presenting an input pat-
→ winner i, tern and determining the associated win-
Distance measurement: Then the dis- ner neuron. The winner neuron and its
change in
position
i and tance ||p − ck || is determined for every neighbor neurons, which are defined by the
neighbors
neuron k in the network. topology function, then adapt their cen-
ters according to the rule
Winner takes all: The winner neuron i
is determined, which has the smallest ∆ck = η(t) · h(i, k, t) · (p − ck ),
distance to p, i.e. which fulfills the (10.1)
condition c (t + 1) = c (t) + ∆c (t). (10.2)
k k k
||p − ci || ≤ ||p − ck || ∀ k 6= i
10.3.1 The topology function
. You can see that from several win- defines, how a learning
ner neurons one can be selected at neuron influences its
will. neighbours
Adapting the centers: The neuron cen- The topology function h is not defined
ters are moved within the input space on the input space but on the grid and rep-
according to the rule2 resents the neighborhood relationships be-
tween the neurons, i.e. the topology of the
∆ck = η(t) · h(i, k, t) · (p − ck ), network. It can be time-dependent (which
it often is) – which explains the parameter
defined on
t. The parameter k is the index running the grid
where the values ∆ck are simply through all neurons, and the parameter i
added to the existing centers. The is the index of the winner neuron.
last factor shows that the change in
In principle, the function shall take a large
position of the neurons k is propor-
value if k is the neighbor of the winner neu-
tional to the distance to the input
ron or even the winner neuron itself, and
pattern p and, as usual, to a time-
small values if not. SMore precise defini-
dependent learning rate η(t). The
tion: The topology function must be uni-
above-mentioned network topology ex-
modal, i.e. it must have exactly one maxi-
erts its influence by means of the func-
mum. This maximum must be next to the
tion h(i, k, t), which will be discussed
winner neuron i, for which the distance to
in the following.
itself certainly is 0.
only 1 maximum
2 Note: In many sources this rule is written ηh(p −
Additionally, the time-dependence enables
for the winner
ck ), which wrongly leads the reader to believe that
h is a constant. This problem can easily be solved us, for example, to reduce the neighbor-
by not omitting the multiplication dots ·. hood in the course of time.

In order to be able to output large values

for the neighbors of i and small values for
non-neighbors, the function h needs some
kind of distance notion on the grid because
from somewhere it has to know how far i
and k are apart from each other on the
grid. There are different methods to cal-
culate this distance.
/.-,
()*+ 89:;
?>=<
i o 1 / ?>=<
89:;
k /.-,
()*+ /.-,
()*+
On a two-dimensional grid we could apply,
for instance, the Euclidean distance (lower
/.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
part of fig. 10.2) or on a one-dimensional
grid we could simply use the number of the
connections between the neurons i and k
/.-,
()*+ /.-,
()*+ /.-,
()*+ 89:;
?>=< /.-,
()*+
(upper part of the same figure).
qq8 kO
Definition 10.5 (Topology function): The qqq
q
topology function h(i, k, t) describes the 2.23
qqq
/.-,
()*+ 89:;
?>=< /()*+o
/ .-,o //()*+
.-, /.-,
()*+
qq
neighborhood relationships in the topol- qx q
i o
ogy. It can be any unimodal function
that reaches its maximum when i = k
gilt. Time-dependence is optional, but of- /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+ /.-,
()*+
ten used.
Figure 10.2: Example distances of a one-

dimensional SOM topology (above) and a two-
dimensional SOM topology (below) between two
10.3.1.1 Introduction of common neurons i and k. In the lower case the Euclidean
distance and topology distance is determined (in two-dimensional space
functions equivalent to the Pythagoream theorem). In the
upper case we simply count the discrete path
length between i and k. To simplify matters I
required a fixed grid edge length of 1 in both
A common distance function would be, for cases.
example, the already known Gaussian
bell (see fig. 10.3 on page 149). It is uni-
modal with a maximum close to 0. Addi-
tionally, its width can be changed by ap-
plying its parameter σ , which can be used
σI
to realize the neighborhood being reduced
in the course of time: We simply relate the
time-dependence to the σ and the result is

a monotonically decreasing σ(t). Then our first, let us talk about the learning rate:
topology function could look like this: Typical sizes of the target value of a learn-
ing rate are two sizes smaller than the ini-
||gi −ck ||2

− tial value, e.g
h(i, k, t) = e 2·σ(t)2
, (10.3)
where gi and gk represent the neuron po- 0.01 < η < 0.6
sitions on the grid, not the neuron posi-
tions in the input space, which would be could be true. But this size must also de-
referred to as ci and ck . pend on the network topology or the size
of the neighborhood.
Other functions that can be used in-
stead of the Gaussian function are, for As we have already seen, a decreasing
instance, the cone function, the cylin- neighborhood size can be realized, for ex-
der function or the Mexican hat func- ample, by means of a time-dependent,
tion (fig. 10.3 on the right page). Here, monotonically decreasing σ with the
the Mexican hat function offers a particu- Gaussin bell being used in the topology
lar biological motivation: Due to its neg- function.
ative digits it rejects some neurons close
to the winner neuron, a behavior that has
The advantage of a decreasing neighbor-
already been observed in nature. This can
hood size is that in the beginning a moving
cause sharply separated map areas – and
neuron ”pulls along” many neurons in its
that is exactly why the Mexican hat func-
vicinity, i.e. the randomly initialized net-
tion has been suggested by Teuvo Koho-
work can unfold fast and properly in the
nen himself. But this adjustment charac-
beginning. In the end of the learning pro-
teristic is not necessary for the functional-
cess, only a few neurons are influenced at
ity of the map, it could even be possible
the same time which stiffens the network
that the map would diverge, i.e. it could
as a whole but enables a good ”fine tuning”
virtually explode.
of the individual neurons.
10.3.2 Learning rates and It must be noted that

neighbourhoods can
decrease monotonically over h·η ≤1
time
must always be true, since otherwise the
To avoid that the later training phases neurons would constantly miss the current
forcefully pull the entire map towards training example.
a new pattern, the SOMs often work
with temporally monotonically decreasing But enough of theory – let us take a look
learning rates and neighborhood sizes. At at a SOM in action!

Gaussian in 1D Cone Function

1
1
0.8
0.8
0.6 0.6
h(r)
f(x)
0.4 0.4
0.2
0.2
0
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −2 0 2 4
r x
Cylinder Funktion Mexican Hat Function

3.5
1 3
2.5
0.8
2
0.6 1.5
f(x)
f(x)
1
0.4 0.5
0
0.2
−0.5
0 −1
−1.5
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
x x
Figure 10.3: Gaussian bell, cone function, cylinder function and the Mexican hat function sug-
gested by Kohonen as examples for topology functions of a SOM..

89:;
?>=<
1
89:;
?>=<
1 ?>=<
89:;
2 89:;
?>=<
2

89:;
?>=< 89:;
?>=<

7 3

89:;
?>=<

4

89:;
?>=<
?>=<
89:; 89:;
?>=<
5
4> 6
>>
89:;
?>=<
>>
>>
> 6
89:;
?>=<
3 // p 89:;
?>=<
5 89:;
?>=<
7
Figure 10.4: Illustration of the two-dimensional input space (left) and the one-dimensional topolgy
space (right) of a self-organizing map. Neuron 3 is the winner neuron since it is closest to p. In
the topology, the neurons 2 and 4 are the neighbors of 3. The arrows mark the movement of the
winner neuron and its neighbors towards the training example p.
To illustrate the one-dimensional topology of the network, it is plotted into the input space by the
dotted line. The arrows mark the movement of the winner neuron and its neighbors towards the
pattern.

dkriesel.com 10.4 Examples
10.4 Examples for the Thus, the factor (p − ck ) indicates the

functionality of SOMs vector of the neuron k to the pattern
p. This is now multiplied by different
scalars:
Let us begin with a simple, mentally com-
prehensible example. Our topology function h indicates that
In this example, we use a two-dimensional only the winner neuron and its two
input space, i.e. N = 2 is true. Let the closest neighbors (here: 2 and 4) are
grid structure be one-dimensional (G = 1). allowed to learn by returning 0 for
Furthermore, our example SOM should all other neurons. A time-dependence
consist of 7 neurons and the learning rate is not specified. Thus, our vector
should be η = 0.5. (p − ck ) is multiplied by either 1 or
0.
The neighborhood function is also kept
simple so that we will be able to mentally The learning rate indicates, as always,
comprehend the network: the strength of learning. As already
mentioned, η = 0.5, i. e. all in all, the
1
k direct neighbor of i,

result is that the winner neuron and


h(i, k, t) = 1 k = i, its neighbors (here: 2, 3 and 4) ap-
0 otherw. proximate the pattern p half the way


(10.4) (in the figure marked by arrows).
Now let us take a look at the above- Although the center of neuron 7 – seen
from the input space – is considerably
mentioned network with random initializa-
closer to the input pattern p than neuron
tion of the centers (fig. 10.4 on the left
2, neuron 2 is learning and neuron 7 is
page) and enter a training example p. Ob-
not. I want to remind that the network
viously, in our example the input pattern
topology specifies which neuron is allowed
is closest to neuron 3, i.e. this is the win- topology
ning neuron. to learn and not its position in the input specifies,
space. DThis is exactly the mechanism by who will learn
We remember the learning rule for which a topology can significantly cover an
SOMs input space without having to be related
to it by any sort.
∆ck = η(t) · h(i, k, t) · (p − ck )
and process the three factors from the After the adaptation of the neurons 2, 3
back: and 4 the next pattern is applied, and so
on. Another example of how such a one-
Learning direction: Remember that the dimensional SOM can develop in a two-
neuron centers ck are vectors in the dimensional input space with uniformly
input space, as well as the pattern p. distributed input patterns in the course of

time can be seen in figure 10.5 on the right

page.
End states of one- and two-dimensional

SOMs with differently shaped input spaces
can be seen in figure 10.6 on page 154.
As we can see, not every input space can
be neatly covered by every network topol-
ogy. There are so called exposed neurons
– neurons which are located in an area
where no input pattern has ever been oc-
curred. A one-dimensional topology gen-
erally produces less exposed neurons than
a two-dimensional one: For instance, dur-
ing training on circularly arranged input
patterns it is nearly impossible with a two-
dimensional squared topology to avoid the
exposed neurons in the center of the cir-
cle. These are pulled in every direction Figure 10.7: A topological defect in a two-
during the training so that they finally dimensional SOM.
remain in the center. But this does not
make the one-dimensional topology an op-
timal topology since it can only find less
complex neighborhood relationships than
neighborhood size, because the more com-
a multi-dimensional one.
plex the topology is (or the more neigh-
bors each neuron has, respectively, since a
three-dimensional or a honeycombed two-
10.4.1 Topological defects are dimensional topology could also be gener-
failures in SOM unfolding ated) the more difficult it is for a randomly
initialized map to unfold.
During the unfolding of a SOM it
could happen that a topological defect
(fig. 10.7), i.e. the SOM does not unfold 10.5 It is possible to dose the
resolution of certain
”knot”
in map correctly. A topological defect can be de-
scribed at best by means of the word ”knot-
ting”.
areas in a SOM
A remedy for topological defects could We have seen that a SOM is trained by
be to increase the initial values for the entering input patterns of the input space

dkriesel.com 10.5 Resolution Dose and Position-dependent Learning Rate
Figure 10.5: Behavior of a SOM with one-dimensional topology (G = 1) after the input of 0, 100,
300, 500, 5000, 50000, 70000 and 80000 randomly distributed input patterns p ∈ R2 . During the
training η decreased from 1.0 to 0.1, the σ parameter of the Gauss function decreased from 10.0
to 0.2.

Figure 10.6: End states of one-dimensional (left column) and two-dimensional (right column)
SOMs on different different input spaces. 200 neurons were used for the one-dimensional topology,
10 × 10 neurons for the two-dimensionsal topology and 80.000 input patterns for all maps.

dkriesel.com 10.6 Application
RN one after another, again and again so For example, the different phonemes of
that the SOM will be aligned with these the finnish language have successfully been
patterns and map them. It could happen mapped onto a SOM with a two dimen-
that we want a certain subset U of the in- sional discrete grid topology and therefore
put space to be mapped more precise than neighborhoods have been found (a SOM
the other ones. does nothing else than finding neighbor-
hood relationships). So one tries once
This problem can easily be solved by more to break down a high-dimensional
means of SOMs: During the training dis- space into a low-dimensional space (the
proportionally many input patterns of the topology), looks if some structures have
area U are presented to the SOM. If the been developed – et voilà: Clearly defined
number of training patterns of U ⊂ RN areas for the individual phenomenons are
presented to the SOM exceeds the number formed.
of those patterns of the remaining RN \ U ,
then more neurons will group there while Teuvo Kohonen himself made the ef-
the remaining neurons are sparsely dis- fort to search many papers mentioning his
tributed on RN \ U (fig. 10.8 on the next SOMs for key words. In this large input
more page). space the individual papers now occupy in-
patterns dividual positions, depending on the occur-
↓ As you can see in the illustration, the edge rence of key words. Then Kohonen created
higher
resolution
of the SOM could be deformed. This can a SOM with G = 2 and used it to map the
be compensated by assigning to the edge high-dimensional ”paper space” developed
of the input space a slightly higher proba- by him.
bility of being hit by training patterns (an
often applied approach for ”reaching every Thus, it is possible to enter any paper
nook and corner” with the SOMs). into the completely trained SOM and look
which neuron in the SOM is activated. It
Also, a higher learning rate is often used will be likely to discover that the neigh-
for edge and corner neurons, since they are bored papers in the topology are interest-
only pulled into the center by the topol- ing, too. This type of brain-like context-
ogy. This also results in a significantly im- based search also works with many other
proved corner coverage. input spaces.
SOM finds
similarities
It is to be noted that the system itself
10.6 Application of SOMs defines what is neighbored, i.e. similar,
within the topology – and that’s why it
is so interesting.
Regarding the biologically inspired asso-
ciative data storage, there are many This example shows that the position c of
fields of application for self-organizing the neurons in the input space is not signif-
maps and their variations. icant. t is rather interesting to see which

Figure 10.8: Training of a SOM with G = 2 on a two-dimensional input space. On the left side,
the chance to become a training pattern was equal for each coordinate of the input space. On the
right side, for the central circle in the input space, this chance is more than ten times larger than
for the remaining input space (visible in the larger pattern density in the background). In this circle
the neurons are obviously more crowded and the remaining area is covered less dense but in both
cases the neurons are still evenly distributed. The two SOMS were trained by means of 80.000
training examples and decreasing η (1 → 0.2) as well as decreasing σ (5 → 0.5).

dkriesel.com 10.7 Variations
neuron is activated when an unknown init can be used to influence neighbour RBF
put pattern is entered. Next, we can look as well.
at which of the previous inputs this neu-
For this, many neural network simulators
ron was also activated – and will imme-
offer an additional so-called SOM layer
diately discover a group of very similar
in connection with the simulation of RBF
inputs. The more the inputs within the
networks.
topology are diverging, the less things in
common they have, um so weniger Gemein-
samkeiten haben sie. Virtually, the topol-
ogy generates a map of the input character- 10.7 Variations of SOMs
istics – reduced to descriptively few dimen-
sions in relation to the input dimension. There are different variations of SOMs
for different variations of representation
Therefore, the topology of a SOM often
tasks:
is two-dimensional so that it can be easily
visualized, while the input space can be
very high-dimensional. 10.7.1 A Neural Gas is a SOM
without a static topology
10.6.1 SOMs can be used to
determin centers for The neural gas is a variation of the self-
RBF-Neurons organizing maps of Thomas Martinetz
[MBS93], which has been developed from
the difficulty of mapping complex input
SOMs are exactly directed towards the po-
information that partially only occur in
sitions of the outgoing inputs. As a result
the subspaces of the input space or even
they are used, for example, to select the
change the subspaces (fig. 10.9 on the next
centers of an RBF network. We have al-
page).
ready been introduced to the paradigm of
the RBF network in chapter 6. The idea of a neural gas is, roughly speak-
As we have already seen, it is possible ing, to realize a SOM without a grid struc-
to control which areas of the input space ture. Due to the fact that they are de-
should be covered with higher resolution rived from the SOMs the learning steps
- or, in connection with RBF networks, are very similar to the SOM learning steps,
which areas of our function should word but they include an additional intermedi-
the RBF network with more neurons, i.e. ate step:
work more exactly. A further useful fea-
. Again, random initialization of ck ∈
ture of the combination of RBF networks
Rn
with SOMs is the topology between the
RBF Neurons, which is given by the SOM: . Selection and presentation of a pat-
During the final training of a RBF Neuron tern of the input space p ∈ Rn

Figure 10.9: A figure filling different subspaces of the actual input space of different positions
therefore can hardly be filled by a SOM.
. Neuron distance measurement of the winner neuron i. The direct re-

sult is that – similar to the free-floating
. Identification of the winner neuron i molecules in a gas – dthe neighborhood
dynamic
neighborhood
relationships between the neurons can
. Intermediate step: Generation of a
change anytime, and the number of neigh-
forward sorted list L of neurons or-
bors is almost arbitrary, too. The dis-
dered by the distance to the winner
tance within the neighborhood is now rep-
neuron. Thus, the first neuron in the
resented by the distance within the input
list L is the neuron being closest to
space.
the winner neuron.
The mass of neurons can become as stiff-
. Changing the centers by means of the ened as a SOM by means of a constantly
known rule but with the slightly mod- decreasing neighborhood size. It does not
ified topology function have a fixed dimension but it can take the
dimension that is locally needed at the mo-
hL (i, k, t). ment, which can be very advantageous.
A disadvantage could be that there is
The function hL (i, k, t), which is slightly no fixed grid forcing the input space to
modified compared with the original func- become regularly covered, and therefore
tion h(i, k, t), now regards the first el- wholes can occur in the cover or neurons
ements of the list as the neighborhood can be isolated.

Inspite of all practical hints it is, as always, problem: What do we do with input pat-
the user’s responsibility not to understand terns from which we know that they sepa-
this paper as a catalog for easy answers rate themselves into different (maybe dis-
but to explore all advantages and disad- joint) areas?
several SOMs
vantages himself.
Here, the idea is to use not only one
Remark: Unlike a SOM, the neighbor- SOM but several ones: A multi-self-
hood of a neural gas initially must refer to organizing map, shortly referred to as
all neurons since otherwise some outliers M-SOM [GKE01b, GKE01a, GS06]. It is
of the random initialization may never ap- unnecessary that the SOMs dispose of the
proximate the remaining group. To forget same topology or size, an M-SOM is only
this is a popular error during the imple- a combination of M SOMs.
mentation of a neural gas.
This learning process is analog to that of
the SOMs. However, only the neurons be-
With a neural gas it is possible to learn a
longing to the winner SOM of each train-
kind of complex input such as in fig. 10.9
can classify ing step are adapted. Thus, it is easy to
on the left page since we are not bound to
represent two disjoint clusters of data by
complex
figure a fixed-dimensional grid. But some com-
means of two SOMs even if one of the clus-
putational effort could be necessary for
ters is not represented in every dimension
the permanent sorting of the list (here,
of the input space RN . Actually, the in-
it could be effective to store the list in
dividual SOMs exactly reflect these clus-
an ordered data structure right from the
ters.
start).
Definition 10.7 (Multi-SOM): A multi-
Definition 10.6 (Neural gas): A neural
SOM is nothing more than the simultane-
gas differs from a SOM by a completely
ous use of M SOMs.
dynamic neighborhood function. With ev-
ery learning cycle it is decided anew which
neurons are the neigborhood neurons of 10.7.3 A multi-neural gas consists
the winner neuron. Generally, the crite- of several separate neural
rion for this decision is the distance be- gases
tween the neurosn and the winner neuron
in the input space.
Analogous to the multi-SOM, we also have
a set of M neural gases: a multi-neural
gas [GS06, SG06]. This construct be-
several gases
10.7.2 A Multi-SOM consists of haves analogous to neural gas and M-SOM:
several separate SOMs Again, only the neurons of the winner gas
are adapted.
In order to present another variant of the The reader certainly wonders what advan-
SOMs, I want to formulate an extended tage is there to use a multi-neural gas since

an individual neural gas is already capa- n log2 (n) pattern comparisons. To sort
ble to divide into clusters and to work on n = 216 input patterns
complex input patterns with changing di-
mensions. Basically, this is correct, but n log2 (n) = 216 · log2 (216 )
a multi-neural gas has two serious advan- = 1048576
tages over a simple neural gas.
≈ 1 · 106
comparisons on average would be neces-

1. With several gases, we can directly
sary. If we now divide the list containing
tell which neuron belongs to which
the 216 = 65536 neurons into 64 lists to
gas. This is particularly important
be sorted with each containing 210 = 1024
for clustering tasks, for which multi-
neurons, a sorting effort of
neural gases are recently used. Simple
neural gases can also find and cover
64 · 210 · log2 (210 ) = 64 · 10240

clusters, but now we cannot recognize
which neuron belongs to which clus- = 655360
ter.
less computa- ≈ 6.5 · 105
tional effort
2. A lot of computational effort is saved comparisons will be necessary. Thus, for

when large original gases are divided this example alone, there will be a ≈ 35%
into several smaller ones since (as al- saving of comparisons due to the use of a
ready mentioned) the sorting of the multi-neural gas. DIt is obvious that this
list L could use a lot of computa- effect is considerably increased by the use
tional effort while the sorting of sev- of more neurons and/or divisions.
eral smaller lists L1 , L2 , . . . , LM is less Remark: Now we can choose between two
time-consuming – even if these lists in extreme cases of multi-neural gases: One
total contain the same number of neu- extreme case is the ordinary neural gas
rons. M = 1, i.e. we only use one single neural
gas. Interesting enough, the other extreme
case (very large M , a few or only one neu-
As a result we will only receive local in- ron per gas) behaves analogously to the
stead of global3 sortings, but in most cases K-means clustering (for more information
these local sortings are sufficient. on clustering procedures see the excursus
Example (The sorting effort): As we know, A).
the sorting problem can be solved in Definition 10.8 (Multi-neural gas): A
multi-neural gas is nothing more than the
3 Unfortunately, this step has not solved the sorting simultaneous use of M neural gases.
problem faster than the solutions already known
:-)

10.7.4 Growing neural gases can

add neurons to theirselves
A growing neural gas is a variation of

the above mentioned neural gas to which
more and more neurons are added accord-
ing to certain rules. Thus, this is an at-
tempt to work against the isolation of neu-
rons or the generation of larger wholes in
the cover.
Here, this subject should only be men-
tioned but not discussed.
Remark: To build a growing SOM is more
difficult in as much as new neurons have
to be integrated in the neighborhood.
Exercises
Exercise 17: A regular, two-dimensional

grid shall cover a two-dimensional surface
as ”good” as possible.
1. Which grid structure would suit best
for this purpose?
2. Which criteria did you use for ”good”
and ”best”?
The very imprecise formulation of this ex-
ercise is intentional.

Chapter 11
Adaptive Resonance Theory
An ART network in its original form shall classify binary input vectors, i.e. to
assign them to a 1-out-of-n output. Simultaneously, the so far unclassified
patterns shall be recognized and assigned to a new class.
As in the other brief chapters, we want additionally an ART network shall be ca-
to try to figure out the basic idea of pable to find new classes.
the adaptive resonance theory (abbre-
viated: ART ) without discussing its the-
ory profoundly.
11.1 Task and Structure of
In several sections we have already men- an ART Network
tioned that it is difficult to use neural
networks for the learning of new informa-
tion in addition to but without destroy- ART network comprises exactly two lay-
ing the already existing ones. This cir- ers: The input layer I and the recog-
cumstance is called stability / plasticity nition layer O with the input layer be-
dilemma. ing completely linked towards the recog-
nition layer. This complete link indicates
In 1987, Stephen Grossberg and Gail a top-down weight matrix W that con-
Carpenter published the first version of tains the weight values of the connections
their ART network [Gro76] in order to al- between each neuron in the input layer
leviate this problem. This was followed and each neuron in the recognition layer
by a whole family of ART improvements (fig. 11.1 on the next page).
(which we want to briefly discuss, too).
Simple binary patterns are entered into
It is the idea of unsupervised learning, the the input layer and transferred to the pattern
aim of which is the (initially binary) pat- recognition layer while the recognition recognition
tern recognition, or more precisely the cat- layer shall return a 1-out-of-|O| encoding,
egorization of patterns into classes. But i.e. it should follow the winner-takes-all
163
Chapter 11 Adaptive Resonance Theory dkriesel.com
GFED
@ABC @ABC
GFED @ABC
GFED @ABC
GFED

i 1 O S
gSFi OSOSS ; 2 Og FOO
i o7 ; 3 F i kkok7 ; 4 5 i

E O 4Y 44cF4F4OFFFOSOFFOSOSFxOOSxSxOSxSxxOS
Sx
ESSO 4Y 44cF4F4OoFFFOoOFFOoOoFxOoOoxxoOoxxxOo
x
E O k4Y k44cF4Fk4koFFkFkoFkFkooFxkokoxxokoxxxo
x
E O 4Y 4444

4 F O
S
O S S 4
o o F O
O k k k4
4x44x4xxxxFFF
F
OF
OOoOOoOoSoOoSSoSS4xo44SSx4xxSxxFkFF
Fkk
OF
kkOkkOoOkOoOooOoo4xo44x4xxxxFFF
F
o o F 444

x x 4
F o
F o O O
x x kS
4 k S k

S F o
F o O O
x x 4
F F 444

x 4
o F o O x Ok 4 kS
xxxx o4o44
o4
oo
oookFFkFxFkkxFxkxkxkkOkOOok4OOo44
oO4
Ooo
SOooSoSSFSFSFxFSSxFxSxxSOOO4OO44
O4
O
O FFFFF 44444

So F o O x O 4
F

x

xxxx o o
44kkxkxxkxFFFFo o
44O OxOxxxFSFFSFSS
44O OO FFFF 44

xxxxxoxooookookook
ko
k
k
kkkkx444xkx4x4kxoxooooooFooF
Fo
FF
F x444xx4Ox4xOxOOOOOOFOFS
F
FSF
FSSSS44S4S4OS4SOSOOSOOOOFOFFFFF 44444

xxxoxooookokkkkkk
xxxoxoo4o4o4o
FxFxFxFx444 O
O
O
OOFOFFF444SSSSSSOSOOOOFOFFF444

xo
{xkxxokoxxokokxookkokkkkk

xo
{xxxooxxooxooo 44

x
{xxxxxx FFFFF4# 4
OOOFOOFFOOFF'4# 4 SSSSSOSSOOFSOSOFFOSOFF) '4# 4

@ABC
GFED
Ω1
owku k GFED
@ABC
Ω2
ow @ABC
GFED
Ω3 @ABC
GFED
Ω4 @ABC
GFED
Ω5
S GFED
@ABC
Ω6

Figure 11.1: Simplified illustration of the ART network structure. Top: the input layer, bottom:
the recognition layer. In this illustration the lateral inhibition of the recognition layer and the control
neurons are omitted.
scheme. For instance, to realize this 1- activity within the recognition layer while
out-of-|O| encoding the principle of lateral in turn in the recognition layer every ac-
inhibition can be used – or in the imple- tivity causes an activity within the input
mentation the most activated neuron can layer.
be searched. For practical reasons an IF
query would suit this task best.
In addition to the two mentioned layers, in

an ART network also exist some few neu-
11.1.1 Resonance takes place by
rons that exercise control functions such
activities being tossed and
as signal enhancement. ut we do not want
turned
to discuss this theory further since here
only the basic principle of the ART net-
But there also exists a bottom-up weight work should become explicit. I have only
matrix V , which propagates the activities mentioned it to explain that inspite of the
VI
within the recognition layer back into the recurrences the ART network will achieve
input layer. Now it is obvious that these a stable state after an input.
activities are buffeted again and again, a
fact that leads us to resonance. Every
activity within the input layer causes an
layers
activate
one
another
dkriesel.com 11.3 Extensions
11.2 The learning process of 11.2.3 Adding an output neuron

an ART network is
Of course, it could happen that the neu-
divided to Top-Down rons are nearly equally activated or that
and Bottom-Up learning several neurons are activated, i.e. that the
network is indecisive. In this case, the
The trick of the adaptive resonance the- mechanisms of the control neurons acti-
ory is not only the configuration of the vate a signal that adds a new output neu-
ART network but also the two-piece learn- ron. Then the current pattern is assigned
ing procedure of the theory: For one thing to this output neuron and the weight sets
we train the top-down matrix W , and for of the new neuron are trained as usual.
another thing we train the bottom-up ma- Thus, the advantage of this system is not
trix V (fig. 11.2 on the next page). only to divide inputs into classes and to
find new classes, it can also tell us after
the activation of an output neuron how a
11.2.1 Pattern input and top-down
typical representative of a class looks like
learning
- which is a significant feature.
When a pattern is entered into the net- Often, however, the system can only mod-
work it causes - as already mentioned - an erately distinguish the patterns. The ques-
activation at the output neurons and the tion is when a new neuron is permitted to
winner
neuron strongest neuron wins. Then the weights become active and when it should learn.
is of the matrix W going towards the output In an ART network there are different ad-
ditional control neurons which answer this
amplified
neuron are changed such that the output
of the strongest neuron Ω is still enhanced, question according to different mathemat-
i.e. the class affiliation of the input vector ical rules and which are responsible for in-
to the class of the output neuron Ω be- tercepting special cases.
comes enhanced.
This is at the same time one of the largest
objections to an ART is the fact that an
11.2.2 Resonance and bottom-up ART network is capable to distinguish spe-
learning cial cases, similar to an IF queri that has
been forced into the mechanism of a neural
input is
The training of the backward weights of network.
teach. inp. the matrix V is a bit tricky: EOnly the
for backward weights of the respective winner neuron
are trained towards the input layer and 11.3 Extensions
weights
our current input pattern is used as teach-

ing input. Thus, the network is trained to As already mentioned above, the ART net-
enhance input vectors. works have often been extended.

Chapter 1111Adaptive
Kapitel Resonance
Adaptive Resonance Theory
Theory dkriesel.com dkriesel.com
einer IF-Abfrage, die man in den Mecha-

nismusART-2 [CG87] Netzes
eines Neuronalen is extended
gepresstto continuous
hat. inputs and additionally offers (in an ex-
GFED
@ABC GFED
@ABC GFED
@ABC GFED
@ABC

i1 b
Y
i2
O Y
i
E O 3 < E i4 tension called ART-2A) enhancements of
the learning speed which results in addi-
11.3 Erweiterungen
tional control neurons and layers.
ART-3
Wie schon [CG90]
eingangs 3 improves
erwähnt, wurden die the learning
GFED
@ABC GFED
@ABC
"
ART-Netze vielfach erweitert.
|
Ω1 Ω2 ability of ART-2 by adapting additional
ART-2 biological
[CG87] processes such as the chemical
ist eine Erweiterung
auf kontinuierliche
processes withinEingaben
the und bietet 1 .
synapses
0 1 zusätzlich (in einer ART-2A genannten
Erweiterung) Verbesserungen
Apart from der Lernge-
the described once there exist
schwindigkeit, was zusätzliche Kontroll-
many other extensions.
neurone und Schichten zur Folge hat.
GFED
@ABC @ABC
GFED GFED
@ABC @ABC
GFED

i1 b FF i2 4 E i3 < E i4 ART-3 [CG90] verbessert die Lernfähig-
keit von ART-2, indem zusätzliche biolo-
Y FF O Y 4 O
FF 44
FF
FF
44
gische Vorgänge wie z.B. die chemischen
4
FF 4
FF 44
FF 44 Vorgänge innerhalb der Synapsen adap-
tiert werden1 .
FF 4
FF4
@ABC
GFED @ABC
GFED
"
|
Ω1 Ω2 Zusätzlich zu den beschriebenen Erweite-
rungen existieren noch viele mehr.

0 1
GFED
@ABC GFED
@ABC @ABC
GFED @ABC
GFED

i1 Fb i2 i
E O 3 < i4
Y FF O 4Y 4 E
FF
FF
44

44
FF
FF 4
FF 444
FF 4
FF 44
FF 4
GFED
@ABC @ABC
GFED
F"
|
Ω1 Ω2
1 Durch die häufigen Erweiterungen der Adaptive

0 1 Resonance Theory sprechen böse Zungen bereits
von ART-n-Netzen“.
”
Figure 11.2: Simplified

Abbildung illustration
11.2: Vereinfachte of the
Darstellung des two-
piecezweigeteilten
training ofTrainings
an ART eines ART-Netzes:
network: The Dietrained
168 trainierten D.
jeweils Krieselsind
Gewichte – Ein kleiner Überblick über Neuronale Netze (EPSILON-DE)
durchgezogen
weights are represented by solid lines. Let
dargestellt. Nehmen wir an, ein Muster wurde in
us as-
sumedasthat
Netz eingegeben und die Zahlen markieren the
a pattern has been entered into
network and that
Ausgaben. Oben: the numbers
Wir wir sehen,mark
ist Ω2the
das outputs.
Ge-
Top:winnerneuron.
We can see Mitte:
thatAlsoΩwerden die Gewichte
2 is the winner neu-
zum Gewinnerneuron hin trainiert und (unten)
ron. die
Middle:So the weights are trained towards 1 Because of the frequent extensions of the adap-
Gewichte vom Gewinnerneuron zur Eingangs-
the winner neuron and (below) the weights of tive resonance theory wagging tongues already call
schicht trainiert.
them ”ART-n networks”.
the winner neuron are trained towards the input
layer.

Part IV
Excursi, Appendices and Registers
167
Appendix A
Excursus: Cluster Analysis and Regional
and Online Learnable Fields
In Grimm’s dictionary the extinct German word ”Kluster” is described by ”was
dicht und dick zusammensitzet (a thick and dense group of sth.)”. In static
cluster analysis, the formation of groups within point clouds is explored.
Introduction of some procedures, comparison of their advantages and
disadvantages. Discussion of an adaptive clustering method based on neural
networks. A regional and online learnable field models from a point cloud,
possibly with a lot of points, a comparatively small set of neurons being
representative for the point cloud.
As already mentioned, many problems can 2. dist(x1 , x2 ) = dist(x2 , x1 ), i.e. sym-

be traced back to problems in cluster metry,
analysis. Therefore, it is necessary to re-
search procedures that examine whether 3. dist(x1 , x3 ) ≤ dist(x1 , x2 ) +
groups (so-called clusters) are developed dist(x2 , x3 ), d. h. the triangle
within point clouds. inequality is true.
Since cluster analysis procedures need a

Colloquially speaking, a metric is a tool
notion of distance between two points, a
for determining distances between points
metric must be defined on the space
in any space. Here, the distances have
where these points are situated.
to be symmetrical, and the distance be-
We briefly want to specify what a metric tween to points may only be 0 if the two
is. points are equal. Additionally, the trian-
Definition A.1 (Metric): A relation gle inequality must apply.
dist(x1 , x2 ) defined for two objects x1 , x2
Metrics provide, for example, the squared
is referred to as metric if each of the fol-
distance and the Euclidean distance,
lowing criteria applies:
which have already been introduced.
1. dist(x1 , x2 ) = 0 if and only if x1 = x2 , Based on such metrics we can define a clus-
169
Appendix A Excursus: Cluster Analysis and Regional and Online Learnable Fields
dkriesel.com
tering procedure that uses a metric as dis- 7. Continue with 4 until the assignments
tance dimension. are no longer changed.
number of
Now we want to introduce and briefly dis- 2 already shows one of the great questions cluster
must be
cuss different clustering procedures. of the k-means algorithm: The number k known
of the cluster centers has to be determined previously
in advance. This cannot be done by the al-
gorithm. The problem is that it is not nec-
A.1 k-Means Clustering essarily known in advance how k can be de-
allocates data to a termined best. EAnother problem is that
predefined number of the procedure can become quite instable
if the codebook vectors are badly initial-
clusters ized. But since this is random, it is often
useful to restart the procedure. This has
k-means clustering according to J. the advantage of not requiring much com-
MacQueen [Mac67] is an algorithm that putational effort. If you are fully aware
is often used because of its low computa- of those weaknesses, you will receive quite
tion and storage complexity and which is good results.
regarded as ”inexpensive and good”. The
However, complex structures such as ”clus-
operation sequence of the k-means cluster-
ters in clusters” cannot be recognized. If k
ing algorithm is the following:
is high, the outer ring of the construction
1. Provide data to be examined. in the following illustration will be recog-
nized as many single clusters. If k is low,
2. Define k, which is the number of clus- the ring with the small inner clusters will
ter centers. definieren be recognized as one cluster.
3. Select k random vectors for the clus- For an illustration see the upper right part
ter centers (also referred to as code- of fig. A.1 on page 172.
book vectors).
4. Assign each data point to the next A.2 k-Nearest Neighbouring

codebook vector1
looks for the k nearest
5. Compute cluster centers for all clus- neighbours of each data
ters.
point
6. Set codebook vectors to new cluster
centers. The k-nearest neighbouring procedure
1 The name codebook vector was created because
[CH67] connects each data point to the
the often used name cluster vector was too un- k clostest neighbors, which often results
clear. in a division of the groups. Then such a

dkriesel.com A.4 The Silhouette coefficient
group builds a cluster. The advantage is which is the reason for the name epsilon-
that the number of clusters occurs all by it- nearest neighboring. Points are neig-
self. The disadvantage is that a large stor- bors if they are ε apart from each other at
age and computational effort is required to the most. Here, the storage and computa-
find the next neighbor (the distances be- tional effort is obviously very high, which
tween all data points must be computed is a disadvantage.
Clustering
and stored).
But note that there are some special cases:
radii around
clustering
points
There are some special cases in which the Two separate clusters can easily be con-
next
points
procedure combines data points belonging nected due to the unfavorable situation of
to different clusters, if kis too high. (see a single data point. This can also happen
the two small clusters in the upper right with k-nearest neighbouring, but it would
of the illustration). Clusters consisting of be more difficult since in this case the num-
only one single data point are basically ber of neighbors per point is limited.
conncted to another cluster, which is not
always intentional. An advantage is the symmetric nature of
the neighborhood relationships. Another
Furthermore, it is not mandatory that advantage is that the combination of min-
the links between the points are symmet- imal clusters due to a fixed number of
rical. neighbors is avoided.
But this procedure allows a recognition of On the other hand, it is necessary to skill-
rings and therefore of ”clusters in clusters”, fully initialize ε in order to be successful,
which is a clear advantage. Another ad- i.e. smaller than half the smallest distance
vantage is that the procedure adaptively between two clusters. With variable clus-
responds to the distances in and between ter and point distances within clusters this
the clusters. can possibly be a problem.
For an illustration see the lower left part For an illustration see the lower right part
of fig. A.1. of fig. A.1.
A.3 ε-Nearest Neighbouring A.4 The Silhouette

looks for neighbours coefficient determines
within the radius εfor how accurate a given
each data point clustering is
Another approach of neighboring: Here, As we can see above, there is no easy an-
the neighborhood detection does not use a swer for clustering problems. Each proce-
fixed number k of neighbors but a radius ε, dure described has very specific disadvan-

dkriesel.com
Figure A.1: Top left: our set of points. We will use this set to explore the different clustering
methods. Top right: k-means clustering. Using this procedure we chose k = 6. As we can
see, the procedure is not capable to recognize ”clusters in clusters” (bottom left of the illustration).
Long ”Lines” of points are a problem, too: SThey would be recognized as many small clusters (if k
is sufficiently large). Bottom left: k-nearest neighboring. If k is selected too high (higher than
the number of points in the smallest cluster), this will result in cluster combinations shown in the
upper right of the illustration. Bottom right: ε-nearest neighbouring. This procedure will cause
difficulties ε is selected larger than the minimum distance between two clusters (see upper left of
the illustration), which will then be combined.

dkriesel.com A.5 Regional and Online Learnable Fields
tages. In this respect it is useful to have Apparently, the whole term s(p) can only
a criterion to decide how good our clus- be within the interval [−1; 1]. A value
ter division is. This possibility is offered close to -1 indicates a bad classification of
by the silhouette coefficient according p.
to [Kau90]. This coefficient measures how
good the clusters are delimited from each The silhouette coefficient S(P ) results
other and indicates if points are maybe from the average of all values s(p):
sorted into the wrong clusters.
clustering 1 X
quality is
Let P be a point cloud and p a point ∈ P . S(P ) = s(p) (A.4)
measureable |P | p∈P
Let c ⊆ P be a cluster within the point
cloud and p be part of this cluster, i.e.
applies. DAs above the total quality of the
p ∈ c. The set of clusters is called C. Sum-
cluster division is expressed by the interval
mary:
[−1; 1].
p∈c⊆P
applies. As different clustering strategies with dif-
ferent charakteristics are presented now
To calculate the silhouette coefficient we
(lots of further material are presented in
initially need the average distance between
[DHS01]), as well as a measure to indi-
point p and all its cluster neighbors. This
cate the quality of an existing arrange-
variable is referred to as a(p) and defined
ment of given data into clusters, I want
as follows:
to introduce a clustering method based
1 on an unsupervised learning neural net-
a(p) = dist(p, q) (A.1)
X
|c| − 1 q∈c,q6=p work [SGE05] which was published in 2005.
Like all the other methods this one may
Furthermore, let b(p) be the average dis- not be perfect but it eliminates large stan-
tance between our point p and all points dard weaknesses of the known clustering
of the next cluster (g includes all clusters methods
except for c):
1 X
b(p) = min dist(p, q) (A.2)
g∈C,g6=c |g|
q∈g A.5 Regional and Online
The point p is classified well if the distance Learnable Fields are a
to the center of the own cluster is minimal neural clustering strategy
and the distance to the centers of the other
clusters is maximal. In this case, the fol-
lowing term provides a value close to : The paradigm of neural networks, which I
want to introduce now, are the regional
(A.3) and online learnable fields, shortly re-
b(p) − a(p)
s(p) =
max{a(p), b(p)} ferred to as ROLFs.

dkriesel.com
A.5.1 ROLFs try to cover data with

neurons
Roughly speaking, the regional and online

learnable fields are a set K of neurons
KI
which try to cover a set of points as good
as possible by means of their distribution
in the input space. For this, neurons are
added, moved or changed in their size dur-
network
covers ing training if necessary. The parameters
point cloud of the individual neurons will be discussed
later.
Definition A.2 (Regional and online learn-
able field): A regional and online learn-
able filed (abbreviated ROLF or ROLF
Figure A.2: Structure of a ROLF neuron.
network) is a set K of neurons that are
trained to cover a certain set in the input
space as good as possible.
ron. This particularly means that the neu-

A.5.1.1 ROLF-Neurons feature a rons are capable to cover surfaces of differ-
position and a radius in the ent sizes.
input space
The radius of the perceptive surface is
Here, a ROLF neuron k ∈ K has two specified by r = ρ · σ (fig. A.2) with
parameters: Similar to the RBF networks, the multiplier ρ being globally defined and
it has a center ck , i.e. a position in the previously specified for all neurons. Intu-
cI
input space. itively, the reader will wonder what this
multiplicator is used for. Its significance
But it has yet another parameter: The ra- will be discussed later. Furthermore, the
dius σ, which defines the radius of the per- following has to be observed: It is not nec-
σI
ceptive surface surrounding the neuron2 . essary for the perceptive surface of the dif-
A neuron covers the part of the input space ferent neurons to be of the same size.
that is situated within this radius. Definition A.3 (ROLF neuron): The pa-
rameters of a ROLF neuron k are a center
neuron ck and σk are locally defined for each neu- ck and a radius σk .
represents
surface 2 I write ”defines” and not ”is” because the actual Definition A.4 (Perceptive surface): The
radius is specified by σ · ρ. perceptive surface of a ROLF neuron k

consists of all points within the radius ρ · σ is an accepting neuron k. Then the radius
in the input space. moves towards ||p − ck || (i.e. towards the
distance between p and ck ) and the center
ck towards p. Additionally, let us define
A.5.2 A ROLF learns unsupervised the two learning rates ησ and ηc for radii
by presenting training and centers.
Jησ
examples online Jηc
ck (t + 1) = ck (t) + ηc (p − ck (t))
Like many other paradigms of neural net- σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t))
works our ROLF network learns by receiv-
Note that here σk is a scalar while ck is a
ing many training examples p of a training
vector in the input space.
set P . The learning is unsupervised. For
each training example p entered into the Definition A.6 (Adapting a ROLF neuron):
network two cases can occur: A neuron k accepted by a point p is
adapted according to the following rules:
1. There is one accepting neuron k for p
or ck (t + 1) = ck (t) + ηc (p − ck (t)) (A.5)
σk (t + 1) = σk (t) + ησ (||p − ck (t)|| − σk (t))
2. there is no accepting neuron at all. (A.6)
If in the first case several neurons are suit-
able, then there will be exactly one ac- A.5.2.2 The radius multiplier cares for
cepting neuron insofar as the closest neu- neurons not only to shrink
ron is the accepting one. For the accepting
neuron k ck and σk are adapted. Now we can understand the function of the
Definition A.5 (Accepting neuron): The multiplier ρ: Due to this multiplier the per-
criterion for a ROLF neuron k to be an ceptive surface of a neuron includes more Jρ
accepting neuron of a point p is that the than only all points surrounding the neu-
point p must be located within the percep- ron in the radius σ. This means that due
tive surface of k. If p is located in the per- to the above-mentioned learning rule σ
ceptive surfaces of several neurons, then cannot only decrease but also increase.
so the
the closest neuron will be the accepting Definition A.7 (Radius multiplier): The ra- neurons
one. If there are several closest neurons, dius multiplier ρ > 1 is globally defined can grow
one can be randomized. and expands the perceptive surface of a
neuron k to a multiple of σk . So it is en-
sured that the radius σk cannot only de-
A.5.2.1 Both positions end radii are
crease but also increase.
adapted throughout learning
Adapting Generally, the radius multiplier is set to
existing
neurons
Let us assume that we entered a training values in the lower one-digit range, such
example p into the network and that there as 2 or 3.

dkriesel.com
So far we only have discussed the case in Currently, the mean-σ variant is the fa-
the ROLF training that there is an accept- vorite one although the learning procedure
ing neuron for the training example p. also works with the other ones. In the
minimum-σ variant the neurons tnd to
cover less surface, in the maximum-σ vari-
A.5.2.3 As and when required, new ant they tend to cover more surface.
neurons are generated Definition A.8 (Generating a ROLF neu-
ron): If a new ROLF neuron k k is gen-
This suggests to discuss the approach for erated by entering a training example p,
initialization
the case that there is no accepting neu- then ck is intialized with p and σk ac- of a
ron. cording to one of the above-mentioned neurons
strategies (Init-σ, Minimum-σ, Maximum-

In this case a new accepting neuron k is σ, Mean-σ).
generated for our training example. The
result is that ck and σk certainly have to The training is complete when after re-
be initialized. peated randomly permuted pattern presen-
tation no new neuron has been generated
The initialization of ck can be understood
in an epoch and the positions of the neu-
intuitively: DThe center of the new neu-
rons barely change.
ron is simply set on the training example,
i.e.
ck = p. A.5.3 Evaluating a ROLF
We generate a new neuron because there
is no neuron close to p – for logical reasons, The result of the training algorithm is that
we place the neuron exactly on p. the training set is gradually covered well
and precisely by the ROLF neurons and
But how to set a σ when a new neuron that a high concentration of points on a
is generated? For this purpose there exist spot of the input space does not automati-
different options: cally generate more neurons. Thus, a pos-
sibly very large point cloud is reduced to
Init-σ: We always select a predefined
very few representatives (based on the in-
static σ.
put set).
Minimum σ: We take a look at the σ of
Then it is very easy to define the num-
all neurons and select the minimum. cluster =
ber of clusters: Two neurons are (accord- connected
Maximum σ: We take a look at the σ of ing to the definition of the ROLF) con-
neurons
all neurons and select the maximum. nected when their perceptive surfaces over-
lap (i.e. some kind of nearest neighbour-
Mean σ: We select the mean σ of all neu- ing is executed with the variable percep-
rons. tive surfaces). A cluster is a group of

connected neurons or a group of points of

the input space covered by these neurons
(fig. A.3).
Of course, the complete ROLF network

can be evaluated by means of other clus-
tering methods, i.e. the neurons can be
searched for clusters. Particularly with
clustering methods whose storage effort
grows quadratic to |P | the storage effort
can be reduced dramatically since gener-
ally there are considerably less ROLF neu-
rons than original data points, but the
neurons represent the data points quite
good.
A.5.4 Comparison with Popular

Clustering Methods
It is obvious, that storing the neurons

rather than storing the input points takes
the lion’s share of the storage effort of the
ROLFs. This is a great advantage for huge
less
storage point clouds with a lot of points.
effort!
Since it is unnecessary to store the entire
data points, our ROLF, as a neural cluster-
ing method, has the capability to learn on-
line, which is definitely a great advantage.
Furthermore, it can (similar to ε nearest
neighbouring or k nearest neighbouring)
distinguish clusters from enclosed clusters
recognize
– but due to the online presentation of
”cluster in the data without a quadratically growing
Clusters” storage effort, which is by far the greatest
disadvantage of the two neighboring meth-
Figure A.3: The clustering process. Above the
ods. input set, in the middle the input space covered
by ROLF neurons, below the the input space only
Additionally, the issue of the size of the in-
covered by the neurons (representatives.
dividual clusters proportional to their dis-

dkriesel.com
tance from each other is addressed by us- are relatively robust after some training
ing variable perceptive surfaces - which is time.
also not always the case for the two men-
As a whole, the ROLF is on a par with
tioned methods.
the other clustering methods and is par-
The ROLF compares favorably with the ticularly very interesting for systems with
k-means clustering, as well: Firstly, it is low storage capacity or huge data sets.
unnecessary to previously know the num-
ber of clusters and, secondly, thek-means A.5.6 Application examples
clustering recognizes clusters enclosed by
other clusters as separate clusters.
A first application example may be, for
example, finding color clusters in RGB
images. Another field of application di-
A.5.5 Initializing Radii, Learning rectly described in the ROLF publication
Rates and Multiplier is not is the recognition of words transferred into
trivial a 720-dimensional feature space. Thus, we
can see that ROLFs are relatively robust
Certainly, the disadvantages of the ROLF against higher dimensions. Further appli-
shall not be concealed: It is not always cations can be found in the field of analy-
easy to select the appropriate initial value sis of attacks on network systems and their
for σ and ρ. The previous knowledge classification.
about the data set can colloquially be in-
cluded in ρ and the σ initial value of the
ROLF: Fine-grained data clusters should Exercises
use a small ρ and a small σ initial value.
But the smaller the ρ the smaller the Exercise 18: Determine at least four
chance that the neurons will grow if neces- adaptation steps for one single ROLF neu-
sary. Here again, there is no easy answer, ron k if the four patterns stated below
just like for the learning rates ηc and ησ . are presented one after another in the in-
dicated order. Let the initial values for
For ρ the multipliers in the lower one-digit
the ROLF neuron be ck = (0.1, 0.1) and
range such as 2 or 3 are very popular. ηc
σk = 1. Furthermore, let gelte ηc = 0.5
and ησ successfully work with values about
and ησ = 0. Let ρ = 3.
0.005 to 0.1, variations during run-time
are also imaginable for this type of net- P = {(0.1, 0.1);
work. Initial values for σ generally depend
= (0.9, 0.1);
on the cluster and data distribution (i.e.
they often have to be tested). But com- = (0.1, 0.9);
pared to wrong initializations they are – = (0.9, 0.9)}.
at least with the mean-σ strategy – they

Appendix B
Excursus: Neural Networks Used for
Prediction
Discussion of an application of neural networks: A look ahead into the future
of time series.
After discussing the different paradigms of B.1 About Time Series

neural networks it is now useful to take a
look at an application of neural networks
which is brought up often and (as we will A time series iis a series of values dis-
see) is also used for fraud: The applica- cretized in time. For example, daily mea-
tion of time series prediction. This ex- sured temperature values or other meteo-
cursus is structured into the description of rological data of a specific site could be
time series and estimations about the re- represented by a time series. Share price
quirements that are actually needed to pre- values also represent a time series. Often
dict the values of a time series. Finally, I the measurement of time series is timely
will say something about the range of soft- equidistant, and in many time series the
ware, which should predict share prices or future development of their values is very
other economic characteristics by means of interesting, e.g. the daily weather fore-
neural networks or other procedures. cast. time
series of
Time series can also be values of an actu- values
ally continuous function read in a certain
This chapter should not be a detailed distance of time ∆t (fig. B.1 on the next
J∆t
description but rather indicate some ap- page).
proaches for time series prediction. In this
respect I will again try to avoid formal def- If we want to predict a time series, we will
initions. look for a neural network that maps the
previous series values to future develop-
ments of the time series, i.e. if we know
longer sections of the time series, we will
179
Appendix B Excursus: Neural Networks Used for Prediction dkriesel.com
have enough training examples. Of course,

these are not examples for the future to be
predicted but it is tried to generalize and
to extrapolate the past by means of the
said examples.
But before we begin to predict a time

series we have to answer some questions
about the time series regarded and ensure
that our time series fulfills some require-
ments.
1. Do we have any evidence which sug-

gests that future values depend in any
way on the past values of the time se-
ries? Does the past of a time series
include information about its future?
2. Do we have enough past values of the

time series that can be used as train-
ing patterns?
3. In case of a prediction of a continuous

function: How must a useful ∆t look
like?
Now this question shall be explored in de-

tail.
Figure B.1: A function x f the time is read at
How many information about the future
discrete times (time discretized), this means that
the result is a time series. The read values are is included in the past values of a time se-
entered into a neural network (in this example ries? This is the most important question
an SLP) which shall learn to predict the future to be answered for any time series that
values of the time series. should be mapped into the future. If the
future values of a time series, for instance,
do not depend on the past values, then a
time series prediction based on them will
be impossible.
In this chapter, we assume systems whose

future values can be deduced from their
states – the deterministic systems. This

dkriesel.com B.2 One-Step-Ahead Prediction
leads us to the question of what a system B.2 One-Step-Ahead

state is. Prediction
A system state completely describes a sys-
tem for a certain point of time. The future
of a deterministic system would be clearly The first attempt to predict the next fu-
defined by means of the complete descrip- ture value of a time series out of past val-
tion of its current state. ues is called one-step-ahead prediction
(fig. B.2 on the next page).
predict
The problem in the real world is that such the next
value
a state concept includes all things that in- Such a predictor system receives the last
fluence our system by any means. n observed state parts of the system as
input and outputs the prediction for the
In case of our weather forecast for a spe- next state (or state part). The idea of
cific site we could definitely determine a state space with predictable states is
the temperature, the atmospheric pres- called state space forecasting.
sure and the cloud density as the mete-
orological state of the place at a time t.
The aim of the predictor is to realize a
But the whole state would include signifi-
function
cantly more information. Here, the world-
wide phenomena that control the weather
would be interesting as well as small local f (xt−n+1 , . . . , xt−1 , xt ) = x̃t+1 , (B.1)
pheonomena such as the cooling system of

the local power plant. which receives exactly n in order to predict
the future value. Predicted values shall be
So we shall note that the system state is de- headed by a tilde (e.g. x̃) to distinguish it
Jx̃
sirable for prediction but not always possi- from the actual future values.
ble to obtain. Often only fragments of the
current states can be acquired, e.g. for a To find a linear combination
weather forecast these fragments are the
said weather data.
x̃i+1 = a0 xi + a1 xi−1 + . . . + aj xi−j
(B.2)
Wir können aber diese Schwäche teil-
weise ausgleichen, indem wir nicht nur die
beobachtbaren Teile eines einzigen (des at fulfills our conditions by approximation
letzten) Zustandes in die Vorhersage mit would be the most intuitive and simplest
einfließen lassen, sondern mehrere vergan- approach.
gene Zeitpunkte betrachten. From this
we want to derive our first prediction sys- Such a construction is called digital fil-
tem: ter. Here we use the fact that time series

xt−3 xt−2 xt−1 xt x̃t+1

K
.-+ predictor
Figure B.2: Representation of the one-step-ahead prediction. It is tried to calculate the future
value from a series of past values. The predicting element (in this case a neural network) is referred
to as predictor.
usually have a lot of past values so that we means of the delta rule provides results
can set up a series of equations1 : very close to the analytical solution.
xt = a0 xt−1 + . . . + aj xt−1−(n−1) Even if this approach often provides sat-

xt−1 = a0 xt−2 + . . . + aj xt−2−(n−1) isfying results, we have seen that many
.. problems cannot be solved by using a
. (B.3) single-layer perceptron. Additional lay-
xt−n = a0 xt−n + . . . + aj xt−n−(n−1) ers with linear activation function are use-
less, as well, since a multi-layer perceptron
Thus, n equations could be found for n with only linear activation functions can
unknown coefficients and solve them (if be reduced to a single-layer perceptron.
possible). Or another, better approach: DSuch considerations lead to a non-linear
We could use m > n equations for n un- approach.
knowns in such a way that the sum of
the mean squared prediction error, which
The multi-layer perceptron and non-linear
are already known, is minimized. This is
activation functions provide a universal
called moving average procedure.
non-linear function approximator, i.e. we
But this linear structure corresponds to a can use an n-|H|-1-MLP for n n inputs
single-layer perceptron with a linear acti- out of the past. An RBF network could
vation function which has been trained by also be used. But remember that here the
means of data from the past (The experi- number n has to remain low since in RBF
mental setup would comply with fig. B.1 networks high input dimensions are very
on page 180). n fact, the training by complex to realize. So if we want to in-
clude many past values, a multi-layer per-
1 Without going into detail I want to remark that
the prediction becomes easier the more past values
ceptron will require considerably less com-
of the time series are available. I would like to putational effort.
ask the reader to read up on the Nyquist-Shannon
sampling theorem

dkriesel.com B.4 Additional Optimization Approaches for Prediction
B.3 Two-Step-Ahead B.4 Additional Optimization

Prediction Approaches for
Prediction
What approaches can we use to to see far-
ther into the future?
The possibility to predict values far away
in the future is not only important be-
B.3.1 Recursive two-step-ahead cause we try to look farther ahead into the
prediction future. There can also be periodic time
series where other approaches are hardly
predict
possible: If a lecture is begins at 9 a.m.
future
In order to extent prediction to, for in-
values every Thursday, it is not very useful to
stance, two time steps into the future, we
know how many people sat in the lecture
could perform two one-step-ahead predic-
room on Monday to predict the number
tions in a row (fig. B.3 on the next page),
of lecture participants. The same applies,
i.e. a recursive two-step-ahead predic-
for example, to periodically occurring com-
tion. Unfortunately, the value determined
muter jams.
by means of a one-step-ahead prediction
is generally imprecise so that errors can
be built up, and the more predictions are
performed in a row the more imprecise be- B.4.1 Changing temporal
comes the result. parameters
Thus, it can be useful to intentionally leave

B.3.2 Direct two-step-ahead gaps in the future values as well as in the
prediction past values of the time series, i.e. to in-
troduce the parameter ∆t which indicates
We have already suspected that there ex- which past value is used for prediction.
ists a better approach: Like the system Technically speaking, we still use a one-
can be trained to predict the next value,
extenth
step-ahead prediction only that we extend input
we certainly can train it to predict the the input space or train the system to pre- period
direct
prediction value after next. This means we directly dict values lying farther away.
is better train, for example, a neural network to
look two time steps ahead into the fu- It is also possible to combine different ∆t:
ture, which is referred to as direct two- In case of the traffic jam prediction for a
step-ahead prediction (fig. B.4 on the Monday the values of the last few days
next page). Obviously, the direct two-step- could be used as data input in addition to
ahead prediction is technically identical to the values of the previous Mondays. Thus,
the one-step-ahead prediction. The only we use the last values of several periods,
difference is the training. in this case the values of a weekly and a

0 predictor
O

xt−3 xt−2 xt−1 xt x̃t+1 x̃t+2
J
.-+ predictor
Figure B.3: Representation of the two-step-ahead prediction. Attempt to predict the second future
value out of a past value series by means of a second predictor and the involvement of an already
predicted value.
xt−3 xt−2 xt−1 xt x̃t+1 x̃t+2

E
.-+ predictor
Figure B.4: Representation of the direct two-step-ahead prediction. Here, the second time step is
predicted directly, the first one is omitted. Technically, it does not differ from a one-step-ahead
prediction.

dkriesel.com B.5 Remarks on the Prediction of Share Prices
daily period. We could also include an an- der to benefit from this knowledge. Share
nual period in the form of the beginning of prices are discontinuous and therefore they
the holidays (for sure, everyone of us has are principally difficult functions. Further-
already spent a lot of time on the highway more, the functions can only be used for
because he forgot the beginning of the hol- discrete values – often, for example, in a
idays). daily rhythm (including the maximum and
minimum values per day, if we are lucky)
with the daily variations certainly being
B.4.2 Heterogeneous prediction eliminated. But this makes the whole
thing even more difficult.
Another prediction approach would be to
predict the future values of a single time There are chartists, i.e. people who look
series out of several time series, if it is at many diagrams and decide by means
use assumed that the additional time series of a lot of background knowledge and
information is related to the future of the first one decade-long experience whether the equi-
outside of
(heterogeneous one-step-ahead pre- ties should be bought or not (and often
time series
diction, fig. B.5 on the next page). they are very successful.
If we want to predict two outputs of two Apart from the share prices it is very in-
teresting to predict the exchange rates of
related time series, it is certainly possible
to perform two parallel one-step-ahead pre-currencies: If we exchange 100 Euros into
Dollars, the Dollars into Pounds and the
dictions (analytically this is done very of-
ten because otherwise the equations would Pounds back into Euros it could be pos-
become very confusing); or in case of sible that we will finally receive 110 Eu-
the neural networks an additional output ros. But once found out, we would do this
neuron is attached and the knowledge of more often and thus we would change the
both time series is used for both outputs exchange rates into a state in which such
(fig. B.6 on the next page). an increasing circulation would no longer
be possible (otherwise we could produce
You’ll find more and more general material
money by generating, so to speak, a finan-
on time series in [WG94].
cial perpetual motion machine.
At the stock exchange, successful stock

B.5 Remarks on the and currency brokers increase or de-
Prediction of Share crease their thumbs – and thus indicating
whether in their opinion a share price or
Prices an exchange rate will increase. SMathe-
matically speaking, they indicate the first
Many people observe the changes of a bit (sign) of the first derivation of the ex-
share price in the past and try to con- change rate. So, excellent worldclass bro-
clude the future from those values in or- kers obtain success rates of about 70%.


K
.0-1+3 predictor
yt−3 yt−2 yt−1 yt
Figure B.5: Representation of the heterogeneous one-step-ahead prediction. Predition of a time

series under consideration of a second one.

K
.0-1+3 predictor

yt−3 yt−2 yt−1 yt ỹt+1
Figure B.6: Heterogeneous one-step-ahead prediction of two time series at the same time.

dkriesel.com B.5 Remarks on the Prediction of Share Prices
In Great Britain, the heterogeneous one- Time and again some software appears
step-ahead prediction was successfully which uses key words such as neural net-
used to increase the accuracy of such pre- works to purport that it is capable to pre-
dictions to 76%: In addition to the time dict where share prices are going. Do
series of the values indicators indicators not buy such software! In addition to
such as the oil price in Rotterdam or the the above-mentioned scientific exclusions
US national debt were included. there is one simple reason for this: If
these tools work – why should the man-
This just as an example to show the dimen- ufacturer sell them? Normally, useful eco-
sion of the accuracy of stock-exchange eval- nomic knowledge is kept secret. If we knew
uations, since we are still talking about the a way to definitely gain wealth by means
first bit of the first derivation! We still do of shares, we would earn our millions by
not know how strong the expected increase using this knowledge instead of selling it
or decrease will be and also whether the for 30 euros, wouldn’t we?
effort will pay off: Probably, one wrong
prediction could nullify the profit of one
hundred correct predictions.
How can neural networks be used to pre-

dict share prices? Intuitively, we assume
that future share prices are a function of
the previous share values.
But this assumption is wrong: Share

prices are no function of their past val-
ues, but a function of their assumed fu-
share price
function of ture value. We do not buy shares be-
assumed cause their values have been increased dur-
ing öthe last days, but because we believe
future
value!
that they will futher increase tomorrow.
If, as a consequence, many people buy a
share, they will boost the price. Therefore
their assumption was right – a self ful-
filling prophecy has been generated, a
phenomenon long known in economics.
The same is true for the other way round:

We sell shares because we believe that to-
morrow the prices will decrease. This will
beat down the prices the next day and gen-
erally even more the day after the next.

Appendix C
Excursus: Reinforcement Learning
What if there were no training examples but it would nevertheless be possible
to evaluate how good we have learned to solve a problem? et us regard a
learning paradigm that is situated between supervised and unsupervised
learning.
I now want to introduce a more exotic ap- While it is generally known that pro-
proach of learning – just to leave the usual cedures such as backpropagation cannot
paths. We know learning procedures in work in the human brain itself, the rein-
which the network is exactly told what to forcement learning is usually considered as
do, i.e. we provide exemplary output val- being biologically more motivated.
ues. We also know learning procedures
The term reinforcement learning The
like those of the self-organizing maps, into
term reinforcement learning comes from
which only input values are entered.
cognitive science and psychology and it de-
Now we want to explore something in- scribes the learning system of carrot and
between: The learning paradigm of rein- stick, which occurs everywhere in nature,
forcement learning – reinforcement learn- i.e. learning by means of good or bad expe-
ing according toSutton and Barto rience, reward and punishment. But there
[SB98]. is no learning aid that exactly explains
what we have to do: We only receive a
Reinforcement learning in itself is no neu-
total result for a process (Did we win the
ral network but only one of the three learn-
game of chess or not? And how sure was
ing paradigms already mentioned in chap-
this victory?), but no results for the indi-
ter 4. In some sources it is counted among
vidual intermediate steps.
the supervised learning procedures since a
no feedback is given. Due to its very rudimen- For example, if we ride our bike with worn
examples tary feedback it is reasonable to separate tires and at a speed of exactly 21,5 21, 5 km
h
but
feedback
it from the supervised learning procedures through a bend over some sand with a
– apart from the fact that there are no grain size of 0.1mm, on the average, then
training examples at all. nobody could tell us exactly which handle-
189
Appendix C Excursus: Reinforcement Learning dkriesel.com
bar angle we have to adjust or, even worse, interaction between an agent and an envi-
how strong the great number of muscle ronmental system (fig. C.2).
parts in our arms or legs have to contract
for this. Depending on whether we reach The agent shall solve some problem. He
the end of the bend unharmed or not, we could, for instance, be an autonomous
soon have to face the good or bad learn- robot that shall avoid obstacles. The
ing experience, i.e. a feedback or a reward. agent performs some actions within the
Thus, the reward is very simple - but on environment and in return receives a feed-
the other hand it is considerably easier to back from the environment, which in the
achieve. If we now have tested different ve- following is called reward. This circle of
locities and bend angles often enough and action and reward is characteristic for rein-
received some rewards, we will get a feel forcement learning. The agent influences
for what works and what does not. The the system, the system provides a reward
aim of reinforcement is to maintain exactly and then changes.
this feeling. The reward is a real or discrete scalar
which describes, as mentioned above, how
Another example for the quasi-
well we achieve our aim, but it does not
impossibility to achieve a sort of cost
give any guidance how we can achieve it.
or utility function is a tennis player
The aim is always to make the sum of
who tries to maximize his athletic glory
rewards as high as possible on the long
for a long time by means of complex
term.
movements and ballistic trajectories in
the three-dimensional space including the
wind direction, the importance of the C.1.1 The gridworld
tournament, private factors and many
more.
As a learning example for reinforcement
To get straight to the point: Since we learning I would like to use the so-called
receive only little feedback, reinforcement gridworld. WWe will see that its struc-
learning often means trial and error – and ture is very simple and easy to figure out
therefore it is very slow. and therefore reinforcement is actually not
necessary. However, it is very suitable
simple
for representing the approach of reinforce- world of
C.1 System Structure ment learning. Now let us exemplary de- examples
fine the individual components of the re-

inforcement system by means of the grid-
Now we want to briefly discuss different world. Later, each of these components
sizes and components of the system. We will be examined more exactly.
will define them more precisely in the fol-
lowing sections. Broadly speaking, rein- Environment: The gridworld (fig. C.1 on
forcement learning represents the mutual the right page) is a simple, discrete

dkriesel.com C.1 System Structure
world in two dimensions which in the

following we want to use as environ-
mental system.
Agent: Als Agent As an agent we use a ×
simple robot being situated in our
gridworld.
State space: As we can see, our gridworld
has 5 × 7 with 6 fields being unaccessi-
ble. Therefore, our agent can occupy
29 positions in the grid world. These
positions are regarded as states for
×
the agent.
Action space: The actions are still miss-
ing. We simply define that the robot
could move one field up or down, to
the right or to the left (as long as
there is no obstacle or the edge of our Figure C.1: A graphical representation of our
gridworld). gridworld. Dark-colored cells are obstacles and
therefore inaccessible. The exit is located on the
Task: Our agent’s task is to leave the grid- right side of the light-colored field. The symbol
world. The exit is located on the left × marks the starting position of our agent. In
of the light-colored field. the upper part of our figure the door is open, in
the lower part it is closed.
Non-determinism: The two obstacles can
be connected by a ”door”. When the
door is closed (lower part of the illus-
tration), the corresponding field is in-
accessible. The position of the door
cannot change during a cycle but only ? Agent
between the cycles.
reward / new situation action
We now have created a small world that
will accompany us through the following _
learning strategies and illustrate them. environment
Figure C.2: The agent performs some actions

C.1.2 Agent und environment
within the environment and in return receives a
reward.
Our aim is that the agent learns what hap-
pens by means of the reward. Thus, it

is trained over, of and by means of a dy- described as a mapping of the situation

namic system, the environment, in order space S into the action space A(st ). The
to reach an aim. But what does learning meaning of situations st l be defined later
mean in this context? and should only indicate that the action
space depends on the current situation.
agent
The agent shall learn a mapping of sit-
acts in uations to actions (called policy), i.e. it Agent: S → A(st ) (C.1)
environment shall learn what to do in which situation
to achieve a certain (given) aim. The aim Definition C.2 (Environment): The envi-
is simply shown to the agent by giving a ronment represents a stochastic mapping
recompense for the achievement. of an action A in the current situation st
to a reward rt and a new situation st+1 .
Such a recompense must not be mistaken
for the reward – on the agent’s way to Environment: S × A → P (S × rt ) (C.2)
the solution it may sometimes be useful
to receive less recompense or punishment
when in return the longterm result is max- C.1.3 States, situations and actions
imum (similar to the situation when an
investor just sits out the downturn of the As already mentioned, an agent can be in
share price or to a pawn sacrifice in a chess different states: In case of the gridworld,
game). So, if the agent is heading into for example, it can be in different positions
the right direction towards the target, it (here we get a two-dimensional state vec-
receives a positive reward, and if not it re- tor).
ceives no reward at all or even a negative
reward (punishment). The recompense is, For an agent is ist not always possible to
so to speak, the final sum of all rewards – realize all information about its current
which is also called return. state so that we have to introduce the term
situation. A situation is a state from the
After having colloquially named all the ba- agent’s point of view, i.e. only a more or
sic components, we want to discuss more less precise approximation of a state.
precisely which components can be used to
make up our abstract reinforcement learn- Therefore, situations generally do not al-
ing system. low to clearly ”predict” successor situa-
tions – even with a completely determin-
In the gridworld: In the gridworld, the istic system this may not be applied. If
agent is a simple robot that should find the we knew all states and the transitions be-
exit of the gridworld. The environment tween them exactly (thus, the complete
is the gridworld itself, which is a discrete system), it would be possible to plan op-
gridworld. timally and also easy to find an optimal
Definition C.1 (Agent): DIn reinforce- policy (methods are provided, for example,
ment learning the agent can be formally by dynamic programming).

Now we know that reinforcement learning knowledge about its state. This approx-
is an interaction between the agent and imation (about which the agent cannot
the system including Actions at and sit- even know how good it is) makes clear pre-
uations st . The agent cannot determine dictions impossible.
by itself whether the current situation isDefinition C.5 (Action): Actions at n be
good or bad: This is exactly the reason performed by the agent (whereby it could
Jat
why it receives the said reward from the be possible that depending on the situa-
environment. tion another action space A(S) exists).
They cause state transitions and therefore JA(S)
In the gridworld: States are positions
where the agent can be situated. Sim- a new situation from the agent’s point of
ply said, the situations equal the states view.
in the gridworld. Possible actions would
be to move towards north, south, east or C.1.4 Reward and return
west.
Remark: Situation and action can be vec- As in real life it is our aim to receive a
torial, the reward, however, is always a recompense as high as possible, i.e. to
scalar (in an extreme case even only a bi- maximize the sum of the expected [re-
nary value) since the aim of reinforcement ward]rewards r, called return R, on the
learning is to get along with little feedback. long term. For finitely many time steps1
A complex vectorial reward would equal a the rewards can simply be added:
real teaching input.
Rt = rt+1 + rt+2 + . . . (C.3)
By the way, the cost function should be ∞
minimized, which would not be possible, = (C.4)
X
rt+x
however, with a vectorial reward since we x=1
do not have any intuitive order relations Certainly, the return is only estimated
in multi-dimensional space, i.e. we do not here (if we knew all rewards and therefore
directly know what is better or worse. the return completely, it would no longer
Definition C.3 (State): Within its envi- be necessary to learn).
ronment the agent is in a state. States Definition C.6 (Reward): A reward rt is
contain any information about the agent a scalar, real or discrete (even sometimes Jrt
within the environmental system. Thus, only binary) reward or punishment which
it is theoretically possible to clearly pre- the environmental system returns to the
dict a successor state to a performed ac- agent as reaction to an action.
tion within a deterministic system out of Definition C.7 (Return): The return R
this godlike state knowledge.
t
is the accumulation of all received rewards
Definition C.4 (Situation): Situations st JRt
1 In practice, only finitely many time steps will eb
(hier at time t) of a situation space possible, even though the formulas are stated with
st I
S are the agent’s limited, approximate an infinite sum in the first place
SI

until time t. next expected reward but the expected to-

tal sum decides what the agent will do, it
is also possible to perform actions that, on
C.1.4.1 Dealing with large time spaces short notice, result in a negative reward
(e.g. the pawn sacrifice in a chess game)
However, not every problem has an ex- but will pay off later.
plicit target and therefore a finite sum (e.g.
our agent can be a robot having the task
to drive around again and again and to C.1.5 The policy
avoid obstacles). In order not to receive a
diverging sum in case of an infinite series After having regarded and formalized
of reward estimations a weakening factor some system components of reinforcement
0 < γ < 1 is used, which weakens the in- learning the actual aim is still to be dis-
γI
fluence of future rewards. This is not only cussed:
useful if there exists no target but also if
During reinforcement learning the agent
the target is very far away:
learns a policy
JΠ
Rt = rt+1 + γ 1 rt+2 + γ 2 rt+3 + . . . (C.5) Π : S → P (A),
∞
= (C.6) Thus, it continuously adjusts a mapping
X
γ x−1 rt+x
x=1 of the situations to the probabilities P (A),
The farther the reward is away, the less is with which any action A is performed in
the part it has in the agent’s decisions. any situation S. A policy can be defined
as a strategy to select actions that would
Another possibility to handle the return maximize the reward in the long term.
sum would be a limited time horizon
τ so that only τ many following rewards In the gridworld: n the gridworld the pol-
τI
rt+1 , . . . , rt+τ are regarded: icy is the strategy according to which the
agent tries to exit the gridworld.
Rt = rt+1 + . . . + γ (C.7) Definition C.8 (Policy): The policy Π s
τ −1
rt+τ
τ a mapping of situations to probabilities
= (C.8) to perform every action out of the action
X
γ x−1 rt+x
x=1 space. So it can be formalized as
Thus, we divide the timeline into Π : S → P (A). (C.9)
episodes. Usually, one of the two meth-
ods is used to limit the sum, if not both
Basically, we distinguish between two pol-
methods together.
icy paradigms: An open loop policy An
As in daily living we try to approximate open-loop policy represents an open con-
our current situation to a desired state. trol chain and creates out of an initial sit-
Since it is not mandatory that only the uation s0 a sequence of actions a0 , a1 , . . .

with ai 6= ai (si ); i > 0. DThus, in the be- When selecting the actions to be per-
ginning the agent develops a plan and con- formed, again two basic strategies can be
secutively executes it to the end without regarded.
considering the interim situations (there-
In the gridworld: A closed-loop policy
fore ai 6= ai (si ), actions after a0 do not
would be responsive to the current posi-
depend on the situations).
tion and choose the direction according to
the action. n particular, when an obstacle
In the gridworld: In the gridworld, an
appears dynamically, such a policy is the
open-loop policy would provide a precise
better choice.
direction towards the exit, such as the way
from the given starting position to (in ab-
breviations of the directions) OOOON. C.1.5.1 Exploitation vs. exploration
So an open-loop policy is a sequence of As in real life, during reinforcement learn-

actions without interim feedback. A se- ing often the question arrives whether
quence of actions is generated out of a the exisiting knowledge is only wilfully
starting situation. If the system is known exploited or new ways are also explored.
well and truly, such an open-loop policy Initially, we want to discuss the two ex-
can be used successfully and lead to use- tremes:
ful results. But, for example, to know the research
A greedy policy always chooses the way
or safety?
chess game well and truly it would be nec-
essary to try every possible move, which of the highest reward that can be deter-
would be very time-consuming. Thus, for mined in advance, i.e. the way of the high-
such problems we have to find an alterna- est known reward. This policy represents
tive to the open-loop policy, which incorpo- the exploitation approach and is very
rates the current situations into the action promising when the used system is already
plan: known.
In contrast to the exploitation approach it
A closed loop policy is a closed loop, a
is the aim of the exploration approach
function
to explore a system as detailed as possi-
ble so that also ways leading to the target
Π : si → ai mit ai = ai (si ),
can be found which may be at first glance
not very promising but, however, are very
in a manner of speaking. Here, the envi-
successful.
ronment influences our action or the agent
responds to the input of the environment, Let us assume that we are looking for the
respectively, as already illustrated in fig. way to a restaurant, a safe policy would be
C.2. A closed-loop policy, so to speak, is to always take the way we already know,
a reactive plan to map current situations not matter how unoptimal and long it may
to actions to be performed. be, and not to try to explore better ways.

Another approach would be to explore Now we have to adapt from daily life how
shorter ways every now and then, even at we learn exactly.
the risk of taking a long time and being
unsuccessful, and therefore we finally will
take the original way and arrive too late C.2.1 Rewarding strategies
at the restaurant.
Interesting and very important is the ques-
In reality, often a combination of both tion for what a reward and what kind of
methods is applied: In the beginning of reward is awarded since the design of the
the learning process it is researched with reward significantly controls system behav-
a higher probability while at the end more ior. As we have seen above, there gener-
existing knowledge is exploited. Here, a ally are (again as in daily life) various ac-
static probability distribution is also pos- tions that can be performed in any situa-
sible and often applied. tion. There are different strategies to eval-
In the gridworld: For finding the way in uate the selected situations and to learn
the gridworld, the restaurant example ap- which series of actions would lead to the
plies equally. target. First of all, this principle should
be explained in the following.
We now want to indicate some extreme
C.2 Learning process cases as design examples for the reward:
A rewarding similar to the rewarding in a
Let us again take a look at daily life. Ac- chess game is referred to as pure delayed
tions can lead us from one situation into reward: We only receive the reward at
different subsituations, from each subsit- the end of and not during the game. This
uation into further sub-subsituations. In method is always advantageous when we
a sense, we get a situation tree where finally can say whether we were succesful
links between the nodes must be consid- or not; but the interim steps do not allow
ered (often there are several ways to reach an estimation of our situation. If we win,
a situation – so the tree could more accu- then
rately be referred to as situation graph).
he leaves of such tree are the end situa- rt = 0 ∀t < τ (C.10)
tions of the system. The exploration ap-
proach would search the tree as thoroughly as well as rτ = 1. If we lose, then rτ = −1.
as possible and become acquainted with all With this rewarding strategy a reward is
leaves. The exploitation approach would only returned by the leaves of the situation
unerringly go to the best known leave. tree.
Analogous to the situation tree, we also Pure negative reward: Here,

can create an action tree. Here, the re-
wards for the actions are within the nodes. rt = −1 ∀t < τ. (C.11)

dkriesel.com C.2 Learning process
This system finds the most rapid way to -6 -5 -4 -3 -2

reach the target because this way is auto- -7 -1
matically the most favorable one in respect -6 -5 -4 -3 -2
of the reward. Man wird bestraft The -7 -6 -5 -3
agent receives punishment for anything it -8 -7 -6 -4
does – even if it does nothing. As a result -9 -8 -7 -5
it is the most inexpensive method for the -10 -9 -8 -7 -6
agent to reach the target fast.
-6 -5 -4 -3 -2
Another strategy is the avoidance strat- -7 -1
egy: Harmful situations are avoided. -8 -9 -10 -2
Here, -9 -10 -11 -3
-10 -11 -10 -4
rt ∈ {0, −1}, (C.12) -11 -10 -9 -5
-10 -9 -8
-7 -6
Most situations do not receive any reward,
only a few of them receive a negative re- Figure C.3: Representation of each optimal re-
ward. The agent agent will avoid getting turn per field in our gridworld by means of of
too close to such negative situations pure negative reward awarding, at the top with
an open and at the bottom with a closed door.
Warning: Rewarding strategies can have
unexpected consequences. A robot that
is told ”have it your own way but if you
touch an obstacle you will be punished” C.2.2 The state-value function
will simply stand still. Hence,since stand-
ing still is also punished it will drive in Unlike our agent we have a godlike view
small circles. Reconsidering this, we will
state
of our gridworld so that we can swiftly de- evaluation
understand that this behavior optimally termine which robot starting position can
fulfills the return of the robot but unfor- provide which optimal return.
tunately was not intended to do so.
In figure C.3 these optimal returns are ap-
Furthermore, we can show that especially plied per field.
small tasks can be solved better by means In the gridworld: The state-value function
of negative rewards while positive, more for our gridworld exactly represents such
differentiated rewards are useful for large, a function per situation (= position) with
complex tasks. the difference being that here the function
is unknown and has to be learned.
For our gridworld we want to apply the
pure negative reward strategy: The robot Thus, we can see that it would be more
shall find the exit as fast as possible. practical for the robot to be capable to

evaluate the current and future situations. The optimal state-value function is called
So let us take a look at another system VΠ∗ (s).
component of reinforcement learning: the JVΠ∗ (s)
state-value function V (s), which with Unfortunaely, unlike us our robot does not
regard to a policy Π is often called VΠ (s). have a godlike view of its environment. It
Because whether a situation is bad often does not have a table with optimal returns
depends on the general behavior Π of the like the one shown above to orient itself.
agent. he aim of reinforcement learning is that
the robot little by little generates its state-
A situation being bad under a policy that value function on the basis of the returns
is searching risks and checking out limits of many trials and approximates the op-
would be, for instance, if an agent on a bi- timal state-value function V ∗ (if there is
cycle turns a corner and the front wheel one).
begins to slide out. And due to its dare-
devil policy the agent would not brake in In this context I want to introduce two
this situation. With a risk-aware policy terms closely related to the cycle between
the same situations would look much bet- state-value function and policy:
ter, thus it would be evaluated higher by
a good state-value function C.2.2.1 Policy evaluation
VΠ (s) state-value function simply returns
VΠ (s)I the value the current situation s has for Policy evaluation the approach to try
the agent under policy Π. Abstractly a policy a few times, to provide many re-
speaking, according to the above defini- wards that way and to gradually accumu-
tions, the value of the state-value function late a state-value function by means of
corresponds to the return Rt (the expected these rewards.
value) of a situation st . EΠ denotes the set
of the expected returns under Π and the
C.2.2.2 Policy improvement
current situation st .
VΠ (s) = EΠ {Rt |s = st } Policy improvement means to improve
a policy itself, i.e. to turn it into a new and
Definition C.9 (State-value function): better one. In order to improve the policy
The state-value function VΠ (s) as the task we have to aim at the return finally having
of determining the value of situations un- a larger value than before, i.e. until we
der a policy, i.e. to answer the agent’s have found a shorter way to the restaurant
question of whether a situation s is good and have walked it successfully
or bad or how good or bad it is. For this
purpose it returns the expectation of the The principle of reinforcement learning is
return under the situation: to realize an interaction. It is tried to eval-
uate how good a policy is in individual sit-
VΠ (s) = EΠ {Rt |s = st } (C.13) uations. The changed state-value function

If we additionally assume a pure negative

)
V i Π reward, it is obvious that we can receive
an optimum value of −6 for our starting
field in the state-value function. Je nach-
V∗ Π∗ dem, Depending on the random way the
random policy takes values other (smaller)
Figure C.4: The cycle of reinforcement learning than −6 can occur for the starting field.
which ideally leads to optimal Π∗ and V ∗ . Intuitively, we want to memorize only the
better value for one state (i.e. one field).
HBut here caution is advised: In this
way, the learning procedure would work
provides information about the system by only with deterministic systems. Our door,
means of which we again improve our pol- which can be open or closed during a cy-
icy. These two values lift each other, which cle, would produce oscillations for all fields
can mathematically be proved, so that the and such oscillations would influence their
final result is an optimal policy Π∗ and an shortest way to the target.
optimal state-value function V ∗ (fig. C.4). With the Monte Carlo method we prefer
This cycle sounds simple but is very time- to use the learning rule2
consuming.
At first, let us regard a simple, ran- V (st )neu = V (st )alt + α(Rt − V (st )alt ),
dom policy by means of which our robot
in which the update of the state-value func-
could slowly fulfill and improve its state-
tion is obviously influenced by both the
value function without any previous knowl-
old state value and the received return (α
edge.
is the learning rate). Thus, the agent gets
some kind of memory, new findings always
Jα
C.2.3 Monte Carlo method change the situation value just a little bit.
An exemplary learning step is shown in
The easiest approach to accumulate a fig. C.5 on the next page.
state-value function is mere trial and er- In this example, the computation of the
ror. Thus, we select a randomly behaving state value was applied for only one single
policy which does not consider the accumu- state (our initial state). It should be ob-
lated state-value function for its random vious that it is possible (and often done)
decisions. It can be proved that at some to train the values for the states visited in-
point we will find the exit of our gridworld between (in case of the gridworld our ways
by chance. to the target) at the same time. The result
Inspired by random-based games of chance 2 The learning rule is, among others, derived by
this approach is called Monte Carlo means of the Bellman equation, but this deriva-
method. tion is not discussed in this chapter.

of such a calculation related to our exam-

ple is illustrated in fig. C.6 on the right
page.
The Monte Carlo method seems to be -1
suboptimal and usually it is significantly -6 -5 -4 -3 -2
slower thatn the following methods of re-
inforcement learning. But this method is
the only one for which it can be mathe-
matically proved that it works and there-
fore it is very useful for theoretic consider-
ations.
Definition C.10 (Monte Carlo learning): -1
Actions are randomly performed regard- -14 -13 -12 -2
less of the state-value function and in the -11 -3
long term an expressive state-value func- -10 -4
tion is accumulated by means of the fol- -9 -5
lowing learning rule. -8 -7 -6
V (st )neu = V (st )alt + α(Rt − V (st )alt ),
C.2.4 Temporal difference learning

-10
Most of the learning is the result of ex-
periences; e.g. walking or riding a bicy-
cle without getting injured (or not), even
mental skills like mathematical problem
solving benefit a lot from experiences and
Figure C.5: Application of the Monte Carlo
simple trial and error. Thus, we initialize learning rule with a learning rate of α = 0.5.
our policy with any values – we try, learn Top: two exemplary ways the agent randomly
and improve the policy due to experience selects are applied (one with an open and one
(fig. C.7). In contrast to the Monte Carlo with a closed door). Bottom: The result of the
method we want to do this in a more di- learning rule for the value of the initial state con-
rected manner. sidering both ways. Due to the fact that in the
course of time many different ways are gone un-
Just as wie learn from experiences to react der random policy a very expressive state-value
on different situations in different ways the function.
temporal difference Lernmethode (ab-
breviated: TD learning), does the same
by training VΠ (s) (i.e. the agent learns to

estimate which situations are worth a lot

and which are not). Again the current sit-
uation is identified with st , the following
situations with st+1 and so on. Thus, the
learning formula for the state-value func-
-1 tion VΠ (st ) is
-10 -9 -8 -3 -2
-11 -3 V (st )neu =V (st )
-10 -4 + α(rt+1 + γV (st+1 ) − V (st ))
-9 -5 | {z
change of previous value
}
-8 -7 -6
We can see that the change in the value of
Figure C.6: Extension of the learning example the current situation st , which is propor-
in fig. C.5 in which the returns for intermedi- tional to the learning rate α is influenced
ate states are also used to accumulate the state-
by
value function. Here, the low value on the door
field can be seen very good: If this state is possi- . the received reward rt+1 ,
ble, it must be very positive. If the door is closed,
this state is impossible. . the previous return weighted with a
factor γ of the following situation
V (st+1 ),
. the previous value of the situation
V (st ).
Definition C.11 (Temporal difference learn-
ing): Unlike the Monte Carlo method, TD
learning looks ahead by regarding the fol-
lowing situation st+1 . Thus, the learning
Evaluation rule is by
!
Πa Q V (st )neu =V (st ) (C.14)
+ α(rt+1 + γV (st+1 ) − V (st )) .
policy improvement | {z }
Figure C.7: We try different actions w ithin the

environment and as a result we learn and improve C.2.5 The action-value function
the policy.
Analogous to the state-value function
VΠ (s), the action-value function action
QΠ (s, a) is another system component of evaluation
reinforcement learning, which evaluates a JQΠ (s, a)

C.2.6 Q learning
0
× +1
This implies QΠ (s, a) as learning fomula
-1 for the action-value function, and – analo-
gously to TD learning – its application is
called Q learning:
Figure C.8: Exemplary values of an action-

value function for the position ×. Moving right Q(st , a)neu =Q(st , a)
one remains on the fastest way towards the tar- + α(rt+1 + γ max Q(st+1 , a) −Q(st , a)) .
get, moving up is still a quite fast way, moving
a
| {z }
down is not a good way at all (provided that the greedy strategy
door is open for all cases).

| {z }
Again we break down the change of the

current action value (proportional to the
learning rate α) under the current situa-
tion. It is influenced by
certain action a under a certain situation
. the received reward rt+1 ,
s and the policy Π.
. the maximum action over the follow-
In the gridworld: In the gridworld, the ing actions weighted with γ (Here, a
action-value function tells us how good it greedy strategy is applied since it can
is to move from a certain field into a cer- be assumed that the best known ac-
tain direction (fig. C.8). tion is selected. With TD learning,
Definition C.12 (Action-value function): on the other hand, we do not mind to
Like the state-value function, the action- always get into the best known next
value function QΠ (st , a) evaluates certain situation),
actions on the basis of certain situations
under a policy. The optimal action-value . the previous value of the action un-
function is called Q∗Π (st , a). der our situation st known as Q(st , a)
Q∗Π (s, a)I (remind that this is also weighted by
means of α).
As shown in fig. C.9 the actions are per-
formed until a target situation (here re- IUsually, the action-value function learns
ferred to as sτ ) is achieved (if there exists a considerably faster than the state-value
target situation, otherwise the actions are function. But we must not disregard that
simply performed again and again). reinforcement learning is generally quite
slow: The system has to find out itself
what is good. But the advantage of Q

dkriesel.com C.3 Example Applications
direction of actions
GFED
@ABC
s0 hk 0 / GFED
@ABC
s1 k 1 / GFED
@ABC / ONML
HIJK
sτ −1 l τ −1 / GFED
@ABC
a a aτ −2 a (
··· k sτ
r1 r2 rτ −1 rτ
direction of reward
Figure C.9: Actions are performed until the desired target situation is achieved. Attention should
be paid to numbering: Rewards are numbered beginning with 1, actions and situations beginning
with 0 (This method has been adopted as tried and true).
learning is: Π can be initialized arbitrar- played backgammon knows that the situ-
ily, and by means of Q learning the result ation space is huge (approx. 1020 situa-
is always Q∗ . tions). As a result, the state-value func-
Definition C.13 (Q learning): Q learning tions cannot be computed explicitly (par-
trains the action-value function by means ticularly in the late eighties when the TD
of the learning rule gammon was introduced). The selected re-
warding strategy was the pure delayed re-
ward, i.e. the system receives the reward
not before the end of the game and at the
(C.15)
Q(st , a)new =Q(st , a)
same time the reward is the return. Then
the system was allowed to practice itself
+ α(rt+1 + γ max Q(st+1 , a) − Q(st , a)).
a
(initially against a backgammon program,

then against an entity of itself). The result
and thus finds Q in any case.
∗ was that it achieved the highest ranking in
a computer-backgammon league and strik-
ingly disproved the theory that a computer
programm is not capable to master a task
C.3 Example Applications better than its programmer.
C.3.1 TD gammon C.3.2 The car in the pit
TD gammon is a very successful Let us take a look at a car parking on a

backgammon game based on TD learn- one-dimensional road at the bottom of a
ing invented by Gerald Tesauro. The deep pit without being able to get over
situation here is the current configura- the slope on both sides straight away by
tion of the board. Anyone who has ever means of its engine power in order to leave

the pit. Trivially, the executable actions The angle of the pole relative to the verti-
are here the possibilities to drive forwards cal line is referred to as α. Furthermore,
and backwards. The intuitive solution we the vehicle always has a fixed position x an
think of immediately is to move backwards, our one-dimensional world and a velocity
to gain momentum at the opposite slope of ẋ. Our one-dimensional world is lim-
and severally oscillite in this way to dash ited, i.e. there are maximum values and
out of the pit. minimum values x can adopt.
The actions of a reinforcement learning
system would be ”full throttle forward”, The aim of our system is to learn to steer
”full reverse” and ”doing nothing”. the car in such a way that it can balance
the pole, to prevent the pole from tipping
Here, ”everything costs” would be a good over. This is achieved best by an avoid-
choice for awarding the reward so that the ance strategy: As long as the pole is bal-
system learns fast how to leave the pit and anced the reward is 0. If the pole tips over,
realizes that our problem cannot be solved the reward is -1.
by means of mere forward directed engine
power. DSo the system will slowly build
Interestingly, the system is soon capable
up the movement.
to keep the pole balanced by tilting it suf-
The policy can no longer be stored as a ficiently fast and with small movements.
table since the state space is hard to dis- At this the system mostly is in the cen-
cretize. As policy a function has to be ter of the space since this is farthest from
generated. the walls which it understands as negative
(if it touches the wall, the pole will tip
over).
C.3.3 The pole balancer
The pole balancer was developed by

Barto, Sutton and Anderson. C.3.3.1 Swinging up an inverted
pendulum
Let be given a situation including a vehicle
that is capable to move either to the right
at full throttle or to the left at full throt- The initial situation is more difficult for
tle (bang bang control). Only these two the system: the fact that initially the pole
actions can be performed, standing still is hangs down, that it has to be swung up
impossible. n the top of this car is hinged over the vehicle and that it finally has to
an upright pole that could tip over to both be stabilized. In the literature this task
sides. The pole is built in such a way that is called swing up an inverted pendu-
it always tips over to one side so it never lum.
stands still (let us assume that the pole is
rounded at the lower end).

dkriesel.com C.4 Reinforcement Learning in Connection with Neural Networks
C.4 Reinforcement Learning learning to find a strategy in order to exit

in Connection with a labyrinth as fast as possible.
Neural Networks . How could an appropriate state-value

function look like?
Finally, the reader would like to ask why . How would you generate an appropri-
a paper on ”neural networks” includes a ate reward?
chapter about reinforcement learning.
Assume that the robot is capable to avoid
The answer is very simple. We have al- obstacles and at any time knows its posi-
ready been introduced to supervised and tion (x, y) and orientation φ.
unsupervised learning procedures. Al- Exercise 20: Describe the function of the
though we do not always have an om- two components ASE and ACE as they
niscient teacher who makes unsupervised have been proposed by Barto, Sutton
learning possible that does not mean that and Anderson to control the pole bal-
we do not receive any feedback at all. ancer.
There is often something between, some
kind of criticism or school mark. Prob- Bibliography: [BSA83].
lems like this can be solved by means of Exercise 21: Indicate several ”classical”
reinforcement learning. problems of informatics which could be
solved efficiently by means of reinforce-
But not every problem is so easy to solve
ment learning. Please give reasons for
like our gridworld: BIn our backgammon
your answers.
example we have approx. 1020 situations
and the situation tree has a large branch-
ing factor, let alone other games. HHere,
the tables used in the gridworld can no
longer be realized as state- and action-
value functions. Thus, we have to find ap-
proximators for these functions.
And which learning approximators for
these reinforcement learning components
come immediately into our mind? Exactly:
neural networks.
Exercises
Exercise 19: A robot control system shall

be persuaded by means of reinforcement

Bibliography
[And72] James A. Anderson. A simple neural network generating an interactive

memory. Mathematical Biosciences, 14:197–220, 1972.
[BSA83] A. Barto, R. Sutton, and C. Anderson. Neuron-like adaptive elements
that can solve difficult learning control problems. IEEE Transactions on
Systems, Man, and Cybernetics, 13(5):834–846, September 1983.
[CG87] G. A. Carpenter and S. Grossberg. ART2: Self-organization of stable cate-
gory recognition codes for analog input patterns. Applied Optics, 26:4919–
4930, 1987.
[CG88] M.A. Cohen and S. Grossberg. Absolute stability of global pattern forma-
tion and parallel memory storage by competitive neural networks. Com-
puter Society Press Technology Series Neural Networks, pages 70–81, 1988.
[CG90] G. A. Carpenter and S. Grossberg. ART 3: Hierarchical search using
chemical transmitters in self-organising pattern recognition architectures.
Neural Networks, 3(2):129–152, 1990.
[CH67] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans-
actions on Information Theory, 13(1):21–27, 1967.
[CR00] N.A. Campbell and JB Reece. Biologie. Spektrum. Akademischer Verlag,
2000.
[Cyb89] G. Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314,
1989.
[DHS01] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. Wiley New
York, 2001.
[Elm90] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–
211, April 1990.
[Fah88] S. E. Fahlman. An empirical sudy of learning speed in back-propagation
networks. Technical Report CMU-CS-88-162, CMU, 1988.
207
Bibliography dkriesel.com
[FMI83] K. Fukushima, S. Miyake, and T. Ito. Neocognitron: A neural network

model for a mechanism of visual pattern recognition. IEEE Transactions on
Systems, Man, and Cybernetics, 13(5):826–834, September/October 1983.
[Fri94] B. Fritzke. Fast learning with incremental RBF networks. Neural Process-
ing Letters, 1(1):2–5, 1994.
[GKE01a] N. Goerke, F. Kintzler, and R. Eckmiller. Self organized classification of
chaotic domains from a nonlinearattractor. In Neural Networks, 2001. Pro-
ceedings. IJCNN’01. International Joint Conference on, volume 3, 2001.
[GKE01b] N. Goerke, F. Kintzler, and R. Eckmiller. Self organized partitioning of
chaotic attractors for control. Lecture notes in computer science, pages
851–856, 2001.
[Gro76] S. Grossberg. Adaptive pattern classification and universal recoding, I:
Parallel development and coding of neural feature detectors. Biological
Cybernetics, 23:121–134, 1976.
[GS06] Nils Goerke and Alexandra Scherbart. Classification using multi-soms and
multi-neural gas. In IJCNN, pages 3895–3902, 2006.
[Heb49] Donald O. Hebb. The Organization of Behavior: A Neuropsychological
Theory. Wiley, New York, 1949.
[Hop82] John J. Hopfield. Neural networks and physical systems with emergent col-
lective computational abilities. Proc. of the National Academy of Science,
USA, 79:2554–2558, 1982.
[Hop84] JJ Hopfield. Neurons with graded response have collective computational
properties like those of two-state neurons. Proceedings of the National
Academy of Sciences, 81(10):3088–3092, 1984.
[HT85] JJ Hopfield and DW Tank. Neural computation of decisions in optimization
problems. Biological cybernetics, 52(3):141–152, 1985.
[Jor86] M. I. Jordan. Attractor dynamics and parallelism in a connectionist se-
quential machine. In Proceedings of the Eighth Conference of the Cognitive
Science Society, pages 531–546. Erlbaum, 1986.
[Kau90] L. Kaufman. Finding groups in data: an introduction to cluster analysis.
In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley,
New York, 1990.
[Koh72] T. Kohonen. Correlation matrix memories. IEEEtC, C-21:353–359, 1972.

dkriesel.com Bibliography
[Koh82] Teuvo Kohonen. Self-organized formation of topologically correct feature

maps. Biological Cybernetics, 43:59–69, 1982.
[Koh89] Teuvo Kohonen. Self-Organization and Associative Memory. Springer-

Verlag, Berlin, third edition, 1989.
[Koh98] T. Kohonen. The self-organizing map. Neurocomputing, 21(1-3):1–6, 1998.
[KSJ00] E.R. Kandel, J.H. Schwartz, and T.M. Jessell. Principles of neural science.
Appleton & Lange, 2000.
[lCDS90] Y. le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In

D. Touretzky, editor, Advances in Neural Information Processing Systems
2, pages 598–605. Morgan Kaufmann, 1990.
[Mac67] J. MacQueen. Some methods for classification and analysis of multivariate

observations. In Proceedings of the Fifth Berkeley Symposium on Mathe-
matics, Statistics and Probability, Vol. 1, pages 281–296, 1967.
[MBS93] Thomas M. Martinetz, Stanislav G. Berkovich, and Klaus J. Schulten.

’Neural-gas’ network for vector quantization and its application to time-
series prediction. IEEE Trans. on Neural Networks, 4(4):558–569, 1993.
[MP43] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanent
in nervous activity. Bulletin of Mathematical Biology, 5(4):115–133, 1943.
[MP69] M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass,

1969.
[MR86] J. L. McClelland and D. E. Rumelhart. Parallel Distributed Processing:

Explorations in the Microstructure of Cognition, volume 2. MIT Press,
Cambridge, 1986.
[Par87] David R. Parker. Optimal algorithms for adaptive networks: Second or-
der back propagation, second order direct propagation, and second order
hebbian learning. In Maureen Caudill and Charles Butler, editors, IEEE
First International Conference on Neural Networks (ICNN’87), volume II,
pages II–593–II–600, San Diego, CA, June 1987. IEEE.
[PG89] T. Poggio and F. Girosi. A theory of networks for approximation and

learning. MIT Press, Cambridge Mass., 1989.
[Pin87] F. J. Pineda. Generalization of back-propagation to recurrent neural net-

works. Physical Review Letters, 59:2229–2232, 1987.

Bibliography dkriesel.com
[PM47] W. Pitts and W.S. McCulloch. How we know universals the perception of
auditory and visual forms. Bulletin of Mathematical Biology, 9(3):127–147,
1947.
[RD05] G. Roth and U. Dicke. Evolution of the brain and intelligence. Trends in
Cognitive Sciences, 9(5):250–257, 2005.
[RHW86a] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by
back-propagating errors. Nature, 323:533–536, October 1986.
[RHW86b] David E. Rumelhart, Geoffrey E. Hinton, and R. J. Williams. Learning
internal representations by error propagation. In D. E. Rumelhart, J. L.
McClelland, and the PDP research group., editors, Parallel distributed pro-
cessing: Explorations in the microstructure of cognition, Volume 1: Foun-
dations. MIT Press, 1986.
[Ros58] F. Rosenblatt. The perceptron: a probabilistic model for information stor-
age and organization in the brain. Psychological Review, 65:386–408, 1958.
[Ros62] F. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.
[SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction.
MIT Press, Cambridge, MA, 1998.
[SG06] A. Scherbart and N. Goerke. Unsupervised system for discovering patterns
in time-series, 2006.
[SGE05] Rolf Schatten, Nils Goerke, and Rolf Eckmiller. Regional and online learn-
able fields. In Sameer Singh, Maneesha Singh, Chidanand Apté, and Petra
Perner, editors, ICAPR (2), volume 3687 of Lecture Notes in Computer
Science, pages 74–83. Springer, 2005.
[Ste61] K. Steinbuch. Die lernmatrix. Kybernetik (Biological Cybernetics), 1:36–45,
1961.
[vdM73] C. von der Malsburg. Self-organizing of orientation sensitive cells in striate
cortex. Kybernetik, 14:85–100, 1973.
[Was89] P. D. Wasserman. Neural Computing Theory and Practice. New York :
Van Nostrand Reinhold, 1989.
[Wer74] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis
in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
[Wer88] P. J. Werbos. Backpropagation: Past and future. In Proceedings ICNN-88,
San Diego, pages 343–353, 1988.

dkriesel.com Bibliography
[WG94] A.S. Weigend and N.A. Gershenfeld. Time series prediction. Addison-
Wesley, 1994.
[WH60] B. Widrow and M. E. Hoff. Adaptive switching circuits. In Proceedings
WESCON, pages 96–104, 1960.
[Wid89] R. Widner. Single-stage logic. AIEE Fall General Meeting, 1960. Wasser-
man, P. Neural Computing, Theory and Practice, Van Nostrand Reinhold,
1989.
[Zel94] Andreas Zell. Simulation Neuronaler Netze. Addison-Wesley, 1994. Ger-
man.

List of Figures
1.1 Robot with 8 sensors and 2 motors . . . . . . . . . . . . . . . . . . . . . 6

1.3 Black box with eight inputs and two outputs . . . . . . . . . . . . . . . 7
1.2 Learning examples for the example robot . . . . . . . . . . . . . . . . . 8
1.4 Institutions of the field of neural networks . . . . . . . . . . . . . . . . . 9
2.1 Central Nervous System . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Action Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Complex Eyes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Data processing of a neuron . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Various popular activation functions . . . . . . . . . . . . . . . . . . . . 39
3.3 Feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Feedforward network with shortcuts . . . . . . . . . . . . . . . . . . . . 42
3.5 Directly recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Indirectly recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Laterally recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 Completely linked network . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10 Examples for different types of neurons . . . . . . . . . . . . . . . . . . 46
3.9 Example network with and without bias neuron . . . . . . . . . . . . . . 47
4.1 Training examples and network capacities . . . . . . . . . . . . . . . . . 56

4.2 Learning curve with different scalings . . . . . . . . . . . . . . . . . . . 59
4.3 Gradient descent, 2D visualization . . . . . . . . . . . . . . . . . . . . . 61
4.4 Possible errors during a gradient descent . . . . . . . . . . . . . . . . . . 63
4.5 The 2-spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Checkerboard problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 The perceptron in three different views . . . . . . . . . . . . . . . . . . . 72

5.2 Single-layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Single-layer perceptron with several output neurons . . . . . . . . . . . . 74
5.4 AND and OR single-layer perceptron . . . . . . . . . . . . . . . . . . . . 75
213
List of Figures dkriesel.com
5.5 Error surface of a network with 2 connections . . . . . . . . . . . . . . . 78

5.6 Sketch of a XOR-SLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Two-dimensional linear separation . . . . . . . . . . . . . . . . . . . . . 82
5.8 Three-dimensional linear separation . . . . . . . . . . . . . . . . . . . . 83
5.9 The XOR network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.10 Multi-layer perceptrons and output sets . . . . . . . . . . . . . . . . . . 85
5.11 Position of an inner neuron for derivation of backpropagation . . . . . . 87
5.12 Illustration of the backpropagation derivation . . . . . . . . . . . . . . . 89
5.13 Fermi function and hyperbolic tangent . . . . . . . . . . . . . . . . . . . 95
5.14 Momentum term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.15 Functionality of 8-2-8 encoding . . . . . . . . . . . . . . . . . . . . . . . 98
6.1 RBF network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 Distance function in the RBF network . . . . . . . . . . . . . . . . . . . 104
6.3 Individual Gaussian bells in one- and two-dimensional space . . . . . . . 105
6.4 Accumulating Gaussian bells in one-dimensional space . . . . . . . . . . 105
6.5 Accumulating Gaussian bells in two-dimensional space . . . . . . . . . . 106
6.6 Even coverage of an input space with radial basis functions . . . . . . . 112
6.7 Uneven coverage of an input space with radial basis functions . . . . . . 113
6.8 Random, uneven coverage of an input space with radial basis functions . 113
7.1 Roessler attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2 Jordan network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3 Elman network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Unfolding in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.1 Hopfield network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8.2 Binary threshold function . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.3 Convergence of a Hopfield network . . . . . . . . . . . . . . . . . . . . . 130
8.4 Fermi function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.1 Examples for quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.1 Example topologies of a SOM . . . . . . . . . . . . . . . . . . . . . . . . 144

10.2 Example distances of SOM topologies . . . . . . . . . . . . . . . . . . . 147
10.3 SOM topology functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.4 First example of a SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.7 Topological defect of a SOM . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.5 Training a SOM with one-dimensional topology . . . . . . . . . . . . . . 153
10.6 SOMs with one- and two-dimensional topologies and different input spaces154
10.8 Resolution optimization of a SOM to certain areas . . . . . . . . . . . . 156

dkriesel.com List of Figures
10.9 Figure to be classified by neural gas . . . . . . . . . . . . . . . . . . . . 158
11.1 Structure of an ART network . . . . . . . . . . . . . . . . . . . . . . . . 164

11.2 Learning process of an ART network . . . . . . . . . . . . . . . . . . . . 166
A.1 Comparing cluster analysis methods . . . . . . . . . . . . . . . . . . . . 172

A.2 ROLF neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.3 Clustering by means of a ROLF . . . . . . . . . . . . . . . . . . . . . . . 177
B.1 Neural network reading time series . . . . . . . . . . . . . . . . . . . . . 180

B.2 One-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 182
B.3 Two-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 184
B.4 Direct two-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . 184
B.5 Heterogeneous one-step-ahead prediction . . . . . . . . . . . . . . . . . . 186
B.6 Heterogeneous one-step-ahead prediction with two outputs . . . . . . . . 186
C.1 Gridworld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

C.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
C.3 Gridworld with optimal returns . . . . . . . . . . . . . . . . . . . . . . . 197
C.4 Reinforcement learning cycle . . . . . . . . . . . . . . . . . . . . . . . . 199
C.5 The Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . 200
C.6 Extended Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . 201
C.7 Improving the policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
C.8 Action-value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
C.9 Reinforcement learning timeline . . . . . . . . . . . . . . . . . . . . . . . 203

List of Tables
1.1 Brain vs. Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 3-bit parity function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 Definition of the XOR function . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Number of linearly separable functions . . . . . . . . . . . . . . . . . . . 83
5.3 classifiable sets per perceptron . . . . . . . . . . . . . . . . . . . . . . . 86
217
Index
ATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
* attractor . . . . . . . . . . . . . . . . . . . . . . . . . . 117
autoassociator . . . . . . . . . . . . . . . . . . . . . 129
100-step rule . . . . . . . . . . . . . . . . . . . . . . . . 5 axon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20, 25
A B
Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
backpropagation . . . . . . . . . . . . . . . . . . . . 90
action potential . . . . . . . . . . . . . . . . . . . . 23
second order . . . . . . . . . . . . . . . . . . . 96
action space . . . . . . . . . . . . . . . . . . . . . . . 193
backpropagation of error. . . . . . . . . . . .86
action-value function . . . . . . . . . . . . . . 201
recurrent . . . . . . . . . . . . . . . . . . . . . . 122
activation . . . . . . . . . . . . . . . . . . . . . . . . . . 37
bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
activation function . . . . . . . . . . . . . . . . . 38
basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
selection of . . . . . . . . . . . . . . . . . . . . . 93
bias neuron . . . . . . . . . . . . . . . . . . . . . . . . . 45
ADALINE . . see adaptive linear neuron
binary threshold function . . . . . . . . . . 38
adaptive linear element . . . see adaptive
bipolar cell . . . . . . . . . . . . . . . . . . . . . . . . . 29
linear neuron
black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
adaptive linear neuron . . . . . . . . . . . . . . 10
brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
adaptive resonance theory . . . . . 11, 163
brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . 18
agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .52
amacrine cell . . . . . . . . . . . . . . . . . . . . . . . 30
approximation. . . . . . . . . . . . . . . . . . . . .109
ART . . . . see adaptive resonance theory C
ART-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
ART-2A . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 capability to learn . . . . . . . . . . . . . . . . . . . 4
ART-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 center
artificial intelligence . . . . . . . . . . . . . . . . 10 of a ROLF neurons . . . . . . . . . . . . 174
associative data storage . . . . . . . . . . . 155 of a SOM Neuron . . . . . . . . . . . . . 144
219
Index dkriesel.com
of an RBF neuron . . . . . . . . . . . . . 102 difference vector . . . . . . . see error vector

distance to the . . . . . . . . . . . . . . 107 digital filter . . . . . . . . . . . . . . . . . . . . . . . 181
central nervous system . . . . . . . . . . . . . 16 digitization . . . . . . . . . . . . . . . . . . . . . . . . 136
cerebellum . . . . . . . . . . . . . see little brain discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
cerebral cortex . . . . . . . . . . . . . . . . . . . . . 16 discretization . . . . . . . . . see quantization
cerebrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 distance
change in weight. . . . . . . . . . . . . . . . . . . .66 Euclidean . . . . . . . . . . . . . . . . . 57, 169
cluster analysis . . . . . . . . . . . . . . . . . . . . 169 squared. . . . . . . . . . . . . . . . . . . .78, 169
clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 dynamical system . . . . . . . . . . . . . . . . . 117
CNS . . . . . . . see central nervous system
codebook vector . . . . . . . . . . . . . . 136, 170
complete linkage. . . . . . . . . . . . . . . . . . . .40
complex eye . . . . . . . . . . . . . . . . . . . . . . . . 28
compound eye . . . . . . . . see complex eye E
concentration gradient . . . . . . . . . . . . . . 21
early stopping . . . . . . . . . . . . . . . . . . . . . . 60
concept of time . . . . . . . . . . . . . . . . . . . . . 35 electronic brain . . . . . . . . . . . . . . . . . . . . . . 9
cone function . . . . . . . . . . . . . . . . . . . . . . 148 Elman network . . . . . . . . . . . . . . . . . . . . 119
connection. . . . . . . . . . . . . . . . . . . . . . . . . .36 environment . . . . . . . . . . . . . . . . . . . . . . . 192
context-based search . . . . . . . . . . . . . . 155 episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
continuous . . . . . . . . . . . . . . . . . . . . . . . . 135 epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
cortex . . . . . . . . . . . . . . see cerebral cortex epsilon-nearest neighboring . . . . . . . . 171
visual . . . . . . . . . . . . . . . . . . . . . . . . . . 17 error
cortical field . . . . . . . . . . . . . . . . . . . . . . . . 16 specific . . . . . . . . . . . . . . . . . . . . . . . . . 57
association . . . . . . . . . . . . . . . . . . . . . 17 total . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
primary . . . . . . . . . . . . . . . . . . . . . . . . 17 error function . . . . . . . . . . . . . . . . . . . . . . 77
cylinder function . . . . . . . . . . . . . . . . . . 148 specific . . . . . . . . . . . . . . . . . . . . . . . . . 77
error vector . . . . . . . . . . . . . . . . . . . . . . . . 55
evolutionary algorithms . . . . . . . . . . . 122
exploitation approach . . . . . . . . . . . . . 195
exploration approach . . . . . . . . . . . . . . 195
D exteroceptor . . . . . . . . . . . . . . . . . . . . . . . . 26
Dartmouth Summer Research Project9
Delta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Delta rule . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Dendrite . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 F
dendrite
tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 fault tolerance . . . . . . . . . . . . . . . . . . . . . . . 4
depolarization . . . . . . . . . . . . . . . . . . . . . . 23 Feedforward . . . . . . . . . . . . . . . . . . . . . . . . 40
diencephalon . . . . . . . . . . . . see interbrain Fermi function . . . . . . . . . . . . . . . . . . . . . 38

dkriesel.com Index
flat spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
elimination . . . . . . . . . . . . . . . . . . . . . 96 I
function approximation . . . . . . . . . . . . . 93
function approximator individual eye . . . . . . . . see ommatidium
universal . . . . . . . . . . . . . . . . . . . . . . . 84 input dimension . . . . . . . . . . . . . . . . . . . . 49
input patterns . . . . . . . . . . . . . . . . . . . . . . 52
input vector . . . . . . . . . . . . . . . . . . . . . . . . 48
interbrain . . . . . . . . . . . . . . . . . . . . . . . . . . 17
internodesn . . . . . . . . . . . . . . . . . . . . . . . . . 25
G interoceptor . . . . . . . . . . . . . . . . . . . . . . . . 26
interpolation
ganglion cell . . . . . . . . . . . . . . . . . . . . . . . . 29 precise . . . . . . . . . . . . . . . . . . . . . . . . 108
Gauss-Markov model . . . . . . . . . . . . . . 109 ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Gaussian bell . . . . . . . . . . . . . . . . . . . . . . 147 iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
generalization . . . . . . . . . . . . . . . . . . . . 4, 51
glial cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
gradient descent . . . . . . . . . . . . . . . . . . . . 62
problems . . . . . . . . . . . . . . . . . . . . . . . 62 J
grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Jordan network. . . . . . . . . . . . . . . . . . . .118
gridworld. . . . . . . . . . . . . . . . . . . . . . . . . .190
H K
k-means clustering . . . . . . . . . . . . . . . . 170
Heaviside function see binary threshold k-nearest neighbouring . . . . . . . . . . . . 170
function
Hebbian rule . . . . . . . . . . . . . . . . . . . . . . . 66
generalized form . . . . . . . . . . . . . . . . 66
heteroassociator . . . . . . . . . . . . . . . . . . . 130
Hinton diagram . . . . . . . . . . . . . . . . . . . . 36 L
history of development. . . . . . . . . . . . . . .8
Hopfield networks . . . . . . . . . . . . . . . . . 125 layer
continuous . . . . . . . . . . . . . . . . . . . . 132 hidden . . . . . . . . . . . . . . . . . . . . . . . . . 40
horizontal cell . . . . . . . . . . . . . . . . . . . . . . 30 input . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
hyperbolic tangent . . . . . . . . . . . . . . . . . 38 output . . . . . . . . . . . . . . . . . . . . . . . . . 40
hyperpolarization . . . . . . . . . . . . . . . . . . . 25 learnability . . . . . . . . . . . . . . . . . . . . . . . . . 93
hypothalamus . . . . . . . . . . . . . . . . . . . . . . 17 learning

Index dkriesel.com
batch . . . . . . . . . . see learning, offline Rt . . . . . . . . . . . . . . . . . . . . . . see return

offline . . . . . . . . . . . . . . . . . . . . . . . . . . 53 S . . . . . . . . . . . . . . see situation space
online . . . . . . . . . . . . . . . . . . . . . . . . . . 53 T . . . . . . see temperature parameter
reinforcement . . . . . . . . . . . . . . . . . . 52 VΠ∗ (s) . . . . . see state-value function,
supervised. . . . . . . . . . . . . . . . . . . . . .53 optimal
unsupervised . . . . . . . . . . . . . . . . . . . 52 VΠ (s) . . . . . see state-value function
learning rate . . . . . . . . . . . . . . . . . . . . . . . 91 W . . . . . . . . . . . . . . see weight matrix
variable . . . . . . . . . . . . . . . . . . . . . . . . 92 ∆wi,j . . . . . . . . see change in weight
learning strategy . . . . . . . . . . . . . . . . . . . 40 Π . . . . . . . . . . . . . . . . . . . . . . . see policy
learning vector quantization . . . . . . . 135 Θ . . . . . . . . . . . . . . see threshold value
lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 α . . . . . . . . . . . . . . . . . . see momentum
linear separability . . . . . . . . . . . . . . . . . . 83 β . . . . . . . . . . . . . . . . see weight decay
linearer associator . . . . . . . . . . . . . . . . . . 11 δ . . . . . . . . . . . . . . . . . . . . . . . . see Delta
little brain. . . . . . . . . . . . . . . . . . . . . . . . . .17 η . . . . . . . . . . . . . . . . . see learning rate
locked-in syndrome . . . . . . . . . . . . . . . . . 18 ∇ . . . . . . . . . . . . . . see nabla operator
logistic function . . . . see Fermi function ρ . . . . . . . . . . . . . see radius multiplier
temperature parameter . . . . . . . . . 39 Err . . . . . . . . . . . . . . . . see error, total
LVQ . . see learning vector quantization Err(W ) . . . . . . . . . see error function
LVQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Errp . . . . . . . . . . . . . see error, specific
LVQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Errp (W )see error function, specific
LVQ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 ErrWD . . . . . . . . . . . see weight decay
aj . . . . . . . . . . . . . . . . . . . see activation
at . . . . . . . . . . . . . . . . . . . . . . see Action
c. . . . . . . . . . . . . . . . . . . . . . . .see center
of an RBF neuron, see neuron,
M self-organizing map, center
m . . . . . . . . . . . see output dimension
M-SOM . see self-organizing map, multi n . . . . . . . . . . . . . see input dimension
Mark I perceptron . . . . . . . . . . . . . . . . . . 10 p . . . . . . . . . . . . . see training pattern
Mathematical Symbols rh . . . see center of an RBF neuron,
(t). . . . . . . . . . . . .see concept of time distance to the
A(S) . . . . . . . . . . . . . see action space rt . . . . . . . . . . . . . . . . . . . . . . see reward
Ep . . . . . . . . . . . . . . . . see error vector st . . . . . . . . . . . . . . . . . . . . see situation
G . . . . . . . . . . . . . . . . . . . . see topology t . . . . . . . . . . . . . . . see teaching input
N . . see self-organizing map, input wi,j . . . . . . . . . . . . . . . . . . . . see weight
dimension x . . . . . . . . . . . . . . . . . see input vector
P . . . . . . . . . . . . . . . . . see Training set y . . . . . . . . . . . . . . . . see output vector
Q∗Π (s, a) . see action-value function, fact . . . . . . . . see activation function
optimal fout . . . . . . . . . . . see output function
QΠ (s, a) . see action-value function membrane . . . . . . . . . . . . . . . . . . . . . . . . . . 21

dkriesel.com Index
-potential . . . . . . . . . . . . . . . . . . . . . . 21 neuron layers . . . . . . . . . . . . . . . . . see layer

memorized . . . . . . . . . . . . . . . . . . . . . . . . . 56 neurotransmitters . . . . . . . . . . . . . . . . . . 19
metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 nodes of Ranvier . . . . . . . . . . . . . . . . . . . 25
Mexican hat function . . . . . . . . . . . . . . 148
MLP. . . . . . .see perceptron, Multi-layer
momentum . . . . . . . . . . . . . . . . . . . . . . . . . 94
momentum term. . . . . . . . . . . . . . . . . . . .95
Monte Carlo method . . . . . . . . . . . . . . 199 O
Moore-Penrose pseudo inverse . . . . . 109
oligodendrocytes . . . . . . . . . . . . . . . . . . . 25
moving average procedure . . . . . . . . . 182
OLVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
myelin sheath . . . . . . . . . . . . . . . . . . . . . . 25
on-neuron . . . . . . . . . . . . . see bias neuron
one-step-ahead prediction . . . . . . . . . 181
heterogeneous . . . . . . . . . . . . . . . . . 185
open loop learning. . . . . . . . . . . . . . . . .122
N optimal brain damage . . . . . . . . . . . . . . 97
order of activation . . . . . . . . . . . . . . . . . . 46
nabla operator. . . . . . . . . . . . . . . . . . . . . .62 asynchronous
Neocognitron . . . . . . . . . . . . . . . . . . . . . . . 12 fixed order . . . . . . . . . . . . . . . . . . . 48
nervous system . . . . . . . . . . . . . . . . . . . . . 15 random order . . . . . . . . . . . . . . . . 47
network input . . . . . . . . . . . . . . . . . . . . . . 37 randomly permuted order . . . . 47
neural gas . . . . . . . . . . . . . . . . . . . . . . . . . 157 topological order . . . . . . . . . . . . . 48
growing . . . . . . . . . . . . . . . . . . . . . . . 161 synchronous . . . . . . . . . . . . . . . . . . . . 46
multi- . . . . . . . . . . . . . . . . . . . . . . . . . 159 output dimension . . . . . . . . . . . . . . . . . . . 49
output function. . . . . . . . . . . . . . . . . . . . .39
neural network . . . . . . . . . . . . . . . . . . . . . 36
output vector . . . . . . . . . . . . . . . . . . . . . . . 49
recurrent . . . . . . . . . . . . . . . . . . . . . . 117
neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
accepting . . . . . . . . . . . . . . . . . . . . . 175
binary. . . . . . . . . . . . . . . . . . . . . . . . . .73
context. . . . . . . . . . . . . . . . . . . . . . . .118 P
Fermi . . . . . . . . . . . . . . . . . . . . . . . . . . 73
identity . . . . . . . . . . . . . . . . . . . . . . . . 73 parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 5
information processing . . . . . . . . . 73 Pattern . . . . . . . . . . . see training pattern
input . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 pattern recognition . . . . . . . . . . . . 94, 130
RBF . . . . . . . . . . . . . . . . . . . . . . . . . . 102 perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 73
output . . . . . . . . . . . . . . . . . . . . . . 102 Multi-layer . . . . . . . . . . . . . . . . . . . . . 84
ROLF. . . . . . . . . . . . . . . . . . . . . . . . .174 multi-layer
self-organizing map. . . . . . . . . . . .144 recurrent . . . . . . . . . . . . . . . . . . . . 117
Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 single-layer . . . . . . . . . . . . . . . . . . . . . 74
winner . . . . . . . . . . . . . . . . . . . . . . . . 146 perceptron convergence theorem . . . . 75

Index dkriesel.com
perceptron learning algorithm . . . . . . 75 Williams . . . . . . . . . . . . . . . . . . . . . . . 12

period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Zuse, Konrad . . . . . . . . . . . . . . . . . . . . 9
peripheral nervous system . . . . . . . . . . 15 pinhole eye . . . . . . . . . . . . . . . . . . . . . . . . . 28
Persons PNS . . . . see peripheral nervous system
Anderson . . . . . . . . . . . . . . . . . . . . 204 f. pole balancer . . . . . . . . . . . . . . . . . . . . . . 204
Anderson, James A. . . . . . . . . . . . . 11 policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Barto . . . . . . . . . . . . . . . . . . . 189, 204 f. closed loop . . . . . . . . . . . . . . . . . . . . 195
Carpenter, Gail . . . . . . . . . . . . 11, 163 evaluation . . . . . . . . . . . . . . . . . . . . . 198
Elman . . . . . . . . . . . . . . . . . . . . . . . . 118 greedy . . . . . . . . . . . . . . . . . . . . . . . . 195
Fukushima . . . . . . . . . . . . . . . . . . . . . 12 improvement . . . . . . . . . . . . . . . . . . 198
Girosi . . . . . . . . . . . . . . . . . . . . . . . . . 101 open loop . . . . . . . . . . . . . . . . . . . . . 194
Grossberg, Stephen . . . . . . . . 11, 163 pons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Hebb, Donald O. . . . . . . . . . . . . 9, 66 propagation function . . . . . . . . . . . . . . . 37
Hinton . . . . . . . . . . . . . . . . . . . . . . . . . 12 pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Hoff, Marcian E. . . . . . . . . . . . . . . . 10 pupil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Hopfield, John . . . . . . . . . . . . . 12, 125
Ito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Jordan . . . . . . . . . . . . . . . . . . . . . . . . 118
Kohonen, Teuvo . 11, 135, 143, 155
Lashley, Karl . . . . . . . . . . . . . . . . . . . . 9 Q
MacQueen, J. . . . . . . . . . . . . . . . . . 170
Q learning . . . . . . . . . . . . . . . . . . . . . . . . 202
Martinetz, Thomas . . . . . . . . . . . . 157
quantization . . . . . . . . . . . . . . . . . . . . . . . 135
McCulloch, Warren . . . . . . . . . . . . 8 f.
quickpropagation . . . . . . . . . . . . . . . . . . . 96
Minsky, Marvin . . . . . . . . . . . . . . 9, 11
Miyake . . . . . . . . . . . . . . . . . . . . . . . . . 12
Nilsson, Nils. . . . . . . . . . . . . . . . . . . .10
Papert, Seymour . . . . . . . . . . . . . . . 11
Parker, David . . . . . . . . . . . . . . . . . . 96 R
Pitts, Walter . . . . . . . . . . . . . . . . . . . 8 f.
Poggio . . . . . . . . . . . . . . . . . . . . . . . . 101 RBF network. . . . . . . . . . . . . . . . . . . . . .102
Pythagoras . . . . . . . . . . . . . . . . . . . . . 58 growing . . . . . . . . . . . . . . . . . . . . . . . 114
Rosenblatt, Frank . . . . . . . . . . 10, 71 recepter cell . . . . . . . . . . . . . . . . . . . . . . . . 26
Rumelhart . . . . . . . . . . . . . . . . . . . . . 12 receptive field . . . . . . . . . . . . . . . . . . . . . . 30
Steinbuch, Karl . . . . . . . . . . . . . . . . 10 receptor cell
Sutton . . . . . . . . . . . . . . . . . . 189, 204 f. photo-. . . . . . . . . . . . . . . . . . . . . . . . . .29
Tesauro, Gerald . . . . . . . . . . . . . . . 203 primary . . . . . . . . . . . . . . . . . . . . . . . . 26
von der Malsburg, Christoph . . . 11 secondary . . . . . . . . . . . . . . . . . . . . . . 26
Werbos, Paul . . . . . . . . . . . 11, 86, 96 recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Widrow, Bernard . . . . . . . . . . . . . . . 10 direct . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Wightman, Charles . . . . . . . . . . . . . 10 indirect . . . . . . . . . . . . . . . . . . . . . . . . 43

dkriesel.com Index
lateral . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Snark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
recurrent . . . . . . . . . . . . . . . . . . . . . . . . . . 117 sodium-potassium pump . . . . . . . . . . . . 22
refractory period . . . . . . . . . . . . . . . . . . . 25 SOM . . . . . . . . . . see self-organizing map
regional and online learnable fields 173 soma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
reinforcement learning . . . . . . . . . . . . . 189 spin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
repolarization . . . . . . . . . . . . . . . . . . . . . . 23 spinal cord . . . . . . . . . . . . . . . . . . . . . . . . . 16
representability . . . . . . . . . . . . . . . . . . . . . 93 stability / plasticity dilemma . . . . . . 163
resonance . . . . . . . . . . . . . . . . . . . . . . . . . 164 state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
retina . . . . . . . . . . . . . . . . . . . . . . . . . . . 29, 73 state space forecasting . . . . . . . . . . . . . 181
return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 state-value function . . . . . . . . . . . . . . . 198
reward stimulus . . . . . . . . . . . . . . . . . . . . . . . 23, 145
avoidance strategy . . . . . . . . . . . . 197 stimulus-conducting apparatus. . . . . .26
pure delayed . . . . . . . . . . . . . . . . . . 196 surface, perceptive. . . . . . . . . . . . . . . . .174
pure negative . . . . . . . . . . . . . . . . . 196 swing up an inverted pendulum. . . .204
RMS . . . . . . . . . . . . see root mean square symmetry breaking . . . . . . . . . . . . . . . . . 94
ROLFs . . . . . . . . . see regional and online synapse
learnable fields chemical . . . . . . . . . . . . . . . . . . . . . . . 19
root mean square . . . . . . . . . . . . . . . . . . . 57 electrical . . . . . . . . . . . . . . . . . . . . . . . 19
synapses. . . . . . . . . . . . . . . . . . . . . . . . . . . .19
synaptic cleft . . . . . . . . . . . . . . . . . . . . . . . 19
S
saltatory conductor . . . . . . . . . . . . . . . . . 25 T
Schwann cell . . . . . . . . . . . . . . . . . . . . . . . 25
self fulfilling prophecy . . . . . . . . . . . . . 187 target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
self-organizing feature maps . . . . . . . . 11 TD gammon . . . . . . . . . . . . . . . . . . . . . . 203
self-organizing map . . . . . . . . . . . . . . . . 143 TD learning. . . .see temporal difference
multi- . . . . . . . . . . . . . . . . . . . . . . . . . 159 learning
sensory adaptation . . . . . . . . . . . . . . . . . 27 teacher forcing . . . . . . . . . . . . . . . . . . . . 122
sensory transduction. . . . . . . . . . . . . . . .26 teaching input . . . . . . . . . . . . . . . . . . . . . . 55
shortcut connections . . . . . . . . . . . . . . . . 41 telencephalon . . . . . . . . . . . . see cerebrum
silhouette coefficient . . . . . . . . . . . . . . . 173 temporal difference learning . . . . . . . 200
single lense eye . . . . . . . . . . . . . . . . . . . . . 29 thalamus . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Single Shot Learning . . . . . . . . . . . . . . 128 threshold potential . . . . . . . . . . . . . . . . . 23
situation . . . . . . . . . . . . . . . . . . . . . . . . . . 192 threshold value . . . . . . . . . . . . . . . . . . . . . 38
situation space . . . . . . . . . . . . . . . . . . . . 193 time horizon . . . . . . . . . . . . . . . . . . . . . . 194
situation tree . . . . . . . . . . . . . . . . . . . . . . 196 time series . . . . . . . . . . . . . . . . . . . . . . . . 179
SLP . . . . . . . see perceptron, single-layer time series prediction . . . . . . . . . . . . . . 179

Index dkriesel.com
topological defect. . . . . . . . . . . . . . . . . .152

topology . . . . . . . . . . . . . . . . . . . . . . . . . . 145
topology function . . . . . . . . . . . . . . . . . 146
training pattern . . . . . . . . . . . . . . . . . . . . 54
set of . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Training set . . . . . . . . . . . . . . . . . . . . . . . . 52
transfer functionsee activation function
Truncus cerebri . . . . . . . . . see brainstem
two-step-ahead prediction . . . . . . . . . 183
direct . . . . . . . . . . . . . . . . . . . . . . . . . 183
U
unfolding in time . . . . . . . . . . . . . . . . . . 121
V
voronoi diagram . . . . . . . . . . . . . . . . . . . 137
W
weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
weight matrix . . . . . . . . . . . . . . . . . . . . . . 36
bottom-up . . . . . . . . . . . . . . . . . . . . 164
top-down. . . . . . . . . . . . . . . . . . . . . .163
weight vector . . . . . . . . . . . . . . . . . . . . . . . 36
weighted sum . . . . . . . . . . . . . . . . . . . . . . . 37
Widrow-Hoff rule. . . . . . . .see Delta rule
winner-takes-all scheme . . . . . . . . . . . . . 43

Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks

Uploaded by

Copyright:

Available Formats

A Brief Introduction to

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) iii

1. You are free to redistribute this docu-

vi D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

How to Cite this Manuscript In the Table of Contents, Different

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) vii

viii D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

to speak, a place of honor: My girl-friend

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) ix

I From Biology to Formalization – Motivation, Philosophy, History and

1 Introduction, Motivation and History 3

2 Biological Neural Networks 15

2.5 Technical Neurons as Caricature of Biology . . . . . . . . . . . . . . . . 32

3 Components of Artificial Neural Networks (fundamental) 35

4 How to Train a Neural Network? (fundamental) 51

xii D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

4.6 Problem Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

II Supervised learning Network Paradigms 69

6 Radial Basis Functions 101

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) xiii

7 Recurrent Perceptron-like Networks (depends on chapter 5) 117

8 Hopfield Networks 125

9 Learning Vector Quantization 135

III Unsupervised learning Network Paradigms 141

10 Self-organizing Feature Maps 143

xiv D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

10.2 Functionality and Output Interpretation . . . . . . . . . . . . . . . . . . 145

11 Adaptive Resonance Theory 163

IV Excursi, Appendices and Registers 167

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) xv

B Excursus: Neural Networks Used for Prediction 179

C Excursus: Reinforcement Learning 189

List of Figures 213

List of Tables 217

xvi D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

From Biology to Formalization –

1.1 Why Neural Networks? If we compare computer and brain1 , we

4 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) 5

put is called H for ”halt signal”). There-

that applies the input signals to a robot

1.1.2.1 The classical way

There are two ways of realizing this map-

6 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

Our example can be optionally expanded.

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) 7

1.2 A Brief History of Neural The history of neural networks begins in

8 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

calculate nearly any logic or arith- 1950: The neuropsychologist Karl

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) 9

10 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

Grossberg presented many papers

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) 11

genome, which has, however, not the Parallel Distributed Processing

Exercise 1: Indicate one example for each

The influence of John Hopfield, . A book on neural networks or neuroin-

1985: veröffentlicht John Hopfield pub- . A company using neural networks,

12 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

Exercise 3: Briefly characterize the four

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) 13

Before we begin to describe the technical 2.1 The Vertebrate Nervous

ripheral nervous system includes, for ex-

The central nervous system (CNS),

2.1.2 The cerebrum is responsible