You are on page 1of 18

455

Artificial Neu 27. Artificial Neural Network Models

Part D | 27.1
Peter Tino, Lubica Benuskova, Alessandro Sperduti

27.1 Biological Neurons............................... 455


We outline the main models and developments
in the broad field of artificial neural networks 27.2 Perceptron .......................................... 456
(ANN). A brief introduction to biological neurons 27.3 Multilayered Feed-Forward
motivates the initial formal neuron model – the ANN Models ......................................... 458
perceptron. We then study how such formal neu- 27.4 Recurrent ANN Models.......................... 460
rons can be generalized and connected in network
27.5 Radial Basis Function ANN Models ........ 464
structures. Starting with the biologically motivated
layered structure of ANN (feed-forward ANN), the 27.6 Self-Organizing Maps........................... 465
networks are then generalized to include feedback 27.7 Recursive Neural Networks ................... 467
loops (recurrent ANN) and even more abstract gen- 27.8 Conclusion........................................... 469
eralized forms of feedback connections (recursive References................................................... 470
neuronal networks) enabling processing of struc-
tured data, such as sequences, trees, and graphs.
We also introduce ANN models capable of form- line the basic principles of training the corre-
ing topographic lower-dimensional maps of data sponding ANN models on an appropriate data
(self-organizing maps). For each ANN type we out- collection.

The human brain is arguably one of the most excit- important building block of more complex feed-forward
ing products of evolution on Earth. It is also the most ANN models. Such models can be used to approximate
powerful information processing tool so far. Learning complex non-linear functions or to learn a variety of as-
based on examples and parallel signal processing lead sociation tasks. The feed-forward models are capable of
to emergent macro-scale behavior of neural networks processing patterns without temporal association. In the
in the brain, which cannot be easily linked to the be- presence of temporal dependencies, e.g., when learning
havior of individual micro-scale components (neurons). to predict future elements of a time series (with certain
In this chapter, we will introduce artificial neural net- prediction horizon), the feed-forward ANN needs to
work (ANN) models motivated by the brain that can be extended with a memory mechanism to account for
learn in the presence of a teacher. During the course of temporal structure in the data. This will naturally lead
learning the teacher specifies what the right responses us to recurrent neural network models (RNN), which
to input examples should be. In addition, we will also besides feed-forward connections also contain feedback
mention ANNs that can learn without a teacher, based loops to preserve, in the form of the information pro-
on principles of self-organization. cessing state, information about the past. RNN can be
To set the context, we will begin by introducing ba- further extended to recursive ANNs (RecNN), which
sic neurobiology. We will then describe the perceptron can process structured data such as trees and acyclic
model, which, even though rather old and simple, is an graphs.

27.1 Biological Neurons


It is estimated that there are about 1012 neural cells form a 46 mm thick cortex that is assumed to be the
(neurons) in the human brain. Two-thirds of the neurons center of cognitive processes. Within each neuron com-
456 Part D Neural Networks

Fig. 27.1 Schematic illustration of the


Part D | 27.2

Dendrites
basic information processing struc-
ture of the biological neuron
Axon

Soma
Terminal

plex biological processes take place, ensuring that it can axon, there are thousands of output branches whose ter-
process signals from other neurons, as well as send its minals form synapses on other neurons in the network.
own signals to them. The signals are of electro-chemical Typically, as a result of input excitation, the neuron can
nature. In a simplified way, signals between the neurons generate a series of spikes of some average frequency –
can be represented by real numbers quantifying the in- about 1  102 Hz. The frequency is proportional to the
tensity of the incoming or outgoing signals. The point overall stimulation of the neuron.
of signal transmission from one neuron to the other is The first principle of information coding and rep-
called the synapse. Within synapse the incoming sig- resentation in the brain is redundancy. It means that
nal can be reinforced or damped. This is represented by each piece of information is processed by a redun-
the weight of the synapse. A single neuron can have up dant set of neurons, so that in the case of partial
to 103 105 such points of entry (synapses). The input brain damage the information is not lost completely. As
to the neuron is organized along dendrites and the soma a result, and crucially – in contrast to conventional com-
(Fig. 27.1). Thousands of dendrites form a rich tree-like puter architectures, gradually increasing damage to the
structure on which most synapses reside. computing substrate (neurons plus their interconnec-
Signals from other neurons can be either excitatory tion structure) will only result in gradually decreasing
(positive) or inhibitory (negative), relayed via exci- processing capabilities (graceful degradation). Further-
tatory or inhibitory synapses. When the sum of the more, it is important what set of neurons participate
positive and negative contributions (signals) from other in coding a particular piece of information (distributed
neurons, weighted by the synaptic weights, becomes representation). Each neuron can participate in cod-
greater than a certain excitation threshold, the neuron ing of many pieces of information, in conjunction with
will generate an electric spike that will be transmitted other neurons. The information is thus associated with
over the output channel called the axon. At the end of patterns of distributed activity on sets of neurons.

27.2 Perceptron
The perceptron is a simple neuron model that takes in- the perceptron, the input .n C 1/ will be fixed to 1 and
put signals (patterns) coded as (real) input vectors xN D the associated weight to wnC1 D , which is the value
.x1 ; x2 ; : : : ; xnC1 / through the associated (real) vector of the excitation threshold.
of synaptic weights w N D .w1 ; w2 ; : : : ; wnC1 /. The out-
put o is determined by
x1
o D f .net/ D f .w
N xN / D w1
0 1 0 1 x2 w2
X
nC1 X
n
f@ wj xj A D f @ wj xj  A ; (27.1) · w· x o
·
jD1 jD1
· wn+1 = θ
where net denotes the weighted sum of inputs, (i. e., dot xn+1 = –1
product of weight and input vectors), and f is the acti-
vation function. By convention, if there are n inputs to Fig. 27.2 Schematic illustration of the perceptron model
Artificial Neural Network Models 27.2 Perceptron 457

In 1958 Rosenblatt [27.1] introduced a discrete

Part D | 27.2
perceptron model with a bipolar activation function
(Fig. 27.2) x x

f .net/ D sign.net/
8 x x x x
ˆ Xn
ˆ
ˆ C1 if net  0 , wj xj 
ˆ
<
jD1
D Xn (27.2)
ˆ
ˆ
Fig. 27.3 Linearly separable and non-separable problems
ˆ1 if net < 0 , wj xj < :

jD1
The positive constant 0 < ˛  1 is called the learning
rate.
The boundary equation In the case of the perceptron, the learning signal is
the disproportion (difference) between the desired (tar-
X
n
wj xj  D 0 ; (27.3) get) and the actual (produced by the model) response,
jD1
s D d  o D ı. The update rule is known as the ı (delta)
rule
parameterizes a hyperplane in n-dimensional space with
normal vector w.N wj D ˛.d  o/xj : (27.6)
The perceptron can classify input patterns into two
classes, if the classes can indeed be separated by an The same rule can, of course, be used to update the ac-
.n  1/-dimensional hyperplane (27.3). In other words, tivation threshold wnC1 D .
the perceptron can deal with linearly-separable prob- Consider a training set
lems only, such as logical functions AND or OR. XOR,
on the other hand, is not linearly separable (Fig. 27.3).
Atrain D f.Nx1 ; d 1 /.Nx2 ; d 2 / : : : .Nxp ; d p / : : : .NxP ; d P /g
Rosenblatt showed that there is a simple training rule
that will find the separating hyperplane, provided that
the patterns are linearly separable. consisting of P (input,target) couples. The perceptron
As we shall see, a general rule for training many training algorithm can be formally written as:
ANN models (not only the perceptron) can be for-  Step 1: Set ˛ 2 .0; 1i. Initialize the weights ran-
mulated as follows: the weight vector w N is changed domly from .1; 1/. Set the counters to k D 1, p D
proportionally to the product of the input vector and 1 (k indexes sweep through Atrain , p indexes individ-
a learning signal s. The learning signal s is a function ual training patterns).
N xN , and possibly a teacher feedback d
of w,  Step 2: Consider input xN p , calculate the output o D
PnC1 p
sign. jD1 wj xj /.
N xN ; d/ or s D s.w;
s D s.w; N xN / : (27.4) p
 Step 3: Weight update: wj wj C ˛.d p  op /xj , for
In the former case, we talk about supervised learning j D 1; : : : ; n C 1.
(with direct guidance from a teacher); the latter case is  Step 4: If p < P, set p p C 1, go to step 2. Other-
known as unsupervised learning. The update of the j-th wise go to step 5.
weight can be written as  Step 5: Fix the weights and calculate the cumulative
error E on Atrain .
 Step 6: If E D 0, finish training. Otherwise, set p D
wj .t C 1/ D wj .t/ C wj .t/ D wj .t/ C ˛s.t/xj .t/ : 1, k D k C 1 and go to step 2. A new training epoch
(27.5) starts.
458 Part D Neural Networks

27.3 Multilayered Feed-Forward ANN Models


Part D | 27.3

A breakthrough in our ability to construct and train closely resemble the desired values (target). The train-
more complex multilayered ANNs came in 1986, when ing problem is transformed to an optimization one by
Rumelhart et al. [27.2] introduced the error back- defining the error function
propagation method. It is based on making the transfer
1X
K
functions differentiable (hence the error functional to be Ep D .dpk  opk /2 ; (27.10)
minimized is differentiable as well) and finding a local 2 kD1
minimum of the error functional by the gradient-based where p is the training point index. Ep is the sum of
steepest descent method. squares of errors on the output neurons. During learn-
We will show derivation of the back-propagation ing we seek to find the weight setting that minimizes
algorithm for two-layer feed-forward ANN as demon- Ep . This will be done using the gradient-based steepest
strated, e.g., in [27.3]. Of course, the same principles descent on Ep ,
can be applied to a feed-forward ANN architecture with
any (finite) number of layers. In feed-forward ANNs @Ep @Ep @.netk /
neurons are organized in layers. There are no connec- wkj D ˛ D ˛ D ˛ıok yj ;
@wkj @.netk / @wkj
tions among neurons within the same layer; connections
(27.11)
only exist between successive layers. Each neuron from
layer l has connections to each neuron in layer l C 1. where ˛ is a positive learning rate. Note that
As has already been mentioned, the activation func- @Ep =@.netk / D ıok , which is the generalized training
tions need to differentiable and are usually of the signal on the k-th output neuron. The partial derivative
sigmoid shape. The most common activation functions @.netk /=@wkj is equal to yj (27.9). Furthermore,
are @Ep @Ep @ok
ıok D  D D .dpk  opk /fk0 ;
 Unipolar sigmoid: @.netk / @ok @.netk /
(27.12)
1
f .net/ D (27.7)
fk0
where denotes the derivative of the activation func-
1 C exp.net/
tion with respect to netk . For the unipolar sigmoid
 Bipolar sigmoid (hyperbolic tangent): (27.7), we have fk0 D ok .1  ok /. For the bipolar sigmoid
(27.8), fk0 D .1=2/.1o2k /. The rule for updating the j-th
2
f .net/ D 1 : (27.8) weight of the k-th output neuron reads as
1 C exp.net/

wkj D ˛.dpk  opk /fk0 yj ; (27.13)
The constant  > 0 determines steepness of the sig- where (dpk  opk ) D ıok is generalized error signal
fk0
moid curve and it is commonly set to 1. In the limit flowing back through all connections ending in the k-
 ! 1 the bipolar sigmoid tends to the sign function the output neuron. Note that if we put fk0 D 1, we would
(used in the perceptron) and the unipolar sigmoid tends obtain the perceptron learning rule (27.6).
to the step function.
Consider the single-layer ANN in Fig. 27.4. The
output and input vectors are yN D .y1 ; : : : ; yj ; : : : ; yJ / w11
o1
and oN D .o1 ; : : : ; ok ; : : : ; oK /, respectively, where ok D y1 1
f .netk / and · w1j
·
· ·
X
J
· ·
ok
netk D wkj yj : (27.9) wkj
yj k
jD1 · ·
· wKj ·
Set yJ D 1 and wkJ D k , a threshold for k D 1; : : : ; K · oK
·
output neurons. The desired output is dN D .d1 ; : : : ; dk ; yJ
wKJ
K
: : : ; dK /.
After training, we would like, for all training pat-
terns p D 1; : : : ; P from Atrain , the model output to Fig. 27.4 A single-layer ANN
Artificial Neural Network Models 27.3 Multilayered Feed-ForwardANN Models 459

We will now extend the network with another layer, where fj 0 is the derivative of the activation function in

Part D | 27.3
called the hidden layer (Fig. 27.5). the hidden layer with respect to netj ,
Input to the network is identical with the in-
put vector xN D .x1 ; : : : ; xi ; : : : ; xI / for the hidden layer. @Ep XK
@f .netk /
D .dpk  opk /
The output neurons process as inputs the outputs yN D @yj kD1
@yj
.y1 ; : : : ; yj ; : : : ; yJ /, yj D f .netj / from the hidden layer.
Hence, X
K
@f .netk / @.netk /
D .dpk  opk / : (27.17)
kD1
@.netk / @yj
X
I
netj D vji xi : (27.14) Since fk0 is the derivative of the output neuron sigmoid
iD1 with respect to netk and @.netk /=@yj D wkj (27.9), we
have
As before, the last (in this case the I-th) input is fixed to
1. Recall that the same holds for the output of the J-th @Ep XK XK

hidden neuron. Activation thresholds for hidden neu- D .dpk  opk /fk0 wkj D  ıok wkj :
@yj
rons are vjI D j , for j D 1; : : : ; J. kD1 kD1

Equations (27.11)–(27.13) describe modification of (27.18)


weights from the hidden to the output layer. We will Plugging this to (27.16) we obtain
now show how to modify weights from the input to the !
hidden layer. We would still like to minimize Ep (27.10) XK

through the steepest descent. ıyj D ıok wkj fj 0 : (27.19)


The hidden weight vji will be modified as follows kD1

Finally, the weights from the input to the hidden layer


@Ep @Ep @.netj / are modified as follows

vji D ˛ D ˛ D ˛ıyj xi : !
@vji @.netj / @vji XK

(27.15) vji D ˛ ıok wkj fj 0 xi : (27.20)


kD1

Again, @Ep =@.netj / D ıyj is the generalized training Consider now the general case of m hidden layers. For
signal on the j-th hidden neuron that should flow on the the n-th hidden layer we have
input weights. As before, @.netj /=@vji D xi (27.14). Fur-
vjin D ˛ıyjn xn1
i ; (27.21)
thermore,
where
@Ep @Ep @yj @Ep 0 !
ıyj D  D D f ; X
K
@.netj / @yj @.netj / @yj j ıyjn D ıok
nC1 nC1
wkj .fj n /0 ; (27.22)
(27.16) kD1

and .fj n /0 is the derivative of the activation function of


the n-layer with respect to netjn .
v11 y1
x1 1 Often, the learning speed can be improved by using
· v1i w11 the so-called momentum term
· o1
· ·
·
w1j 1 wkj .t/ wkj .t/ C wkj .t  1/ ;
· yj ·
xi
vji
j ·
vji .t/ vji .t/ C vji .t  1/ ; (27.23)
wKj
· · oK where  2 .0; 1i is the momentum rate.
· K Consider a training set
vJi · yJ
wKJ
xI J Atrain D f.Nx1 ; dN 1 /.Nx2 ; dN 2 / : : : .Nxp ; dN p / : : : .NxP ; dN P /g :
vJI
The back-propagation algorithm for training feed-
Fig. 27.5 A two-layer feed-forward ANN forward ANNs can be summarized as follows:
460 Part D Neural Networks

 Step 1: Set ˛ 2 .0; 1i. Randomly initialize weights bias) and it will not be able to sufficiently adapt to the
Part D | 27.4

to small values, e.g., in the interval (0:5; 0:5). data. However, under different samples from the same
Counters and the error are initialized as follows: data generating process, the resulting trained models
k D 1, p D 1, E D 0. E denotes the accumulated er- will vary relatively little (low variance). On the other
ror across training patterns hand, if J is too high, the model will be too complex,
modeling even such irrelevant features of the data such
X
P
as output noise. The particular data will be interpolated
ED Ep ; (27.24)
exactly (low bias), but the variability of fitted models
pD1
under different training samples from the same process
where Ep is given in (27.10). Set a tolerance thresh- will be immense. It is, therefore, important to set J to an
old " for the error. The threshold will be used to stop optimal value, reflecting the complexity of the data gen-
the training process. erating process. This is usually achieved by splitting the
 Step 2: Apply input xN p and compute the correspond- data into three disjoint sets – training, validation, and
ing yN p and oN p . test sets. Models with different numbers of hidden units
 Step 3: For every output neuron, calculate ıok are trained on the training set, their performance is then
(27.12), for hidden neuron determine ıyj (27.19). checked on a held-out validation set. The optimal num-
 Step 4: Modify the weights wkj wkj C ˛ıok yj and ber of hidden units is selected based on the (smallest)
vji vji C ˛ıyj xi . validation error. Finally, the test set is used for inde-
 Step 5: If p < P, set p D p C 1 and go to step 2. Oth- pendent comparison of selected models from different
erwise go to step 6. model classes.
 Step 6: Fixing the weights, calculate E. If E < ", If the data set is not large enough, one can perform
stop training, otherwise permute elements of Atrain , such a model selection using k-fold cross-validation.
set E D 0, p D 1, k D k C 1, and go to step 2. The data for model construction (this data would be
considered training and validation sets in the scenario
Consider a feed-forward ANN with fixed weights above) is split into k disjoint folds. One fold is selected
and single output unit. It can be considered a real- as the validation fold, the other k  1 will be used for
valued function G on I-dimensional vectorial inputs, training. This is repeated k times, yielding k estimates
0 !1 of the validation error. The validation error is then cal-
X
J XI
G.Nx/ D f @ wj f vji xi A : culated as the mean of those k estimates.
jD1 iD1
We have described data-based methods for model
selection. Other alternatives are available. For exam-
There has been a series of results showing that such ple, by turning an ANN into a probabilistic model (e.g.,
a parameterized function class is sufficiently rich in the by including an appropriate output noise model), un-
space of reasonable functions (see, e.g., [27.4]). For der some prior assumptions on weights (e.g., a-priori
example, for any smooth function F over a compact do- small weights are preferred), one can perform Bayesian
main and a precision threshold ", for sufficiently large model selection (through model evidence) [27.5].
number J of hidden units there is a weight setting so There are several seminal books on feed-forward
that G is not further away from F than " (in L-2 norm). ANNs with well-documented theoretical foundations
When training a feed-forward ANN a key decision and practical applications, e.g., [27.3, 6, 7]. We refer
must be made about how complex the model should be. the interested reader to those books as good starting
In other words, how many hidden units J one should points as the breadth of theory and applications of feed-
use. If J is too small, the model will be too rigid (high forward ANNs is truly immense.

27.4 Recurrent ANN Models


Consider a situation where the associations in the train- clear that now for one input item there can be different
ing set we would like to learn are of the following output associations, depending on the temporal con-
(abstract) form: a ! ˛, b ! ˇ, b ! ˛, b !  , c ! ˛, text in which the training items are presented. In other
c !  , d ! ˛, etc., where the Latin and Greek letters words, the model output is determined not only by the
stand for input and output vectors, respectively. It is input, but also by the history of presented items so far.
Artificial Neural Network Models 27.4 Recurrent ANN Models 461

Obviously, the feed-forward ANN model described in lay. It also may be possible to have connections from

Part D | 27.4
the previous section cannot be used in such cases and a higher-level layer to a lower layer, again subject to
the model must be further extended so that the temporal a time delay. In many cases it is, however, more conve-
context is properly represented. nient to introduce an additional fictional context layer
The architecturally simplest solution is provided that contains delayed activations of neurons from the
by the so-called time delay neural network (TDNN) selected layer(s) and represent the resulting RNN archi-
(Fig. 27.6). The input window into the past has a finite tecture as a feed-forward architecture with some fixed
length D. If the output is an estimate of the next item one-to-one delayed connections. As an example, con-
of the input time series, such a network realizes a non- sider the so-called simple recurrent network (SRN) of
linear autoregressive model of order D. Elman [27.10] shown in Fig. 27.7. The output of SRN
If we are lucky, even such a simple solution can be at time t is given by
sufficient to capture the temporal structure hidden in the 0 1
data. An advantage of the TDNN architecture is that X
J
some training methods developed for feed-forward net- ok.t/ D f @ mkj yj.t/ A ;
works can be readily used. A disadvantage of TDNN jD1
networks is that fixing a finite order D may not be ade- !
X
J X
I
quate for modeling the temporal structure of the data yj.t/ Df wji y.t1/
i C vji xi.t/ : (27.25)
generating source. TDNN enables the feed-forward iD1 iD1
ANN to see, besides the current input at time t, the other
inputs from the past up to time t  D. Of course, during The hidden layer constitutes the state of the input-
the training, it is now imperative to preserve the order of driven dynamical system whose role it is to represent
training items in the training set. TDNN has been suc- the relevant (with respect to the output) information
cessfully applied in many fields where spatial-temporal
structures are naturally present, such as robotics, speech
recognition, etc. [27.8, 9].
In order to extend the ANN architecture so that the
variable (even unbounded) length of input window can o(t) Unit delay
be flexibly considered, we need a different way of cap- M
turing the temporal context. This is achieved through
the so-called state space formulation. In this case, we y(t)
will need to change our outlook on training. The new
architectures of this type are known as recurrent neural V W
networks (RNN).
As in feed-forward ANNs, there are connections be- x(t) y(t–1)
tween the successive layers. In addition, and in contrast
to feed-forward ANNs, connections between neurons of Fig. 27.7 Schematic depiction of the SRN architecture
the same layer are allowed, but subject to a time de-

Unit delay

Output
Output layer

Hidden
Hidden layer

Input Context
x(t) x(t–1) · · · x(t–D–1) x(t–D)
Fig. 27.8 Schematic depiction of the Jordan’s RNN archi-
Fig. 27.6 TDNN of order D tecture
462 Part D Neural Networks

about the input history seen so far. The state (as in sification task, where the label of the input sequence
Part D | 27.4

generic state space model) is updated recursively. is known only after T time steps (i. e., after T input
Many variations on such architectures with time- items have been processed). The RNN is unfolded in
delayed feedback loops exist. For example, Jor- time to form a feed-forward network with T hidden lay-
dan [27.11] suggested to feed back the outputs as ers. Figure 27.10 shows a simple two-neuron RNN and
the relevant temporal context, or Bengio et al. [27.12] Fig. 27.11 represents its unfolded form for T D 2 time
mixed the temporal context representations of SRN and steps.
the Jordan network into a single architecture. Schematic The first input comes at time t D 1 and the last at
representations of these architectures are shown in t D T. Activities of context units are initialized at the
Figs. 27.8 and 27.9. beginning of each sequence to some fixed numbers.
Training in such architectures is more complex The unfolded network is then trained as a feed-forward
than training of feed-forward ANNs. The principal network with T hidden layers. At the end of the se-
problem is that changes in weights propagate in time quence, the model output is determined as
and this needs to be explicitly represented in the up- 0 1
date rules. We will briefly mention two approaches XJ
to training RNNs, namely back-propagation through ok D f @
.T/
mkj yj A ;
.T/ .T/

time (BPTT) [27.13] and real-time recurrent learning jD1


(RTRL) [27.14]. We will demonstrate BPTT on a clas- !
X
J X
I
yj.t/ Df wji.t/ y.t1/
i C vji.t/ xi.t/ : (27.26)
iD1 iD1
Unit delay
Having the model output enables us to compute the er-
Output ror
Unit delay
1 X  .T/ 2
K

Hidden E.T/ D dk  o.T/


k : (27.27)
2 kD1

The hidden-to-output weights are modified according to


Context Input Context
@E.T/
m.T/
kj D ˛ D ˛ık.T/ yj.T/ ; (27.28)
Fig. 27.9 Schematic depiction of the Bengio’s RNN archi- @mkj
tecture

y1(2) y2(2)
o1(t) o2(t)

w12(2)
w21(2)
m12 m21 v11(2) v22(2)
m11 m22 w11(2) w22(2)
v21(2) v12(2)
y1(t) y2(t)
w21 x1(2) x2(2)
y1(1) y2(1)
w11 1 2 w22 w12(1) w21(1)
w12
v11(1) v22(1)
v12 v21
w11(1) w22(1)
v11 v22 v21(1) v21(1)
x1(1) x2(1)

y1(0) y2(0)
x1(t) x2(t)

Fig. 27.10 A two-neuron SRN Fig. 27.11 Two-neuron SRN unfolded in time for T D 2
Artificial Neural Network Models 27.4 Recurrent ANN Models 463

where time step t

Part D | 27.4
   
@E.t/
ık.T/ D dk.T/  o.T/
k f 0
net.T/
k : (27.29) w.t/
kj D ˛ ;
.t/
@wkj
@E.t/
The other weight updates are calculated as follows vji.t/ D ˛ ;
@vji.t/
@E.T/ @E.t/
w.T/
hj D ˛ D ˛ıh.T/ yj.T1/ I mjl.t/ D ˛ : (27.35)
@whj @mjl.t/
!
X
K  
.T/
ıh D ık mkh f 0 net.T/
.T/ .T/
h (27.30) The updates of hidden-to-output weights are straight-
kD1 forward
@E.T/    
vji.T/ D ˛ D ˛ıj.T/ x.T/ I
@vji i
m.t/ .t/ .t/ .t/
kj D ˛ık yj D ˛ dk  ok
.t/
fk0 netk.t/ yj.t/ :
!
X
K   (27.36)
.T/
ıj D ık mkj f 0 netj.T/
.T/ .T/
(27.31)
kD1 For the other weights we have
w.T1/
hj D ˛ıh.T1/ yj.T2/ I !
0 1 X
K X
@yh.t/
J
X J   vji.t/
D˛ ık.t/
wkh ;
ıh.T1/ D @ ıj.T1/ wjh.T1/ A f 0 net.T1/ kD1 hD1
@vji
h
!
jD1
.t/
XK
.t/
XJ
@yh.t/
(27.32) wji D ˛ ık wkh ; (27.37)
@wji
vji.T1/ ˛ıj.T1/ x.T1/
kD1 hD1
D i I
!
X
J   where
ıj.T1/ D ıh.T1/ w.T1/ f 0 netj.T1/ ; !
hj
@yh.t/   XJ
@y.t1/
D f 0 neth.t/ xi.t/ ıjhKron. C
hD1
whl l
(27.33) @vji lD1
@vji
(27.38)
etc. The final weight updates are the averages of the !
@yh.t/   XJ
@y.t1/
T partial weight update suggestions calculated on the D f 0 neth.t/ xi.t/ ıjhKron. C whl l ;
unfolded network @wji lD1
@wji
(27.39)
PT .t/ PT .t/
tD1 whj tD1
vji
whj D and vji D : and ıjhKron. is the Kronecker delta (ıjhKron. D 1, if j D h;
T T ıjhKron. D 0 otherwise). The partial derivatives required
(27.34)
for the weight updates can be recursively updated us-
ing (27.37)–(27.39). To initialize training, the partial
For every new training sequence (of possibly different derivatives at t D 0 are usually set to 0.
length T) the network is unfolded to the desired length There is a well-known problem associated with
and the weight update process is repeated. In some gradient-based parameter fitting in recurrent networks
cases (e.g., continual prediction on time series), it is (and, in fact, in any parameterized state space models
necessary to set the maximum unfolding length L that of similar form) [27.15]. In order to latch an important
will be used in every update step. Of course, in such piece of past information for future use, the state-
cases we can lose vital information from the past. This transition dynamics (27.25) should have an attractive
problem is eliminated in the RTRL methodology. set.
Consider again the SRN architecture in Fig. 27.6. However, in the neighborhood of such an attractive
In RTRL the weights are updated on-line, i. e., at every set, the derivatives of the dynamic state-transition map
464 Part D Neural Networks

vanish. Vanishingly small derivatives cannot be reliably sociated dynamic state transition structure is called the
Part D | 27.5

propagated back through time in order to form a useful reservoir. The main idea is that the reservoir should be
latching set. This is known as the information latching sufficiently complex so as to capture a large number of
problem. Several suggestions for dealing with informa- potentially useful features of the input stream that can
tion latching problem have been made, e.g., [27.16]. be then exploited by the simple readout.
The most prominent include long short term memory The reservoir computing models differ in how the
(LSTM) RNN [27.17] and reservoir computation mod- fixed reservoir is constructed and what form the readout
els [27.18]. takes. For example, echo state networks (ESN) [27.21]
LSTM models operate with a specially designed have fixed RNN dynamics (27.25), but with a lin-
formal neuron model that contains so-called gate units. ear hidden-to-output layer map. Liquid state machines
The gates determine when the input is significant (in (LSM) [27.22] also have (mostly) linear readout, but
terms of the task given) to be remembered, whether the reservoirs are realized through the dynamics of a set
the neuron should continue to remember the value, and of coupled spiking neuron models. Fractal prediction
when the value should be output. The LSTM architec- machines (FPM) [27.23] are reservoir RNN models for
ture is especially suitable for situations where there are processing discrete sequences. The reservoir dynamics
long time intervals of unknown size between impor- is driven by an affine iterative function system and the
tant events. LSTM models have been shown to provide readout is constructed as a collection of multinomial
superior results over traditional RNNs in a variety of distributions. Reservoir models have been successfully
applications (e.g., [27.19, 20]). applied in many practical applications with competitive
Reservoir computation models try to avoid the results, e.g., [27.21, 24, 25].
information latching problem by fixing the state- Several books that are solely dedicated to RNNs
transition part of the RNN. Only linear readout from have appeared, e.g., [27.26–28] and they contain
the state activations (hidden recurrent layer) producing a much deeper elaboration on theory and practice of
the output is fit to the data. The state space with the as- RNNs than we were able to provide here.

27.5 Radial Basis Function ANN Models


In this section we will introduce another implemen- The RBF network in this form can be simply viewed
tation of the idea of feed-forward ANN. The activa- as a form of kernel regression. The J functions 'j
tions of hidden neurons are again determined by the form a set of J linearly independent basis functions
closeness of inputs xN D .x1 ; x2 ; : : : ; xn / to weights cN D (e.g., if all the centers cNj are different) whose span
.c1 ; c2 ; : : : ; cn /. Whereas in the feed-forward ANN in (the set of all their linear combinations) forms a lin-
Sect. 27.3, the closeness is determined by the dot- ear subspace of functions that are realizable by the
product of xN and cN , followed by the sigmoid activation given RBF architecture (with given centers cNj and kernel
function, in radial basis function (RBF) networks the widths j ).
closeness is determined by the squared Euclidean dis- For the training of RBF networks, it important that
tance of xN and cN , transferred through the inverse expo- the basis functions 'j .Nx/ cover the structure of the in-
nential. The output of the j-th hidden unit with input puts space faithfully. Given a set of training inputs xN p
weight vector cNj is given by from Atrain D f.Nx1 ; dN 1 /.Nx2 ; dN 2 / : : : .Nxp ; dN p / : : : :.NxP ; dN P /g,
  ! many RBF-ANN training algorithms determine the cen-
xN  cNj 2
'j .Nx/ D exp  ; (27.40)
ters cNj and widths j based on the inputs fNx1 ; xN 2 ; : : : ; xN P g
j2 only. One can employ different clustering algo-
rithms, e.g., k-means [27.29], which attempts to
where j is the activation strength parameter of the j-th position the centers among the training inputs so
hidden unit and determines the width of the spherical that the overall sum of (Euclidean) distances be-
(un-normalized) Gaussian. The output neurons are usu- tween the centers and the inputs they represent (i. e.,
ally linear (for regression tasks) the inputs falling in their respective Voronoi com-
X
J partments – the set of inputs for which the cur-
ok .Nx/ D wkj 'j .Nx/ : (27.41) rent center is the closest among all the centers) is
jD1 minimized:
Artificial Neural Network Models 27.6 Self-Organizing Maps 465

 Step 1: Set J, the number of hidden units. The op- output weights can be solved using methods of linear

Part D | 27.6
timum value of Jcan be obtained through a model regression.
selection method, e.g., cross-validation. Of course, it is more optimal to position the centers
 Step 2: Randomly select J training inputs that will with respect to both the inputs and target outputs in the
form the initial positions of the J centers cNj . training set. This can be formulated, e.g., as a gradient
 Step 3: At time step t: descent optimization. Furthermore, covering of the in-
a) Pick a training input xN .t/ and find the center cN .t/ put space with spherical Gaussian kernels may not be
closest to it. optimal, and algorithms have been developed for learn-
b) Shift the center cN .t/ towards xN .t/ ing of general covariance structures. A comprehensive
review of RBF networks can be found, e.g., in [27.30].
cN .t/ cN .t/ C .t/.Nx.t/  cN .t// ; Recently, it was shown that if enough hidden
where 0  .t/  1 : (27.42) units are used, their centers can be set randomly
at very little cost, and determination of the only
The learning rate .t/ usually decreases in time remaining free parameters – output weights – can
towards zero. The training is stopped once the cen- be done cheaply and in a closed form through lin-
ters settle in their positions and move only slightly ear regression. Such architectures, known as extreme
(some norm of weight updates is below a certain learning machines [27.31] have shown surprisingly
threshold). Since k-means is guaranteed to find only lo- high performance levels. The idea of extreme learn-
cally optimal solutions, it is worth re-initializing the ing machines can be considered as being analo-
centers and re-running the algorithm several times, gous to the idea of reservoir computation, but in
keeping the solution with the lowest quantization the static setting. Of course, extreme learning ma-
error. chines can be built using other implementations of
Once the centers are in their positions, it is easy to feed-forward ANNs, such as the sigmoid networks of
determine the RBF widths, and once this is done, the Sect. 27.3.

27.6 Self-Organizing Maps


In this section we will introduce ANN models that A particular feature of the SOM is that it can map
learn without any signal from a teacher, i. e., learning the training set on the neuron grid in a manner that
is based solely on training inputs – there are no out- preserves the training set’s topology – two input pat-
put targets. The ANN architecture designed to operate terns close in the input space will activate neurons most
in this setting was introduced by Kohonen under the that are close on the SOM grid. Such topological map-
name self-organizing map (SOM) [27.32]. This model ping of inputs (feature mapping) has been observed in
is motivated by organization of neuron sensitivities in biological neural networks [27.32] (e.g., visual maps,
the brain cortex. orientation maps of visual contrasts, or auditory maps,
In Fig. 27.12a we show schematic illustration of frequency maps of acoustic stimuli).
one of the principal organizations of biological neural Teuvo Kohonen presented one possible realization
networks. In the bottom layer (grid) there are recep- of the Hebb rule that is used to train SOM. Input
tors representing the inputs. Every element of the inputs
(each receptor) has forward connections to all neurons
in the upper layer representing the cortex. The neurons a) b)
are organized spatially on a grid. Outputs of the neurons
represent activation of the SOM network. The neurons,
besides receiving connections from the input recep-
tors, have a lateral interconnection structure among
themselves, with connections that can be excitatory, or
inhibitory. In Fig. 27.12b we show a formal SOM archi-
tecture – neurons spatially organized on a grid receive
inputs (elements of input vectors) through connections Fig. 27.12a,b Schematic representation of the SOM ANN architec-
with synaptic weights. tures
466 Part D Neural Networks

weights of the neurons are initialized as small ran- Training of SOM networks can be summarized as fol-
Part D | 27.7

dom numbers. Consider a training set of inputs, Atrain D lows:


P
fNxp gpD1 and linear neurons
 Step 1: Set ˛0 , 0 and tmax (maximum number
of iterations). Randomly (e.g., with uniform distri-
X
m
bution) generate the synaptic weights (e.g., from
oi D N i xN ;
wij xj D w (27.43)
(0:5; 0:5)). Initialize the counters: t D 0, p D 1; t
jD1
indexes time steps (iterations) and p is the input pat-
tern index.
where m is the input dimension and i D 1; : : : ; n. Train-
ing inputs are presented in random order. At each train-
 Step 2: Take input xN p and find the corresponding
winner neuron.
ing step, we find the (winner) neuron with the weight
vector most similar to the current input xN . The mea-
 Step 3: Update the weights of the winner and its
topological neighbors on the grid (as determined by
sure of similarity can be based on the dot product, i. e.,
the neighborhood function). Increment t.
the index of the winner neuron is i D arg max.wTi x/,
 Step 4: Update ˛ and .
or the (Euclidean) distance i D arg mini kx  wi k. Af-
 Step 5: If p < P, set p p C 1, go to step 2 (we
ter identifying the winner the learning continues by
can also use randomized selection), otherwise go to
adapting the winner’s weights along with the weights
step 6.
all its neighbors on the neuron grid. This will ensure
that nearby neurons on the grid will eventually repre-
 Step 6: If t D tmax , finish the training process. Oth-
erwise set p D 1 and go to step 2. A new training
sent similar inputs in the input space. This is moderated
epoch begins.
by a neighborhood function h.i ; i/ that, given a winner
neuron index i , quantifies how many other neurons on
The SOM network can be used as a tool for non-
the grid should be adapted
linear data visualization (grid dimensions 1, 2, or 3).
  In general, SOM implements constrained vector quanti-
wi .t C 1/ D wi .t/ C ˛ .t/ h i ; i .x .t/  wi .t// : zation, where the codebook vectors (vector quantization
(27.44) centers) cannot move freely in the data space dur-
ing adaptation, but are constrained to lie on a lower
The learning rate ˛.t/ 2 .0; 1/ decays in time as 1=t, or dimensional manifold  in the data space. The dimen-
exp.kt/, where k is a positive time scale constant. This sionality of  is equal to the dimensionality of the
ensures convergence of the training process. The sim- neural grid. The neural grid can be viewed as a dis-
plest form of the neighborhood function operates with cretized version of the local co-ordinate system # (e.g.,
rectangular neighborhoods, computer screen) and the weight vectors in the data
space (connected by the neighborhood structure on the
(
neuron grid) as its image in the data space. In this in-
1; if dM .i ; i/   .t/
h.i ; i/ D

(27.45) terpretation, the neuron positions on the grid represent
0; otherwise ; co-ordinate functions (in the sense of differential ge-
ometry) mapping elements of the manifold  to the
where dM .i ; i/ represents the 2.t/ (Manhattan) dis- coordinate system # . Hence, the SOM algorithm can
tance between neurons i and i on the map grid. The also be viewed as one particular implementation of
neighborhood size 2 .t/ should decrease in time, e.g., manifold learning.
through an exponential decay as or exp.qt/, with time There have been numerous successful applications
scale q > 0. Another often used neighborhood function of SOM in a wide variety of applications, e.g., in image
is the Gaussian kernel processing, computer vision, robotics, bioinformatics,
 2   process analysis, and telecommunications. A good sur-
d .i ; i/ vey of SOM applications can be found, e.g., in [27.33].
h.i ; i/ D exp  E 2 ; (27.46)
 .t/ SOMs have also been extended to temporal domains,
mostly by the introduction of additional feedback con-
where dE .i ; i/ is the Euclidean distance between i and nections, e.g., [27.34–37]. Such models can be used for
i on the grid, i. e., dE .i ; i/ D kri  ri k, where ri is the topographic mapping or constrained clustering of time
co-ordinate vector of the i-th neuron on the grid SOM. series data.
Artificial Neural Network Models 27.7 Recursive Neural Networks 467

27.7 Recursive Neural Networks

Part D | 27.7
In many application domains, data are naturally or- cursive network) representing, for a generic node v, the
ganized in structured form, where each data item is functional dependencies among the input information
composed of several components related to each other xv , the state variable (hidden node) yv , and the output
in a non-trivial way, and the specific nature of the task to variable ov . The operator q1 represents the shift oper-
be performed is strictly related not only to the informa- ator in time (unit time delay), i. e., q1 y t D y t1 , which
tion stored at each component, but also to the structure applied to node v in our framework returns the child of
connecting the components. Examples of structured node v. At the bottom of Fig. 27.13 we have reported
data are parse trees obtained by parsing sentences in the unfolding of a binary tree, where the recursive net-
natural language, and the molecular graph describing work uses a generalization of the shift operator, which
a chemical compound. given an index i and a variable associated to a vertex
Recursive neural networks (RecNN) [27.38, 39] are v returns the variable associated to the i-th child of v,
neural network models that are able to directly pro- i. e., q1
i yv D ychi Œv . So, while in RNN the network is
cess structured data, such as trees and graphs. For the unfolded in time, in RecNN the network is unfolded on
sake of presentation, here we focus on positional trees. the structure. The result of unfolding, in both cases, is
Positional trees are trees for which each child has an as- the encoding network. The encoding network for the se-
sociated index, its position, with respect to the siblings. quence specifies how the components implementing the
Let us understand how RecNN is able to process a tree different parts of the recurrent network (e.g., each node
by analogy with what happens when unfolding a RNN of the recurrent network could be instantiated by a layer
when processing a sequence, which can be understood of neurons or by a full feed-forward neural network
as a special case of tree where each node v possesses with hidden units) need to be interconnected. In the case
a single child. of the tree, the encoding network has the same seman-
In Fig. 27.13 (top) we show the unfolding in time tics: this time, however, a set of parameters (weights)
of a sequence when considering a graphical model (re- for each child should be considered, leading to a net-

a) ov

q –1
t =2 t =4

t =1 t =3 t =5 yv

Sequence (list) xv
Encoding network: time unfolding
Recursive network

b)
a
a ov
b c
q L–1 q R–1
b c

yv d e e
d e e
xv

Data structure (binary tree) Recursive network


Frontier states
Encoding network

Fig. 27.13a,b Generation of the encoding network (a) for a sequence and (b) a tree. Initial states are represented by
squared nodes
468 Part D Neural Networks

Fig. 27.14 The causality style of computation induced by


Part D | 27.7

yv
the use of recursive networks is made explicit by using
nested boxes to represent the recursive dependencies of the
yv,L yv,R hidden variable associated to the root of the tree J
xv,L xv,R
work that, given a node v, can be described by the
yv,R,L yv,R,R equations
0 1
X
J
ok.v/ D f @ mkj yj.v/ A ;
jD1
!
X
d X
J X
I
yj.v/ Df wjis yi.chs Œv/ C vji xi.v/ ;
sD1 iD1 iD1
a)
where d is the maximum number of children an input
100 h647 node can have, and weights wjis are indexed on the s-th
h563 child. Note that it is not difficult to generalize all the
learning algorithms devised for RNN to these extended
80 h2179
h1501 h873 equations.
It should be remarked that recursive networks
60 h1701 clearly introduce a causal style of computation, i. e.,
the computation of the hidden and output variables for
h573
a vertex v only depends on the information attached to v
40 h1179
and the hidden variables of the children of v. This de-
h2263
pendence is satisfied recursively by all v’s descendants
h379
20 h1549 and is clearly shown in Fig. 27.14. In the figure, nested
20 40 60 80 100 boxes are used to make explicit the recursive depen-
dencies among hidden variables that contribute to the
b)
determination of the hidden variable yv associated to the
root of the tree.
s1319
100 s339 s1911 Although an encoding network can be generated
for a directed acyclic graph (DAG), this style of com-
s1199
s669 s901 putation limits the discriminative ability of RecNN
80
to the class of trees. In fact, the hidden state is not
s1227 s1609 able to encode information about the parents of nodes.
60 s177 The introduction of contextual processing, however, al-
s1495 lows us to discriminate, with some specific exceptions,
s1125 among DAGs [27.40]. Recently, Micheli [27.41] also
40 showed how contextual processing can be used to ex-
s2201
s1555 s171
tend RecNN to the treatment of cyclic graphs.
20 The same idea described above for supervised neu-
s2377 s2111 ral networks can be adapted to unsupervised mod-
els, where the output value of a neuron typically
represents the similarity of the weight vector associ-
60 80 100 120
ated to the neuron with the input vector. Specifically,
Fig. 27.15a,b Schematic illustration of a principal organization of in [27.37] SOMs were extended to the processing of
biological self-organizing neural network (a) and its formal coun- structured data (SOM-SD). Moreover, a general frame-
terpart SOM ANN architecture (b) work for self-organized processing of structured data
Artificial Neural Network Models 27.8 Conclusion 469

was proposed in [27.42]. The key concepts introduced vectorial representations of hidden states returned by

Part D | 27.8
are: the rep( / function when processing the activation
of the map for the i-th neighbor of v. Each neu-
i) The explicit definition of a representation space R ron nj in the map is associated to a weight vector
equipped with a similarity measure dR . ; / to evalu- ŒwN j ; cNj1 ; : : : ; cNjd . The computation of the winner neu-
ate the similarity between two hidden states. ron is based on the joint contribution of the similarity
ii) The introduction of a general representation func- measures dx . ; / for the input information, and dR . ; /
tion, denoted rep( /, which transforms the activation for the hidden states, i. e., the internal representa-
of the map for a given input into an hidden state rep- tions. Some parts of a SOM-SD map trained on DAGs
resentation. representing visual patterns are shown in Fig. 27.15.
Even in this case the style of computation is causal,
In these models, each node v of the input struc- ruling out the treatment of undirected and/or cyclic
ture is represented by a tuple ŒNxv ; rNv1 ; : : : ; rNvd , where graphs. In order to cope with general graphs, recently
xN v is a real-valued vectorial encoding of the infor- a new model, named GraphSOM [27.43], was pro-
mation attached to vertex v, and rNvi are real-valued posed.

27.8 Conclusion
The field of artificial neural networks (ANN) has cated frameworks and possibly incorporating domain
grown enormously in the past 60 years. There are knowledge.
many journals and international conferences specifi- ANN models have been formulated to operate in
cally devoted to neural computation and neural net- supervised (e.g., feed-forward ANN, Sect. 27.3; RBF
work related models and learning machines. The field networks, Sect. 27.5), unsupervised (e.g., SOM models,
has gone a long way from its beginning in the form Sect. 27.6), semi-supervised, and reinforcement learn-
of simple threshold units existing in isolation (e.g., ing scenarios and have been generalized to process
the perceptron, Sect. 27.2) or connected in circuits. inputs that are much more general than simple vector
Since then we have learnt how to generalize such data of fixed dimensionality (e.g., the recurrent and re-
networks as parameterized differentiable models of var- cursive networks discussed in Sects. 27.4 and 27.7). Of
ious sorts that can be fit to data (trained), usually course, we were not able to cover all important de-
by transforming the learning task into an optimization velopments in the field of ANNs. We can only hope
one. that we have sufficiently motivated the interested reader
ANN models have found numerous successful with the variety of modeling possibilities based on the
practical applications in many diverse areas of sci- idea of interconnected networks of formal neurons,
ence and engineering, such as astronomy, biology, so that he/she will further consult some of the many
finance, geology, etc. In fact, even though basic feed- (much more comprehensive) monographs on the topic,
forward ANN architectures were introduced long time e.g., [27.3, 6, 7].
ago, they continue to surprise us with successful ap- We believe that ANN models will continue to
plications, most recently in the form of deep net- play an important role in modern computational in-
works [27.44]. For example, a form of deep ANN telligence. Especially the inclusion of ANN-like mod-
recently achieved the best performance on a well- els in the field of probabilistic modeling can provide
known benchmark problem – the recognition of hand- techniques that incorporate both explanatory model-
written digits [27.45]. This is quite remarkable, since based and data-driven approaches, while preserving
such a simple ANN architecture trained in a purely a much fuller modeling capability through operat-
data driven fashion was able to outperform the current ing with full distributions, instead of simple point
state-of-art techniques, formulated in more sophisti- estimates.
470 Part D Neural Networks

References
Part D | 27

27.1 F. Rosenblatt: The perceptron, a probabilistic model 27.19 A. Graves, M. Liwicki, S. Fernandez, R. Bertolami,
for information storage and organization in the H. Bunke, J. Schmidhuber: A novel connection-
brain, Psychol. Rev. 62, 386–408 (1958) ist system for improved unconstrained handwriting
27.2 D.E. Rumelhart, G.E. Hinton, R.J. Williams: Learn- recognition, IEEE Trans. Pattern Anal. Mach. Intell.
ing internal representations by error propagation. 31, 5 (2009)
In: Parallel Distributed Processing: Explorations in 27.20 S. Hochreiter, M. Heusel, K. Obermayer: Fast model-
the Microstructure of Cognition. Vol. 1 Founda- based protein homology detection without align-
tions, ed. by D.E. Rumelhart, J.L. McClelland (MIT ment, Bioinformatics 23(14), 1728–1736 (2007)
Press/Bradford Books, Cambridge 1986) pp. 318–363 27.21 H. Jaeger, H. Hass: Harnessing nonlinearity: pre-
27.3 J. Zurada: Introduction to Artificial Neural Systems dicting chaotic systems and saving energy in wire-
(West Publ., St. Paul 1992) less telecommunication, Science 304, 78–80 (2004)
27.4 K. Hornik, M. Stinchocombe, H. White: Multilayer 27.22 W. Maass, T. Natschlager, H. Markram: Real-time
feedforward networks are universal approxima- computing without stable states: A new framework
tors, Neural Netw. 2, 359–366 (1989) for neural computation based on perturbations,
27.5 D.J.C. MacKay: Bayesian interpolation, Neural Com- Neural Comput. 14(11), 2531–2560 (2002)
put. 4(3), 415–447 (1992) 27.23 P. Tino, G. Dorffner: Predicting the future of discrete
27.6 S. Haykin: Neural Networks and Learning Machines sequences from fractal representations of the past,
(Prentice Hall, Upper Saddle River 2009) Mach. Learn. 45(2), 187–218 (2001)
27.7 C. Bishop: Neural Networks for Pattern Recognition 27.24 M.H. Tong, A. Bicket, E. Christiansen, G. Cottrell:
(Oxford Univ. Press, Oxford 1995) Learning grammatical structure with echo state
27.8 T. Sejnowski, C. Rosenberg: Parallel networks that network, Neural Netw. 20, 424–432 (2007)
learn to pronounce English text, Complex Syst. 1, 27.25 K. Ishii, T. van der Zant, V. Becanovic, P. Ploger:
145–168 (1987) Identification of motion with echo state network,
27.9 A. Weibel: Modular construction of time-delay Proc. OCEANS 2004 MTS/IEEE-TECHNO-OCEAN Conf.,
neural networks for speech recognition, Neural Vol. 3 (2004) pp. 1205–1210
Comput. 1, 39–46 (1989) 27.26 L. Medsker, L.C. Jain: Recurrent Neural Networks:
27.10 J.L. Elman: Finding structure in time, Cogn. Sci. 14, Design and Applications (CRC, Boca Raton 1999)
179–211 (1990) 27.27 J. Kolen, S.C. Kremer: A Field Guide to Dynamical
27.11 M.I. Jordan: Serial order: A parallel distributed pro- Recurrent Networks (IEEE, New York 2001)
cessing approach. In: Advances in Connectionist 27.28 D. Mandic, J. Chambers: Recurrent Neural Networks
Theory, ed. by J.L. Elman, D.E. Rumelhart (Erlbaum, for Prediction: Learning Algorithms, Architectures
Hillsdale 1989) and Stability (Wiley, New York 2001)
27.12 Y. Bengio, R. Cardin, R. DeMori: Speaker indepen- 27.29 J.B. MacQueen: Some models for classification and
dent speech recognition with neural networks and analysis if multivariate observations, Proc. 5th
speech knowledge. In: Advances in Neural Infor- Berkeley Symp. Math. Stat. Probab. (Univ. California
mation Processing Systems II, ed. by D.S. Touretzky Press, Oakland 1967) pp. 281–297
(Morgan Kaufmann, San Mateo 1990) pp. 218–225 27.30 M.D. Buhmann: Radial Basis Functions: Theory and
27.13 P.J. Werbos: Generalization of backpropagation Implementations (Cambridge Univ. Press, Cam-
with application to a recurrent gas market model, bridge 2003)
Neural Netw. 1(4), 339–356 (1988) 27.31 G.-B. Huang, Q.-Y. Zhu, C.-K. Siew: Extreme learn-
27.14 R.J. Williams, D. Zipser: A learning algorithm for ing machine: theory and applications, Neurocom-
continually running fully recurrent neural net- puting 70, 489–501 (2006)
works, Neural Comput. 1(2), 270–280 (1989) 27.32 T. Kohonen: Self-Organizing Maps, Springer Series
27.15 Y. Bengio, P. Simard, P. Frasconi: Learning long- in Information Sciences, Vol. 30 (Springer, Berlin,
term dependencies with gradient descent is dif- Heidelberg 2001)
ficult, IEEE Trans. Neural Netw. 5(2), 157–166 (1994) 27.33 T. Kohonen, E. Oja, O. Simula, A. Visa, J. Kangas: En-
27.16 T. Lin, B.G. Horne, P. Tino, C.L. Giles: Learning long- gineering applications of the self-organizing map,
temr dependencies with NARX recurrent neural Proc. IEEE 84(10), 1358–1384 (1996)
networks, IEEE Trans. Neural Netw. 7(6), 1329–1338 27.34 T. Koskela, M. Varsta, J. Heikkonen, K. Kaski: Re-
(1996) current SOM with local linear models in time series
27.17 S. Hochreiter, J. Schmidhuber: Long short-term prediction, 6th Eur. Symp. Artif. Neural Netw. (D-
memory, Neural Comput. 9(8), 1735–1780 (1997) facto Publications,1998) pp. 167–172
27.18 M. Lukosevicius, H. Jaeger: Overview of Reservoir 27.35 T. Voegtlin: Recursive self-organizing maps, Neural
Recipes, Technical Report, Vol. 11 (School of En- Netw. 15(8/9), 979–992 (2002)
gineering and Science, Jacobs University, Bremen 27.36 M. Strickert, B. Hammer: Merge som for temporal
2007) data, Neurocomputing 64, 39–72 (2005)
Artificial Neural Network Models References 471

27.37 M. Hagenbuchner, A. Sperduti, A. Tsoi: Self- 27.42 B. Hammer, A. Micheli, A. Sperduti, M. Strickert:

Part D | 27
organizing map for adaptive processing of struc- A general framework for unsupervised process-
tured data, IEEE Trans. Neural Netw. 14(3), 491–505 ing of structured data, Neurocomputing 57, 3–35
(2003) (2004)
27.38 A. Sperduti, A. Starita: Supervised neural networks 27.43 M. Hagenbuchner, A. Sperduti, A.-C. Tsoi: Graph
for the classification of structures, IEEE Trans. Neu- self-organizing maps for cyclic and unbounded
ral Netw. 8(3), 714–735 (1997) graphs, Neurocomputing 72(7–9), 1419–1430 (2009)
27.39 P. Frasconi, M. Gori, A. Sperduti: A general frame- 27.44 Y. Bengio, Y. LeCun: Greedy Layer-Wise Training of
work for adaptive processing of data structures, Deep Network. In: Advances in Neural Information
IEEE Trans. Neural Netw. 9(5), 768–786 (1998) Processing Systems 19, ed. by B. Schölkopf, J. Platt,
27.40 B. Hammer, A. Micheli, A. Sperduti: Universal ap- T. Hofmann (MIT Press, Cambridge 2006) pp. 153–
proximation capability of cascade correlation for 160
structures, Neural Comput. 17(5), 1109–1159 (2005) 27.45 D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmid-
27.41 A. Micheli: Neural network for graphs: A contextual huber: Deep big simple neural nets for handwritten
constructive approach, IEEE Trans. Neural Netw. digit recognition, Neural Comput. 22(12), 3207–3220
20(3), 498–511 (2009) (2010)
http://www.springer.com/978-3-662-43504-5

You might also like