You are on page 1of 23

COMPUTER SCIENCE REVIEW 3 (2009) 127–149

available at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cosrev

Survey

Reservoir computing approaches to recurrent neural


network training

Mantas Lukoševičius ∗ , Herbert Jaeger


School of Engineering and Science, Jacobs University Bremen gGmbH, P.O. Box 750 561, 28725 Bremen, Germany

A R T I C L E I N F O A B S T R A C T

Article history: Echo State Networks and Liquid State Machines introduced a new paradigm in artificial
Received 17 October 2008 recurrent neural network (RNN) training, where an RNN (the reservoir) is generated
Received in revised form randomly and only a readout is trained. The paradigm, becoming known as reservoir
27 March 2009 computing, greatly facilitated the practical application of RNNs and outperformed classical
Accepted 31 March 2009 fully trained RNNs in many tasks. It has lately become a vivid research field with
numerous extensions of the basic idea, including reservoir adaptation, thus broadening
the initial paradigm to using different methods for training the reservoir and the readout. This
review systematically surveys both current ways of generating/adapting the reservoirs
and training different types of readouts. It offers a natural conceptual classification of
the techniques, which transcends boundaries of the current “brand-names” of reservoir
methods, and thus aims to help in unifying the field and providing the reader with a
detailed “map” of it.
c 2009 Elsevier Inc. All rights reserved.

1. Introduction in the absence of input. Mathematically, this renders an


RNN to be a dynamical system, while feedforward networks
Artificial recurrent neural networks (RNNs) represent a large and are functions.
varied class of computational models that are designed by • If driven by an input signal, an RNN preserves in its
more or less detailed analogy with biological brain modules. internal state a nonlinear transformation of the input
In an RNN numerous abstract neurons (also called units or history — in other words, it has a dynamical memory, and
processing elements) are interconnected by likewise abstracted is able to process temporal context information.
synaptic connections (or links), which enable activations to This review article concerns a particular subset of RNN-
propagate through the network. The characteristic feature of based research in two aspects:
RNNs that distinguishes them from the more widely used
• RNNs are used for a variety of scientific purposes,
feedforward neural networks is that the connection topology
and at least two major classes of RNN models exist:
possesses cycles. The existence of cycles has a profound
they can be used for purposes of modeling biological
impact:
brains, or as engineering tools for technical applications.
• An RNN may develop a self-sustained temporal activation The first usage belongs to the field of computational
dynamics along its recurrent connection pathways, even neuroscience, while the second frames RNNs in the realms

∗ Corresponding author.
E-mail addresses: m.lukosevicius@jacobs-university.de (M. Lukoševičius), h.jaeger@jacobs-university.de (H. Jaeger).
1574-0137/$ - see front matter c 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.cosrev.2009.03.005
128 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

of machine learning, the theory of computation, and • It is intrinsically hard to learn dependences requiring
nonlinear signal processing and control. While there are long-range memory, because the necessary gradient
interesting connections between the two attitudes, this information exponentially dissolves over time [9] (but see
survey focuses on the latter, with occasional borrowings the Long Short-Term Memory networks [10] for a possible
from the first. escape).
• From a dynamical systems perspective, there are two • Advanced training algorithms are mathematically in-
main classes of RNNs. Models from the first class volved and need to be parameterized by a number of global
are characterized by an energy-minimizing stochastic control parameters, which are not easily optimized. As a
dynamics and symmetric connections. The best known result, such algorithms need substantial skill and experi-
instantiations are Hopfield networks [1,2], Boltzmann ence to be successfully applied.
machines [3,4], and the recently emerging Deep Belief
Networks [5]. These networks are mostly trained in some In this situation of slow and difficult progress, in
unsupervised learning scheme. Typical targeted network 2001 a fundamentally new approach to RNN design and
functionalities in this field are associative memories, training was proposed independently by Wolfgang Maass
data compression, the unsupervised modeling of data under the name of Liquid State Machines [11] and by
distributions, and static pattern classification, where the Herbert Jaeger under the name of Echo State Networks [12].
model is run for multiple time steps per single input This approach, which had predecessors in computational
instance to reach some type of convergence or equilibrium neuroscience [13] and subsequent ramifications in machine
(but see e.g., [6] for extension to temporal data). The learning as the Backpropagation-Decorrelation [14] learning rule,
mathematical background is rooted in statistical physics. is now increasingly often collectively referred to as Reservoir
In contrast, the second big class of RNN models Computing (RC). The RC paradigm avoids the shortcomings
typically features a deterministic update dynamics and of gradient-descent RNN training listed above, by setting up
directed connections. Systems from this class implement RNNs in the following way:
nonlinear filters, which transform an input time series into
an output time series. The mathematical background here • A recurrent neural network is randomly created and
is nonlinear dynamical systems. The standard training remains unchanged during training. This RNN is called
mode is supervised. This survey is concerned only with the reservoir. It is passively excited by the input signal and
RNNs of this second type, and when we speak of RNNs later maintains in its state a nonlinear transformation of the
on, we will exclusively refer to such systems.1 input history.
• The desired output signal is generated as a linear
RNNs (of the second type) appear as highly promising
combination of the neuron’s signals from the input-
and fascinating tools for nonlinear time series processing
excited reservoir. This linear combination is obtained by
applications, mainly for two reasons. First, it can be shown
linear regression, using the teacher signal as a target.
that under fairly mild and general assumptions, such
RNNs are universal approximators of dynamical systems [7]. Fig. 1 graphically contrasts previous methods of RNN
Second, biological brain modules almost universally exhibit training with the RC approach.
recurrent connection pathways too. Both observations Reservoir Computing methods have quickly become
indicate that RNNs should potentially be powerful tools for popular, as witnessed for instance by a theme issue of
engineering applications. Neural Networks [15], and today constitute one of the basic
Despite this widely acknowledged potential, and despite paradigms of RNN modeling [16]. The main reasons for this
a number of successful academic and practical applications, development are the following:
the impact of RNNs in nonlinear modeling has remained
limited for a long time. The main reason for this lies in the Modeling accuracy. RC has starkly outperformed previ-
fact that RNNs are difficult to train by gradient-descent-based ous methods of nonlinear system identification, predic-
methods, which aim at iteratively reducing the training error. tion and classification, for instance in predicting chaotic
While a number of training algorithms have been proposed (a dynamics (three orders of magnitude improved accuracy
brief overview is given in Section 2.5), these all suffer from the [17]), nonlinear wireless channel equalization (two or-
following shortcomings: ders of magnitude improvement [17]), the Japanese Vowel
• The gradual change of network parameters during learn- benchmark (zero test error rate, previous best: 1.8% [18]),
ing drives the network dynamics through bifurcations [8]. financial forecasting (winner of the international forecast-
At such points, the gradient information degenerates and ing competition NN32 ), and in isolated spoken digits recog-
may become ill-defined. As a consequence, convergence nition (improvement of word error rate on benchmark
cannot be guaranteed. from 0.6% of previous best system to 0.2% [19], and further
• A single parameter update can be computationally to 0% test error in recent unpublished work).
expensive, and many update cycles may be necessary. This Modeling capacity. RC is computationally universal
results in long training times, and renders RNN training for continuous-time, continuous-value real-time systems
feasible only for relatively small networks (in the order of modeled with bounded resources (including time and
tens of units). value resolution) [20,21].

1 However, they can also be used in a converging mode, as 2 http://www.neural-forecasting-competition.com/NN3/index.

shown at the end of Section 8.6. htm.


COMPUTER SCIENCE REVIEW 3 (2009) 127–149 129

Fig. 1 – A. Traditional gradient-descent-based RNN training methods adapt all connection weights (bold arrows), including
input-to-RNN, RNN-internal, and RNN-to-output weights. B. In Reservoir Computing, only the RNN-to-output weights are
adapted.

Biological plausibility. Numerous connections of RC with different names. We would like to make a distinction
principles to architectural and dynamical properties here between these differently named “tradition lines”, which
of mammalian brains have been established. RC (or we like to call brands, and the actual finer-grained ideas on
closely related models) provides explanations of why producing good reservoirs, which we will call recipes. Since
biological brains can carry out accurate computations recipes can be useful and mixed across different brands, this
with an “inaccurate” and noisy physical substrate [22,23], review focuses on classifying and surveying them. To be fair,
especially accurate timing [24]; of the way in which visual it has to be said that the authors of this survey associate
information is superimposed and processed in primary themselves mostly with the Echo State Networks brand, and
visual cortex [25,26]; of how cortico-basal pathways thus, willingly or not, are influenced by its mindset.
support the representation of sequential information; and Overview. We start by introducing a generic notational frame-
RC offers a functional interpretation of the cerebellar work in Section 2. More specifically, we define what we mean
circuitry [27,28]. A central role is assigned to an RC by problem or task in the context of machine learning in Sec-
circuit in a series of models explaining sequential tion 2.1. Then we define a general notation for expansion
information processing in human and primate brains, (or kernel) methods for both non-temporal (Section 2.2) and
most importantly of speech signals [13,29–31]. temporal (Section 2.3) tasks, introduce our notation for re-
Extensibility and parsimony. A notorious conundrum of current neural networks in Section 2.4, and outline classical
neural network research is how to extend previously training methods in Section 2.5. In Section 3 we detail the
learned models by new items without impairing or foundations of Reservoir Computing and proceed by naming
destroying previously learned representations (catastrophic the most prominent brands. In Section 4 we introduce our
interference [32]). RC offers a simple and principled solution: classification of the reservoir generation/adaptation recipes,
new items are represented by new output units, which which transcends the boundaries between the brands. Fol-
are appended to the previously established output units lowing this classification we then review universal (Section 5),
of a given reservoir. Since the output weights of different unsupervised (Section 6), and supervised (Section 7) reservoir
output units are independent of each other, catastrophic generation/adaptation recipes. In Section 8 we provide a clas-
sification and review the techniques for reading the outputs
interference is a non-issue.
from the reservoirs reported in literature, together with dis-
These encouraging observations should not mask the cussing various practical issues of readout training. A final
fact that RC is still in its infancy, and significant further discussion (Section 9) wraps up the entire picture.
improvements and extensions are desirable. Specifically, just
simply creating a reservoir at random is unsatisfactory. It
seems obvious that, when addressing a specific modeling 2. Formalism
task, a specific reservoir design that is adapted to the task
will lead to better results than a naive random creation. Thus, 2.1. Formulation of the problem
the main stream of research in the field is today directed
at understanding the effects of reservoir characteristics on Let a problem or a task in our context of machine learning be
task performance, and at developing suitable reservoir design defined as a problem of learning a functional relation between
a given input u(n) ∈ RNu and a desired output ytarget (n) ∈ RNy ,
and adaptation methods. Also, new ways of reading out
where n = 1, . . . , T, and T is the number of data points in
from the reservoirs, including combining them into larger
the training dataset {(u(n), ytarget (n))}. A non-temporal task is
structures, are devised and investigated. While shifting from
where the data points are independent of each other and the
the initial idea of having a fixed randomly created reservoir
goal is to learn a function y(n) = y(u(n)) such that E(y, ytarget )
and training only the readout, the current paradigm of
is minimized, where E is an error measure, for instance, the
reservoir computing remains (and differentiates itself from
normalized root-mean-square error (NRMSE)
other RNN training approaches) as producing/training the v  2 
reservoir and the readout separately and differently. u
y(n) − ytarget (n)
u
This review offers a conceptual classification and a
u
E(y, ytarget ) = E 2  , (1)
u
u 
comprehensive survey of this research. t D
ytarget (n) − ytarget (n)

As is true for many areas of machine learning, methods in
reservoir computing converge from different fields and come where k·k stands for the Euclidean distance (or norm).
130 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

A temporal task is where u and ytarget are signals in a option afforded by many kernels of computing inner products
discrete time domain n = 1, . . . , T, and the goal is to learn in the (high-dimensional, hence expensive) feature space of
a function y(n) = y(. . . , u(n − 1), u(n)) such that E(y, ytarget ) x more cheaply in the original space populated by u. The
is minimized. Thus the difference between the temporal and term kernel function has acquired a close association with the
non-temporal task is that the function y(·) we are trying to kernel trick. Since here we will not exploit the kernel trick, in
learn has memory in the first case and is memoryless in order to avoid confusion we will use the more neutral term of
the second. In both cases the underlying assumption is, of an expansion function for x(u(n)), and refer to methods using
course, that the functional dependence we are trying to learn such functions as expansion methods. These methods then
actually exists in the data. For the temporal case this spells include Support Vector Machines (which standardly do use the
out as data adhering to an additive noise model of the form kernel trick), Feedforward Neural Networks, Radial Basis Function
ytarget (n) = ytarget (. . . , u(n − 1), u(n)) + θ(n), where ytarget (·) is approximators, Slow Feature Analysis, and various Probability
the relation to be learned by y(·) and θ(n) ∈ RNy is a noise term, Mixture models, among many others. Feedforward neural
limiting the learning precision, i.e., the precision of matching networks are also often referred to as (multilayer) perceptrons
the learned y(n) to ytarget (n). in the literature.
Whenever we say that the task or the problem is learned While training the output Wout is a well defined and
well, or with good accuracy or precision, we mean that understood problem, producing a good expansion function
E(y, ytarget ) is small. Normally one part of the T data points is x(·) generally involves more creativity. In many expansion
used for training the model and another part (unseen during methods, e.g., Support Vector Machines, the function is
the training) for testing it. When speaking about output errors chosen “by hand” (most often through trial-and-error) and is
and performance or precision we will have testing errors in mind fixed.
(if not explicitly specified otherwise). Also n, denoting the
discrete time, will often be used omitting its range 1, . . . , T. 2.3. Expansions in temporal tasks

Many temporal methods are based on the same principle.


2.2. Expansions and kernels in non-temporal tasks
The difference is that in a temporal task the function
to be learned depends also on the history of the input,
Many tasks cannot be accurately solved by a simple linear
as discussed in Section 2.1. Thus, the expansion function
relation between the u and ytarget , i.e., a linear model y(n) =
has memory: x(n) = x(. . . , u(n − 1), u(n)), i.e., it is an
Wu(n) (where W ∈ RNy ×Nu ) gives big errors E(y, ytarget ) expansion of the current input and its (potentially infinite)
regardless of W. In such situations one has to resort to history. Since this function has an unbounded number
nonlinear models. A number of generic and widely used of parameters, practical implementations often take an
approaches to nonlinear modeling are based on the idea of alternative, recursive, definition:
nonlinearly expanding the input u(n) into a high-dimensional
feature vector x(n) ∈ RNx , and then utilizing those features x(n) = x(x(n − 1), u(n)). (4)
using linear methods, for instance by linear regression or The output y(n) is typically produced in the same way as
computing for a linear separation hyperplane, to get a for non-temporal methods by (2) or (3).
reasonable y. Solutions of this kind can be expressed in the In addition to the nonlinear expansion, as in the non-
form temporal tasks, such x(n) could be seen as a type of a spatial
y(n) = Wout x(n) = Wout x(u(n)), (2) embedding of the temporal information of . . . , u(n − 1), u(n).
This, for example, enables capturing higher-dimensional
Ny ×Nx
where Wout ∈ R are the trained output weights. dynamical attractors y(n) = ytarget (. . . , u(n−1), u(n)) = u(n+1)
Typically Nx  Nu , and we will often consider u(n) as included of the system being modeled by y(·) from a series of lower-
in x(n). There is also typically a constant bias value added dimensional observations u(n) the system is emitting, which
to (2), which is omitted here and in other equations for is shown to be possible by Takens’s theorem [33].
brevity. The bias can be easily implemented, having one of
the features in x(n) constant (e.g., = 1) and a corresponding
2.4. Recurrent neural networks
column in Wout . Some models extend (2) to

y(n) = fout (Wout x[u(n)]), (3) The type of recurrent neural networks that we will consider
most of the time in this review is a straightforward
where fout (·) is some nonlinear function (e.g., a sigmoid implementation of (4). The nonlinear expansion with
applied element-wise). For the sake of simplicity we will memory here leads to a state vector of the form
consider this definition as equivalent to (2), since fout (·) can
be eliminated from y by redefining the target as y0target = x(n) = f (Win u(n) + Wx(n − 1)), n = 1, . . . , T, (5)
−1
fout (ytarget ) (and the error function E(y, y0target ), if desired). where x(n) ∈ RNx is a vector of reservoir neuron activations at
Note that (2) is a special case of (3), with fout (·) being the a time step n, f (·) is the neuron activation function, usually
identity. the symmetric tanh(·), or the positive logistic (or Fermi)
Functions x(u(n)) that transform an input u(n) into a sigmoid, applied element-wise, Win ∈ RNx ×Nu is the input
(higher-dimensional) vector x(n) are often called kernels (and weight matrix, and W ∈ RNx ×Nx is a weight matrix of internal
traditionally denoted φ(u(n))) in this context. Methods using network connections. The network is usually started with the
kernels often employ the kernel trick, which refers to the initial state x(0) = 0. Bias values are again omitted in (5) in
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 131

the same way as in (2). The readout y(n) of the network is a dynamic reservoir — an RNN as a nonlinear temporal
implemented as in (3). expansion function — and a recurrence-free (usually linear)
Some models of RNNs extend (5) as readout that produces the desired output from the expansion.
This separation is based on the understanding (common
x(n) = f (Win u(n) + Wx(n − 1) + Wofb y(n − 1)),
with kernel methods) that x(·) and y(·) serve different
n = 1, . . . , T, (6) purposes: x(·) expands the input history u(n), u(n − 1), . . . into
where Wofb ∈ R Nx ×Ny
is an optional output feedback weight a rich enough reservoir state space x(n) ∈ RNx , while y(·)
matrix. combines the neuron signals x(n) into the desired output
signal ytarget (n). In the linear readout case (2), for each
2.5. Classical training of RNNs dimension yi of y an output weight vector (Wout )i in the same
space RNx is found such that
The classical approach to supervised training of RNNs, (Wout )i x(n) = yi (n) ≈ ytarget i (n), (7)
known as gradient descent, is by iteratively adapting all
weights Wout , W, Win , and possibly Wofb (which as a whole while the “purpose” of x(n) is to contain a rich enough
we denote Wall for brevity) according to their estimated representation to make this possible.
gradients ∂E/∂Wall , in order to minimize the output error Since the expansion and the readout serve different
E = E(y, ytarget ). A classical example of such methods is purposes, training/generating them separately and even
Real-Time Recurrent Learning [34], where the estimation of with different goal functions makes sense. The readout
∂E/∂Wall is done recurrently, forward in time. Conversely, y(n) = y(x(n)) is essentially a non-temporal function, learning
error backpropagation (BP) methods for training RNNs, which which is relatively simple. On the other hand, setting
are derived as extensions of the BP method for feedforward up the reservoir such that a “good” state expansion x(n)
neural networks [35], estimate ∂E/∂Wall by propagating emerges is an ill-understood challenge in many respects.
E(y, ytarget ) backwards through network connections and The “traditional” RNN training methods do not make the
time. The BP group of methods is arguably the most conceptual separation of a reservoir vs. a readout, and train
prominent in classical RNN training, with the classical both reservoir-internal and output weights in technically
example in this group being Backpropagation Through the same fashion. Nonetheless, even in traditional methods
Time [36]. It has a runtime complexity of O(Nx 2 ) per weight the ways of defining the error gradients for the output y(n)
update per time step for a single output Ny = 1, compared to and the internal units x(n) are inevitably different, reflecting
O(Nx 4 ) for Real-Time Recurrent Learning. that an explicit target ytarget (n) is available only for the
A systematic unifying overview of many classical gradient output units. Analyses of traditional training algorithms
descent RNN training methods is presented in [37]. The same have furthermore revealed that the learning dynamics of
contribution also proposes a new approach, often referred internal vs. output weights exhibit systematic and striking
to by others as Atiya–Parlos Recurrent Learning (APRL). It differences. This theme will be expanded in Section 3.4.
estimates gradients with respect to neuron activations ∂E/∂x Currently, reservoir computing is a vivid fresh RNN
(instead of weights directly) and gradually adapts the weights research stream, which has recently gained wide attention
Wall to move the activations x into the desired directions. The due to the reasons pointed out in Section 1.
method is shown to converge faster than previous ones. See We proceed to review the most prominent “named”
Section 3.4 for more implications of APRL and bridging the reservoir methods, which we call here brands. Each of them
gap between the classical gradient descent and the reservoir has its own history, a specific mindset, specific types of
computing methods. reservoirs, and specific insights.
There are also other versions of supervised RNN
training, formulating the training problem differently, such 3.1. Echo State Networks
as using Extended Kalman Filters [38] or the Expectation-
Maximization algorithm [39], as well as dealing with special Echo State Networks (ESNs) [16] represent one of the two
types of RNNs, such as Long Short-Term Memory [40] modular pioneering reservoir computing methods. The approach is
networks capable of learning long-term dependences. based on the observation that if a random RNN possesses
There are many more, arguably less prominent, methods certain algebraic properties, training only a linear readout
and their modifications for RNN training that are not from it is often sufficient to achieve excellent performance
mentioned here, as this would lead us beyond the scope of in practical applications. The untrained RNN part of an ESN
this review. The very fact of their multiplicity suggests that is called a dynamical reservoir, and the resulting states x(n) are
there is no clear winner in all aspects. Despite many advances termed echoes of its input history [12]—this is where reservoir
that the methods cited above have introduced, they still have computing draws its name from.
multiple common shortcomings, as pointed out in Section 1. ESNs standardly use simple sigmoid neurons, i.e., reservoir
states are computed by (5) or (6), where the nonlinear function
f (·) is a sigmoid, usually the tanh(·) function. Leaky integrator
3. Reservoir methods neuron models represent another frequent option for ESNs,
which is discussed in depth in Section 5.5. Classical recipes of
Reservoir computing methods differ from the “traditional” producing the ESN reservoir (which is in essence Win and W)
designs and learning techniques listed above in that they are outlined in Section 5.1, together with input-independent
make a conceptual and computational separation between properties of the reservoir. Input-dependent measures of the
132 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

quality of the activations x(n) in the reservoir are presented easily transferable to these more biologically-realistic models
in Section 6.1. (there is more on this in Section 6.2).
The readout from the reservoir is usually linear (3), where The main theoretical contributions of the LSM brand to
u(n) is included as part of x(n), which can also be spelled out Reservoir Computing consist in analytical characterizations
in (3) explicitly as of the computational power of such systems [11,21] discussed
in Sections 6.1 and 7.4.
y(n) = fout (Wout [u(n)|x(n)]), (8)
3.3. Evolino
where Wout ∈ RNy ×(Nu +Nx ) is the learned output weight
matrix, fout (·) is the output neuron activation function Evolino [46] transfers the idea of ESNs from an RNN of
(usually the identity) applied component-wise, and ·|· stands simple sigmoidal units to a Long Short-Term Memory type
for a vertical concatenation of vectors. The original and most of RNNs [40] constructed from units capable of preserving
popular batch training method to compute Wout is linear memory for long periods of time. In Evolino the weights of
regression, discussed in Section 8.1.1, or a computationally the reservoir are trained using evolutionary methods, as is
cheap online training discussed in Section 8.1.2. also done in some extensions of ESNs, both discussed in
The initial ESN publications [12,41–43,17] were framed in Section 7.2.
settings of machine learning and nonlinear signal processing
applications. The original theoretical contributions of early 3.4. Backpropagation-Decorrelation
ESN research concerned algebraic properties of the reservoir
that make this approach work in the first place (the echo state The idea of separation between a reservoir and a readout
property [12] discussed in Section 5.1) and analytical results function has also been arrived at from the point of view of
optimizing the performance of the RNN training algorithms
characterizing the dynamical short-term memory capacity of
that use error backpropagation, as already indicated in
reservoirs [41] discussed in Section 6.1.
Section 2.5. In an analysis of the weight dynamics of an
RNN trained using the APRL learning algorithm [47], it was
3.2. Liquid State Machines revealed that the output weights Win of the network being
trained change quickly, while the hidden weights W change
Liquid State Machines (LSMs) [11] are the other pioneering reser- slowly and in the case of a single output Ny = 1 the changes
voir method, developed independently from and simultane- are column-wise coupled. Thus in effect APRL decouples the
ously with ESNs. LSMs were developed from a computational RNN into a quickly adapting output and a slowly adapting
neuroscience background, aiming at elucidating the princi- reservoir. Inspired by these findings a new iterative/online
pal computational properties of neural microcircuits [11,20, RNN training method, called BackPropagation-DeCorrelation
44,45]. Thus LSMs use more sophisticated and biologically re- (BPDC), was introduced [14]. It approximates and significantly
alistic models of spiking integrate-and-fire neurons and dy- simplifies the APRL method, and applies it only to the output
namic synaptic connection models in the reservoir. The con- weights Wout , turning it into an online RC method. BPDC
nectivity among the neurons often follows topological and uses the reservoir update equation defined in (6), where
output feedbacks Wofb are essential, with the same type of
metric constraints that are biologically motivated. In the LSM
units as ESNs. BPDC learning is claimed to be insensitive to
literature, the reservoir is often referred to as the liquid, fol-
the parameters of fixed reservoir weights W. BPDC boasts
lowing an intuitive metaphor of the excited states as ripples
fast learning times and thus is capable of tracking quickly
on the surface of a pool of water. Inputs to LSMs also usually
changing signals. As a downside of this feature, the trained
consist of spike trains. In their readouts LSMs originally used
network quickly forgets the previously seen data and is highly
multilayer feedforward neural networks (of either spiking or
biased by the recent data. Some remedies for reducing this
sigmoid neurons), or linear readouts similar to ESNs [11]. Ad- effect are reported in [48]. Most of applications of BPDC in
ditional mechanisms for averaging spike trains to get real- the literature are for tasks having one-dimensional outputs
valued outputs are often employed. Ny = 1; however BPDC is also successfully applied to Ny > 1,
RNNs of the LSM-type with spiking neurons and more as recently demonstrated in [49].
sophisticated synaptic models are usually more difficult to From a conceptual perspective we can define a range of
implement, to correctly set up and tune, and typically more RNN training methods that gradually bridge the gap between
expensive to emulate on digital computers3 than simple ESN- the classical BP and reservoir methods:
type “weighted sum and nonlinearity” RNNs. Thus they are 1. Classical BP methods, such as Backpropagation Through
less widespread for engineering applications of RNNs than Time (BPTT) [36];
the latter. However, while the ESN-type neurons only emulate 2. Atiya–Parlos recurrent learning (APRL) [37];
mean firing rates of biological neurons, spiking neurons are 3. BackPropagation-DeCorrelation (BPDC) [14];
able to perform more complicated information processing, 4. Echo State Networks (ESNs) [16].
due to the time coding of the information in their signals In each method of this list the focus of training gradually
(i.e., the exact timing of each firing also matters). Also findings moves from the entire network towards the output, and
on various mechanisms in natural neural circuits are more convergence of the training is faster in terms of iterations,
with only a single “iteration” in case 4. At the same time the
3 With a possible exception of event-driven spiking NN potential expressiveness of the RNN, as per the same number
simulations, where the computational load varies depending on of units in the NN, becomes weaker. All methods in the list
the amount of activity in the NN. primarily use the same type of simple sigmoid neuron model.
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 133

3.5. Temporal Recurrent Networks overview of reservoir computing is presented in [60], with
an emphasis on applications and hardware implementations
This summary of RC brands would be incomplete without of reservoir methods. The editorial in the “Neural Networks”
a spotlight directed at Peter F. Dominey’s decade-long journal special issue on ESNs and LSMs [15] offers a short
research suite on cortico-striatal circuits in the human brain introduction to the topic and an overview of the articles in
(e.g., [13,29,31], and many more). Although this research is the issue (most of which are also surveyed here). An older and
rooted in empirical cognitive neuroscience and functional much shorter part of this overview, covering only reservoir
neuroanatomy and aims at elucidating complex neural adaptation techniques, is available as a technical report [61].
structures rather than theoretical computational principles,
it is probably Dominey who first clearly spelled out the RC
principle: “(. . . ) there is no learning in the recurrent connections 4. Our classification of reservoir recipes
[within a subnetwork corresponding to a reservoir], only
between the State [i.e., reservoir] units and the Output units. The successes of applying RC methods to benchmarks (see
Second, adaptation is based on a simple associative learning the listing in Section 1) outperforming classical fully trained
mechanism (. . . )” [50]. It is also in this article where Dominey RNNs do not imply that randomly generated reservoirs
brands the neural reservoir module as a Temporal Recurrent are optimal and cannot be improved. In fact, “random”
Network. The learning algorithm, to which Dominey alludes, is almost by definition an antonym to “optimal”. The
can be seen as a version of the Least Mean Squares discussed results rather indicate the need for some novel methods of
in Section 8.1.2. At other places, Dominey emphasizes the training/generating the reservoirs that are very probably not
randomness of the connectivity in the reservoir: “It is worth a direct extension of the way the output is trained (as in
noting that the simulated recurrent prefrontal network relies on fixed BP). Thus besides application studies (which are not surveyed
randomized recurrent connections, (. . . )” [51]. Only in early 2008 here), the bulk of current RC research on reservoir methods is
did Dominey and “computational” RC researchers become devoted to optimal reservoir design, or reservoir optimization
aware of each other. algorithms.
It is worth mentioning at this point that the general “no
3.6. Other (exotic) types of reservoirs free lunch” principle in supervised machine learning [62]
states that there can exist no bias of a model which would
As is clear from the discussion of the different reservoir universally improve the accuracy of the model for all possible
methods so far, a variety of neuron models can be used for the problems. In our context this can be translated into a claim
reservoirs. Using different activation functions inside a single that no single type of reservoir can be optimal for all types of
reservoir might also improve the richness of the echo states, problems.
as is illustrated, for example, by inserting some neurons In this review we will try to survey all currently
with wavelet-shaped activation functions into the reservoir investigated ideas that help producing “good” reservoirs. We
of ESNs [52]. A hardware implementation friendly version will classify those ideas into three major groups based on
of reservoirs composed of stochastic bitstream neurons was their universality:
proposed in [53]. • Generic guidelines/methods of producing good reservoirs
In fact the reservoirs do not necessarily need to be neural irrespective of the task (both the input u(n) and the desired
networks, governed by dynamics similar to (5). Other types output ytarget (n));
of high-dimensional dynamical systems that can take an • Unsupervised pre-training of the reservoir with respect to
input u(n) and have an observable state x(n) (which does the given input u(n), but not the target ytarget (n);
not necessarily fully describe the state of the system) can • Supervised pre-training of the reservoir with respect to both
be used as well. In particular this makes the reservoir the given input u(n) and the desired output ytarget (n).
paradigm suitable for harnessing the computational power These three classes of methods are discussed in the
of unconventional hardware, such as analog electronics [54, following three sections. Note that many of the methods to
55], biological neural tissue [26], optical [56], quantum, or some extend transcend the boundaries of these three classes,
physical “computers”. The last of these was demonstrated but will be classified according to their main principle.
(taking the “reservoir” and “liquid” idea quite literally) by
feeding the input via mechanical actuators into a reservoir
full of water, recording the state of its surface optically, 5. Generic reservoir recipes
and successfully training a readout multilayer perceptron
on several classification tasks [57]. An idea of treating a The most classical methods of producing reservoirs all fall
computer-simulated gene regulation network of Escherichia into this category. All of them generate reservoirs randomly,
Coli bacteria as the reservoir, a sequence of chemical stimuli with topology and weight characteristics depending on some
as an input, and measures of protein levels and mRNAs as an preset parameters. Even though they are not optimized for
output is explored in [58]. a particular input u(n) or target ytarget (n), a good manual
selection of the parameters is to some extent task-dependent,
3.7. Other overviews of reservoir methods complying with the “no free lunch” principle just mentioned.

An experimental comparison of LSM, ESN, and BPDC reservoir 5.1. Classical ESN approach
methods with different neuron models, even beyond the
standard ones used for the respective methods, and different Some of the most generic guidelines of producing good
parameter settings is presented in [59]. A brief and broad reservoirs were presented in the papers that introduced
134 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

ESNs [12,42]. Motivated by an intuitive goal of producing a task requires. A rule of thumb, likewise discussed in [12], is
“rich” set of dynamics, the recipe is to generate a (i) big, (ii) that ρ(W) should be close to 1 for tasks that require long
sparsely and (iii) randomly connected, reservoir. This means memory and accordingly smaller for the tasks where a too
that (i) Nx is sufficiently large, with order ranging from tens long memory might in fact be harmful. Larger ρ(W) also have
to thousands, (ii) the weight matrix W is sparse, with several the effect of driving signals x(n) into more nonlinear regions
to 20 per cent of possible connections, and (iii) the weights of tanh units (further from 0) similarly to Win . Thus scalings
of the connections are usually generated randomly from a of both Win and W have a similar effect on nonlinearity of
uniform distribution symmetric around the zero value. This the ESN, while their difference determines the amount of
design rationale aims at obtaining many, due to (i), reservoir memory.
activation signals, which are only loosely coupled, due to (ii), A rather conservative rigorous sufficient condition of the
and different, due to (iii). echo state property for any kind of inputs u(n) (including zero)
The input weights Win and the optional output feedback and states x(n) (with tanh nonlinearity) being σmax (W) < 1,
weights Wofb are usually dense (they can also be sparse like where σmax (W) is the largest singular value of W, was proved
W) and generated randomly from a uniform distribution. The in [12]. Recently, a less restrictive sufficient condition, namely,
exact scaling of both matrices and an optional shift of the infD∈D σmax (DWD−1 ) < 1, where D is an arbitrary matrix,
input (a constant value added to u(n)) are the few other free minimizing the so-called D-norm σmax (DWD−1 ), from a
parameters that one has to choose when “baking” an ESN. The set D ⊂ RNx ×Nx of diagonal matrices, has been derived
rules of thumb for them are the following. The scaling of Win in [63]. This sufficient condition approaches the necessary
and shifting of the input depends on how much nonlinearity infD∈D σmax (DWD−1 ) → ρ(W)− , ρ(W) < 1, e.g., when W
of the processing unit the task needs: if the inputs are close is a normal or a triangular (permuted) matrix. A rigorous
to 0, the tanh neurons tend to operate with activations close sufficient condition for the echo state property is rarely
to 0, where they are essentially linear, while inputs far from ensured in practice, with a possible exception being critical
0 tend to drive them more towards saturation where they control tasks, where provable stability under any conditions
exhibit more nonlinearity. The shift of the input may help to is required.
overcome undesired consequences of the symmetry around
0 of the tanh neurons with respect to the sign of the signals. 5.2. Different topologies of the reservoir
Similar effects are produced by scaling the bias inputs to the
neurons (i.e., the column of Win corresponding to constant There have been attempts to find topologies of the
input, which often has a different scaling factor than the ESN reservoir different from sparsely randomly connected
rest of Win ). The scaling of Wofb is in practice limited by ones. Specifically, small-world [64], scale-free [65], and
a threshold at which the ESN starts to exhibit an unstable biologically inspired connection topologies generated by
behavior, i.e., the output feedback loop starts to amplify (the spatial growth [66] were tested for this purpose in a careful
errors of) the output and thus enters a diverging generative study [67], which we point out here due to its relevance
mode. In [42], these and related pieces of advice are given although it was obtained only as a BSc thesis. The NRMS
without a formal justification. error (1) of y(n) as well as the eigenvalue spread of the cross-
An important element for ESNs to work is that the correlation matrix of the activations x(n) (necessary for a
reservoir should have the echo state property [12]. This fast online learning described in Section 8.1.2; see Section 6.1
condition in essence states that the effect of a previous state for details) were used as the performance measures of the
x(n) and a previous input u(n) on a future state x(n + k) topologies. This work also explored an exhaustive brute-
should vanish gradually as time passes (i.e., k → ∞), and force search of topologies of tiny networks (motifs) of four
not persist or even get amplified. For most practical purposes, units, and then combining successful motives (in terms of the
the echo state property is assured if the reservoir weight eigenvalue spread) into larger networks. The investigation,
matrix W is scaled so that its spectral radius ρ(W) (i.e., the unfortunately, concludes that “(. . . ) none of the investigated
largest absolute eigenvalue) satisfies ρ(W) < 1 [12]. Or, using network topologies was able to perform significantly better than
another term, W is contractive. The fact that ρ(W) < 1 simple random networks, both in terms of eigenvalue spread as
almost always ensures the echo state property has led to an well as testing error” [67]. This, however, does not serve as
unfortunate misconception which is expressed in many RC a proof that similar approaches are futile. An indication of
publications, namely, that ρ(W) < 1 amounts to a necessary this is the substantial variation in ESN performance observed
and sufficient condition for the echo state property. This is among randomly created reservoirs, which is, naturally, more
wrong. The mathematically correct connection between the pronounced in smaller reservoirs (e.g., [68]).
spectral radius and the echo state property is that the latter In contrast, LSMs often use a biologically plausible
is violated if ρ(W) > 1 in reservoirs using the tanh function as connectivity structure and weight settings. In the original
neuron nonlinearity, and for zero input. Contrary to widespread form they model a single cortical microcolumn [11]. Since
misconceptions, the echo state property can be obtained even the model of both the connections and the neurons
if ρ(W) > 1 for non-zero input (including bias inputs to themselves is quite sophisticated, it has a large number of
neurons), and it may be lost even if ρ(W) < 1, although it is free parameters to be set, which is done manually, guided
hard to construct systems where this occurs (unless f 0 (0) > 1 by biologically observed parameter ranges, e.g., as found
for the nonlinearity f ), and in practice this does not happen. in the rat somatosensory cortex [69]. This type of model
The optimal value of ρ(W) should be set depending on also delivers good performance for practical applications of
the amount of memory and nonlinearity that the given speech recognition [69,70] (and many similar publications
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 135

by the latter authors). Since LSMs aim at accuracy of


modeling natural neural structures, less biologically plausible
connectivity patterns are usually not explored.
It has been demonstrated that much more detailed
biological neural circuit models, which use anatomical
Fig. 2 – Signal flow diagram of the standard ESN.
and neurophysiological data-based laminar (i.e., cortical
layer) connectivity structures and Hodgkin–Huxley model
neurons, improve the information-processing capabilities
of the models [23]. Such highly realistic (for present-day The Evolino approach introduced in Section 3.3 can also
standards) models “perform significantly better than control be classified as belonging to this group, as the LSTM RNN
circuits (which are lacking the laminar structures but are otherwise used for its reservoir consists of specific small memory-
identical with regard to their components and overall connection holding modules (which could alternatively be regarded as
more complicated units of the network).
statistics) for a wide variety of fundamental information-processing
Approaches relying on combining outputs from several
tasks” [23].
separate reservoirs will be discussed in Section 8.8.
Different from this direction of research, there are also
explorations of using even simpler topologies of the reservoir
5.4. Time-delayed vs. instantaneous connections
than the classical ESN. It has been demonstrated that the
reservoir can even be an unstructured feed-forward network
Another time-related limitation of the classical ESNs pointed
with time-delayed connections if the finite limited memory
out in [78] is that no matter how many neurons are contained
window that it offers is sufficient for the task at hand [71].
in the reservoir, it (like any other fully recurrent network
A degenerate case of a “reservoir” composed of linear with all connections having a time delay) has only a single
units and a diagonalized W and unitary inputs Win was layer of neurons (Fig. 2). This makes it intrinsically unsuitable
considered in [72]. A one-dimensional lattice (ring) topology for some types of problems. Consider a problem where the
was used for a reservoir, together with an adaptation of mapping from u(n) to ytarget (n) is a very complex, nonlinear
the reservoir discussed in Section 6.2, in [73]. A special one, and the data in neighboring time steps are almost
kind of excitatory and inhibitory neurons connected in a independent (i.e., little memory is required), as e.g., the
one-dimensional spatial arrangement was shown to produce “meta-learning” task in [79].4 Consider a single time step
interesting chaotic behavior in [74]. n: signals from the input u(n) propagate only through one
A tendency that higher ranks of the connectivity matrix untrained layer of weights Win , through the nonlinearity
Wmask (where wmaski,j = 1 if w i,j 6= 0, and = 0 otherwise, f influence the activations x(n), and reach the output y(n)
for i, j = 1, . . . , Nx ) correlate with lower ESN output errors through the trained weights Wout (Fig. 2). Thus ESNs are not
was observed in [75]. Connectivity patterns of W such that capable of producing a very complex instantaneous mapping
W ∞ ≡ limk→∞ W k (W k standing for “W to the power from u(n) to y(n) using a realistic number of neurons, which
k” and approximating weights of the cumulative indirect could (only) be effectively done by a multilayer FFNN (not
connections by paths of length k among the reservoir units) counting some non-NN-based methods). Delaying the target
is neither fully connected, nor all-zero, are claimed to give ytarget by k time steps would in fact make the signals coming
a broader distribution of ESN prediction performances, thus from u(n) “cross” the nonlinearities k+1 times before reaching
including best performing reservoirs, than random sparse y(n + k), but would mix the information from different time
connectivities in [76]. A permutation matrix with a medium steps in x(n), . . . , x(n + k), breaking the required virtually
independent mapping u(n) → ytarget (n + k), if no special
number and different lengths of connected cycles, or a
structure of W is imposed.
general orthogonal matrix, are suggested as candidates for
As a possible remedy Layered ESNs were introduced in [78],
such Ws.
where a part (up to almost half) of the reservoir connections
can be instantaneous and the rest take one time step for
5.3. Modular reservoirs the signals to propagate as in normal ESNs. Randomly
generated Layered ESNs, however, do not offer a consistent
One of the shortcomings of conventional ESN reservoirs is improvement for large classes of tasks, and pre-training
that, even though they are sparse, the activations are still methods of such reservoirs have not yet been investigated.
coupled so strongly that the ESN is poor in dealing with The issue of standard ESNs not having enough trained
different time scales simultaneously, e.g., predicting several layers is also discussed and addressed in a broader context
superimposed generators. This problem was successfully in Section 8.8.
tackled by dividing the reservoir into decoupled sub-
reservoirs and introducing inhibitory connections among all 5.5. Leaky integrator neurons and speed of dynamics
the sub-reservoirs [77]. For the approach to be effective,
the inhibitory connections must predict the activations of In addition to the basic sigmoid units, leaky integrator
the sub-reservoirs one time step ahead. To achieve this the neurons were suggested to be used in ESNs from the point
inhibitory connections are heuristically computed from (the of their introduction [12]. This type of neuron performs a
rest of) W and Wofb , or the sub-reservoirs are updated in a
sequence and the real activations of the already updated sub- 4 ESNs have been shown to perform well in a (significantly)

reservoirs are used. simpler version of the “meta-learning” in [80].


136 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

leaky integration of its activation from previous time steps. to predict/generate signals having structure on different
Today a number of versions of leaky integrator neurons are timescales.
often used in ESNs, which we will call here leaky integrator Following this line of thought, Infinite Impulse Response
ESNs (LI-ESNs) where the distinction is needed. The main two (IIR) band-pass filters having sharper cutoff characteristics
groups are those using leaky integration before application were tried on neuron activations in ESNs with success in
of the activation function f (·), and after. One example of the several types of signals [84]. Since the filters often introduce
latter (in the discretized time case) has reservoir dynamics an undesired phase shift to the signals, a time delay for the
governed by activation of each neuron was learned and applied before the
linear readout from the reservoir. A successful application of
x(n) = (1 − a1t)x(n − 1) + 1tf (Win u(n) + Wx(n − 1)), (9)
Butterworth band-pass filters in ESNs is reported in [85].
where 1t is a compound time gap between two consecutive Connections between neurons that have different time
time steps divided by the time constant of the system and a delays (more than one time step) can actually also be used
is the decay (or leakage) rate [81]. Another popular (and we inside the recurrent part, which enables the network to
believe, preferable) design can be seen as setting a = 1 and operate on different timescales simultaneously and learn
redefining δt in Eq. (9) as the leaking rate a to control the longer-term dependences [86]. This idea has been tried for
“speed” of the dynamics, RNNs trained by error backpropagation, but could also be
useful for multi-timescale reservoirs. Long-term dependences
x(n) = (1 − a)x(n − 1) + af (Win u(n) + Wx(n − 1)), (10) can also be learned using the reservoirs mentioned in
which in effect is an exponential moving average, has only Section 3.3.
one additional parameter and the desirable property that
neuron activations x(n) never go outside the boundaries
defined by f (·). Note that the simple ESN (5) is a special 6. Unsupervised reservoir adaptation
case of LI-ESNs (9) or (10) with a = 1 and 1t = 1. As a
corollary, an LI-ESN with a good choice of the parameters In this section we describe reservoir training/generation
can always perform at least as well as a corresponding simple methods that try to optimize some measure defined on the
ESN. With the introduction of the new parameter a (and 1t), activations x(n) of the reservoir, for a given input u(n), but
the condition for the echo state property is redefined [12]. regardless of the desired output ytarget (n). In Section 6.1 we
A natural constraint on the two new parameters is a1t ∈ survey measures that are used to estimate the quality of
[0, 1] in (9), and a ∈ [0, 1] in (10) — a neuron should neither the reservoir, irrespective of the methods optimizing them.
retain, nor leak, more activation than it had. The effect Then local, Section 6.2, and global, Section 6.3 unsupervised
of these parameters on the final performance of ESNs was reservoir training methods are surveyed.
investigated in [18] and [82]. The latter contribution also
considers applying the leaky integrator in different places of 6.1. “Goodness” measures of the reservoir activations
the model and resampling the signals as an alternative.
The additional parameters of the LI-ESN control the The classical feature that reservoirs should possess is the
“speed” of the reservoir dynamics. Small values of a and echo state property, defined in Section 5.1. Even though this
1t result in reservoirs that react slowly to the input. By property depends on the concrete input u(n), usually in
changing these parameters it is possible to shift the effective practice its existence is not measured explicitly, and only
interval of frequencies in which the reservoir is working. the spectral radius ρ(W) is selected to be <1 irrespective of
Along these lines, time warping invariant ESNs (TWIESNs) u(n), or just tuned for the final performance. A measure of
— an architecture that can deal with strongly time-warped short-term memory capacity, evaluating how well u(n) can be
signals — were outlined in [81,18]. This architecture varies 1t reconstructed by the reservoir as y(n + k) after various delays
on-the-fly in (9), directly depending on the speed at which the k, was introduced in [41].
input u(n) is changing. The two necessary and sufficient conditions for LSMs to
From a signal processing point of view, the exponential work were introduced in [11]. A separation property measures
moving average on the neuron activation (10) does a simple the distance between different states x caused by different
low-pass filtering of its activations with the cutoff frequency input sequences u. The measure is refined for binary ESN-
type reservoirs in [87] with a generalization in [88]. An
a approximation property measures the capability of the readout
fc = , (11)
2π(1 − a)1t to produce a desired output ytarget from x, and thus is not an
where 1t is the discretization time step. This makes the unsupervised measure, but is included here for completeness.
neurons average out the frequencies above fc and enables Methods for estimating the computational power and
tuning the reservoirs for particular frequencies. Elaborating generalization capability of neural reservoirs were presented
further on this idea, high-pass neurons, that produce their in [89]. The proposed measure for computational power, or
activations by subtracting from the unfiltered activation kernel quality, is obtained in the following way. Take k different
(5) the low-pass filtered one (10), and band-pass neurons, input sequences (or segments of the same signal) ui (n), where
that combine the low-pass and high-pass ones, were i = 1, . . . , k, and n = 1, . . . , Tk . For each input i take the
introduced [83]. The authors also suggested mixing neurons resulting reservoir state xi (n0 ), and collect them into a matrix
with different passbands inside a single ESN reservoir, M ∈ Rk×Nx , where n0 is some fixed time after the appearance
and reported that a single reservoir of such kind is able of ui (n) in the input. Then the rank r of the matrix M is
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 137

the measure. If r = k, this means that all the presented In contrast to ESN-type reservoirs of real-valued units,
inputs can be separated by a linear readout from the reservoir, simple binary threshold units exhibit a more immediate tran-
and thus the reservoir is said to have a linear separation sition from damped to chaotic behavior without interme-
property. For estimating the generalization capability of the diate periodic oscillations [87]. This difference between the
reservoir, the same procedure can be performed with s (s  k) two types of activation functions, including intermediate
inputs uj (n), j = 1, . . . , s, that represent the set of all possible quantized ones, in ESN-type reservoirs was investigated more
inputs. If the resultant rank r is substantially smaller than closely in [88]. The investigation showed that reservoirs of bi-
the size s of the training set, the reservoir generalizes well. nary units are more sensitive to the topology and the connec-
These two measures are more targeted to tasks of time series tion weight parameters of the network in their transition be-
classification, but can also be revealing in predicting the tween damped and chaotic behavior, and computational per-
performance of regression [90]. formance, than the real-valued ones. This difference can be
A much-desired measure to minimize is the eigenvalue related to the similar apparent difference in sensitivity of the
spread (EVS, the ratio of the maximal eigenvalue to the ESNs and LSM-type reservoirs of firing units, discussed in Sec-
minimal eigenvalue) of the cross-correlation matrix of the tion 5.2.
activations x(n). A small EVS is necessary for an online
training of the ESN output by a computationally cheap 6.2. Unsupervised local methods
and stable stochastic gradient descent algorithm outlined in
Section 8.1.2 (see, e.g., [91], chapter 5.3, for the mathematical A natural strategy for improving reservoirs is to mimic biology
reasons that render this mandatory). In classical ESNs the EVS (at a high level of abstraction) and count on local adaptation
sometimes reaches 1012 or even higher [92], which makes rules. “Local” here means that parameters pertaining to some
the use of stochastic gradient descent training unfeasible. neuron i are adapted on the basis of no other information
Other commonly desirable features of the reservoir are small than the activations of neurons directly connected with
pairwise correlation of the reservoir activations xi (n), or a neuron i. In fact all local methods are almost exclusively
large entropy of the x(n) distribution (e.g., [92]). The latter is unsupervised, since the information on the performance E at
a rather popular measure, as discussed later in this review. A the output is unreachable in the reservoir.
criterion for maximizing the local information transmission First attempts to decrease the eigenvalue spread in ESNs
of each individual neuron was investigated in [93] (more in by classical Hebbian [97] (inspired by synaptic plasticity
Section 6.2). in biological brains) or Anti-Hebbian learning gave no
The so-called edge of chaos is a region of parameters of success [92]. A modification of Anti-Hebbian learning, called
a dynamical system at which it operates at the boundary Anti-Oja learning, is reported to improve the performance of
between the chaotic and non-chaotic behavior. It is often ESNs in [98].
claimed (but not undisputed; see, e.g., [94]) that at the On the more biologically realistic side of the RC research
edge of chaos many types of dynamical systems, including with spiking neurons, local unsupervised adaptations are
binary systems and reservoirs, possess high computational very natural to use. In fact, LSMs had used synaptic
power [87,95]. It is intuitively clear that the edge of chaos in connections with realistic short-term dynamic adaptation,
reservoirs can only arise when the effect of inputs on the as proposed by [99], in their reservoirs from the very
reservoir state does not die out quickly; thus such reservoirs beginning [11].
can potentially have high memory capacity, which is also The Hebbian learning principle is usually implemented
demonstrated in [95]. However, this does not universally in spiking NNs as spike-time-dependent plasticity (STDP) of
imply that such reservoirs are optimal [90]. The edge of chaos synapses. STDP is shown to improve the separation property
can be empirically detected (even for biological networks) of LSMs for real-world speech data, but not for random inputs
by measuring Lyapunov exponents [95], even though such u, in [100]. The authors however were uncertain whether
measurements are not trivial (and often involve a degree of manually optimizing the parameters of the STDP adaptation
expert judgment) for high-dimensional noisy systems. For (which they did) or the ones for generating the reservoir
reservoirs of simple binary threshold units this can be done would result in a larger performance gain for the same effort
more simply by computing the Hamming distances between spent. STDP is shown to work well with time-coded readouts
trajectories of the states [87]. There is also an empirical from the reservoir in [101].
observation that, while changing different parameter settings Biological neurons are widely observed to adapt their
of a reservoir, the best performance in a given task correlates intrinsic excitability, which often results in exponential
with a Lyapunov exponent specific to that task [59]. The distributions of firing rates, as observed in visual cortex
optimal exponent is related to the amount of memory needed (e.g., [102]). This homeostatic adaptation mechanism, called
for the task, as discussed in Section 5.1. It was observed in intrinsic plasticity (IP), has recently attracted a wide attention
ESNs with no input that when ρ(W) is slightly greater than in the reservoir computing community. Mathematically, the
1, the internally generated signals are periodic oscillations, exponential distribution maximizes the entropy of a non-
whereas for larger values of ρ(W), the signals are more negative random variable with a fixed mean; thus it enables
irregular and even chaotic [96]. Even though stronger inputs the neurons to transmit maximal information for a fixed
u(n) can push the dynamics of the reservoirs out of the metabolic cost of firing. An IP learning rule for spiking model
chaotic regime and thus make them useful for computation, neurons aimed at this goal was first presented in [103].
no reliable benefit of such a mode of operation was found in For a more abstract model of the neuron, having a
the last contribution. continuous Fermi sigmoid activation function f : R →
138 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

(0, 1), the IP rule was derived as a proportional control that biologically plausible. The approaches described in this para-
changes the steepness and offset of the sigmoid to get an graph are still waiting to be tested in the reservoir computing
exponential-like output distribution in [104]. A more elegant setting.
gradient IP learning rule for the same purpose was presented It is also of great interest to understand how different
in [93], which is similar to the information maximization types of plasticity observed in biological brains interact when
approach in [105]. Applying IP with Fermi neurons in reservoir applied together and what effect this has on the quality of
computing significantly improves the performance of BPDC- reservoirs. The interaction of the IP with Hebbian synaptic
trained networks [106,107], and is shown to have a positive plasticity in a single Fermi neuron is investigated in [104]
effect on offline trained ESNs, but can cause stability and further in [115]. The synergy of the two plasticities is
problems for larger reservoirs [106]. An ESN reservoir with shown to result in a better specialization of the neuron that
IP-adapted Fermi neurons is also shown to enable predicting finds heavy-tail directions in the input. An interaction of
several superimposed oscillators [108]. IP with a neighborhood-based Hebbian learning in a layer
An adaptation of the IP rule to tanh neurons (f : R → of such neurons was also shown to maximize information
(−1, 1)) that results in a zero-mean Gaussian-like distribution transmission, perform nonlinear ICA, and result in an
of activations was first presented in [73] and investigated emergence of orientational Gabor-like receptive fields in [116].
more in [55]. The IP-adapted ESNs were compared with The interaction of STDP with IP in an LSM-like reservoir of
classical ones, both having Fermi and tanh neurons, in the simple sparsely spiking neurons was investigated in [117].
latter contribution. IP was shown to (modestly) improve the The interaction turned out to be a non-trivial one, resulting in
networks more robust to perturbations of the state x(n) and
performance in all cases. It was also revealed that ESNs
having a better short-time memory and time series prediction
with Fermi neurons have significantly smaller short-term
performance.
memory capacity (as in Section 6.1) and worse performance
A recent approach of combining STDP with a biologically
in a synthetic NARMA prediction task, while having a slightly
plausible reinforcement signal is discussed in Section 7.5, as
better performance in a speech recognition task, compared to
it is not unsupervised.
tanh neurons. The same type of tanh neurons adapted by IP
aimed at Laplacian distributions are investigated in [109]. In
general, IP gives more control on the working points of the
6.3. Unsupervised global methods
reservoir nonlinearity sigmoids. The slope (first derivative)
Here we review unsupervised methods that optimize
and the curvature (second derivative) of the sigmoid at the
reservoirs based on global information of the reservoir
point around which the activations are centered by the IP
activations induced by the given input u(x), but irrespective of
rule affect the effective spectral radius and the nonlinearity
the target ytarget (n), like for example the measures discussed
of the reservoir, respectively. Thus, for example, centering
in Section 6.1. The intuitive goal of such methods is to
tanh activations around points other than 0 is a good idea if
produce good representations of (the history of) u(n) in x(n)
no quasi-linear behavior is desired. IP has recently become
for any (and possibly several) ytarget (n).
employed in reservoirs as a standard practice by several
A biologically inspired unsupervised approach with a
research groups.
reservoir trying to predict itself is proposed in [118]. An
Overall, an information-theoretic view on adaptation of
additional output z(n) ∈ RNx , z(n) = Wz x(n) from the reservoir
spiking neurons has a long history in computational neuro-
is trained on the target ztarget (n) = x0 (n + 1), where x0 (n) are
science. Even better than maximizing just any information
the activations of the reservoir before applying the neuron
in the output of a neuron is maximizing relevant informa-
transfer function tanh(·), i.e., x(n) = tanh(x0 (n)). Then, in
tion. In other words, in its output the neuron should encode the application phase of the trained networks, the original
the inputs in such a way as to preserve maximal informa- activations x0 (n), which result from u(n), Win , and W, are
tion about some (local) target signal. This is addressed in a mixed with the self-predictions z(n − 1) obtained from Wz ,
general information-theoretical setting by the Information Bot- with a certain mixing ratio (1 − α) : α. The coefficient α
tleneck (IB) method [110]. A learning rule for a spiking neuron determines how much the reservoir is relying on the external
that maximizes mutual information between its inputs and input u(n) and how much on the internal self-prediction z(n).
its output is presented in [111]. A more general IB learning With α = 0 we have the classical ESN and with α = 1 we
rule, transferring the general ideas of IB method to spiking have an “autistic” reservoir that does not react to the input.
neurons, is introduced in [112] and [113]. Two semi-local train- Intermediate values of α close to 1 were shown to enable
ing scenarios are presented in these two contributions. In the reservoirs to generate slow, highly nonlinear signals that are
first, a neuron optimizes the mutual information of its output hard to get otherwise.
with outputs of some neighboring neurons, while minimizing An algebraic unsupervised way of generating ESN
the mutual information with its inputs. In the second, two reservoirs was proposed in [119]. The idea is to linearize
neurons reading from the same signals maximize their in- the ESN update equation (5) locally around its current state
formation throughput, while keeping their inputs statistically x(n) at every time step n to get a linear approximation
independent, in effect performing Independent Component of (5) as x(n + 1) = Ax(n) + Bu(n), where A and B are
Analysis (ICA). A simplified online version of the IB training time (n)-dependent matrices corresponding to W and Win
rule with a variation capable of performing Principle Compo- respectively. The approach aims at distributing the predefined
nent Analysis (PCA) was recently introduced in [114]. In addi- complex eigenvalues of A uniformly within the unit circle on
tion, it assumes slow semi-local target signals, which is more the C plane. The reservoir matrix W is obtained analytically
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 139

from the set of these predefined eigenvalues and a given input probability (k − 1)/k, even though the improvement might be
u(n). The motivation for this is, as for Kautz filters [120] in not striking.
linear systems, that if the target ytarget (n) is unknown, it is Several evolutionary approaches on optimizing reservoirs
best to have something like an orthogonal basis in x(n), from of ESNs are presented in [121]. The first approach was to
which any ytarget (n) could, on average, be constructed well. carry out an evolutionary search on the parameters for
The spectral radius of the reservoir is suggested to be set
generating W: Nx , ρ(W), and the connection density of W.
by hand (according to the correlation time of u(n), which is
Then an evolutionary algorithm [122] was used on individuals
an indication of a memory span needed for the task), or by
consisting of all the weight matrices (Win , W, Wofb ) of small
adapting the bias value of the reservoir units to minimize the
(Nx = 5) reservoirs. A variant with a reduced search space was
output error (which actually renders this method supervised,
also tried where the weights, but not the topology, of W were
as in Section 7). Reservoirs generated this way are shown
explored, i.e., elements of W that were zero initially always
to yield higher average entropy of x(n) distribution, higher
short-term memory capacity (both measures mentioned in stayed zero. The empirical results of modeling the motion of
Section 6.1), and a smaller output error on a number of an underwater robot showed superiority of the methods over
synthetic problems, using relatively small reservoirs (Nx = other state-of-art methods, and that the topology-restricted
20, 30). However, a more extensive empirical comparison of adaptation of W is almost as effective as the full one.
this type of reservoir with the classical ESN one is still lacking. Another approach of optimizing the reservoir W by a
greedy evolutionary search is presented in [75]. Here the same
idea of separating the topology and weight sizes of W to
7. Supervised reservoir pre-training reduce the search space was independently used, but the
search was, conversely, restricted to the connection topology.
In this section we discuss methods for training reservoirs This approach also was demonstrated to yield on average
to perform a specific given task, i.e., not only the concrete 50% smaller (and much more stable) error in predicting the
input u(n), but also the desired output ytarget (n) is taken into behavior of a mass–spring–damper system with small (Nx =
account. Since a linear readout from a reservoir is quickly 20) reservoirs than without the genetic optimization.
trained, the suitability of a candidate reservoir for a particular
Yet another way of reducing the search space of the
task (e.g., in terms of NRMSE (1)) is inexpensive to check.
reservoir parameters is constructing a big reservoir weight
Notice that even for most methods of this class the explicit
matrix W in a fractal fashion by repeatedly applying Kronecker
target signal ytarget (n) is not technically required for training
self-multiplication to an initial small matrix, called the
the reservoir itself, but only for evaluating it in an outer loop
Kronecker kernel [123]. This contribution showed that among
of the adaptation process.
Ws constructed in this way some yield ESN performance
similar to the best unconstrained Ws; thus only the good
7.1. Optimization of global reservoir parameters
weights of the small Kronecker kernel need to be found
by evolutionary search for producing a well-performing
In Section 5.1 we discussed guidelines for the manual choice
reservoir.
of global parameters for reservoirs of ESNs. This approach
works well only with experience and a good intuitive grasp on Evolino [46], introduced in Section 3.3, is another example
nonlinear dynamics. A systematic gradient descent method of adapting a reservoir (in this case an LSTM network) using a
of optimizing the global parameters of LI-ESNs (recalled genetic search.
from Section 5.5) to fit them to a given task is presented It has been recently demonstrated that by adapting only
in [18]. The investigation shows that the error surfaces in the the slopes of the reservoir unit activation functions f (·) by a
combined global parameter and Wout spaces may have very state-of-art evolutionary algorithm, and having Wout random
high curvature and multiple local minima. Thus, gradient and fixed, a prediction performance of an ESN can be achieved
descent methods are not always practical. close to the best of classical ESNs [68].
In addition to (or instead of) adapting the reservoirs,
7.2. Evolutionary methods an evolutionary search can also be applied in training the
readouts, such as readouts with no explicit ytarget (n), as
As one can see from the previous sections of this discussed in Section 8.4.
review, optimizing reservoirs is generally challenging, and
breakthrough methods remain to be found. On the other
hand, checking the performance of a resulting ESN is 7.3. Other types of supervised reservoir tuning
relatively inexpensive, as said. This brings in evolutionary
methods for the reservoir pre-training as a natural strategy.
A greedy pruning of neurons from a big reservoir has been
Recall that the classical method generates a reservoir
shown in a recent initial attempt [124] to often give a (bit)
randomly; thus the performance of the resulting ESN varies
slightly (and for small reservoirs not so slightly) from one better classification performance for the same final Nx than
instance to another. Then indeed, an “evolutionary” method just a randomly created reservoir of the same size. The effect
as naive as “generate k reservoirs, pick the best” will of neuron removal to the reservoir dynamics, however, has
outperform the classical method (“generate a reservoir”) with not been addressed yet.
140 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

7.4. Trained auxiliary feedbacks 8. Readouts from the reservoirs

While reservoirs have a natural capability of performing com- Conceptually, training a readout from a reservoir is a common
plex real-time analog computations with fading memory [11], supervised non-temporal task of mapping x(n) to ytarget (n).
an analytical investigation has shown that they can ap- This is a well investigated domain in machine learning, much
proximate any k-order differential equation (with persistent more so than learning temporal mappings with memory. A
memory) if extended with k trained feedbacks [21,125]. This large choice of methods is available, and in principle any of
is equivalent to simulating any Turing machine, and thus them can be applied. Thus we will only briefly go through the
also means universal digital computing. In the presence of ones reported to be successful in the literature.
noise (or finite precision) the memory becomes limited in
such models, but they still can simulate Turing machines with
8.1. Single-layer readout
finite tapes.
This theory has direct implications for reservoir comput-
By far the most popular readout method from the ESN
ing; thus different ideas on how the power of ESNs could
reservoirs is the originally proposed [12] simple linear
be improved along its lines are explored in [78]. It is done
by defining auxiliary targets, training additional outputs of readout, as in (3) (we will consider it as equivalent to (8),
ESNs on these targets, and feeding the outputs back to the i.e., u(n) being part of x(n)). It is shown to be often sufficient,
reservoir. Note that this can be implemented in the usual as reservoirs provide a rich enough pool of signals for
model with feedback connections (6) by extending the orig- solving many application-relevant and benchmark tasks, and
inal output y(n) with additional dimensions that are trained is very efficient to train, since optimal solutions can be found
before training the original (final) output. The auxiliary tar- analytically.
gets are constructed from ytarget (n) and/or u(n) or some ad-
ditional knowledge of the modeled process. The intuition is 8.1.1. Linear regression
that the feedbacks could shift the internal dynamics of x(n) in In batch mode, learning of the output weights Wout (2) can be
the directions that would make them better linearly combin- phrased as solving a system of linear equations
able into ytarget (n). The investigation showed that for some
types of tasks there are natural candidates for such auxiliary Wout X = Ytarget (12)
targets, which improve the performance significantly. with respect to Wout , where X ∈ RN×T are all x(n) produced
Unfortunately, no universally applicable methods for produc- by presenting the reservoir with u(n), and Ytarget ∈ RNy ×T
ing auxiliary targets are known such that the targets would be are all ytarget (n), both collected into respective matrices over
both easy to learn and improve the accuracy of the final out- the training period n = 1, . . . , T. Usually x(n) data from the
put y(n). In addition, training multiple outputs with feedback beginning of the training run are discarded (they come before
connections Wofb makes the whole procedure more compli-
n = 1), since they are contaminated by initial transients.
cated, as cyclical dependences between the trained outputs
Since typically the goal is minimizing a quadratic error
(one must take care of the order in which the outputs are
E(Ytarget , Wout X) as in (1) and T > N, to solve (12) one
trained) as well as stability issues discussed in Section 8.2
usually employs methods for finding least square solutions
arise. Despite these obstacles, we perceive this line of re-
of overdetermined systems of linear equations (e.g., [128]), the
search as having a big potential.
problem also known as linear regression. One direct method
is calculating the Moore–Penrose pseudoinverse X+ of X, and
7.5. Reinforcement learning
Wout as
In the line of biologically inspired local unsupervised adapta- Wout = Ytarget X+ . (13)
tion methods discussed in Section 6.2, an STDP modulated
by a reinforcement signal has recently emerged as a pow- Direct pseudoinverse calculations exhibit high numerical
erful learning mechanism, capable of explaining some fa- stability, but are expensive memory-wise for large state-
mous findings in neuroscience (biofeedback in monkeys), as collecting matrices X ∈ RN×T , thereby limiting the size of the
demonstrated in [126,127] and references thereof. The learn- reservoir N and/or the number of training samples T.
ing mechanism is also well biologically motivated as it uses This issue is resolved in the normal equations formulation
a local unsupervised STDP rule and a reinforcement (i.e., of the problem5 :
reward) feedback, which is present in biological brains in a T T
form of chemical signaling, e.g., by the level of dopamine. In Wout XX = Ytarget X . (14)
the RC framework this learning rule has been successfully ap- A naive solution of it would be
plied for training readouts from the reservoirs so far in [127], T T
but could in principle be applied inside the reservoir too. Wout = Ytarget X (XX )−1 . (15)
Overall the authors of this review believe that reinforce- T T
ment learning methods are natural candidates for reservoir Note that in this case Ytarget X ∈ RNy ×N and XX ∈ RN×N
adaptation, as they can immediately exploit the knowledge do not depend on the length T of the training sequence,
of how well the output is learned inside the reservoir without and can be calculated incrementally while the training data
the problems of error backpropagation. They can also be used are passed through the reservoir. Thus, having these two
in settings where no explicit target ytarget (n) is available. We
expect to see more applications of reinforcement learning in 5 Note that our matrices are transposed compared to the
reservoir computing in the future. conventional notation.
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 141

matrices collected, the solution complexity of (15) does not detrimental effects of eigenvalue spread and boasts a much
depend on T either in time or in space. Also, intermediate faster convergence because it is a second-order method. The
values of Wout can be calculated in the middle of running downside is that RLS is computationally more expensive
through the training data, e.g., for an early assessment of the (order O(N 2 ) per time step instead of O(N) for LMS, for Ny = 1)
performance, making this a “semi-online” training method. and notorious for numerical stability issues. Demonstrations
The method (15) has lower numerical stability, compared of RLS are presented in [17,43]. A careful and comprehensive
to (13), but the problem can be mitigated by using the comparison of variants of RLS is carried out in a Master’s
T T
pseudoinverse (XX )+ instead of the real inverse (XX )−1 thesis [130], which we mention here because it will be helpful
(which usually also works faster). In addition, this method for practitioners.
enables one to introduce ridge, or Tikhonov, regularization The BackPropagation-DeCorrelation (BPDC) algorithm dis-
elegantly: cussed in Section 3.4 is another powerful method for online
training of single-layer readouts with feedback connections
T T
Wout = Ytarget X (XX + α2 I)−1 , (16) from the reservoirs.
Simple forms of adaptive online learning, such as LMS,
where I ∈ RN×N is the identity matrix and α is a regularization
are also more biologically plausible than batch-mode training.
factor. In addition to improving the numerical stability, the
From spiking neurons a firing time-coded (instead of a
regularization in effect reduces the magnitudes of entries
more common firing rate-coded) output for classification
in Wout , thus mitigating sensitivity to noise and overfitting;
can also be trained by only adapting the delays of the
see Section 8.2 for more details. All this makes (16) a
output connections [101]. And firing rate-coded readouts
highly recommendable choice for learning outputs from the
reservoirs. can be trained by a biologically-realistic reward-modulated
Another alternative for solving (14) is decomposing the STDP [127], mentioned in Section 6.2.
T
matrix XX into a product of two triangular matrices via
Cholesky or LU decomposition, and solving (14) by two steps 8.1.3. SVM-style readout
of substitution, avoiding (pseudo-)inverses completely. The Continuing the analogy between the temporal and non-
Cholesky decomposition is the more numerically stable of the temporal expansion methods, discussed in Section 2, the
two. reservoir can be considered a temporal kernel, and the
Weighted regression can be used for training linear standard linear readout Wout from it can be trained
readouts by multiplying both x(n) and the corresponding using the same loss functions and regularizations as in
ytarget (n) by different weights over time, thus emphasizing Support Vector Machines (SVMs) or Support Vector Regression
some time steps n over others. Multiplying certain recorded (SVR). Different versions of this approach are proposed and

x(n) and corresponding ytarget (n) by k has the same investigated in [131].
emphasizing effect as if they appeared in the training A standard SVM (having its own kernel) can also be used as
sequence k times. a readout from a continuous-value reservoir [132]. Similarly,
When the reservoir is made from spiking neurons and thus special kernel types could be applied in reading out from
x(n) becomes a collection of spike trains, smoothing by low- spiking (LSM-type) reservoirs [133] (and references therein).
pass filtering may be applied to it before doing the linear
regression, or it can be done directly on x(n) [11]. For more 8.2. Feedbacks and stability issues
on linear regression based on spike train data, see [129].
Evolutionary search for training linear readouts can Stability issues (with reservoirs having the echo state
also be employed. State-of-art evolutionary methods are property) usually only occur in generative setups where a
demonstrated to be able to achieve the same record levels of model trained on (one step) signal prediction is later run
precision for supervised tasks as with the best applications in a generative mode, looping its output y(n) back into the
of linear regression in ESN training [68]. Their much higher input as u(n + 1). Note that this is equivalent to a model with
computational cost is justifiable in settings where no explicit output feedbacks Wofb (6) and no input at all (Nu = 0), which
ytarget (n) is available, discussed in Section 8.4. is usually trained using teacher forcing (i.e., feeding ytarget (n)
as y(n) for the feedbacks during the training run) and later
8.1.2. Online adaptive output weight training is run freely to generate signals as y(n). Win in the first
case is equivalent to Wofb in the second one. Models having
Some applications require online model adaptation, e.g., in
feedbacks Wofb may also suffer from instability while driven
online adaptive channel equalization [17]. In such cases one
with external input u(n), i.e., not in a purely generative mode.
typically minimizes an error that is exponentially discounted
The reason for these instabilities is that even if the model
going back in time. Wout here acts as an adaptive linear
can predict the signal quite accurately, going through the
combiner. The simplest way to train Wout is to use stochastic
feedback loop of connections Wout and Wofb (or Win ) small
gradient descent. The method is familiar as the Least Mean
errors get amplified, making y(n) diverge from the intended
Squares (LMS) algorithm in linear signal processing [91], and
ytarget (n).
has many extensions and modifications. Its convergence
One way to look at this for trained linear outputs is to
performance is unfortunately severely impaired by large
T consider the feedback loop connections Wout and Wofb as
eigenvalue spreads of XX , as mentioned in Section 6.1.
part of the reservoir W. Putting (6) and (2) together we get
An alternative to LMS, known in linear signal processing as
the Recursive Least Squares (RLS) algorithm, is insensitive to the x(n) = f (Win u(n) + [W + Wofb Wout ]x(n − 1)), (17)
142 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

where W + Wofb Wout forms the “extended reservoir” against this approach, at least in its straightforward form,
connections, which we will call W ∗ for brevity (as in [78] pointing out some weaknesses, like the lack of specificity, and
Section 3.2). If the spectral radius of the extended reservoir negative practical experience.
ρ(W ∗ ) is very large we can expect unstable behavior. A For both approaches a weighting scheme can be used for
more detailed analysis using Laplace transformations and a both training (like in weighted regression) and integrating the
sufficient condition for stability is presented in [134]. On the class votes, e.g., putting more emphasis on the end of the
other hand, for purely generative tasks, ρ(W ∗ ) < 1 would pattern when sufficient information has reached the classifier
mean that the generated signal would die out, which is not to make the decision.
desirable in most cases. Thus producing a generator with An advanced version of ESN-based predictive classifier,
stable dynamics is often not trivial. where for each class there is a set of competitively trained
Quite generally, models trained with clean (noise-free) predictors and dynamic programming is used to find the
data for the best one-time-step prediction diverge fast in optimal sequence of them, is reported to be much more noise
the generative mode, as they are too “sharp” and not noise- robust than a standard Hidden Markov Model in spoken word
robust. A classical remedy is adding some noise to reservoir recognition [138].
states x(n) [12] during the training. This way the generator
forms a stable attractor by learning how to come to the 8.4. Readouts beyond supervised learning
desired next output ytarget (n) from a neighborhood of the
current state x(n), having seen it perturbed by noise during Even though most of the readout types from reservoirs
training. Setting the right amount of noise is a delicate reported in the literature are trained in a purely supervised
balance between the sharpness (of the prediction) and the manner, i.e., making y(n) match an explicitly given ytarget (n),
stability of the generator. Alternatively, adding noise to x(n) the reservoir computing paradigm lends itself to settings
can be seen as a form of regularization in training, as it in where no ytarget (n) is available. A typical such setting is
T
effect also emphasizes the diagonal of matrix XX in (16). A reinforcement learning where only a feedback on the model’s
similar effect can be achieved using ridge regression (16) [135], performance is available. Note that an explicit ytarget (n) is not
or to some extent even pruning of Wout [136]. Ridge regression required for the reservoir adaptation methods discussed in
(16) is the least computationally expensive to do of the three, Sections 5 and 6 of this survey by definition. Even most of the
since the reservoir does not need to be rerun with the data to adaptation methods classified as supervised in Section 7 do
test different values of the regularization factor α. not need an explicit ytarget (n), as long as one can evaluate
Using different modifications of signals for teacher forcing, the performance of the reservoir. Thus they can be used
like mixing ytarget (n) with noise, or in some cases using pure without modification, provided that unsupervised training
strong noise, during the training also has an effect on the final and evaluation of the output is not prohibitively expensive
performance and stability, as discussed in Section 5.4 of [78]. or can be done simultaneously with reservoir adaptation. In
this section we will give some pointers on training readouts
8.3. Readouts for classification/recognition using reinforcement learning.
A biologically inspired learning rule of Spike-Time-
The time series classification or temporal pattern detection Dependent Plasticity (STDP) modulated by a reinforcement
tasks that need a category indicator (as opposed to real signal has been successfully applied for training a readout
values) as an output can be implemented in two main ways. of firing neurons from the reservoirs of the same LSTM-type
The most common and straightforward way is having a in [127].
real-valued output for each class (or a single output and Evolutionary algorithms are a natural candidate for
a threshold for the two-class classifier), and interpreting training outputs in a non-supervised manner. Using a genetic
the strengths of the outputs as votes for the corresponding search with crossover and mutation to find optimal output
classes, or even class probabilities (several options are weights Wout of an ESN is reported in [139]. Such an ESN
discussed in [18]). Often the most probable class is taken as is successfully applied for a hard reinforcement learning
the decision. A simple target ytarget for this approach is a task of direct adaptive control, replacing a classical indirect
constant ytarget i (n) = 1 signal for the right class i and 0 for controller.
the others in the range of n where the indicating output is ESNs trained with a simple “(1 + 1)” evolution strategy for
expected. More elaborate shapes of ytarget (n) can improve an unsupervised artificial embryogeny (the so-called “flag”)
classification performance, depending on the task (e.g., [81]). problem are shown to perform very well in [140].
With spiking neurons the direct classification based on time An ESN trained with a state-of-art evolutionary continuous
coding can be learned and done, e.g., the class is assigned parameter optimization method (CMA-ES) shows comparable
depending on which output fires first [101]. performance in a benchmark double pole balancing problem
The main alternative to direct class indications is to to the best RNN topology-learning methods in [68,141].
use predictive classifiers, i.e., train different predictors to For this problem the best results are obtained when the
predict different classes and assign a class to a new example spectral radius ρ(W) is adapted together with Wout . The
corresponding to the predictor that predicts it best. Here same contributions also validate the CMA-ES readout training
the quality of each predictor serves as the output strength method on a standard supervised prediction task, achieving
for the corresponding class. The method is quite popular in the same excellent precision (MSE of the order 10−15 ) as
automated speech recognition (e.g., Section 6 in [137] for an the state-of-art with linear regression. Conversely, the best
overview). However, in Section 6.5 of [137] the author argues results for this task were achieved with ρ(W) fixed and
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 143

training only Wout . An even more curious finding is that which transform an input time series into an output time
almost as good results were achieved by only adapting slopes series, ESNs can also be utilized for non-temporal (defined
of the reservoir activation functions f (·) and having Wout in Section 2.1) tasks {(u(n), ytarget (n))} by presenting an ESN
fixed, as mentioned in Section 7.2. with the same input u(n) for many time steps, letting the
ESN converge to a fixed-point attractor state xu(n) (∞) (which
8.5. Multilayer readouts it does if it possesses echo state property) and reading the
output from the attractor state y(n) = y(xu(n) (∞)) [145,146].
Multilayer perceptrons (MLPs) as readouts, trained by error
backpropagation, were used from the very beginnings of
8.7. Combining several readouts
LSMs [11] and ESNs (unpublished). They are theoretically
more powerful and expressive in their instantaneous
Segmenting of the spatially embedded trajectory of x(n) by
mappings from x(n) to y(n) than linear readouts, and are
k-means clustering and assigning a separate “responsible”
thus suitable for particularly nonlinear outputs, e.g., in [142,
143]. In both cases the MLP readouts are trained by error linear readout for each cluster is investigated in [147]. This
backpropagation. On the other hand they are significantly approach increases the expressiveness of the ESN by having k
harder to train than an optimal single-layer linear regression, linear readouts trained and an online switching mechanism
thus often giving inferior results compared to the latter in among them. Bigger values of k are shown to compensate
practice. for smaller sizes Nx of the reservoirs to get the same level
Some experience in training MLPs as ESN readouts, of performance.
including network initialization, using stochastic, batch, and A benchmark-record-breaking approach of taking an
semi-batch gradients, adapting learning rates, and combining average of outputs from many (1000) different instances of
with regression-training of the last layer of the MLP, is tiny (N = 4) trained ESNs is presented in Section 5.2.2 of [18].
presented in Section 5.3 of [78]. The approach is also combined with reading from different
support times as discussed in Section 8.6 of this survey.
8.6. Readouts with delays Averaging outputs over 20 instances of ESNs was also shown
to refine the prediction of chaotic time series in supporting
While the readouts from reservoirs are usually recurrence- online material of [17].
free, this does not mean that they may not have memory. In Using dynamic programing to find sequences in multiple
some approaches they do, or rather some memory is inserted sets of predicting readouts for classification [138] was already
between the reservoir and the readout. mentioned at the end of Section 8.3.
Learning a delay for each neuron in an ESN reservoir x(n)
in addition to the output weight from it is investigated in [84]. 8.8. Hierarchies
Cross-correlation (simple or generalized) is used to optimally
align activations of each neuron in x(n) with ytarget (n), and Following the analogy between the ESNs and non-temporal
then activations with the delays xdelayed (n) are used to find
kernel methods, ESNs would be called “type-1 shallow ar-
Wout in a usual way. This approach potentially enables
chitectures” according to the classification proposed in [148].
utilizing the computational power of the reservoir more
The reservoir adaptation techniques reviewed in our article
efficiently. In a time-coded output from a spiking reservoir
would make ESNs “type-3 shallow architectures”, which are
the output connection delays can actually be the only thing
more expressive. However, the authors in [148] argue that any
that is learned [101].
type of shallow (i.e., non-hierarchical) architectures is inca-
For time series classification tasks the decision can be
pable of learning really complex intelligent tasks. This sug-
based on a readout from a joined reservoir state xjoined =
gests that for demandingly complex tasks the adaptation of
[x(n1 ), x(n2 ), . . . , x(nk )] that is a concatenation of the reservoir
a single reservoir might not be enough and a hierarchical ar-
states from different moments n1 , n2 , . . . , nk in time during
chitecture of ESNs might be needed, e.g., such as presented
the time series [18]. This approach, compared to only using
in [149]. Here the outputs of a higher level in the hierarchy
the last state of the given time series, moves the emphasis
serve as coefficients of mixing (or voting on) outputs from
away from the ending of the series, depending on how the
a lower one. The structure can have an arbitrary number of
support times ni are spread. It is also more expressive, since
it has k times more trainable parameters in Wout for the same layers. Only the outputs from the reservoirs of each layer are
size of the reservoir N. As a consequence, it is also more prone trained simultaneously, using stochastic gradient descent and
to overfitting. It is also possible to integrate intervals of states error backpropagation through the layers. The structure is
Pn1 demonstrated to discover features on different timescales in
in some way, e.g., use x∗ (n1 ) = n −n1 +1 m=n 0
x(m) instead
1 0
of using a single snapshot of the states x(n1 ). an unsupervised way when being trained for predicting a syn-
An approach of treating a finite history of reservoir thetic time series of interchanging generators. On the down-
activations x(n) (similar to X in (12)) as a two-dimensional side, such hierarchies require many epochs to train, and suf-
image, and training a minimum average correlations energy fer from a similar problem of vanishing gradients, as deep
filter as the readout for dynamical pattern recognition is feedforward neural networks or gradient-descent methods
presented in [144]. for fully trained RNNs. They also do not scale-up yet to real-
Even though in Section 1 we stated that the RNNs world demanding problems. Research on hierarchically struc-
considered in this survey are used as nonlinear filters, tured RC models has only just begun.
144 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

9. Discussion Even though this review is quite extensive, we tried


to keep it concise, outlining only the basic ideas of each
The striking success of the original RC methods in contribution. We did not try to include every contribution
outperforming fully trained RNNs in many (though not all) relating to RC in this survey, but only the ones highlighting
tasks, established an important milestone, or even a turning the main research directions. Publications only reporting
point, in the research of RNN training. The fact that a applications of reservoir methods, but not proposing any
randomly generated fixed RNN with only a linear readout interesting modifications of them, were left out. Since this
trained consistently outperforms state-of-art RNN training review is aimed at a (fast) moving target, which RC is, some
methods had several consequences: (especially very new) contributions might have been missed
• First of all it revealed that we do not really know how unintentionally.
to train RNNs well, and something new is needed. The In general, the RC field is still very young, but very active
error backpropagation methods, which had caused a and quickly expanding. While the original first RC methods
breakthrough in feedforward neural network training (up made an impact that could be called a small revolution,
to a certain depth), and had also become the most popular current RC research is more in a phase of a (rapid) evolution.
training methods for RNNs, are hardly unleashing their full The multiple new modifications of the original idea are
potential. gradually increasing the performance of the methods. While
• Neither are the classical RC methods yet exploiting the full with no striking breakthroughs lately, the progress is steady,
potential of RNNs, since they use a random RNN, which is establishing some of the extensions as common practices to
unlikely to be optimal, and a linear readout, which is quite build on further. There are still many promising directions to
limited by the quality of the signals it is combining. But be explored, hopefully leading to breakthroughs in the near
they give a quite tough performance reference for more future.
sophisticated methods. While the tasks for which RNNs are applied nowadays
• The separation between the RNN reservoir and the readout often are quite complex, hardly any of them could yet be
provides a good platform to try out all kinds of RNN called truly intelligent, as compared to the human level of
adaptation methods in the reservoir and see how much intelligence. The fact that RC methods perform well in many
they can actually improve the performance over randomly of these simple tasks by no means indicates that there is
created RNNs. This is particularly well suited for testing little space left for their improvement. More complex tasks
various biology-inspired RNN adaptation mechanisms, and adequate solutions are still to meet each other in RC. We
which are almost exclusively local and unsupervised, in further provide some of our (subjective, or even speculative)
how they can improve learning of a supervised task. outlooks on the future of RC.
• In parallel, it enables all types of powerful non-temporal The elegant simplicity of the classical ESNs gives many
methods to be applied for reading out of the reservoir. benefits in these simple applications, but it also has some
This platform is the current paradigm of RC: using different intrinsic limitations (as, for example, discussed in Section 5.4)
methods for (i) producing/adapting the reservoir, and (ii) that must be overcome in some way or other. Since the RNN
training different types of readouts. It enables looking for model is by itself biologically inspired, looking at real brains
good (i) and (ii) methods independently, and combining the is a natural (literally) source of inspiration on how to do
best practices from both research directions. The platform that. RC models may reasonably explain some aspects of how
has been actively used by many researchers, ever since the small portions of the brain work, but if we look at the bigger
first ESNs and LSMs appeared. This research in both (i) and picture, the brain is far from being just a big blob of randomly
(ii) directions, together with theoretical insights, like what connected neurons. It has a complex structure that is largely
characterizes a “good” reservoir, constitutes the modern field predefined before even starting to learn. In addition, there
of RC. are many learning mechanisms observed in the real brain, as
In this review, together with motivating the new paradigm, briefly outlined in Section 6.2. It is very probable that there
we have provided a comprehensive survey of all this RC is no single easily implementable underlying rule which can
research. We introduced a natural taxonomy of the reservoir explain all learning.
generation/adaptation techniques (i) with three big classes of The required complexity in the context of RC can be
methods (generic, unsupervised, and supervised), depending achieved in two basic ways: either (i) by giving the reservoir
on their universality with respect to the input and desired a more complex internal structure, like that discussed in
output of the task. Inside each class, methods are also Section 5.3 or (ii) externally building structures combining
grouped into major directions of approaches, taking different several reservoirs and readouts, like those discussed in
inspirations. We also surveyed all types of readouts from Section 8.8. Note that the two ways correspond to the
the reservoirs (ii) reported in the literature, including the above-mentioned dichotomy of the RC research and are not
ones containing several layers of nonlinearities, combining mutually exclusive. An “externally” (ii) built structure can also
several time steps, or several reservoirs, among others. We be regarded as a single complex reservoir (i) and a readout
also briefly discussed some practical issues of training the from it all can be trained.
most popular types of readouts in a tutorial-like fashion. An internal auto-structuring of the reservoir (i) through an
The survey is transcending the boundaries among several (unsupervised) training would be conceptually appealing and
traditional methods that fall under the umbrella of RC, nature-like, but not yet quite feasible at the current state of
generalizing the results to the whole RC field and pointing knowledge. A robust realization of such a learning algorithm
out relations, where applicable. would signify a breakthrough in the generation/training
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 145

of artificial NNs. Most probably such an approach would [2] John J. Hopfield, Neural networks and physical systems with
combine several competing learning mechanisms and goals, emergent collective computational abilities, Proceedings of
and require a careful parameter selection to balance them, the National Academy of Sciences of the United States of
America 79 (1982) 2554–2558.
and thus would not be easy to successfully apply. In addition,
[3] Geoffrey E. Hinton, Boltzmann machine, Scholarpedia 2 (5)
changing the structure of the RNN during the adaptive
(2007) 1668.
training would lead to bifurcations in the training process, as [4] David H. Ackley, Geoffrey E. Hinton, Terrence J. Sejnowski,
in [8], which makes learning very difficult. A learning algorithm for Boltzmann machines, Cognitive
Constructing external architectures or several reservoirs Science 9 (1985) 147–169.
can be approached as more of an engineering task. The [5] Geoffrey E. Hinton, Ruslan Salakhutdinov, Reducing the
structures can be hand-crafted, based on the specifics of the dimensionality of data with neural networks, Science 313
(5786) (2006) 504–507.
application, and, in some cases, trained entirely supervised,
[6] Graham W. Taylor, Geoffrey E. Hinton, Sam Roweis,
each reservoir having a predefined function and a target
Modeling human motion using binary latent variables,
signal for its readout. While such approaches are successfully in: Advances in Neural Information Processing Systems 19,
being applied in practice, they are very case-specific, and not NIPS 2006, MIT Press, Cambridge, MA, 2007, pp. 1345–1352.
quite in the scope of the research reviewed here, since in [7] Ken-ichi Funahashi, Yuichi Nakamura, Approximation of
essence they are just applications of (several instances of) the dynamical systems by continuous time recurrent neural
classical RC methods in bigger settings. networks, Neural Networks 6 (1993) 801–806.
[8] Kenji Doya, Bifurcations in the learning of recurrent neural
However, generic structures of multiple reservoirs (ii) that
networks, in: Proceedings of IEEE International Symposium
can be trained with no additional information, such as
on Circuits and Systems 1992, vol. 6, 1992, pp. 2777–2780.
discussed in Section 8.8, are of high interest. Despite their [9] Yoshua Bengio, Patrice Simard, Paolo Frasconi, Learning
current state being still an “embryo”, and the difficulties long-term dependencies with gradient descent is difficult,
pointed out earlier, the authors of this review see this IEEE Transactions on Neural Networks 5 (2) (1994) 157–166.
direction as highly promising. [10] Felix A. Gers, Jürgen Schmidhuber, Fred A. Cummins,
Biological inspiration and progress of neuroscience in Learning to forget: Continual prediction with LSTM, Neural
Computation 12 (10) (2000) 2451–2471.
understanding how real brains work are beneficial for both
[11] Wolfgang Maass, Thomas Natschläger, Henry Markram,
(i) and (ii) approaches. Well understood natural principles of
Real-time computing without stable states: A new frame-
local neural adaptation and development can be relatively work for neural computation based on perturbations, Neu-
easily transfered to artificial reservoirs (i), and reservoirs ral Computation 14 (11) (2002) 2531–2560.
internally structured to more closely resemble cortical [12] Herbert Jaeger, The “echo state” approach to analysing
microcolumns in the brain have been shown to perform and training recurrent neural networks, Technical Report
better [23]. Understanding how different brain areas interact GMD Report 148, German National Research Center for
Information Technology, 2001.
could also help in building external structures of reservoirs
[13] Peter F. Dominey, Complex sensory-motor sequence learn-
(ii) better suited for nature-like tasks.
ing based on recurrent state representation and reinforce-
In addition to processing and “understanding” multiple ment learning, Biological Cybernetics 73 (1995) 265–274.
scales of time and abstraction in the data, which hierarchical [14] Jochen J. Steil, Backpropagation-decorrelation: Recurrent
models promise to solve, other features still lacking in the learning with O(N) complexity, in: Proceedings of the IEEE
current RC (and overall RNN) methods include robustness International Joint Conference on Neural Networks, 2004,
and stability of pattern generation. A possible solution to IJCNN 2004, vol. 2, 2004, pp. 843–848.
[15] Herbert Jaeger, Wolfgang Maass, José C. Príncipe, Special
this could be a homeostasis-like self-regulation in the RNNs.
issue on echo state networks and liquid state machines —
Other intelligence-tending features as selective longer-term
Editorial, Neural Networks 20 (3) (2007) 287–289.
memory or active attention are also not yet well incorporated. [16] Herbert Jaeger, Echo state network, Scholarpedia 2 (9) (2007)
In short, RC is not the end, but an important stepping- 2330.
stone in the big journey of developing RNNs, ultimately [17] Herbert Jaeger, Harald Haas, Harnessing nonlinearity:
leading towards building artificial and comprehending Predicting chaotic systems and saving energy in wireless
natural intelligence. communication, Science (2004) 78–80.
[18] Herbert Jaeger, Mantas Lukoševičius, Dan Popovici, Udo
Siewert, Optimization and applications of echo state
networks with leaky-integrator neurons, Neural Networks
Acknowledgments 20 (3) (2007) 335–352.
[19] David Verstraeten, Benjamin Schrauwen, Dirk Stroobandt,
This work is partially supported by Planet Intelligent Systems Reservoir-based techniques for speech recognition, in:
GmbH, a private company with an inspiring interest in Proceedings of the IEEE International Joint Conference on
fundamental research. The authors are also thankful to Neural Networks, 2006, IJCNN 2006, 2006 pp. 1050–1053.
Benjamin Schrauwen, Michael Thon, and an anonymous [20] Wolfgang Maass, Thomas Natschläger, Henry Markram, A
model for real-time computation in generic neural mi-
reviewer of this journal for their helpful constructive
crocircuits, in: Advances in Neural Information Processing
feedback. Systems 15, NIPS 2002, MIT Press, Cambridge, MA, 2003,
pp. 213–220.
REFERENCES [21] Wolfgang Maass, Prashant Joshi, Eduardo D. Sontag,
Principles of real-time computing with feedback applied
to cortical microcircuit models, in: Advances in Neural
[1] John J. Hopfield, Hopfield network, Scholarpedia 2 (5) (2007) Information Processing Systems 18, MIT Press, Cambridge,
1977. MA, 2006, pp. 835–842.
146 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

[22] Dean V. Buonomano, Michael M. Merzenich, Temporal [42] Herbert Jaeger, Tutorial on training recurrent neural
information transformed into a spatial code by a neural networks, covering BPTT, RTRL, EKF and the “echo
network with realistic properties, Science 267 (1995) state network” approach, Technical Report GMD Report
1028–1030. 159, German National Research Center for Information
[23] Stefan Haeusler, Wolfgang Maass, A statistical analysis Technology, 2002.
of information-processing properties of lamina-specific [43] Herbert Jaeger, Adaptive nonlinear system identification
cortical microcircuit models, Cerebral Cortex 17 (1) (2007) with echo state networks, in: Advances in Neural Informa-
149–162. tion Processing Systems 15, MIT Press, Cambridge, MA, 2003,
[24] Uma R. Karmarkar, Dean V. Buonomano, Timing in the pp. 593–600.
absence of clocks: Encoding time in neural network states, [44] Thomas Natschläger, Henry Markram, Wolfgang Maass,
Neuron 53 (3) (2007) 427–438. Computer models and analysis tools for neural microcir-
[25] Garrett B. Stanley, Fei F. Li, Yang Dan, Reconstruction of cuits, in: R. Kötter (Ed.), A Practical Guide to Neuroscience
natural scenes from ensemble responses in the lateral Databases and Associated Tools, Kluver Academic Publish-
genicualate nucleus, Journal of Neuroscience 19 (18) (1999) ers, Boston, 2002 (chapter 9).
8036–8042. [45] Wolfgang Maass, Thomas Natschläger, Henry Markram,
[26] Danko Nikolić, Stefan Haeusler, Wolf Singer, Wolfgang Computational models for generic cortical microcircuits,
Maass, Temporal dynamics of information content carried in: J. Feng (Ed.), Computational Neuroscience: A Compre-
by neurons in the primary visual cortex, in: Advances in hensive Approach, CRC-Press, 2002.
Neural Information Processing Systems 19, NIPS 2006, MIT [46] Jürgen Schmidhuber, Daan Wierstra, Matteo Gagliolo,
Press, Cambridge, MA, 2007, pp. 1041–1048. Faustino J. Gomez, Training recurrent networks by Evolino,
[27] Werner M. Kistler, Chris I. De Zeeuw, Dynamical working Neural Computation 19 (3) (2007) 757–779.
memory and timed responses: The role of reverberating [47] Ulf D. Schiller, Jochen J. Steil, Analyzing the weight dynam-
loops in the olivo-cerebellar system, Neural Computation 14 ics of recurrent learning algorithms, Neurocomputing 63C
(2002) 2597–2626. (2005) 5–23.
[28] Tadashi Yamazaki, Shigeru Tanaka, The cerebellum as a
[48] Jochen J. Steil, Memory in backpropagation-decorrelation
liquid state machine, Neural Networks 20 (3) (2007) 290–297.
O(N) efficient online recurrent learning, in: Proceedings
[29] Peter F. Dominey, Michel Hoen, Jean-Marc Blanc, Taïssia
of the 15th International Conference on Artificial Neural
Lelekov-Boissard, Neurological basis of language and
Networks, in: LNCS, vol. 3697, Springer, 2005, pp. 649–654
sequential cognition: Evidence from simulation, aphasia,
(chapter 9).
and ERP studies, Brain and Language 86 (2003) 207–225.
[49] Felix R. Reinhart, Jochen J. Steil, Recurrent neural autoasso-
[30] Jean-Marc Blanc, Peter F. Dominey, Identification of prosodic
ciative learning of forward and inverse kinematics for move-
attitudes by atemporal recurrent network, Cognitive Brain
ment generation of the redundant PA-10 robot, in: Proceed-
Research 17 (2003) 693–699.
ings of the ECSIS Symposium on Learning and Adaptive
[31] Peter F. Dominey, Michel Hoen, Toshio Inui, A neurolinguis-
Behaviors for Robotic Systems, LAB-RS, vol. 1, 2008, 35–40.
tic model of grammatical construction processing, Journal
[50] Peter F. Dominey, Franck Ramus, Neural network processing
of Cognitive Neuroscience 18 (12) (2006) 2088–2107.
of natural language: I. Sensitivity to serial, temporal and
[32] Robert M. French, Catastrophic interference in connection-
abstract structure of language in the infant, Language and
ist networks, in: L. Nadel (Ed.), Encyclopedia of Cognitive Sci-
Cognitive Processes 15 (1) (2000) 87–127.
ence, Volume 1, Nature Publishing Group, 2003, pp. 431–435.
[51] Peter F. Dominey, From sensorimotor sequence to grammat-
[33] Floris Takens, Detecting strange attractors in turbulence,
ical construction: Evidence from simulation and neurophys-
in: Proceedings of a Symposium on Dynamical Systems and
iology, Adaptive Behaviour 13 (4) (2005) 347–361.
Turbulence, in: LNM, vol. 898, Springer, 1981, pp. 366–381.
[34] Ronald J. Williams, David Zipser, A learning algorithm for [52] Se Wang, Xiao-Jian Yang, Cheng-Jian Wei, Harnessing
continually running fully recurrent neural networks, Neural non-linearity by sigmoid-wavelet hybrid echo state net-
Computation 1 (1989) 270–280. works (SWHESN), in: The 6th World Congress on In-
[35] David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, telligent Control and Automation, WCICA 2006, 1, 2006,
Learning internal representations by error propagation, pp. 3014–3018.
in: Neurocomputing: Foundations of research, MIT Press, [53] David Verstraeten, Benjamin Schrauwen, Dirk Stroobandt,
Cambridge, MA, USA, 1988, pp. 673–695. Reservoir computing with stochastic bitstream neurons,
[36] Paul J. Werbos, Backpropagation through time: What it does in: Proceedings of the 16th Annual ProRISC Workshop,
and how to do it, Proceedings of the IEEE 78 (10) (1990) Veldhoven, The Netherlands, November 2005, pp. 454–459.
1550–1560. [54] Felix Schürmann, Karlheinz Meier, Johannes Schemmel,
[37] Amir F. Atiya, Alexander G. Parlos, New results on recurrent Edge of chaos computation in mixed-mode VLSI - A
network training: Unifying the algorithms and accelerating hard liquid, in: Advances in Neural Information Processing
convergence, IEEE Transactions on Neural Networks 11 (3) Systems 17, MIT Press, Cambridge, MA, 2005, pp. 1201–1208.
(2000) 697–709. [55] Benjamin Schrauwen, Marion Wardermann, David Ver-
[38] Gintaras V. Puškorius, Lee A. Feldkamp, Neurocontrol of straeten, Jochen J. Steil, Dirk Stroobandt, Improving reser-
nonlinear dynamical systems with Kalman filter trained voirs using intrinsic plasticity, Neurocomputing 71 (2008)
recurrent networks, IEEE Transactions on Neural Networks 1159–1171.
5 (2) (1994) 279–297. [56] Kristof Vandoorne, Wouter Dierckx, Benjamin Schrauwen,
[39] Sheng Ma, Chuanyi Ji, Fast training of recurrent networks David Verstraeten, Roel Baets, Peter Bienstman, Jan Van
based on the EM algorithm, IEEE Transactions on Neural Campenhout, Toward optical signal processing using
Networks 9 (1) (1998) 11–26. photonic reservoir computing, Optics Express 16 (15) (2008)
[40] Sepp Hochreiter, Jürgen Schmidhuber, Long short-term 11182–11192.
memory, Neural Computation 9 (8) (1997) 1735–1780. [57] Chrisantha Fernando, Sampsa Sojakka, Pattern recognition
[41] Herbert Jaeger, Short term memory in echo state networks, in a bucket, in: Proceedings of the 7th European Conference
Technical Report GMD Report 152, German National on Advances in Artificial Life, ECAL 2003, in: LNCS, vol. 2801,
Research Center for Information Technology, 2002. Springer, 2003, pp. 588–597.
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 147

[58] Ben Jones, Dov Stekelo, Jon Rowe, Chrisantha Fernando, Is [76] Márton Albert Hajnal, András Lőrincz, Critical echo
there a liquid state machine in the bacterium Escherichia state networks, in: Proceedings of the 16th International
coli?, in: Proceedings of the 1st IEEE Symposium on Artificial Conference on Artificial Neural Networks, in: LNCS, vol.
Life, ALIFE 2007, 1–5 April 2007, pp. 187–191. 4131, Springer, 2006, pp. 658–667.
[59] David Verstraeten, Benjamin Schrauwen, Michiel D’Haene, [77] Yanbo Xue, Le Yang, Simon Haykin, Decoupled echo state
Dirk Stroobandt, An experimental unification of reservoir networks with lateral inhibition, Neural Networks 20 (3)
computing methods, Neural Networks 20 (3) (2007) 391–403. (2007) 365–376.
[60] Benjamin Schrauwen, David Verstraeten, Jan Van Campen- [78] Mantas Lukoševičius, Echo state networks with trained
hout, An overview of reservoir computing: Theory, appli- feedbacks, Technical Report No. 4, Jacobs University
cations and implementations, in: Proceedings of the 15th Bremen, 2007.
European Symposium on Artificial Neural Networks, ESANN [79] Danil V. Prokhorov, Lee A. Feldkamp, Ivan Yu. Tyukin,
2007, 2007, pp. 471–482. Adaptive behavior with fixed weights in RNN: An overview,
[61] Mantas Lukoševičius, Herbert Jaeger, Overview of reservoir in: Proceedings of the IEEE International Joint Conference on
recipes, Technical Report No. 11, Jacobs University Bremen, Neural Networks, 2002, IJCNN 2002, 2002, pp. 2018–2023.
2007. [80] Mohamed Oubbati, Paul Levi, Michael Schanz, Meta-
[62] David H. Wolpert, The supervised learning no-free-lunch learning for adaptive identification of non-linear dynamical
theorems, in: Proceedings of the 6th Online World systems, in: Proceedings of the IEEE International Joint
Conference on Soft Computing in Industrial Applications, Symposium on Intelligent Control, June 2005, pp. 473–478.
WSC 2006, 2001, pp. 25–42. [81] Mantas Lukoševičius, Dan Popovici, Herbert Jaeger, Udo
[63] Michael Buehner, Peter Young, A tighter bound for the echo Siewert, Time warping invariant echo state networks,
state property, IEEE Transactions on Neural Networks 17 (3) Technical Report No. 2, Jacobs University Bremen, 2006.
(2006) 820–824. [82] Benjamin Schrauwen, Jeroen Defour, David Verstraeten, Jan
[64] Duncan J. Watts, Steven H. Strogatz, Collective dynamics of M. Van Campenhout, The introduction of time-scales in
‘small-world’ networks, Nature 393 (1998) 440–442. reservoir computing, applied to isolated digits recognition,
[65] Albert-Laszlo Barabasi, Reka Albert, Emergence of scaling in in: Proceedings of the 17th International Conference on
random networks, Science 286 (1999) 509. Artificial Neural Networks, in: LNCS, vol. 4668, Springer,
[66] Marcus Kaiser, Claus C. Hilgetag, Spatial growth of real- 2007, pp. 471–479.
world networks, Physical Review E 69 (2004) 036103. [83] Udo Siewert, Welf Wustlich, Echo-state networks with band-
[67] Benjamin Liebald, Exploration of effects of different net- pass neurons: Towards generic time-scale-independent
work topologies on the ESN signal crosscorrelation ma- reservoir structures, Internal Status Report, PLANET intel-
trix spectrum, Bachelor’s Thesis, Jacobs University Bre- ligent systems GmbH, 2007. Available online at http://snn.
men, 2004 http://www.eecs.jacobs-university.de/archive/ elis.ugent.be/.
bsc-2004/liebald.pdf. [84] Georg Holzmann, Echo state networks with filter neurons
[68] Fei Jiang, Hugues Berry, Marc Schoenauer, Supervised and and a delay&sum readout, Internal Status Report, Graz
evolutionary learning of echo state networks., in: Proceed- University of Technology, 2007. Available online at http://
ings of 10th International Conference on Parallel Problem grh.mur.at/data/misc.html.
Solving from Nature, PPSN 2008, in: LNCS, vol. 5199, [85] Francis wyffels, Benjamin Schrauwen, David Verstraeten,
Springer, 2008, pp. 215–224. Stroobandt Dirk, Band-pass reservoir computing, in: Z. Hou,
[69] Wolfgang Maass, Thomas Natschläger, Henry Markram, and N. Zhang (Eds.), Proceedings of the IEEE International
Computational models for generic cortical microcircuits, Joint Conference on Neural Networks, 2008, IJCNN 2008,
in: Computational Neuroscience: A Comprehensive Ap- Hong Kong, 2008, pp. 3204–3209.
proach, Chapman & Hall/CRC, 2004, pp. 575–605. [86] Salah El Hihi, Yoshua Bengio, Hierarchical recurrent neural
[70] David Verstraeten, Benjamin Schrauwen, Dirk Stroobandt, networks for long-term dependencies, in: Advances in
Jan Van Campenhout, Isolated word recognition with the Neural Information Processing Systems 8, MIT Press,
liquid state machine: A case study, Information Processing Cambridge, MA, 1996, pp. 493–499.
Letters 95 (6) (2005) 521–528. [87] Nils Bertschinger, Thomas Natschläger, Real-time compu-
[71] Michal Čerňanský, Matej Makula, Feed-forward echo state tation at the edge of chaos in recurrent neural networks,
networks, in: Proceedings of the IEEE International Joint Neural Computation 16 (7) (2004) 1413–1436.
Conference on Neural Networks, 2005, IJCNN 2005, vol. 3, [88] Benjamin Schrauwen, Lars Buesing, Robert Legenstein, On
2005, pp. 1479–1482. computational power and the order-chaos phase transition
[72] Georg Fette, Julian Eggert, Short term memory and pattern in reservoir computing, in: Advances in Neural Information
matching with simple echo state network, in: Proceedings Processing Systems 21, NIPS 2008, 2009, pp. 1425–1432.
of the 15th International Conference on Artificial Neural [89] Wolfgang Maass, Robert A. Legenstein, Nils Bertschinger,
Networks, ICANN 2005, in: LNCS, vol. 3696, Springer, 2005, Methods for estimating the computational power and gen-
pp. 13–18. eralization capability of neural microcircuits, in: Advances
[73] David Verstraeten, Benjamin Schrauwen, Dirk Stroobandt, in Neural Information Processing Systems 17, NIPS 2004,
Adapting reservoirs to get Gaussian distributions, in: MIT Press, Cambridge, MA, 2005, pp. 865–872.
Proceedings of the 15th European Symposium on Artificial [90] Robert A. Legenstein, Wolfgang Maass, Edge of chaos and
Neural Networks, ESANN 2007, 2007, pp. 495–500. prediction of computational performance for neural circuit
[74] Carlos Lourenço, Dynamical reservoir properties as network models, Neural Networks 20 (3) (2007) 323–334.
effects, in: Proceedings of the 14th European Symposium on [91] Behrouz Farhang-Boroujeny, Adaptive Filters: Theory and
Artificial Neural Networks, ESANN 2006, 2006, pp. 503–508. Applications, Wiley, 1998.
[75] Keith Bush, Batsukh Tsendjav, Improving the richness of [92] Herbert Jaeger, Reservoir riddles: suggestions for echo state
echo state features using next ascent local search, in: network research, Proceedings of the IEEE International
Proceedings of the Artificial Neural Networks In Engineering Joint Conference on Neural Networks, 2005, IJCNN 2005, vol.
Conference, St. Louis, MO, 2005, pp. 227–232. 3, 2005, pp. 1460–1462.
148 COMPUTER SCIENCE REVIEW 3 (2009) 127–149

[93] Jochen Triesch, A gradient rule for the plasticity of a [110] Naftali Tishby, Fernando C. Pereira, William Bialek, The
neuron’s intrinsic excitability, in: Proceedings of the 13th information bottleneck method, in: Proceedings of the 37th
European Symposium on Artificial Neural Networks, ESANN Annual Allerton Conference on Communication, Control
2005, 2005, pp. 65–70. and Computing, 1999, pp. 368–377.
[94] Melanie Mitchell, James P. Crutchfield, Peter T. Hraber, [111] Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara,
Dynamics, computation, and the “edge of chaos”: A re- Wulfram Gerstner, Generalized Bienenstock–Cooper–Munro
examination, in: G. Cowan, D. Pines, D. Melzner (Eds.), rule for spiking neurons that maximizes information
Complexity: Metaphors, Models, and Reality, Addison- transmission, Proceedings of National Academy of Sciences
Wesley, Reading, MA, 1994, pp. 497–513. USA 102 (2005) 5239–5244.
[95] Robert Legenstein, Wolfgang Maass, What makes a [112] Stefan Klampfl, Robert Legenstein, Wolfgang Maass,
dynamical system computationally powerful? in: S. Haykin, Information bottleneck optimization and independent
J. Príncipe, T. Sejnowski, J. McWhirter (Eds.), New Directions component extraction with spiking neurons, in: Ad-
in Statistical Signal Processing: From Systems to Brain, MIT vances in Neural Information Processing Systems 19,
Press, 2007, pp. 127–154. ICANN 2007, MIT Press, Cambridge, MA, 2007, pp. 713–720.
[96] Mustafa C. Ozturk, José C. Príncipe, Computing with [113] Stefan Klampfl, Robert Legenstein, Wolfgang Maass, Spiking
transiently stable states, in: Proceedings of the IEEE neurons can learn to solve information bottleneck problems
International Joint Conference on Neural Networks, 2005, and to extract independent components, Neural Computa-
IJCNN 2005, vol. 3, 2005, pp. 1467–1472. tion 21 (4) (2008) 911–959.
[97] Donald O. Hebb, The Organization of Behavior: A Neuropsy- [114] Lars Buesing, Wolfgang Maass, Simplified rules and
chological Theory, Wiley, New York, 1949. theoretical analysis for information bottleneck optimization
[98] Štefan Babinec, Jiří Pospíchal, Improving the prediction ac- and PCA with spiking neurons, in: Advances in Neural
curacy of echo state neural networks by anti-Oja’s learn- Information Processing Systems 20, MIT Press, Cambridge,
ing, in: Proceedings of the 17th International Conference on MA, 2008, pp. 193–200.
Artificial Neural Networks, in: LNCS, vol. 4668, Springer, [115] Jochen Triesch, Synergies between intrinsic and synaptic
2007, pp. 19–28. plasticity mechanisms, Neural Computation 19 (4) (2007)
[99] Henry Markram, Yun Wang, Misha Tsodyks, Differential 885–909.
signaling via the same axon of neocortical pyramidal
[116] Nicholas J. Butko, Jochen Triesch, Learning sensory rep-
neurons, Proceedings of National Academy of Sciences USA
resentations with intrinsic plasticity, Neurocomputing (70)
95 (9) (1998) 5323–5328.
(2007) 1130–1138.
[100] David Norton, Dan Ventura, Preparing more effective liquid
[117] Andreea Lazar, Gordon Pipa, Jochen Triesch, Fading memory
state machines using Hebbian learning, in: Proceedings of
and time series prediction in recurrent networks with
the IEEE International Joint Conference on Neural Networks,
different forms of plasticity, Neural Networks 20 (3) (2007)
2006, IJCNN 2006, 2006, pp. 4243–4248.
312–322.
[101] Hélène Paugam-Moisy, Regis Martinez, Samy Bengio, Delay
[118] Norbert M. Mayer, Matthew Browne, Echo state networks
learning and polychronization for reservoir computing,
and self-prediction, in: Revised Selected Papers of Biologi-
Neurocomputing 71 (7–9) (2008) 1143–1158.
cally Inspired Approaches to Advanced Information Tech-
[102] Roland Baddeley, Larry F. Abbott, Michael C.A. Booth, Frank
nology, BioADIT 2004, 2004, pp. 40–48.
Sengpeil, Toby Freeman, Edward A. Wakeman, Edmund
[119] Mustafa C. Ozturk, Dongming Xu, José C. Príncipe, Analysis
T. Rolls, Responses of neurons in primary and inferior
and design of echo state networks, Neural Computation 19
temporal visual cortices to natural scenes, Proceedings of
(1) (2007) 111–138.
the Royal Society of London B 264 (1997) 1775–1783.
[103] Martin Stemmler, Christof Koch, How voltage-dependent [120] William H. Kautz, Transient synthesis in the time domain,
conductances can adapt to maximize the information IRE Transactions on Circuit Theory 1 (3) (1954) 29–39.
encoded by neuronal firing rate, Nature Neuroscience 2 (6) [121] Kazuo Ishii, Tijn van der Zant, Vlatko Bečanović, Paul
(1999) 521–527. Plöger, Identification of motion with echo state network,
[104] Jochen Triesch, Synergies between intrinsic and synaptic in: Proceedings of the OCEANS 2004 MTS/IEEE – TECHNO-
plasticity in individual model neurons, in: Advances in OCEAN 2004 Conference, vol. 3, 2004, pp. 1205–1210.
Neural Information Processing Systems 17, MIT Press, [122] John H. Holland, Adaptation in Natural and Artificial
Cambridge, MA, 2005, pp. 1417–1424. Systems: An Introductory Analysis with Applications to
[105] Anthony J. Bell, Terrence J. Sejnowski, An information- Biology Control and Artificial Intelligence, MIT Press,
maximization approach to blind separation and blind Cambridge, MA, USA, 1992.
deconvolution, Neural Computation 7 (6) (1995) 1129–1159. [123] Ali Ajdari Rad, Mahdi Jalili, Martin Hasler, Reservoir
[106] Jochen J. Steil, Online reservoir adaptation by intrinsic optimization in recurrent neural networks using Kronecker
plasticity for backpropagation-decorrelation and echo state kernels, in: Proceedings of IEEE International Symposium on
learning, Neural Networks 20 (3) (2007) 353–364. Circuits and Systems 2008, IEEE, 2008, pp. 868–871.
[107] Marion Wardermann, Jochen J. Steil, Intrinsic plasticity for [124] Xavier Dutoit, Hendrik Van Brussel, Marnix Nutti, A first
reservoir learning algorithms, in: Proceedings of the 15th attempt of reservoir pruning for classification problems, in:
European Symposium on Artificial Neural Networks, ESANN Proceedings of the 15th European Symposium on Artificial
2007, 2007, pp. 513–518. Neural Networks, ESANN 2007, 2007, pp. 507–512.
[108] Jochen J. Steil, Several ways to solve the MSO problem, in: [125] Wolfgang Maass, Prashant Joshi, Eduardo D. Sontag,
Proceedings of the 15th European Symposium on Artificial Computational aspects of feedback in neural circuits, PLoS
Neural Networks, ESANN 2007, 2007, pp. 489–494. Computational Biology 3 (1) (2007) e165+.
[109] Joschka Boedecker, Oliver Obst, Norbert Michael Mayer, [126] Robert Legenstein, Dejan Pecevski, Wolfgang Maass, Theo-
Minoru Asada, Studies on reservoir initialization and retical analysis of learning with reward-modulated spike-
dynamics shaping in echo state networks, in: Proceedings timing-dependent plasticity, in: Advances in Neural In-
of the 17th European Symposium on Artificial Neural formation Processing Systems 20, NIPS 2007, MIT Press,
Networks, ESANN 2009, 2009 (in press). Cambridge, MA, 2008, pp. 881–888.
COMPUTER SCIENCE REVIEW 3 (2009) 127–149 149

[127] Robert Legenstein, Dejan Pecevski, Wolfgang Maass, [139] Dongming Xu, Jing Lan, José C. Príncipe, Direct adaptive
A learning theory for reward-modulated spike-timing- control: An echo state network and genetic algorithm
dependent plasticity with application to biofeedback, PLoS approach, in: Proceedings of the IEEE International Joint
Computational Biology 4 (10) (2008) e1000180. Conference on Neural Networks, 2005, IJCNN 2005, vol. 3,
[128] Åke Björck, Numerical Method for Least Squares Problems, 2005, pp. 1483–1486.
SIAM, Philadelphia, PA, USA, 1996. [140] Alexandre Devert, Nicolas Bredeche, Marc Schoenauer,
[129] Andrew Carnell, Daniel Richardson, Linear algebra for time Unsupervised learning of echo state networks: a case
series of spikes, in: Proceedings of the 13th European study in artificial embryogeny, in: Proceedings of the 8th
Symposium on Artificial Neural Networks, ESANN 2005, International Conference on Artificial Evolution, in: LNCS,
2005, pp. 363–368. vol. 4926, Springer, 2008, pp. 278–290.
[130] Ali U. Küçükemre, Echo state networks for adaptive filtering, [141] Fei Jiang, Hugues Berry, Marc Schoenauer, Unsupervised
University of Applied Sciences Bohn-Rhein-Sieg, Germany, learning of echo state networks: Balancing the double
April 2006. http://www.faculty.jacobs-university.de/hjaeger/ pole, in: Proceedings of the 10th Genetic and Evolutionary
pubs/Kucukemre.pdf. Computation Conference, ACM, 2008, pp. 869–870.
[131] Zhinwei Shi, Min Han, Support vector echo-state machine [142] Keith Bush, Charles Anderson, Modeling reward functions
for chaotic time-series prediction, IEEE Transactions on for incomplete state representations via echo state
Neural Networks 18 (2) (2007) 359–372. networks, in: Proceedings of the IEEE International Joint
[132] Jürgen Schmidhuber, Matteo Gagliolo, Daan Wierstra, Conference on Neural Networks, 2005, IJCNN 2005, vol. 5,
Faustino J. Gomez, Evolino for recurrent support vector 2005, pp. 2995–3000.
machines. Technical Report, 2006. [143] Štefan Babinec, Jiří Pospíchal, Merging echo state and
[133] Benjamin Schrauwen, Jan Van Campenhout, Linking non- feedforward neural networks for time series forecasting,
binned spike train kernels to several existing spike train in: Proceedings of the 16th International Conference on
metrics, in: M. Verleysen (Ed.), Proceedings of the 14th Artificial Neural Networks, in: LNCS, vol. 4131, Springer,
European Symposium on Artificial Neural Networks, ESANN 2006, pp. 367–375.
2006, d-side publications, Evere, 2006, pp. 41–46. [144] Mustafa C. Ozturk, José C. Príncipe, An associative memory
[134] Jochen J. Steil, Stability of backpropagation-decorrelation readout for ESNs with applications to dynamical pattern
efficient O(N) recurrent learning, in: Proceedings of the 13th recognition, Neural Networks 20 (3) (2007) 377–390.
European Symposium on Artificial Neural Networks, ESANN [145] Mark Embrechts, Luis Alexandre, Jonathan Linton, Reservoir
2005, 2005, pp. 43–48. computing for static pattern recognition, in: Proceedings
[135] Francis wyffels, Benjamin Schrauwen, Dirk Stroobandt, of the 17th European Symposium on Artificial Neural
Stable output feedback in reservoir computing using Networks, ESANN 2009, 2009 (in press).
ridge regression, in: Proceedings of the 18th International [146] Felix R. Reinhart, Jochen J. Steil, Attractor-based com-
Conference on Artificial Neural Networks, ICANN 2008, putation with reservoirs for online learning of inverse
in: LNCS, vol. 5163, Springer, 2008, pp. 808–817. kinematics, in: Proceedings of the 17th European Sym-
[136] Xavier Dutoit, Benjamin Schrauwen, Jan Van Campen- posium on Artificial Neural Networks, ESANN 2009, 2009
hout, Dirk Stroobandt, Hendrik Van Brussel, Marnix Nut- (in press).
tin, Pruning and regularization in reservoir computing: A [147] Keith Bush, Charles Anderson, Exploiting iso-error path-
first insight, in: Proceedings of the 16th European Sym- ways in the N, k-plane to improve echo state network per-
posium on Artificial Neural Networks, ESANN 2008, 2008, formance, 2006.
pp. 1–6. [148] Yoshua Bengio, Yann LeCun, Scaling learning algorithms
[137] Joe Tebelskis, Ph.D. Thesis, Speech Recognition using Neural toward AI, in: L. Bottou, O. Chapelle, D. DeCoste, J. Weston
Networks, School of Computer Science, Carnegie Mellon (Eds.), Large Scale Kernel Machines, MIT Press, Cambridge,
University, Pittsburgh, Pennsylvania, 1995. MA, 2007.
[138] Mark D. Skowronski, John G. Harris, Automatic speech [149] Herbert Jaeger, Discovering multiscale dynamical features
recognition using a predictive echo state network classifier, with hierarchical echo state networks, Technical Report
Neural Networks 20 (3) (2007) 414–423. No. 9, Jacobs University Bremen, 2007.

You might also like