Neural Computation Platforms: To The Blue Brain and Beyond

TALLINNA TEX NIKA Ü LIKOOL
Infotehnoloogiateaduskond
Arvutitiehnika Instituut
Digitaaltehnika õppetool
Paralleelarhitektuurid
IAY0060
Neural computation platforms: to the Blue Brain and

beyond
Referaat
Õppejõud: K. Tammemäe
Üleõppelane: Valentin Tihhomirov
971081 LASM
Tallinn 2005
Contents
1 Prologue: The Blue Brain project 2
2 Intorduction 4
2.1 Brain research . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Artificial NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Demand for the Neural HW . . . . . . . . . . . . . . . . . . . 10
3 Traditional Approach 12
3.1 Simulating Artificial Neural Networks on Parallel Architec-
tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Mapping neural networks on parallel machines . . . . . . . 12
3.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Simulation on General-Purpose Parallel Machines . . . . . . 15
3.5 Neurocomputers . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 ANNs on RAPTOR2000 . . . . . . . . . . . . . . . . . 25
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Spiking NNs 29
4.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . 29
4.2 Sample HW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Learning at the Edge of Chaos . . . . . . . . . . . . . 30
4.2.2 MASPINN on NeuroPipe-Chip: A Digital Neuro-Processor 31
4.2.3 Analog VLSI for SNN . . . . . . . . . . . . . . . . . . 32
4.3 Maas-Markram theory: WetWare in Liquid Computer . . . . 33
4.3.1 The ‘Hard Liquid’ . . . . . . . . . . . . . . . . . . . . 36
4.4 The Blue Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Blue Gene . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Brain simulation on BG . . . . . . . . . . . . . . . . . 39
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Conclusions 41
6 Epilogue 45
1
Chapter 1
Prologue: The Blue Brain

project
What really motivated us to study the field was the announce [1] of Blue
Brain project 1 , the assent of IBM corp to grant their TOP15-listed BG/L
computer to the Brain Mind Institute at Switzerland’s Ecole Polytechnique
Fédérale de Lausanne (EPFL) for replicating ‘in silico’ one of the brain’s
building blocks, a neocortex column (NCC).
Henry Markram, the project leader, its initiator and founder of the Brain
Mind, explains:
The neocortical column is the beginning of intelligence and adapt-
ability marking the jump from reptiles to mammals. When it
evolved, it was like Mother Nature had discovered the Pentium
chip. The circuitry was so successful that it’s just duplicated,
with very little variation, from mouse to man. In the human
cortex, there are just more cortical columns — about 1 million.
Over the past 10 years, Markram’s laboratory has developed new tech-
niques for multi-neuron patch-clamp recordings producing highly quan-
titative data on the electrophysiology and anatomy of the different types
of neurons and the connections they form. The obtained data give al-
most complete digital description of microstructure, operation and learn-
ing function making it possible to begin the reconstruction of the NCC in
SW.
The BB goal is merely to build a simulacrum of a biological brain. It
is achieved when the outputs produced by the simulation in response to
particular inputs are identical to the ‘wet’ experiments. If that works, two
directions are planned. Once cellular model of NCC is debugged and opti-
mized, the BG/L will be replaced by a HW chip, which is easy to replicate
in millions for simulation of the whole brain. The second track will be to
1 http://bluebrainproject.epfl.ch/
2
work at more elementary level – to simulate the brain at molecular level
and to look at the role of genes in the brain function.
Replacing the ‘in vivo’ experiments by ’in vitro’ simulation would turn
years of brain research in days, save huge funds and lab-animals. What is
much more interesting is the hope that the project will shed some light on
the emergence of consciousness. Scientists have no purchase on this eva-
sive phenomena at all but you have to start somewhere, quips Markram.
It is not the first attempt to build a computer model of the brain1 . This
time however, it is launched by neuro- and computer science world leaders,
so we take it seriously as the most ambitious project ever conducted in
neuroscience.
Figure 1.1: What kind of HW can effectively rely the pure connectivity of
these parallel ‘computing threads’, which are 3D and morphing at that?
Any ideas on decomposition? [A video frame from the BB site]
3
Chapter 2
Intorduction
In this potpourri we’ll rush through the milestones of neuro-hardware.

We review the ‘classical’ approach to HW engineering for neural models
discovering hardware-significant peculiarities of ANNs, estimate applica-
bility of Blue Gene for the neurosimulation and, finally, freeze mesmerized
by the computational paradigm behind the the BB project. By the way, fun-
damental aspects of artificial intelligence will be considered.
2.1 Brain research

Some computers think that they
are intelligent. Tell them that they
are wrong.
Anecdote
All truths are easy to see once they

have discovered; the point is to
discover them.
Galileo
In this year, 2006, the World celebrates a century since Spanish histol-
ogist Santiago Ramón Cajal was rewarded with the Nobel Prize for pio-
neering the field of cellular neuoroscience through his research into the
microscopic properties of the brain. He is credited to be the founder of
modern neuroscience after discovering the structure of brain cortical layers
composed of millions of individual cells (neurons), which communicate via
specialized junctions (synapses).
The neocortex constitutes about 85 % of the human brain’s total mass
and is thought to be responsible for the cognitive functions of language,
4
learning, memory and complex thought. It is also responsible for the ‘mira-
cles of thought’ making people creative, inventive and philosophical to ask
such questions. During the century of neuroscience research, it have been
discovered that neocortex’s neurons are organized into columns. These
cylindrical structures are about 0.5 mm in diameter and 2–4 mm high but
pack inside up to 60 000 neurons, each has 10 000 connections with others
producing 5 km of cabling1 . In fact, what we call the the ‘gray matter’ is
just a thin surface of neuron bodies that cover the white matter, the insu-
lated cabling.
Any biological research contributes to the technology and science. Look-
ing at the finesse of the life creatures, we wonder their beauty so much that
even 200 years after Darvin’s publication some cannot believe that the un-
intelligent random process under pressure of natural selection can gener-
ate such perfection [2]. Throughout the history, people draw the resources
from the Nature and the inspiration from its evolution-optimized appli-
ances, like wings and silk. The computers were invented as a byproduct
in the attempt to formalize the consciousness and computability, started by
Gilbert in the beginning of XX century. Von Neumann derived the com-
puter from the theoretic Turing machine. This machine executes a pre-
scribed algorithm at speeds as high as 109 op/sec. It is astonishing how
much can be done by simple (in terms of complexity theory) algorithms
running automatically. Throughout the 20th century, mankind has devel-
oped communication and information processing technology entering the
‘information society’. The very idea of evolution was adopted by computer
technology in forms of OOP and genetic algorithms. Looking for more ad-
vanced computation techniques, mankind has finally resorted to the secrets
of the brain.
Despite almighty, the biological evolution is useless during the life of
its creatures since genes perform slowly. But the pressure to react immedi-
ately in the rapidly changing environment forced them create the nervous
system to prompt
1. adequate solutions
2. in real-time
3. by analyzing incomplete and controversial sensory information.
The neural networks (NN) turn out good where it is difficult or impossible
to solve by math. Unlike traditional computer methods, they learn by ex-
ample rather than solve by algorithm. Exactly these adaptation capabilities
outwitted the large-toothed enemies, discovered and subordinated forces
of nature and, finally, try to understand themselves.
It is time to take over the most powerful tool in the nature created by
millions years of evolution. Likewise computer science studies combin-
1 http://bluebrainproject.epfl.ch/TheNeocorticalColumn.htm
5
ing the logic gates to get a computer, neuroscience studies how the brain
is made of neurons. Its neuroinformatics branch uses mathematical and
computational techniques such as simulation to understand the function of
nervous system. But is not still clear that the goal is reachable.
The issue is that, despite it gave us the computer, Gilbert’s program
on mathematics fundamentals has failed: it was shown that some ‘truths’,
which can be ‘seen’ by a human, cannot be deduced by a formal algorithm
(a computer). The examples of such true sentences are the Turing’s Halt-
ing problem and the identical Gëdel sentence “P = (P cannot be proved)”.
Some argue that the brain must necessarily possess a degree of random-
ness2 in the struggle for survival because its purpose is to ‘deceive’ the en-
emy whereas predictability means ‘defeat’. The random generators, being
capable of transcending the deduction of formal logic, are also attributed as
creativity tools. In [3], this is presented as entertaining story, which points
to the mind as it is that very God’s stone, which might be created but can-
not be understood. Penrose agrees that algorithmic computation exists to
weed out suboptimal solutions generated at random. He locates the source
of randomness from the theoretical physics standpoint [4]. Penrose sees
a gap between the micro- and macro- worlds considering the mind as a
bridge, which performs the irreversible quantum function reduction thus
joining the material world with the ideal world of Plato. Anybody who
starts learning neural networks and quantum theory notes this similarity in
their magic to traverse the huge problem space quickly3 . As one argument,
Penrose points out that the ideas, no matter how big they are, are compre-
hended as holistic pictures like the huge ensembles of atoms consolidate in
pseudo-crystals. Promoting his own TOE4 , the prominent scientist attacks
formalists’ ‘strong AI’ not saying a word about analogue computation nor
connectionism. Whatever who is right, the tremendous success of computer
in XX century suggests that the neuroinformatic’s research will be the main
challenge of the 21st century.
Let us start with the parallel architecture of the brain, which is capable
to recognize mother’s image in 1/ 100 sec, while operating at frequencies
as low as 103 Hz; that is, less than in 100 steps [5].
2.2 Artificial NNs

In order to explain its incredible features, mathematical models of the
brain were proposed. They are known as Artificial Neural Networks (ANNs).
An ANN is a computational structure consisting of a multitude of ele-
2 According to theory of algorithmic complexity, a random sequence is unpredictable —
it is infinitely complex and corresponds to infinitely long program.

3 this is my opinion
4 theory of everything
6
mentary computers, the neurons, massively interconnected some topology.
These structures are distinguished by utmost parallelism in data storage
and processing, the capabilities to generalize and learn and their tolerance
to incomplete and inaccurate learning information. These unique features
make the ANNs applicable to classification, clasterization/categorization
(the classification without a teacher), prediction, optimization and control
tasks.
The Binary Model

The history of ANN development knows three periods. Back in 1943,
McCulloch and Pitts offered the idea to use a binary threshold element as a
neuron. This mathematical neuron computes a weighted sum of its inputs
xi :
iLenght
s= ∑ x i wi = X · W
i =1
and outputs 1 if s exeeds a certain threshold T and 0 otherwise:
y = boolean(s > T ).
The positive weights correspond to excitatory connections, the negative

ones to the inhibitory. McCulloch and Pitts have proved that the network
is capable to perform arbitrary first-order logic computations by picking
up proper weights. The model resembles the biological neuron where sig-
nal transfer imitates axons and dentrites, connection weights correspond
to synapses and threshold function reflects the soma activity. In modern
conception, the ANNs are interconnections in accordance to a certain rule,
a paradigm, of typical processing elements (neurons). The paradigms are
dictated by neural network models. Up to date, a lot of models have been
worked out. Mostly, they are generalizations of McCulloch-Pitts model.
Perceptrons
Practical implementation of the model became possible 20 years later.
At the same time the binary model was extended by Rosenblatt developing
perceptrons. The neuron is characterized by weight vector W and the form
of activation function f , which produces activation value y:
y = f (W · X ) = f ( s ) .
The signals and weights are real numbers and the activation function turns
into a smooth threshold like sigmoid 1/(1 + e−ax ) or arctan.
It is convenient to consider NN as a directed graph with weighted edges.
Its neurons, the network nodes, are divided into three groups: input, output
7
(a) Artificial neuron (b) Examples of Activation
function
(c) 3-layer Perceptron (d) Hopfield model
(result) and hidden (intermediate) ones. The loop-less graphs are called di-
rect propagation networks or perceptrons and recurrent otherwise.
Many ANN tasks, are boiled down to classification: any input is mapped
to a given set of classes. Geometrically interpreting, this corresponds to
breaking space of solutions (or attributes) into domains by hyperplanes.
Rosenblatt’s perceptron consisted of one layer. The hyperplane for a two-
input single-layer perceptron (a neuron) with a steep threshold function is a
line x1 w1 + x2 w2 ≶ T. Bisecting a plane, it allows to implement AND and OR
binary functions, but not XOR. The XOR problem revealing limited capabil-
ities of perceptrons was pointed out by Minsky and Peppert [6] suggesting
that a ‘hidden layer’ would resolve the linear separability problem.
The graph output depends on its topology and synaptic weights. The
procedure of weight adjustment is said to be learing (or training). In the
one-layer perceptron learning, one of input vectors Xt is submitted to the
network and output vector Yt is analyzed. At iteration t, all input i weights
wij of all the neurons j are adjusted according to the error ∆ = Rt − Yt
between Yt and supervisor provided reference Rt : wij (t + 1) = wij (t) + k ·
∆ j xij , where k ∈ [0, 1] is a learning speed factor. The procedure is repeated
until network answers have converged to the references. Obviously, this
technique is not applicable to the multilayer perceptron, because its hidden
layer outputs are unknown. In their work, Minsky/Pappert conjectured
8
that there such learning is infeasible5 effectively abridging enthusiasm and
funding of ANN research for the next 20 years.
With invention of backward propagation learning, multilayer percep-
trons gained popularity. Its essence is gradient-decent minimization of
the network error square mean E = ∑(y j − d j )2 . The weights wij con-
necting ith neuron of layer n with jth neuron of layer n + 1 are adjusted
as ∆wij = −k · ∂E/∂wij . The derivative exploits the smooth activation
function. The algorithm is not free of deficiencies. The high weights may
shift the working point of sigmoids into saturation area. Additionally, high
learning speed causes instability; it, therefore, is slowed down and network
stops learning — the gradient decent is likely trapped in local mimima here.
Another algorithm, simulated annealing, performs better.
Self-Organizing Models
It is also possible to learn without the training information. The self-
organization capability is a cornerstone feature of all alive systems, includ-
ing nerve cells. As far as ANNs are concerned, the self-organization means
adjustment of weighting factors, a number of neurons and network topol-
ogy. For simplicity, only weights are adjusted.
One such approach is the Hebbian learning. It reflects a known neu-
robiological fact: if two connected neurons are excited simultaneously and
regularly, their connection becomes stronger. Mathematically, this is ex-
pressed as ∆wij = k xij (t) y j (t), where xij = yi and y j are outputs of jth and
ith neuron.
Building a one-layer fully recurrent NN, which size matches the input
vector (object) length and weights are programmed by Hebb algorithm, we
get the epochal Hopfield model. Initialized the state by input, it is iterated
Xt+1 = F (W · Xt ) until convergence. The state is effectively “attracted” to
one of the synapse-predefined states accomplishing the classification. Ba-
sically, the system energy E = 0, 5 · ∑ wij xi x j is minimized. The neural
associative memory (NAM) operates similarly — the input vectors are pre-
defined in pair with their output.
Kohonen self-organizing maps (SOMs) minimize the difference between
neuron’s input and its synaptic weight: ∆wij = k ( xi − wij ). In contrast to
Hebbian’s algorithm, the weights are adjusted not for all neurons, rather
the group around the strongest input neuron. Such a principle is known as
learning by competition. At the beginning, the neighborhood is set as large
as 2/3 of the network and is shrinked down to a single neuron during the
course. This shapes the network so that the close input signals correspond
to close neurons effectively implementing categorization task.
5 http://ece-www.colorado.edu/
~ecen4831/lectures/NNet3.html
9
2.3 Demand for the Neural HW
The maturity of some NN algorithms and importance of their intrinsic
non-linearity contrasted to classical linear approaches are long-proven [7]:
Quite to our surprise, connectionist networks consequently out-
performed all other methods received, ranging from visual pre-
dictions to sophisticated noise reduction techniques.
The problem is to find a proper substrate for their execution. The real ap-
plications tend to require large networks and process many vectors quickly.
The ordinary HW is extremely inefficient here. Furthermore, as the field of
theoretical neuroscience develops and the electrophysiological evidence ac-
cumulate, researchers need more and more efficient computational tools for
the study of neural systems. In cognitive neuroscience and neuro-informatics
the neurosimulation is the essential approach for understanding the com-
plex brain processing. The computational resources required far exceed
those available to researchers.
There is always demand for faster computing. The two approaches to
speed up are: the speed daemon, which means to wait for the faster pro-
cessors and briniac, running many processors simultaneously. The latter is
supported by CMOS technology packaging billions of gates in micro-areas.
All that is left to do is learn how to connect them efficiently. Those who are
familiar with parallel architectures know that the subject is all about scala-
bility. You cannot just take a handful of fast processors and obtain a linear
speedup, since the computation overhead along with the synchronization
and communication inevitably involved limit the performance growth.
That is not the case for the brain, which demonstrates the tremendous
scalability — from a few neurons in the primitive species up to 100 billion in
human at unprecedented connectivity of 10 000 per neuron. As [8] points
out in his review, the terms ‘parallel architectures’ and ‘neural networks’
are so close that are often used as synonyms.
Summarizing the classical ANN models presented above, the neurosim-
ulation consists of two phases: 1) in the learning phase, the weights are
adjusted in accordance with input examples; 2) inputs are mapped to out-
puts during recall. The neuron processing is simple: real-valued activation
function is applied to the weighted sum of inputs (scalar multiplication)
and the resulting activation value is broadcasted over to other nodes. Ba-
sically, two simple operations are needed: multiply-and-accumulate and
activation function F (W · X ). Nevertheless, owing to the large number of
neurons, which are massively interconnected at that, the workload in recall
phase ends quite involved. The learning phase is even more burdensome.
However, all the synapses and neurons operate independently.
The classical ANN models suggest that the information and computa-
tion are uniformly distributed over the network in such way that the stored
10
objects are memorized in synapses so that every synapse bears information
on all the memorized objects. This utmost diffusion of information is op-
posed to the unambiguity pursued in the classical (mechanic, symbolic)
algorithms and data structures. This revolutionary concept, the highest
degree of distribution, is defined as connectionism. Confining all the com-
putation right into the connections would result in to the truly wired logic.
The inherently parallel neural models can take most of the highly par-
allel machines. However, the redundancies and simulation of neural op-
erations instead of their implementation in HW make the general purpose
supercomputers expensive and slow. The HW is fast and efficient when it
is ‘in line’ with the model it executes. The degree of parallelism inherent
to the neuroprocessing is provoking for the parallel execution. The neu-
roscience inspires the computer scientists to look for optimal substrate for
the neural models — the fast and efficient parallel HW architectures. Mas-
sively parallel VLSI implementations of NNs would combine the high per-
formance of the former with good scalability of the latter.
11
Chapter 3
Traditional Approach
3.1 Simulating Artificial Neural Networks on Parallel

Architectures
The classical NN models were presented in the previous chapter. This
chapter will overview the machines built for simulating them. The novel,
spiking-based models will be discussed in the following chapter.
The field of neurosimulation is still young and is used primarily for
model development in research labs. This means that besides the efficiency
(cost, power, space), the machine must support different models, particu-
larly activation functions: threshold, sigmoid, hypertangent. This degree
of freedom is known as flexibility, the ability to support existing and novel
paradigms, and programmability are required. Another feature of a good
architecture, modularity, is defined in [7] in two different ways: 1) the pos-
sibility to offer a machine of user problem size/available funding and fully
exploit its resources. 2) possibility to replace the elements of the frame-
work without redesigning the machine in order to support new elements.
Whereas speed and efficiency favor each other, the flexibility conflicts with
them. It is natural, therefore, that the lifetime of ANN models starts from
general and migrates to the special platforms, like it is planned in BB.
The figure 3.1 summarizes the taxonomies of the neurosimulation plat-
forms. I have added FPGA because this class of universal computing de-
vices is missing in the taxonomies developed earlier.
3.2 Mapping neural networks on parallel machines

The ANN is called a guest graph G(N, W), which is a set of neurons N
interconnected by weighted synapses W. Its possible topologies are fully-
recurrent, random, layered, toroid, modular, hierarchical. It is mapped on
the target HW called a host graph H(P, C), which is a set of processing ele-
12
Figure 3.1: Taxonomy of neural platforms
ments (PE) interconnected by connecting elements (CE) into architecture. In

the biological implementation we have: G = H, the isomorphic one-to-one
mapping, where the NN is organized into a hierarchy of modules. A com-
puter PE supports one or more neuron and each CE supports several virtual
node-to-node connections.
The parts of the network are processed by processing elements (PE).
One chip chip may incorporate more than one PE. The key concepts of an
efficient mapping of the network to the available PEs are load balancing,
minimizing inter-PE communication and synchronization. Furthermore,
the mapping should be scalable both for different network sizes and for dif-
ferent number of processing elements. The amount of parallelism achieved
depends on the granularity of problem decomposition. The following lev-
els of parallelism are exploited (ordering from coarseset to finest granular-
ity):
• training-session parallelism: processors emulate a netrowk indepen-

dently (Their results may be exchanged after a number of cycles);
• pattern-parallelism: every processor simulates a different input vec-

tor on its network reducing the communication to zero;
• layer parallelism: concurrent execution of layers within a network. A

popular variation of layer parallelism is to divide a layer “vertically”.
This makes more sense because layers are computed in sequence.
• neuron parallelism: a whole layer of neurons or the full network is

simulated in parallel; and
13
Figure 3.2: Mapping an ANN (guest graph) onto a parallel machine (host)
[9]
• synapse parallelism: simultaneous weighting.

Dispite this categorization was made with the feed-forward networks
and back propagation learning in mind, it can be applied to many other
models. Packaging more neurons onto a single processor has proved to be
advantageous in the coarse processing.
3.3 Benchmarking
The performance measurements play a key role in deciding about the
applicability of a neuroimplementation. Yet, because of immaturity, there
is no standard benchmarking application like TEX and Spice we have in
the ordinary computing. Only few exceptions exist, like NETtalk, and they
are sometimes used for comparisons. The most commonly accepted CPS
(Connections Per Second), which measures how fast a network performs
the recall phase, and CUPS (Connection Updates Per Second). Implemen-
tations are compared against Alpha workstation and alternative designs.
This measure is blamed more deceptive [7] (EPFL) however than the FLOPs
for two reasons:
• it does not define neither network model nor its size and precision.
A more complex neuron can replace many simpler ones and prolong
data lifetime on PE, considerably reducing the communication, which
is crucial for I/O-bound NNs. The missing benchmark application al-
lows the developers to choose the best-case network and misses any
information on the capability of the structure to adapt different prob-
lems.
• the definition of CPS/CUPS is vague if not unexciting to the levels

exceeding MIPS and FLOPS. This allows the Adaptive Solution de-
signers to obtain it as a product of connections in the network by the
number of input vectors processed per second. The Philips Lneuro
reports MCUPS > MCPS because the time to perform one operation
is left out.
14
(a) A Ring Architecture (b) A Bus Architecture
The critics propose a rationale evaluation, used by a Army/Navy CFA

Committee in selecting computer architecture for future military imple-
mentations. They mention that not all organizations proposing the new
architectures may not use the latest technologies: for instance, large compa-
nies may own advanced CMOS processes or invest heavily in consolidated
technologies as a key factor of performance, whereas academic projects are
funds-limited. The idea is, therefore, to select the architecture by evaluating
it and undergo a novel implementation regarding the original implemen-
tation as irrelevant.
A detailed theoretical analysis of neural HW architectures based on the
measurements adopted in parallel architectures is done in [7] (EPFL). [10]
shows how to analyze the mappings. The basic assumption taken is that in
massively parallel neural network implementations, the major bottle-neck
is formed by the communication process rather than the calculation of the
neural activation and learning rules. The efficiency of a neurocomputer
implementation can therefore best be defined in terms of the time taken by
a single iteration of the total NN. In [11], the same authors bring a broad
overview on the parallel machines used for the neural simulations.
3.4 Simulation on General-Purpose Parallel Machines

The general parallel architectures as possible hosts for neurosumula-
tions are characterized by a large number of processors organized accord-
ing to some topology. Depending on the presence/absence of central con-
trol, parallel computers may be divided into two broad categories: data-
parallel and control-parallel. The two categories require a quite different
style of programming.
Data-parallel architectures simultaneously process large distributed data
sets using centralized control flow. A large amount of data is processed by
a large number of processors in a synchronous (typically SIMD) of regular
(e.g. pipelined) fashion. Pipelining usually provides layer parallelization.
Pipeline structuring is often exploited on systolic arrays — specific hard-
ware architectures designed to map high-level computation directly onto
HW. Numerous simple processors are arranged in one- or multi-dimenti-
onal arrays, performing simple operations in a pipelined fashion. Circular
communication insures that data arrive at regular time intervals from (pos-
sibly) different directions.
The ring can be 100% efficient on the fully recurrent ANNs [10]: every
node computes the one MAC operation and advances the result and its out-
15
put to the next node for the accumulation. Once all the sums are computed,
the nodes apply the activation function and start a new round.
Control-parallel architectures perform processing in a decentralized man-
ner, allowing different programs to be executed on different processors
(MIMD). A parallel program is explicitly divided into several different tasks
which are placed on different processors. The communication schema is
usually general routing, i.e. the processors are message-passing computers.
The transputers are the most popular control-parallel neural simulations.
Making this paper in a course of parallel architectures, it is curious to
note that [8] identifies data-parallel decomposition with the SIMD architec-
ture and multiprocessor. This ‘class of equivalence’ is opposed to another
consisting of the parallelized control, MIMD architecture and message-
communicating multicomputer. The author lists some neural simulations
on general-purpose computers. The summary is presented in table 3.1.
Author concludes that data-parallel techniques “significantly outperform
their control-parallel counterparts”. In my opinion, it is not fair to com-
pare the power of group of six transputers against kilo-processor armies.
Admitting linear scalability of transputers, the data shows that they are an
order of magnitude faster. Yet, from theoretical point of view, it is a reason-
able to think that the data-parallel architectures are a natural mapping of
neuroparadigm since the neural computations are most often interpreted
in terms of synchronous matrix-vector operations. For this reason, it is
not surprising that control-parallel architectures are programmed in data-
parallel style. [9] explains that MIMD is used forcefully because production
of SIMD has stopped a long time ago.
Curiously, [12] finds the Beowulf cluster the most attractive in his re-
view on simulating NNs by parallel general-purpose computers. Equipped
with high speed connection network such as Myrinet, Beowulf offers excel-
lent performance at a very competitive price. This cost advantage often can
be as high as an order of magnitude over multiprocessor machines of com-
parable capabilities.
Programming neural networks on parallel machines requires high-level
techniques reflecting both inherent features of neuromodels and character-
istics of the underlying computers. To simplify the task of neuroscientist,
a number of parallel neurosimulators were proposed on general-purpose
machines. Some institutions develop libraries for MIMD supercomputers
that enable NN developers to use the supercomputer efficiently without
specific knowledge[9]. Others develop professional portable neurosimu-
lators, like NEURON and NCS. The compromise between portability and
efficiency is usually achieved by parallel programming environments, e.g.
Message-Passing Interface (MPI), Parallel Virtual Machine (PVM), Pthreads and
OpenMP, on heterogeneous and homogeneous clusters and multiproces-
sors.
Simulations on general-purpose parallel computers were mostly done
16
in late eighties. A large number of parallel neural network implementation
studies have been carried out on existing the massively parallel machines
listed below, simply because neural hardware was not available. Although
these machines were not specially designed for neural implementations, in
many cases very high performance rates have been obtained. The universal
computers still remain popular in the neurocomputing because they are
more flexible and easier to program.
Zhang et al. [13] have used node-per-layer and training prallelism to
implement backpropagation networks on the Connection Machine. Each
ptocessor is used to store a node from each of the layers, so that a ‘slice’
of nodes lies on a single processor. The number of processors needed to
store a network is equal to the number of nodes in the largest layer of the
network. The weights are stored in a memory structure shared by a group
of 32 processors reflecting CM specific architecture. With 64 K processors,
the CM is a perfect candidate for training-example parallelism. The au-
thors use the network replication to fully utilize the machine. The NETtalk
implementation achieves peak performance of 38 CUPS and 180 CPS.
The MasPar implementation [14] similarly to Zhang’s implementa-
tion exploits both layer and training-session parallelism. Each processor
stores the weights of corresponding neurons in its local memory. In the
forward phase, the weighted sums are evaluated, with intermediate results
rotated from right to left of the processor array using MasPar’s local in-
terconnect. Once the input values have been evaluated, sigmoid activation
functions are applied and the same procedure is repeated for the next layer.
In the backward phase, a similar procedure is performed, with errors prop-
agated from the output down to the input layer. After performing a num-
ber of training examples on multiple copies of the same natwork, wheights
are synchronously updated. Maximal NETtalk performance obtained is
176 CPS and 42 CUPS.
Rosenberg and Belloch [15] have used node and weight parallelism
to implement backpropagation networks on a one-dimentional array with
one processor being allocated to a node and two processors to each side of
connection: input and output. Connection-processors multiply the values
by their respective weights. The nodes accumulate the products and com-
pute sigmoids. The backpropagetion is done in a similar way. The NETtalk
maximum speed achievs 13 MCUPS.
Pomerleau et al. [16] have used training and layer parallelism to im-
plement backpropatation network on a Wrap computer with processors
organized in a systolic ring. In the forward phase, the activation values
are shifted circularily along the ring and multiplied by the corresponding
weights. Each processor accumulates the parital weighted sum. When the
sum has been evaluated, the activation function is performed. In backward
processing, is similar, but instead of activation values, accumulated errors
are shifted circularly. Performance measurements for the NETtalk applica-
17
Structuring Paralle- Num of Computer Performance
Technique lism procrs architecture CPS CUPS
COARSE training, 64K Connection Machine 180M 38M
layer (Zhang 90)
COARSE training, 16K MasPar (Zell 90) 176M 42M
layer
FINE node, 64K Connection Machine 13M
weight (Rosenberg 87)
PIPELINED training, 10 Warp (Parmelau 88) 17M
layer
PIPELINED layer, 13K Systolic Array 148M
node (Chung 92)
COARSE partitions 6 Transputers 207K
(Straub 91)
Table 3.1: NETtalk implementations on general supercomputers. The re-

sults are from late ’80s and early ’90s.
tion on a Warp computer showed a speed of 17 MCUPS.

Chung et al. [17] have applied classical systolic algorithms for matrix-
by-vector multiplication when simulating backpropagation networks. They
exploited layer and node parallelization by partitioning the neurons of each
layer into groups and by partitioning the opertion of each neuron into the
sum or the product operation and the non-linear function. The execution
of forward and backward phases was also done in parallel by pipelining
multiple inputs sets. The NETtalk application done on a 2-D systolic array
with 13 K processing elements achieved a maximum speed of 248 MCUPS.
In [18], authors describe a backpropagation implementation on the T8000
system consisting, consisting of central transputer and six slaves. A mul-
tilayer feedforward network is “vertically divided” so that each slave con-
tains a fragment of nodes from each layer. Computation is synchronized
by master so that the layers are computed in sequnce. This is similar to
layer decomposition, but the execution flow is closer to the SPMD model.
The authors give 58 KCUPS for small network and 207 KCUPS for a larger
network, which better utilizes processors.
MindShape [10] is a fractal-architecture universal computer that was
designed for simulating the brain and, inspired by the fractal brain orga-
nization, propose a fractal node-parallel architecture. Analyzing the node-
parallel mappings, they conclude that the scalability is bound by commu-
nication overhead O(1) < ti < O(n) — the network iteration time grows
with the number of nodes n (node-parallelism). For instance, ti is O(n)
for the fully recurrent network – there is no ‘good’ architecture for it. Us-
ing this measure, it is shown how the fractal architecture manages to host
18
Figure 3.3: The fractal topology and MindShape architecture: the similar
module interconnect pattern at all levels of hierarchy
different guest topologies: fully and randomly connected, layered feedfor-

ward, torus, modular, and hierarchical modular (fractal). The hierarchical
ones perform the best (O(1)).
The CEs store the transfer tables, the information on what data has to
go where. The brain capacity is estimated: 9 × 1010 neurons x 104 synapse
/neuron x 7 bits/synapse at 500 Hz = 53 PCUPS. With 256 neurons per chip,
32 chips on board, 32 boards per rack, 32 racks × 32 racks per floor x 32
floors at 100 MHz, we could deliver 8.7 PCUPS, the cortex capacity. The
volume of such system is equivalent to the first IBM computer. The 2-byte
weights are proposed totaling 8.3 TByte. The issue of fault-tolerance at this
scale is discussed.
3.5 Neurocomputers
Despite of the advances, the speed and efficiency requirements cannot
be successfully met by general-purpose parallel computers. The general-
purpose neuroarchitectures offer ‘generic’ neural features aiming at a wide
range of ANN models. The neurocomputers can be further specialized for
simulating concrete models and networks.
Architecturally, neurocomputers are large processor arrays — complex
regular VLSI architectures organized in a data-parallel manner. A typical
processing unit of a neurocomputer has local memory for storing weights
and state information. The whole system is interconnected with a paral-
lel broadcast bus, and usually has a central control unit. The data-parallel
programming techniques and HW architectures are most efficient for neu-
ral processing. The dominating approaches are: systolic arrays, SIMD and
SPMD processor arrays.
Important for the design of highly scalable hardware, finding an in-
terconnection strategy for large numbers of processors has turned out to
be a non-trivial problem. Much knowledge about the architectures of these
massively parallel computers can be directly applied in the design of neural
architectures. Most architectures are however ‘regular’, for instance grid-
based, ring-based, etc. Only a few are hierarchical. As was argued in [11],
the latter forms the most brain-like architecture.
19
Analog architectures tend to the full connectivity. Digital chips use
localized communication plan. Three architectural classes of system in-
terconnect can be distinguished: systolic, broadcast bus, and hierarchical
architectures. Systolic arrays are considered non-scalable. According to
many designers, broadcasting the most efficient multiplexed interconnec-
tion architecture for large fan-in and fan-out. It seems that broadcast com-
munication is often the key to success in getting communication and pro-
cessing balanced, since it is a way to time-share communication paths effi-
ciently.
2D architectures are less modular and less reconfigurable as the data
flow is quite rigid. At the same time, they allow throughputs much higher
than 1D architectures.
Implementing neural functions on special purpose chips speeds the
neural iteration time up by about 2 orders of magnitude compared to general-
purpose µP. The common goal of the neurochip designers is to pack as
many processing elements as possible into a single silicon chip, thus pro-
viding faster connectivity. To achieve this, developers limit the computa-
tion precision. [11] remarks that overfocusing on this shoves back the inter-
chip connectivity issue, which is also important for their integration into a
large-scale architecture.
Digital technology has produced the most mature neurochips, provid-
ing flexibility-programmability and reliability (stable precision compared
to analog) at relatively low costs. Furthermore, due to mass-production, a
lot of powerful tools to custom design are available. Numerous programs
for digital neurochip design are offered, all major microchip companies and
research centers world-wide announced their neuroproducts. Digital im-
plementations use thousands of transistors to implement a single neuron
or synapse.
On the other hand, these computationally intensive calculations are au-
tomatically performed by analog physical processes such as summing of
currents or charges. The operational amplifiers, for instance, are easily built
from single transistors and automatically perform synapse- and neuron-
like functions, such as integration and sigmoid transfer. Being natural,
analog chips are very compact and offer high speed at low energy dissi-
pation. Simple neural (non-learning) associative memory chips with more
than 1000 neurons and 1000 inputs each can be integrated on a single chip
performing about 100 GCPS [11]. Another advantage is ease of integration
with real world while digital counterparts need AD-DA converters.
The first problem why analog did not replace digital chips is a lack of
flexibility: analog technology is unusually dedicated to one model and re-
sults in scarcely-usable neurocomputer. Another problem of representing
adaptable weights limits the applicability of analog circuits. Weights can,
for instance, be represented by resistors, but such fixing of weights in pro-
duction of the chips makes them not adaptable — they can only be used in
20
the recall phase. The capacitors suffer of limited storage time and trouble-
some learning. Off-chip training is sometimes used with refreshing the ana-
log memory. For on-chip training, statistical methods, like random weight
changing, are proposed in place of back-propagation because its complex
computation and non-local information make it prohibitive. Other mem-
ory techniques are incompatible with the standard VLSI technology.
In addition to the weight storage problem, analog electronics is suscep-
tible to temperature changes, (interference) noise, and VLSI process vari-
ations that make the analog chips less accurate and harder to understand
what exactly is computed, complicates the design and debug. At the same
time, the practice shows that realistic neuroapplications often require accu-
rate calculations especially for back-propagation.
Taking these drawbacks into account, the optimal solution would ap-
pear to be a combination of both analog and digital techniques. Hybrid
technology exploits advantages of the two approaches. The optimal com-
bination applies digital techniques to perform accurate and flexible training
and uses potential density of analog chips to obtain finer parallelism on a
smaller area in the recall phase.
Here, we do not consider the optical technology, which introduces pho-
tons as basic information carriers. They are much faster that electrons and
have less interference problems. In addition to the greater potential com-
munication bandwidth, the processing of light beams also offers massive
parallelism. These features put optical computing first among possible
candidates form the neurocomputer of the future. The optics ideally suits
to the realization of dense networks of weighted interconnections. Spatial
optics offers 3-D interconnection networks with enormous bandwidth and
very low power consumption. Besides optoelectronics, electro-chemical,
and molecular are also very promising. Despite enormous parallel pro-
cessing and 3D connection prospects of optical technology, silicon technol-
ogy continues to dominate with more and more neurons being packed on
a chip.
Neurocomputers are popular in the form accelerator boards added to
personal computers.
The CNAPS System (Connected Network of Adaptive Processors, 1991)
developed by Adaptive Solutions became one of the most well known com-
mercially available neurocomputers. It is build of N6400 neurochips that
itself consist of 64 processing nodes (PN) that are connected by a broad-
cast bus in a SIMD mode. Two 8-bit buses allow the broadcasting of input
and output data to all PNs and easily adding more chips. Additionally, the
buses connect PNs to the common instruction sequencer.
The PNs are designed like DSPs including fixed-point MAC and equipped
with 4 KB of local SRAM for holding the weights — one matrix for learning
and one for back-propagation learning. It limits system size: the perfor-
mance drops dramatically when 64 PNs try to communicate over the two
21
buses, which becomes necessary when network and weight matrix grow.
The complete CNAPS system may have 512 nodes connected to a host
workstation and includes SW support. It uses layer decomposition and
offers a maximum performance of 5.7 GCPS and 1.46 GCUPS tested on a
backpropagation network. The machine can be used as a general-purpose
accelerator.
The SYNAPSE System (Synthesis of Neural Algorithms on a Parallel
Systolic Engine) is build by Siemens in 1993 of MA-16, the neurochps de-
signed for fast 4x4 matrix operations with 16-bit fixed-point precision. The
chips are cascaded to form a systolic array: one MA-16 chip outputs to an-
other in a pipelined manner ensuring optimal throughput. Two parallel
rings of SYNAPSE-1 are controlled by Motolola processors. The weights
are stored off-chip in 128MB SDRAM. Similarly to CNAPS, wide range of
NN models are supported but, in opposite to SIMD, programming is diffi-
cult because of complex PEs and 2D systolic structure. The system is pack-
aged with SW to make neuroprogramming easier. Each chip throughput is
500 MCPS, the full system performs at 5.12 GCPS and 33 MCPUS.
The RAP System, developed at Berkley in 1993, is a ring array of DSP
chips specialized for fast dot-product arithmetic. Each DSP has a local
memory (256 KB of static RAM, and 4 MB of dynamic RAM) and a ring
interface. Four DSPs can be packed on a board, with a maximum of ten
boards. Each board has a VME bus interface with host workstation. The
processing is performed in a SPMD manner. Several neurons are mapped
onto a single DSP in layer decomposition style. The maximum speed of
10-board system is estimated at 574 MCPS and 106 MCUPS.
The SAIC SIGMA-1 neurocomputer is a PC computer with a DELTA
floating-point processor board and two software packages: an object ori-
ented language and a neural net library. The coprocessor can hold 3 M vir-
tual processing elements and connections, performing 2 MUPS and 11 MCPS.
The Balboa 869 co-processor board for PC and Sun workstations is in-
tended to enhance the neurosoftware package ExploreNet. It uses Intel i860
as a central processor and reaches the maximum speed of 25 MCPS for a
backpropagation network in the recall phase, and 9 MCUPS in the learning
phase.
The Lneuro (1990, Learning Neurochip) implemented by Philips im-
plements 32 input and 16 output neurons. By updating the whole set of
synaptic weights related to a given output neuron is in parallel, a sort of
weight parallelism is reached. The chip comprises on-chip learning with
an adjustable learning rule. A number of chips can be cascaded within a
reconfigurable, transputer controlled network. The experiments with 16
LNeuro 1.0 chips report 8x speed-up compared to an implementation on
a transputer. Measured performance: 16 LNeuros on 4 dedicated boards
show 19 MCPS, 4.2 MCUPS. The authors guarantee a linear speed-up with
the size of machine.
22
(a) CNAPS made of 6400 chips
(b) SYNAPSE
(c) MANTRA: Systolic Array of Genes (d) MANTRA: Genes IV processing ele-
VI chips ment
(e) MANTRA-I System Architecture
Figure 3.4: Representative neurocip-based architectures
23
Mantra I (1993, Swiss Federal Institute of Technology) is aimed at a
multi-model neural computer which supports several types of networks
and paradigms. It consists of a 2-D array of up to 40x40 GENES IV sys-
tolic processors and the linear array of auxiliary processors called GACD1.
The GENES chips (Generic Element for Neuro-Emulator Systolic arrays)
are bit-serial processing elements that perform vector/matrix multiplica-
tions. The Mantra architecture is in principle very well scalable. It is one
of the rare examples of synaptic parallelism. It shares the difficult recon-
figurability and programming with SYNAPE. The slow controller and se-
rial communication limit the performance. Performance: 400 MCPS, 133
MCUPS (backpropagation) with 1600 PEs.
BACHUS III (1994, Darmstadt University of Technology, Univ. of Dus-
seldorf, Germany) is chip containing the functionality of 32 neurons with
1 bit connections. The chips are mounted together resulting in 256 simple
processors. The total system was called PAN IV. Chips are only used in the
feed forward phase; learning or programming is not supported and thus
has to be done off-chip. The system only supports neural networks with
binary weights. Applications are to be found in fast associative databases
in a multi-user environment, speech processing, etc.
Analog Mod2 neurocomputer (Naval Air Warfare Center Weapons Di-
vision, CA, 1992) system incorporates neural networks as subsystems in
a layered hierarchical structure. The Mod2 is designed to support par-
allel processing of image data at sensor (real-time) rates. The architec-
ture was inspired by the structures of biological olfactory, auditory, and
visual systems. The basic structure is a hierarchy of locally densely con-
nected, globally sparsely connected networks. The locally densely inter-
connected network is implemented in a modular/block structure based
upon the ETANN chip. Mod2 is said to implement several neural network
paradigms, and is in theory infinitely extensible. An initial implementation
consists of 12 ETANN chips, each able to perform 1.2 GCPS.
Epsilon, 1992, the (Edinburgh Pulse Stream Implementation of a Learn-
ing Oriented Network) developed in Edinburgh University is a hybrid large-
scale generic building block device. It consists of 30 nodes and 3600 synap-
tic weights, and can be used both as an accelerator to a conventional com-
puter and as an autonomous processor. The chip has a single layer of
weights but can be cascaded to form larger networks. The synapses are
formed by transconductance multiplier circuits which generate output cur-
rents proportional to the product of two input voltages. A weight is rep-
resented by fixing one of these voltages. In neuron synchronous mode,
the first uses pulse width modulation and is specially designed with vi-
sion applications in mind. The asynchronous mode is provided by pulse
frequency modulation, which is advantageous for feedback and recurrent
networks, where temporal characteristics are important. The synchronous
implementation was successfully applied to a vowel recognition task. An
24
MLP network consisting of 38 neurons (hidden and output) was trained by
the ‘chip in loop method’ and showed performance comparable to a soft-
ware simulation on a SPARC station. With this chip it has been shown that
it is possible to implement robust and reliable networks using the pulse
stream technique. Performance: 360 MCPS.
3.6 FPGAs
The massively parallel and reconfigurable FPGAs very well suit to im-
plement the highly parallel and dynamically adoptable ANNs. In addition,
being general-purpose computing devices, FPGAs offer the level of flexibil-
ity for many neuromodels and are also useful for pre- and post-processing
the interface around the network in conventional way.
However despite the custom-chip fine-grain parallelism offered, the
FPGA is not true digital VLSI; they are one order of magnitude slower. Yet,
the newest FGPAs incorporate ASIC multipliers and MAC units that have
considerable effect in the multiplication-rich ANNs. The floating-point op-
erations are impractical; particularly, the non-linear activation (sigmoid)
function, which is too expensive in direct implementation, is usually lin-
early approximated peace-wise.
The reconfigurability permits the neural morphing. During training,
the topology and the required computational precision for an ANN can
be adjusted according to some learning criteria. The [19] review refers
two works that used genetic algorithms to dynamically grow and evolve
ANNs-based cellular automata and implemented the algorithm, which sup-
ports on-line pruning and construction of network models.
Reviewers mention the need for more friendly learning algorithms and
software tools.
Below are some examples of implementing different models on RAP-
TOR2000 board. Because of flexibility, many other neural and conventional
algorithms can be mapped on the system and reconfigured at runtime.
3.6.1 ANNs on RAPTOR2000

RAPTOR2000 is an extensible PCI board with Dual-Port SRAM. The
(expansion) FPGAs are connected in linear array with neighbors (128-bit
bus) as well as by two buses that are 75 and 85 bits wide. It is tested with
on for 3 sample applications.
For Kohonen SOM, four Virtex FPGAs were connected in 2D array. Fifth
FPGA implements host-PC interfacing controller communicating NN input
vectors and results and equipped with 128 MB SDRAM for storing training
vectors. Architecture of the processing elements (PE) is similar to the ones
proposed for ASIC. FPGA BlockRam is used for storing the weights. Man-
25
(a) The prototyping board (b) SOM architecture
(c) BiNAM architecture (d) Radial Basis Functions
Figure 3.5: RAPTOR2000
26
hattan distances instead of Euclidean to avoid multiplications and square
roots. An interesting trick to start PE learning in fast, 8-bit precision con-
figuration for rough ordering of the map and are reconfiguration to the
slower 16 bits for fine-tuning is demonstrated. The number of cycles per
input vector depends on input vector length l and number of neurons/PE
n: crecall = n · (l + 2dld(l · 255)e + 4) and is almost twice for learning.
Achieved 65 MHz clocking, XCV812E-6 outperforms the 800 MHz AMD
Athlon more than 30 times.
Another application is Binary Neural Associative Memory. Using sparse
encoding, i.e. when almost all bits of input and output vectors are ‘0’, best
storage efficiency and almost linear scalability is achievable for both recall
and learning. Every processor works on its part of neurons but large stor-
age is required. More than million associations can be stored on six Viretex
modules (512 neurons per FPGA) using external SDRAM. Every FPGA has
a 512-bit connection with the SDRAM bus and every neuron processes one
column of the memory matrix. This 50MHz implementation is limited by
SDRAM access-time and results in 5.4 µs.
The last sample app is (Radial Basis) Function Approximation. A net-
work with flexible number of hidden neurons is trained incrementally:
if good approximation is not achieved, a neuron is added and learning
restarts. This also minimizes the risk of local minimum. A number of iden-
tical PEs compute their neurons in parallel. The data selectors assign the
inputs and select correct outputs, which are summed up in a global accu-
mulator. Simultaneously, error calculation unit analyzes the error, which
is submitted to controller and PEs for weight update. Such implementa-
tion can run at 50 MHz with the number of cycles/recall given by: c =
l + NPE + d NNneur
PE
e((4 · l + 5) + 2).
3.7 Conclusions
During the past five decades, the most frequently used types of artifi-
cial neural networks have been the perceptron-based models. Implemen-
tation projects such as those reported above are giving rise to new insights,
insights that most likely should have never emerged from simulation stud-
ies. However, the neurocomputers did not uncover all the potential for fast,
scalable and user-friendly neurosimulation.
In the late 1980s and early 1990s neurocomputers based on digital neu-
rochips reached the peak of their popularity — some even came out from
the research laboratories and entered the market. However, the progress
stalled — initial enthusiasm decreased because 1) user experience in solv-
ing the real-world problems was not very satisfactory and 2) in competition
with the general-purpose µP that grew according to Moore Law. Develop-
27
Figure 3.6: Neurocomputer Performances [11]
ment of custom chips1 is very expensive (and especially hard for the con-
nectionists who are not familiar with such things as VHDL), they are less
programmable and it turns out better to rely on the massively produced
and exponentially growing general-purpose µP. DSPs are especially popu-
lar because of highly parallel SIMD-style MACs for synapse and tightly in-
tegrated FPU for sigmoid computation. The relatively new FPGAs, which
are also general-purpose computation devices, surpass their performance
one order of magnitude. These implementations will of course never be
maximally efficient and fast as dedicated chips.
If you look at the diagram 3.7, neurocomputers are 2 orders of magni-
tude faster than the general-purpose multiporcessors. The neurocomputers
made of neurochips are additionally 2 magnitudes faster than the µP-based
counterparts. The analog technology brings two additional orders.
Later works show that the analog inaccuracy can be an advantage in the
“inherently fault-tolerant” neurocomputing remarking that the ‘wetware’
of real brains keeps going surprisingly well in the wide range of tempera-
tures and variety of neurons. But recall that CPS is a vague figure. It was
realized that the wetware, which constitutes the animal brains, uses more
powerful spike-based models. The relatively recent, last decade, trend was
to move to the spiking neural networks [principles of designs for large-
scale], which are much more powerful yet allow simulating more neurons
with less HW.
1 [22]
justifies it only 1) for large system solutions and 2) when the topological and com-
putational model flexibility by user-simple description is provided
28
Chapter 4
Spiking NNs
4.1 Theoretical background

As opposed to the 2nd generation neural models presented above, the
real neurons do not encode their activation values as binary words in com-
puter. Rather, axons are the ion channels that propagate the wave packets
of charge. These pulses are called action potentials. At this, the activation
function acts like a leaking capacitor, which integrates the charge and fires
a pulse once the sum of pulses, the membrane potential, overcomes its thresh-
old. The length of spikes is not accounted.
The integrate-and-fire model can operate on both rate coding and pulse
coding (timing of pulse is taken into account). Both encodings are compu-
tationally powerful and easy to implement in computer simulation as well
as hardware VLSI systems. Yet, since every biological neuron fires no more
than 3 pulses during the estimated brain reaction, 150 ms, the timing of
pulses must be accounted for realistics. In fact, the pulse coding allows the
NN to respond even faster than one neuron spiking time. As [20] remarks,
the pulse coding is very promising for tasks in which temporal information
needs to be processed, which is the case for virtually all real-world tasks.
Additionally, the rate coding is difficult in learning.
The back-propagation is not suitable for the SNNs. Spike-timing de-
pendent synaptic plasticity (STDP), which is a form of competitive Heb-
bian learning that uses the exact spike timing information, is used instead.
The synapse strengthening, named long-term potentiation (LTP), occurs if
post-synaptic neuron fires a spike in 50 ms after pre-synaptc. More for-
mally: ∆wij = α · h(∆t), where correlation h(∆t) grows from 0 to some
peak and then decays to 0 so that late spikes do not affect the weights.
All the HW implementations examined make use of sparse SNN con-
nectivity and low activity: only about 1 % of neurons are firing in a time
slot.
29
Figure 4.1: The Action Potential and Spiking NN
4.2 Sample HW
4.2.1 Learning at the Edge of Chaos
Any internal dynamics emerges from the collective behavior of inter-
acting neurons. This is a product of the neuron coupling. As mentioned by
[21] authors, it is, therefore, necessary to study the coupling factor — the
average influence of one neuron upon another.
They experiment on STDP by building a video-processing robot, which
learns to avoid the obstacles (walls and moving objects) with the purpose
to investigate the internal dynamics of the network. A Khepera robot with
a linear (1-D horizon vision) camera and collision sensors is used for ex-
periment. The video image is averaged to 16 pixels, which are fed to 16
input neurons, processed by 40 fully recurrent hidden neurons and two
output neurons that control two motors. The weights in the fully recur-
rent network are initialized randomly with some variance from the normal
distribution center.
The full-black color is supplied by 10 Hz spikes, full-white corresponds
to 100 Hz. In average, input spike happens every 100 steps, meantime the
network keeps firing and learning. The STDP learning factor α = ±const
depending on if robot moves or hits a wall and 0 otherwise. Started at
chaotic firing, neurons synchronize with each other and external world.
However, fast synchronization is not always good. The network must
exhibit two contradictory dynamic features: the plasticity to remain re-
sponsive and the “autism” to maintain stability of internal dynamics es-
pecially in case of noisy environment. Authors look at the average mem-
brane potential developing in time: m(t) = ∑ Vi (t). The coupling must be
high enough to avoid ‘neural death’ then its dynamics evolves from ini-
tial chaotic firing to synchronous mode. At weak coupling, this measure
is chaotic, neurons fire asynchronously and aperiodically, robot behaves
almost randomly. Increased variance favors the increased periodicity and
30
Figure 4.2: NeuroPipe-Chip on MASPINN board
synchrony among the neurons — the average membrane potential rectifies

into a straight line. IMO, here is some confusion between coupling ‘vari-
ance’ and ‘strength’.
The task is achieved by a simple controller: single Motorola 23Mhz,
512 KB of RAM and 512 KB ROM. Having got the idea, let’s move on more
powerful designs.
4.2.2 MASPINN on NeuroPipe-Chip: A Digital Neuro-Processor

MASPINN is a pretty typical accelerator1 of that time. It implements
many concepts proposed in previous works. Based on a custom NeuroPipe,
it simulates 106 neurons with up to 50–100 connections per each if network
activity < 0.5% to enable video-processing in real-time. The common con-
cepts are:
• a spike event list that models the axon delays: it stores spike source
neurons along with spike time2 . The fact that the next time slot is
computed from the data of the previous one allows the network to be
processed in parallel.
• a sender-oriented connectivity list (a map) that keeps the destination

neurons for every source neuron along with the connection weight.
• tagging of dendrite potentials — the dendrite potentials that have de-

cayed to zero and, thus, have no impact on the membrane, are tagged
by ‘ignore’ bit.
1 For instance, [22] is a very similar design
2 This is like VHLD simulation but simpler
31
num Alpha SPIKE 128k ParSpike MASIPNN
of 500 MHz 10 MHz 100 MHz 100 MHz
neurons FPGA 64 DSPs NeuroPipe-Chip
1K 0.56ms 1ms 1ms 6.5µs
128K 67ms 10ms 1ms 0.83ms
1M 650ms — 8ms 6.5ms
Table 4.1: Comparison of MASPINN
The chip supports programmable NN models by user code specifying

the connections and how they contribute to the membrane potential. Im-
plemented in 0.35 µm digital CMOS at 100 MHz and consuming 2 Watt
shows two orders of magnitude improvement over 500 MHz Alpha work-
station and approaches the real-time requirements for SANNs. It also cu-
rious to note the 10-fold improvement over FPGA and competitiveness of
DSP-based designs with the custom made chips.
4.2.3 Analog VLSI for SNN

Similarities between biological neurons and continuous analogue tech-
nology are drawn in [23] highlighting that the digital technology is inca-
pable of simulating large parts of brain in foreseeable future. The authors
propose the natural (analogue) computing VLSI chip architecture instead.
Yet, existing digital protocols are exploited for conducting the naturally
discrete action potentials. Additionally, SRAM is used for weight storage
instead of capacitors to span the operation beyond few millisecond and
simplify weight adjustment. Moreover, the digital controller is proposed
for ‘developmental changes’ of NN connectivity.
The prototype is a 256-neuron x128 synapses chip implemented on 0.18 µm
CMOS process. The core of architecture is a synaptic matrix. Neuron ax-
ons drives a triangular pulse of voltage along a horizontal line. A column
of synapses-nodes translate it into the current driven down the column,
where a neuron membrane is located and accumulates the charge. Once
threshold voltage is exceeded, neuron’s comparator fires a pulse to its re-
spective axon. The architecture results in very small synapses — 15 iden-
tical NMOS gates each. This is important, since synapses dominate in NN
computation.
The weights are stored in 4-bit SRAM. A digital STDP controller adjusts
them. The synapses are equipped with two capacitances for the controller
to store the pre- and post- synaptic events (pulse timestamps?). The corre-
lation computation and weight updates can be done sequentially because
synaptic plasticity is a slow process: it takes minutes in biology that corre-
sponds to tens of milliseconds in the timescale of this chip, while the entire
32
Figure 4.3: SNN on analog VLSI. Axon drives a row. A column of synapses
feeds a neuron at the column bottom. A neuron spikes are converted into
axon current, which drives a row.
matrix can be updated in microseconds. The low network activity allows

to traverse the matrix faster.
Analyzing the external interface to transmit the numbers of spiking
neurons, authors estimate 2.6 × 107 spikes/s. The timing precision must
be 150 ps (10 µs biological time). Eight bits for neuron number plus other
eight bit for time synchronization sub-periods would require 52 MB/s. The
1.6 GB Hypertransport proposed should sustain the bursts of higher spik-
ing rates and still leaves headroom for more chips. The spikes travel off-
chip from the axon. They are priority-encoded to handle simultaneous
spiking of multiple neurons. A conflict inflicts a 2.5 ns (250 µs biological)
delay error at maximum. The reverse direction is similar: an off-chip source
transmits the row number to pulse. Beside the spike transport, the monitor
amplifiers allow external monitoring of four neuron membranes at a time.
Please note that this implementation offers the true synapse-parallelism.
It closely mimics the biological system while being as simple as possible.
By operating several chips in parallel, it should be feasible to build system
of 10000 neurons. At speed 105 faster than the real-time, this system allows
to test many hypothesis, where simulation time on a digital computer is
too long. Besides physical modeling on a serious time and size scale, it cor-
responds to the continuous, non-Turing, computations of the real neurons
and supports the amazing model behind the BB, the ‘liquid computing’.
4.3 Maas-Markram theory: WetWare in Liquid Com-

puter
To give an insight on the strength of spiking networks, one spiking neu-
ron is more powerful than sigmoidal NN with 412 hidden levels. This was
recently proven by Wolfgang Maass, a mathematician and accomplice of
Markram in the BB project. From his bibliography in theoretical computer
33
science, we see that Maass started by studying the hard, symbolic automa-
tion and Turing machines, then moved to analog and neural networks ap-
proaching the Markram’s brain research. Together they have developed a
‘liquid state machine’ (LSM), a kind of SNN, to be used in real-world ap-
plications. This is needed for the following reasons.
The authors point out that the computer science lacks the universal
model of organization of computations in cortical microcircuits that are
capable of carrying out potentially universal information processing tasks
[24]. The universal computers adopted, the Turing machines and attrac-
tor ANNs, are inapplicable because they process the static discrete inputs
whereas the neural microcircuits carry out computations on continuous
streams of inputs. The conventional computers keep the state in a num-
ber of bits and are intended for static analysis: you record the input data
i and revise this history, in order compute the output at current time t:
o (t) = O(i1 , i2 , ..., it ). This costs HW and time. However in the real-world,
there is no time to wait until a computation has converged — results are
needed instantly (anytime computing) or within a short time window (real-
time computing). The computations in common computational models
are partitioned into discrete steps, each of which require convergence to
some stable internal state, whereas the dynamics of cortical microcircuits
appears to be continuously changing (the only stable is the ‘dead state’).
The biological data suggest that cortical microcircuits may process many
tasks in parallel while the most of NN models are incompatible with this
pattern-parallelism. Finally, the components of biological neural micro-
circuits, neurons and synapses, are highly diverse and exhibit complex
dynamical responses on several temporal scales, what makes them com-
pletely unsuitable as building blocks of computational models that require
simple uniform components, such as virtually all models inspired by com-
puter science or ANNs. These observations motivated the authors to look
for alternative organization of computations, calling this ‘a key challenge
of neural modeling’. The proposed framework not just compatible with the
aforementioned constrains, it requires them.
Every new neuron adds a degree of freedom to the network, making its
dynamics very complicated. The conventional approaches are, therefore,
to keep the (chaotic) high-dimentional dynamics, under control or work
only with the stable (attractor) states of the system. This eliminates the
inherent property of NNs to continuously absorb information about inputs
as a function of time. This gives an idea for explaining ‘how a continuous
stream of multi-modal input from a rapidly changing environment can be
processed by stereotypical recurrent integrate-and-fire neuron circuits in
real-time’.
The authors look at NN as a ‘liquid’, which dynamics (state) is ‘per-
turbed’ by inputs. All the temporal aspects of input data are digested into
the highly-dimensional liquid state. The desired function output is ‘read
34
Figure 4.4: The Liquid State Computing
out’ from the (literally) current state of the liquid by another NN. That is all:
the high-dimensional dynamical system formed by neural liquid serves a
universal source of information about past stimuli for the readout neurons,
which implement the extract particular aspects needed for diverse tasks in
real-time.
Owing to the fact that
1. The liquid is fixed — its connections and synaptic weights are ran-
domly predefined; and
2. The only part that learns is the readout, which is memory-less (it
relies only on the current state of the liquid ignoring any previous
states) and can thus be as simple as 1-layer perceptron
this approach dramatically simplifies the computation resolving the com-

plexity problem. Furthermore, one liquid reservoir of information may
serve many readouts in parallel.
Like the Turing machine, the LSM is based on a rigorous mathematical
framework that proves its universal computational power. However, unlike
the sequential in nature Turing machines that process the static discrete in-
puts off-line, LSMs are not based on stable states and algorithms presenting
biologically more relevant case of real-time computing on continuous input
streams. The analysis shows that NNs are ‘ideal’ liquids, as opposed to cof-
fee cup, for instance. Despite [25] who introduces the term ‘super-Turing’
is skeptical on this, dr. H. Jaeger has independently discovered the ‘echo
state networks’, which share the same ’reservoir computing’ concept [26]
while implement nonlinear filters in simple and computationally efficient
way.
The liquid carries all the complexity. The readouts are made 1-layer
for trivial training. Such readouts are unable to solve the linearly non-
separable problems and, thus, 1) are sometimes called linear classifiers and
2) linearly non-separable problems are good benchmarks for checking the
liquid quality. Likewise it was between order and chaos, the liquid can be
more or less useful. It can be too stable (order) disregarding all the inputs,
and, on the other end, chaotic — the current input overwrites all the mem-
ory. The optimum lies inbetween in- and over-sensitive response to the
35
Figure 4.5: Hard Liquid. a) A structural block. b) The interpolated Mem-
ory Capacity for different weight distributions (the points). The largest 3
distributions are highlighted).
inputs.
4.3.1 The ‘Hard Liquid’

The authors of the hybrid VLSI above, elaborate their chip for LSM
applicability [27]: the network size and the technology (the analog inte-
gration of current in synapses plus digital signaling) are retained but the
McCullohc-Pitts neurons (step activation function θ) are used instead of
spiking and the 11-bit nominal weight storage is made capacitive. Im-
plemented in 0.35 µm CMOS process, the full network can be refreshed in
200 µs. The speed is I/O limited while the core allows for 20 times faster
operation.
The network operates in discrete time update scheme, i.e. all the neu-
ron outputs are calculated once for each network cycle. The 256-neuron
network is partitioned into four blocks: the 128 synapses of every neuron
are driven by axons incoming from all the four blocks and the network
inputs. The block internal connections can be arbitrary, whereas the inter-
block connections are hardwired.
Following the Maas terminology, the ASIC chip represents the liquid
acting as a non-linear filter upon the input. The ASIC response at a cer-
tain time step is called the liquid state x(t). The reconfigurability of the
used ANN ASIC allows to explore qualities of physically different liquids.
The liquids are generated at random by drawing the weights form a zero-
centered Gaussian distribution that is governed by the number of neurons
N, number of incoming connections k per neuron and the variance σ2 . The
readouts (linear classifiers) are implemented in SW: v(t) = θ (∑ wi · xi ),
where weights wi are determined with a least-squares linear regression for
the desired values y(t). The resulting machine performance is evaluated on
linearly non-separable problem of 3-bit parity in time by two theoretical in-
formatics measures: 1) mutual information MI between y and v at given time
step t and 2) the sum of MI along the preceding time-steps, which is the
memory capacity MC assessing the capability to account for the preceding
inputs.
36
Notably, the liquid’s major quality to serve as a memory storage is mea-
sured in bits. At every iteration of generation parameter sweep (a dot in
fig. 4.5(b)), a number of liquids were generated and readouts trained for the
same function. The average MC distinctly peaks along the hyperbolic band.
This band shows a sharp transition from the ordered dynamics (area below)
to the chaotic behavior (above). To estimate the ability to support multiple
functions, multiple linear classifiers were experimented on the same liquid.
The mean MI shows that the critical dynamics yield a generic (independent
of the readout) liquid.
The experiments reproduce the earlier published theoretical and sim-
ulation results showing that the linear classifiers can be successful when
the liquid exhibits the critical dynamics between the order and chaos. The
experiments with this general purpose ANN ASIC allow to explore the nec-
essary connectivity and accuracy of future hardware implementations. The
next step planned is to use area of the ASIC to realize the readout. Such an
LSM will be able to operate in real-time on continuous data streams.
4.4 The Blue Brain

The first stage is to try the novel biologically-realistic simulation on the
BG, a typical general-purpose supercomputer.
4.4.1 Blue Gene

Application-driven design approach
Financed by taxpayers under American nuclear program pretext, the
IBM corporation’s designers of this machine claim to bridge the gap be-
tween cost/performance of existing supercomputers and application-spe-
cific machines [28] making it as cheap as a cluster solution. This objec-
tive meshes nicely with additional goals of achieving exceptional perfor-
mance/power and performance/space ratios. The key enabler to the BG/L
family is low power design – the machine is made of low-frequency low-
power IBM PowerPC chips.
To overcome the challenges in designing the good performance using
many processors of moderate frequency, the innovations were restricted to
scalability enhancement at little cost and the options were estimated on a
selected classes of representative applications. The machine is announced
as the first scientific-dedicated computer with the primary goal on DNA
and protein folding simulation.
37
Usability
The networks were designed with extreme scaling in mind. They sup-
port short messages (as small as 32 bytes) and HW collective operations
(broadcast, reduction, barriers, interrupts, etc.).
Developing the machine at ASIC level allowed to integrate the reliabil-
ity, accessibility and survivability (RAS) functions into single-chip nodes,
so that the machine would stay reliable and usable even at extreme scales.
The feature is crucial, since the probability of machine to start approaches
zero as the number of nodes grows. In contrast, clusters typically don’t
possess this goodness at all.
The full potential cannot be disclosed without system SW, standard li-
braries and performance monitoring tools. Though BG/L was designed
to support both distributed-memory and message-passing programming
models efficiently, the architecture is tuned for the dominant MPI interface.
From the user perspective the BG/L appears as up to 216 compute node net-
work. But this is not an architectural limit. Every 1024 nodes are assembled
into a rack consuming 0.92 × 1.9 m3 of space and 27.5 kW of power.
Nodes and Networks

Every compute node is 130 nm SoC ASIC containing two PPC440 cores.
The cores share 4 MB L3 DRAM cache and 512 MB main memory. It is in-
teresting to have ( L2 = 2 kB) < ( L1 = 32 kB). For our purposes it is also
worth to point out that every core has two double precision (32 bits) FPUs.
Running at 700 MHz, nodes jointly deliver 5.6 GFlops at peak and 77 % of
it in benchmarks.
The compute nodes are interconnected through five networks the ma-
jor of which is a symmetrical 64 × 32 × 32 3D torus. Each node has, there-
fore, six independent bidirectional neighbor links. The signaling rate and
latency of the links are 1.4 GB/s and 100 ns correspondingly. Symmetry
means that the links have the same bandwidth and almost the same latency
regardless of the physical distance — whether nodes are located closely on
the same board or on neighbor rack (rack accommodates 85 % of intercon-
nections). The maximal network distance is, therefore, 32 + 16 + 16 = 64
hops and bandwidth 216 × 2.1 GB/s = 138 TB/s.
Aggregated into a Gigabit Ethernet network, I/O nodes supply external
parallel file system interface. The number of I/O nodes is configurable
with maximum IO-to-compute node ratio of 1 : 8.
Two other interconnects are collective and barrier networks. Combining
nodes into trees, they are useful for asthmatic reductions and global result
broadcasting from the root back to the nodes.
Finally, the control system networks are the various networks such as I2 C
and JTAG used to initialize, monitor and control all registers of nodes, tem-
38
perature sensors, power supplies, clock trees and etc. — more than 250 000
endpoints for a 64 k machine. A 100 Mb Ethernet connects them to the host.
A partition mechanism, based on link nodes, enables each user to have a
dedicated set of nodes. The same mechanism also isolates any faulty nodes
(once fault is isolated, the program restarts at the last RAS checkpoint).
4.4.2 Brain simulation on BG

EPFL has its own neorosimulation computer lab, which produces spe-
cial HW, the MANTRA presented above. The BB team does not explain
why did they choose an inefficient general-purpose computer. Though they
claim that the computer was developed for their project in mind, BG does
not confirm this in the list of design reference apps. Indeed, the broadcast
networks and reduced power of nodes seem to be in line with the neoro-
processing and neural models teaching us to compute by myriads of ele-
mentary processors. However, orientation to Flops and MIMD architecture
are considered improper in ANN. Yet, the diversity of biologically-realistic
neurons may well require the irregularity of MIMD nodes. We have to con-
clude that the existing neurocomputers are not flexible enough to meet the
complexity of BB the model3 .
There is no links on the official site to the simulation laboratory. Yet, I
have encountered some Goodman Lab4 , that develops an MPI-based neo-
cortical simulator similar to the PCSIM launched by Maass, has H.Markram
on its list and reports the progress with Blue Gene! Here is a report issued
in 20055 .
It describes the tests run their NeoCordical Simulator (NCS) ported on a
1024-CPU BG. The neurons are connected at random (the connection prob-
ability drops with the distance) and spike activity is observed. The network
size is synapse memory-limited — 512 MByte per CPU allow networks of
5 billion synapses. This means 100 bytes/synapse, which strikingly con-
trasts with all the neurocomputers presented above that limit the weights
below 2 bytes. Trying up to 2500 neurons with 676 Msynapse networks, the
Spike-per-CPU measure shows near-linear scalability, which significantly
drops in 1024-processor mode, probably due to the extra communication
workload.
Previously, NCS was running on a Beowulf cluster, each computer of
which is treated as a 2-CPU 4GB node (Memory stores synapses and thus
bounds the network size to 109 synapses6 ). Surprisingly, they call this to
3 http://brain.cs.unr.edu/publications/gwm.largescalecortex 01.pdf and [29] explain
that it is infeasible to characterize the fine-grain connectionism without large-scale mod-
eling on coarse-grain supercomputers
4 http://brain.cs.unr.edu/
5 http://brain.cs.unr.edu/publications/NCS BlugGene report 07Nov04.pdf
6 In the previous work http://brain.cs.unr.edu/publications/hbfkbgk hardware 02.pdf
39
‘utilize a very fine grain parallelism’. It is also surprising that this cluster
outperforms the BG! The personnel guess that besides 3x slower CPUs, 1)
the Beowulf’s Myrinet is better than the supercomputer’s 3D torus; and
2) that the NN distribution is optimized for the cluster. The BG profiling
tools show that one order of magnitude SW performance improvement is
possible.
Now, I see that Maass builds a similar simulator. All this suggests that
BB is just a branch of this project. Indeed, [30] confirms that BB runs Good-
man simulator. As of 2008, BB reports that 8 kCPU BG has fulfilled the goal
— rat’s cortical column has been recreated.
4.5 Conclusions
In the last decade, the attention has switched to the more realistic spik-
ing NNs, which are theoretically much more powerful than the conven-
tional ones. The topology is as important to the capacity of the network
as the size: the optimal quality was found at the edge between ‘order and
chaos’. Exploiting the local connectivity along with low network activity
in the form of event-list and disabling decayed dendrites, the event-driven
neurocomputers made of custom digital chips may deliver almost any sim-
ulation performance.
Yet, looking at semiconductor roadmap, the enormous gap between the
digital performance and requirements to simulate large parts of the brain
that cannot be bridged by the digital VLSI. The digital computer is Turing
paradigm-based: it is inherently sequential, repeatedly executes simple op-
erations on some kind of data stored in memory [23]. This is fundamentally
opposite to the nervous system, where continuous-mode neurons process
multiply tasks in parallel in real-time. The developers of reservoir comput-
ing find a flaw in the classical, stable-states-based, approach. The idea to
let inputs cause perturbations to the transient network state. In conjunction
with a simple one-layer readout trained for user problem this eliminates
the computation-expensive learning and, by processing continuous input
streams in real-time, opens the ANNs for the real-world problems. Inspired
by complex and ‘analog’ system, the biological nervous system, LSM may
serve as non-Turing universal computer. The physical computer structure
must match the neural model, which means it must be analogue VLSI. An
optimal analog computation substrate mixed with the digital technology
for weight storage and signaling was presented.
they admitted unlimited brain simulation
40
Chapter 5
Conclusions
It is the mind that is the

paradoxical stone of the God: It
can be created but cannot be
understood.
[3]
I have collected the parallelizability features of neural networks that

were discovered throughout this work:
− the software (ANN) incompatibility hinders benchmarking while C(U)PS
are even more vague than FLOPS.
+ The connectionist models inherently possess the finest-grain paral-
lelism (easier mapping on HW).
+ regularity (easier mapping on HW)
+ reduced precision (simpler processor — lower its cost and higher ef-
ficiency).
+ The inter-node communication is restrained by broadcasting activa-
tion values effectively eliminating the reading latency and memory
coherency issue.
+ connections are local.
+ spiking activity (broadcasting) is low.
+ the ‘fault-tolerance’ can make use of analog electronics, which a way
more is dense, fast and low-power compared to the clocked digital,
serving as a universal computing paradigm.
+ the fault-tolerance not just enables the analog computing, it is also
demanded by VLSI technology with even higher integration densities
and tinier elements. The NNs tolerate not only manufacturing defects
but also runtime faults, which happen after training.
The ANN field is not pretty new compared to the era of conventional
computers. Yet, these networks are still poorly understood because of their
41
nature. Essential research is done by simulation. Because new qualities are
exhibited in larger networks, the research is limited by available comput-
ing power. Being highly-parallel, ANNs are not limited by Amdahl law
and there is always demand for more parallel computing. Choosing the
platform the speed, efficiency (cost, power, space), flexibility and scalabil-
ity factors are considered.
For economical reasons, it is preferred to simulate networks of small
and medium size on lately considerably enhanced workstations. The first
stop in parallel simulation track is a cluster interconnected by low-latency
networks. Its performance is quite competitive with the supercomputers
even without looking at the highly important cost factor. But if you can
afford one, the data-parallel (SIMD) architectures, the ones with one cen-
tral processor, perform better in neural processing than the control-parallel
(MIMD), presumably due to more optimal routing, automatic data broad-
cast and synchronization and code-processing redundancy eliminated. The
general-purpose computers are user friendly and easier in migration on the
new HW generations . Despite the special HW cannot compete in flexibility
with the parallel computers, it is faster, not as expensive and more efficient.
The purpose of dedicated HW is to reach supercomputing power at
price of top-range workstation . The neurocomputers are build to sup-
port wide ranges of popular ANN models. They can be built of general-
purpose, custom digital and analog chips, where each improves both speed
and efficiency 2, 4 and 6 orders of magnitude. The analog chips are unreli-
able and have limited trainability. Therefore, it is expected that the hybrid
technology, that combines advantages of analog computations with digital
storage will dominate in the future. As yet, the general-purpose µP s have
regained their popularity because of better programmability and the very
expensive custom design that cannot keep up after the massively produced
universal µP s at exponentially (Moore Law) growing power.
For this reason, DSPs that offer MAC and floating point operations used
to be the most popular building blocks. However now, with the advent
of massively parallel FPGAs, which support these DSP operations, the re-
configurable technology is taking over. Besides the wider range of ANN
models, the flexibility of general-purpose components is able to process
the interface that surrounds the NN. The embedded devices, which tend
to avoid any redundancy simulating concrete models and networks, also
benefit from the reconfigurability.
In the second part we have seen that the intention of neurocomputer
field has moved to the more realistic and powerful spiking networks. Here,
the local connectivity and low network activities are exploited. The best
digital performance is shown on the event-driven machines. Real-time per-
formance is shown on the mixed technology, which uses analog computa-
tion plus digital weight storage and signaling. We have seen how the opti-
mal spike synchronization emerges between the ‘order and chaos’. Such a
42
model runs on the BB. This project ‘opens the horizons for the neuroscience
researchers’ as do all the others examined.
In the final section it is allowed to express my personal impressions
emerged while glancing over the ANN subject. The most curious are the
most fundamental questions, which means the minds and machines, the
consciousness and computability.
Biologically, the authentic live creatures are the genes. In the battle for
survival, they synthesize the protein machines (body) and the nerves for
adequate behavior in the complex and changing environment. A good
brain must build a model of reality to avoid the dangerous experiences.
The consciousness arises when one puts itself inside his model of reality1 . I
see 1) a self-reference recursion here akin to the one we had with the Gödel
sentence trying to answer if a machine can think, and 2) the mind build-up
procedure implies its incompleteness, so it can easily be controversial. This
history also recalls who the brain, this manager of the planet, serves to.
Artificial implementation will have deep consequences. It does not
need to grow up from a single cell and is free of ancient rudiments and,
therefore, may completely concentrate on the computation, it runs at elec-
tronic rather than molecular speeds, it may have unlimited size and un-
ceasingly run forever. Will it look at on its creators like animals? When
Markram says: “the intelligence that is going to emerge if we succeed in
doing that is going to be far more than we can even imagine”, the Singular-
ity occurs.
Personally, I have realized two things doing this work. At first, I have
realized how propaganda works. Raimond Ubar explained us once that
information is repeated in the noise-looking signal and it is how it is sepa-
rated from the background noise (like SETI does). It looks like the brain is
a filter of this kind. When everybody tells you the same, it is recognized as
important and true. This explains dr. Goebbel’s formulation: ‘Keep repeat-
ing a lie and it will become true” and effectiveness of mainstream picture
of the reality used by owners to manufacture the consent and consumers in
their interests in democratic societies.
Secondly, I have realized that the brain is not a hardcoded algorithm2 , it
is continuously shaped by the environment. Particularity, when one plays
chess, he teaches the opponent’s brain and, in result, plays with himself.
More generally, the task is solved not by algorithm, rather by external in-
fluence. Perhaps, this my delusion also explains the inscrutability of mind
— there is just no algorithm to discover, it is constantly morphing. This
returns us the the important in the field of AI question if a machine can
simulate the consciousness.
The idea that algorithm of mind cannot be understood can be proved in
1 Dawkins, “The Selfish Gene”
2 The [3] and [31] have influenced
43
other way. After Truing, we can guess that somebody has comprehended
his mind. He knows then what he must do and violates the algorithm act-
ing in another way.
This non-determinism of mind might come from the quantum world
as Penrose argued. I believe that any beginner in NNs and quantum the-
ory notes this similarity in their magic to traverse the huge problem space
quickly. I note another one: likewise the quantum physics denies the trajec-
tory of a particle, the computability theory fails to trace the train of thought.
The utmost information and computation diffusion in NNs is opposed to
the unambiguity pursued in the classical (mechanic, symbolic) algorithms
and data structures. Aren’t the following facts about connectionist messy
‘fuzzy logic’
• it gives miraculous qualities unavailable to the mechanic machine

• it obscures the line of computer reasoning
sound like another, the famous uncertainty, principle of quantum physics:

the elementary particles bizarrely interfere with themselves and entangle
with others only when nobody watches them? Yet, there is an evidence that
quantum computers, yet faster, are not more computationally powerful. A
QC would not ‘see’ that the Gödel sentence is ‘true’. The human mind and
analog computer, however, have the super-Turing capabilities.
It looks like the algorithmic computers must be discrete to distinguish
between the values. It was told that the noise is inevitable in the analog
circuit and that the randomness provides the mind with the unpredictabil-
ity and creativity. Hardly however, we call the simple automatas that use
randomness to resolve conflicts, e.g. Ethernet delay, ‘intelligent’ because of
that. Yet, the idea to combine it with the massively parallel computation
seems very attractive.
We have seen how the biological neurons inspired the Blue Brain theo-
reticians to discover the liquid (transient) state machine that enables the ap-
plication of ANNs to the real-world by reducing their precessing demands
and increasing computational power: continuous input streams and mul-
tiple tasks can now be processed in real-time. Additionally, the model’s
universal analog3 , non-Turing, which means super-Turing, computational
power is mathematically rigidly grounded. Such a fault-tolerant continu-
ous model both requires the imperfect analog physical substrate and en-
ables it both in runtime and highly dense VLSI manufacturing. We have
seen the mixed-mode ‘hard’ implementation of the ‘liquid’ that offers cheap
real-time synapse-parallelism. All this seems to add up consolidating in a
holistic coherent picture, the way to implement the mind in silicon. Just
one doubt plagues: how can a liquid, neurons of which never learn, be
biologically more realistic than the conventional ANN?
3 For the sake of truth, many physicists believe that space-time is ultimately discrete.
44
Chapter 6
Epilogue
As part of the agreement with IBM, some of Blue Gene’s time will also
be allotted to IBM’s Zurich Research Lab working together with scien-
tists from EPFL’s Institutes of Complex Matter Physics and Nanostructure
Physics to research future semiconductor (post-CMOS) technology such as
carbon nanotubes.1 Meantime, BG was designed and is used for DNA and
protein folding research2 , which in itself offers 10 nm patterns for the new
generation of integral circuits in replacement to 65 nm photolithography.
Thom LaBean in Duke University has demonstrated the self-assembly hun-
dreds of trillions of building blocks [32]. As a matter of fact, new transistor
technologies reshape the landscape by far more substantially than any ar-
chitectural solution.
Notes
1 For instance, looking for alternative brain simulations, I encountered the ad.com:
The Artificial Development is privately held company, comprised of an in-
ternational multidisciplinary team of professionals who are working to intro-
duce the world’s first true AI.
CorticalDB is building advanced artificial intelligence technologies that will
reshape business operations on a global scale. The core of these technolo-
gies is CCortex, a massive neural network with breakthrough capabilities in
simulating important aspects of human intelligence, cognition and memory.
CCortex accurately models the billions of neurons and trillions of connections
in the human brain with a layered distribution of spiking neural nets running
on a high-performance supercomputer. This Linux cluster is one of the 20
fastest computers in the world with 500 nodes, 1,000 processors, 1 terabyte of
RAM, 200 terabytes of storage, and a theoretical peak performance of 4,800
Gflops.
It achieves this simulation by dynamically employing the vast amounts of
neurological data derived from the CorticalMap and NanoAtlas projects.
1 http://www.physorg.com/news4402.html
2 http://folding.stanford.edu/FAQ-diseases.html
45
The CorticalMap is a comprehensive database of neurological structures that
represents multiple levels of the brain. It results from extensive neuroscience
literature data mining and contains billions of neurons with trillions of con-
nections, including data on neuron cell types, morphology, connectivity, chem-
istry, physiology and functionality. The NanoAtlas is a 100-nm resolution
digital atlas of entire human brain that is built using innovative whole brain
imaging and modeling techniques in the histology and genetics lab of the
company.
This is so doubtful that I cannot leave without a comment:

1. the topical sentence to ‘reshape business operations on a global scale’ sounds like a
.com nonsense.
2. Secondly, they claim that they do simulate human intelligence, while the words of
BB founders suggest that the AI is yet to be raised.
3. Thirdly, they have build one of the world’s fastest supercomputers and keep this in
secret from the world’s rating, not event disclose its architecture. The ANN models
used are also silenced.
4. I do not understand how the intended use of Cortextm : pattern recognition, audio-
visual computer interface, knowledge processing and intelligent decision making,
meshes with the company’s goal ‘to deliver a wide spectrum of commercial prod-
ucts’ and ‘improve business relations’? Would you like to buy a huge supercomputer
to make your computer interface easier?
5. The ambitiousness of the project is questionable if we look at the degrees of the tech-
nical stuff. All this looks like a dot-com project even without taking into account that
the project leader is familiar with connectivity paradigm by providing the Internet.
6. The most complete database on the brain must be widely known, especially because
company has plans to advance it in the business and scientific communities. Yet,
after many years passed, there seems no interest according to the Google.
The Advent of the AI is constantly adjourned for later by its followers.
46
Bibliography
[1] Otis Port. Blue brain: Illuminating the mind. BusinessWeek Online,
June 2005. [http://www.businessweek.com/technology/content/
jun2005/tc2005066_6414_tc024.htm].
[2] Richard Dawkins. The improbability of god. Free Inquiry maga-

zine, 18(3), 1998. [http://www.secularhumanism.org/library/fi/
dawkins_18_3.html].
[3] Alexander Sevenov. Neokonqenna p~essa dl oqen~ odinokogo av-

tomata. 1998. [http://sudy_zhenja.tripod.com/russian/NP_Akt1.
html].
[4] Roger Penrose. The Emperor’s New Mind. 1989.
[5] Anil K. Jain, Jianchang Mao, and K. M. Mohiuddin. Artificial neural

networks: A tutorial. IEEE Computer, 29(3):31–44, 1996.
[6] M.L. Minsky and S.A. Pappert. Perceptrons. M.I.T. Press, 1969.
[7] Paolo Ienne (EPLF). Architectures for neuro-computers: Review and

performance evaluation, 1993.
[8] Nikola B. Šerbedžija. Simulating artificial neural networks on parallel

architectures. Computer, 29(3), 1996.
[9] F. Alexandre Y. Boniface and S. Vialle. Artificial neural networks on

massively parallel computer hardware. In International Joint Conference
on Neural Networks, volume 4, pages 2441–2446, 1999.
[10] Jacob M. J. Murre Jan N. H. Heemskerk. Brain-size neurocomputers:

Analyses and simulations of neural topologies on fractal architectures,
1996.
[11] Jan N. H. Heemskerk. Overview of neural hard-

ware. IBM Journal of Research and Development, 1995.
ftp://ftp.cs.cuhk.hk/pub/neuro/papers/neurhard.ps.gz.
47
[12] Udo Seiffert. Artificial neural networks on massively parallel com-
puter hardware. In European Symposium on Artificial Neural Networks,
pages 319–330, April 2002.
[13] J.P. Mesirov X. Zhang, M. Mackena and D.L. Waltz. The backpropaga-
tion algorithm on grid and hypercube architectures. Parallel Comput-
ing, 14(3):317–327, 1990.
[14] T. Sommer A. Zell, N. Mache and T. Korb. Recent developments of
the SNNS neural network simulator. In Proc. Applications on Neural
Networks Conference SPIE, pages 708–719, 1991.
[15] C.R Rosenberg and G. Blelloch. An implementations of network learn-
ing on the CM. In Proc. Joint Conference Artificial Intelligence, pages
329–340, Milan, Italy, 1987.
[16] D.S. Touretzky D.A. Pomerleau, G.L. Gusciora and H.T. Kung. Neural
network simulations at warp speed: How we got 17 million connec-
tions per second. In Proc. IEEE ICNN, pages 134–150, San Diego, Jul
1988.
[17] H. Yoon J-H Chung and S.R. Naeng. A systolic array exploiting the
inherent parallelism of arificial neural networks. Microporcessing and
Microprogramming, 33(3):145–159, 1991/92.
[18] D. Schwarz R. Straub and E. Schöneburg. Simulation of backpropaga-
tion netowrks on tracsputers. Neurocomputing, 2(5&6):199–208, 1991.
[19] Jihan Zhu and Peter Sutton. Artificial neural networks on massively
parallel computer hardware. In Field-Programmable Logic and Applica-
tions, pages 1062–1066. Springer Berlin, September 2003.
[20] Jilles Vreeken. Spiking neural networks, an introduction. Technical
report, 2002.
[21] A. Alwan H. Soula and G. Beslon. Learning at the edge of chaos: tem-
poral coupling of spiking neurons controller for autonomous robotic.
In Spring Symposia on Developmental Robotic, page 6. Proceedings of
American Association for Artificial Intelligence, 2005.
[22] Nasser Mehrtash, Dietmar Jung, Heik Heinrich Hellmich, Tim Schoe-
nauer, Vi Thanh Lu, and Heinrich Klar. Synaptic plasticity in spiking
neural networks (sp2 inn): A system approach. [http://mikro.ee.
tu-berlin.de/spinn/pdf/ieee03.pdf].
[23] K. Meier J. Schemmel and E. Mueller. A new vlsi model of neural mi-
crocircuits including spike time dependent plasticity. In IEEE Interna-
tional Joint Conference on Neural Networks, volume 3, pages 1711–1716,
July 2004.
48
[24] Henry Markram Wolfgang Maass, Thomas Natschlager. Real-time
computing without stable states: A new framework for neural com-
putation based on perturbations. In Neural Computation, pages 2531–
2560. MIT Press, 2006.
[25] Hava Siegelmann. Neural Networks and Analog Computation: Beyond the
Turing Limit. Birkhauser Boston, 1998.
[26] H. Van Brussel X. Dutoit and M. Nuttin. A first attempt of reservoir

pruning for classification problems. In Proceedings of the 15th Euro-
pean Symposium on Artificial Neural Networks (ESANN), pages 507–512,
Bruges, Belgium, 2007.
[27] K. Meier F. Schürmann and J. Schemmel. Edge of chaos computation

in mixed-mode vlsi - “a hard liquid”. In Proc of NIPS, 2004.
[28] A. Gara et al. Overview of the blue gene/l system architecture. IBM
Journal of Research and Development, 49(2/3):195–213, Mar-May 2005.
[http://www.research.ibm.com/journal/rd/492/gara.html].
[29] Christopher M. Bishop Wolfgang Maass. Pulsed Neural Networks. MIT

Press, 2001.
[30] A. Morrison R. Brette, P.H. Goodman and et.al. Simulation of net-

works of spiking neurons: A review of tools and strategies. Journal of
Computational Neuroscience, 23(3), July 2007.
[31] Dina Goldin Peter Wegner. Computation beyond turing machines. In

Communications of the ACM archive, pages 100–102. ACM, April 2003.
[32] R. Colin Johnson. Nanocircuits self-build from dna. EE Times, Feb

2006. [http://www.eetimes.com/showArticle.jhtml?articleID=
175800069].
49

Neural Computation Platforms: To The Blue Brain and Beyond

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Computation Platforms: To The Blue Brain and Beyond

Uploaded by

Copyright:

Available Formats

TALLINNA TEX NIKA Ü LIKOOL

Neural computation platforms: to the Blue Brain and

1 Prologue: The Blue Brain project 2

Prologue: The Blue Brain

In this potpourri we’ll rush through the milestones of neuro-hardware.

2.1 Brain research

All truths are easy to see once they

2.2 Artificial NNs

it is infinitely complex and corresponds to infinitely long program.

The Binary Model

and outputs 1 if s exeeds a certain threshold T and 0 otherwise:

The positive weights correspond to excitatory connections, the negative

(c) 3-layer Perceptron (d) Hopfield model

3.1 Simulating Artificial Neural Networks on Parallel

3.2 Mapping neural networks on parallel machines

ments (PE) interconnected by connecting elements (CE) into architecture. In

• training-session parallelism: processors emulate a netrowk indepen-

• pattern-parallelism: every processor simulates a different input vec-

• layer parallelism: concurrent execution of layers within a network. A

• neuron parallelism: a whole layer of neurons or the full network is

• synapse parallelism: simultaneous weighting.

• the definition of CPS/CUPS is vague if not unexciting to the levels

The critics propose a rationale evaluation, used by a Army/Navy CFA

3.4 Simulation on General-Purpose Parallel Machines

Table 3.1: NETtalk implementations on general supercomputers. The re-

tion on a Warp computer showed a speed of 17 MCUPS.

different guest topologies: fully and randomly connected, layered feedfor-

(e) MANTRA-I System Architecture

Figure 3.4: Representative neurocip-based architectures

3.6.1 ANNs on RAPTOR2000

(c) BiNAM architecture (d) Radial Basis Functions

Figure 3.5: RAPTOR2000

4.1 Theoretical background

synchrony among the neurons — the average membrane potential rectifies

4.2.2 MASPINN on NeuroPipe-Chip: A Digital Neuro-Processor

• a sender-oriented connectivity list (a map) that keeps the destination

• tagging of dendrite potentials — the dendrite potentials that have de-

Table 4.1: Comparison of MASPINN

The chip supports programmable NN models by user code specifying

4.2.3 Analog VLSI for SNN

matrix can be updated in microseconds. The low network activity allows

4.3 Maas-Markram theory: WetWare in Liquid Com-

this approach dramatically simplifies the computation resolving the com-

4.3.1 The ‘Hard Liquid’

4.4 The Blue Brain

4.4.1 Blue Gene

Nodes and Networks

4.4.2 Brain simulation on BG

they admitted unlimited brain simulation

It is the mind that is the

I have collected the parallelizability features of neural networks that

• it gives miraculous qualities unavailable to the mechanic machine

sound like another, the famous uncertainty, principle of quantum physics:

This is so doubtful that I cannot leave without a comment:

[2] Richard Dawkins. The improbability of god. Free Inquiry maga-

[3] Alexander Sevenov. Neokonqenna p~essa dl oqen~ odinokogo av-

[4] Roger Penrose. The Emperor’s New Mind. 1989.

[5] Anil K. Jain, Jianchang Mao, and K. M. Mohiuddin. Artificial neural

[7] Paolo Ienne (EPLF). Architectures for neuro-computers: Review and

[8] Nikola B. Šerbedžija. Simulating artificial neural networks on parallel

[9] F. Alexandre Y. Boniface and S. Vialle. Artificial neural networks on

[10] Jacob M. J. Murre Jan N. H. Heemskerk. Brain-size neurocomputers:

[11] Jan N. H. Heemskerk. Overview of neural hard-

[3] Alexander Sevenov. Neokonqenna p~essa dl oqen~ odinokogo av-