Professional Documents
Culture Documents
ables integration of memory with the computing circuit [14] and allows a compact and efficient architecture for learning algorithms. Specifically, it was recently shown [15] that memristors could be used to
implement MNNs trained by backpropgation in a chipin-a-loop setting (where the weight update calculation
is performed using a host computer). However, it remained an open question whether the learning itself,
which is a major computational bottleneck, could also
be done online (completely implemented using hardware) with massively parallel memristor arrays.
To date, no circuit has been suggested to utilize
memristors for implementing scalable learning systems. Most previous implementations of learning rules
using memristor arrays have been limited to spiking
neurons and focused on Spike Time Dependent Plasticity (STDP) [1623]. Applications of STDP are
usually aimed to explain biological neuroscience results. At this point, however, it is not clear how useful is STDP algorithmically. For example, the convergence of STDP-based learning is not guaranteed
for general inputs [24]. Other learning systems [25
27] that rely on memristor arrays, are limited to single
layers with a few inputs per neuron. To the best of
the authors knowledge, the learning rules used in the
above memristor-based designs are not yet competitive and have not been used for large scale problems
- which is precisely where dedicated hardware is required. In contrast, online gradient descent is considered to be very effective in large scale problems [28].
Also, there are guarantees that this optimization procedure converges to some local minimum. Finally,
it can achieve state-of-the-art-results when executed
with massive computational power (e.g., [3, 29]).
The main challenge for general synaptic array circuit design arises from the nature of learning rules:
practically all of them contain a multiplicative term
[30], which is hard to implement in compact and scalable hardware. In this paper, a novel and general
scheme to design hardware for online gradient descent
learning rules is presented. The proposed scheme
uses a memristor as a memory element to store the
weight and temporal encoding as a mechanism to perform a multiplication operation. The proposed design
uses a single memristor and two CMOS transistors per
synapse, and requires therefore 2% to 8% of the area
and static power of previously proposed CMOS-only
circuits. The functionality of a circuit utilizing the
memristive synapse array is demonstrated numerically
on standard supervised learning tasks. On all datasets
the circuit performs as well as the software algorithm.
Introducing noise levels of about 10% and parame-
(1)
(2)
A generalization of this model , which is called a memristive system [32], was proposed in 1976 by Chua and
Kang. In memristive devices, s is a general state variable, rather than an integral of the voltage. Memristive
SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING3
As explained in the introduction, a reasonable iterative algorithm for minimizing objective (7) (i.e., updating W, where initially W is arbitrarily chosen) is the
following online gradient descent (also called stochastic Gradient Descent ) iteration
G (s (t)) = g + g s (t) ,
2
1
W(k) = W(k1) W
y(k)
,
2
(3)
(4)
or
rn(k) =
(k) (k)
Wnm
xm .
(5)
(6)
K0
X
(k)
2
y
.
(7)
k=1
(8)
(9)
or
(k)
(k1)
(k)
Wnm
= Wnm
+ x(k)
m yn .
(10)
column
input
interface
row
input
interface
row
input
interface
column
input
interface
row
output
interface
row
output
interface
Similarly to the Adaline algorithm described in section II, the circuit operates on discrete presentations of
inputs (trials) - see Fig. 1. On each trial k, the cirM
cuit receives an input vector x(k) [A, A] and an
N
error vector y(k) [A, A] (where A is a bound on
both the input and error) and produces a result output
vector r(k) RN , which depends on the input by the
relation (4), where the matrix W(k) RN M , called
the synaptic weight matrix, is stored in the system.
Additionally, on each step, the circuit updates W(k)
according to (9). This circuit can be used to implement
ML algorithms. For example, as depicted in Fig. 1 the
simple Adaline algorithm can be implemented using
the circuit, with training enabled on k = 1, . . . , K0 .
The implementation of the Backpropgation algorithm,
is shown in section IV.
B. The circuit architecture
B.1 The synaptic grid
The synaptic grid system described in Fig. 1 is implemented by the circuit shown in 2, where components of all vectors are shown as individual signals.
Each gray cell in Fig. 2 is a synaptic circuit (artificial synapse) using a memristor (described in Fig.
3a). The synapses are arranged in a two dimensional
N M grid array as shown in Fig. 2, where each
synapse is indexed by (n, m), with m {1..M } and
n {1..N }. Each (n, m) synapse receives two inputs um , um , an enable signal en , and produces an
output current Inm . Each column of synapses in the
SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING5
(12)
2. We assume that
K (VDD |u (t)| |VT |) G (s (t))
(13)
Defining
Wnm = ac
g snm ,
(20)
r = Wx ,
(21)
we obtain
as desired.
C.2 Update phase (write)
0.5Trd
snm =
Trd
(axm ) dt+
0
(axm ) dt = 0 ,
0.5Trd
(15)
The zero net change in the value of snm between
the times of 0 and Trd implements a non-destructive
read, as common in many memory devices. To minimize inaccuracies due to changes in the conductance
of the memristor during the read phase, the row output interface samples the output current at time zero.
Using (3), the output current of the synapse to the on
line at the time is thus
Inm = a(
g + gsnm )xm .
(16)
(18)
Trd +b|yn |
snm =
= abxm yn
(24)
(25)
where = a2 bc
g.
IV. A MULTILAYER N EURAL NETWORK CIRCUIT
So far, the circuit operation was exemplified on a
SNN with the simple Adaline algorithm. In this section it is explained how, with a few adjustments, the
proposed circuit can be used to implement Backpropgation on a general MNN.
Recall the context of the supervised learning
setup detailed in section II. Consider an estimator of
the form
r = W2 (W1 x) ,
with W1 and W2 being the two parameter matrices.
Such a double layer MNN can approximate any target function with arbitrary precision [40]. Denote by
SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING7
Fig. 4 Implementing the Backpropagation for a MNN with MSE error, using a modified version of the original circuit. The function
boxes denote the operation of either () or 0 () on the input, and the box denotes a component-wise product. For detailed
circuit schematics, see [31].
with x2 , (r
1 ) , y2 , d r2 , x1 , x and
y1 , W2> y2 0 (r1 ), where here a b ,
(a1 b1 , . . . , aM bM ), a component-wise product. Implementing such an algorithm requires a minor modification of the proposed circuit - it should have an additional inverted output , W> y. Once this modification is made to the circuit, by cascading such circuits, it is straightforward to implement the Backpropagation algorithm for two layers or more, as shown in
Fig. 4.
The additional output = W> y can be generated
by the circuit in an additional read phase with duration Trd , between the original read and write
phases, in which the original role of the input and output lines are inverted. In this phase the NMOS transistor is on, i.e., n, en = VDD , and the former output
on lines are given the following voltage signal (again,
used for a non-destructive read)
(
ayn
, if Trd t < 1.5Trd
on =
.
ayn , if 1.5Trd t 2Trd
The Inm current now exits through the (original input) um terminal, and the sum of all the currents is
measured at time Trd by the column interface (before
it goes into ground)
X
X
um =
Inm = a
(
g + gsnm ) yn . (26)
n
(27)
where uref = a
g
m =
Wnm yn ,
as required.
Lastly, note that it is straightforward to use a different error function instead of MSE. For example, in the
2
kd rk .
two-layer example, recall that y2 = 12 r
If a different error function E (r, d) is required, one
can re-define
y2 ,
E (r, d)
r
(28)
A. Noise
When one of the transistors is enabled (e (t) =
VDD ), then the current is affected by intrinsic thermal noise sources in the transistors and memristor of
each synapse. Current fluctuations on a device due to
thermal origin can be approximated by a white noise
signal I (t) with zero mean (hI (t)i = 0) and autocorrelation hI (t) I (t0 )i = 2 (t t0 ) where ()
is Diracs delta function and 2 = 2k Tg (where k
is Boltzmann constant, T is the temperature, and g
is the conductance of the device). For 65 nm transistors (parameters taken from IBMs 10LPe/10RFe
process [42]) the characteristic conductivity is g1
104 1 , and therefore for I1 (t), the thermal current source of the transistors are 12 1024 A2 sec.
Assume that the memristor characteristic conductivity g2 = g1 , so for the thermal current source of the
memristor, 22 = 12 . Note that from (13) we have
1, the resistance of the transistor is much smaller
than that of the memristor. The total voltage on the
memristor is thus
g1
1
u (t) + (g1 + g2 ) (I1 (t) I2 (t))
VM (t) =
g1 + g2
1
u (t) + (t)
=
1+
where (t) = g11 (I1 (t) I2 (t)) and h (t) (t0 )i
2 (t t0 ) with 2 2k Tg11 1016 V2 sec. The
equivalent circuit, including sources of noise, is shown
in Fig. (5). Assuming the circuit minimal trial duration is T = 10nsec, the maximum root mean square
error due to thermal noise would be about ET
T 1/2 104 V .
Noise in the inputs u, u
and e also exists. According
to [43], the relative noise in the power supply of the
u/
u inputs is approximately 10% in the worst case.
Applying u = ax, effectively gives an input of ax +
ET + Eu , where |Eu | 0.1a |x|. The absolute noise
min
level in duration of e should be smaller than Tclk
2
1010 sec, assuming a digital implementation of pulsemin
width modulation with Tclk
being the shortest clock
cycle currently available. On every write cycle e =
VDD is therefore applied for a duration of b |y| + Ee
(instead of b |y|), where |Ee | < Tclk .
B. Parameter variability
A common estimation of the variability in memristor parameters is a Coefficient of Variation (CV =
(standard deviation)/mean) of a few percent [44]. In
this paper, the circuit is also evaluated with considerably larger variability (CV 30%), in addition to
the noise sources as described in section A. This is
Fig. 5 Noise model for artificial synapse. (a) During operation, only one transistor is conducting (assume it is the N-type
transistor). (b) Thermal noise in a small signal model, the
transistor is converted to a resistor (g1 ) in parallel to current
source (I1 ), the memristor is converted to a resistor (g2 ) in
parallel to current source (I2 ), and the effects of the sources
are summed linearly.
done by assuming that the variability in the parameters of the memristors are random variables independently sampled from a uniform distribution g
U [0.5
g , 1.5
g ]. When running the algorithm in software, these variations are equivalent to corresponding
changes in the synaptic weights W or the learning rate
in the algorithm. Note that variability in the transistor parameters is not considered, since these can affect the circuit operation only if (12) or (13) are invalidated. This can happen, however, only if the values
of K or VT vary in orders of magnitude, which is unlikely.
VI. C IRCUIT E VALUATION
In this section, the proposed circuit is implemented
(section A) in a simulated physical model. Then (section B), using standard supervised learning datasets,
the implementation of SNNs and MNNs using the proposed circuit is demonstrated, including its robustness
to noise and variation.
A. Basic functionality
First, the basic operation of the proposed design is
examined numerically. A small 2 2 synaptic grid
circuit is simulated for time 10T with simple inputs
(x1 , x2 ) = (1, 0.5) sign (t 5T)
(y1 , y2 ) = (1, 0.5) 0.1 .
(29)
(30)
SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING9
Type
Parameter
Power source
VDD
10 V
5 AV2
NMOS
PMOS
VT
1.7 V
5 AV2
VT
1.4 V
Memristor
Timing
Value
1 S
180 S/ (V sec)
0.1 sec
Twr
0.21 T
0.1 V
Twr
c1
10 nA
Scaling
Fig. 7a
Fig. 7b
Network
30 1
443
#Training samples
284
75
#Test samples
285
75
0.1
0.1
posed circuit, the circuit performance has been compared to an algorithmic implementation of the MNNs
in (Matlab) software. Two standard tasks are used:
1. The Wisconsin Breast Cancer diagnosis task [47]
(linearly separable).
2. The iris classification task [47] (not linearly separable).
The first task was evaluated using an SNN circuit, similarly to Fig. 1. The second task is evaluated on a 2layer MNN circuit, similarly to Fig. 4. Fig. 7 shows
the training phase performance of
1. The algorithm.
2. The proposed circuit (implementing the algorithm).
3. The circuit, with about 10% noise and 30% variability (section V).
Also, Table III shows the generalization errors after
training was over. On each of the two tasks, the performance of the circuit and the algorithm similarly improves, finally reaching a similar generalization error.
These results indicate that the proposed circuit design
can precisely implement MNNs trainable with online
gradient descent, as expected, from the derivations in
section III. Moreover, the circuit exhibits considerable
robustness, as its performance is only mildly affected
by significant levels of noise and variation.
Implementation details. The SNN and the MNN
were implemented using the (SimElectronics) synaptic grid circuit model which was described in section
A. The detailed schematics of the 2-layer MNN model
are again given in [31] (HTML, open using Firefox),
and also the code. All simulations are done using a
desktop computer with core i7 930 processor, a Windows 8.1 operating system, and a Matlab 2013b environment. Each circuit simulation takes about one day
to complete (note the software implementation of the
circuit does not exploit the parallel operation of the
circuit hardware). For each k sample from the dataset,
the attributes are converted to a M -long vector x(k) ,
and each label is converted to a N -long binary (1)
vector d(k) . The training set is repeatedly presented,
with samples at random order. The inputs are centralized and normalized as recommended by [6], with
additional rescaling made done in dataset 1, to insure
that the inputs fall within the circuit specifications 12.
The performance was averaged over 10 repetitions and
the training error was also averaged over 20 samples.
Cross entropy error was used instead of MSE, since it
reduces classification error [48]. The initial weights
were set as sampled independently from a symmetric uniform distribution. The neuronal activation functions were set as (x) = 1.7159 tanh (2x/3), as rec-
10
r1
r2
r1
r2
0.4
0.6
5
1
0.8
0.5
0
0.2
0.4
t [sec]
0.4
0.6
gs11 [S]
V12 [mV]
V11 [mV]
0.5
0.2
0.5
1
0.8
0.2
50
0.1
50
100
0
0.1
0.2
0.4
0.8
50
0.1
0.6
50
0.2
1
0.8
V22 [mV]
0.1
gs21 [S]
V21 [mV]
0.2
50
0.4
0.6
0.2
1
t [sec]
100
0.2
5
1
100
t [sec]
100
0
0.8
t [sec]
100
100
0
0.6
gs12 [S]
0.2
0.1
50
0
0.2
t [sec]
0.4
0.6
0.8
gs22 [S]
1
0
0.5
x2
x1
0.1
1
t [sec]
Training Error
Fig. 6 Synaptic 2 2 grid circuit simulation, during ten operation cycles. Top: Circuit Inputs (x1 , x2 ) and result outputs (r1 , r2 ).
Middle and bottom: voltage (solid black) and conductance (dashed red) change for each memristor in the grid. Notice that the
conductance of the (n, m) memristor indeed changes proportionally to xm yn , following 24 . Simulation was done using SimElectronics, with inputs as in (29-30), and circuit parameters as in Table I. Similar results were obtained using SPICE with linear ion drift
memristor model and CMOS 0.18m process, as shown in appendix D [31].
Fig. 7a
0.8
Algorithm
Circuit
Noisy+variable circuit
0.6
0.4
0.2
0
0
200
400
600
800
1000
1200
Training Error
samples
0.8
Algorithm
Circuit
Noisy+variable circuit
0.6
0.4
(a)
(b)
(c)
Fig. 7b
1.3% 0.7%
2.9% 2%
1.5% 0.7%
2.8% 1.9%
1.5% 0.7%
4.7% 2.4%
0.2
0
0
200
400
600
800
1000
1200
samples
Fig. 7 Circuit evaluation - training error for two datasets: (a)
Wisconsin breast cancer diagnosis (b) Iris classification.
In this paper, the proposed circuit is designed specifically to deal with these bottlenecks using memristor
arrays. This design has a relatively small number of
components in each array element - one memristor
and two transistors. The physical grid-like structure
of the arrays implements the matrixvector product operation (4) using analog summation of currents,
while the memristor dynamics enable us to perform
the vectorvector outer product operation (10), using timevoltage encoding paradigm. The idea to
use a resistive grid to perform matrixvector product operation is not new (e.g., [49]). The main novelty of this paper is the use of memristors together
with timevoltage encoding, which allows us to
perform a mathematically accurate vectorvector
outer product operation in the learning rule using a
small number of components.
SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING11
TABLE IV Hardware designs of artificial synapses implementing scalable online learning algorithms
Design
#Transistors
Comments
Proposed design
2 (+1 memristor)
[50]
[51]
[52, 53]
39
[54]
52
[55]
92
[56]
83
[57]
150
12
to be quite broad. For example, the values of the memristive timescales range from picoseconds [62] to milliseconds [46]. Here the parameters were taken from
a millisecond-timescale memristor [46]. The actual
speed of the proposed circuit would critically depend
on the timescales of commercially available memristor
devices.
Additional important modifications of the proposed
circuit are straightforward. In appendix A it is explained how to modify the circuit to work with more
realistic memristive devices [13, 32], instead of the
classical memristor model [11], given a few conditions. In appendix C, it is shown that it is possible
to reduce the transistor count from two to one, at the
price of doubling the duration of the write phase.
Other useful modifications of circuit are also possible.
For example, the input x may be allowed to receive
different values during the read and write operations. Also, it is straightforward to replace the simple
outer product update rule in (10) by more general update rules of the form
X
(k)
(k1)
Wnm
= Wnm
+
fi yn(k) gj x(k)
,
m
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
i,j
[14]
[15]
VIII. C ONCLUSIONS
A novel method to implement scalable online gradient descent learning in multilayer neural networks
through local update rules is proposed, based on
the emerging memristor technology. The proposed
method is based on an artificial synapse using one
memristor, to store the synaptic weight, and two
CMOS transistors to control the circuit. The correctness of the proposed synapse structure exhibits similar accuracy to its equivalent software implementation, while the proposed structure shows high robustness and immunity to noise and parameter variability.
Such circuits may be used to implement large-scale
online learning in multilayer neural networks, as well
as other and other learning systems. The circuit is estimated to be significantly smaller than existing CMOSonly designs, opening the opportunity for massive parallelism with millions of adaptive synapses on a single
integrated circuit, operating with low static power and
good robustness. Such brain-like properties may give
a significant boost to the field of neural networks and
learning systems.
[16]
[17]
[18]
[19]
[20]
[21]
[22]
SOUDRY ET AL: MEMRISTOR-BASED MULTILAYER NEURAL NETWORKS WITH ONLINE GRADIENT DESCENT TRAINING13
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]