You are on page 1of 11

Logarithmic Multiplier in Hardware

Implementation of Neural Networks

Uroš Lotrič and Patricio Bulić

Faculty of Computer and Information Science,


University of Ljubljana, Slovenia
{uros.lotric,patricio.bulic}@fri.uni-lj.si

Abstract. Neural networks on chip have found some niche areas of ap-
plications, ranging from massive consumer products requiring small costs
to real-time systems requiring real time response. Speaking about latter,
iterative logarithmic multipliers show a great potential in increasing per-
formance of the hardware neural networks. By relatively reducing the size
of the multiplication circuit, the concurrency and consequently the speed
of the model can be greatly improved. The proposed hardware implemen-
tation of the multilayer perceptron with on chip learning ability confirms
the potential of the concept. The experiments performed on a Proben1
benchmark dataset show that the adaptive nature of the proposed neural
network model enables the compensation of the errors caused by inexact
calculations by simultaneously increasing its performance and reducing
power consumption.

Keywords: Neural network, Iterative logarithmic multiplier, FPGA.

1 Introduction
Artificial neural networks are commonly implemented as software models run-
ning in general purpose processors. Although widely used, these systems usually
operate on von-Neumann architecture which is sequential in nature and as such
can not exploit the inherent concurrency present in artificial neural networks.
On the other hand, hardware solutions, specially tailored to the architecture of
neural network models, can better exploit the massive parallelism, thus achiev-
ing much higher performances and smaller power consumption then the ordinary
systems of comparable size and cost. Therefore, the hardware implementations
of artificial neural network models have found its place in some niche applica-
tions like image processing, pattern recognition, speech synthesis and analysis,
adaptive sensors with teach-in ability and so on.
Neural chips are available in analogue and digital hardware designs [1,2]. The
analogue designs can take advantage of many interesting analogue electronics el-
ements which can directly perform the neural networks’ functionality resulting
in very compact solutions. Unfortunately, these solutions are susceptible to noise,
which limits their precision, and are extremely limited for on-chip learning. On the
other hand, digital solutions are noise tolerant and have no technological
obstacles for on-chip learning, but result in larger circuit size. Since the design of

A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part I, LNCS 6593, pp. 158–168, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Logarithmic Multiplier in Hardware Implementation of Neural Networks 159

the application specific integrated circuits (ASIC) is time consuming and


requires a lot of resources, many hardware implementations use of programmable
integrated circuit technologies, like field programmable gate array (FPGA) tech-
nology.
The implementation of neural network models in integrated circuits is still a
challenging task due to the complex algorithms involving a large number of mul-
tiplications. Multiplication is a resource, power and time consuming arithmetic
operation. In artificial neural network designs, where many concurrent multipli-
cations are desired, the multiplication circuits should be as small as possible.
Due to the complexity of circuits needed for floating-point operations, the de-
signs are constrained to the fixed-point implementations, which can make use of
integer adders and multipliers.
The integer multiplier circuits can be further optimized. Many practical so-
lutions like truncated and logarithmic multipliers [3,4,5] consume less space
and power and are faster then ordinary multipliers for the price of introduc-
ing small errors to the calculations. These errors can cause serious problems in
neural network performance if the teaching is not performed on-chip. However,
if the neural network learning is performed on-chip, the erroneous calculations
should be compensated in the learning phase and should not seriously degrade its
performance.
All approximate multipliers discard some of the less significant partial prod-
ucts and introduce a sort of a compensation circuit to reduce the error. The main
idea of logarithmic multipliers is to approximate the operands with their loga-
rithms thus replacing the multiplication with one addition. Errors introduced by
approximation are usually compensated by some lookup-table approach, interpo-
lations or based on Mitchell’s algorithm [3]. The one stage iterative logarithmic
multiplier [5] follows the ideas of Mitchell but uses different error-correction
circuits. The final hardware implementation involves only one adder and few
shifters, resulting in reduced usage of logic resources and power consumption.
In this paper the behaviour of hardware implementation of neural network us-
ing iterative logarithmic multipliers is considered. In the next section the iterative
logarithmic multiplier is introduced, outlining its advantages and weaknesses. Fur-
thermore, a highly parallel processing unit specially suited for feed-forward neural
networks is proposed. Its design allows it to be used in the forward pass as well as
the backward pass during the learning phase. In section four the performance of
the proposed solution is tested on many benchmark problems. The results are com-
pared with the hardware implementation using exact matrix multipliers as well as
floating-point implementation. Main findings are summarized in the end.

2 Iterative Logarithmic Multiplier

The iterative logarithmic multiplier (ILM) was proposed by Babic et al. in [5].
It simplifies the logarithm approximation introduced in [3] and introduces an
iterative algorithm with various possibilities for achieving an error as small as
required and the possibility of achieving an exact result.
160 U. Lotrič and P. Bulić

2.1 Mathematical Formulation

The logarithm of the product of two non-negative integer numbers, N1 and N2


can be written as the sum of the logarithms, log2 (N1 · N2 ) = log2 N1 + log2 N2 .
By denoting k1 = log2 N1  and k2 = log2 N2 , the logarithm of the product
can be approximated as k1 + k2 . In this case the calculation of the approximate
product 2k1 +k2 requires only one add and one shift operation, but has a large
error.
To decrease this error, the following procedure is prosed in [5]. A non-negative
integer number N can be written as

N = 2k + N (1) , (1)

where k is a characteristic number, indicating the place of the leftmost 1 or the


leading 1 bit in its binary representation, and the number N (1) = N − 2k is the
remainder of the number N after removal of the leading 1.
Following the notation in Eq. 1, the product of two numbers can be written
as
(1) (1)
Ptrue = N1 · N2 = (2k1 + N1 ) · (2k2 + N2 )
(0)
= Papprox + E (0) . (2)

While the the first approximation of the product


(0) (1) (1)
Papprox = 2k1 +k2 + N1 · 2k2 + N2 · 2k1 (3)

can be calculated by applying only few shift and add operations, the term
(1) (1)
E (0) = N1 · N2 , E (0) > 0 , (4)

representing the absolute error of the first approximation, requires multiplication.


Similarly, the proposed multiplication procedure can be performed on
multiplicands from Eq. 4 such that

E (0) = C (1) + E (1) , (5)

where C (1) is the approximate value of E (0) , and E (1) the corresponding absolute
error. The combination of Eq. 2 and Eq. 5 gives
(0)
Ptrue = Papprox + C (1) + E (1) = Papprox
(1)
+ E (1) . (6)

By repeating the described procedure we can obtain an arbitrarily precise ap-


proximation of the product by summing up iteratively obtained correction terms
C (j)
i

(i) (0)
Papprox = Papprox + C (j) . (7)
j=1
Logarithmic Multiplier in Hardware Implementation of Neural Networks 161

Table 1. Average and maximal relative errors for 16-bit iterative multiplier [5]

number of iterations i 0 1 2 3
(i)
average Er [%] 9.4 0.98 0.11 0.01
(i)
max Er [%] 25.0 6.25 1.56 0.39

The number of iterations required for an exact result is equal to the number of
bits with the value of 1 in the operand with the smaller number of bits with the
value of 1. Babic at al. [5] showed that in the worst case scenario the relative error
(i)
introduced by the proposed multiplier Er = E (i) /N1 N2 decays exponentially
−2(i+1)
with the rate 2 . Table 1 presents the average and maximal relative errors
with respect to the number of considered iterations.
The proposed method assumes non-negative numbers. To apply the method
on signed numbers, it is most appropriate to specify them in sign and magnitude
representation. In that case, the sign of the product is calculated as the EXOR
operation between sign bits of the both multiplicands.

2.2 Hardware Implementation


The implementation of the proposed multiplier is described in [5]. The multi-
plier with one error correction circuit, shown in Figure 1a, is composed of two
pipelined basic blocks, of which the first one calculates an approximate product
(0)
Papprox , while the second one calculates the error-correction term C (1) . The task
of the basic block is to calculate one approximate product according to Eq. 3. To
decrease the maximum combinational delay in the basic block, we used pipelin-
ing to implement the basic block. The pipelined implementation of the basic
block is shown in Figure 1b and has four stages. The stage 1 calculates the two
(1) (1)
characteristic numbers k1 , k2 and the two residues N1 , N2 . The residues are
(1) (1)
outputted in the stage 2, which also calculates k1 + k2 , N1 · 2k2 and N2 · 2k1 .
(1) (1)
The stage 3 calculates 2k1 +k2 and N1 · 2k2 + N2 · 2k1 , which are summed
(0)
up to the approximation of the product Papprox in the stage 4. After the initial
latency of 5 clock periods the proposed iterative logarithmic multiplier enables
the products to be calculated in each clock period. The estimated device uti-
lization in terms of programmable hardware components, i.e. slices and lookup
tables, and power consumption at frequency of 25 MHz for the 16-bit pipelined
implementations of the proposed multiplier and the classical matrix multiplier
are compared in Table 2.

Table 2. Device utilization and power consumption of multipliers obtained on the


Xilinx Spartan 3 XC3S1500-5FG676 FPGA circuit

multiplier slices lookup tables power [mW]


iterative logarithmic 427 803 7.32
matrix 477 1137 9.16
162 U. Lotrič and P. Bulić

N1 N2

LOD LOD

2 k1 2 k2

PRIORITY PRIORITY STAGE 1


ENCODER ENCODER
k1 k2
k1 N1-2 k2 N2-2
BASIC BLOCK
N1 Register Register Register Register
N2 STAGE 1 N1-2 k1 N2-2 k2
BASIC BLOCK k1 k2
(1)
N1 BARREL BARREL
SHIFTER SHIFTER
STAGE 2 (1)
N2 STAGE 1 + LEFT LEFT
STAGE 2
(2)
N1 Register Register Register
k2
STAGE 3 STAGE 2 (2)
N2 ( N1-2 k1)2 ( N2-2 k2)2
k1

k1 + k2

STAGE 4 STAGE 3
STAGE 3 DECODER +

(0)
Papprox Register Register
STAGE 4
k2
Register 2 k1 + k2 ( N1-2 k1)2 + ( N2-2 k2)2
k1

(1) STAGE 4
C +

+ Register

(1) (0)
Papprox Papprox
a. b.

Fig. 1. Block diagrams of a. a pipelined iterative logarithmic multiplier with one error-
correction circuit, and b. its basic block

3 Multilayer Perceptron with Highly Parallel Neural


Unit

One of the most widely used neural networks is the multilayer perceptron, which
gained its popularity with the development of the back propagation learning
algorithm [6]. Despite its simple idea the learning phase still presents a hard nut
to crack when hardware implementations of the model are in question.
A multilayer perceptron is a feed-forward neural network consisting of a set
of source nodes forming the input layer, one or more hidden layers of computa-
tion nodes, and an output layer of computation nodes. A computation  node or
a neuron n in a layer l first computes an activation potential vnl = i ωni l
xl−1
i ,
l l−1
a linear combination of weights ωni and outputs from the previous layer xi .
To get the neuron output, the activation potential is passed to an activation
function, xln = ϕ(vnl ), for example ϕ(v) = tanh(v). The objective of a learning
algorithm is to find such a set of weights and biases that minimizes the perfor-
mance function, usually defined as a squared error between calculated outputs
and target values. For the back-propagation learning rule, the weight update
l l l l  l l
equation in its simplest form becomes ni = ηδn xi , with δn = ϕ (vn )(tn − xn ) in
 ωl+1
l  l l+1
the output layer and δn = ϕ (vn ) o δo wno otherwise, where η is a learning
parameter and tn the n-th element of a target output.
Logarithmic Multiplier in Hardware Implementation of Neural Networks 163

A multilayered perceptron exhibits two levels of concurrency: a fine-grained


computation of each neuron’s activation potential and a coarse-grained compu-
tation of outputs from all neurons in a layer. A lot of existing solutions bet on the
latter concept [7], which complicates the hardware implementation of the learn-
ing process. Since the calculations of activation potential and delta of neurons
in hidden layers are very similar, we have exploited the first concept.
For that purpose we have built a highly parallel neural unit that calculates
the scalar product of two vectors in only one clock cycle [8]. The inputs to the
neural units are first passed to the multipliers from which the products are then
fed to the adders organized in a tree-like structure.
In order to gain as much as possible from the neural unit, it should be ca-
pable of calculating a scalar product of the largest vectors that appear in the
computation. The hardware circuit thus becomes very complex and can only
be operated at lowered frequencies. For example, a unit with 32 multipliers and
consequently 31 adders was implemented in Spartan 3 XC3S1500-5FG676 FPGA
chip. While separate multiplications can run at maximum frequency of 50 MHz,
the proposed unit managed to run at still acceptable 30 MHz [8].
To use the neural unit, a set of subsidiary units is needed: RAM memory
for storing weights, registers for keeping inputs, outputs and partial results,
multiplexers for loading proper data to the neural unit, lookup tables with stored
values of activation function (LUT) and its derivative (LUTd) and three state
machines. The forward pass and the backward pass are controlled by the Learn
and Execute state machines which are supervised by the Main state machine. A
simplified scheme of the implementation is shown in Fig. 2.

Fig. 2. Neural network implementation scheme [8]


164 U. Lotrič and P. Bulić

4 Experimental Work
To asses the performance of the iterative logarithmic multiplier, a set of experi-
ments was performed on multilayer perceptron neural networks with one hidden
layer. The models were compared in terms of classification or approximation
accuracy, speed of convergence, and power consumption. Three types of models
were evaluated: a) an ordinary software model (SM) using floating point arith-
metic, b) a hardware model with exact matrix multipliers (HMM ), and c) the
proposed hardware model using the iterative logarithmic multipliers with one
error correction circuit (HML ).
The models were evaluated on Proben1 collection of freely available bench-
marking problems for the neural network learning [9]. A rather heterogeneous
collection contains 15 data sets from 12 different domains, and all but one consist
of real world data. Among them 11 data sets are from the area of pattern clas-
sification and the remaining four from the area of function approximation. The
datasets, containing from few hundred to few thousand input-output samples,
are already divided into training, validation and test set, generally in proportion
50 : 25 : 25. The number of attributes in input samples ranges from 9 to 125 and
in output samples from 1 to 19. Before modelling, all input and output samples
were rescaled to the interval [−0.8, +0.8].
The testing of models on each of the data sets mentioned above was performed
in two steps. After finding the best software models, the modelling of hardware
models started, keeping the same number of neurons in the hidden layer.
During the software model optimization, the topology parameters as well as
the learning parameter η were varied. Since the number of inputs and outputs
is predefined with a data set, the only parameter influencing the topology of
the model is the number of neurons in the hidden layer. It was varied from
one to a maximum value, determined in such a way, that the number of model
weights did not exceed the number of training samples. The learning process
in the backpropagation scheme heavily depends on the learning parameter η.
Since the data sets are very heterogeneous, the values 2−2 , 2−4 , . . . , 2−12 were
used for the learning parameter η. Powers of two are very suitable for hardware
implementation because the multiplications can be replaced by shift operation.
While the software model uses 64-bit floating point arithmetic, both hardware
models use fixed point arithmetic with weights represented with 16, 18, 20, 22,
or 24 bits. For both hardware models the weights were limited to the interval
[−4, +4]. The processing values including inputs and outputs were represented
with 16 bits in the interval [−1, +1]. The values of activation function ϕ(v) =
tanh(1.4 v) and its derivatives for 256 equidistant values of v from the interval
[−2, 2] were stored in two separate lookup tables.
By applying the early stopping criterion, the learning phase was stopped as
soon as the classification or approximation error on the validation set started
to grow. The analysis on test set was performed with the model parameters
which gave the minimal value of the normalized squared error on validation
set. The normalized squared error is defined as a squared difference between
the calculated and the target outputs averaged over all samples and output
Logarithmic Multiplier in Hardware Implementation of Neural Networks 165

Fig. 3. Performance of the models with respect to the weight precision on Hearta1
data set

attributes, and divided with a squared difference between the maximal and the
minimal value of the output attributes. Results, presented in the following, are
only given for the test set samples which were not used during the learning phase.
In Fig. 3, a typical dependence of the normalized squared error on the weight
precision is presented. The normalized squared error exponentially decreases
with increasing precision of weights. However, the increasing precision of weights
also requires more and more hardware resources. Since there is a big drop in the
normalized squared error from 16 to 18 bit precision and since we can make use
of numerous prefabricated 18 × 18 - bit matrix multipliers in the new Xilinx
FPGA programmable circuits, our further analysis is confined to 18-bit weight
precision.
The model performance for some selected data sets from Proben1 collection
is given in Table 3. Average values and standard variations for all three types of
models over ten runs are given in terms of three measures: the number of epochs,
the normalized squared error Ete and the percentage of misclassified samples
pmiss
te . The latter is only given for the data sets from classification domain.
The results obtained for software models using the backpropagation algorithm
are similar to those reported in [9], where more advance learning techniques were
applied. The most noticeable difference between software and hardware models
is in the number of epochs needed to train a model. The number of epochs in
the case of the hardware models is for many data sets and order of magnitude
smaller than in the case of the software models. The reason probably lies in
the inability of hardware models to further optimize the weights due to their
representation in limited precision.
As a rule, the hardware models exhibit slightly poorer performance in case
of the normalized squared error and the percentage of misclassified samples. A
discrepancy is very large at gene1 and thyroid1 data sets, where more than 18
bits representation of weights is needed to close the gap.
The comparison of hardware models HMM and HML reveals that the replace-
ment of the exact matrix multipliers with the proposed approximate iterative
logarithmic multipliers does not have any notable effect on the performance of
the models. The reasons for the very good compensation of the errors caused by
166 U. Lotrič and P. Bulić

Table 3. Performance of software and hardware models on some data sets. For each
data set the results obtained with models SM, HMM , and HML are given in the first,
second, and third row, respectively.

data set hidden neurons epochs Ete pmiss


te [%]
cancer1 6 24.0 ± 4.6 0.111 ± 0.002 1.46 ± 0.29
30.7 ± 4.7 0.114 ± 0.002 1.72 ± 0.00
30.6 ± 4.8 0.116 ± 0.003 1.72 ± 0.00
diabetes1 7 152 ± 62 0.407 ± 0.002 23.85 ± 0.42
28.5 ± 3.2 0.418 ± 0.001 25.31 ± 0.44
27.4 ± 5.5 0.416 ± 0.001 24.74 ± 0.51
gene1 8 1230 ± 208 0.262 ± 0.005 12.03 ± 0.59
20.6 ± 1.3 0.337 ± 0.007 23.33 ± 1.29
20.6 ± 1.3 0.339 ± 0.007 23.82 ± 1.25
thyroid1 48 6830 ± 2340 0.132 ± 0.002 2.75 ± 0.13
24.6 ± 5.4 0.195 ± 0.003 6.09 ± 0.19
23.1 ± 5.2 0.195 ± 0.002 6.07 ± 0.14
building1 56 50.5 ± 6.0 0.158 ± 0.001
16.2 ± 0.4 0.217 ± 0.015
16.2 ± 0.4 0.217 ± 0.014
flare1 4 21.1 ± 3.4 0.075 ± 0.001
30.8 ± 14.6 0.076 ± 0.002
32.5 ± 15.1 0.076 ± 0.001
hearta1 3 5640 ± 928 0.330 ± 0.002
38 ± 13 0.342 ± 0.005
38 ± 12.5 0.344 ± 0.006
heartac1 4 5070 ± 371 0.250 ± 0.014
41 ± 19 0.272 ± 0.019
44.2 ± 27.2 0.271 ± 0.020

inexact multiplication can be found in the high ability of adaptation, common


to all neural network models.
The proposed neural unit needs to be applied many times to calculate the
model output, therefore it is important to be as small and as efficient as pos-
sible. The estimation of device utilization in terms of Xilinx Spartan 3 FPGA
programmable circuit building blocks for a model with 32 exact 16 × 18 matrix
multipliers is shown in Table 4. According to the analysis of the multipliers in
Table 2, the replacement of the matrix multipliers with the iterative logarithmic

Table 4. Estimation of FPGA device utilization for a neural network model with 32
inputs, 8 hidden neurons and 10 outputs using 16 × 18 - bit matrix multipliers

slices (×1000) lookup tables (×1000)


whole model 27 38
neural unit 25 (92 %) 34 (89 %)
multipliers 24 (89 %) 33 (87 %)
Logarithmic Multiplier in Hardware Implementation of Neural Networks 167

multipliers can lead to more than 10 % smaller device utilization and more than
30 % smaller power consumption.

5 Conclusion

Neural networks offer a high degree of internal parallelism, which can be effi-
ciently used in custom design chips. Neural network processing comprises of a
huge number of multiplications, i.e. arithmetic operations consuming a lot of
space, time and power. In this paper we have shown that exact matrix multipli-
ers can be replaced with approximate iterative logarithmic multipliers with one
error correction circuit.
Due to the highly adaptive nature of neural network models which compen-
sated the erroneous calculation, the replacement of the multipliers did not have
any notable impact on the models’ processing and learning accuracy. Even more,
the proposed logarithmic multipliers require less resources on a chip, which leads
to smaller designs on one hand and on the other hand to designs with more con-
current units on the same chip. A consumption of less resources per multiplier
also results in more power efficient circuits. The power consumption reduced for
roughly 20 % makes the hardware neural network models with iterative loga-
rithmic multipliers favourable candidates for battery powered applications.

Acknowledgments

This research was supported by Slovenian Research Agency under grants P2-
0241 and P2-0359, and by Slovenian Research Agency and Ministry of Civil
Affairs, Bosnia and Herzegovina, under grant BI-BA/10-11-026. [1]

References
1. Zhu, J., Sutton, P.: FPGA implementations of neural networks - a survey of a
decade of progress. In: Cheung, P.Y.K., Constantinides, G.A., de Sousa, J.T. (eds.)
FPL 2003. LNCS, vol. 2778, pp. 1062–1066. Springer, Heidelberg (2003)
2. Dias, F.M., Antunesa, A., Motab, A.M.: Artificial neural networks: a review
of commercial hardware. Engineering Applications of Artificial Intelligence 17,
945–952 (2004)
3. Mitchell, J.N.: Computer multiplication and division using binary logarithms. IRE
Transactions on Electronic Computers 11, 512–517 (1962)
4. Mahalingam, V., Rangantathan, N.: Improving Accuracy in Mitchell’s Logarithmic
Multiplication Using Operand Decomposition. IEEE Transactions on Computers 55,
1523–1535 (2006)
5. Babic, Z., Avramovic, A., Bulic, P.: An Iterative Logarithmic Multiplier.
Microprocessors and Microsystems 35(1), 23–33 (2011) ISSN 0141-9331,
doi:10.1016/j.micpro.2010.07.001
6. Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Prentice-Hall,
New Jersey (1999)
168 U. Lotrič and P. Bulić

7. Pedroni, V.A.: Circuit Design With VHDL. MIT, Cambridge (2004)


8. Gutman, M., Lotrič, U.: Implementation of neural network with learning ability
using FPGA programmable circuits. In: Zajc, B., Trost, A. (eds.) Proceedings of
the ERK 2010 Conference, vol. B, pp. 173–176. IEEE Slovenian section, Ljubljana
(2010)
9. Prechelt, L.: Proben1 – A Set of Neural Network Benchmark Problems and Rules.
Technical Report 21/94, University of Karslruhe, Karlsruhe (1994)

You might also like