You are on page 1of 47

CHAPTER1

INTRODUCTION
1.1 INTRODUCTION
Multipliers are key components of many high performance systems such as FIR filters,
microprocessors, digital signal processors, etc. A systems performance is generally determined
by the performance of the multiplier because the multiplier is generally the slowest clement in the
system. Furthermore, it is generally the most area consuming. Hence, optimizing the speed and
area of the multiplier is a major design issue. However, area and speed are usually conflicting
constraints so that improving speed results mostly in larger areas. As a result, a whole spectrum
of multipliers with different area-speed constraints have been designed with fully parallel.
Multipliers at one end of the spectrum and fully serial multipliers at the other end. In between are
digit serial multipliers where single digits consisting of several bits are operated on. These
multipliers have moderate performance in both speed and area. However, existing digit serial
multipliers have been Plagued by complicated switching systems and/or irregularities in design.
Radix 2^n multipliers which operate on digits in a parallel fashion instead of bits bring the
pipelining to the digit level and avoid most ofthe above problems. They were introduced by M.
K. Ibrahim in 1993. These structures are iterative and modular. The pipelining done at the digit
level brings the benefit of constant operation speed irrespective of the size of the multiplier. The
clock speed is only determined by the digit size which is already fixed before the design is
implemented

1.2 MOTIVATION
As the scale of integration keeps growing, more and more sophisticated signal processing
systems are being implemented on a VLSI chip . These signal processing applications not only
demand great computation capacity but also consume considerable amounts of energy. While
performance and area remain to be two major design goals, power consumption has become a
critical concern in todays VLSI system design . The need for low-power VLSI systems arises
from two main forces. First, with the steady growth of operating frequency and processing
capacity per chip, large current has to be delivered and the heat due to large power consumption
must be removed by proper cooling techniques. Second, battery life in portable electronic devices
is limited. Low power design directly leads to prolonged operation time in these portable devices.
1

Multiplication is a fundamental operation in most signal processing algorithms .


Multipliers have large area, long latency and consume considerable power. Therefore, low-power
multiplier design has been an important part in low-power VLSI system design. The primary
objective is power reduction with small area and delay overhead. By using Radix_4 booth
algorithm we design a multiplier with low power and lesser area.

1.3 POWER OPTIMIZATION


Power refers to the number of Joules dissipated over a certain amount of time whereas
energy is a measure of the total number of Joules dissipated by a circuit. Strictly speaking, lowpower design is a different goal from low-energy design although they are related . Power is a
problem primarily when cooling is a concern. The maximum power at any time, peak power, is
often used for power and ground wiring design, signal noise margin and reliability analysis.
Energy per operation or task is a better metric of the energy efficiency of a system, especially in
the domain of maximizing battery lifetime.
In digital CMOS design, the well-known power-delay product is commonly used to assess
the merits of designs. In a sense, this is a misnomer as power delay = (energy/delay) delay =
energy, which implies delay is irrelevant . Instead, the term energy-delay product should be used
since it involves two independent measures of circuit behaviors. Therefore, when power-delay
products are used as a comparison metric, different schemes should be measured at the same
frequency to ensure that it is equivalent to energy-delay product comparison.
There are two major sources of power dissipation in digital CMOS circuits: dynamic
power and static power . Dynamic power is related to circuit switching activities or the changing
events of logic states, including power dissipation due to capacitance charging and discharging,
and dissipation due to short-circuit current (SCC). In CMOS logic, unintended leakage current,
either reverse biased PN-junction current or sub threshold channel conduction current, is the only
source of static current. However, occasional deviations from the strict CMOS style logic, such as
pseudo NMOS logic, can cause intended static current.

The total power consumption is summarized in the following equations :


Ptotal = Pdynamic + Pstatic = Pcap + Pscc + Pstatic
2

(1.1)

Pcap = 01fclk.

= 01fclk CL

(1.2)

Pscc = 01fclk Ipeak (tr + tf)/2 .VDD

(1.3)

Pstatic = Istatic.VDD

(1.4)

Pcap in Equation 1.2 represents the dynamic power due to capacitance charging and discharging
of a circuit node, where CL is the loading capacitance, fclk is the clock frequency, and 01 is
the 0 1 transition probability in one clock period. In most cases, the voltage swing Vswing is
the same as the supply voltage VDD; otherwise, Vswing should replace VDD in this equation.
Pscc is a first-order average power consumption due to short-circuit current. The peak current,
Ipeak, is determined by the saturation current of the devices and is hence directly proportional to
the sizes of the transistors. tr and tf are rising time and falling time of short-circuit current,
respectively. The static power Pstatic is primarily determined by fabrication technology
considerations, which is usually several orders of magnitude smaller than the dynamic power.
The leakage power problem mainly appears in very low frequency circuits or ones with sleep
modes where dynamic activities are suppressed . The dominant term in a well-designed circuit
during its active state is the dynamic term due to switching activity on loading capacitance, and
thus low-power design often becomes the task of minimizing 01, CL, VDD and fclk, while
retaining the required functionality . In the future, static power will become increasingly
important as the supply voltage keeps scaling. To avoid performance degrading, the threshold
voltage Vt is lowered accordingly and sub threshold leakage current increases exponentially .
Leakage power reduction heavily depends on circuit and technology techniques such as dual Vt
partitioning and multi-threshold CMOS . In this work, we will not consider leakage power
reduction.
Power optimization of digital systems has been studied at different abstract levels, from
the lowest technology level, to the highest system level . At the technology level, power
consumption is reduced by the improvement in fabrication process such as small feature size,
very low voltages, copper interconnects, and insulators with low dielectric constants . With the
fabrication support of multiple supply voltages, lower voltages can be applied on non-critical
system blocks. At the layout level, placement and routing are adjusted to reduce wire capacitance
and signal delay imbalances . At the circuit level, power reduction is achieved by transistor
sizing, transistor network restructuring and reorganization, and different circuit logic styles.

1.4 LOW POWER MULTIPLIER DESIGN


3

Multiplication consists of three steps: generation of partial products or PPs (PPG), reduction of
partial products (PPR), and final carry-propagate addition (CPA) . In general, there are sequential
and combinational multiplier implementations. We only consider combinational multipliers in
this work because the scale of integration now is large enough to accept parallel multiplier
implementation in digital VLSI systems. Different multiplication algorithms vary in the
approaches of PPG, PPR, and CPA. For PPG, radix-2 digit-vector multiplication is the simplest
form because the digit-vector multiplication is produced by a set of AND gates. To reduce the
number of PPs and consequently reduce the area/delay of PP reduction, one operand is usually
recoded into high-radix digit sets. The most popular one is the radix-4 digit set {2,1, 0, 1, 2}.
For PPR, two alternatives exist : reduction by rows , performed by an array of adders, and
reduction by columns , performed by an array of counters. In reduction by rows, there are two
extreme classes: linear array and tree array. Linear array has the delay of O(n) while both tree
array and column reduction have the delay of O(log n), where n is the number of PPs. The final
CPA requires a fast adder scheme because it is on the critical path. In some cases, final CPA is
postponed if it is advantageous to keep redundant results from PPG for further arithmetic
operations.
The difficulty of low-power multiplier design lies in three aspects. First, the multiplier
area is quadratically related to the operand precision. Second, parallel multipliers have many
logic levels that introduce spurious transitions or glitches. Third, the structure of parallel
multipliers could be very complex in order to achieve high speed, which deteriorates the
efficiency of layout and circuit level optimization. As a fundamental arithmetic operation,
multiplication has many algorithm-leveland bit-level computation features in which it differs
from random logic. These features have not been considered well in low-level power
optimization. It is also difficult to consider input data characteristics at low levels. Therefore, it is
desirable to develop algorithm and architecture level power optimization techniques that consider
multiplications arithmetic features and operands characteristics.
There has been some work on low-power multipliers at the algorithm and architecture
level. As smaller area usually leads to less switching capacitance, the results in could provide a
rough estimation of relative power consumptions in different multiplication schemes. In ,
Callaway studied the power/delay/area characteristics of four classical multipliers. In , Angel
proposed low-power sign extension schemes and self-timed design with bypassing logic for zero
PPs in radix-4 multipliers. Cherkauer and Friedman proposed a hybrid radix-4/radix-8 low
4

power signed multiplier architecture. For multiplication data with large dynamic range, several
approaches have been proposed. Architecture-level signal gating techniques have been studied .
In , a mixed number representation for radix-4 twos-complement multiplication is proposed. In ,
radix-4 recoding is applied to the constant input instead of the dynamic input in low-power
multiplication for FIR filters. In , multiplication is separated into higher and lower parts and the
results of the higher part are stored in a cache in order to reduce redundant computation. In , two
techniques are proposed for data with large dynamic range: most-significant-digit-first carry-save
array for PP reduction and dynamically-generated reduced twos-complement representation In,
the precisions of two input data are compared at runtime and two operands are then exchanged if
necessary so that radix-4 recoding is applied on the operand with smaller precisions in order to
generate more zero PPs.

CHAPTER 2
MULTIPLIERS
5

2.1 MULTIPLIERS: OVERVIEW


Multipliers can be classified as hardware multipliers and software multipliers. In older
digital systems, there was no hardware multiplier and multiplication was implemented with a
micro program. The micro program needed many micro instruction cycles to complete the
multiplication process, which make the micro programmed multipliers slow. For high speed
digital systems, hardware multipliers are usually used. In modem microprocessors and ASIC
processors, most arithmetic logic units (ALU) contain a hardware multiplier. High speed
hardware multipliers have been of interest for some time. More sophisticated approaches for
multiplier designs can be implemented today due to the increase density of integrated circuits.
Hardware multipliers can be divided in two main categories: sequential and parallel array
multipliers. For a sequential multiplier, multiplication of the multiplier

and multiplicand is the

operation of repeatedly adding the multiplicand and shifting. The advantage of a sequential
multiplier is that the circuit is simple and the chip occupies less area, the disadvantage is that it is
slower. For parallel array multipliers, the summation of partial products is carried out by using a
linear adder array. Because the operation is in parallel, the speed is much faster than of a
sequential multiplier.
There are a number of algorithms used for multiplication . The 3-bit recoding algorithm is
one of the most well known . It is used in the design of many kinds of hardware and software
multipliers. This algorithm is used to reduce the number of partial product rows by about half, so,
the speed of multiplication increases significantly and the chip area is reduced. The 3-bit recoding
algorithm is also called the Modified Booth's Algorithm and was developed from Booth's
algorithm . A number of other multiple-bit recoding algorithms for multiplication have been
developed . Recent, a parallel hardware multiplier based on a 5-bit recoding algorithm has been
proposed . From the view of optimization, the 5-bit recoding algorithm is preferred to a 4-bit
recoding algorithm. While more partial product rows can be reduced with the 5-bit recoding
algorithm than with a 3-bit recoding algorithm, more complicated circuits are required to
determine the odd multiples of the multiplicand. With the potential of improving both
performance and the hardware requirements, the 5-bit recoding algorithm maybe good for a high
bit multiplier, but, not for a low bit multiplier, such as, an 8 x 8 bit multiplier.

Using the 5-bit recoding algorithm reduces the number of partial product rows to two. The
partial products are selected from 17 different multiples of the multiplicand Y (0, Y, 2Y, 3Y,
4Y, 5Y, 6Y, 7Y, 8Y). Using the 3-bit recoding algorithm reduces the number of partial
product rows to four. The partial products are selected from 5 different multiples of the
multiplicand Y (0, Y, 2Y). The addition of four partial products can be changed to the
addition of two binary numbers by using two rows of carry save adder arrays (CSA) with only a
two gate delay introduced. The even multiples of the multiplicand Y can be implemented by
using a hardwire shift. For the 3-bit recoding algorithm, only the two's complement of Y needs to
be determined. For the 5-bit recoding algorithm, additional high speed adders are required to
determine odd multiples of Y. These high speed adders require more circuitry to implement and
suffer more time delay. For higher bit multipliers, the advantage of the 5-bit recoding algorithm
can be seen. For example, for a 32 x 32 bit multiplier, the 5-bit recoding algorithm reduces the
number of partial product rows to 8 and the 3-bit recoding algorithm reduces the number of
partial product rows to 16. The reduction of the number of partial product rows is apparent.

2.2 BINARY MULTIPLIER


A Binary multiplier is an electronic hardware device used in digital electronics or a
computer or other electronic device to perform rapid multiplication of two numbers in binary
representation. It is built using binary adders.
The rules for binary multiplication can be stated as follows
1. If the multiplier digit is a 1, the multiplicand is simply copied down and represents the product.
2. If the multiplier digit is a 0 the product is also 0.
For designing a multiplier circuit we should have circuitry to provide or do the following
three things:
1. it should be capable identifying whether a bit is 0 or 1.
2. It should be capable of shifting left partial products.
3. It should be able to add all the partial products to give the products as sum of partial products.
4. It should examine the sign bits. If they are alike, the sign of the product will be a positive, if
the sign bits are opposite product will be negative. The sign bit of the product stored with above
criteria should be displayed along with the product.
From the above discussion we observe that it is not necessary to wait until all the partial
products have been formed before summing them. In fact the addition of partial product can be
carried out as soon as the partial product is formed.
7

2.3 DIRECT MULTIPLICATION OF TWO UNSIGNED BINARY


NUMBERS
The process of digital multiplication is based on addition, and many of the techniques
useful in addition carry over to multiplication. The general scheme for unsigned multiplication is
shown in Figure 2.1.

Figure 2.1 Digital multiplication of unsigned four bit binary numbers


For the multiplication of a n bit multiplier and a m bit multiplicand, the product is represented
with a n + m bit binary number. To complete the multiplication,
(1) The partial products can be added sequentially or
(2) The partial products can be added by using parallel adder array

2.4 MULTIPLY ACCUMULATE CIRCUITS


Multiplication followed by accumulation is a operation in many digital systems
,particularly those highly interconnected like digital filters, neural networks, data quantisers, etc.
One typical MAC(multiply-accumulate) architecture is illustrated in figure. It consists of
multiplying 2 values, then adding the result to the previously accumulated value, which must then
be restored in the registers for future accumulations. Another feature of MAC circuit is that it
must check for overflow, which might happen when the number of MAC operation is large .
8

This design can be done using component because we have already design each of the
units shown in figure. However since it is relatively simple circuit, it can also be designed
directly. In any case the MAC circuit, as a whole, can be used as a component in application like
digital filters and neural networks.

2.5 SEQUENTIAL MULTIPLIER/ARRAY MULTIPLIER


A sequential multiplier implements multiplication by repeatedly shifting the multiplicand
and adding to the partial product. The advantage of a sequential multiplier is that the circuit is
simple and the chip occupies less area, the disadvantage is that it is slower. A sequential
multiplier usually consists of a register, MD, which holds the multiplicand, a shift register, MR,
which holds the multiplier initially, a shift accumulator which holds the partial product, a shift
counter.

Figure 2.2 Flow chart for multiplication process of a sequential multiplier

Figure 2.2 shows the multiplication process of a sequential multiplier.


The steps for multiplication are given by:
1. The multiplier and multiplicand are loaded into the register MR and MD, respectively, and the
accumulator and counter are reset to zero.
2. The least significant bit j ^ ^ of the shift register MR is tested, if ^^=1, the multiplicand Y is
added to partial product.
3. The partial product and multiplier are shifted one place right and the least significant bit of the
multiplier is discarded.
4. The counter number is increased by one.
10

5. If the count is equal to the number n of bits in the multiplier, the multiplication process is
complete and the product is equal to the number held in the accumulator, otherwise, the operation
return to step 2.
A sequential multiplier is a simple circuit and occupies less chip area, but it is slow. To increase
the speed of multiplication, parallel adder arrays are used to add partial products.

2.6 PARALLEL MULTIPLIER


In parallel array multipliers, the summation of partial products is carried out by using a linear
adder array. Because the operation is in parallel, the speed is much faster than of a sequential
multiplier.
There are a number of algorithms used for multiplication, two of them are Radix_2 and
Radix_4 Booth algorithms. Radix_4 algorithm is also called as modified booth algorithm

2.7 ARCHITECTURE OF RADIX 2n MULTIPLIER


The architecture of a radix 2^n multiplier is given in the Figure. This block diagram
shows the multiplication of two numbers with four digits each. These numbers are denoted as V
and U while the digit size was chosen as four bits. The reason for this will become apparent in the
following sections. Each circle in the figure corresponds to a radix cell which is the heart of the
design. Every radix cell has four digit inputs and two digit outputs. The input digits are also fed
through the corresponding cells. The dots in the figure represent latches for pipelining. Every dot
consists of four latches. The ellipses represent adders which are included to calculate the higher
order bits. They do not fit the regularity of the design as they are used to terminate the design at
the boundary. The outputs are again in terms of four bit digits and are shown by Ws. The1s
denote the clock period at which the data appear.

11

Figure 2.3 Radix_2n multiplier Architecture

2.8 BOOTH MULTIPLICATION ALGORITHM


Booth's multiplication algorithm will multiply two signed binary numbers in two's
complement notation.
Procedure:
If x is the count of bits of the multiplicand, and y is the count of bits of the multiplier :

Draw a grid of three lines, each with squares for x + y + 1 bits. Label the lines

respectively A (add), S(subtract), and P (product).


In two's complement notation, fill the first x bits of each line with :
A: the multiplicand
S: the negative of the multiplicand
P: zeroes

Fill the next y bits of each line with :


12

A: zeroes
S: zeroes
P: the multiplier

Fill the last bit of each line with a zero.


Do both of these steps y times :
1. If the last two bits in the product are...
a) 00 or 11: do nothing.
b) 01: P = P + A. Ignore any overflow.
c) 10: P = P + S. Ignore any overflow.
2. Arithmetically shift the product right one position.

Drop the last bit from the product for the final result.

2.9 BOOTH MULTIPLICATION ALGORITHM FOR RADIX 4


One of the solutions of realizing high speed multipliers is to enhance parallelism which
helps to decrease the number of subsequent calculation stages. The original version of the Booth
algorithm (Radix-2) had two drawbacks. They are:
(i) The number of add subtract operations and the number of shift operations becomes
variable and becomes inconvenient in designing parallel multipliers.
(ii) The algorithm becomes inefficient when there are isolated 1s. These problems are
overcome by using Radix4 Booth algorithm. This algorithm is used to reduce the number
of partial product rows by about half, so, the speed of multiplication increases
significantly and it also consumes less power.
Booth algorithm which scan strings of three bits with the algorithm given below:
1) Extend the sign bit 1 position if necessary to ensure that n is even.
2) Append a 0 to the right of the LSB of the multiplier.
3) According to the value of each vector , each Partial Product will he 0, +y , -y, +2y or -2y.
The negative values of y are made by taking the 2s complement and in this paper Carrylook-ahead (CLA) fast adders are used. The multiplication of y is done by shifting y by one bit to
the left. Thus, in any case, in designing a n-bit parallel multipliers, only n/2 partial products are
generated.

13

Table 2.1 Radix_4 Booth recording Table


Let us see an example demonstrating the whole procedure of Booth multiplier (Radix -4)
using Wallace Tree and Sign Extension Correctors. Let us take Example of calculation of (3442). M Multiplicand A = 34 = 00100010
Multiplier B = -42 = 11010110 (2s Complement form)
AB = 34 -42 = -1428
First of all, the multiplier had to be converted into radix number as in Figure below. The first
partial product determined by three digits LSB of multiplier that are B1, B0 and one appended
zero. This 3 digit number is 100 which mean the multiplicand A has to multiply by -2.To multiply
by -2, the process takes twos complement of the multiplicand value and then shift left one bit of
that product. Hence, the first partial product is 110111100. All of the partial products will have
nine bits length.
Next, the second partial product is determined by bits B3, B2, B1 which indicated have to
multiply by 2. Multiply by 2 means the multiplicand value has to shift left one bit. So, the second
partial product is 001000100. The third partial product is determined by bits B5, B4, B3 in which
indicated have to multiply by 1. So, the third partial product is the multiplicand value namely
000100010. The forth partial product is determined by bits B7, B6, B5 which indicated have to
multiply by -1. Multiply by -1 means the multiplicand has to convert to twos complement value.
So, the forth partial product is 111011110.
Figure below shows the arrangement for all four partial products to be added using
Wallace tree adder method. 1E, 1BE 2E, 3E and 4E is obtained based on the Table 4.2. The way
14

on how this sign E is arranged has been shown in Wallace Tree Multiplication Method above. The
Wallace tree for the Example is given below.

Figure 2.4 Method showing How Partial Products Should Be Added


To prove the output result is correct:
11111101001101100 = 20(0) + 21(0) + 22(1) + 23(1) + 24(0) + 25(1) + 26(1) + 27(0) + 29(1) +
210(0) + 211(-1)
= 4 + 8 + 32 + 64 + 512 2048 = -1428

2.10 ARCHITECTURE OF AN 8 X 8 BIT MULTIPLIER


Figure2.5 shows the architecture of an 8 x 8-bit parallel multiplier using the 3-bit
recoding algorithm. The number of partial product rows is reduced to four. The two's complement
block is used to determine the two's complement of the multiplicand Y which represents the
negative of Y. The 3-bit Encoder gives four 4-bit codes: S1, S2, S3 AND S4, which are used to
determine Pk.

15

Figure2.5 Architecture of an 8x8 bit multiplier


The Partial Products Selectors are four identical multiplexers which are used to determine
(k=l ,2,3,4). The product P of X
and Y is:
p=p1+p2.

+p3.

+p4 .

The partial products

.
(k=1' 2,3 and 4) are shifted 2(k-l) bits to the right of

. and need

be sign extended to 16 bits. Carry save adder arrays (CSA) are used to add multiple inputs and
change the summation of multiple numbers to the summation of two numbers. There is no carry
propagation delay in the carry save adders array. So, the speed is high. Two 16 bits numbers
A[15.. .0] and B[15...0] are obtained from the CSA. The 4 least significant bits of B are zero, so,
it is only needed to add the two numbers A[15...4] and B[15...4]. For the least significant 4 bits,
16

P[3...0]=A[3...0]. A Ripple Carry Adder Array (RCA) is used to add A[9.. .4] and B[9.. .4] to
obtain P[9.. .4]. A Carry Select Adder Array is used to add A[15...10] and B[15...10] to obtain
P[15...10].
The logic implementations of different multiplier blocks will be detailed in the following
section. These blocks are based on logic gates and half and full adder.

2.11 TWO'S COMPLEMENT BLOCK


The two's complement block is used to determine the two's complement of the
multiplicand Y which represents the negative of Y. Figure shows the circuit of the Two's
complement block. According to the definition of two's complement, negative.

Figure 2.6 Two's Complement Circuit


To obtain the highest speed, carry skip adders combined with carry select adders are adopted.

17

2.12 3-BIT ENCODER BLOCK


The 3-bit Encoder gives four words
on Table .

and

, and , simultaneously based

(k= 1,2,3,4) are used to determine PK. The following table shows how

to determine ^5*^ and p^ by looking up three consecutive bits of the multiplier X.

Table 2.2 Encodes

[30]

18

Figure 2.7 shows the circuit for determining

[30]

Figure 2.7 Part of 3-bit encoder circuit

2.13 THE PARTIAL PRODUCTS SELECTOR


The circuit shown in Figure is a basic circuit cell to build the partial product select
circuits. This basic cell is used to determine the ith bit of the kth partial product (before shifting).
In Figure

b1i,

b2i, b3i I and b4i are the ith bits of Y, 2Y, -Y and -2Y, and

partial product (before shifting 2k bits to the right).

Figure 2.8 Circuit to determine ith bit of the kth partial product
19

is the ith bit of the kth

2.14 CARRY SAVE ADDERS ARRAY BLOCK


Figure is a diagram of two carry save adders (CSA) arrays used to add four binary
numbers . The carry save adder array, CSAl, is used to add three binary numbers. There is no
carry propagation on carry in. The carry of the ith bit will be saved as the value of C[i] and the

sum of the ith

will be saved as the value of S[i].

Figure 2.9 Carrv Save Adders Array Block


The summation of three numbers: A1[MSB:0], A2[MSB:0] and A3[MSB:0] is equal to
the summation of two numbers: C[MSB:0] and S[MSB:0]. The carry save adders array, CSA2, is
used to add A4[MSB:0], C[MSB:0] and S[MSB:0]. So, the summation of four numbers is
transferred to the summation of two numbers quickly by using carry save adders arrays with only
a two-gate time delay.

2.15 RIPPLE CARRY ADDERS


Figure shows 6-bit ripple carry adders consisted of 6 1-bit full adders.

20

Figure 2.10 Ripple Carrv Adders

2.16 CARRY SELECT ADDERS


Figure shows the carry select adders array. There are two identical 6-bit ripple carry
adder arrays. For one ripple carry adder array, the carry in is '0'; for the other, the carry in is ' 1'.
Carry C[9] is used to determine which set of bits are the most significant 6 bits:P[15:10].

Figure 2.11 Carry select adders array

21

CHAPTER 3
ADDERS
In electronics, an adder is a digital circuit that performs addition of numbers. In modern
computers adders reside in the arithmetic logic unit (ALU) where other operations are performed.
Although adders can be constructed for many numerical representations, such as Binary-coded
decimal or excess-3, the most common adders operate on binary numbers. In cases where two's
complement is being used to represent negative numbers it is trivial to modify an adder into an
adder-subtracter
Addition is the most common and often used arithmetic operation on microprocessor,
digital signal processor, especially digital computers. Also, it serves as a building block for
synthesis all other arithmetic operations. Therefore, regarding the efficient implementation of an
arithmetic unit, the binary adder structures become a very critical hardware unit.
In any book on computer arithmetic, someone looks that there exists a large number of
different circuit architectures with different performance characteristics and widely used in the
practice. Although many researches dealing with the binary adder structures have been done, the
studies based on their comparative performance analysis are only a few.
In this project, qualitative evaluations of the classified binary adder architectures are
given. Among the huge member of the adders we wrote VHDL (Hardware Description Language)
code for Ripple-carry, Carry-select and Carry-look ahead to emphasize the common performance
properties belong to their classes. In the following section, we give a brief description of the
studied adder architectures.
22

The first class consists of the very slow ripple-carry adder with the smallest area. In the
second class, the carry-skip, carry-select adders with multiple levels have small area requirements
and shortened computation times. From the third class, the carry-look ahead adder and from the
fourth class, the parallel prefix adder represents the fastest addition schemes with the largest area
complexities.
Types of adders
For single bit adders, there are two general types. A half adder has two inputs, generally labeled A
and B, and two outputs, the sum S and carry C. S is the two-bit XOR of A and B, and C is the
AND of A and B. Essentially the output of a half adder is the sum of two one-bit numbers, with C
being the most significant of these two outputs. The second type of single bit adder is the full
adder. The full adder takes into account a carry input such that multiple adders can be used to add
larger numbers. To remove ambiguity between the input and output carry lines, the carry in is
labeled Ci or Cin while the carry out is labeled Co or Cout.

3.1 HALF ADDER

Figure 3.1 Half adder Logic diagram


A half adder is a logical circuit that performs an addition operation on two binary digits. The half
adder produces a sum and a carry value which are both binary digits.
Following is the logic table for a half adder:
Inputs
A
0
0
1
1

B
0
1
0
1

Outputs
S
C
0
0
1
0
1
0
0
1
23

Table 3.1 Half Adder Truth Table

3.2 FULL ADDER

Figure 3.2 Full adder Logic diagram


Inputs: {A, B, Carry In}
Outputs: {Sum, Carry Out}

Figure3.3 Schematic symbol for a 1-bit full adder


A full adder is a logical circuit that performs an addition operation on three binary digits.
24

The full adder produces a sum and carries value, which are both binary digits. It can be combined
with other full adders (see below) or work on its own.

Input
A B Ci
0 0 0

Output
Co S
0 0

0 0 1

0 1

0 1 0

0 1

0 1 1

1 0

1 0 0

0 1

1 0 1

1 0

1 1 0

1 0

1 1 1

1 1

Table 3.2 Full Adder Truth Table

Note that the final OR gate before the carry-out output may be replaced by an XOR gate
without altering the resulting logic. This is because the only discrepancy between OR andXOR
gates occurs when both inputs are 1; for the adder shown here, one can check this is never
possible. Using only two types of gates is convenient if one desires to implement the adder
directly using common IC chips. A full adder can be constructed from two half adders by
connecting A and B to the input of one half adder, connecting the sum from that to an input to the
second adder, connecting Ci to the other input and or the two carry outputs. Equivalently, S could
be made the three-bit xor of A, B, and Ci and Co could be made the three-bit majority function of
A, B, and Ci. The output of the full adder is the two-bit arithmetic sum of three one-bit numbers.
25

CHAPTER 4
FILTERS
4.1 FIR FILTER
Digital filters can be divided into two categories: finite impulse response (FIR) filters; and
infinite impulse response (IIR) filters. Although FIR filters, in general, require higher taps than
IIR filters to obtain similar frequency characteristics, FIR filters are widely used because they
have linear phase characteristics, guarantee stability and are easy to implement with multipliers,
adders and delay elements . The number of taps in digital filters varies according to applications.
In commercial filter chips with the fixed number of taps , zero coefficients are loaded to registers
for unused taps and unnecessary calculations have to be performed. To alleviate this problem, the
FIR filter chips providing variable-length taps have been widely used in many application fields .
However, these FIR filter chips use memory, an address generation unit, and a modulo unit to
access memory in a circular manner. The paper proposes two special features called a data reuse
structure and a recurrent-coefficient scheme to provide variable-length taps efficiently. Since the
proposed architecture only requires several MUXs, registers, and a feedback-loop, the number of
gates can be reduced over 20 % than existing chips.

Fig 4.1 FIR filter block diagram


In, general, FIR filtering is described by a simple convolution operation as expressed in
the equation (5.1)
(4.1)
where x[n], y[n], and h[n] represent data input, filtering output, and a coefficient, respectively
and N is the filter order. The equation using the bit-serial algorithm for a FIR filter can be
26

represented as

(4.2)
where the hj, N and M are the jth bit of the coefficient

4.2 TRANSVERSAL FILTER


An N-Tap transversal was assumed as the basis for this adaptive filter. The value of N is
determined by practical considerations, . An FIR filter was chosen because of its stability. The
use of the transversal structure allows relatively straight forward construction of the filter, Fig..

Figure 4.2 Transversal filter


As the input, coefficients and output of the filter are all assumed to be complex valued, and then
the natural choice for the property measurement is the modulus, or instantaneous amplitude. If
y(k) is the complex valued filter output, then |y(k)| denotes the amplitude. The convergence error
p(k) can be defined as follows:
Aykpk=)( (4)
where the A is the amplitude in the absence of signal degredations. The error p(k) should
be zero when the envelope has the proper value, and non-zero otherwise. The error carries sign
information to indicate which direction the envelope is in error. The adaptive algorithm is defined
by specifying a performance/cost/fitness function based on the error p (k) and then developing a
27

procedure that adjusts the filter impulse response so as to minimize or maximize that
performance function.
Yk = 10iNi==_ wk(i) xk-I (5)
The gradient search algorithm was selected to simplify the filter design.
The filter coefficient update equation is given by:
WK+1 = wK eK xK (6), Where XK is the filter input at sample k, ek is the error term at
sample k = pk . yk and is the step size for updating the weights value.

28

CHAPTER 5
VHDL
Many DSP applications demand high throughput and real-time response, performance
constraints that often dictate unique architectures with high levels of concurrency. DSP designers
need the capability to manipulate and evaluate complex algorithms to extract the necessary level
of concurrency. Performance constraints can also be addressed by applying alternative
technologies. A change at the implementation level of design by the insertion of a new
technology can often make viable an existing marginal algorithm or architecture.
The VHDL language supports these modeling needs at the algorithm or behavioral level,
and at the implementation or structural level. It provides a versatile set of description facilities to
model DSP circuits from the system level to the gate level. Recently, we have also noticed efforts
to include circuit-level modeling in VHDL. At the system level we can build behavioral models
to describe algorithms and architectures. We would use concurrent processes with constructs
common to many high-level languages, such as if, case, loop, wait, and assert statements. VHDL
also includes user-defined types, functions, procedures, and packages." In many respects VHDL
is a very powerful, high-level, concurrent programming language. At the implementation level we
can build structural models using component instantiation statements that connect and invoke
subcomponents. The VHDL generate statement provides ease of block replication and control. A
dataflow level of description offers a combination of the behavioral and structural levels of
description. VHDL lets us use all three levels to describe a single component. Most importantly,
the standardization of VHDL has spurred the development of model libraries and design and
development tools at every level of abstraction. VHDL, as a consensus description language and
design environment, offers design tool portability, easy technical exchange, and technology
insertion

VHDL: The language


29

An entity declaration, or entity, combined with architecture or body constitutes a VHDL


model. VHDL calls the entity-architecture pair a design entity. By describing alternative
architectures for an entity, we can configure a VHDL model for a specific level of investigation.
The entity contains the interface description common to the alternative architectures. It
communicates with other entities and the environment through ports and generics. Generic
information particularizes an entity by specifying environment constants such as register size or
delay value. For example,
entity A is
port (x, y: in real; z: out real);
generic (delay: time);
end A;
The architecture contains declarative and statement sections. Declarations form the region before
the reserved word begin and can declare local elements such as signals and components.
Statements appear after begin and can contain concurrent statements. For instance,
architecture B of A is
component M
port ( j : in real ; k : out real);
end component;
signal a, b ,c real := 0.0;
begin
"concurrent statements"
end B;
The variety of concurrent statement types gives VHDL the descriptive power to create and
combine models at the structural, dataflow, and behavioral levels into one simulation model. The
structural type of description makes use of component instantiation statements to invoke models
described elsewhere. After declaring components, we use them in the component instantiation
statement, assigning ports to local signals or other ports and giving values to generics. invert: M
port map ( j => a ; k => c); We can then bind the components to other design entities through
configuration specifications in VHDL's architecture declarative section or through separate
configuration declarations. The dataflow style makes wide use of a number of types of concurrent
30

signal assignment statements, which associate a target signal with an expression and a delay. The
list of signals appearing in the expression is the sensitivity list; the expression must be evaluated
for any change on any of these signals. The target signals obtain new values after the delay
specified in the signal assignment statement. If no delay is specified, the signal assignment occurs
during the next simulation cycle:
c <= a + b after delay;
VHDL also includes conditional and selected signal assignment statements. It uses block
statements to group signal assignment statements and makes them synchronous with a
guarded condition. Block statements can also contain ports and generics to provide more
modularity in the descriptions. We commonly use concurrent process statements when we wish to
describe hardware at the behavioral level of abstraction. The process statement consists of
declarations and procedural types of statements that make up the sequential program. Wait and
assert statements add to the descriptive power of the process statements for modeling concurrent
actions:
process
begin
variable i : real := 1.0;
wait on a;
i = b * 3.0;
c <= i after delay;
end process;
Other concurrent statements include the concurrent assertion statement, concurrent
procedure call, and generate statement. Packages are design units that permit types and objects to
be shared. Arithmetic operations dominate the execution time of most Digital Signal Processing
(DSP) algorithms and currently the time it takes to execute a multiplication operation is still the
dominating factor in determining the instruction cycle time of a DSP chip and Reduced
Instruction Set Computers (RISC). Among the many methods of implementing high speed
parallel multipliers, there is one basic approach namely Booth algorithm. Power consumption in
VLSI DSPs has gained special attention due to the proliferation of high performance portable
battery-powered electronic devices such as cellular phones, laptop computers, etc. DSP
applications require high computational speed and, at the same time, suffer from stringent power
dissipation constraints. Multiplier modules are common to many DSP applications. The fastest
31

types of multipliers are parallel multipliers. Among these, the Wallace multiplier is among the
fastest. However, they suffer from a bad regularity. Hence, when regularity, high performance and
low power are primary concerns, Booth multipliers tend to be the primary choice. Booth
multipliers allow the operation on signed operands in 2's-complement. They derive from array
multipliers where, for each bit in a partial product line, an encoding scheme issued to determine if
this bit is positive, negative or zero. The Modified Booth algorithm achieves a major performance
improvement through radix-4 encoding. In this algorithm each partial product line operates on 2
bits at a time, thereby reducing the total number of the partial products. This is particularly true
for operands using 16 bits or more.

32

CHAPTER 6
DESIGN PROCEDURE
In this project first we design two different multipliers using shift and add method and
Radix_4 Booth algorithm. We used different type of adders like sixteen bit full adder in designing
those multiplier. Then we designed a 4 tap delay FIR filter and in place of the multiplication and
additions we implemented the components of different multipliers and adders. Then we compared
the working of these two multipliers by comparing the power consumption by each of them.

6.1 DESIGN OF 16_BIT FULL ADDER


For designing different multipliers and FIR filter we need a 16_bit full adder. The 16_bit
full adder performs the addition of two 16_bit numbers, for this operation we used a two_bit
adder as component. The two_bit full adder performs the addition of two binary digits and
produces sum and carry as outputs. The intermediate carry generated is propagated to the next
higher order bits and again the addition operation is performed and the carry is propagated to the
next higher order bit, the whole process is repeated sixteen times, if there exist any carry in the
sixteenth iteration that bit is assigned to cout and the actual sum is stored in the ouput signal sum.
Example:

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0

0 1 1 1 0 0 0 1 1 0 1 0 1

sum

Cout

6.2 DESIGN OF ARRAY MULTIPLIER


The array multiplier multiplies two binary numbers by repeated additions and shift
operatins. Here we designed an 8x8 array multiplier there there will be 8 partial products. The
33

inputs a and b holds multiplier and multiplicand bits respectively. The output signal product
contains the product of a and b which is of 16_bit length.
The signal pp contains the partial products, pc holds carry bits and ps holds sum of partial
products. After computing the partial products , they are added by shifting each partial product
one one bit left, then all these partial products are added and assigned to ps, finally prod is
assigned to ps.
Example:
0

multiplicand(78)

multiplier(36)

0 0

0 1

0
0

1 0 1 0 1

0 1

partial products

product(2808)

6.3 DESIGN OF RADIX_4 BOOTH MULTIPLIER


Modified booth algorithm halves the maximum number of partial products. It is derived
directly from the booth algorithm. Group the booth recorded multiplier bits in pairs, and observe
the following : The pairs (+1 -1) is equivalent to the pair (0 +1) that is, instead of adding -1 times
34

the multiplicand M at shift position I to +1xM at position i+1, the same result is obtained by
adding +1xM at position i. The other examples are : (+ 1 0) is equivalent to (0 +2) , (-1 +1) is
equivalent to (0 +1), and so on. Thus , if booth_recorded multiplier is examined two bits t a time,
starting from right, it can be rewritten in a form that requires at most one version of the
multiplicand to the partial product for each pair of multiplier bits.
The design procedure of Radix_4 Booth multiplier requires Booth encoder and 16 bit adder,
so we used them as components. The booth multiplier examines the multiplier bits there at a time
from right to left and generates the corresponding partial products . the 16_bit full adder then
produces the product b adding partial products.

6.4 DESIGN OF FIR FILTER


An FIR filter of length M ith input x(n) and output y(n) is described by the difference
equation.

Where

(7.1)

is the set of filter coefficients. Alternatively we can express the output sequence as the

convolution of the unit sample response h(n) of the system with input signal thus we have
)

From equation (7.1) it is clear that inorder to implement an FIR filter we need filter coefficients,
adder and a multiplier.
This projects implements FIR filter using Array multiplier and Modified Booth multiplier.
This project includes design of a 4 tap delay FIR filter which requires 4 filter coefficients which
are declared in the program itself for additions and multiplication operations adder and multiplier
are used as components . A register is declared to store the previous outputs, an accumulator is
declared to store the addition of previous and present outputs so after execution the accumulator
contains the convolution sum, which is the output of FIR filer.

35

CHAPTER 7
SIMULATION RESULTS
16_BIT ADDER OUTPUT

Figure 7.1 Simulation Results of 16_bit adder


The above waveforms are simulated results of 16bit adder. A and B are two 16bit inputs applied
to the adder,yout is the output of the adder which is produced by adding two inputs applied to it.
Cout is overflow signal which is set to 1 if the result exceed the length of Yout.
36

OUTPUT OF ARRAY MULTIPLIER

Figure 7.2 Simulation Results of Array multiplier


The above wave forms are the simulated results of an array multiplier. A is 8bit multiplicand and
B is 8bit multiplier. Yout is the 16bit out put signal which is formed by multiplying A with B. pp,
pc, ps are the intermediate signal in the process of multiplication

37

OUTPUT OF RADIX_4 MULTIPLIER

Figure 7.3 Simulation Results of Radix_4 multiplier


The above wave forms are the simulated results of an radix_4 boothmultiplier. A is 8bit
multiplicand and B is 8bit multiplier. Yout is the 16bit out put signal which is formed by
multiplying A with B.

pp1,pp2,pp3,pp4 are the partial products formed in the process of

multiplication. s1,s2,s3 are the modified partial products. Sum1,sum2,sum3 are the intermediate
signals formed by adding the modified partial products. K1,k2,k3 are the overflow bits in the
adder blocks used for adding the partial products

38

OUTPUT OF FIR FILTER USING ARRAY MULTIPLIER

Figure 7.4 Simulation Results of FIR f ilter using Array multiplier


The above are the simulated results of FIR filter implimented with array multiplier. X is
the input to the filter. Clk is clock to the filter and RST is the reset signal if reset is set to high
then filter will not produce the out put. Y is the output signal formed by convoluting the inut with
filter coefficents. Reg is the register that store the filter coefficents. C,p,prod, acc,sign, c1 are the
intermediate signals

39

OUTPUT OF FIR FILTER USING RADIX_4 MULTIPLIER

Figure 7.5 Simulation Results of FIR f ilter using Radix_4 multiplier


The above are the simulated results of FIR filter implimented with array multiplier. X is
the input to the filter. Clk is clock to the filter and RST is the reset signal if reset is set to high
then filter will not produce the out put. Y is the output signal formed by convoluting the inut with
filter coefficents. Reg is the register that store the filter coefficents. C,p,prod, acc,sign, c1 are the
intermediate signals

40

POWER ANALYSIS OF ARRAY MULTIPLIER

Figure 7.6 Power analysis of Array multiplier


Xpower estimator is atool used to calculate the total power consumed by a circuit. The
method to calculate the power is very simple. After synthasizing the program generate map report
to the written program. In this tool click on import from ISE botun and import the map report.
After importing the map report this tool will generate the total power. The toal power consume by
array multiplier is 92mwatts

41

GRAPH ANALYSIS OF ARRAY MULTIPLIER

Figure 7.7 Graph analysis of Array multiplier


The above graphs shows the variation of power consumption of Array multiplier with junction
temperature, vccint voltage. The power consumption increases with increase in junction
temperature and vccint voltage
42

POWER ANALYSIS OF RADIX_4 MULTIPLIER

Figure 7.8 Power analysis of Radix_4 multiplier

Xpower estimator is a tool used to calculate the total power consumed by a circuit. The method to
calculate the power is very simple. After synthesizing the program generate map report to the
written program. In this tool click on import from ISE button and import the map report. After
importing the map report this tool will generate the total power. The tool power consume by
raddix_4 booth multiplier is 44mwatts

43

GRAPH ANALYSIS OF RADIX_4 MULTIPLIER

Figure 7.9 Graph analysis of Radix_4 multiplier

The above graphs shows the variation of power with junction temperature, vccint voltage. The
power consumption increases with increase in junction temperature and vccint voltage.
44

SYNTHESIS REPORT
ARRAY MULTIPLIER
Number of Slices

71

Number of 4 input LUTs

123

Number of bonded INPUT

16

Number of bonded OUTPUT

16

CLB Logic Power

99 mw

Table 7.1 Array multiplier synthesis Report

RADIX_4 MULTIPLIER
Number of Slices

101

Number of 4 input LUTs

179

Number of bonded INPUT

16

Number of bonded OUTPUT

17

CLB Logic Power

44 mw

Table 7.2 Radix_4 multiplier synthesis Report

CHAPTER 8
CONCLUSION AND FUTURE WORK
CONCLUSION
45

Our project gives a clear concept of different multipliers and their implementation in tap
delay FIR filter. We found that the parallel multipliers are much option than the serial multiplier.
We concluded this from the result of power consumption and the total area. In case of parallel
multipliers, the total area is much less than that of serial multipliers. Hence the power
consumption is also less. This is clearly depicted in our results. This speeds up the calculation and
makes the system faster.
When we compare modified booth multiplier with array multiplier we found that
modified both multipliers consume less power This is because it uses almost half number of
iteration and adders when compared to array multiplier so the computation speed is also high for
Radix 4 booth multiplier.
Multipliers are one the most important component of many systems. So we always need
to find a better solution in case of multipliers. Our multipliers should always consume less power
and cover less power. So through our project we try to determine which of the multipliers works
better. In the end we determine that radix 4 modified booth algorithm works the best.
FUTURE WORK
As an attempt to develop arithmetic algorithm for low-power multiplier design, the
Radix_4 booth algorithm presented in this project has achieved good results . However, there are
limitations in our work and several future research directions are possible.
One possible direction is radix higher-than-4 recoding. We have only considered radix-4
recoding as it is a simple and popular choice. Higher-radix recoding further reduces the number
of PPs and thus has the potential of power saving. Because of the difficulty of generating hard
PPs such as 3X, higher-radix recoding may increase critical path delay and design complexity
which are negative factors for power. Thus, there is power/delay/area tradeoff in higher-radix
recoding.

REFRENCES
Websites referred:
1. www.wikipedia.com
2. www.howstuffswork.com
3. www.xilinx.com
46

Books referred:
1. Circuit Design using VHDL, by Pedroni , page number 285-293.
2. VHDL by Sjoholm Stefan
3. VHDL by B Bhaskar
4. Digital Signal Processing by Johny R Johnson ,PHI publications.
5. Digital Sinal Processing by Vallavraj & Salivhanan, TMH publications
6. Computer Organization Fifth edition by Carl Hamacher, Zvonko Vranesic, Safwat Zaky.

47

You might also like