You are on page 1of 82

`

1. INTRODUCTION
1.1 INTRODUCTION:
In the VLSI design timing, power, area are the three major constraints.
Optimizations in VLSI have been done on three factors: Area, Power and Timing
(Speed). Area optimization means reducing the space of logic which occupy on the
die. This is done in both front-end and back-end of design. In front-end design, proper
description of simplified Boolean expression and removing unused states will lead to
minimize the gate/transistor utilization. Partition, Floor planning, Placement, and
routing are perform in back-end of the design which is done by CAD tool. The CAD
tool have a specific algorithm for each process to produce an area efficient design
similar to Power optimization. Power optimization is to reduce the power dissipation
of the design which suffers by operating voltage, operating frequency, and switching
activity. The first two factors are merely specified in design constraints but switching
activity is a parameter which varies dynamically, based on the way which designs the
logic and input vectors. Timing optimization refers to meeting the user constraints in
efficient manner without any violation otherwise, improving performance of the
design. High performance designs are achieved by proper placement, routing and
sizing the element. The word optimization is approached in different ways by
merging, instead of sizing the memory element.
Multiplication in hardware can be implemented in two ways either by using more
hardware for achieving fast execution or by using less hardware and end up with slow
execution. The area and speed of the multiplier is an important issue, increment in
speed results in large area consumption and vice versa. Multipliers play vital role in
most of the high performance systems. Performance of a system depend to a great
extent on the performance of multiplier thus multipliers should be fast and consume
less area and hardware. This idea forced us to study and review about the multipliers
speed, power consumption and Area occupied. Below are three famous Multipliers
And their drawbacks namely

Wallace Tree Multiplier


Array Multiplier
Booth Multiplier

1. Wallace tree multiplier:


A Wallace tree is an efficient hardware implementation of a digital circuit that
multiplies two integers, devised by Australian Computer Scientist Chris Wallace in
1964.
The Wallace tree has three steps:
1.

Multiply (that is - AND) each bit of one of the arguments, by each bit of the
other, yielding

results. Depending on position of the multiplied bits, the wires

carry different weights, for example wire of bit carrying result of

is 32 (see

explanation of weights below).


2.

Reduce the number of partial products to two by layers of full and half adders.

3.

Group the wires in two numbers, and add them with a conventional adder.
The second phase works as follows. As long as there are three or more wires with the
same weight add a following layer:

Take any three wires with the same weights and input them into a full adder. The
result will be an output wire of the same weight and an output wire with a higher
weight for each three input wires.

If there are two wires of the same weight left, input them into a half adder.

If there is just one wire left, connect it to the next layer.


The benefit of the Wallace tree is that there are only
each layer has
the final addition is

reduction layers, and

propagation delay. As making the partial products is


, the multiplication is only

and

, not much

slower than addition (however, much more expensive in the gate count). Naively
adding partial products with regular adders would require

time. From

a complexity theoretic perspective, the Wallace tree algorithm puts multiplication in


the class .

These computations only consider gate delays and don't deal with wire delays, which
can also be very substantial.
The Wallace tree can be also represented by a tree of 3/2 or 4/2 adders.
It is sometimes combined with Booth encoding.

Fig 1.1 Weights

The weight of a wire is the radix (to base 2) of the digit that the wire carries. In
general,
of

have indexes of
is

and

; and since

Fig: 1.2 Pattern

Example:
, multiplying

by

the weight

1.

First we multiply every bit by every bit:

weight 1 -

weight 2 -

weight 4 -

weight 8 -

weight 16 -

weight 32 -

weight 64 -

2.

,
,

Reduction layer 1:

Pass the only weight-1 wire through, output: 1 weight-1 wire

Add a half adder for weight 2, outputs: 1 weight-2 wire, 1 weight-4 wire

Add a full adder for weight 4, outputs: 1 weight-4 wire, 1 weight-8 wire

Add a full adder for weight 8, and pass the remaining wire through, outputs: 2 weight8 wires, 1 weight-16 wire

Add a full adder for weight 16, outputs: 1 weight-16 wire, 1 weight-32 wire

Add a half adder for weight 32, outputs: 1 weight-32 wire, 1 weight-64 wire

Pass the only weight-64 wire through, output: 1 weight-64 wire

3.

Wires at the output of reduction layer 1:


weight 1 - 1

weight 2 - 1

weight 4 - 2

weight 8 - 3

weight 16 - 2

weight 32 - 2

weight 64 - 2

4.

5.

Reduction layer 2:
Add a full adder for weight 8, and half adders for weights 4, 16, 32, 64
Outputs:

weight 1 - 1

weight 2 - 1

weight 4 - 1

weight 8 - 2

weight 16 - 2

weight 32 - 2

weight 64 - 2

weight 128 - 1

6.

Group the wires into a pair integers and an adder to add them.
Drawback:

Cannot be implemented effectively on FPGA

Power consumption is High

Complexity is High

Array Multiplier:
Digital multiplication entails a sequence of additions carried out on partial products.

Fig: 1.3

In Array multiplier partial products are independently computed in parallel.Let us


consider two binary numbers A and B of m and n bits respectively,

Fig: 1.4

There are m*n summands that are produced in parallel by a set of m*n numbers of
ANDgates.
If m = n Then it will require n(n-2) full adders, n half-adders and n*n AND gates and
worst case delay would be (2n+1)td .where td is the time delay of gates

Basic cell of a parallel array multiplier:

Fig: 1.5

Array structure of parallel multiplier:


7

Fig: 1.6

Consider computing the product of two 4-bit integer numbers given by A3A2A1A0
(multiplicand) and B3B2B1B0 (multiplier). The product of these two numbers can be
formed as shown below.

Fig: 1.7

Each of the ANDed terms is referred to as a partial product. The final product (the
result) is formed by accumulating (summing) down each column of partial products.
Any carries must be propagated from the right to the left across the columns.
Since we are dealing with binary numbers, the partial products reduce to simple AND
operations between the corresponding bits in the multiplier and multiplicand. The
sums down each column can be implemented using one or more 1-bit binary adders.
Any adder that may need to accept a carry from the right must be a full adder. If there
is no possibility of a carry propagating in from the right, then a half adder can be used
instead, if desired (a full adder can always be used to implement a half adder if the
carry-in is tied low). The diagram below illustrates a combinational circuit for
performing

the

4x4

binarymultiplication.

The initial layer of AND gates forms the sixteen partial products that result from
ANDing all combinations of the four multiplier bits with the four multiplicand bits.
The column sums are formed using a combination of half and full adders. Look again
at the first two illustrations of the binary multiplication process above, and make a
careful comparison with the figure below.

Fig: 1.8

The adder blocks (indicated by FA and HA) in the figure above are drawn in such a
way that the two bits to be added enter from the top, any carry in from the right enters
from the right, and any carry out exits from the left of each block. The output from the
bottom

of

block

is

the

sum.

The least significant output bit, S0 (the first column), involves only two input bits and
is

computed

as

the

simple

output

of

an

AND

gate.

The next output bit, S1, involves the sum of two partial products. A half adder is used
to form the sum since there can be no carry in from the first column.
The third output bit, S2, is formed from the sum of three (1-bit) partial products plus a
possible carry in from the previous bit. This operation requires two cascaded adders
(one half adder and one full adder) to sum the four possible input bits (three partial
products

and

one

possible

carry

in

from

the

right).

The remaining output bits are formed similarly. Because in some columns we must
add more than two binary numbers, there may be more than one carry out generated to
the left.
Drawbacks:

Low speed
More Area
More power

10

Booths Multiplier:
Booths Algorithm is a smart move for multiplying signed numbers. It initiate with the
ability to both add and subtract there are multiple ways to compute a product .
Booths algorithm is a multiplication algorithm that utilizes twos complement
notation of signed binary numbers for multiplication . Earlier multiplication was in
general implemented via sequence of addition then subtraction, and

then shift

operations. Multiplication can be well thought-out as a series of repeated additions.


The number which is to be added is known as the multiplicand, and the number of
times it is added is known as the multiplier, and the result we get is the multiplication
result. After Each step of addition a partial product is generated. When the operands
are integers, the product in general is twice the length of operands in order to protect
the information content. This repetitive addition method that is recommended by the
arithmetic definition is slow as it is always replaced by an algorithm that makes use of
positional depiction. We can decompose multipliers into two parts. The first part is
committed to the generation of partial products, and the second part collects and then
adds them. The fundamental multiplication principle is twofold i.e. evaluation of
partial products and gathering of the shifted partial products. It is performed by the
consecutive additions of the columns of the shifted partial product matrix.
The delayed, gated case of the multiplicand must all be in the same column of the
shifted partial product matrix. Then they are added to form the product bit for the
particular form. Multiplication is thus a multi operand operation. To expand the
multiplication to both signed and unsigned numbers, a suitable number system would
be the depiction of numbers in twos complement format.
MULTIPLICATION ALGORITHM:
A circuit that multiplies two unsigned n bit binary numbers, uses a 2 dimensional
array of identical subcircuits. Each of which contains a full adder and an and gate.
For large number of bits this approach may not be appropriate because of the large
number of gates needed. Another approach is to use shift register in combination with
an adder to implement the traditional method of multiplication.
P=0; For i=0 to n-1 do If bi=1 then

11

P=P+A; End if; Left shift A;


End for;

Fig:1.9

Signed multiplication is a vigilant process. Through unsigned multiplication there is


no need to take the sign of the number into consideration. Even though in signed
multiplication the same procedure cannot be applied for the reason that the signed
number is in a 2s compliment form which would give in an inaccurate result if
multiplied in an analogous manner to unsigned multiplication . Thus here Booths
algorithm comes in. Booths algorithm conserves the sign of the end result. While
doing multiplication, strings of 0s in the multiplier call for only shifting. While doing
multiplication, strings of 1s in the multiplier need an operation only at each end. We
require to add or subtract merely at positions in the multiplier where there is a switch
from 0 to 1 or from 1 to 0. In the following flow chart we have, b=Multiplier,
a=Multiplicand, m= Product . Now here we will require twice as many bits in our
product as we already have in our two operands. The leftmost bit of our operands of
12

both the multiplicand and the multiplier is always a sign bit, and cant be used as part
of the value. Then choose which operand will be multiplier and which will be
multiplicand. If one operand and both are negative then they are represented in two's
complement form. Start in on with a product that consists of the multiplier in the
company of an additional X leading zero bits. Now check the LSB and the previous
LSB of product to find out the arithmetic action. Add 0 as the previous LSB if it is the
FIRST pass. Probable arithmetic actions are if: 00:- no arithmetic operation is
performed only shifting is done. 01:- add multiplicand to left half part of product and
then shifting is done. 10:- subtract multiplicand from left half part of product and then
shifting is performed 11:- no arithmetic operation is performed only shifting is
Example :
Multiply 10 by -7 using 5-bit numbers (10-bit result). 10 in binary is 01010 -10 in
binary is 10110 (thus now we can add 10110 when we need to subtract multiplicand)
-7 in binary is 11001 Our expected result should be (-70) in binary (11101 11010).
Steps of algorithm are:
Step1:
(00000 11001 0) now as last two bits are 10 so here 00000+10110=10110. Now we
get (10110 11001 0) now by ARS (arithmetic right shift) we get (11011 01100 1).
Step2:
as last two bits are 01 so, 11011+01010=00101(carry is ignored as because addition
+ve and ve numbers cannot overflow). Now we get (00101 01100 1) now by ARS
we get (00010 10110 0).
Step 3:
as last two bits are 00 there is no change only ARS will be done, now we will get
(00001 01011 0).
Step 4:
as last two bits are 10 so, 00001+10110=10111, now we will get (10111 01011 0)
now by ARS we will get (11011 10101 1)
Step 5:
13

as last two bits are 11, there is no change only ARS will take place, now we will get
(11101 11010 1).
Step 6:
now ignoring the last bit we will get our product that is (11101 11010) = -70

Drawbacks:
Most Complex Algorithm
Trade off Between Speed and Area
Parameter

Array multiplier

Wallace

tree Booth multiplier

multiplier
Operation speed

Less

High

Highest

Time delay

More

Medium

Less

Area

More area

Medium

Minimum

Complexity

less

More

Most

Power consumption

Most

More

Less

These Drawbacks can be overcomed by Vedic Multiplier.


1.2 INTRODUCTION TO VEDIC MATHS
Complex multiplication is of immense importance inDigital Signal Processing (DSP)
and Image Processing (IP).To implement the hardware module of Discrete Fourier
Transformation

(DFT),

Discrete

Cosine

Transformation(DCT),Discrete

Sine

Transformation (DST)and modern broadband communications; large numbers


ofcomplex multipliers are required. Complex numbermultiplication is performed
using four real numbermultiplications and two additions/ subtractions. In realnumber
processing, carry needs to be propagated from theleast significant bit (LSB) to the
most significant bit (MSB) when binary partial products are added . Therefore,
theaddition and subtraction after binary multiplications limitthe overall speed. Many
alternative method had so far beenproposed for complex number multiplication like
algebraic transformation used implementation bit-serialmultiplication using offset
binary and distributed arithmetic, the CORDIC (co ordinate rotation digital computer)

14

algorithm , the quadratic residue number system(QRNS) , and recently, the redundant
complex number system (RCNS) .Blahut etc. All proposed a technique for complex
number multiplication, where the algebraictransformation was used. This algebraic
transformationsaves one real multiplication, at the expense of three additions as
compared to the direct method implementation. A left to right array for the fast
multiplication has been reported in 2005, and the method is not further extended
forcomplex multiplication. But, all the above techniquesrequire either large overhead
for pre/post processing orlong latency. Further many design issues like as
speed,accuracy, design overhead, power consumption etc., shouldnot be addressed for
fast multiplication .In algorithmicand structural levels, a lot of multiplication
techniques hadbeen developed to enhance the efficiency of the multiplier;which
encounters the reduction of the partial Productsand/or the methods for their partial
products addition ,butthe principle behind multiplication was same in all cases. Vedic
Mathematics is the ancient system of Indianmathematics which has a unique
technique of calculationsbased on 16 Sutras (Formulae). "Urdhva-tiryakbyham" is a
Sanskrit word means vertically and crosswise formula isused for fast Multiplication
All these formulas are adopted from ancient Indian Vedic Mathematics. In this work
we formulate this mathematics for designing the complex multiplier architecture in
transistor level with two clear goals in mind such as: i) Simplicity and
modularitymultiplications for VLSI implementations and ii) Theelimination of carry
propagation for rapid additions and subtractions. Mehta et al.have been proposed a
multiplier design using "Urdhva-tiryakbyham" sutras, which was adopted from the
Vedas. The formulation usingthis sutra is similar to the modern array multiplication,
which also indicating the carry propagation issues. Multiplier implementation in the
gate level(FPGA) using Vedic Mathematics has already been
reported but to the best of our knowledge till date there isno report on transistor
level(ASIC) implementation of suchcomplex multiplier. By employing the Vedic
mathematics, an N bit complex number multiplication was transformedinto four
multiplications for real and imaginary terms of the final product. In this paper we
report on a novel high speed complex multiplier design using ancient Indian
Vedicmathematics.
The sutras of vedic mathematics with their meanings are listed in
the table below:

15

S.No
1

SUTRA NAME
(Anurupye)-Shunyamanyathu

MEANING
If one value is in ratio, other is
zero
Differences and similarities
By one more or less than the
previous
one
Factors of the sum is equal to
sum
The product of the sum is
equal to sum
Many from 9, before 10
Taking Transpose & adjust
By the ending or no ending
Add & subtract

2
3

ChalanaKalanabhyam
Ekadhikina&Ekanyunena
Purvena

Gunakkasamuchhyah

Guniitasamuchhyah

6
7
8
9
10
11

NikilamNavatashcaramam
ParavartyaYojayethu
Puranapuranabhyamm
SankhalanaVyavakhalanabh
yam
SesanyaankenaCaramena
SunyamSamyasamucaye

12

Soopantyadvayamantyam

13

Urdhva-Tiryakbhyam

Remainder by the ending digit


Sum is the zero for the same
sum
The last and double the last
before
Vertically & crosswise

14

Vyashtisamanstih

Part and Whole

15

Yaavadunam

Whatever the extent of its


deficiency

Table1: sutras of vedic mathematics

1.3 MATHEMATICAL FORMULATION OF VEDIC MATHS


The gifts of the ancient Indian mathematics in the world history of mathematical
science are not well recognized.The contributions of saint and mathematician in the
field of number theory, 'Sri BharatiKrsnaThirthaji Maharaja', in the form of Vedic
Sutras (formulas) are significant forcalculations. He had explored the mathematical
potentialsfrom Vedic primers and showed that the mathematicaloperations can be
carried out mentally to produce fastanswers using the Sutras. Vedic Mathematics is
the ancient system ofIndian mathematics which has a unique technique of calculations
based on 16 Sutras (Formulae). "Urdhvatiryakbyham" is a Sanskrit word means
vertically andcrosswise formula is used for smaller number multiplication. formula is

16

used for large number multiplication andsubtraction. All these formulas are adopted
from ancient Indian Vedic Mathematics.

1.4 PROPOSED MULTIPLIER ARCHITECHURE DESIGN


Design Factors of Multiplication:
Latency, throughput, area, and design complexity are the important factors to choose a
suitable design for the requirement. Latency is a measure of how long the inputs to a
device are stable until the final result available on outputs. Throughput is the measure
of how many multiplications can be performed in a given period of time.
Urdhva Tiryakbhyam Sutra The basic Sutras and Urdhva Tiryakbhyam Sutra in the
Vedic Mathematics helps to do almost all the numeric computations in easy and fast
manner. The Sutra which we employ in this project is Urdhva Tiryakbhyam
(Multiplication)
Description of Sutra:
This is the general formula applicable to all cases of

multiplication . Urdhva

Tiryakbhyam means Vertically and Crosswise, which is the method of


multiplication followed.
Illustration:

Fig: 1.10 2 Bit multiplication by Urdhva Tiryagbhyam

17

Fig: 1.11 3 Bit multiplication by using Urdhva tiryagbhyam

Design Of The 16x16 Multiplier


The Fundamental Block (22 block)
In the design of the proposed Vedic multiplier a 22 block is a fundamental block
(Basic block) . Also symbol of this fundamental block is shown to be used in 4 x 4 bit
Multiplier. We know that in binary multiplication basically we AND each two bits in
2-input AND gate. First off all vertical bits (LSBs) are ANDed this will result in the
LSB of the result. Than we and crosswise bits and then result is added using a half
adder. The sum output of the half adder is the next bit of the result right to the LSB.
The carry output is also added in half adder with the AND output of the MSBs. The
carry of this adder is the MSB of the result.

Fig: 1.12 2x2 Vedic Multiplier

18

Design of 44 block :
The design of 44 block is a simple arrangement of 22 blocks in an optimized
manner. The first step in the design of 44 block will be grouping the 2 bit of each 4
bit input. These pair terms will form vertical and crosswise product terms. Each input
bit-pair is handled by a separate 22 Vedic the schematic of a 44 block designed
using 22 blocks. The partial products represent the Urdhva vertical and cross product
terms. Then first two bits of right most 2x2 vedic multiplier output will be send
directly to output first two bits. Remaining partial products will be handled by 4 Bit
and 6 Bit Adders as shown in the figure.

Fig:1.13 4x4 Vedic Multiplier

19

Design of 8x8 Vedic Multiplier:


The design of 88 block is a similar arrangement of 44 blocks in an optimized
manner . The first step in the design of 88 block will be grouping the 4 bit (nibble)
of each 8 bit input. These quadruple terms will form vertical and crosswise product
terms. Each input bit-quadruple is handled by a separate 44 Vedic multiplier to
produce partial product rows. Then first four bits of right most 4x4 vedic multiplier
will be send to output directly other bits are handled with 8Bit and 12 Bit adders
respectively as shown. The figure shows the schematic of an 88 block designed
using 44 blocks. The partial products represent the Urdhva vertical and cross product
terms.

Fig 1.14 8x8 Vedic Multiplier

Design of a 1616 Multiplier


The design of 1616 block is a similar arrangement of 88 blocks in an optimized
manner as in figure. The first step in the design of 1616 block will be grouping the 8
bit (byte) of each 16 bit input. These lower and upper bytes pairs of two inputs will
form vertical and crosswise product terms. Each input byte is handled by a separate
88 Vedic multiplier to produce sixteen partial product rows. Then first 8 Bits of right
most multiplier output are directly send to the output then other bits are handled by

20

24 Bit and 16 Bit Adders.

Fig: 1.15 16 Bit Multiplier

2. SOFTWARE REQUIREMENTS
2.1 XILINX:
Xilinx, Inc. is an American technology company, primarily a supplier of
programmable logic devices. It is known for inventing the field programmable gate
array(FPGA) and as the first semiconductorcompany with a fablessmanufacturing
model.
Xilinx designs, develops and markets programmable logic products, including
integrated circuits (ICs), software design tools, predefined system functions delivered
as intellectual property (IP) cores, design services, customer training, field
engineering and technical support. Xilinx sells both FPGAs and CPLDs for electronic
equipment manufacturers in end markets such as communications, industrial,
consumer, automotive and data processing.

2.2XILINX-ISE:
Xilinx ISE (Integrated Software Environment) is a software tool produced by
Xilinx for synthesis and analysis of HDL designs, enabling the developer to
synthesize ("compile") their designs, perform timing analysis, examine RTL
diagrams, simulate a design's reaction to different stimuli, and configure the target
device with the programmer.

21

ISim provides a complete, full-featured HDL simulator integrated within ISE.


HDL simulation now can be an even more fundamental step within your design flow
with the tight integration of the ISim within your design environment.
ISim Key Features:

Mixed language support

Supports VHDL-93 and Verilog 2001

No special license requirements

Multi-Threaded compilation

Post-Processing capabilities

Standalone Waveform viewing capabilities

Debug capabilities

Waveform tracing, waveform viewing, HDL source debugging

Memory Editor for viewing and debugging memory elements

Single click re-compile and re-launch of simulation

Integrated with ISE Design Suite and PlanAhead application

Easy to use - One-click compilation and simulation

Additional mapping or compilation not required.

22

3. HARDWARE REQUIREMENTS
3.1 SPARTAN 3:
The Spartan-3 family of Field-Programmable Gate Arrays is specifically
designed to meet the needs of high volume, cost-sensitive consumer electronic
applications. The eight-member family offers densities ranging from 50,000 to
5,000,000 system gates. The Spartan-3 family is a superior alternative to mask
programmed ASICs. FPGAs avoid the high initial cost, the lengthy development
cycles, and the inherent inflexibility of conventional ASICs. Also, FPGA
programmability permits design upgrades in the field with no hardware replacement
necessary, an impossibility with ASICs.

23

3.1.1 Architectural Overview:


The Spartan-3 family architecture consists of five fundamental programmable
functional elements:
Configurable Logic Blocks (CLBs) contain RAM-based Look-Up Tables
(LUTs) to implement logic and storage elements that can be used as flip-flops
or latches. CLBs can be programmed to perform a wide variety of logical
functions as well as to store data.
Input/output Blocks (IOBs) control the flow of data between the I/O pins and
the internal logic of the device. Each IOB supports bidirectional data flow plus
3-state operation. Twenty-six different signal standards, including eight highperformance differential standards, are available as shown in Table 2. Double
Data-Rate (DDR) registers are included. The Digitally Controlled Impedance
(DCI) feature provides automatic on-chip terminations, simplifying board
designs.
Block RAM provides data storage in the form of 18-Kbit dual-port blocks.
Multiplier blocks accept two 18-bit binary numbers as inputs and calculate the
product.
Digital Clock Manager (DCM) blocks provide self-calibrating, fully digital
solutions for distributing, delaying, multiplying, dividing, and phase shifting
clock signals.
These elements are organized as shown in Figure 1. A ring of IOBs surrounds
a regular array of CLBs. The XC3S50 has a single column of block RAM
embedded in the array. Those devices ranging from the XC3S200 to the
XC3S2000 have two columns of block RAM. The XC3S4000 and XC3S5000
devices have four RAM columns. Each column is made up of several 18-Kbit
RAM blocks; each block is associated with a dedicated multiplier. The DCMs
are positioned at the ends of the outer block RAM columns.
The Spartan-3 family features a rich network of traces and switches that
interconnect all five functional elements, transmitting signals among them.
Each functional element has an associated switch matrix that permits multiple
connections to the routing.

24

Fig 3.1Spartan-3 Family Architecture

3.1.2 Configuration:
Spartan-3 FPGAs are programmed by loading configuration data into robust
reprogrammable static CMOS configuration latches (CCLs) that collectively control
all functional elements and routing resources. Before powering on the FPGA,
configuration data is stored externally in a PROM or some other nonvolatile medium
either on or off the board. After applying power, the configuration data is written to
the FPGA using any of five different modes: Master Parallel, Slave Parallel, Master
Serial, Slave Serial, and Boundary Scan (JTAG). The Master and Slave Parallel
modes use an 8-bit-wide SelectMAP port.
The recommended memory for storing the configuration data is the low-cost Xilinx
Platform Flash PROM family, which includes the XCF00S PROMs for serial
configuration and the higher density XCF00P PROMs for parallel or serial
configuration.

3.2 PACKAGE MARKING:


Figure 4.2 shows the top marking for Spartan-3 FPGAs in the quad-flat
packages.The 5C and 4I part combinations may be dual marked as 5C/4I.
Devices with the dual mark can be used as either -5C or -4I devices. Devices with a
single mark are only guaranteed for the marked speed grade and temperature range.

25

Fig 3.2: Spartan-3 FPGA QFP Package Marking Example for Part Number XC3S400-

4PQ208C

3.3 ORDERING INFORMATION:


Spartan-3 FPGAs are available in both standard (Figure 3.3) and Pb-free
(Figure 3.4) packaging options for all device/package combinations. The Pb-free
packages include a special G character in the ordering code.

Fig 3.3: Standard Packaging

Fig 3.4: Pb-Free Packaging

26

4. APPLICATIONS
4.1The Implementation of Vedic Algorithms in Digital Signal
Processing
Digital signal processing (DSP) is the technology that is omnipresent in almost every
Engineering discipline. It is also the fastest growing technology this century and,
therefore, it poses tremendous challenges to the engineering community. Faster
additions and multiplications are of extreme importance in DSP for convolution,
discrete Fourier transforms digital filters, etc. The core computing process is always a
multiplication routine; therefore, DSP engineers are constantly looking for new

27

algorithms and hardware to implement them. Vedic mathematics is the name given to
the ancient system of mathematics, which was rediscovered, from the Vedas between
1911 and 1918 by Sri Bharati Krishna Tirthaji. The whole of Vedic mathematics is
based on 16 sutras (word formulae) and manifests a unified structure of mathematics.
As such, the methods are complementary, direct and easy. The authors highlight the
use of multiplication process based on Vedic algorithms and its implementations on
8085 and 8086 microprocessors, resulting in appreciable savings in processing time.
The exploration of Vedic algorithms in the DSP domain may prove to be extremely
advantageous. Engineering institutions now seek to incorporate research-based studies
in Vedic mathematics for its applications in various engineering processes. Further
research prospects may include the design and development of a Vedic DSP chip
using VLSI technology.

4.2Discrete Fourier Transform (DFT) by using Vedic Mathematics


The Vedic mathematical methods suggested by Shankaracharya Sri. Bharti Krishna
Tirthaji through his book offer efficient alternatives. The present seminar analyses and
compares the implementation of DFT algorithm by existing and by Vedic
mathematical technique . It is suggested that architectural level changes in the entire
computation system to accommodate the Vedic mathematical method shall increase
the overall efficiency of DFT procedure.

5. RESULTS
5.1 16x16 Vedic Multiplier:
Device Utilization Summary:

28

Fig:5.0 Device Utilization summary

RTL Schematic of 16x16 Multiplier:


Fig5.1 shows the schematic diagram of 16x16 Multiplier. Here we find the
Mutiplication Output for the given 16-bit valued Input.

Fig: 5.1 RTL Schematic

Simulation result of 16 bit multiplier:

29

Fig5.2 shows the simulation result of a 16 Bit multiplier.

From the figure we can infer that the input is given when enable is high and exponent
value is determined.

5.2 8x8 Vedic Multiplier:

Fig5.3: RTL Schematic of 8x8 Vedic Multiplier

Fig5.3 shows the schematic diagram of 8x8 Multiplier.

30

Fig 5.4: Simulation result of 8x8 multiplier

5.3 4x4 Vedic Multiplier:


RTL schematic of 4x4 Vedic Multiplier

Fig5.5: RTL Schematic of 4x4 Multiplier

Fig 5.5 shows the schematic diagram of 4x4 Multiplier

31

Fig5.6: Simulation Result of 4x4 Multiplier Block

Fig 5.6 shows the simulation results of a 4x4 multiplier block

5.4 2x2 Vedic Multiplier:


RTL Schematic of 2x2 Vedic Multiplier

Fig5.7: RTL Schematic of 2x2 Multiplication Block

32

Fig5.8: Simulation result for a 2x2 vedic multiplier

5.5 Full Adders:


5.5.1 24 Bit Full Adder
RTL Schematic of 24 Bit full Adder

Fig: 5.9 RTL Schematic

33

Simulation Result of 24 Bit Full Adder:

Fig: 5.10 Simulation Result

5.5.2 16 Bit Full Adder:


RTL Schematic of 16 Bit Full Adder:

Fig:5.11 RTL Schematic

34

Simulation Result of 16 Bit Full Adder:

Fig:5.12 Simulation Result

5.5.3 12 Bit Full Adder


RTL Schematic of 12 Bit Full Adder:

Fig: 5.13 RTL Schematic

35

Simulation Result of 12 Bit Full Adder

Fig: 5.14 Simulation Result

5.5.4 8 Bit Full Adder:


RTL Schematic of 8 Bit Full Adder:

Fig:5.15 RTL Schematic

36

Simulation Result of 8bit Full adder:

Fig: 5.16 Simulation Result

5.5.5 6 Bit Full Adder:


RTL Schematic of 6 Bit Full Adder

Fig: 5.17 RTL Schematic

37

Simulation Result of 6 Bit Full Adder:

Fig:5.18 Simulation Result

5.5.6 4 Bit Full Adder:


RTL Schematic of 4 Bit Full Adder:

Fig: 5.19 RTL Schematic

38

Simulation Result of 4 Bit Full Adder:

Fig 5.20 Simulation Result

5.5.7 1 Bit Full Adder:


RTL Schematic of 1 Bit Full Adder

Fig: 5.21 RTL Schematic

39

Simulation Result of 1 Bit Full Adder

Fig: 5.22 Simulation Result

5.5.8 Half Adder:


RTL Schematic of Half Adder:

Fig: 5.23 RTL Schematic

40

Simulation Result of Half Adder:

Fig: 5.24 Simulation Result

5.6 FPGA IMPLEMENTATION:

Fig 5.25: Spartan 3 Board

Fig 5.25 shows the snapshot of Spartan 3 board.A,B are the inputs and c is the
output.

41

Fig 5.10:Output c= 1073938437(01000000000000000110000000000000101) when


a=32769(1000000000000001)and b=32773(1000000000000101)

5.5 Timing Report:


NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.
FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT
GENERATED AFTER PLACE-and-ROUTE.

Clock Information:
-----------------No clock signals found in this design
Asynchronous Control Signals Information:
---------------------------------------No asynchronous control signals found in this design

Timing Summary:
--------------Speed Grade: -5
Minimum period: No path found

42

`
Minimum input arrival time before clock: No path found
Maximum output required time after clock: No path found
Maximum combinational path delay: 29.126ns

Timing Detail:
-------------All values displayed in nanoseconds (ns)
==================================================================
=======
Timing constraint: Default path analysis
Total number of paths / destination ports: 30173 / 16
------------------------------------------------------------------------Delay:

29.126ns (Levels of Logic = 18)

Source:

a<1> (PAD)

Destination:

c<15> (PAD)

Data Path: a<1> to c<15>


Gate
Cell:in->out

Net

fanout Delay Delay Logical Name (Net Name)

---------------------------------------- -----------IBUF:I->O
LUT2:I0->O

24 0.715 1.822 a_1_IBUF (a_1_IBUF)


4 0.479 0.802 z1/z5/fa1/Mxor_sum_Result311 (N94)

LUT4:I3->O
(z1/z5/fa1/Mxor_sum_Result26)
LUT3:I0->O
MUXF5:S->O

0.479

0.976

z1/z5/fa1/Mxor_sum_Result26

2 0.479 0.745 z1/z5/fa1/Mxor_sum_Result28 (z1/q4<1>)


3 0.540 0.941 z1/z7/fa1/c_out_f5 (z1/z7/c2)

LUT3:I1->O

2 0.479 0.804 z1/z7/fa3/Mxor_sum_Result11 (N9)

LUT3:I2->O

2 0.479 1.040 z1/z7/fa3/c_out1 (z1/z7/c4)

LUT4:I0->O

2 0.479 1.040 z1/z7/fa4/Mxor_sum_Result65 (q0<6>)

LUT3:I0->O

2 0.479 0.915 z5/fa2/c_out1 (z5/c3)

LUT4:I1->O

2 0.479 0.804 z5/fa3/Mxor_sum_Result1 (q4<3>)

LUT3:I2->O

2 0.479 0.768 z7/fa3/c_out1 (z7/c4)


43

`
LUT4:I3->O

2 0.479 0.915 z7/fa4/c_out1 (z7/c5)

LUT3:I1->O

2 0.479 0.915 z7/fa5/c_out1 (z7/c6)

LUT3:I1->O

4 0.479 0.949 z7/fa6/c_out1 (z7/c7)

LUT3:I1->O

2 0.479 0.804 z7/fa7/c_out1 (z7/c8)

LUT3:I2->O

1 0.479 0.851 z7/fa9/c_out1 (z7/c10)

LUT4:I1->O

1 0.479 0.681 z7/fa11/Mxor_sum_Result1 (c_15_OBUF)

OBUF:I->O

4.909

c_15_OBUF (c<15>)

---------------------------------------Total

29.126ns (13.349ns logic, 15.777ns route)


(45.8% logic, 54.2% route)

44

6. CONCLUSION AND FUTURE SCOPE


Multiplier design based on the formulas of the ancient Indian Vedic Mathematics,
highly suitable for high speedcomplex arithmetic circuits which are having wide
application in VLSI signal processing. The implementation was done on FPGA
SPARTAN-3 and compared with the mostly used architecture like distributed
arithmetic, parallel adderbased implementation, and algebraic transformation based
implementation. This novel architecture combines the advantages of the Vedic
mathematics for multiplication which encounters the stages and partial product
reduction.The proposed complex number multiplier offered a delay of 29.12ns which
is10 percent faster than booths multiplier and improvement in terms of propagation
delay and power consumption respectively.
The project can be further extended by increasing the bit size and further extensions
can be done for signed numbers and complex numbers also.

45

7. REFERENCES
[1]

P. K. Saha, A. Banerjee, and A. Dandapat, "High SpeedLow PowerComplex

Multiplier Design Using ParallelAdders and SubtractorsInternational Journal on


Electronic and Electrical Engineering,(/JEEE).
[2]

C. S. Wallace, "A suggestion for a fast multiplier,"lEETrans.ElectronicComput.,

vol. EC-\3, pp. 14-17, Dec.1964.


[3]

.PrabirSaha, Arindam Banerjee, Partha Bhattacharyya, AnupDandapat, High

speed ASIC design of complex multiplierusing vedic mathematics , Proceeding of


the 2011 IEEEStudents' Technology Symposium 14-16 January, 2011, lITKharagpur,
pp. 237-241.
[4]

https://learn.digilentinc.com/Documents/259

[5]

http://www.xilinx.com/support/documentation

[6]

http://electronicsforu.com

[7]

http://en.wikipedia.org/wiki/

46

APPENDIX
A. INTRODUCTION TO VERILOG
A.1 OVERVIEW:
Hardware description languages such as Verilog differ from software
programming languages because they include ways of describing the propagation time
and signal strengths (sensitivity). There are two types of assignment operators; a
blocking assignment (=), and a non-blocking (<=) assignment. The non-blocking
assignment allows designers to describe a state-machine update without needing to
declare and use temporary storage variables. Since these concepts are part of Verilog's
language semantics, designers could quickly write descriptions of large circuits in a
relatively compact and concise form. At the time of Verilog's introduction (1984),
Verilog represented a tremendous productivity improvement for circuit designers who
were already using graphical schematic capture software and specially written
software programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C
programming language, which was already widely used in engineering software
development. Like C, Verilog is case-sensitive and has a basic preprocessor (though
less sophisticated than that of ANSI C/C++). Its control flow keywords (if/else, for,
while, case, etc.) are equivalent, and its operator precedence is compatible with C.
Syntactic differences include: required bit-widths for variable declarations,
demarcation of procedural blocks (Verilog uses begin/end instead of curly braces {}),
and many other minor differences. Verilog requires that variables be given a definite
size. In C these sizes are assumed from the 'type' of the variable (for instance an
integer type may be 8 bits).
A Verilog design consists of a hierarchy of modules. Modules encapsulate
design hierarchy, and communicate with other modules through a set of declared
input, output, and bidirectional ports. Internally, a module can contain any
combination of the following: net/variable declarations (wire, reg, integer, etc.),
concurrent and sequential statement blocks, and instances of other modules (subhierarchies). Sequential statements are placed inside a begin/end block and executed

47

in sequential order within the block. However, the blocks themselves are executed
concurrently, making Verilog a dataflow language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and signal strengths (strong, weak, etc.). This system allows abstract
modeling of shared signal lines, where multiple sources drive a common net. When a
wire has multiple drivers, the wire's (readable) value is resolved by a function of the
source drivers and their strengths.
A subset of statements in the Verilog language are synthesizable. Verilog modules that
conform to a synthesizable coding style, known as RTL (register-transfer level), can
be physically realized by synthesis software. Synthesis software algorithmically
transforms the (abstract) Verilog source into a netlist, a logically equivalent
description consisting only of elementary logic primitives (AND, OR, NOT, flipflops, etc.) that are available in a specific FPGA or VLSI technology. Further
manipulations to the netlist ultimately lead to a circuit fabrication blueprint (such as a
photo mask set for an ASIC or a bitstream file for an FPGA).
A.1.1 Beginning:
Verilog was one of the first modern[clarification needed] hardware description
languages to be invented.[citation needed] It was created by PrabhuGoel and Phil
Moorby during the winter of 1983/1984. The wording for this process was
"Automated Integrated Design Systems" (later renamed to Gateway Design
Automation in 1985) as a hardware modeling language. Gateway Design Automation
was purchased by Cadence Design Systems in 1990. Cadence now has full proprietary
rights to Gateway's Verilog and the Verilog-XL, the HDL-simulator that would
become the de facto standard (of Verilog logic simulators) for the next decade.
Originally, Verilog was intended to describe and allow simulation; only afterwards
was support for synthesis added.
A.1.2 Verilog-95:
With the increasing success of VHDL at the time, Cadence decided to make the
language available for open standardization. Cadence transferred Verilog into the
public domain under the Open Verilog International (OVI) (now known as Accellera)

48

organization. Verilog was later submitted to IEEE and became IEEE Standard 13641995, commonly referred to as Verilog-95.
A.1.3 Verilog 2001:
Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that
users had found in the original Verilog standard. These extensions became IEEE
Standard 1364-2001 known as Verilog-2001.
Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support
for (2's complement) signed nets and variables. Previously, code authors had to
perform signed operations using awkward bit-level manipulations (for example, the
carry-out bit of a simple 8-bit addition required an explicit description of the Boolean
algebra to determine its correct value). The same function under Verilog-2001 can be
more succinctly described by one of the built-in operators: +, -, /, *, >>>. A
generate/endgenerate construct (similar to VHDL's generate/endgenerate) allows
Verilog-2001 to control instance and statement instantiation through normal decision
operators (case/if/else). Using generate/endgenerate, Verilog-2001 can instantiate an
array of instances, with control over the connectivity of the individual instances. File
I/O has been improved by several new system tasks. And finally, a few syntax
additions were introduced to improve code readability (e.g. always @*, named
parameter override, C-style function/task/module header declaration).
Verilog-2001 is the dominant flavor of Verilog supported by the majority of
commercial EDA software packages.
A.1.4 Verilog 2005:
Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005)
consists of minor corrections, spec clarifications, and a few new language features
(such as the uwire keyword).
A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and
mixed signal modeling with traditional Verilog.

49

A.1.5 System Verilog:


System Verilog is a superset of Verilog-2005, with many new features and capabilities
to aid design verification and design modeling. As of 2009, the System Verilog and
Verilog language standards were merged into System Verilog 2009 (IEEE Standard
1800-2009).
The advent of hardware verification languages such as OpenVera, and Verisity's e
language encouraged the development of Superlog by Co-Design Automation Inc.
Co-Design Automation Inc was later purchased by Synopsys. The foundations of
Superlog and Vera were donated to Accellera, which later became the IEEE standard
P1800-2005: SystemVerilog.
Example:
A hello world program looks like this:
module main;
initial
begin
$display("Hello world!");
$finish;
end
endmodule
A simple example of two flip-flops follows:
moduletoplevel(clock,reset);
input clock;
input reset;
reg flop1;
reg flop2;

50

always @ (posedge reset or posedge clock)


if (reset)
begin
flop1 <= 0;
flop2 <= 1;
end
else
begin
flop1 <= flop2;
flop2 <= flop1;
end
endmodule
The "<=" operator in Verilog is another aspect of its being a hardware description
language as opposed to a normal procedural language. This is known as a "nonblocking" assignment. Its action doesn't register until the next clock cycle. This means
that the order of the assignments is irrelevant and will produce the same result: flop1
and flop2 will swap values every clock.
The other assignment operator, "=", is referred to as a blocking assignment. When "="
assignment is used, for the purposes of logic, the target variable is updated
immediately. In the above example, had the statements used the "=" blocking operator
instead of "<=", flop1 and flop2 would not have been swapped. Instead, as in
traditional programming, the compiler would understand to simply set flop1 equal to
flop2 (and subsequently ignore the redundant logic to set flop2 equal to flop1).
An example counter circuit follows:
module Div20x (rst, clk, cet, cep, count, tc);
parameter size = 5;

51

parameter length = 20;


inputrst; // These inputs/outputs represent
inputclk; // connections to the module.
inputcet;
inputcep;
output [size-1:0] count;
outputtc;
reg [size-1:0] count; // Signals assigned within an always (or initial)block must be of
// type reg
wiretc; // Other signals are of type wire
// The always statement below is a parallel
// execution statement that
// executes any time the signals
// rst or clk transition from low to high
always @ (posedgeclk or posedgerst)
if (rst) // This causes reset of the cntr
count<= {size{1'b0}};
else
if (cet&&cep) // Enables both true
begin
if (count == length-1)
count<= {size{1'b0}};
else

52

count<= count + 1'b1;


end
// the value of tc is continuously assigned
// the value of the expression
assigntc = (cet&& (count == length-1));
endmodule
An example of delays:
reg a, b, c, d;
wire e;
always @(b or e)
begin
a = b & e;
b = a | b;
#5 c = b;
d = #6 c ^ e;
end
The always clause above illustrates the other type of method of use, i.e. it executes
whenever any of the entities in the list (the b or e) changes. When one of these
changes, a is immediately assigned a new value, and due to the blocking assignment,
b is assigned a new value afterward (taking into account the new value of a). After a
delay of 5 time units, c is assigned the value of b and the value of c ^ e is tucked away
in an invisible store. Then after 6 more time units, d is assigned the value that was
tucked away.

53

Signals that are driven from within a process (an initial or always block) must be of
type reg. Signals that are driven from outside a process must be of type wire. The
keyword reg does not necessarily imply a hardware register.

A.2 DEFINITION OF CONSTANTS:


The definition of constants in Verilog supports the addition of a width parameter. The
basic syntax is:
<Width in bits>'<base letter><number>
Examples:
12'h123 - Hexadecimal 123 (using 12 bits)
20'd44 - Decimal 44 (using 20 bits - 0 extension is automatic)
4'b1010 - Binary 1010 (using 4 bits)
6'o77 - Octal 77 (using 6 bits)
Synthesizeable constructs:
There are several statements in Verilog that have no analog in real hardware, e.g.
$display. Consequently, much of the language can not be used to describe hardware.
The examples presented here are the classic subset of the language that has a direct
mapping to real gates.
// Mux examples - Three ways to do the same thing.
// The first example uses continuous assignment
wire out;
assign out = sel ? a : b; // the second example uses a procedure to accomplish the
//same thing.
reg out;
always @(a or b or sel)
begin

54

case(sel)
1'b0: out = b;
1'b1: out = a;
endcase
end
// Finally - you can use if/else in a procedural structure.
reg out;
always @(a or b or sel)
if (sel)
out = a;
else
out = b;
The next interesting structure is a transparent latch; it will pass the input to the output
when the gate signal is set for "pass-through", and captures the input and stores it
upon transition of the gate signal to "hold". The output will remain stable regardless
of the input signal while the gate is set to "hold". In the example below the "passthrough" level of the gate would be when the value of the if clause is true, i.e. gate =
1. This is read "if gate is true, the din is fed to latch_out continuously." Once the if
clause is false, the last value at latch_out will remain and is independent of the value
of din.
// Transparent latch example
reglatch_out;
always @(gate or din)
if(gate)
latch_out = din; // Pass through state

55

// Note that the else isn't required here. The variable


// latch_out will follow the value of din while gate is
// high. When gate goes low, latch_out will remain constant.
The flip-flop is the next significant template; in Verilog, the D-flop is the simplest,
and it can be modeled as:
reg q;
always @(posedgeclk)
q <= d;
The significant thing to notice in the example is the use of the non-blocking
assignment. A basic rule of thumb is to use <= when there is a posedge or negedge
statement within the always clause.
A variant of the D-flop is one with an asynchronous reset; there is a convention that
the reset state will be the first if clause within the statement.
reg q;
always @(posedgeclk or posedge reset)
if(reset)
q <= 0;
else
q <= d;
The next variant is including both an asynchronous reset and asynchronous set
condition; again the convention comes into play, i.e. the reset term is followed by the
set term.
reg q;
always @(posedgeclk or posedge reset or posedge set)
if(reset)

56

q <= 0;
else
if(set)
q <= 1;
else
q <= d;
Note: If this model is used to model a Set/Reset flip flop then simulation errors can
result. Consider the following test sequence of events. 1) reset goes high 2) clk goes
high 3) set goes high 4) clk goes high again 5) reset goes low followed by 6) set going
low. Assume no setup and hold violations.
In this example the always @ statement would first execute when the rising edge of
reset occurs which would place q to a value of 0. The next time the always block
executes would be the rising edge of clk which again would keep q at a value of 0.
The always block then executes when set goes high which because reset is high forces
q to remain at 0. This condition may or may not be correct depending on the actual
flip flop. However, this is not the main problem with this model. Notice that when
reset goes low, that set is still high. In a real flip flop this will cause the output to go to
a 1. However, in this model it will not occur because the always block is triggered by
rising edges of set and reset - not levels. A different approach may be necessary for
set/reset flip flops.
The final basic variant is one that implements a D-flop with a mux feeding its input.
The mux has a d-input and feedback from the flop itself. This allows a gated load
function.
// Basic structure with an EXPLICIT feedback path
always @(posedgeclk)
if(gate)
q <= d;

57

else
q <= q; // explicit feedback path
// The more common structure ASSUMES the feedback is present
// This is a safe assumption since this is how the
// hardware compiler will interpret it. This structure
// looks much like a latch. The differences are the
// '''@(posedgeclk)''' and the non-blocking '''<='''
always @(posedgeclk)
if(gate)
q <= d; // the "else" mux is "implied"
Note that there are no "initial" blocks mentioned in this description. There is a split
between FPGA and ASIC synthesis tools on this structure. FPGA tools allow initial
blocks where reg values are established instead of using a "reset" signal. ASIC
synthesis tools don't support such a statement. The reason is that an FPGA's initial
state is something that is downloaded into the memory tables of the FPGA. An ASIC
is an actual hardware implementation.

A.3 INITIAL AND ALWAYS:


There are two separate ways of declaring a Verilog process. These are the always and
the initial keywords. The always keyword indicates a free-running process. The initial
keyword indicates a process executes exactly once. Both constructs begin execution at
simulator time 0, and both execute until the end of the block. Once an always block
has reached its end, it is rescheduled (again). It is a common misconception to believe
that an initial block will execute before an always block. In fact, it is better to think of
the initial-block as a special-case of the always-block, one which terminates after it
completes for the first time.
//Examples
initial
58

begin
a = 1; // Assign a value to reg a at time 0
#1; // Wait 1 time unit
b = a; // Assign the value of reg a to reg b
end
always @(a or b) // Any time a or b CHANGE, run the process
begin
if (a)
c = b;
else
d = ~b;
end // Done with this block, now return to the top (i.e. the @ event-control)
always @(posedge a)// Run whenever reg a has a low to high change
a <= b;
These are the classic uses for these two keywords, but there are two significant
additional uses. The most common of these is an always keyword without the @(...)
sensitivity list. It is possible to use always as shown below:
always
begin // Always begins executing at time 0 and NEVER stops
clk = 0; // Set clk to 0
#1; // Wait for 1 time unit
clk = 1; // Set clk to 1
#1; // Wait 1 time unit

59

end // Keeps executing - so continue back at the top of the begin


The always keyword acts similar to the "C" construct while(1) {..} in the sense that it
will execute forever.
The other interesting exception is the use of the initial keyword with the addition of
the forever keyword.
The example below is functionally identical to the always example above.
initial forever // Start at time 0 and repeat the begin/end forever
begin
clk = 0; // Set clk to 0
#1; // Wait for 1 time unit
clk = 1; // Set clk to 1
#1; // Wait 1 time unit
end
Fork/join:
The fork/join pair are used by Verilog to create parallel processes. All statements (or
blocks) between a fork/join pair begin execution simultaneously upon execution flow
hitting the fork. Execution continues after the join upon completion of the longest
running statement or block between the fork and join.
initial
fork
$write("A"); // Print Char A
$write("B"); // Print Char B
begin
#1; // Wait 1 time unit

60

$write("C");// Print Char C


end
join
The way the above is written, it is possible to have either the sequences "ABC" or
"BAC" print out. The order of simulation between the first $write and the second
$write depends on the simulator implementation, and may purposefully be
randomized by the simulator. This allows the simulation to contain both accidental
race conditions as well as intentional non-deterministic behavior.
Notice that VHDL cannot dynamically spawn multiple processes like Verilog.
Race conditions
The order of execution isn't always guaranteed within Verilog. This can best be
illustrated by a classic example. Consider the code snippet below:
initial
a = 0;
initial
b = a;
initial
begin
#1;
$display("Value a=%a Value of b=%b",a,b);
end
What will be printed out for the values of a and b? Depending on the order of
execution of the initial blocks, it could be zero and zero, or alternately zero and some
other arbitrary uninitialized value. The $display statement will always execute after
both assignment blocks have completed, due to the #1 delay

61

B. INTRODUCTION TO VLSI
Very-large-scale integration (VLSI) is the process of creating integrated
circuits by combining thousands of transistor-based circuits into a single chip. VLSI
began in the 1970s when complex semiconductor and communication technologies
were being developed. The microprocessor is a VLSI device. The term is no longer as
common as it once was, as chips have increased in complexity into the hundreds of
millions of transistors.

B.1 OVERVIEW:
The first semiconductor chips held one transistor each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions or
systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it
possible to fabricate one or more logic gates on a single device. Now known
retrospectively as "small-scale integration" (SSI), improvements in technique led to
devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.
systems with at least a thousand logic gates. Current technology has moved far past
this mark and today's microprocessors have many millions of gates and hundreds of
millions of individual transistors.
At one time, there was an effort to name and calibrate various levels of largescale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were
used. But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use. Even VLSI is now somewhat quaint,
given the common assumption that all microprocessors are VLSI or better.
As of early 2008, billion-transistor processors are commercially available, an
example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65
nm processes to the next 45 nm generations (while experiencing new challenges such
as increased variation across process corners). Another notable example is NVIDIAs
280 series GPU.

62

This microprocessor is unique in the fact that its 1.4 Billion transistor count,
capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's
transistor count is largely due to the 24MB L3 cache). Current designs, as opposed to
the earliest devices, use extensive design automation and automated logic synthesis to
lay out the transistors, enabling higher levels of complexity in the resulting logic
functionality. Certain high-performance logic blocks like the SRAM cell, however,
are still designed by hand to ensure the highest efficiency (sometimes by bending or
breaking established design rules to obtain the last bit of performance by trading
stability).

B.2 VLSI:
VLSI stands for "Very Large Scale Integration". This is the field
whichInvolves packing more and more logic devices into smaller and smaller areas.
1. Simply we say Integrated circuit is many transistors on one chip.
2. Design/manufacturing of extremely small, complex circuitry using modified
semiconductor material.
3. Integrated circuit (IC) may contain millions of transistors, each a few mm in
size.
4. Applications wide ranging: most electronic logic devices.

B.3 VLSI DESIGN FLOW:


B.3.1 Digital Circuit:
Digital ICs of SSI and MSI types have become universally standardized and
have been accepted for use. Whenever a designer has to realize a digital function, he
uses a standard set of ICs along with a minimal set of additional discrete circuitry.
Consider a simple example of realizing a function as
Q n+1 = Q n + (A B)
Here on, A, and B are Boolean variables, with Q n being the value of Q at the
nth time step. Here A B signifies the logical AND of A and B; the + symbol signifies
the logical OR of the logic variables on either side. A circuit to realize the function is

63

On
Clk

`
A

shown in Figure. The circuit can be realized in terms of two ICs an A-O-I gate and a
flip-flop. It can be directly
B wired up, tested, and used.

Fig B.1: Simple digital circuit

With comparatively larger circuits, the task mostly reduces to one of


identifying the set of ICs necessary for the job and interconnecting; rarely does one
have to resort to a micro level design. The accepted approach to digital design here is
a mix of the top-down and bottom-up approaches as follows.
Decide the requirements at the system level and translate them to circuit requirements.
Identify the major functional blocks required like timer, DMA unit, register file etc.,
say as in the design of a processor.
Whenever a function can be realized using a standard IC, use the same for
example programmable counter, mux, demux, etc.
Whenever the above is not possible, form the circuit to carry out the block
functions using standard SSI for example gates, flip-flops, etc.

64

Use additional components like transistor, diode, resistor, capacitor, etc.,


wherever essential.
System requirements

Circuit requirements

Other components

ICs

PCB layout

Wiring & testing

Final circuit

Fig B.2: Process flowchart

Once the above steps are gone through, a paper design is ready. Starting with
the paper design, one has to do a circuit layout. The physical location of all the
components is tentatively decided; they are interconnected and the circuit-onpaper is
made ready. Once a paper design is done, a layout is carried out and a net-list
prepared. Based on this, the PCB is fabricated and populated and all the populated
cards tested and debugged.
At the debugging stage one may encounter three types of problems:

65

Functional mismatch: The realized and expected functions are different. One
may have to go through the relevant functional block carefully and locate any
error logically. Finally the necessary correction has to be carried out in
hardware.
Timing mismatch: The problem can manifest in different forms. One
possibility is due to the signal going through different propagation delays in
two paths and arriving at a point with a timing mismatch. This can cause
faulty operation. Another possibility is a race condition in a circuit involving
asynchronous feedback. This kind of problem may call for elaborate
debugging. The preferred practice is to do debugging at smaller module stages
and ensuring that feedback through larger loops is avoided: It becomes
essential to check for the existence of long asynchronous loops.
Overload: Some signals may be overloaded to such an extent that the signal
transition may be unduly delayed or even suppressed. The problem manifests
as reflections and erratic behavior in some cases (The signal has to be suitably
buffered here.). In fact, overload on a signal can lead to timing mismatches.
The above have to be carried out after completion of the prototype PCB
manufacturing; it involves cost, time, and also a redesigning process to develop a bug
free design.
History of Scale Integration:
Late 40s Transistor invented at Bell Labs
Late 50s First IC (JK-FF by Jack Kilby at TI)
Early 60s Small Scale Integration (SSI)
10s of transistors on a chip
Late 60s Medium Scale Integration (MSI)
100s of transistors on a chip

66

Early 70s Large Scale Integration (LSI)


1000s of transistor on a chip
Early 80s VLSI 10,000s of transistors on a
chip (later 100,000s & now 1,000,000s)
Ultra LSI is sometimes used for 1,000,000s
SSI - Small-Scale Integration (0-102)
MSI - Medium-Scale Integration (102-103)
LSI - Large-Scale Integration (103-105)
VLSI - Very Large-Scale Integration (105-107)
ULSI - Ultra Large-Scale Integration (>=107)
B.3.2 VLSI Design:
The complexity of VLSIs being designed and used today makes the manual
approach to design impractical. Design automation is the order of the day. With the
rapid technological developments in the last two decades, the status of VLSI
technology is characterized by the following:
A steady increase in the size and hence the functionality of the ICs.
A steady reduction in feature size and hence increase in the speed of operation
as well as gate or transistor density.
A steady improvement in the predictability of circuit behavior.
A steady increase in the variety and size of software tools for VLSI design.
The above developments have resulted in a proliferation of approaches to
VLSI design. We briefly describe the procedure of automated design flow. The aim is
67

more to bring out the role of a Hardware Description Language (HDL) in the design
process. An abstraction based model is the basis of the automated design.
B.3.3 Abstraction Model:
The model divides the whole design cycle into various domains. With such an
abstraction through a division process the design is carried out in different layers. The
designer at one layer can function without bothering about the layers above or below.
The thick horizontal lines separating the layers in the figure signify the
compartmentalization. As an example, let us consider design at the gate level. The
circuit to be designed would be described in terms of truth tables and state tables.
With these as available inputs, he has to express them as Boolean logic equations and
realize them in terms of gates and flip-flops. In turn, these form the inputs to the layer
immediately below. Compartmentalization of the approach to design in the manner
described here is the essence of abstraction; it is the basis for development and use of
CAD tools in VLSI design at various levels.
The design methods at different levels use the respective aids such as Boolean
equations, truth tables, state transition table, etc. But the aids play only a small role in
the process. To complete a design, one may have to switch from one tool to another,
raising the issues of tool compatibility and learning new environments.

B.4 ASIC DESIGN FLOW:


As with any other technical activity, development of an ASIC starts with an
idea and takes tangible shape through the stages of development. The first step in the
process is to expand the idea in terms of behavior of the target circuit. Through stages
of programming, the same is fully developed into a design description in terms of
well-defined standard constructs and conventions.

68

Fig B.3: Design domain and levels of abstraction

Idea

Design description

Synthesis

Simulation

Physical design

Fig B.4: Major activities in ASIC design

The design is tested through a simulation process; it is to check, verify, and


ensure that what is wanted is what is described. Simulation is carried out through

69

dedicated tools. With every simulation run, the simulation results are studied to
identify errors in the design description. The errors are corrected and another
simulation run carried out. Simulation and changes to design description together
form a cyclic iterative process, repeated until an error-free design is evolved.
Design description is an activity independent of the target technology or
manufacturer. It results in a description of the digital circuit. To translate it into a
tangible circuit, one goes through the physical design process. The same constitutes a
set of activities closely linked to the manufacturer and the target technology.
B.4.1 Design Description:
The design is carried out in stages. The process of transforming the idea into a
detailed circuit description in terms of the elementary circuit components constitutes
design description. The final circuit of such an IC can have up to a billion such
components; it is arrived at in a step-by-step manner. The first step in evolving the
design description is to describe the circuit in terms of its behavior. The description
looks like a program in a high level language like C. Once the behavioral level design
description is ready, it is tested extensively with the help of a simulation tool; it
checks and confirms that all the expected functions are carried out satisfactorily. If
necessary, this behavioral level routine is edited, modified, and rerun all done
manually. Finally, one has a design for the expected system described at the
behavioral level. The behavioral design forms the input to the synthesis tools, for
circuit synthesis. The behavioral constructs not supported by the synthesis tools are
replaced by data flow and gate level constructs. To surmise, the designer has to
develop synthesizable codes for his design. The design at the behavioral level is to be
elaborated in terms of known and acknowledged functional blocks. It forms the next
detailed level of design description
Once again the design is to be tested through simulation and iteratively
corrected for errors. The elaboration can be continued one or two steps further. It
leads to a detailed design description in terms of logic gates and transistor switches.
B.4.2 Optimization:

70

The circuit at the gate level in terms of the gates and flip-flops can be
redundant in nature. The same can be minimized with the help of minimization tools.
The step is not shown separately in the figure. The minimized logical design is
converted to a circuit in terms of the switch level cells from standard libraries
provided by the foundries. The cell based design generated by the tool is the last step
in the logical design process; it forms the input to the first level of physical design.
B.4.3 Simulation:
The design descriptions are tested for their functionality at every level
behavioral, data flow, and gate. One has to check here whether all the functions are
carried out as expected and rectify them. All such activities are carried out by the
simulation tool. The tool also has an editor to carry out any corrections to the source
code. Simulation involves testing the design for all its functions, functional sequences,
timing constraints, and specifications. Normally testing and simulation at all the levels
behavioral to switch level are carried out by a single tool; the same is identified as
scope of simulation tool.

71

Fig B.5: ASIC Design and Development flow

72

B.4.4 Synthesis:
With the availability of design at the gate (switch) level, the logical design is
complete. The corresponding circuit hardware realization is carried out by a synthesis
tool.
Two common approaches are as follows:
The circuit is realized through an FPGA. The gate level design description is the
starting point for the synthesis here. The FPGA vendors provide an interface to the
synthesis tool. Through the interface the gate level design is realized as a final
circuit. With many synthesis tools, one can directly use the design description at
the data flow level itself to realize the final circuit through an FPGA. The FPGA
route is attractive for limited volume production or a fast development cycle.
The circuit is realized as an ASIC. A typical ASIC vendor will have his own
library of basic components like elementary gates and flip-flops. Eventually the
circuit is to be realized by selecting such components and interconnecting them
conforming to the required design. This constitutes the physical design. Being an
elaborate and costly process, a physical design may call for an intermediate
functional verification through the FPGA route. The circuit realized through the
FPGA is tested as a prototype. It provides another opportunity for testing the
design closer to the final circuit.
B.4.5 Physical Design:
A fully tested and error-free design at the switch level can be the starting point
for a physical design. It is to be realized as the final circuit using (typically) a million
components in the foundrys library. The step-by-step activities in the process are
described briefly as follows:
System partitioning: The design is partitioned into convenient compartments or
functional blocks. Often it would have been done at an earlier stage itself and the
software design prepared in terms of such blocks. Interconnection of the blocks is
part of the partition process.

73

Floor planning: The positions of the partitioned blocks are planned and the blocks
are arranged accordingly. The procedure is analogous to the planning and
arrangement of domestic furniture in a residence. Blocks with I/O pins are kept
close to the periphery; those which interact frequently or through a large number
of interconnections are kept close together, and so on. Partitioning and floor
planning may have to be carried out and refined iteratively to yield best results.
Placement: The selected components from the ASIC library are placed in position
on the Silicon floor. It is done with each of the blocks above.
Routing: The components placed as described above are to be interconnected to
the rest of the block: It is done with each of the blocks by suitably routing the
interconnects. Once the routing is complete, the physical design cam is taken as
complete. The final mask for the design can be made at this stage and the ASIC
manufactured in the foundry.
B.4.6 Post Layout Simulation:
Once the placement and routing are completed, the performance specifications
like silicon area, power consumed, path delays, etc., can be computed. Equivalent
circuit can be extracted at the component level and performance analysis carried out.
This constitutes the final stage called verification. One may have to go through the
placement and routing activity once again to improve performance.
B.4.7 Critical Subsystems:
The design may have critical subsystems. Their performance may be crucial to
the overall performance; in other words, to improve the system performance
substantially, one may have to design such subsystems afresh. The design here may
imply redefinition of the basic feature size of the component, component design,
placement of components, or routing done separately and specifically for the
subsystem. A set of masks used in the foundry may have to be done afresh for the
purpose.

74

C. FIELD PROGRAMMABLE GATE ARRAY


C.1 INTRODUCTION:
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable.
Logic blocks are programmed to implement a desired function and the interconnects
are programmed using the switch boxes

to connect the logic blocks.

To be more clear, if we want to implement a complex design (CPU for instance), then
the design is divided into small sub functions and each sub function is implemented
using one logic block. Now, to get our desired design (CPU), all the sub functions
implemented in logic blocks must be connected and this is done by programming the
interconnects.

Internal structure of an FPGA is depicted in the following figure C.1.

75

Fig C.1: FPGA Architecture

FPGAs, alternative to the custom ICs, can be used to implement an entire


System On one Chip (SOC). The main advantage of FPGA is ability to reprogram.
User can reprogram an FPGA to implement a design and this is done after the FPGA
is manufactured. This brings the name Field Programmable.Custom ICs are
expensive and takes long time to design so they are useful when produced in bulk
amounts. But FPGAs are easy to implement within a short time with the help of
Computer Aided Designing (CAD) tools (because there is no physical layout process,
no mask making, and no IC manufacturing).Some disadvantages of FPGAs are, they

76

are slow compared to custom ICs as they cant handle vary complex designs and also
they draw more power.Xilinx logic block consists of one Look Up Table (LUT) and
one Flip-flop. An LUT is used to implement number of different functionality. The
input lines to the logic block go into the LUT and enable it. The output of the LUT
gives the result of the logic function that it implements and the output of logic block is
registered or unregistered outputfrom the LUT. SRAM is used to implement a LUT.A
k-input logic function is implemented using 2^k * 1 size SRAM. Number of different
possible functions for k input LUT is 2^2^k. Advantage of such an architecture is that
it supports implementation of so many logic functions, however the disadvantage is
unusually large number of memory cells required to implement such a logic block in
case number of inputs is large. Figure C.2 below shows a 4-input LUT based
implementation of logic block.

Fig C.2: Xilinx LUT

LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with trade-off
between performance and logic density.An n-LUT can be shown as a direct
implementation of a function truth-table. Each of the latch holds the value of the
function corresponding to one input combination. For Example: 2-LUT can be used to
implement 16 types of functions like AND, OR, A+notB etc.
A
0

B
0

AND
0

77

OR

C.2 INTERCONNECTS:
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an
FPGA can be termed as a track.Typically an FPGA has logic blocks, interconnects and
switch blocks (Input/output blocks). Switch blocks lie in the periphery of logic blocks
and interconnect. Wire segments are connected to logic blocks through switch blocks.
Depending on the required design, one logic block is connected to another and so on.
In this part of tutorial we are going to have a short intro on FPGA design flow. A
simplified version of design flow is given in the flowing figure C.3.

Fig C.3: FPGA Design Flow

C.3 DESIGN ENTRY:


There are different techniques for design entry. Schematic based, Hardware
Description Language and combination of both etc. Selection of a method depends on
the design and designer. If the designer wants to deal more with Hardware, then

78

Schematic entry is the better choice. When the design is complex or the designer
thinks the design in an algorithmic way then HDL is the better choice. Language
based entry is faster but lag in performance and density.HDLs represent a level of
abstraction that can isolate the designers from the details of the hardware
implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method
but rarely used is state-machines. It is the better choice for the designers who think the
design as a series of states. But the tools for state machine entry are limited. In this
documentation we are going to deal with the HDL based design entry.

C.4 SYNTHESIS:
The process which translates VHDL or Verilog code into a device netlist
format. i.e. a complete circuit with logical elements (gates, flip flops, etc) for the
design.If the design contains more than one sub designs, ex. to implement a processor,
we need a CPU as one design element and RAM as another and so on, then the
synthesis

process

generates

netlist

for

each

design

element

Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has
selected. The resulting netlist(s) is saved to an NGC (Native Generic Circuit) file (for
Xilinx Synthesis Technology (XST)).

Fig C.4: FPGA Synthesis

C.5 IMPLEMENTATION:
Thisprocess consists a sequence of three steps
1.Translate
2.Map
3. Place and Route
79

Translate process combines all the input netlists and constraints to a logic
design file. This information is saved as a NGD (Native Generic Database) file. This
can be done using NGD Build program. Here, defining constraints is nothing but,
assigning the ports in the design to the physical elements (ex. pins, switches, buttons
etc) of the targeted device and specifying time requirements of the design. This
information is stored in a file named UCF (User Constraints File).
Tools used to create or modify the UCF are PACE, Constraint Editor Etc.

Fig C.5: FPGA Translate

Map process divides the whole circuit with logical elements into sub blocks
such that they can be fit into the FPGA logic blocks. That means map process fits the
logic defined by the NGD file into the targeted FPGA elements (Combinational Logic
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.

Fig C.6: FPGA map

80

Place and RoutePAR program is used for this process. The place and route
process places the sub blocks from the map process into logic blocks according to the
constraints and connects the logic blocks. Ex. if a sub block is placed in a logic block
which is very near to IO pin, then it may save the time but it may effect some other
constraint. So trade-off between all the constraints is taken account by the place and
route process. The PAR tool takes the mapped NCD file as input and produces a
completely routed NCD file as output. Output NCD file consists the routing
information.

Fig C.7: FPGA place and route

C.6 DEVICE PROGRAMMING:


Now the design must be loaded on the FPGA. But the design must be
converted to a format so that the FPGA can accept it. BITGEN program deals with the
conversion. The routed NCD file is then given to the BITGEN program to generate a
bit stream (a .BIT file) which can be used to configure the target FPGA device. This
can be done using a cable. Selection of cable depends on the design.

C.7 DESIGN VERIFICATION:


Verification can be done at different stages of the process steps.
Behavioral Simulation (RTL Simulation) this is first of all simulation steps;
those are encountered throughout the hierarchy of the design flow. This simulation is
performed before synthesis process to verify RTL (behavioral) code and to confirm
that the design is functioning as intended. Behavioral simulation can be performed on

81

either VHDL or Verilog designs. In this process, signals and variables are observed,
procedures and functions are traced and breakpoints are set. This is a very fast
simulation and so allows the designer to change the HDL code if the required
functionality is not met with in a short time period. Since the design is not yet
synthesized to gate level, timing and resource usage properties are still unknown.
Functional simulation (Post Translate Simulation) Functional simulation
gives information about the logic operation of the circuit. Designer can verify the
functionality of the design using this process after the Translate process. If the
functionality is not as expected, then the designer has to made changes in the code and
again follow the design flow steps.
Static Timing Analysis This can be done after MAP or PAR processes Post
MAP timing report lists signal path delays of the design derived from the design logic.
Post Place and Route timing report incorporates timing delay information to provide a
comprehensive timing.

C.8 ASIC Vs. FPGA:


ASIC and FPGAs have different value propositions, and they must be
carefully evaluated before choosing any one over the other. Information abounds that
compares the two technologies. While FPGAs used to be selected for lower
speed/complexity/volume designs in the past, todays FPGAs easily push the 500
MHz performance barrier. With unprecedented logic density increases and a host of
other features, such as embedded processors, DSP blocks, clocking, and high-speed
serial at ever lower price points, FPGAs are a compelling proposition for almost any
type of design.

82

You might also like