Professional Documents
Culture Documents
INTRODUCTION
1.1 INTRODUCTION
Multipliers are key components of many high performance systems such as FIR filters,
microprocessors, digital signal processors, etc. A systems performance is generally determined
by the performance of the multiplier because the multiplier is generally the slowest clement in the
system. Furthermore, it is generally the most area consuming. Hence, optimizing the speed and
area of the multiplier is a major design issue. However, area and speed are usually conflicting
constraints so that improving speed results mostly in larger areas. As a result, a whole spectrum
of multipliers with different area-speed constraints have been designed with fully parallel.
Multipliers at one end of the spectrum and fully serial multipliers at the other end. In between are
digit serial multipliers where single digits consisting of several bits are operated on. These
multipliers have moderate performance in both speed and area. However, existing digit serial
multipliers have been Plagued by complicated switching systems and/or irregularities in design.
Radix 2^n multipliers which operate on digits in a parallel fashion instead of bits bring the
pipelining to the digit level and avoid most ofthe above problems. They were introduced by M.
K. Ibrahim in 1993. These structures are iterative and modular. The pipelining done at the digit
level brings the benefit of constant operation speed irrespective of the size of the multiplier. The
clock speed is only determined by the digit size which is already fixed before the design is
implemented
1.2 MOTIVATION
As the scale of integration keeps growing, more and more sophisticated signal processing
systems are being implemented on a VLSI chip . These signal processing applications not only
demand great computation capacity but also consume considerable amounts of energy. While
performance and area remain to be two major design goals, power consumption has become a
critical concern in todays VLSI system design . The need for low-power VLSI systems arises
from two main forces. First, with the steady growth of operating frequency and processing
capacity per chip, large current has to be delivered and the heat due to large power consumption
must be removed by proper cooling techniques. Second, battery life in portable electronic devices
is limited. Low power design directly leads to prolonged operation time in these portable devices.
1
(1.1)
Pcap = 01fclk.
= 01fclk CL
(1.2)
(1.3)
Pstatic = Istatic.VDD
(1.4)
Pcap in Equation 1.2 represents the dynamic power due to capacitance charging and discharging
of a circuit node, where CL is the loading capacitance, fclk is the clock frequency, and 01 is
the 0 1 transition probability in one clock period. In most cases, the voltage swing Vswing is
the same as the supply voltage VDD; otherwise, Vswing should replace VDD in this equation.
Pscc is a first-order average power consumption due to short-circuit current. The peak current,
Ipeak, is determined by the saturation current of the devices and is hence directly proportional to
the sizes of the transistors. tr and tf are rising time and falling time of short-circuit current,
respectively. The static power Pstatic is primarily determined by fabrication technology
considerations, which is usually several orders of magnitude smaller than the dynamic power.
The leakage power problem mainly appears in very low frequency circuits or ones with sleep
modes where dynamic activities are suppressed . The dominant term in a well-designed circuit
during its active state is the dynamic term due to switching activity on loading capacitance, and
thus low-power design often becomes the task of minimizing 01, CL, VDD and fclk, while
retaining the required functionality . In the future, static power will become increasingly
important as the supply voltage keeps scaling. To avoid performance degrading, the threshold
voltage Vt is lowered accordingly and sub threshold leakage current increases exponentially .
Leakage power reduction heavily depends on circuit and technology techniques such as dual Vt
partitioning and multi-threshold CMOS . In this work, we will not consider leakage power
reduction.
Power optimization of digital systems has been studied at different abstract levels, from
the lowest technology level, to the highest system level . At the technology level, power
consumption is reduced by the improvement in fabrication process such as small feature size,
very low voltages, copper interconnects, and insulators with low dielectric constants . With the
fabrication support of multiple supply voltages, lower voltages can be applied on non-critical
system blocks. At the layout level, placement and routing are adjusted to reduce wire capacitance
and signal delay imbalances . At the circuit level, power reduction is achieved by transistor
sizing, transistor network restructuring and reorganization, and different circuit logic styles.
Multiplication consists of three steps: generation of partial products or PPs (PPG), reduction of
partial products (PPR), and final carry-propagate addition (CPA) . In general, there are sequential
and combinational multiplier implementations. We only consider combinational multipliers in
this work because the scale of integration now is large enough to accept parallel multiplier
implementation in digital VLSI systems. Different multiplication algorithms vary in the
approaches of PPG, PPR, and CPA. For PPG, radix-2 digit-vector multiplication is the simplest
form because the digit-vector multiplication is produced by a set of AND gates. To reduce the
number of PPs and consequently reduce the area/delay of PP reduction, one operand is usually
recoded into high-radix digit sets. The most popular one is the radix-4 digit set {2,1, 0, 1, 2}.
For PPR, two alternatives exist : reduction by rows , performed by an array of adders, and
reduction by columns , performed by an array of counters. In reduction by rows, there are two
extreme classes: linear array and tree array. Linear array has the delay of O(n) while both tree
array and column reduction have the delay of O(log n), where n is the number of PPs. The final
CPA requires a fast adder scheme because it is on the critical path. In some cases, final CPA is
postponed if it is advantageous to keep redundant results from PPG for further arithmetic
operations.
The difficulty of low-power multiplier design lies in three aspects. First, the multiplier
area is quadratically related to the operand precision. Second, parallel multipliers have many
logic levels that introduce spurious transitions or glitches. Third, the structure of parallel
multipliers could be very complex in order to achieve high speed, which deteriorates the
efficiency of layout and circuit level optimization. As a fundamental arithmetic operation,
multiplication has many algorithm-leveland bit-level computation features in which it differs
from random logic. These features have not been considered well in low-level power
optimization. It is also difficult to consider input data characteristics at low levels. Therefore, it is
desirable to develop algorithm and architecture level power optimization techniques that consider
multiplications arithmetic features and operands characteristics.
There has been some work on low-power multipliers at the algorithm and architecture
level. As smaller area usually leads to less switching capacitance, the results in could provide a
rough estimation of relative power consumptions in different multiplication schemes. In ,
Callaway studied the power/delay/area characteristics of four classical multipliers. In , Angel
proposed low-power sign extension schemes and self-timed design with bypassing logic for zero
PPs in radix-4 multipliers. Cherkauer and Friedman proposed a hybrid radix-4/radix-8 low
4
power signed multiplier architecture. For multiplication data with large dynamic range, several
approaches have been proposed. Architecture-level signal gating techniques have been studied .
In , a mixed number representation for radix-4 twos-complement multiplication is proposed. In ,
radix-4 recoding is applied to the constant input instead of the dynamic input in low-power
multiplication for FIR filters. In , multiplication is separated into higher and lower parts and the
results of the higher part are stored in a cache in order to reduce redundant computation. In , two
techniques are proposed for data with large dynamic range: most-significant-digit-first carry-save
array for PP reduction and dynamically-generated reduced twos-complement representation In,
the precisions of two input data are compared at runtime and two operands are then exchanged if
necessary so that radix-4 recoding is applied on the operand with smaller precisions in order to
generate more zero PPs.
CHAPTER 2
MULTIPLIERS
5
operation of repeatedly adding the multiplicand and shifting. The advantage of a sequential
multiplier is that the circuit is simple and the chip occupies less area, the disadvantage is that it is
slower. For parallel array multipliers, the summation of partial products is carried out by using a
linear adder array. Because the operation is in parallel, the speed is much faster than of a
sequential multiplier.
There are a number of algorithms used for multiplication . The 3-bit recoding algorithm is
one of the most well known . It is used in the design of many kinds of hardware and software
multipliers. This algorithm is used to reduce the number of partial product rows by about half, so,
the speed of multiplication increases significantly and the chip area is reduced. The 3-bit recoding
algorithm is also called the Modified Booth's Algorithm and was developed from Booth's
algorithm . A number of other multiple-bit recoding algorithms for multiplication have been
developed . Recent, a parallel hardware multiplier based on a 5-bit recoding algorithm has been
proposed . From the view of optimization, the 5-bit recoding algorithm is preferred to a 4-bit
recoding algorithm. While more partial product rows can be reduced with the 5-bit recoding
algorithm than with a 3-bit recoding algorithm, more complicated circuits are required to
determine the odd multiples of the multiplicand. With the potential of improving both
performance and the hardware requirements, the 5-bit recoding algorithm maybe good for a high
bit multiplier, but, not for a low bit multiplier, such as, an 8 x 8 bit multiplier.
Using the 5-bit recoding algorithm reduces the number of partial product rows to two. The
partial products are selected from 17 different multiples of the multiplicand Y (0, Y, 2Y, 3Y,
4Y, 5Y, 6Y, 7Y, 8Y). Using the 3-bit recoding algorithm reduces the number of partial
product rows to four. The partial products are selected from 5 different multiples of the
multiplicand Y (0, Y, 2Y). The addition of four partial products can be changed to the
addition of two binary numbers by using two rows of carry save adder arrays (CSA) with only a
two gate delay introduced. The even multiples of the multiplicand Y can be implemented by
using a hardwire shift. For the 3-bit recoding algorithm, only the two's complement of Y needs to
be determined. For the 5-bit recoding algorithm, additional high speed adders are required to
determine odd multiples of Y. These high speed adders require more circuitry to implement and
suffer more time delay. For higher bit multipliers, the advantage of the 5-bit recoding algorithm
can be seen. For example, for a 32 x 32 bit multiplier, the 5-bit recoding algorithm reduces the
number of partial product rows to 8 and the 3-bit recoding algorithm reduces the number of
partial product rows to 16. The reduction of the number of partial product rows is apparent.
This design can be done using component because we have already design each of the
units shown in figure. However since it is relatively simple circuit, it can also be designed
directly. In any case the MAC circuit, as a whole, can be used as a component in application like
digital filters and neural networks.
5. If the count is equal to the number n of bits in the multiplier, the multiplication process is
complete and the product is equal to the number held in the accumulator, otherwise, the operation
return to step 2.
A sequential multiplier is a simple circuit and occupies less chip area, but it is slow. To increase
the speed of multiplication, parallel adder arrays are used to add partial products.
11
Draw a grid of three lines, each with squares for x + y + 1 bits. Label the lines
A: zeroes
S: zeroes
P: the multiplier
Drop the last bit from the product for the final result.
13
on how this sign E is arranged has been shown in Wallace Tree Multiplication Method above. The
Wallace tree for the Example is given below.
15
+p3.
+p4 .
.
(k=1' 2,3 and 4) are shifted 2(k-l) bits to the right of
. and need
be sign extended to 16 bits. Carry save adder arrays (CSA) are used to add multiple inputs and
change the summation of multiple numbers to the summation of two numbers. There is no carry
propagation delay in the carry save adders array. So, the speed is high. Two 16 bits numbers
A[15.. .0] and B[15...0] are obtained from the CSA. The 4 least significant bits of B are zero, so,
it is only needed to add the two numbers A[15...4] and B[15...4]. For the least significant 4 bits,
16
P[3...0]=A[3...0]. A Ripple Carry Adder Array (RCA) is used to add A[9.. .4] and B[9.. .4] to
obtain P[9.. .4]. A Carry Select Adder Array is used to add A[15...10] and B[15...10] to obtain
P[15...10].
The logic implementations of different multiplier blocks will be detailed in the following
section. These blocks are based on logic gates and half and full adder.
17
and
(k= 1,2,3,4) are used to determine PK. The following table shows how
[30]
18
[30]
b1i,
b2i, b3i I and b4i are the ith bits of Y, 2Y, -Y and -2Y, and
Figure 2.8 Circuit to determine ith bit of the kth partial product
19
20
21
CHAPTER 3
ADDERS
In electronics, an adder is a digital circuit that performs addition of numbers. In modern
computers adders reside in the arithmetic logic unit (ALU) where other operations are performed.
Although adders can be constructed for many numerical representations, such as Binary-coded
decimal or excess-3, the most common adders operate on binary numbers. In cases where two's
complement is being used to represent negative numbers it is trivial to modify an adder into an
adder-subtracter
Addition is the most common and often used arithmetic operation on microprocessor,
digital signal processor, especially digital computers. Also, it serves as a building block for
synthesis all other arithmetic operations. Therefore, regarding the efficient implementation of an
arithmetic unit, the binary adder structures become a very critical hardware unit.
In any book on computer arithmetic, someone looks that there exists a large number of
different circuit architectures with different performance characteristics and widely used in the
practice. Although many researches dealing with the binary adder structures have been done, the
studies based on their comparative performance analysis are only a few.
In this project, qualitative evaluations of the classified binary adder architectures are
given. Among the huge member of the adders we wrote VHDL (Hardware Description Language)
code for Ripple-carry, Carry-select and Carry-look ahead to emphasize the common performance
properties belong to their classes. In the following section, we give a brief description of the
studied adder architectures.
22
The first class consists of the very slow ripple-carry adder with the smallest area. In the
second class, the carry-skip, carry-select adders with multiple levels have small area requirements
and shortened computation times. From the third class, the carry-look ahead adder and from the
fourth class, the parallel prefix adder represents the fastest addition schemes with the largest area
complexities.
Types of adders
For single bit adders, there are two general types. A half adder has two inputs, generally labeled A
and B, and two outputs, the sum S and carry C. S is the two-bit XOR of A and B, and C is the
AND of A and B. Essentially the output of a half adder is the sum of two one-bit numbers, with C
being the most significant of these two outputs. The second type of single bit adder is the full
adder. The full adder takes into account a carry input such that multiple adders can be used to add
larger numbers. To remove ambiguity between the input and output carry lines, the carry in is
labeled Ci or Cin while the carry out is labeled Co or Cout.
B
0
1
0
1
Outputs
S
C
0
0
1
0
1
0
0
1
23
The full adder produces a sum and carries value, which are both binary digits. It can be combined
with other full adders (see below) or work on its own.
Input
A B Ci
0 0 0
Output
Co S
0 0
0 0 1
0 1
0 1 0
0 1
0 1 1
1 0
1 0 0
0 1
1 0 1
1 0
1 1 0
1 0
1 1 1
1 1
Note that the final OR gate before the carry-out output may be replaced by an XOR gate
without altering the resulting logic. This is because the only discrepancy between OR andXOR
gates occurs when both inputs are 1; for the adder shown here, one can check this is never
possible. Using only two types of gates is convenient if one desires to implement the adder
directly using common IC chips. A full adder can be constructed from two half adders by
connecting A and B to the input of one half adder, connecting the sum from that to an input to the
second adder, connecting Ci to the other input and or the two carry outputs. Equivalently, S could
be made the three-bit xor of A, B, and Ci and Co could be made the three-bit majority function of
A, B, and Ci. The output of the full adder is the two-bit arithmetic sum of three one-bit numbers.
25
CHAPTER 4
FILTERS
4.1 FIR FILTER
Digital filters can be divided into two categories: finite impulse response (FIR) filters; and
infinite impulse response (IIR) filters. Although FIR filters, in general, require higher taps than
IIR filters to obtain similar frequency characteristics, FIR filters are widely used because they
have linear phase characteristics, guarantee stability and are easy to implement with multipliers,
adders and delay elements . The number of taps in digital filters varies according to applications.
In commercial filter chips with the fixed number of taps , zero coefficients are loaded to registers
for unused taps and unnecessary calculations have to be performed. To alleviate this problem, the
FIR filter chips providing variable-length taps have been widely used in many application fields .
However, these FIR filter chips use memory, an address generation unit, and a modulo unit to
access memory in a circular manner. The paper proposes two special features called a data reuse
structure and a recurrent-coefficient scheme to provide variable-length taps efficiently. Since the
proposed architecture only requires several MUXs, registers, and a feedback-loop, the number of
gates can be reduced over 20 % than existing chips.
represented as
(4.2)
where the hj, N and M are the jth bit of the coefficient
procedure that adjusts the filter impulse response so as to minimize or maximize that
performance function.
Yk = 10iNi==_ wk(i) xk-I (5)
The gradient search algorithm was selected to simplify the filter design.
The filter coefficient update equation is given by:
WK+1 = wK eK xK (6), Where XK is the filter input at sample k, ek is the error term at
sample k = pk . yk and is the step size for updating the weights value.
28
CHAPTER 5
VHDL
Many DSP applications demand high throughput and real-time response, performance
constraints that often dictate unique architectures with high levels of concurrency. DSP designers
need the capability to manipulate and evaluate complex algorithms to extract the necessary level
of concurrency. Performance constraints can also be addressed by applying alternative
technologies. A change at the implementation level of design by the insertion of a new
technology can often make viable an existing marginal algorithm or architecture.
The VHDL language supports these modeling needs at the algorithm or behavioral level,
and at the implementation or structural level. It provides a versatile set of description facilities to
model DSP circuits from the system level to the gate level. Recently, we have also noticed efforts
to include circuit-level modeling in VHDL. At the system level we can build behavioral models
to describe algorithms and architectures. We would use concurrent processes with constructs
common to many high-level languages, such as if, case, loop, wait, and assert statements. VHDL
also includes user-defined types, functions, procedures, and packages." In many respects VHDL
is a very powerful, high-level, concurrent programming language. At the implementation level we
can build structural models using component instantiation statements that connect and invoke
subcomponents. The VHDL generate statement provides ease of block replication and control. A
dataflow level of description offers a combination of the behavioral and structural levels of
description. VHDL lets us use all three levels to describe a single component. Most importantly,
the standardization of VHDL has spurred the development of model libraries and design and
development tools at every level of abstraction. VHDL, as a consensus description language and
design environment, offers design tool portability, easy technical exchange, and technology
insertion
signal assignment statements, which associate a target signal with an expression and a delay. The
list of signals appearing in the expression is the sensitivity list; the expression must be evaluated
for any change on any of these signals. The target signals obtain new values after the delay
specified in the signal assignment statement. If no delay is specified, the signal assignment occurs
during the next simulation cycle:
c <= a + b after delay;
VHDL also includes conditional and selected signal assignment statements. It uses block
statements to group signal assignment statements and makes them synchronous with a
guarded condition. Block statements can also contain ports and generics to provide more
modularity in the descriptions. We commonly use concurrent process statements when we wish to
describe hardware at the behavioral level of abstraction. The process statement consists of
declarations and procedural types of statements that make up the sequential program. Wait and
assert statements add to the descriptive power of the process statements for modeling concurrent
actions:
process
begin
variable i : real := 1.0;
wait on a;
i = b * 3.0;
c <= i after delay;
end process;
Other concurrent statements include the concurrent assertion statement, concurrent
procedure call, and generate statement. Packages are design units that permit types and objects to
be shared. Arithmetic operations dominate the execution time of most Digital Signal Processing
(DSP) algorithms and currently the time it takes to execute a multiplication operation is still the
dominating factor in determining the instruction cycle time of a DSP chip and Reduced
Instruction Set Computers (RISC). Among the many methods of implementing high speed
parallel multipliers, there is one basic approach namely Booth algorithm. Power consumption in
VLSI DSPs has gained special attention due to the proliferation of high performance portable
battery-powered electronic devices such as cellular phones, laptop computers, etc. DSP
applications require high computational speed and, at the same time, suffer from stringent power
dissipation constraints. Multiplier modules are common to many DSP applications. The fastest
31
types of multipliers are parallel multipliers. Among these, the Wallace multiplier is among the
fastest. However, they suffer from a bad regularity. Hence, when regularity, high performance and
low power are primary concerns, Booth multipliers tend to be the primary choice. Booth
multipliers allow the operation on signed operands in 2's-complement. They derive from array
multipliers where, for each bit in a partial product line, an encoding scheme issued to determine if
this bit is positive, negative or zero. The Modified Booth algorithm achieves a major performance
improvement through radix-4 encoding. In this algorithm each partial product line operates on 2
bits at a time, thereby reducing the total number of the partial products. This is particularly true
for operands using 16 bits or more.
32
CHAPTER 6
DESIGN PROCEDURE
In this project first we design two different multipliers using shift and add method and
Radix_4 Booth algorithm. We used different type of adders like sixteen bit full adder in designing
those multiplier. Then we designed a 4 tap delay FIR filter and in place of the multiplication and
additions we implemented the components of different multipliers and adders. Then we compared
the working of these two multipliers by comparing the power consumption by each of them.
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0
0 1 1 1 0 0 0 1 1 0 1 0 1
sum
Cout
inputs a and b holds multiplier and multiplicand bits respectively. The output signal product
contains the product of a and b which is of 16_bit length.
The signal pp contains the partial products, pc holds carry bits and ps holds sum of partial
products. After computing the partial products , they are added by shifting each partial product
one one bit left, then all these partial products are added and assigned to ps, finally prod is
assigned to ps.
Example:
0
multiplicand(78)
multiplier(36)
0 0
0 1
0
0
1 0 1 0 1
0 1
partial products
product(2808)
the multiplicand M at shift position I to +1xM at position i+1, the same result is obtained by
adding +1xM at position i. The other examples are : (+ 1 0) is equivalent to (0 +2) , (-1 +1) is
equivalent to (0 +1), and so on. Thus , if booth_recorded multiplier is examined two bits t a time,
starting from right, it can be rewritten in a form that requires at most one version of the
multiplicand to the partial product for each pair of multiplier bits.
The design procedure of Radix_4 Booth multiplier requires Booth encoder and 16 bit adder,
so we used them as components. The booth multiplier examines the multiplier bits there at a time
from right to left and generates the corresponding partial products . the 16_bit full adder then
produces the product b adding partial products.
Where
(7.1)
is the set of filter coefficients. Alternatively we can express the output sequence as the
convolution of the unit sample response h(n) of the system with input signal thus we have
)
From equation (7.1) it is clear that inorder to implement an FIR filter we need filter coefficients,
adder and a multiplier.
This projects implements FIR filter using Array multiplier and Modified Booth multiplier.
This project includes design of a 4 tap delay FIR filter which requires 4 filter coefficients which
are declared in the program itself for additions and multiplication operations adder and multiplier
are used as components . A register is declared to store the previous outputs, an accumulator is
declared to store the addition of previous and present outputs so after execution the accumulator
contains the convolution sum, which is the output of FIR filer.
35
CHAPTER 7
SIMULATION RESULTS
16_BIT ADDER OUTPUT
37
multiplication. s1,s2,s3 are the modified partial products. Sum1,sum2,sum3 are the intermediate
signals formed by adding the modified partial products. K1,k2,k3 are the overflow bits in the
adder blocks used for adding the partial products
38
39
40
41
Xpower estimator is a tool used to calculate the total power consumed by a circuit. The method to
calculate the power is very simple. After synthesizing the program generate map report to the
written program. In this tool click on import from ISE button and import the map report. After
importing the map report this tool will generate the total power. The tool power consume by
raddix_4 booth multiplier is 44mwatts
43
The above graphs shows the variation of power with junction temperature, vccint voltage. The
power consumption increases with increase in junction temperature and vccint voltage.
44
SYNTHESIS REPORT
ARRAY MULTIPLIER
Number of Slices
71
123
16
16
99 mw
RADIX_4 MULTIPLIER
Number of Slices
101
179
16
17
44 mw
CHAPTER 8
CONCLUSION AND FUTURE WORK
CONCLUSION
45
Our project gives a clear concept of different multipliers and their implementation in tap
delay FIR filter. We found that the parallel multipliers are much option than the serial multiplier.
We concluded this from the result of power consumption and the total area. In case of parallel
multipliers, the total area is much less than that of serial multipliers. Hence the power
consumption is also less. This is clearly depicted in our results. This speeds up the calculation and
makes the system faster.
When we compare modified booth multiplier with array multiplier we found that
modified both multipliers consume less power This is because it uses almost half number of
iteration and adders when compared to array multiplier so the computation speed is also high for
Radix 4 booth multiplier.
Multipliers are one the most important component of many systems. So we always need
to find a better solution in case of multipliers. Our multipliers should always consume less power
and cover less power. So through our project we try to determine which of the multipliers works
better. In the end we determine that radix 4 modified booth algorithm works the best.
FUTURE WORK
As an attempt to develop arithmetic algorithm for low-power multiplier design, the
Radix_4 booth algorithm presented in this project has achieved good results . However, there are
limitations in our work and several future research directions are possible.
One possible direction is radix higher-than-4 recoding. We have only considered radix-4
recoding as it is a simple and popular choice. Higher-radix recoding further reduces the number
of PPs and thus has the potential of power saving. Because of the difficulty of generating hard
PPs such as 3X, higher-radix recoding may increase critical path delay and design complexity
which are negative factors for power. Thus, there is power/delay/area tradeoff in higher-radix
recoding.
REFRENCES
Websites referred:
1. www.wikipedia.com
2. www.howstuffswork.com
3. www.xilinx.com
46
Books referred:
1. Circuit Design using VHDL, by Pedroni , page number 285-293.
2. VHDL by Sjoholm Stefan
3. VHDL by B Bhaskar
4. Digital Signal Processing by Johny R Johnson ,PHI publications.
5. Digital Sinal Processing by Vallavraj & Salivhanan, TMH publications
6. Computer Organization Fifth edition by Carl Hamacher, Zvonko Vranesic, Safwat Zaky.
47