1 views

Uploaded by Anonymous 1aqlkZ

- Paper
- CSS-28
- DCS Assignment 1
- images & video capture - qt+opencv.pdf
- 70 1520501633_08-03-2018.pdf
- TWB Position Paper SoC Semiconductor Industry
- 2007 Nocs Config Slides
- Experiment 1
- Tp LeakagePowerOptimizationFlow
- System on a Chip Market Opportunities, Revenue Analysis and - Forecasts from 2019 to 2023
- xtalk_130nm
- proc2006.pdf
- Comparative Study of Spurious Power Suppression Technique Based Carry Look Ahead Adder And Carry Select Adder
- D ICs
- Tarea1 Lara García
- cr1694.pdf
- AMD EYPC Data Sheet
- ASM IV Remote Monitoring
- Online Assignment
- 1 Digital Logic Circuits

You are on page 1of 8

The continuing reduction in feature sizes of digital very large scale integration (VLSI)

circuits enables the integration of dozens, and in the future hundreds of processing elements on a

single chip. This single chip unit is called System on Chip (SOC). On this Single Chip the

interconnection between each other becomes most challenging issue. In most SOC applications,

a shared bus interconnection to communicate with each integrated processing unit because of its

low-cost and simple control characteristics. However, such shared bus interconnection has some

limitation in its scalability because only one master at a time can utilize the bus which means all

the bus accesses should be serialized by the arbitrator. Therefore, in such an environment where

the number of bus requesters is large and their required bandwidth for interconnection is more

than the current bus, some other interconnection methods should be considered.

To overcome this delay and scalable bandwidth requirement can be satisfied by using

networking topology on-chip. On-chip packet-switched micro-network of interconnects,

generally known as Network-on-Chip (NoC) architecture. By introducing network-on-chip

concept into multiprocessor system-on-chip (MPSoC) platforms, classical multiple buses or

“spaghetti-like” chip structures are replaced by a well-structured, easy scalable and modular

interconnect architecture. The NOC interconnection like switching logic, routing algorithm and

its packet definition should be light-weighted to result in easily implemental solutions. A typical

network-on-chip is comprised of computational processing elements (PE), network interfaces

(NIC) and routers (R). The main part of a network-on-chip architecture is a router. Processing

elements send the data packets over its network interface to the router. The router directly

connects various subsystems (which can be homogenous or heterogeneous) and primarily is used

for communication packets routing, congestion prevention and safe data packets transfer from a

source sub-system to the destination one. In order to implement a competitive network-on-chip

architecture, a router with routing algorithm, should be efficiently design.

The Advantages of On-Chip Networks

The On-Chip Networks having the following advantages compare to other wiring

technology in the chip. They are Energy efficiency, reliability, reusability, scalability and

flexibility.

Energy Efficiency

According to the International Technology Roadmap for Semiconductors (ITRS) and

Semiconductor Industry Association (SIA) roadmaps, clock frequency and number of on-chip

devices are increased. That is, much tighter power budgets for all system components are

required. Based on the roadmaps, as computation and storage components benefit from device

scaling, the energy for global communication does not scale down. Hence, communication-

energy minimization will be a growing concern in future technologies. The on-chip networks aim

to reduce this problem by scaling wires. This new model allows the decoupling of the PEs from

the network. The need for global synchronization can thereby disappear. This new approach

employs parallelism, exhibits modularity to minimize the use of global wires and utilizes locality

for power minimization. Furthermore, network traffic control and monitoring can help in better

managing the power consumed by networked computational resources. For instance, clock speed

and voltage of end nodes can be varied according to available network bandwidth. The emphasis

on energy minimization creates a sleuth of novel challenges that have not been addressed by

traditional high performance network designers.

Reliability

As the geometries of the transistors reach the physical limits of operation, it becomes

increasingly difficult for the hardware components to achieve reliable operation. The variability

in process manufacturing, issues of thermal hotspots and effects of various noise sources. Such

as power supply fluctuations, pose major challenges for the reliable operation of current and

future NoC-based MPSoCs. NoCs are particularly suited for implementation of fault-tolerant

techniques, due to their inherent parallelism and potential for reconfigurability. Fault –tolerant

techniques can be implemented at different levels, from hardware redundancy to software-based

error recovery schemes. Adaptive routing algorithms combined with error detection mechanisms

show great promise in achieving fault-tolerant on-chip communication. If data is sent on an

unreliable channel in packets, error detection and recovery is easier, because the effect of errors

is contained by packet boundaries, and error recovery can be error correcting codes (ECC),

whereas robust and fault-tolerant routing algorithms can route around faulty regions.

Reusability

PEs are usually obtained from internal sources or third parties, and integrated on a single

chip. This reusable PEs may include embedded processors; memory blocks, interface blocks,

analog blocks, and components are also provided in a reusable from and may include real-time

operating systems and kernels, library functions and device drivers. That is, PEs is reusable in

nature if they conform to a common interface and synchronization mechanisms with the on-chip

network. Using a standard interface such as AXI [13], OCP [14], and DTL , in on-chip networks

facilitates the employment of reusable components. In fact, employing a standard interface does

not change the way PEs are developed, since they will still be developed for a certain protocol.

What changes is that a public domain protocol is used and accepted by the industry as a standard,

like the PCI standard for microcomputer manufacturers. Accordingly, not only the PEs

reusability becomes higher but also the design time is reduced . In addition, on-chip routers are

generic in nature and the communication can be employed with any conforming PE.

Scalability

NoC platform is composed of on-chip routers and communication links that are basically

distributed and independent. Each PE is added into the network along with a dedicated router

having a unique address or coordinate in the network. The communication exploits the packet

switching scheme while there is no central arbitration mechanism of the communication

platform. Therefore, the performance in this communication architecture is not constrained or

degraded by the addition of PEs. This is the essential characteristic of a scalable and modular

architecture. Indeed, on-chip interconnection network plays an important role in providing

scalability to integrate hundreds or even thousands of processing elements in a single billion-

transistor chip and alleviate deign productivity gap. In fact, using data packets for

communication, a high level of parallelism is achieved as all channels can be operated

simultaneously. There by, on-chip network improves the scalability in comparison with previous

communication structures such as shared buses or segmented buses.

Flexibility

Utilizing common buses between the communicating resources in SoCs will not give any

flexibility since the needs of the communication have to be thought of every time a design is

made. However, they suffer from low scalability. NoC solves their shortcomings by

implementing a communication network of routers and resources. NoC is a very flexible

communication infrastructure allowing the same physical link to be shared by many different

connections. As future SoC platforms are expected to contain hundreds of PEs, NoC needs to

support an even larger number of connections and many connections span a large number of

routers. This leads the same SoC platform to be used in a wide range of different applications

and thereby increases the production volume. As the same SoC platform is to be used for many

different applications, the NoC must be able to support a wide range of bandwidth and Quality-

of-Service (QoS) requirements. The requirements of the applications can be very different, and

the NoC must therefore be very flexible.

Approximate computing can decrease the design complexity with an increase in performance and

power efficiency for error resilient applications. This brief deals with a new design approach for

approximation of multipliers. The partial products of the multiplier are altered to introduce

varying probability terms. Logic complexity of approximation

is varied for the accumulation of altered partial products based on their probability. The proposed

approximation is utilized in two variants of 16-bit multipliers. Synthesis results reveal that two

proposed multipliers achieve power savings of 72% and 38%, respectively, compared to an

exact multiplier. They have better precision when compared to existing approximate multipliers.

Mean relative error figures are as low as 7.6% and 0.02% for the proposed approximate

multipliers, which are better than the previous works. Performance of the proposed multipliers

is evaluated with an image processing application, where one of the proposed models achieves

the highest peak signal to noise ratio.

In applications like multimedia signal processing and data mining which can tolerate error, exact

computing units are not always necessary. They can be replaced with their approximate

counterparts. Research on approximate computing for error tolerant applications is on the rise.

Adders and multipliers form the key components in these applications. In [1], approximate full

adders are proposed at transistor level and they are utilized in digital signal processing

applications. Their proposed full adders are used in accumulation of partial products in

multipliers. To reduce hardware complexity of multipliers, truncation is widely

employed in fixed-width multiplier designs. Then a constant or variable correction term is added

to compensate for the quantization error introduced by the truncated part [2], [3]. Approximation

techniques in multipliers focus on accumulation of partial products, which is crucial in terms of

power consumption. Broken array multiplier is implemented in [4], where the least significant

bits of inputs are truncated, while forming partial products to reduce hardware

complexity. The proposed multiplier in [4] saves few adder circuits in partial product

accumulation.In [5], two designs of approximate 4-2 compressors are presented and used in

partial product reduction tree of four variants of 8 × 8 Dadda multiplier. The major drawback of

the proposed compressors in [5] is that they give nonzero output for zero valued inputs,

which largely affects the mean relative error (MRE) as discussed later. The approximate design

proposed in this brief overcomes the existing drawback. This leads to better precision. In static

segment multiplier (SSM) proposed in [6], m-bit segments are derived from n-bit operands based

on leading 1 bit of the operands. Then, m× m multiplication is performed instead of n × n

multiplication,where m<n. Partial product perforation (PPP) multiplier in [7] omits Manuscript k

successive partial products starting from jth position, where j ∈ [0,n-1] and k ∈ [1, min(n-j, n-1)]

of a n-bit multiplier. In [8], 2 × 2 approximate multiplier based on modifying an entry in the

Karnaugh map is proposed and used as a building block to construct 4 × 4 and 8 × 8 multipliers.

In [9], inaccurate counter design has been proposed for use in power efficient Wallace tree

multiplier. A new approximate

adder is presented in [10] which is utilized for partial product accumulation of the multiplier. For

16-bit approximate multiplier in [10], 26% of reduction in power is accomplished compared to

exact multiplier. Approximation of 8-bit Wallace tree multiplier due to voltage over-scaling

(VOS) is discussed in [11]. Lowering supply voltage creates paths failing to meet delay

constraints leading to error. Previous works on logic complexity reduction focus on

straightforward application of approximate adders and compressors to the partial products. In

this brief, the partial products are altered to introduce terms with different probabilities.

Probability statistics of the altered partial products are analyzed, which is followed by systematic

approximation. Simplified arithmetic units (half-adder, full-adder, and 4-2 compressor) are

proposed for approximation.The arithmetic units are not only reduced in complexity, but care

is also taken that error value is maintained low. While systemic approximation helps in achieving

better accuracy, reduced logic complexity of approximate arithmetic units consumes less power

and From statistical point of view, the partial product am,n has a probability of 1/4 of being 1. In

the columns containing more than three partial products, the partial products am,n and an,m are

combined to form propogate and generate signals as given in (1). The resulting propogate and

generate signals form altered partial products pm,n and gm,n. From column 3 with weight 23 to

column 11 with weight 211, the partial products am,n and an,m are replaced by altered

partial products pm,n and gm,n. The original and transformed partial product matrices are shown

in Fig. 1

pm,n = am,n + an,m

gm,n = am,n · an,m. (1)

The probability of the altered partial product gm,n being one is 1/16, which is significantly lower

than 1/4 of am,n. The probability of altered partial product pm,n being one is 1/16 + 3/16 + 3/16

= 7/16, which is higher than gm,n. These factors are considered, while applying approximation to

the altered partial product matrix. A. Approximation of Altered Partial Products gm,n The

accumulation of generate signals is done columnwise. As each element has a probability of 1/16

of being one, two elements being 1 in the same column even decreases. For example, in a

column with 4 generate signals, probability of all numbers being 0 is (1 - pr)4, only one element

being one is 4pr(1 - pr)3, the probability of two elements being one in the column is 6pr 2(1 -

pr)2, three ones is 4pr 3(1-pr) and probability of all elements being 1 is pr4, where pr is 1/16.

The probability statistics for a number of generate elements m in each column are given in Table

I. Based on Table I, using OR gate in the accumulation of columnwise generate elements in the

altered partial product matrix provides exact result in most of the cases. The probability of error

(Perr ) while using OR gate for reduction of generate signals in each column is also listed

in Table I. As can be seen, the probability of misprediction is very low. As the number of

generate signals increases, the error probability increases linearly. However, the value of error

also rises. To prevent this, the maximum number of generate signals to be grouped by OR gate is

kept at 4. For a column having m generate signals, m/4 OR gates are used.

The accumulation of other partial products with probability ¼ for am,n and 7/16 for pm,n uses

approximate circuits. Approximate half-adder, full-adder, and 4-2 compressor are proposed for

their accumulation. Carr y and Sum are two outputs of these approximate circuits. Since Carr y

has higher weight of binary bit, error in Carr y bit will contribute more by producing error

difference of two in the output. Approximation is handled in such a way that the absolute

difference between actual output and approximate output is always maintained as one. Hence

Carr y outputs are approximated only for the cases, where Sum is approximated.

In adders and compressors, XOR gates tend to contribute to high area and delay. For

approximating half-adder, XOR gate of Sum is replaced with OR gate as given in (2). This results

in one error in the Sum computation as seen in the truth table of approximate half-adder in Table

II. A tick mark denotes that approximate output matches with correct output and cross mark

denotes mismatch

Sum = x1 + x2

Carr y = x1 · x2. (2)

In the approximation of full-adder, one of the two XOR gates is replaced with OR gate in Sum

calculation. This results in error in last two cases out of eight cases. Carr y is modified as in (3)

introducing one error. This provides more simplification, while maintaining the

difference between original and approximate value as one. The truth table of approximate full-

adder is given in Table III

W = (x1 + x2)

Sum = W ⊕ x3

Carr y = W · x3. (3)

Two approximate 4-2 compressors in [5] produce nonzero output

even for the cases where all inputs are zero. This results in high ED and high degree of precision

loss especially in cases of zeros in all bits or in most significant parts of the reduction tree. The

proposed 4-2 compressor overcomes this drawback. In 4-2 compressor, three bits are required for

the output only when all the four inputs are 1, which happens only once out of 16 cases.

This property is taken to eliminate one of the three output bits in area. The proposed multipliers

outperforms the existing multiplier designs in terms of area, power, and error, and achieves better

peak signal to noise ratio (PSNR) values in image processing application.

Error distance (ED) can be defined as the arithmetic distance between a correct output and

approximate output for a given input. In [12], approximate adders are evaluated and normalized

ED (NED) is proposed as nearly invariant metric independent of the size of the

approximate circuit. Also, traditional error analysis, MRE is found for existing and proposed

multiplier designs. The rest of this brief is organized as follows. Section II details the

proposed architecture. Section III provides extensive result analysis of design and error metrics

of the proposed and existing approximate multipliers. The proposed multipliers are utilized in

image processing application and results are provided in Section IV. Section V

concludes this brief.

products reduction tree, and finally, a vector merge addition to produce final product from the

sum and carry rows generated from the reduction tree. Second step consumes more power.

In this brief, approximation is applied in reduction tree stage. A 8-bit unsigned1 multiplier is

used for illustration to describe the proposed method in approximation of multipliers. Consider

two 8-bit unsigned input operands α = 7 m=0 αm2m and β = 7 n=0 βn2n. The partial product

am,n = αm · βn in Fig. 1 is the result of AND operation between the bits of αm and βn

- PaperUploaded bySam
- CSS-28Uploaded bydhivakargunalan
- DCS Assignment 1Uploaded byAayush Jain
- images & video capture - qt+opencv.pdfUploaded byJavierCastro
- 70 1520501633_08-03-2018.pdfUploaded byRahul Sharma
- TWB Position Paper SoC Semiconductor IndustryUploaded byThe Writers Block
- 2007 Nocs Config SlidesUploaded byMohammad Seemab Aslam
- Experiment 1Uploaded byHimanshu Gupta
- Tp LeakagePowerOptimizationFlowUploaded bySathyanarayana Rao
- System on a Chip Market Opportunities, Revenue Analysis and - Forecasts from 2019 to 2023Uploaded byAkash Sangshetti
- xtalk_130nmUploaded byGaurav Sharma
- proc2006.pdfUploaded byOzgur Tung
- Comparative Study of Spurious Power Suppression Technique Based Carry Look Ahead Adder And Carry Select AdderUploaded byEditor IJRITCC
- D ICsUploaded byKonara Kiran
- Tarea1 Lara GarcíaUploaded bylarameca
- cr1694.pdfUploaded byVenkata Ganesh Gorla
- AMD EYPC Data SheetUploaded byJames Westover
- ASM IV Remote MonitoringUploaded bygokonk
- Online AssignmentUploaded byAjith Manu
- 1 Digital Logic CircuitsUploaded byMOi AKhi
- Computer Technician 073014Uploaded bylance456
- 01121 Special Systems-Website and Construction CamerasUploaded byFred Anthony A. Uera
- ComFacts Safeethernet EUploaded byMiroslav Gojic
- 651243IJSETR Paper TemplateUploaded byAshishKumar
- Dairy Fram GroupUploaded byNikhil Kashimpuria
- Efficient Built-In Self-repair Strategy for Embedded SRAM With Selectable RedundancyUploaded bymdhuq1
- Detailedsylabus.blogspot.in-kerala Psc Lecturer in Information Technology Polytechnics Full SyllabusUploaded byJustin Mathew
- GA3 Final Project PlanUploaded byuser700
- Mohana ResumesUploaded byganeshagp
- prashanthUploaded byanon-890752

- Diploma Embedded Project Titles 2014.docxUploaded byAnonymous 1aqlkZ
- An Efficient Approach for Segmentation and ClassifUploaded byAnonymous 1aqlkZ
- ESAU-04Uploaded byAnonymous 1aqlkZ
- Naresh ABSTRACT FinalUploaded byAnonymous 1aqlkZ
- Detection And Segmentation Of Small Renal Masses.docxUploaded byAnonymous 1aqlkZ
- Detection And Segmentation Of Small Renal Masses.docxUploaded byAnonymous 1aqlkZ
- about vlsi.docxUploaded byAnonymous 1aqlkZ
- ieee pprUploaded byAnonymous 1aqlkZ
- ESAU-04Uploaded byAnonymous 1aqlkZ
- Latest AtmUploaded byAnonymous 1aqlkZ
- SPMA32Uploaded byAnonymous 1aqlkZ
- Multispectral Satellite Images.docxUploaded byAnonymous 1aqlkZ
- AN EFFICIENT APPROACH FOR SEGMENTATION AND CLASSIF.docxUploaded byAnonymous 1aqlkZ
- PDCA.docxUploaded byAnonymous 1aqlkZ
- Spread SpectrumUploaded byAnonymous 1aqlkZ
- ESAU-04Uploaded byAnonymous 1aqlkZ
- ESAU-04Uploaded byAnonymous 1aqlkZ
- SPMA018Uploaded byAnonymous 1aqlkZ
- EegUploaded byAnonymous 1aqlkZ
- SPMA47Uploaded byAnonymous 1aqlkZ
- SPMA45Uploaded byAnonymous 1aqlkZ
- SPMA019Uploaded byAnonymous 1aqlkZ
- SPMA46Uploaded byAnonymous 1aqlkZ
- SPMA48Uploaded byAnonymous 1aqlkZ
- SPMA117Uploaded byAnonymous 1aqlkZ
- SPMA116Uploaded byAnonymous 1aqlkZ
- AN EFFICIENT APPROACH FOR SEGMENTATION AND CLASSIF.docxUploaded byAnonymous 1aqlkZ
- An Advanced Home Energy Management System (1)(1)Uploaded byAnonymous 1aqlkZ
- Spread SpectrumUploaded byAnonymous 1aqlkZ

- ACC 403 QUIZ 2Uploaded byWandavang
- Selenium InterviewUploaded byAshok
- Exalted Cheatsheet, WIP.docxUploaded byThingamabob9999
- ABCDE Socio Economic Classification MEDIARESEARCH Specification 2015Uploaded byMuhamad Tazyinul Akhlaq
- visual rhetoric essayUploaded byapi-341087400
- Hal Foster Death in AmericaUploaded byAnna-Lena Werner
- amy selk - resumeUploaded byapi-302617024
- 醫科Uploaded byJason Lin
- Safe Social Media FlipbookUploaded byYoLa Muñoz
- Class NotesUploaded byasadmir01
- As 1367Uploaded bynvh202148
- Exam TipsUploaded byLiibanMaahir
- Yojana April 2011Uploaded byAbhimanyu Sharma
- Causes and Effects of PovertyUploaded byLenaryle Dela Cerna Ramos
- A. I. vs Humans in Major Reading TestUploaded byInsights Of Interpol
- Comm CasesUploaded byrob1130
- Organic Blueberry GuideUploaded byvas_liviu
- Simulate ONTAP 9.2 Installation and Setup GuideUploaded byCharl11e
- 4 srt lesson idea 2016Uploaded byapi-346732338
- 2014-15 programmeUploaded byapi-269113665
- SCORA manual.pdfUploaded byAmy Wilson
- Us Gaap TheseUploaded byMessaoudi
- Dante: "Inferno", Translated by David Bruce GainUploaded byDavid Bruce Gain
- Beyonce Motion to DismissUploaded byTHROnline
- s13052-015-0172-0Uploaded byM Ibnu Rahman Syah
- $10000copperBernsteinUploaded byemailrobertguy
- TeleStacker® Conveyor ManualUploaded byfullmetalnzs
- ISCSI_optimization_EQLUploaded byRobert Stephens
- 4 Zacarias vs. Anacay.docxUploaded bySarah Ganto
- PSY 1Uploaded byjeanette8rios