You are on page 1of 143

UNIVERSITY OF CINCINNATI

07/11/2008
Date:___________________

ARUN JANARTHANAN
I, _________________________________________________________,
hereby submit this work as part of the requirements for the degree of:
DOCTOR OF PHILOSOPHY
in:
COMPUTER ENGINEERING
It is entitled:
NETWORKS-ON-CHIP BASED HIGH PERFORMANCE
COMMUNICATION ARCHITECTURES FOR FPGAS

This work and its defense approved by:

Dr. Ranga Vemuri


Chair: _______________________________
Dr. Karen Tomko
_______________________________
Dr. Harold Carter
_______________________________
Dr. Wen Ben Jone
_______________________________
Dr. S. Srinivasan
_______________________________
Networks-on-Chip based High Performance Communication
Architectures for FPGAs

A Dissertation submitted to the

Division of Research and Advanced Studies


of the University of Cincinnati

in partial fulfillment of the


requirements for the degree of

DOCTOR OF PHILOSOPHY

in the Department of
Electrical and Computer Engineering and Computer Science
of the College of Engineering.

by

Arun Janarthanan

B. E. (ECE), Sri Venkateswara College of Engineering, University of


Madras, Chennai, India, 2003

Dissertation Advisor : Dr: Karen Tomko


Committee Chair: Dr. Ranga Vemuri
Abstract
Networks-on-Chip is a recent solution paradigm adopted to increase the performance of
multi-core designs. The key idea is to interconnect various computation modules (IP
cores) in a network fashion and transport packets simultaneously across them, thereby
gaining performance. In addition to improving performance by having multiple packets in
flight, NoCs also present a host of other advantages including scalability, power
efficiency, and component re-use through modular design.

This work focuses on design and development of high performance communication


architectures for FPGAs using NoCs. Once completely developed, the above
methodology could be used to augment the current FPGA design flow for implementing
multi-core SoC applications. We design and implement an NoC framework for FPGAs,
Multi-Clock On-Chip Network for Reconfigurable Systems (MoCReS).
We enable the routers to function at independent clock frequencies, that are dictated by
the FPGA place & route constraints, and yet follow a low latency virtual cut-through
flow control. With increasing design complexities, power trade-offs play a significant
role in FPGA design. We analyze the power consumed in the NoC framework that we
have developed on a Virtex-4 FPGA. Through experimental results, we study the various
components of power consumed in an FPGA based NoC.

We propose a novel micro-architecture for a hybrid two-layer router that supports both
packet-switched communications, across its local and directional ports, as well as, time
multiplexed circuit-switched communications among the multiple IP cores directly
connected to it. Results from place and route VHDL models of the advanced router
architecture show an average improvement of 20.4% in NoC bandwidth (maximum of
24% compared to a traditional NoC). We parameterize the hybrid router model over the
number of ports, channel width and bRAM depth and develop a library of network
components (MoClib Library).

Synthesizing an NoC topology for FPGAs from the above library of network components
requires a complex trade-off among switch complexity, area available and bandwidth
capacity. We develop an algorithm and an application-generic design flow that includes
required bandwidth and area in the cost function and synthesizes the NoC topology for
FPGAs. For a set of real application and synthetic benchmarks, our approach shows an
average reduction of 21.6% in FPGA area (maximum of 26%) for equivalent bandwidth
constraints when compared with a baseline approach.

Interconnecting IP cores along with our NoC requires a glue logic that can connect
different versions of the router to IPs. To accomplish this, we design a customizable
Network Interface that is compatible with our 2-layer hybrid router. Towards capturing
real core implementation effects, we characterize a library of soft IP cores and implement
a typical image compression application on our FPGA. Through experiments we
determine the area and power overhead of our on-chip network on an FPGA when
implemented along with a typical application. Further by accurately modeling our On-
chip network for area, delay and power, we develop a platform that could be used to
floorplan a complete multi-processor application along with the NoC.
Acknowledgements

Firstly, I would like to express my gratitude towards my advisor, Prof.Tomko for shaping

up my graduate studies. Your strong directions and compassion has gone a long way in
helping me develop valuable academic and personal life qualities. Thanks for helping me
meet my deadlines and reviewing papers over very short notices. I consider myself very
fortunate to take courses and be constantly associated with Prof. Vemuri and his lab.
The high standards you set in the courses and discussions remained as a stable platform

for my research work. I also thank my other committee members, Prof. Carter, Prof.
Jone and Prof. Srinivasan for reviewing my work and giving good feedback.
I consider myself very lucky to have Prof.Srinivasan as my mentor at every stage
in my academic progress from high school to declaring my dissertation complete. I

would like to acknowledge the support extended by Xilinx and Mentorgraphics through
their university program. I would like to acknowledge our department staff Rob Montjoy
for efficiently handling all computing and licensing issues and Julie Muenchen for her
patience in ensuring that we conform to department regulations and formalities.

I am grateful to Intel and my supervisors for providing me with challenging respon-


sibilities during my internship. The break from graduate school served beyond a quality
work experience by going onto fund a huge duration of my research work.
Thanks to all my friends in graduate school. I thank Bala especially for helping

me develop strong interests in On-Chip Networks. I was inspired by the comprehensive


approach you had for problem solving. The late night/early morning discussions we
had in ERC needs a special mention for strengthening my NoC background. Thanks to
Daniel for implementing the NI that is featured in the dissertation. Also the discussions
Vijay & I had during the collaboration times will be very memorable. Thanks to Vijay

Sundaresan for always finding time to help out in any issue. Thanks to Jayanth for
imparting high levels of optimism on anything we talk anytime.
In a five year long PhD program it is extremely important to keep in touch with

iv
a lot of people and I thank SABHA for serving as a wonderful medium to interact
with students and community. I carry a rich variety of learning from serving for two

consecutive years in the SABHA’s committee. I will cherish the experience for years to
come.
I am thankful to UCs relaxed ambience. It covered all of our expenses leaving us with
enough money for frequent travel. All the never-ending road trip memories that I share

with many in UC will be etched in my mind for ever. UCs on-campus recreational and
housing facilities were excellent as well. I am indebted to my roommates Aravind and
Jagadish for making a wonderful home far away from home also for being great cooks
and more importantly agreeing to share cooking turns with me. I know that I have made
some life long bond of friendship while I was at UC Prasanna, Raghav, Ramki, Payal

and others who I have not mentioned.


Thanks to those everlasting VLSI projects. I found a good project and life partner.
I consider myself very lucky to find someone with whom I will share the rest of my life
Anusha. I have taken you virtually through every up and down in my graduate school.

Thanks for being there for me always.


I am very fortunate to be a part of a caring large family that spans the whole of
the US and India. Amma, Appa, Anu and now Nive I am happily placed in a circle of
affection because of you. I dedicate this contribution to all of you and to my beloved

paati who for all her life prayed for her grandsons success and moved to a more peaceful
world when I was presenting this doctoral dissertation.

v
Contents

1 Introduction 1

1.1 Platform FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.2 Device vs Interconnect Scaling . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 FPGA based NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.2 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


1.3.3 Design Re-use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Research Approach and Overview . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 MoCReS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.2 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


1.4.3 Hybrid Two-Layer Router . . . . . . . . . . . . . . . . . . . . . . 6
1.4.4 Topology Synthesis for FPGAs . . . . . . . . . . . . . . . . . . . 6
1.4.5 SoC Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 On-Chip Network Background 9

2.1 Alternatives in Communication Architecture . . . . . . . . . . . . . . . . 9


2.1.1 Dedicated FPGA Interconnects . . . . . . . . . . . . . . . . . . . 10
2.1.2 Time-muxed Interconnects . . . . . . . . . . . . . . . . . . . . . . 10

vi
2.1.3 Circuit-Switched Interconnects . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Packet-Switched Networks . . . . . . . . . . . . . . . . . . . . . . 11

2.2 On-Chip Network Description . . . . . . . . . . . . . . . . . . . . . . . . 11


2.3 Network-on-Chip Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.5 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Current Research in NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Industrial Applications with NoCs . . . . . . . . . . . . . . . . . . 15

2.4.2 FPGA based NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 MoCReS: NoC Framework 16


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Network-on-Chip Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.1 Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 19


3.4.2 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.4 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.5 Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Router Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.1 Packet Description . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.2 Input Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5.3 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 24


3.5.4 Central Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vii
3.6 Results: Functional Simulation . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Common Clock Design . . . . . . . . . . . . . . . . . . . . . . . . 27

3.6.2 Multi-Clock Design . . . . . . . . . . . . . . . . . . . . . . . . . . 27


3.7 Results: Area-Performance Characterization . . . . . . . . . . . . . . . . 27
3.7.1 Router Area Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7.2 FPGA NoC Resource Analysis . . . . . . . . . . . . . . . . . . . . 29

3.7.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 30


3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8.1 Limitations of our Packet Switched MoCReS Framework . . . . . 33

4 NoC Power Analysis 34


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 MoCReS Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . 36


4.3.1 Components of Power . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Comparison with LiPaR . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 NoC Component Power . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Hybrid Two-Layer Router Architecture 42


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1 Packetization and Control Overheads . . . . . . . . . . . . . . . . 43
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4.1 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 45


5.4.2 Central Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.3 NI Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

viii
5.4.4 Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Architectural Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.6 System-Level Router Model . . . . . . . . . . . . . . . . . . . . . . . . . 50


5.7 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.8 Results: Performance Improvement . . . . . . . . . . . . . . . . . . . . . 53
5.8.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Hybrid NoC: Performance and Power Analysis 56

6.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56


6.1.1 Packet Injection Rate Vs Average Latency . . . . . . . . . . . . . 57
6.2 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.1 Power Breakdown Results . . . . . . . . . . . . . . . . . . . . . . 61

6.2.2 Switch vs Link Power . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Experimental Platform 66

7.1 Multi-Processor Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 66


7.2 Xilinx ISE Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2.1 FPGA Resource Characterization . . . . . . . . . . . . . . . . . . 68
7.3 NoC Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.1 Bandwidth Requirement Vs No.Flits . . . . . . . . . . . . . . . . 76

8 FPGA Based NoC : CAD Flow 78


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


8.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.4 Topology Synthesis Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.4.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.4.2 Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

ix
8.4.3 Candidate Topology Selection . . . . . . . . . . . . . . . . . . . . 83
8.4.4 Area Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.5 Description of Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


8.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.7 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . 90
8.7.1 Execution Time Results . . . . . . . . . . . . . . . . . . . . . . . 90

8.7.2 Area Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91


8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

9 NoC based System-on-Chip Development 94


9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2 IP Core Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.2.1 Core Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.2.2 Xilinx [1] IP Support . . . . . . . . . . . . . . . . . . . . . . . . . 96


9.2.3 IP Library Characterization . . . . . . . . . . . . . . . . . . . . . 98
9.3 Network Interface Implementation . . . . . . . . . . . . . . . . . . . . . . 99
9.3.1 Primary Design Goals . . . . . . . . . . . . . . . . . . . . . . . . 100
9.3.2 Customized IP Library . . . . . . . . . . . . . . . . . . . . . . . . 101

9.4 NoC Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101


9.4.1 Synthesis of a Predictable NoC . . . . . . . . . . . . . . . . . . . 103
9.5 Image Compression Implementation . . . . . . . . . . . . . . . . . . . . . 106
9.5.1 NoC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 108

9.6 NoC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

10 Summary of Contributions and Future Work 113

10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


10.1.1 NoC Framework: MoCReS . . . . . . . . . . . . . . . . . . . . . . 113
10.1.2 Power Characterization . . . . . . . . . . . . . . . . . . . . . . . . 114

x
10.1.3 Hybrid 2-Layer Architecture . . . . . . . . . . . . . . . . . . . . . 114
10.1.4 Performance and Power Analysis . . . . . . . . . . . . . . . . . . 115

10.1.5 CAD Flow: Topology Synthesis . . . . . . . . . . . . . . . . . . . 115


10.1.6 NoC Based SoC Development . . . . . . . . . . . . . . . . . . . . 115
10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xi
List of Tables

1.1 System Level Design Requirements [2] . . . . . . . . . . . . . . . . . . . 5

3.1 Comparison of FPGA Router Designs . . . . . . . . . . . . . . . . . . . . 18


3.2 Standalone Router Area Results . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Standalone: MoCReS (1VC+CC) Power Consumption . . . . . . . . . . 37


4.2 3 × 3 Mesh: MoCReS (1VC+CC) Power Consumption . . . . . . . . . . 38
4.3 Standalone: MoCReS (1VC+MC) vs LiPaR . . . . . . . . . . . . . . . . 39
4.4 Power Dissipation Across NoC Components . . . . . . . . . . . . . . . . 40
4.5 FPGA Resource utilization : MoCReS(1VC+MC) vs LiPaR . . . . . . . 41

5.1 Scaling of Area and Frequency with No.of C-Layer Ports . . . . . . . . . 52


5.2 Scaling of Area and Frequency with No.of P-Layer Ports . . . . . . . . . 53

6.1 Input Flits: Transition Activities . . . . . . . . . . . . . . . . . . . . . . 62

7.1 Application and Synthetic Benchmarks . . . . . . . . . . . . . . . . . . . 68

7.2 Performance and Power Estimates of Routing Resources in XC4VLX100 73


7.3 MoCReS: FPGA Resource Utilization . . . . . . . . . . . . . . . . . . . . 73
7.4 Comparison between Communication Abstractions . . . . . . . . . . . . . 77

8.1 Clustering Results for Benchmarks . . . . . . . . . . . . . . . . . . . . . 83

8.2 Algorithm Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 90


8.3 MPEG4 Area Improvement . . . . . . . . . . . . . . . . . . . . . . . . . 91

xii
List of Figures

1.1 Scaling of Global Interconnects [2] . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Future of Networks-on-Chip [2] . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 A Multi-Clock 2 × 2 Mesh Based NoC . . . . . . . . . . . . . . . . . . . 19

3.2 Packet Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


3.3 Router Micro-Architecture (Input Port, Cross-Point, Central Arbiter) . . 22
3.4 Router Functional Simulation . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Router Area Vs Channel Width . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 3 × 3 Mesh FPGA Utilization . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 Average Latency Vs Injection Rate of 3 × 3 Mesh . . . . . . . . . . . . . 30

5.1 Hybrid Two-Layer Router Architecture . . . . . . . . . . . . . . . . . . . 45


5.2 Modified Central Arbiter Model . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 SystemC Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4 Design Parameters Vs Area . . . . . . . . . . . . . . . . . . . . . . . . . 51


5.5 Design Parameters Vs Frequency . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Area (Slices) Vs Avg. Bandwidth / Port . . . . . . . . . . . . . . . . . . 54

6.1 Baseline 3 × 2 MoCReS Mesh . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 Modified Hybrid Router 2 × 2Mesh . . . . . . . . . . . . . . . . . . . . . 58


6.3 Packet Statistics: No. Packets Injected from each IP . . . . . . . . . . . 59
6.4 Results: Baseline 3 × 2 MoCReS Vs Hybrid Router Mesh . . . . . . . . 60

xiii
6.5 Dynamic Power Breakdown of an 8-port Hybrid Router . . . . . . . . . . 62
6.6 P-Layer Ports (Switch Size) Vs Dynamic Power (mW) . . . . . . . . . . 63

6.7 C-Layer Ports (Switch Size) Vs Dynamic Power (mW)@200MHz . . . . . 64


6.8 Power (mW): Baseline MoCReS Vs Hybrid Router Mesh . . . . . . . . . 65

7.1 Nallatech [3] Bendata-V4 Platform FPGA . . . . . . . . . . . . . . . . . 69


7.2 Sample XDL Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.3 Routing Resources: XC4VLX100 . . . . . . . . . . . . . . . . . . . . . . 72


7.4 MoCReS Routing Resource Utilization . . . . . . . . . . . . . . . . . . . 74
7.5 Router Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.6 Experimental Flow for NoC Topology Synthesis . . . . . . . . . . . . . . 77

8.1 NoC Design Dependency Showing Our Approach . . . . . . . . . . . . . 79


8.2 Topology Synthesis Framework . . . . . . . . . . . . . . . . . . . . . . . 81
8.3 Clustering Phase of Topology Synthesis . . . . . . . . . . . . . . . . . . . 82
8.4 IP Mapping and Link Bandwidth Estimation . . . . . . . . . . . . . . . . 85
8.5 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

8.6 NoC Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9.1 ITRS 2007 Showing IP Design Reuse Trends . . . . . . . . . . . . . . . . 95


9.2 Xilinx [1] MicroBlaze System Design and Architecture . . . . . . . . . . . 97
9.3 IP Core Abstraction and NI Wrapper . . . . . . . . . . . . . . . . . . . . 99

9.4 Triple DES IP Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 101


9.5 IP Properties and Customization Overhead . . . . . . . . . . . . . . . . . 102
9.6 IP Customization Overhead: Area and Power . . . . . . . . . . . . . . . 102
9.7 NoC Mesh Floorplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

9.8 NoC Mesh Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


9.9 Routing Resources: Delay and Power . . . . . . . . . . . . . . . . . . . . 106
9.10 JPEG Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

xiv
9.11 NoC Implementation Alternatives . . . . . . . . . . . . . . . . . . . . . . 110
9.12 JPEG Configuration: Area and Power Overhead Analysis . . . . . . . . . 112

xv
Chapter 1

Introduction

Present platform FPGAs consists of a variety of embedded computation elements, in


addition to the programmable logic and interconnect. The increasing heterogeneity cou-

pled with higher operating frequencies enable FPGAs to replace ASICs in several high
performance applications. In this chapter, we discuss platform FPGAs from an SoC per-
spective and present the inherent performance limitation due to scaling of device sizes.
We then introduce the current solutions proposed to handle the performance bottleneck,
with an emphasis on the Network-on-Chip paradigm.

1.1 Platform FPGAs

Traditional FPGAs are comprised of a large amount of programmable logic and inter-
connects to implement user applications. Recently, a coarse-grained approach has been
adopted by FPGA companies that combines the fine-grained reconfigurable resources
with hard embedded cores that match their ASIC counterparts in performance, power

and area. FPGAs are presently utilized in various application domains, from mobile
portable devices to space devices. They increasingly replace ASICs as the choice of tar-
get technology due to their increasing device sizes and operating frequency. The following
are some of the capabilities that a present platform FPGA sustains:

1
2

• Upto 200,000 Logic Cells

• Abundant Embedded IP (Power PC processors, Multipliers, bRAMs & DCMs)

• Upto 500 MHz Clocking

• Multiple Device Voltages

• Rich Flexible Interconnects

A System-on-Chip (SoC) design integrates processors, memory, and a variety of IPs

in a single design. Due to the FPGA capabilities listed above and high time-to-market
pressures, complex SoC designs are increasingly targeted to FPGA.

1.2 Device vs Interconnect Scaling

FPGA device manufacturers have achieved large device sizes using 65nm technology. In
this nano-meter technology, the interconnects scale poorly compared to transistors. As a

result, the global interconnect in the design sustains a high performance delay compared
to transistors. Figure 1.1 presents the difference between local and global interconnect
delay due to technology scaling [2].
The above performance limitation will be more pertinent to FPGAs (than ASICs)

due to the programmable nature of their interconnects (and large device sizes). As a
result, interconnects in FPGAs are accounting for a significant portion of total circuit
delay and power. As design gets bigger, it is therefore very difficult to maintain high
clock rates. Networks-on-Chip (NoC) is a design paradigm proposed to contend with
this inherent performance bottleneck. Figure 1.2 shows the projected trend for on-chip

network based design methods [2]. NoC on FPGAs are an active area of research and
holds great promise for meeting present SoC communication needs.
3

Figure 1.1: Scaling of Global Interconnects [2]

1.3 FPGA based NoCs

Traditionally cores in FPGAs are connected using bus-based architectures. NoCs are
proposed as an alternative to eliminate the inherent performance bottleneck in bus-based
architectures. In addition to an increase in performance, NoCs present a host of other

advantages, particularly for FPGA designs. In this section, we discuss these advantages
for implementing multi-core applications with NoCs.

1.3.1 Scalability

In NoCs, the IP cores are connected to the network through routers (network backbone).
As the communication between routers is standardized, addition of more cores does not

have an impact on the rest of the design. On the other hand, the bus-based architectures
are poorly scalable with the number of cores. With increasing number of cores and
complexity in arbitration logic, the operating frequency of bus-based communication
4

2007 2010 2013 2016 2019


2005 2006 2008 2009 2011 2012 2014 2015 2017 2018 2020 2021

System Level Component Re-use

On-Chip Network Design Methods

65nm 45nm 32nm 22nm 16nm

Research Required Qualification/Pre-Production

Development Underway Continous Improvement

Figure 1.2: Future of Networks-on-Chip [2]

degrades.

1.3.2 Energy Efficiency

FPGAs are a suitable implementation for portable devices and therefore, low power
operation is a critical requirement. As opposed to bus based architectures, NoCs consume
low power due to less switched capacitance (shorter lines). Further, due to a high level
of parallelism in communication, the overall energy requirement is comparatively low.

1.3.3 Design Re-use

Due to a hierarchical approach, there is a rapid reduction in design and verification time
associated with NoCs. Further, development of IP cores can be independent from the
application/other parts of the design. Table 1.1 shows the future trend for design re-
use [2]. There is a steady increase expected in the % of design component re-used, which
supports our choice of NoC as a design paradigm.
5

Table 1.1: System Level Design Requirements [2]


Year of Production 2005 2006 2007 2008 2009 2010 2011 2012
Design Reuse
% to all logic size 32% 33% 35% 36% 38% 40% 41% 42%
SoC Reconfigurability
Total % of SoC reconfigurable 23% 26% 28% 28% 30% 35% 38% 40%
Year of Production (contd) 2013 2014 2015 2016 2017 2018 2019 2020
Design Reuse
% to all logic size 44% 46% 48% 49% 51% 52% 54% 55%
SoC Reconfigurability
Total % of SoC reconfigurable 42% 45% 48% 50% 53% 56% 60% 62%

1.3.4 Dynamic Reconfiguration

During dynamic reconfiguration, a computational element (core) is replaced by another

task without affecting the execution of other parts of the design. Table 1.1 presents the
expected reconfiguration trend [2]. A steady increase is expected in the % of reconfig-
urable components in an SoC. A primary challenge in the reconfigurable computing is
concurrent design of the communication and computation subsystems. The NoC ap-

proach to design in FPGAs inherently separates these two aspects of the design by pro-
viding standard interfaces to the cores. Several current research efforts [4] [5] advocate
NoC design on FPGAs for efficient module replacement and design re-use

1.4 Research Approach and Overview

In this research, we focus on networks-on-chip based high performance communication

architectures for FPGAs. Once completely developed, this NoC framework can efficiently
replace the traditional communication architecture.

1.4.1 MoCReS

An FPGA based on-chip network has a unique set of design goals that includes satis-
fying the bandwidth requirements with a minimum (limited) resource availability. We
6

implement a minimum area and high performance packet-switched router (MoCReS)


for FPGA based NoCs. Our 5-port virtual cut-through router has an area overhead of

only 282 Virtex-4 slices (a marginal 0.57% of XC4VLX100) and operates at 357 MHz
supporting a competitive data rate of 2.85 Gbit/s.

1.4.2 Power Analysis

The NoC communication architecture competes for resources with the user application
and also sustains a power overhead. We determine the power consumption of the NoC

framework on FPGA. Further, we analyze the power-trade-offs associated with our design
novelties by comparing it with a baseline approach, implemented on the same target
device.

1.4.3 Hybrid Two-Layer Router

A strict packet-switched NoC sustains a high serialization overhead. To offset these per-

formance overheads, we develop a novel hybrid two-layer router architecture that sup-
ports packet-switching for inter router transfers and time-multiplexed circuit-switching
for IP cores connected to the same router. The advanced router architecture achieves
an average improvement of 20.4% in NoC bandwidth (maximum of 24% compared to a

traditional NoC). Furthermore, a thorough analysis of the power performance trade-offs


shows our 2-layer router to be superior to the baseline approach.

1.4.4 Topology Synthesis for FPGAs

Traditional design flow for FPGAs support sophisticated CAD tools to achieve design
closure. However, multi-core designs do not have a standardized CAD flow for FPGAs.

Moreover, the vast heterogeneity of the FPGA device further complicates the design flow
for NoC based FPGA designs. Further, CAD solutions for multi-core applications cannot
7

be borrowed from ASIC domain, due to the inherent differences in the underlying archi-
tecture. As a part of this research, we design an algorithm to effectively automate the

NoC design cycle for FPGAs. For any given application as a task graph, our integrated
synthesis framework determines a suitable NoC topology that satisfies the bandwidth
requirements, while optimizing for the area overhead.

1.4.5 SoC Development

Implementing a complete SoC development flow using FPGA based NoCs require a

thorough characterization of IPs, an efficient Network Interface and methodologies for


floorplanning the complete NoC-NI-IP framework. In this dissertation, we present our
contributions in all the above three domains towards a complete implementation of a
multi-processor application on FPGA. Furthermore, to deeply study the area and power

overheads involved in this alternate communication architecture, an image compression


application is implemented as a case study.

1.5 Thesis Outline

The relevant NoC and FPGA background material required for this research is presented
in Chapter 2. Further, this chapter includes a survey of alternate approaches considered

in current research.
Chapter 3 describes the FPGA based MoCReS framework that is developed as a part
of this research. The chapter also presents the area/performance trade-offs for various
versions of the router. We utilize the NoC framework described in this chapter to conduct
all the experiments presented in the following chapters.

We analyze the power dissipated in FPGA based NoC framework and present its descrip-
tion and results in Chapter 4. Further, a comparison of our MoCReS framework with a
baseline NoC design in terms of power is featured in this chapter.
8

Chapter 5 presents a hybrid two-layer router architecture. The advantages behind the
novel architecture, along with the design issues involved are presented in this chapter.

The area, performance metrics of the hybrid router architecture are characterized and
stored along with parameterized designs in a MoClib NoC component library.
Chapter 6 presents a detailed analysis of performance and power of our novel NoC frame-
work. Through detailed traffic analysis and comparisons, we present the performance

gain obtained in our hybrid router framework. This chapter concludes with a component
based NoC power analysis and a comparison of switch and link power in FPGA based
NoCs.
The experimental platform is outlined in Chapter 7. It includes a description of the
experimental platform for this thesis, including the CAD tools, software and hardware

used. We also present the application and synthetic benchmarks that we utilized to
extract the area/performance results in the experiments.
Chapter 8 formally presents the design and implementation of the CAD tool developed
to perform automatic topology synthesis. It presents the cost constraints and trade-offs

involved in the algorithm during design space exploration. The chapter also includes the
results obtained for a variety of application and synthetic benchmarks.
In Chapter 9, we present the IP implementation methodologies and characterize a library
of frequently used IP cores in FPGAs. The design goals behind a network interface

particularly suitable for our NoC is also presented. Furthermore, the chapter includes a
power-performance study into floorplanning our on-chip network in FPGAs.
Finally, Chapter 10 summarizes all the contributions made in this dissertation, and
outlines the future research directions.
Chapter 2

On-Chip Network Background

The motivation behind implementing NoC architectures in FPGAs is to remove the


performance bottleneck present in bus-based architectures. Even though we give an

emphasis to FPGA based on-chip networks, much of the background material presented
in this chapter is also applicable to ASIC networks. In this chapter, we first present
the alternatives in communication architecture, followed by a detailed description of a
Network-on-Chip. We conclude the chapter by discussing recent research in the NoC
area.

2.1 Alternatives in Communication Architecture

This section presents various ways of implementing communication architectures in FP-


GAs. Some of the main trade-offs involved in choosing an interconnect mechanism are,

• Throughput Available

• Static Schedule Requirement

• Switch/Interconnect Area Overhead

• Signal Integrity

9
10

• Bandwidth Guarantee

Based on the above trade-offs, the interconnection mechanisms can be broadly clas-

sified into the following:

2.1.1 Dedicated FPGA Interconnects

Present FPGA devices support dedicated interconnects. These are the spatially dis-
tributed FPGA resources configured through programmable switches. The latency of
this type of communication is very low and there is guaranteed bandwidth to support

the communications. However, the interconnect utilization is extremely low, as the dedi-
cated connections are almost never time-multiplexed for a different communication. With
the limited resource available within the FPGA and with increasing design complexities,
it is challenging to preserve signal integrity with this form of interconnects. Present
Virtex [1] device architecture supports this type of interconnects.

2.1.2 Time-muxed Interconnects

Time-muxed interconnects offer a high throughput connections with very high intercon-
nect utilization. This approach requires all the communication schedules to be known off-
line. With increasing number of core communications, the area requirement for context
memory offsets the gain achieved in throughput and interconnect utilization. Research

in [6] explores use of time-muxed interconnects for FPGAs.

2.1.3 Circuit-Switched Interconnects

In a Circuit-switched mechanism, the resources utilized for the communication archi-


tecture can be time-multiplexed. The latency of such a network is minimum and the
communication schedules can be determined online. However, preserving signal integrity
11

for larger designs is a big challenge as this circuit-switched connection, once established,
follows a synchronous scheme. PNoC [7] advocates this type of interconnects for FPGAs.

2.1.4 Packet-Switched Networks

The central idea with this structure is to transport data across modules in the form of
packets. Multiple packets are in flight from several source, destination pairs, thereby
increasing the overall performance. Similar to circuit-switched interconnects, this form
of communication is established online and does not require static scheduling. The

packetization/serialization overhead involved along with the lack of bandwidth guarantee


are the main drawbacks of this approach. A detailed comparison of packet-switched
networks with circuit switched networks is presented in [8].

2.2 On-Chip Network Description

The main modules present in an on-chip network are, the IP cores (computation units),

Network Interface (NI) and the NoC backbone.


IP Cores: The Intellectual Property (IP) cores form the computation elements of the
application. These cores are developed independently and often re-used in different parts
of the application. To handle current design complexities, there is a shift in the design

trend to a more modular IP based approach.


Network Interface (NI): The NI forms the glue logic between the IP cores and the
network backbone. It standardizes the interface between the IPs present in a network
and the routers. Further for packet-switched networks, it also packetizes the outgoing
data from the IP and injects it to the network. When an IP receives data from the

network, the NI grants the request of the downstream router to receive the packets. The
NI is customized to suit the requirements of a particular network backbone. Research is
underway to standardize these network interfaces and IP communication protocols [9].
12

Network-on-Chip Component The router forms the heart of the NoC backbone. It
is responsible for transporting packets that originate from the IP cores. Traditionally,

a router in a mesh network has four directional ports (North, East, South and West)
to communicate with the neighboring routers. Further, it has at least one local port
through which an IP core is interfaced to the network. Upon receiving a packet, the
router decodes, buffers and routes it in the appropriate direction based on the destination

node. Throughout this chapter we use the terms router and switch interchangeably.
An on-chip interconnection network can be described by a set of design choices called
network aspects. We describe network aspects in the following section.

2.3 Network-on-Chip Aspects

Choice of suitable network aspects have a large impact on the design metrics of the NoC.

Topology, Flow Control, Arbitration, Buffering and Routing are the main aspects of a
network.

2.3.1 Topology

Network topology comprises of an arrangement and connectivity of the routers. 2D Mesh,


Ring and Star are some of the popular network topologies. The quality of a network can

be defined in terms of some of its characteristics, namely, bisection bandwidth, degree


and diameter. Choice of an appropriate topology has an impact on performance, area
utilized and power consumed. Based on the interconnect requirement, area and power,
2D-mesh networks have been found suitable for our target device [10]. In addition to
this, mesh networks can use a simple, deadlock free XY routing mechanism.
13

2.3.2 Flow Control

The three main alternatives in flow control of an NoC are, Store and Forward, Virtual
Cut-Through, and Wormhole. In the Store & Forward technique, the out-going packet
is completely buffered at the downstream router before making the next hop. This tech-
nique used in [11] sustains a high packet latency (directly proportional to packet size).

In contrast to this, Virtual Cut-Through technique has low latency as the head flit pro-
gresses without waiting for the rest of the packet. However, the buffer requirements are
in terms of multiples of packet sizes for both the above approaches. The wormhole tech-
nique, which operates on flits, sustains low latency with minimum buffer requirements,

with a packet residing across multiple nodes, thereby increasing the complexity of the
switch.

2.3.3 Routing

The routing mechanism determines the decisions taken when a packet is in flight. The
route established for a packet-based communication can be determined either at the

source node (Source Routing) or independently across the routers of the network (Dis-
tributed Routing). In the case of source routing, the first flit in a packet is the header,
that contains the entire route of the packet. Upon decoding the header, the router passes
the packet to the appropriate downstream router. In the case of distributed routing, the

route decisions can be made throughout the network by the routers (depending on con-
gestion and other network conditions).
Another classification in the routing mechanism is based on adaptivity to network con-
ditions. In Deterministic routing, the route between any (source,destination) pair is
always the same. XY routing is a widely used low complexity routing mechanism that

is deadlock free. On the flip side, a deterministic routing mechanism cannot alter the
routes based on network traffic conditions. Paths taken using Adaptive routing can vary
and be non-minimal. However, the switch incurs additional complexity to support this
14

mechanism.

2.3.4 Arbitration

Within a router, arbitration is required while servicing conflicting requests. When mul-

tiple input ports request a common output port, the requested output port determines
the order in which requests are acknowledged. The scheduling can be either static (pre-
determined) or dynamic. Round-Robin arbitration is a popular technique that provides
a fair dynamic granting scheme. Additionally, Quality-of-Service (QoS) guarantees can

be provided to certain class of applications by augmenting this arbitration scheme with


a priority policy.

2.3.5 Buffering

At the intermediate nodes, packets need to be temporarily stored, while waiting for a
channel access. Based on where the buffers are placed in the routers, Input Buffering and

Output Buffering are the two main classifications. While buffering a packet at the input
of a router, additional requests from upstream routers are not granted. This situation
is called Head-of-Line (HOL) blocking. Using output buffering, the HOL blocking is
avoided, thereby decreasing the average latency of the packets.

2.4 Current Research in NoC

In this section, we summarize the recent research work in the area of NoCs. A brief
account of current NoC applications in industry is also presented.
The International Technology Roadmap for Semiconductors 2005 [2] is the first to
address the inherent performance bottleneck due to poor interconnect scaling. The con-

cept of routing packets instead of wires to gain in performance was introduced by Dally
et. al [12] and was later formally presented as a solution paradigm by Benini et al. [13].
15

The first proof-of-concept was presented by Kumar et al. [14] by implementing a complete
NoC framework. Since then, there has been several academic and industrial advance-

ments in the design of NoCs.

2.4.1 Industrial Applications with NoCs

The NoC framework [15] by Philips is one of the first reported industry implementation
of an NoC. Sony Entertainment Play Station PS3 has partnered with IBM [16] for their
network-on-chip implementation of the multi-core design. Further, Sonics [17] has been

in involved in development of interconnect structures based on NoCs. Arteris [18] and


Silistix [19] are other recent startup companies to provide an NoC solution to applications.

2.4.2 FPGA based NoC

Marescaux et al. [4] have applied the NoC paradigm on a Virtex device to enable multi-
tasking by tile-based reconfiguration. The Hermes [20] NoC platform was developed par-

ticularly for FPGAs to enable dynamic reconfiguration. Until recently the potential of
NoCs to address the performance issues in FPGAs was left unexplored. Bartic et al. [21]
present an adaptive NoC design for FPGAs and analyze its implementation issues. Sal-
dana et al. [10] address multi-processor designs in FPGAs by considering various NoC

topologies by scaling the number of IP cores. Hilton et al. [7] design a flexible circuit-
switched NoC for FPGAs. Kapre et al. [22] compare the suitability of packet and circuit
switching for FPGAs.
In addition to the current research presented in this chapter, we also refer to related
work alongside the contributions presented in the subsequent chapters.
Chapter 3

MoCReS: NoC Framework

3.1 Introduction

Design of a high performance, flexible on-FPGA communication architecture with mini-


mum area overhead presents a great challenge. In this chapter, we present the design and
implementation of MoCReS: Multi-Clock On-Chip Network for Reconfigurable Systems.

The key idea is to implement a low area and high performance packet-switched NoC
framework for FPGAs. The central component of the NoC (router) can support inde-
pendent operating frequencies, dictated by placement and routing constraints in FPGA.
Moreover, the router supports a low latency virtual cut-through flow control for vari-

able packet sizes. Our 5-port router has an area overhead of only 282 Virtex-4 slices (a
marginal 0.57% of logic resources of an XC4VLX100 device) and can operate as high
as 357 MHz supporting a competitive data rate of 2.85 Gbit/s. We gain in router area
and performance by reducing the logic depth of the central arbiter and cross point ma-

trix. We utilize our router to construct a mesh based multi-clock on-FPGA NoC. We also
demonstrate its functionality and characterize performance, area and power of several
versions of the router.

16
17

3.2 Related Work

We target our light-weight multi-clock NoC framework for reconfigurable computing


platforms. Requirements of FPGA based SoC design demands the network to have

minimum area overhead, maximum operating frequency and low latency of operation.
In this section, we compare our router to other proposed FPGA based NoC routers [11]
[23] [24]. In [11], the authors present a light-weight FPGA based parallel router that
uses store and forward flow control. This router has the disadvantage of high latency
(directly proportional to packet size) and it supports only fixed packet sizes. The modified

header in our virtual cut-through router overcomes the above mentioned disadvantages
by encoding the packet size as a fraction of the required FIFO depth. This technique
ensures low latency of operation and improved buffer utilization as a result of supporting
variable packet sizes. Research in [23] [24] [4] [25] presents wormhole based routers using

XY deterministic routing. Though the wormhole routing limits the buffer requirements,
it increases the area consumed due to its complexity thus limiting the logic available for
IP implementation in FPGAs. The 5 port wormhole router presented in [23] consumes
1832 Virtex II slices and operates at 66 MHz. Such an increase in router area degrades its

performance and increases the power consumed. Further, [24] [4] support variable packet
sizes, but with an additional header flit overhead as compared to our router. We will
show that our router consumes fewer resources than the above designs when implemented
in a similar target device. Table 3.1 presents a comparison of MoCReS with alternate
designs in terms of area (in slices) and operating frequency (MHz). The number of slices

utilized is a standard metric to compare FPGA area. In terms of logic, a Virtex-II slice is
equivalent to a Virtex-4 slice. The high operating frequency obtained through MoCReS is
also attributed to the low interconnect delays in Virtex-4 (advanced) architectures. The
operating frequency of our router in a comparable Virtex-II pro device was 172 MHz.

Maximum operating frequency of the router varies greatly when implemented in


FPGA due to switch complexity and place & route constraints. The slowest router
18

Table 3.1: Comparison of FPGA Router Designs


Router Area Channel Frequency Flow Target
Design Slices Width MHz Control Device
LiPaR [5] 352 8 33 STF XC2VP30
Moraes [7] 316 8 50 WHR XC2V1000
1506 [5] 56 WHR XC2VP30
RASoC [14] 8
HRSoC [6] 1832 8 66 WHR Virtex-II
Marescaux [8] 446 16 40 WHR XC2V6000
MoCReS 282 8 357 VCut XC4VLX100

implemented in an FPGA NoC determines the overall network operating frequency and
degrades its performance [25] [10]. We overcome this limitation by enabling the router to

support a multi-clock framework. Kim et al. [26] propose to interface the local cores op-
erating on individual frequencies with the network using asynchronous FIFOs for ASICs.
If implemented on FPGA, the slowest router would still dictate the operating frequency
of the network. Moreover, [26] buffers the entire packet from the local core before for-

warding it to the network. On the other hand, our router follows a modified multi-clock
virtual cut-through approach hence sustaining low latency.

3.3 Design Goals

The key design objective is to minimize the area consumed by the router, which is the
central component of a network. Reducing the logic ensures sufficient resources for SoC

design in FPGA and also minimizes power overhead. Secondly, we target to increase
the operating frequency of the router keeping the network latency to a minimum. It
is essential to improve the network bandwidth and avoid the bottleneck present in bus
based architectures. The final objective is to operate multiple routers on independent
clock frequencies thereby preventing the slowest router from restricting the operating

frequency of the network. Figure 3.1 presents a multi-clock framework with routers
functioning at individual frequencies. Dual ported input buffers are used to cross clock
19

 
L1
  L3 
 

  
 
   
   
  CLK_R1   CLK_R3
   
 
 


 

 
 R (0,1)  R (1,1)
 

 


 
 
 
 
 
 
 
L0
  L2
Input Ports
 Central Arbiter + Crosspoint

   
  
 
   
   
   

 
 

 

 

 
 
 
  R (0,0) 
  R (1,0)
 
 
 
 
 
 CLK_R0  CLK_R2
 
 

Figure 3.1: A Multi-Clock 2 × 2 Mesh Based NoC

domain boundaries, as shown in Figure 3.1

3.4 Network-on-Chip Aspects

The network topology along with the flow control, routing, buffering and arbitration

schemes describe an interconnection network. The choice of appropriate network aspects


have a significant impact on area and performance of the communication architecture.

3.4.1 Network Topology

We choose a mesh topology for our light-weight network. Mesh networks have a minimum
area overhead [10] (reduced number of nets) and low power consumption. In addition,

area scales linearly with the number of nodes and channel width in a mesh. A mesh also
maps well to the underlying routing structure of FPGA. Hence, choosing mesh networks
reduces the congestion in FPGA logic and routing which minimizes power consumption.
20

3.4.2 Flow Control

Virtual cut-through and wormhole technique (unlike Store and Forward) have a packet
latency that is only proportional to the path length. However, the complexity of a
wormhole router as compared to a virtual cut-through router is less suitable for light-
weight implementation. We have chosen a virtual cut-through flow control mechanism for

our router. This scheme supports higher throughput than wormhole routing by efficiently
releasing the upstream buffers during blockages. Furthermore, virtual cut-through flow
control supports high channel utilization with low latency and does not reserve physical
channels.

3.4.3 Routing

We choose the deadlock free XY routing for our switch. The simplicity of the XY
routing adds little overhead to the header decoding logic. Hence, XY routing is suitable
for implementing our area efficient router on FPGA.

3.4.4 Buffering

We buffer incoming packets only at the input ports. Although the input buffering intro-

duces the head-of-line problem, it leads to a low area overhead. In addition to buffering
the incoming flits, the input buffers also provide a framework to implement a multi-clock
network.

3.4.5 Arbiter

To ensure fairness, the competing input ports are allocated based on a simple round

robin approach. The priority of the last served/denied port is placed at the end of the
queue. The FIFO virtual channels also follow a round robin approach when switching
packets to downstream.
21

3.5 Router Micro-Architecture

The MoCReS router consists of five Input ports, Crosspoint matrix and Central arbiter.
Except for the header decoding logic, the five input ports are identical. As we adopt

a flow control with virtual channels (VCs), the input port contains arbitration logic
for multiple VCs. The input port also contains input buffers to store the incoming
packet. The MoCReS architecture has been developed in collaboration with another
colleague [27]. A more detailed description of the architecture can be found in the above
thesis [27].

3.5.1 Packet Description

For a FIFO depth of 16 (utilizing the Xilinx block RAM), the packet size can vary
between 24 bits and 128 bits with a header overhead of only 1 flit per packet. The flit
size is fixed at 8 bits. The header contains the address of the destination router, flit type
and packet fraction. Our virtual cut-through flow control needs one bit to specify the

flit type. The tail bit is set on the flit prior to the last flit to send the terminate signal
without wasting a clock cycle.
The router supports variable packet sizes by encoding the packet size as a fraction
of the required blockRAM (bRAM) depth (packet fraction) in its header. The Network

Interface (NI) in each IP core takes the onus of storing the fraction in the packets header.
The fraction bits are complemented before storing to enable efficient comparison against
write count of the FIFOs. The number of fraction bits and flit width can be increased
if a higher packet granularity is desired. Increasing the number of fraction bits also

improves the buffer utilization with a marginal area overhead. The remaining bits in the
header are reserved to implement priority based flow control. An advanced version of the
router could utilize the remaining header bits to incorporate Quality-of-Service (QoS)
and offset the impact of contention latency.
22

Packet Head/Tail Bit


Destination

Packet 
Y
Fraction X0




Tail - 1 
1


Tail Flit

Figure 3.2: Packet Specification


Wr_enA

Rd_enA
emptyA
Cross Point Matrix
N E
2:1
4:1
Wr_cntA BRAM FIFO A BRAM Arbiter L
2
Fract_in VC Arbiter Mux_Sel 4:1
Req_in S 2 W
Ack_out 4:1 2:1

2
Channel
Data_In 2 2 2
Msel_L Msel_N Msel_S Msel_E Msel_W

Demux_Sel                              
                             
    Header
                         Central Arbiter
Wr_cntB BRAM FIFO B                Decoder

Write Clock Read Clock Grnt
Wr_enB

Rd_enB
emptyB

Input Port Req_i

Req_in_W

Grnt_in_W
Req_in_S
Req_in_N
Req_in_E
Req_in_L

Grnt_in_S
Grnt_in_N
Grnt_in_E
Grnt_in_L
Figure 3.3: Router Micro-Architecture (Input Port, Cross-Point, Central Arbiter)

3.5.2 Input Port

The router has a set of input ports, namely, Local (L), North(N), East(E), South(S) and
West(W) to communicate with the local core and neighboring routers. Each input port
can support multiplexed virtual channels, associated arbiters, and a header decoding logic

to make routing decisions. The three main components of the input port (Figure 3.3)
are, the Virtual Channel Selector, the FIFO bRAMs and the bRAM arbiter.
A. Virtual Channel Selector (VC Selector): The decision on the availability of
space in input buffers is made by the VC selector. It receives along with the header, the
23

size of the packet in a coded form. Upon receiving this data from the upstream router
or core, the VC selector compares the size of the packet with the available size in the

least occupied among the virtual channels. This is done by comparing the incoming size
against the FIFOs write count. If adequate space existed the VC selector acknowledges
the request back upstream. The VC selector, in the process, also sets the input de-
multiplexer. The input de-multiplexer is used to route the packet from the input channel

to the appropriate multiplexed input buffer.


B. FIFO bRAMs: The buffer depth is parameterized in our router and we have set
a depth of 16 for our experiments. The buffers are implemented as bRAM First In First
Out (FIFO) memories and perform the following tasks:

1. Buffer the incoming packet partially or fully and when the downstream switch is
available, forward the head and subsequent flits.

2. Demarcate the router to core and router to router frequencies, hence supporting a
multi-clock network design.

3. Support variable packet sizes by enabling the arbiter to monitor the write count

information.

Our router hence does not restrict the size of the packet which might lead to inefficient
transfer of data. The variable packet size capability comes at a marginal (less than 5%)
increase in the area of the switch. This overhead is acceptable, considering the inefficiency

in performance and power, due to the padding of smaller packets with empty flits.
An efficient way of implementing f ull/empty logic is required in the case of having
buffers to separate clock domains. We synchronize the control signals by a) Sending the
granularity of the packet as a fraction of FIFO size b) Setting the tail bit on the flit

before the last to terminate the connection without wasting a clock cycle.
Since the minimum supported FIFO depth in our target FPGA is 16, it is appropriate
to buffer the entire packet during contention (virtual cut-through). The write count,
24

empty and f ull status signals are integrated into the FIFO with minimum additional
logic. The capability to store flits from multiple packets in one buffer improves the

FIFO utilization. Also, we gain in network throughput as the successive flits release the
upstream buffer as soon as the head advances.
C. bRAM Arbiter: The input port also contains the control logic to make arbitra-
tion decisions. A simple round-robin approach is followed when choosing a non-empty

bRAM. Upon choosing a bRAM, the FSM pops the head flit, decodes its destination
and sends appropriate requests. XY routing is adopted in our router which simplifies the
decoder logic significantly. The number of outgoing request lines are reduced according
to the connections that XY routing permits.
Head Decoder: In XY routing, the head flit travels in the X direction and once it

reaches the destination X, it travels in the Y direction. All subsequent flits follow the
header flit in a pipelined fashion. Due to packets traveling in the X direction completely
before Y, a request is never sent from the North and South ports to the downstream
East and West ports. This nature of XY routing is used to reduce the amount of logic

in a) the Header Decoder b) the Cross Point Matrix and c) the Central Arbiter. The
above simplification of the logic translates into significant FPGA slice reduction.

3.5.3 Cross-Point Matrix

We design a multiplexer based cross point matrix to minimize the area. An alternative
would be to support cross point connections for each de-multiplexed virtual channels. The

latter approach, which produces high network throughput, adds a significant complexity
to the switch. The cross point supports parallel connections between exclusive input and
output ports and is used by the central arbiter to support simultaneous requests. Not
all cross point connections are utilized by the XY routing. After optimizing the logic, we

implement the switch with simple 4 and 2-input multiplexers (for L, N, S and E,W ports
respectively). The above optimization reduces the cross point area to only 32 slices and
25

hence, we gain in router area significantly.

3.5.4 Central Arbiter

To ensure fairness, the competing input ports are allocated based on a simple round

robin approach. The last served/denied port is given lowest priority placed at the end of
the queue. We gain in router performance by a) reducing the logic in the central arbiter,
as we determined that it appears at the critical path of our router and b) centralizing
the arbiter which reduces the number of req/grant signals required. Reducing the logic

and routing also minimizes the congestion in the design. This translates into gain in
performance and power.
We minimize the logic in the critical path by reducing the number of service states
in the central arbiter for the requesting downstream input. The East input port will

be requested only by the West and Local ports and similarly, West input port only
by East and Local ports. Also, it is sufficient if only one grant reaches the requesting
input port for all its requests. This reduces the number of nets to be routed hence
minimizing the congestion. We achieve an operating frequency of 357 MHz without
queuing the simultaneous requests for downstream input ports, i.e we enable the arbiter

to handle multiple requests simultaneously. Upon granting an input port, the central
arbiter configures the multiplexers in the cross point matrix to establish a connection.

3.6 Results: Functional Simulation

In this section, we present the functional simulation results of our design in an XC4VLX100-
11 device [1], on a Nallatech BenDAT AT M [3] development board. We use Xilinx ISE

8.2i to synthesize, place and route our design. We validate the functionality of our design
using Modelsim 6.1c [28] and present the results below. We functionally simulate two
versions of MoCReS: a common clock (synchronous) version and a multi-clock version.
26

CLK

L_in 00 10 A2 33 F2

N_in 00 12 A2 63 62

E_in 00 0A 42 23 22

S_in 00 1A 52 33 32

W_in 00 04 72 53 52

W_out 00 10 A2 33 F2

L_out 00 12 A2 63 62

E_out 00 04 72 53 52

N_out 00 1A 52 33 32

S_out 00 0A 42 23 22

60 ns 75 ns 100 ns

(a) Common Clock

L_CLK
L_in 00 10 33 F2
N_CLK
N_in 00 12 63 62
W_CLK
W_in 00 14 53 52
S_CLK
S_in 00 1A 33 32
E_CLK
E_in 00 0A 23 22
Rd_CLK
W_out 00 10 33 F2
L_out 00 12 63 62
E_out 00 14 53 52
N_out 00 1A 33 32
S_out 00 0A 23 22

250 ns 350 ns 400 ns

(b) Independent Clocks

Figure 3.4: Router Functional Simulation


27

3.6.1 Common Clock Design

The router follows virtual cut-through flow control, based on a simple request/acknowledge
protocol. Figure 3.4(a) presents the simulation results of our standalone router operating
on a common clock. Our central arbiter and cross point are capable of establishing par-
allel input port connections without clock penalty. It can be seen that the flits coming in

through the five input ports are simultaneously switched in the appropriate directions.

Table 3.2: Standalone Router Area Results


Slices
Component Common Clock Multiple Clock
(CC) (MC)
Router 282 302
Input Port 32 55
VC Selector 4 4
bRAM FIFO 20 42
bRAM Arbiter 11 11
Central Arbiter 115 115
Cross Point Matrix 32 32

3.6.2 Multi-Clock Design

Figure 3.4(b) shows the operation of our router when each input port receives data
at different frequencies. Once the empty signal of a FIFO is pulled low, the bRAM
arbiter decodes the header at the router’s read frequency. It can be seen that outgoing

packets are synchronized with the read clock and follow an order similar to their incoming
frequencies.

3.7 Results: Area-Performance Characterization

We choose area and performance as the two design metrics to be characterized for our
MoCReS design. A brief description of the experimental platform constructed for the

area-performance study is also presented below along with the results obtained.
28

700

600

500

400
Area (Slices)

300
Basic(1VC+CC)
1VC+MC
2VC+CC
200
2VC+MC

100

0
0 10 20 30 40 50 60 70
Channel Width (bits)

Figure 3.5: Router Area Vs Channel Width

3.7.1 Router Area Analysis

We synthesize, place and route the structural VHDL model of our router and present an
analysis of its FPGA resource utilization in this section. Upon tightly constraining the
area using Xilinx PACE tool [1], the basic version (1 Virtual channel + Common Clock)
of our router consumes 282 Virtex-4 slices (558 LUTs, 289 Slice FFs) which correspond
to a marginal 0.57% of our target FPGA device (XCV4LX100). Table 3.2 presents the

results from synthesis of the basic version of our router, identified as 1VC + CC.
To characterize our common and multi-clock (MC) router for area, we develop three
more versions of it by varying the number of virtual channels (VC). Figure 3.5 presents
the scaling of area versus channel width for various versions of the router. An increase

in the channel width causes a significant increase in the area of the router, due to scaling
of the cross point matrix. This increase in router area could be significant if the cross
point occupies a larger area as in most of the designs. However, the above mentioned
29

4
x 10
2

1.8

1.6

1.4
Logic Overhead (LUTs)
1.2 Routing Overhead (Nets)

0.8

0.6

0.4

0.2

0
0 10 20 30 40 50 60 70
Channel Width (Bits)

Figure 3.6: 3 × 3 Mesh FPGA Utilization

disadvantage is reduced in our router design, due to the area optimizations applied in the

cross point matrix. From Figure 3.5 we observe that even for an 8× increase in channel
width, the router area increases at the most by 2×.

3.7.2 FPGA NoC Resource Analysis

For this analysis we implemented a 3 × 3 mesh topology of MoCReS routers (with 1 VC


+ CC) using the standard Xilinx ISE design flow. Our 3 × 3 mesh framework consumes

only 6.1% of the available FPGA device area leaving the remaining logic to efficiently
implement the IPs. Figure 3.6 presents the scaling of logic and routing utilization with
channel width in a mesh network. The linear scaling of routing resources utilized with
increase in channel width demonstrates the suitability of mesh topology for FPGAs.
30

3.7.3 Performance Analysis

We make use of Xilinx PACE [1] to tightly constrain the critical path and estimate the
post place and route delay. As the routers will be internally connected to the cores,
we need not consider the pad to pad delays while estimating the maximum frequency.
The design is implemented with a Router Functional Module (RFM) [29] wrapper to

estimate the accurate operating frequency. The standalone version of our basic router
can operate at 357 MHz. Therefore, our 8 bits/channel router has a maximum throughput
of 2.85 Gbits/s.

125

100

75
Avg. Latency
cycles

50

25

0.2 0.4 0.6 0.8 1.0


Injection Rate (Flits/Cycle/Node)
Common Clock Multiple Clock

Figure 3.7: Average Latency Vs Injection Rate of 3 × 3 Mesh

In our router, the head flit advances to the next node while the remaining flits flow
in a pipelined fashion. In this scheme the network latency depends only on path length

H (number of hops). In the absence of channel contention, the latency of our network L
can be expressed as:

L = 7 × H + B/w (3.1)
31

where, B is the number of bytes in the packet and w is the number of bytes switched
per clock cycle. The factor 7 in the expression is the setup latency incurred at every

router hop in the MoCReS router. This latency is due to the decoding & arbitration
performed based on the header flit in a packet. If Li denotes the latency of the ith packet
in cycles, then the average latency of the common and multiple clock networks can be
expressed as,

PN P N P Hi fj
i=1 Li i=1 j=1 fworst
Lcc avg = and Lmc avg = (3.2)
N N

Where fj and fworst represent the frequencies of the j th router and the slowest router
respectively. And Lmc avg is given in cycles of frequency fworst .
Multi-Clock Experimental Platform: In order to evaluate the performance of the

proposed multi-clock framework, we utilize our VHDL model of the router and simulate a
3×3 mesh. We implement wrapper to generate packets in the local core frequency. Router
frequency values are extracted after the topology synthesis, placement and routing stages.
For varying configurations (resource availability, inter-router distance, bRAM/dRAM

FIFO versions), the router frequency can degrade up to 18% [29]. Figure 3.7 shows the
latency versus injection rate curve for the common clock and multi-clock versions. For
the common clock case, the network frequency was 286 MHz and for the multiple clock
case, the frequency ranged from 357 MHz to 286 MHz The X-axis represents the injection
rate expressed as number of flits injected from every node in one cycle. The Y-axis plots

the measured average latency of packets in each case. It can be seen that the increase in
performance of the proposed framework significantly delays network saturation.

Reducing Setup Latency

Connection between the upstream and downstream routers is established through a


req/grant protocol. Upon receiving the first flit, the bRAM arbiter in the downstream
router:
32

• arbitrates the bRAMs to decide which packet progresses forward in that cycle

• pops the head flit from the bRAM FIFO

• decodes the header flit and sends appropriate request to the central arbiter

The setup latency of our router (to accomplish the above sequence of operations) as
shown in equation 3.1 is 7 cycles. This latency is a constant for all communications in
the network, i.e for every router hop, a latency of 7 cycles is incurred.

However, by utilizing the First Word Follow Through (FWFT) capability available in
Xilinx FIFOs, this latency can be reduced by one cycle. Here, the head flit when pushed
into the FIFO appears at the output bus in the same clock cycle, thereby feeding itself
as input to the bRAM arbiter to initiate the request process to the Central Arbiter. The

latency of our network L now be expressed as:

L = 6 × H + B/w (3.3)

3.8 Conclusions

We present an area efficient multi clock on-FPGA virtual cut-through router, that has

a minimum area and high performance compared to previously reported designs. We


introduce optimizations in the central arbiter and cross point implementation to gain
in area and performance. We extend the router to a multi-clock NoC framework with
routers functioning at independent frequencies. We validate the functioning of a stand
alone router and a 3 × 3 mesh framework and characterize the network for area and

performance. We use this framework as the baseline design to perform the experiments
presented in the subsequent chapters.
33

3.8.1 Limitations of our Packet Switched MoCReS Framework

The MoCReS framework for FPGA based NoC design is an important contribution of
this thesis. It has low area and high clock rate in comparison to other proposed FPGA
NoCs. However, the performance improvement derived by this NoC is limited by its
strict packet-switched nature. There is a significant performance overhead involved in

converting data to flits, encoding the headers, and serializing them into packets. This
overhead is particularly prominent for IP cores placed close to each other. Overcoming
this overhead is our motivation for designing a novel architecture that is presented in
Chapter 5 which supports an additional time-multiplexed circuit-switched layer for a

high throughput data transfer between nearby IP cores.


Chapter 4

NoC Power Analysis

4.1 Introduction

With increasing device sizes and capabilities, power dissipation in FPGAs has become a
primary concern. Further, FPGA implementation for portable hand held devices demand
low power consumption to extend the battery life. Therefore, it is essential to attain a

balance between power dissipation and performance in an FPGA based NoC. In this
chapter, we analyze power consumption in our FPGA based implementation of an NoC.
The power dissipated across various components of the NoC is presented. Further, we
discuss the power consumption from the FPGA resource utilization perspective. Once

fully developed, the model could be used to enhance the traditional FPGA design flow
for multiprocessor applications.
Further, we feature a comparison of MoCReS with an alternate approach (LiPaR [11])
in this chapter. The alternate router does not support independent clock frequencies,

and thereby sustains a performance bottleneck. Experimental evidence indicates that the
multi-clock novelty has a marginal power overhead due to the additional clock resource
utilized in the FPGA implementation. Moreover, to understand the impact of power
optimization techniques on the NoC, we determine the power consumption share of NoC

34
35

when a typical image compression application is implemented on our target device and
report the same in this chapter. It is shown experimentally that NoCs in FPGA consume

around 45% of total power dissipated in the chosen application.


The above NoC overhead in power is due to significant utilization of programmable
logic and routing resources in the FPGA. In an attempt to model the power overhead of
the two routers, we present the resource utilization results extracted using the approach

described in Section 7.2.1.

4.2 Related Work

Research in [30] addresses power-performance modelling of NoC in the ASIC domain. It


characterizes a 4 × 4 Mesh NoC for total power consumption using a cycle accurate RTL
model. Hu et al. [31] present a mapping technique to reduce dynamic power consumption

in ASICs.
Vestias et. al. [32] propose an approach to explore the design space of an SoC imple-
mented with NoC backbone. The authors validate their technique by mapping a JPEG
encoder application and optimizing the design for performance. However, the authors
do not characterize the power consumption in their design. The main drawback of the

router that the authors implemented in [32] is that it utilizes store-and-forward flow
control mechanism. The latency of a packet is very high, as it is directly proportional to
the packet size. Moreover, this implementation is not suitable for power efficient NoCs
because of its high buffer requirements. The buffers, in addition to increasing the area

consumed, are also switching continuously thus contributing to a significant portion of


power consumed.
To the best of our knowledge this is the first work to address power characterization
of an NoC targeted for reconfigurable platforms. We also determine the NoC share of

total power consumption using a typical image processing application and report the
36

results in this chapter.

4.3 MoCReS Power Consumption

In this section we analyze the power comsumed in the MoCReS router architecture

proposed in Chapter 3. The resource utilization of the MoCReS router is determined


through place & route using the Xilinx ISE tool. To estimate the power consumed,
Xpower utility of the ISE suite is used.

4.3.1 Components of Power

The central component of an NoC is the router and its power consumption has a sig-

nificant impact on the total power consumed by the NoC. The two main components of
FPGA power consumption are, the static power (referred as quiescent power) and the
dynamic power. In spite of current FPGAs supporting upto 65nm, the dynamic power
tends to dominate the total power in FPGAs, due to high operating frequencies (toggle

rates). This trend is in contrary to ASICs, where at current technology, the leakage
power dominates. Therefore it is important for an estimation methodology to account
both the components of power.
The amount of quiescent power is largely dependent on the target device (technology

and preset operating voltages) and the amount of logic utilized by the design. On the
other hand, the dynamic component depends on,

• Switched Capacitance (Resource Utilization)

• Transition Activity

• Operating Voltage

Switched capacitance varies with the type of logic/routing resource utilized for the
design. Transition activity of every signal in the design can be expressed in terms
37

Table 4.1: Standalone: MoCReS (1VC+CC) Power Consumption


Power Component Operating Voltage Power (mW)
Static Power 1.2 402.56
Dynamic Power 1.2 43.62
Static Power 2.5 611.88
Dynamic Power 2.5 1.25
Clock Power 1.2 14.0
Logic Power 1.2 12.0

of the clock toggle rate. In our target Virtex-4 FPGAs [1], the logic (CLBs), block

RAMs, clock tree and entire routing resources operate under the same voltage source
(VCCIN T = 1.2V ). The Input/Output blocks in our target FPGA operate at a higher
voltage level (2.5V).
For a stand alone router, we estimate the static and dynamic power contributed

by the design. Clock lines in FPGA design contribute to a significant portion of the
dynamic power consumed. We report the standalone power consumption of MoCReS in
Table 4.1. The dynamic power dissipated under 2.5V category is very marginal due to
the Router Functional Module (RFM) that was wrapped around the router for accurate
estimation of power. The RFM restricts the number of Input/Output blocks that will be

utilized by the router. In this section, we also obtain the router models of an alternate
design, LiPaR [11] and compare its power overheads with our modified multi-clock router
framework.
Even though the dynamic component of the standalone router is a small percentage

of the total power, for a large NoC topology implementation with several instances of the
router (and high switching activity), the dynamic component will equally dominate the
total power. Table 4.2 presents the power dissipated in a 3 × 3 mesh implementation
of MoCReS (1VC+CC).
38

Table 4.2: 3 × 3 Mesh: MoCReS (1VC+CC) Power Consumption


Power Component Operating Voltage Power (mW)
Static Power 1.2 478.34
Dynamic Power 1.2 298.36
Static Power 2.5 702.53
Dynamic Power 2.5 6.27
Clock Power 1.2 124.96
Logic Power 1.2 97.59

4.3.2 Comparison with LiPaR

In this section, we compare the power consumed by our router architecture with an
alternate router design, LiPaR [11]. Our MoCReS router supports independent clock
frequencies as opposed to LiPaR [11], thereby resulting in increased performance. In

order to evaluate the power trade-offs involved in our design novelty, we compare the
two router designs keeping the following entities a constant:

• Same target device (XC4VLX100 [1])

• 16 Flit Buffer Size

• Channel Width of 8 bits

• bRAM FIFO Implementation

• Number of (Ports,VCs) : (5,1)

• Identical transition activity

In addition to the above parameters, the operating frequency for the power experi-
ments was set at 100 MHz which corresponds to the critical path delay of LiPaR.
Standalone Version: Activity data for power estimation (.vcd) is obtained by simulat-
ing the post place & route model of the two designs (with random inputs). Xpower [1]
takes the design description (resource utilization) as a .ncd file along with the .vcd gen-

erated above. Table 4.3 compares the dynamic and quiescent power consumed by the
39

Table 4.3: Standalone: MoCReS (1VC+MC) vs LiPaR


Power Component Operating Voltage MoCReS Power (mW) LiPaR [11] Power (mW)
Static Power 1.2 489.06 496.12
Dynamic Power 1.2 57.16 51.25
Static Power 2.5 714.15 719.08
Dynamic Power 2.5 14.06 17.82
Clock Power 1.2 28.57 21.26
Logic Power 1.2 22.45 25.19

two approaches on the same target FPGA device. Due to fewer resources utilized in

MoCReS, there is a gain in static power and dynamic power contributed by the logic
resources. This is due to the logic optimizations in the cross-point matrix and central
arbiter which led to fewer FPGA resource utilization. Due to increase in dynamic compo-
nent, the power overhead in our approach is marginally (11.53%) more than the alternate

design presented. It is important to note that the equivalent router for LiPaR is the ba-
sic version of MoCReS (1VC+CC), as the LiPaR does not support virtual channels and
operated on a single clock. Table 4.1 presents the power consumed by that version of
MoCReS.
Multi-Clock Feature Our modified MoCReS architecture supports independent clock

frequencies for router instances, thereby allowing the router to function on the high-
est individual clock rate that the placement, routing and switch complexity constraints
dictate. Therefore the number of clock nets utilized could be higher and can cause
an increase in power consumed. With clock lines typically having higher fan-outs, the

switched capacitance and therefore the dynamic power associated with the clock nets
can be significant. Xilinx power estimation framework permits estimating the power
consumed by the clock lines independently for both the designs. It can be seen that
there is only a marginal 11.53% additional power overhead in our MoCReS approach

that gives significant performance improvement.


40

Table 4.4: Power Dissipation Across NoC Components


S.No NoC Component Dynamic Power (mw)
1. Input Port 7.88
FIFO bRAMs 6.02
VC Arbiter 0.35
bRAM Arbiter 1.72
2. Cross-Point 12.11
3. Central Arbiter 15.69

4.3.3 NoC Component Power

To effectively characterize the FPGA NoC implementation for power, we determine the
dynamic and quiescent power of each major component in the NoC. The experimental
platform involves an incremental place and route of the NoC. During every stage we

retain all the components of it and stimulate the part of design under investigation. We
extract dynamic and quiescent power for identical activity rates.
In order to better understand the contributions to dynamic power of each of the
router components, we activated parts of the MoCReS design with random input vectors

(following a uniform distribution) and observed the dynamic power consumed. It can
be seen that the buffers of the router consumes highest dynamic power. The results
are in agreement with those extracted for ASICs [30]. It is to be noted that there are
five instances of the Input Port component We believe these results will be useful in
developing future NoC designs targeted for FPGAs with optimized power consumption.

Stimulus is applied to the primary inputs. We use Modelsim 6.1c with TCL scripts
to extract the transition activity of internal nodes. This activity generated for the com-
ponent alone is applied to it after an incremental place and route with remaining com-
ponents contribution eliminated. We use the technique developed by Arole [33] in his

power profiler. We exhaustively analyze power consumed in every resource of the NoC
by this methodology.
Resource Utilization and Dynamic Power: Based on the dynamic power consumed
by every component and its resource utilized, which is extracted using the ncd2xdl,
41

Table 4.5: FPGA Resource utilization : MoCReS(1VC+MC) vs LiPaR


FPGA Resource
Router Logic (Slices) Routing Resources
Single Double Hex Long Nets
MoCReS 302 5633 1402 439 5 902
LiPaR 352 8504 2314 740 8 1177

we model the dynamic power across every resource. Table 4.5 compares the resource
utilization between our router and the baseline design. Significant reduction in the
number of nets and routing & logic resources used, contribute to the gain in dynamic
power.

4.4 Conclusions

In this chapter, we discuss the power consumed by our NoC framework on FPGA. We de-
termine the various power components in the standalone design and a 3 × 3 mesh NoC.
Further, to determine the power overhead incurred due to supporting the multi-clock
feature, we compare its power consumption with an alternate design that supports only

one frequency. Results show a marginal 11.53% increase in dynamic power compared to
the baseline approach. Further, we determine the power contributed by various compo-
nents of the router. Results show the buffers to consume majority of power consumed in
the NoC design. An account on the FPGA resources utilized by MoCReS in comparison
with the alternate design is also presented in this chapter.
Chapter 5

Hybrid Two-Layer Router


Architecture

5.1 Introduction

The two main concerns with NoC designs that are strictly packet-switched are the control
and serialization overhead involved in transfering data between IP cores that are placed

close to each other in the FPGA. In order to ensure high throughput between these cores,
we advocate time-multiplexed circuit-switched connections. In addition to this mode of
transfer, the router also preserves the online nature of communication between farther
cores through the packet-switched layer. The area efficient MoCReS architecture pre-

sented in Chapter 3 is modified to support both the above mentioned layers of operation.
The design goals and issues involved in the hybrid two-layer architecture are presented
in this chapter. We also develop a SystemC model of our router for both functionally
verifying the design as well as to vary its specifications and obtain the performance re-

sults rapidly through simulation. We present the results and analysis of the novel router
architecture in this chapter.

42
43

5.2 Motivation

Packet-switching performs online scheduling by dynamically negotiating communication


between the cores. An alternate technique, namely circuit-switching offers high through-

put dedicated connections to overcome the performance drawbacks in packet-switching


by scheduling time-multiplexed communication across the cores. Even though this static
scheduling requires all the communication patterns to be known before hand, it can pro-
vide a very high throughput with marginal area overhead (for storing schedules). We
propose a modified router architecture which interfaces multiple IP cores to the router

and supports packet-switching for inter router transfers and time-multiplexed circuit-
switching for IP cores connected to the same router. This technique also eliminates
the latency in req/grant protocol, serialization and control overheads for data transfers
between cores placed close to each other in FPGAs and mapped to the same router.

5.2.1 Packetization and Control Overheads

In this section, we quantify the overheads associated with the existing baseline approach
(MoCReS). Control and Packetization are the two main overheads associated with the
MoCReS framework.
1. Control Overhead: In MoCReS, connection between various ports are estab-

lished through a req/grant protocol which involves round-robin arbitration in the case
of common ports requests (conflicts). From Chapter 3, we see that it takes at least 6
cycles for the data at the input port to appear at the output of a router (as input to the
downstream router/local IP). This setup latency is a fixed overhead in addition to the

delays due to network congestion.


2. Packetization Overhead: Due to the nature of interconnection network, the
channel width between ports/routers are limited to a fixed size (8 bits in MoCReS,
baseline version). Due to this fixed channel width, the communication data that is to be
44

sent over the network must be quantized into flits. Variable number of flits constitute a
packet. If F is the number of flits in a packet and b is the channel width, then F/b is

the serialization latency associated with the communication.

5.3 Related Work

We target our proposed NoC framework for reconfigurable computing platforms and
therefore we restrict our discussions in this section primarily to existing FPGA based
NoCs. NoCs were introduced into the FPGA domain mainly to simplify tile-based recon-

figuration [4] [5], and its potential as an effective communication architecture is largely
unexplored [34]. Research in [10] [21] address the capabilities of FPGAs to support
NoC based multi-processor applications. Hilton et al. [7] incorporate flexibility into their
design for FPGA based circuit-switched NoCs. However, their strictly circuit-switched

router suffers from signal integrity and path reservation issues which we overcome in our
design. SoCBUS [35] proposes a circuit-switched router with a packet based setup. Here,
control packets are responsible for setting up strict circuit-switched connections, which is
different from our two-layer approach. Research in [36] [7] [6] also present FPGA based
NoCs. The above designs ignore implementation level area-performance trade-offs while

proposing the architecture, thereby limiting to a system-level performance analysis.


To the best of our knowledge, this is the first work to propose an FPGA-suitable
hybrid router architecture integrated with an automatic topology synthesis framework
that satisfies the bandwidth requirements of an application while optimizing its area

overhead.

5.4 Architecture Description

In this section, we first present the modified router micro-architecture, followed by its
architectural advantages and design issues involved. The network topology along with the
45

N
NI NI

IP0 IP1

E
W Circuit-Switched Layer

Packet-Switched
Layer

NI NI

IP2 S IP3

Figure 5.1: Hybrid Two-Layer Router Architecture

flow control for the packet-switched layer are kept the same as presented in Chapter [37].
Network Topology: Mesh networks have minimum area overhead (reduced long lines) [10] [37],

low power consumption and map well to the underlying routing structure of FPGAs.
Hence, we choose a mesh topology to optimize logic and routing in FPGAs, and to
provide sufficient resources for the IP cores.
Flow Control: Our router supports multi-clock virtual cut-through flow control with a

deadlock-free XY routing. The switch complexity involved in the above choice is more
suitable for a light-weight implementation [37].

5.4.1 Cross-Point Matrix

Architecture Modifications The modified switch is comprised of two layers of oper-


ation: a high throughput time-multiplexed circuit-switched layer (C-layer) and a multi-

clock packet-switched layer (P-layer). Variable number of IP cores connected to the


46

switch participate in the C-layer, thereby achieving guaranteed throughput and more
predictable latencies between IP cores placed close to each other in the FPGA.

Figure 5.1 presents the novel two-layer hybrid router architecture. This modified
router has four local IP ports, in addition to the four directional ports. Further, in
this case two of the four local IPs (IP 0,IP 3) are participating in the time-multiplexed
circuit-switched layer. Using the packet-switched layer, all the four IPs can communicate

to the neigbouring routers through the directional ports.


The cross-point matrix is multiplexer based, as opposed to providing connections for
each virtual channel. The following are the design issues involved with the cross-point.
Packet-Switched Cross-Point: In the packet-switched layer, the directional input
ports (N,E,S,W) are multiplexed to every local port. Therefore cross-point connections

are introduced to support these additional local ports. However, all the connections
between the local ports in this layer are removed, as they are connected in the circuit-
switched layer. The ports connected through the C-Layer (IP 0,IP 3) cannot participate
in the P-Layer to transfer data between themselves. This translates into gain in area

which we utilize to increase the bandwidth available.


Circuit-Switched Cross-Point: Let Li be the total number of local IPs and Pi
be the number of ports participating in the circuit-switched layer. The bus width of this
cross-point is currently set to 32 bits in order to support a very high bandwidth. Further,

this cross-point can handle a maximum of Pi high throughput parallel connections. The
scheduling memory configures this cross-point during various time slots.
Router Channel Widths: Due to high throughput requirement between the cores
participating in the circuit-switched layer, we set the channel width to 32 bits (corre-

sponding to the data width of microblaze soft processor). In the packet-switched layer,
we retain the bus width of MoCReS (8 bits/channel). However, choice of an appropriate
channel width is a trade off between resources available and bandwidth required.
47

Cross Point Matrix


Proc_L0: PROCESS (FROM_L0)

8:1 4:1 Case From_L0 is


When L0_Last_N =>
Directional Pkt S/W Local Ckt S/W N, E, S, W Req Check Only

3
When L0_Last_E =>
3 3 3 3 3
Msel_L0 Msel_L3 Msel_N Msel_E Msel_S Msel_W
When L0_Last_S =>

Central Arbiter When L0_Last_W =>


No L1-L3 Last Grant States
end case;

end process Proc_L0;


Req_in_W

Grnt_in_W
Req_in_L0

Grnt_in_L0
Req_in_S
Req_in_L3
Req_in_N
Req_in_E

Grnt_in_S
Grnt_in_L3
Grnt_in_N
Grnt_in_E

Figure 5.2: Modified Central Arbiter Model

5.4.2 Central Arbiter

The Central Arbiter is responsible for configuring the simultaneous connections by setting
the cross-point in the P-Layer. We run parallel FSMs to ensure that no queing takes place
between requests. As long as the participating IPs request mutually exclusive ports, the
connections happen parallely. In case of queing/conflicts, the arbitration is performed

through the round robin approach. The IPs that participate in the C-Layer will not
need arbitration between themselves in the P-Layer. We perform state reduction in the
FSMs corresponding to those inter-local port connections .i.e in correspondence with the
inter local IP connections that are removed (Section 5.4.1) in the packet-switched layer.

The Central Arbiter is also customized to not support states for these connections. The
simplicity of round-robin arbitration coupled with the above state reduction translates
into significant area savings. Figure 5.2 shows the modified central arbiter model.
48

5.4.3 NI Design

The network interface arbitrates the choice of packet/circuit switched layer and is also
responsible for supporting variable size packets.
Mode Switching: Upon receiving the target IP co-ordinates, it triggers the mode signal
to decide if the packet will be decoded to leave the router or the cross point is triggered

in circuit switch mode.

Variable Packet Sizes: As mentioned in Section 3.5.1, during packet-switched trans-


fer, the network interface is also responsible for encoding the header with:

1. Packet Size (As a fraction of bRAM depth)

2. X co-ordinate of destination IP

3. Y co-ordinate of destination IP

The packets transfered through the network can be broadly classified as control (lesser
number of flits) or data. Therefore, the packets will be of varied sizes. The NI encodes

the packet size as a fraction of the total bRAM depth along with the header. This novelty
improves buffer utilization, thereby increasing the performance of the NoC.

5.4.4 Design Parameters

In order to quickly explore the NoC design space, we have parameterized the structural

VHDL model of our router for:

1. Total number of ports

2. Channel width

3. Virtual Channels/port

4. Number of ports participating in the C-Layer


49

By varying the above parameters, we develop a component library, M oClib which we


use to characterize variants of the router for area and operating frequency.

5.5 Architectural Advantages

Bandwidth Increase: Bandwidth available in a switch is the product of the num-


ber of ports, operating frequency and channel width. The C-layer has minimum logic
overhead with no buffering and can operate at a clock rate significantly higher than the
P-layer. Furthermore, increasing the number of ports also scales the available band-

width in a switch. Moreover, the absence of control/serialization overheads (req/grant)


also increases the throughput.
Power Savings: The amount of logic required for the NoC reduces with router count,
thereby saving static power. Further, with increasing number of ports within a router, the

average packet latency is also reduced [36]. Therefore dynamic power drops considerably
with reduction of router hops.
Guaranteed Throughput: The time-multiplexed nature of the C-Layer scheduling
provides good Quality of Service (QoS) to the application, particularly, between cores
placed close to each other. Otherwise, the NoC would have to support area expensive

QoS protocols to ensure the required bandwidth.


Inherent Multi-Cast Capability: The cross-point in the C-layer can be configured
simultaneously for a multi-cast (one to many destinations) operation among IPs con-
nected to the same router without any penalty in performance. Further, this capability

also optimizes the area required for storing the schedules (with fewer bits required to
encode the configuration data of the circuit-switched network).
50

Setup Latency

P_CLK
N_in 00 10 A2 33 F2

E_in 00 12 A2 63 62

S_in 00 0A 42 23 22

W_in 00 1A 52 33 32

L2_in 00 04 72 53 52

S_out 00 10 A2 33 F2

L2_out 00 12 A2 63 62

W_out 00 04 72 53 52

E_out 00 1A 52 33 32

N_out 00 0A 42 23 22

C_CLK
0000
L0_C 0000 A01A 1011 A054 814B xxxx 7054 810B 9910 xxxx DF54 614B 7071 D054
L1_C 0000 A01A 1011 A054 814B xxxx DF54 614B 7071 D054
0000

L3_C 0000
0000 7054 810B 9910 xxxx DF54 614B 7071 D054

Multi-Cast
operation

50 ns 100 ns

Figure 5.3: SystemC Simulation

5.6 System-Level Router Model

With increasing design complexities, there is a need for rapid design space exploration
that makes use of a set of specifications. We model our NoC router framework using

SystemC. By doing so, we functionally verify the model as well as setup a platform to
estimate the advantages of this architecture over the baseline approach.
SystemC is a description language that abstracts the computation elements of a design
by behaviors (or processes) and simplifies the communication between the cores using

transaction level modelling. The framework has a set of library routines and macros
implemented using C++. The behavior of the hardware to be modeled is captured by
simulating concurrent processes coded in C++.
SystemC Tool Flow: Every component in the router is modeled in C++ as a process.
This .cpp file can be compiled and executed with the SystemC engine that is written in

C++. We use the opensource SystemC version 2.1 to compile our router design. The
set of .cpp files are first compiled with the appropriate command options. Then, an
51

600

Area (Slices)
400

200

0
3
2 8
#C
ircu 6
it s/ 1 rts
wP
orts
4
t s/w Po
0 2 #P acke

Figure 5.4: Design Parameters Vs Area


executable is created to run the toolflow. We dump out the Value Change Dump (VCD)

file from the engine.


The .VCD file of the router model can be used as follows:

• Applied to standard simulation tool for verifying the functionality of the model by

viewing the waveform

• Estimate preliminary power consumed by the implementation on FPGAs, by using


Xpower and the architecture information (Virtex-4)

5.7 Synthesis Results

In this section we present the Area/Synthesis results for our modified router implemented
on Xilinx Virtex 4 [1].
The additional bandwidth offered by the proposed router comes with an increase
in switch complexity. The amount of FPGA logic and routing resources consumed by

the router instance depends on its complexity. Figure 5.4 presents this variation in
switch area with the number of ports (C & P-Layer) it supports. Further, the operating
frequency of the router instances vary greatly due to different critical path lengths.
Also, with increasing number of ports participating in the circuit-switched layer, the
routing resources deplete rapidly (due to increased channel widths). This degradation in
52

Max. Frequency (MHz)


400

200

0
3 8
6
2
# Cir Ports
cuit 1 4 e t s/w
s/w P ack
orts 0 2 #P

Figure 5.5: Design Parameters Vs Frequency

Table 5.1: Scaling of Area and Frequency with No.of C-Layer Ports
MoClib Component Area (Slices) Frequency (MHz)
MC (4,2,2) 314 336
MC (5,3,2) 326 318
MC (5,2,3) 341 303
MC (6,3,3) 394 240
MC (6,2,4) 382 258
MC (7,3,4) 440 221

performance in turn affects the bandwidth the switch can offer. Figure 5.5 presents the

variation in switch operating frequency with the number of ports in both layers. The
above area and frequency estimates are obtained by varying the parameters in the VHDL
model of the router and by implementing them on the target device.
Furthermore, to perform automatic topology synthesis, we estimate the increase/decrease

in switch area with exclusive variations in number of P-Layer ports and C-Layer ports
independently. When NoC area is in the cost function, the above data will aid rapid
design space exploration. Tables 5.1 and 5.2 present the scaling of area & frequency
with increasing C-Layer and P-Layer ports respectively. In the tables, MC(x,y,z) denote
an instance of the M oClib library, where y is the total number of C-Layer ports, z is

the total number of P-Layer ports and x is the sum of the two (total number of ports).
Table 5.2 presents the scaling of area and frequency only with respect to the P-Layer
ports and therefore they can be considered as variations of the MoCReS baseline router.
53

Table 5.2: Scaling of Area and Frequency with No.of P-Layer Ports
MoClib Component Area (Slices) Frequency (MHz)
MC (3,0,3) 296 378
MC (4,0,4) 318 362
MC (5,0,5) 349 324
MC (6,0,6) 390 296
MC (7,0,7) 435 267
MC (8,0,8) 493 229

5.8 Results: Performance Improvement

Area vs Average Available Bandwidth/Port: The baseline version in this com-


parison is MoCReS with 1VC+MC. The area (in slices) of the switch increases with the

number of ports it supports. We measure the area values for increasing number of ports
(packet-switched) in the baseline version. For similar area values, when the alternate
hybrid router is used, there is an increase in available bandwidth per port. This band-
width increase associated with the hybrid router architecture is compared in this section

with the baseline approach. For equivalent area overheads (in slices) on a similar FPGA,
Figure 5.6 presents the bandwidth capacity (in MB/s) of the NoC (per port) for both
approaches. In spite of a rapid degradation in operating frequency (with increase in
circuit-switched ports), there is a significant bandwidth gain using the hybrid two-layer

approach. For the area window utilized in our library of routers, there is an average
20.4% gain in bandwidth (maximum of 24%) offered by our NoC. This gain in perfor-
mance is due to supporting a high throughput circuit-switched layer with a marginal
area overhead.

5.8.1 Design Issues

Even though it appears intuitively that an increase in number of ports in the C-layer
gives performance benefits without any area overhead, there are certain design issues
that can potentially limit the performance due to increase in switch complexity.
54

800 25

700 20

Avg. Bandwidth / Port

% Gain in BW
600 15

500 10

400 Hybrid Two−Layer Router 5


Baseline Approach
% BW Gain
300 0
250 300 350 400 450 500
Area (Slices)

Figure 5.6: Area (Slices) Vs Avg. Bandwidth / Port

Operating frequency Vs Switch Complexity: There is a depletion of critical re-

sources associated with an increase in switch complexity (number of ports, bus width).
As a result, the operating frequency of the switch degrades which in turn affects the
bandwidth offered by the router. For the NoC paradigm to efficiently be an alterna-
tive to the bus-based architecture, the performance design parameters must be chosen

carefully so that it is possible to operate the routers at the highest possible frequency.
Switch Power vs Link Power: By increasing the number of ports, we can reduce
the average hop count [36], i.e we minimize the routers and links. This translates into
a reduction in power consumed by the links, but an increase in power consumed by the

switches. Beyond a cut-off, the increase in switch power can potentially overshadow the
gain in link power, thereby it can increase the power/flit ratio.
Explosion of Schedule Memory: With increasing number of C-layer ports, the
schedule memory also scales linearly. The schedule memory, expressed in number of
LUTs is a function of number of schedule cycles and C-layer ports present. If C is the

number of ports participating in the C-Layer, then dlog2 Ce is the number of configuration
55

bits required per cycle.


Clock Signal Integrity: Operation of the C-layer ports require the participating IP

cores to be synchronous, as there is no buffering done, as opposed to packet-switch where


multi-clock FIFOs separate the clock domains. Increasing the number of C-layer ports
could potentially increase the distance between the connected IP cores. In this case, the
signal integrity acts as a limitation to the number of C-layer ports, and reduces the clock

rate.
It can be seen that all of the above factors limit the amount of performance gain that
can be achieved using our hybrid approach. This trade-off between performance, area
and port count merits a balance and requires an application-suitable tuning of the NoC
topology. We present an algorithm along with a CAD flow in Chapter 8 to automate

topology synthesis for FPGA based NoCs.

5.9 Conclusions

In this chapter, we present the limitations associated with the MoCReS packet switched
NoC and then design and implement a hybrid two-layer router architecture for FPGA
based NoCs. We functionally verify the design and characterize several versions of the

novel router for area and operating frequency. We also present the bandwidth results
along with the design advantages and issues involved in the proposed architecture.
Chapter 6

Hybrid NoC: Performance and


Power Analysis

In this chapter we analyze the novel router architecture presented in Chapter 5 for
performance and power. Our MoCReS router design in Chapter 3 is utilized as the

baseline router in making the performance and power comparisons. We retain the Virtex-
4, XC4VLX100 [1] device as the target for all the comparisons.

6.1 Performance Analysis

The main advantage behind the hybrid approach used in the router architecture is in
offering increased overall throughput. The C-layer connections are pre-scheduled between

IP cores that require high bandwidth. These short distance high bandwidth connections
come with a less resource penalty in FPGAs. In this section, we quantify the average
improvement in performance compared to the baseline approach.

56
57

IP6 IP5 IP4

R6 R5 R4

IP1 IP2 IP3

R1 R2 R3

MoCReS Baseline Router

Figure 6.1: Baseline 3 × 2 MoCReS Mesh

6.1.1 Packet Injection Rate Vs Average Latency

In a bus based architecture, performance is measured in terms of overall bus frequency

and number of masters/slaves. As opposed to this, the overall performance of an NoC is


measured in terms of the net traffic it can sustain. Therefore, we design our experiments
by injecting packets of finite flit size with varying rates at each node in the network.
The experimental framework used in this section has been partially adopted from [27].

We have instantiated a 3 × 2 mesh network (Figure 6.1) with the baseline MoCReS
router. Furthermore, six packet injecting modules were wrapped around the mesh frame-
work. The framework adopted from [27] uses a C++ module which generates an input
file that contains all the packets (input vectors) that the network needs to transport.

Input parameters to the C++ program are, the mesh co-ordinates, number of packets
to be generated and the length of each packet. In every generated packet, the first flit
contains the destination IP X and Y co-ordinates and packet size (fraction). The VHDL
testbench wrapped around the mesh directly controls the injection rate at all IP input

ports, thereby serving as an abstraction of the cores that inject/receive packets.


The above model is simulated using Modelsim 6.1 [28] along with the input vectors
58

IP6 IP5 IP4


                  
                  
                  
                  
R6                   
       R5 
                  
                  
                  


IP1 IP2 IP3


                  
                  
                  
                  
                   MoCReS Baseline Router
       R1  R2
                         
                         
                          2-Layer Hybrid Router

Figure 6.2: Modified Hybrid Router 2 × 2Mesh

from the perl tool. While generating the packets, it is ensured that the source and des-
tination of the packet are never the same. The rest of the testbench reads the generated
text files and injects the packet into the source port. While doing so, the packet injection

timestamp is also recorded. When the same packet is received at the destination after
a finite number of clock cycles, the testbench also marks the time stamp. Therefore,
upon successful completion of simulation, the VHDL testbench creates two files for every
IP. One with the injection timestamp of every packet and other with the received time

stamp for every packet arriving at the destination IP.


Using the above data, the number of cycles that is elapsed for each injection rate
case is computed and the average for all the injected packets is computed. We repeat
the same process for a 2 × 2 mesh network that has two routers two C-layer ports each
(with hybrid routers), as shown in Figure 6.2

The experimental flow developed in [27] determines the total execution time, wherein
the largest timestamp of the received packet is reported. We modify the flow to compute
the latencies of every packet injected into the network and finally determine the average
latency of the baseline & modified network for that particular injection rate. We increase

the injection rate in steps of 0.1 flits/node/cycle and observe the increase in average
latency of the two networks.
59

Figure 6.3: Packet Statistics: No. Packets Injected from each IP

For both the baseline and hybrid router approaches, we apply a combination of two
traffic scenarios: random traffic and hot spot traffic.

• Random Traffic: The source-destination pairs and the number of packets are com-
pletely random values that follow a uniform distribution. Once generated, the same
traffic is applied for both the baseline MoCReS and hybrid mesh framework.

• Hot Spot Traffic: Along with the above random approach, we forcefully choose
source-destination pairs such that a majority of the transfers occur between one or

two IPs. We manually perform this task to create hot spots in the traffic. This
could be a common scenario in a SoC, where critical components such as memories
have a significant % of overall packets transfers.

Figure 6.3 presents source-destination pairs for all the packets injected into the net-
work. As mentioned before, a packet is never routed back to the source IP. The figure
presents the number of packets sent into the five possible destinations for each source IP
(IP1 to IP6).
60

1000

900
Hybrid Router Mesh
Baseline MoCReS
800

700
Average Packet Latency

600

500

400

300

200

100

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

# Flits/Node/Cycle

Figure 6.4: Results: Baseline 3 × 2 MoCReS Vs Hybrid Router Mesh

Experimental results are presented in Figure 6.4. As the injection was increased from
0.1, the average latency in both cases increases linearly as well. However, it can be seen
that the network saturates for the baseline approach around the 0.65 flits/node/cycle

mark. In the hybrid router approach, the network sustains linear average latency until
0.75 flits/node/cycle which is a significant improvement in a small mesh network that
consists of 6 cores. The above saturation point will considerably vary with the number
of IPs (dimension of mesh).

We obtain the above improvement in saturation due to two main reasons:

1. Increased bandwidth between two pairs of cores that reduces the traffic burden
from the network.

2. The reduction in average number of packet hops due to reduced number of routers
compared to baseline approach. Mapping of IPs have a significant impact on the
61

total number of packet hops. Compared to the baseline case, the total number of
packet hops reduced by 42.8% in the hybrid mesh due to the routers that support

more than one IP.

In the above analysis we have simplified the latency analysis by neglecting the zero
load setup time for a C-Layer connection. Our hybrid router incurs a 2 clock cycle penalty
for every C-layer schedule memory look up and cross-point connection. Furthermore, in
the above hybrid router mesh, utmost only two ports participate in the C-Layer. In

case of routers that have several C-layer ports with connection between C-layer ports
changing more frequently, this penalty needs to be included while comparing with the
baseline mesh.

6.2 Power Analysis

This section presents the detailed power trade-offs involved in the proposed router ar-

chitecture. We first present the amount of total dynamic power consumed in our hybrid
router along with the power breakdown across its components. Further, we also present
how the above power metric scales with the number of C- and P- Layer ports present
in a router. Furthermore, power dissipated in our hybrid router mesh is compared with

the baseline router. Before determining the above power numbers, we floorplan the NoC
in our target FPGA and then place and route the design to obtain accurate resource
utilization estimates.

6.2.1 Power Breakdown Results

For this analysis, we consider an 8 port hybrid router (4 directional + 4 local) with

3 IPs participating in the C-Layer. The C-Layer components are the Schedule FSM,
bRAM Schedule memory and the C-Layer CPM (C-CPM). We apply random inputs
to the router and determine the switching activity of the placed and routed model.
62

Table 6.1: Input Flits: Transition Activities


Bit Position 0 1 2 3 4 5 6 7
Transition Activity (%) 50.02 49.58 50.94 50.12 49.24 49.28 51.36 48.82
Bit Position 8 9 10 11 12 13 14 15
Transition Activity (%) 51.06 51.02 50.50 49.62 50.22 50.28 50.12 49.58

Total Dynamic Power: 134.87 mW @ 200 MHz

Figure 6.5: Dynamic Power Breakdown of an 8-port Hybrid Router

Using XPower[1], we obtain dynamic and static power estimates for our router. Towards

obtaining a component based power estimate, we adopt the power profiler methodology
(with incremental synthesis) used in [33]. We apply 5000 flits with random requests
to the hybrid router and capture the applied transition activity in Table 6.1. The net
dynamic power consumed by this router is 134.87mW at 200 MHZ operating frequency.

Figure 6.5 presents the power breakdown of the 8 port hybrid router.
The power consumed/packet data for each router varies based on its complexity.
The complexity of the router translates into the amount of logic and routing resources
(switched capacitance) consumed by it. Figure 6.6 presents the scaling of dynamic power

(mW) with increasing P-Layer ports. With increasing P-Layer ports, the Cross-Point
and Central Arbiter scale linearly in terms of amount of logic utilized, thereby increasing
the dynamic power linearly.
63

180
Dynamic Power Vs Switch Size
160

Dynamic Power (mW@200MHz)


140

120

100

80

60

40

2 3 4 5 6 7 8 9 10

P−Layer Ports

Figure 6.6: P-Layer Ports (Switch Size) Vs Dynamic Power (mW)

With increasing C-Layer connections, the amount of long interconnects utilized within

the router increases, thereby increasing the amount of switched capacitance. For equiva-
lent routers (in terms of no. ports), higher C-Layer connections can increase the dynamic
power consumed by up to 18%. Figure 6.7 presents dynamic power scaling in three cases
with varying number of directional ports. In this analysis we fixed the size of schedule

memory at 16 words with 8 bits/word. We have configured the schedule FSM to take as
input the source IP (2 bits), destination IP (2 bits) and the number of clock cycles (4
bits) in binary representation, thereby reducing the number of words required to store a
schedule. However, as the number of ports or configurations increase, the schedule mem-

ory needs to expand. In the above case the power fraction contributed by the schedule
memory will increase.

6.2.2 Switch vs Link Power

The two components of power consumed in an NoC are the switch power and link power.
The switch power is determined by the cross point size (number of ports), arbitration

and routing logic used. Fewer number of ports/router implies reduced switch complexity
64

135 
           
 2 - C-Layer Ports           
           
           
130                       
           
          
Dynamic Power (mW) at 200 MHz

           3 - C-Layer Ports            


           
           
           
           
125            
4 - C-Layer Ports            
           
           
           
           
120            
           
           
           
           
                      
115                       
                      
                      
            
           
                       
                       
110                        
                       
                       
                       
                       
                       
                       
105
                                   
                                   
                       
                                   
                       
           
           
           
                                   
100                                     
                                    
                                    
                                    
                                    
                                    

2 Directional Ports + 3 Directional Ports + 4 Directional Ports +

4 Local IPs 4 Local IPs 4 Local IPs

Figure 6.7: C-Layer Ports (Switch Size) Vs Dynamic Power (mW)@200MHz


65

          
          
      Baseline MoCReS Mesh
Power (mW)
550
Hybrid Router Mesh
500

450
        
        
400
        
                 
                
                 
350                 
                 
                 
300                 
                 
        
250                      
                 
                    
        
200                  
                
                 
        
            
150
        

Dynamic Power @ 200 MHz Static (Quiescent) Power

Figure 6.8: Power (mW): Baseline MoCReS Vs Hybrid Router Mesh

and increased number of links. In our router designs, we have employed the simple XY

routing with round robin arbitration. This leads to a marginal scaling in switch power
with router ports. Further, with increasing interconnect sizes (farther cores), the link
power begins to dominate the overall power consumed in the NoC. In our hybrid router
architecture, we use C-layer connections only for IP cores placed close to each other. This

incurs a marginal penalty in link power compared to long bus based connections. We
present the impact of floorplanning on power and performance of the on-chip network at
a greater depth in Chapter 9.4
Figure 6.8 presents a comparison between the amount of total power consumed in the

hybrid two-layer router and the baseline version. The NoC supports 6 IPs with the same
network topologies in Figures 6.1 and 6.2. Due to reduced number of interconnects and
logic resources in the hybrid router, there is a gain of about 15.38% in dynamic power
and 10.1% in static power consumed.
Chapter 7

Experimental Platform

In this chapter we present the experimental platform developed to evaluate the ap-
proach presented in this thesis. First, a description of the multi-processor benchmarks

is presented. This is followed by an enumeration of the phases in router design and


implementation on our target FPGA.

7.1 Multi-Processor Benchmarks

To determine the impact of our approach on typical designs, we utilize a set of real and
synthetic benchmarks that are widely used in NoC studies [36] [38]. The benchmarks are
represented by task graphs. We use the following graph model in which communication

between IP cores in a design can be abstracted using a Core Communication Graph


G(V, E), where,

• Vi denotes the vertices (cores),

• E(i, j) and BW (i, j) denote the directional edges (communication pattern) and
corresponding bandwidth requirement between IP core i and j

These task graphs can represent many real applications, including MPEG, DFT etc.
They can also be easily generated for a comprehensive study of the proposed approach.

66
67

There are two main kinds of benchmarks used in our research:


1. Application Benchmarks: The four application benchmarks [39] used are Finite

Fourier Transform (FFT), Moving Photographers Experts Group (MPEG4) encoder,


Video Object Place Decoder (VOPD) and Multi-Window Display (MWD), The bench-
marks are provided as task graphs where vertices in the graph represent computation
elements and directional edges represent communication patterns (precedence). These

class of applications permit traffic characterization early in the design cycle. This infor-
mation is used to fine-tune the NoC topology. Each of the edges are annotated with the
required bandwidth in Mega Bytes/second (MB/s).
2. Synthetic Benchmarks: In addition to these real application benchmarks, we obtain
a rich set of synthetic benchmarks that were generated using Task Graphs For Free

(TGFF) [40] in [36]. These benchmark cases sustain a rich variety of communication
properties, namely, in-degree, out-degree and dependence width and therefore represent a
wider class of multi-processor applications. For these synthetic benchmarks, we randomly
generate bandwidth requirements that follow a uniform distribution and use them to

annotate the edges.


Table 7.1 summarizes the properties of the application and synthetic benchmarks used
in this research. The table presents the number of cores (|V |), number of edges (|E|),
max in/out-degree, max/min bandwidth required. The degree of the nodes represent the

packed nature of the benchmarks. The maximum bandwidth requirement denotes the
opportunities for clustering (violation), while the minimum bandwidth requirements
represents the potential to reduce area (by increasing the number of packet switched
ports). The clustering and area reduction approach are discussed in Chapter 8
68

Table 7.1: Application and Synthetic Benchmarks

Benchmark |V| |E| In-degree Out-degree Bandwidth MB/s


Application Max Min
MPEG4 12 26 7 7 910 1
FFT 15 22 2 2 473 26

VOPD 12 14 2 2 500 16

MWD 12 13 2 2 128 64

LU Decomposition 9 11 2 3 510 76
Laplace Solver 9 12 2 2 378 68
Synthetic

Basic -1 9 8 1 4 196 34
Parallel -1 9 14 3 4 225 47
Packed -1 9 16 3 5 334 59
Packed -2 9 15 3 4 412 106

7.2 Xilinx ISE Design Flow

For the router design and implementation, we follow the Xilinx ISE [1] flow. The various
versions of the router are modeled in structural VHDL. We incorporate the design opti-
mizations using the VHDL models. The optimized model is then synthesized, mapped,
placed and routed using Xilinx ISE. The design is targeted for an XC4VLX100-11 on
a nallatech [3] BenDAT AT M development board. Figure 7.1 presents a Virtex-4 [1]

platform FPGA on a BenDATA [3] development board. Xilinx XST [1] which is a part
of ISE 8.2i is used to synthesize the VHDL models. Xilinx LogiCORE FIFO Genera-
tor v2.3 is used to generate common clock/independent clock FIFO buffers. Functional
simulation of the router and mesh versions are performed using Modelsim 6.3c [28]

7.2.1 FPGA Resource Characterization

An NoC router implemented on FPGA consumes certain existing logic and routing re-
sources. Placement and routing constraints and switch complexity dictate the operating
69

Figure 7.1: Nallatech [3] Bendata-V4 Platform FPGA


70

frequency of the router. This variation in frequency is due to its implementation on a


variety of available resources within the constrained area. In an attempt to character-

ize this variation in router properties with the availability of resources, we develop an
experimental platform.
XDL Flow: Xilinx Design Language (XDL) is a standard to transform the native circuit
description (.ncd) of the placed and routed design. The XDL utility which is a part of

ISE suite is used in our research to the convert the .ncd file of the router design to .xdl
(using xdl -ncd2xdl command). Later, this .xdl file is parsed to extract the resource
information. The XDL file contains two parts:

1. Resource Instances

2. Net Description

Resource Instances: This section of the XDL file contains instances of components uti-
lized in the target device, including LUTs, bRAMs, and embedded blocks if any.
Net Description: This part consists of a detailed description of every net instances in
the design, implemented on the FPGA. The description includes source pin, destination

pin(s), fan-out, routing resources utilized by the design. Figure 7.2 illustrates portions
of a sample XDL file for for our router.
Programmable routing resources in FPGA consume a significant portion of circuit
power & delay. An efficient NoC implementation will take into account these factors
of the underlying FPGA architecture. Routing resources in our target Virtex-4 FPGA

can be divided into: 1) Single (OMUX/IMUX) Lines, 2) Double Lines, 3) Hex lines, 4)
Long Lines, 5) Clock Tree, and 6) Programmable Interconnect Points (PIPs). Figure 7.3
illustrates the main types of routing resources in the target FPGA device.
The interconnect distance, switched capacitance of the above resource significantly

vary, thereby accounting for a range of delay and power values. We estimate the delay (ns)
and power (mw) overheads involved in utilizing these resources in the design. Table 7.2
71

design "VCRouter_top" xc4vlx100ff1513-11 v3.1 ,

cfg "

_DESIGN_PROP : : BUS_INFO : 8 : INPUT : E_channel_data_in <7 : 0>


_DESIGN_PROP : : BUS_INFO : 8 : INPUT : L_channel_data_in <7 : 0>

inst "E_channel_data_in <0>" "IOB" , placed IOIS_NC_LX0Y142 H37 ,


cfg " DIFFI_INUSED : : #OFF DIFF_TERM : : #OFF
PULL : : #OFF SLEW : : #OFF INBUF : E_channel_data_in_0_IBUF :
PAD : E_channel_data_in <0> : "
;

net "E_channel_data_in_3_IBUF" ,
outpin "E_channel_data_in <3>" ,
inpin "input_E/fifoA/BU2/U0/BU7"
,
pip BRAM_X10Y132 IMUX_B19_INT0 > RAMB16_DIA3
pip INT_X0Y134 BEST_LOGIC_OUTS0 > E6BEG8
pip INT_X10Y132 S2END8 > IMUXB19
pip INT_X10Y134 W2END6 > S2BEG8
pip INT_X12Y134 E6END6 > W2BEG6
pip INT_X6Y134 E6END8 > E6BEG6
pip IOIS_NC_L_X0Y134 IOIS_IO > BEST_LOGIC_OUTS0_INT ,
pip IOIS_NC_L_X0Y134 IOIS_IBUF0 > IOIS_IBUF_PINWIRE0 ,
pip IOIS_NC_L_X0Y134 IOIS_IBUF_PINWIRE0 > IOIS_I0 ,
;

Figure 7.2: Sample XDL Description


72

Single Line

Double Line

Hex Line

Horizontal Long Line

Figure 7.3: Routing Resources: XC4VLX100


73

Table 7.2: Performance and Power Estimates of Routing Resources in XC4VLX100

Routing Resource Delay (ns) Dynamic Power (mW @ 200MHz)


IMUX/OMUX 0.335 0.23
Double 0.406 0.27
Hex 0.863 0.33
Long 2.507 4.84

Table 7.3: MoCReS: FPGA Resource Utilization

Router Version Nets Hex Double Long Single Slices Max.Freq (MHz)
MoCReS (1VC+MC) 902 439 1402 5 5633 282 303

summarizes the performance and power results of the interconnects. The performance
estimates are produced by re-routing a specific net across various resources using Xilinx

FPGA Editor [1]. For power measurements, the operating frequency is set at 200 MHz
with an operating voltage of 1.2v corresponding to the programmable interconnects in
XC4VLX100. Xpower utility is used to determine the dynamic power contributed by
that particular net that is re-routed.

As a part of this research, we develop a Perl utility to parse the XDL file to obtain
the number of nets, and type of routing/logic resources utilized in the design. This
information is presented in Table 7.3. This version of MoCReS consumed 282 Virtex-4
slices and operated at 303 MHz. Figure 7.4 shows the % of each type of routing resource
used by the design. It can be seen that the direct lines (IMUX/OMUX) contribute to

75% of the total routing resources utilized by the design. The above design is tightly
constrained in terms of area, and therefore the longer lines utilization is less.
Based on the placement and routing constraints present in the FPGA and the switch
complexity (number of ports, etc.), there will be varied number of configurations for every

router (based on the available FPGA resources). We consider three such configurations
for a 5-port MoCReS router and present them below. We used Xilinx PACE tool [1] to
74

HEX
6%

Double
19%

Long
<1%

Single
75%

Figure 7.4: MoCReS Routing Resource Utilization

floorplan and set the area constraints to the router design for every configuration.

Configuration A: This router is closely packed with a square shape and with a max-
imum area constraint on it. Further, the FIFO bRAMs are placed close to the switch.
The above constraints rapidly reduce the utilization of high capacitance routing resources
and achieves a router design with maximum operating frequency (303 MHz). However,

traditional CAD tools avoid such a configuration due to excessive depletion of routing
resources in the constrained area. This leads to a low performance in the user logic that
surrounds the router. Figure 7.5 presents this heavily constrained configuration.
Configuration B: In some cases, user logic (IPs) might be prioritized for CLBs present

near bRAMs. In such cases, the network component (router) must be implemented
farther from the bRAM FIFOs. As a result, there is an increase in critical path in the
router design leading to lower operating frequency (286 MHz in this case). Figure 7.5
also presents this configuration of the router.
Configuration C: Finally, we capture the effect of inter router distances in FPGA

based NoC by means of this configuration. Due to varied core sizes, it is possible that
75

Configuration A Configuration B

Figure 7.5: Router Configurations


76

the network elements (routers) get placed and routed farther apart, thereby increasing
the delay between the output port of a router to the input buffer of the downstream

router. This increase will have an impact on the operating frequency of the upstream
router. In our experiments, the critical path increased by 0.802 ns for an increase in
inter-router distance increase by 6 CLBs. The above increase in critical path leads to
deterioration of operating frequency of the router.

7.3 NoC Design Flow

The NoC topology synthesis tool presented in Chapter 8 is implemented primarily using
C++ language and is supported by Perl scripts. Perl is used for benchmark processing
and mesh topology generation while the C++ tool executes the computationally intensive
operations. Data structures implemented in C++ use the Standard Template Library

(STL). The implemented synthesis algorithm is executed on a AMD Opteron Processor


with Linux, operating at 2.4 GHz and having 3GB RAM.
Figure 7.6 presents the experimental design flow used for NoC topology synthesis.

7.3.1 Bandwidth Requirement Vs No.Flits

Edges in the task graph represent communication requirements between cores. There are

two ways in which this communication can be abstracted: as a bandwidth requirement in


MB/s or as Injection Rates in No. Flits/Cycles/Node. Choice of a model has a significant
impact on accuracy and simulation time of performance estimation.
Table 7.4 presents a qualitative comparison between the two approaches. We model
the communication traffic between the IP cores as a bandwidth requirement, expressed in

MB/s, due to the simplicity of the model and the quick execution time which allows us
to perform many experiments.
77

Hardware Description

P-Layer & C-Layer Design


SystemC VHDL Parameters
MoCReS
Model Design

C++ Compiler SystemC MoClib


Library
+ Simulator Library

gcc* .vhd
.vcd output Area/Freq
Models

Topology Synthesis Framework

Functional
Simulation
Synthesis vsim*
.vcd
Map
ISE Power Analysis
.ncd
Design Flow Place & Route xpower*

.bit

Platform FPGA

Figure 7.6: Experimental Flow for NoC Topology Synthesis

Table 7.4: Comparison between Communication Abstractions


Comparison Metric Bandwidth (MB/s) Injection Rate
Cost Function Simplicity Simple Requires Event Simulation
Estimation for Hybrid Two-Layer Design Simple Moderately Complex
Accuracy Moderate Very Accurate
Power Estimation Less Accurate More Accurate
Chapter 8

FPGA Based NoC : CAD Flow

8.1 Introduction

Our router architecture can support a host of design parameters including, channel link
width (b), number of ports (Pi ), number of virtual channnels/port (v) and number of lo-
cal cores participating in the circuit-switched layer (Li ). Certain domain of applications

provide static communication traces early in the design cycle, i.e these classes of appli-
cations permit traffic characterization at an early stage. We utilize this information to
customize the NoC topology overlayed on FPGAs. Due to a large number of parameters,
it is fundamentaly impossible to instantiate and hand tune the communication topology

for the needs of every application.


This chapter discusses related work and the CAD flow that we have developed to
automate NoC framework design for FPGA based multi-processor designs. Figure 8.1
presents the NoC design dependency cycle that exists between various design parameters.

In our approach, we break this cycle by manipulating the number of router instances and
number of ports in each router. This chapter also presents the various phases of NoC
topology design in detail.

78
79

Our
Approach

# C-Layer Ports
# P-Layer Ports # Routers

NoC Design Dependency Cycle

Max Operating Freq. Average Hop


f MHz Count

Available BW
MB/s

Figure 8.1: NoC Design Dependency Showing Our Approach

8.2 Related Work

While ASIC implementations have a well developed CAD flow for NoC design [41] [42],
there is no automated methodology for FPGAs that takes into account the features of
the underlying architecture. Moreover, the limitation in resources, higher power con-
sumption, and increasing heterogenity of the FPGA device complicates the design flow.

Research in [5] is the first work to address automated design for FPGAs. Their underly-
ing NoC model enables fast performance verification and is less suitable for supporting
high performance NoC in FPGAs, as opposed to our characterized M oClib NoC library
that we have developed in this research.
To the best of our knowledge, this is the first work to propose an FPGA-suitable

hybrid router architecture integrated with an automatic topology synthesis framework


80

that satisfies the bandwidth requirements of an application while optimizing its area
overhead.

8.3 Problem Formulation

Given a task graph G(V, E), where each vi ∈ V represents an IP core, and directed

edge eij = {vi , vj } ∈ E denotes a communication edge with a bandwidth weight function
bij : Eij → R, Find a mapping G(V, E) → C(V, E), where C represents a mesh topol-
ogy graph, such that, ∀i, j ∈ V the available topology bandwidth meets the required
P P
bandwidth ∀eij ∈E bij with minimum NoC area, ∀i∈C <i .

8.4 Topology Synthesis Flow

In this section, we present the four important phases in our flow, a detailed description
of the synthesis algorithm, its functioning and complexity.

The input to the algorithm consists of a core communication graph (G), annotated
with bandwidth requirement between modules. Further, the design space parameters,
namely, the maximum bandwidth supported in a packet-switched link (critical band-
width bc ), available FPGA area (Aav ) along with area and performance models of our
router architecture are also provided as input. Figure 8.2 shows our topology synthesis

framework. Our algorithm supports four main operations in the phases mentioned below:

1. Clustering

2. Mesh Generation

3. Candidate Topology Selection

4. Area Optimization

The above four phases are presented in detail with graphical examples.
81

Core Task Graphs G(V,E)

MoClib
Library 1
50 100
.vhd
Area/Perf 2 3
models 90
90 90 75

4 5

Exhaustive IP
Mapping

XY Routing
BWcritical
Clustering
C max
Link Capacity Required BW
Estimation Estimation

No U' - Router Upper Bound


- Core Clusters
No

All Comb. Is BW Mesh Topology


over? req. met?
Yes Generation
Yes

Optimize NoC
Area

Output
NoC Topology
Topology Synthesis

Figure 8.2: Topology Synthesis Framework


82

BW = 500 MB/s
1 critical 1,3

250 650 450

Clustering
2 3 2,5 3
100
200
150 150
600 200
100
4 6 4 6

5 5

G G'

Figure 8.3: Clustering Phase of Topology Synthesis

8.4.1 Clustering

During the clustering phase, the edges in the input task graph (G) whose required band-
width violates the critical bandwidth (bc ) are identified. The packet switched NoC frame-
work does not have sufficient link capacity to support these communications. Therefore,
we utilize the hybrid router architecture that has enough bandwidth available between
specific cores. The cores requiring these bandwidths that exceed the available inter-

router capacity are grouped to form clusters of multiple IPs connected via the C-layer of
a single router. Upon completion, this phase outputs the clustered core graph (G’), the
upper bound (U’) on the number of routers and the information about the types of net-
work components chosen from the MoCReS Library. Therefore, in the new graph G’, the

upper bound in the number of routers is U 0 = |V |. Figure 8.3 demonstrates clustering,


where by incrementing the C-layer ports for two routers, we eliminate the bandwidth
violations. However, this approach comes with a penalty as the high bus width circuit
switched layer depletes the available FPGA resources. The above resource depletion also

reduces the operating frequency of the router. Therefore, cores must be clustered judi-
ciously to avoid degrading overall performance. Table 8.1 presents the clustering results
from this phase for the chosen benchmarks.
83

Table 8.1: Clustering Results for Benchmarks


Benchmark |V | U0
MPEG4 12 9
FFT 15 12
VOPD 12 10
MWD 12 12
LU Dec 9 8
Laplace 9 7
Basic-1 9 8
Parallel-1 9 9
Packed-1 9 6
Packed-2 9 7

8.4.2 Mesh Generation

During the Mesh Generation phase, we generate all mesh topologies with U’ routers. Due

to its suitability for FPGAs, we consider only mesh based topologies in this research.
From U’, we determine all its factors and identify all possible mesh topologies. During
this topology generation step, we preserve the clustered nature of the cores, output from
the previous phase. Of all these possible meshes, an appropriate topology that satisfies

the bandwidth requirements is later determined in the Candidate Topology Selection


phase.

8.4.3 Candidate Topology Selection

During this phase, the operations performed are, the exhaustive IP mapping and the link
bandwidth estimation.

Exhaustive IP Mapping: Mapping of IPs to a mesh, based on an unconventional hy-


brid router architecture presents a great challenge. During this phase, we iterate through
a large search space by permuting the possible IP mappings exhaustively. We select a
candidate topology based on whether a mapping satisfies the required bandwidth and

optimize it for area during the last phase. If a valid mapping is present for that particular
topology, our exhaustive search algorithm is guaranteed to output the mapping, as the
84

search is exhaustive in nature.


Link Bandwidth Estimation: Selecting a candidate topology requires verifying if the

bandwidth required for all (source,destination) pairs in the clustered core graph meets the
available topology bandwidth. Our choice of XY routing simplifies this phase in addition
to reducing the switch complexity due to its simple logic. The cumulative bandwidth
requirement on each edge (contributed by each source to destination route) establishes

the required link capacity constraint. For a given MPEG4 application as a task graph,
Figure 8.4 presents a candidate topology and its cumulative link bandwidth requirement
(cost) in MB/s. As a result of Task 1 mapped to router (0,0), the link connecting the
router to (1,0) requires a 1912 MB/s bandwidth (equal to the sum of bandwidths between
Task 1 and rest of the tasks). To select a candidate topology, we conservatively estimate

the available link bandwidth that the communication architecture supports. This is
determined from the link width and the operating frequency of the router. During the
Link Bandwidth Estimation operation, the router models from the M oClib library (<)
are input to estimate the available link capacities for the router configurations chosen in

the NoC topology by the above phases. This process involves estimating the bandwidth
available between routers operating over different frequencies. For instance, a bulky
router present in a high communication path, severely degrades the total performance of
the NoC. The candidate topology selection phase incorporates these trade-offs using the

router models.

8.4.4 Area Optimization

The primary objective behind this algorithm is to synthesize NoC topologies for FPGA
based designs that satisfy the required bandwidth with a minimum area overhead. We
estimate this area overhead in terms of Virtex-4 [1] FPGA slices. The routers contribute

to the area utilization of an NoC. The area utilized by the router varies with its configu-
ration (number of C-layer & P-Layer ports, buffering and channel width). For a chosen
85

Candidate NoC Topology MPEG - Core Communication Graph


G'
| V | = 12
4 7 8 12 9 | E | = 13
80 80
72 173
0,2 1,2 2,2
6 7
3 1274 6 245 9 1173 11
120 1200

100 0
0,1 1,1 2,1 380 2
2 1 3

1 1296 2 265 5 1673 10 64 2 1820

4
0,0 1,0 2,0 8 5
1912 1513

346 1340
Router(x,y) IP Core(s)
10
0,0 1,2 1,0 5 2,0 10
500 1000
0,1 3 1,1 6,9 2,1 11
0,2 4,7 1,2 8 2,2 12 11 12

Figure 8.4: IP Mapping and Link Bandwidth Estimation


86

topology that has information on the configurations of routers used, we determine the to-
tal area by summing the individual slices by looking them up from the M oClib library of

NoC components. To summarize, upon determining the candidate topology that satisfies
the bandwidth, we conservatively estimate the area required by the chosen topology.
During the Area Optimization phase, the required number of router components (U’)
is decreased iteratively by one. In each iteration, we prune the NoC topology by removing

a router with a single IP connected to it. In order to balance the total number of IPs
with the local ports of the routers, we perform the following in order:

• Increment the # P-layer ports by one for a router configuration.

• Increment the # C-layer ports by one for a router configuration.

Substituting a router configuration with an alternate one that supports an increased

P-Layer count has the following effects:

• Very marginal area increase compared to the gain obtained by removing one router.

• Reduction in router read frequency, leading to a degradation in available band-


width across its neighboring edges (possibly introducing bandwidth violations in
the resulting topology).

In the area optimization phase, we first perform the above operation and determine
if the bandwidth requirements are still met. However, if the above step introduces
violations for all combinations of mesh and IP mappings, we substitute that chosen
router with an alternate configuration that supports increased C-Layer ports instead.

Substituting a router configuration with an alternate one that supports an increased


C-Layer count has the following effects:

• Marginal reduction in router read frequency, while introducing high bandwidth IP


connections (which could possibly eliminate the violations seen above).
87

• Significant increase in routing resource consumption and more importantly, increase


in area due to supporting additional schedule memory.

Reducing the number of router by above fashion also minimizes the average hop count
of the network, leading to improved execution time. However, our primary objective is
only to ensure that the bandwidth requirements are met with a minimum area NoC. The
new NoC topology is then input back to the Mesh Generation and Candidate Topology
Selection phases to determine (exhaustively) if the bandwidth requirements are met.

8.5 Description of Algorithm

The four phases described in Section 8.4 are presented in the form of a pseudocode in
the following algorithm. Input to the algorithm consists of the core task graph, G(V,E),
with |V | Cores, |E| Edges, along with the bandwidth values annotated. The M oClib
component library values, critical bandwidth (bc ) are also input to the algorithm.

The Clustering phase (lines 1-5) involves iterating over all the edges to determine the
bandwidth violations. The output of this operation is the clustered core graph, G’(V,E)
and the upper bound on the number of routers (U’). Based on the factors of this upper
bound, all possible mesh topologies are generated (lines 6-7) and output to perform

candidate topology selection (lines 8-13). As mentioned in Section 8.4.3, this operation
can be partitioned into two sub-operations: IP Mapping (lines 8-9) and Link Bandwidth
Estimation (lines 11-13). Finally, optimizing area (lines 15-21) involves decrementing U’
and determining if the new topology satisfies the required bandwidth. The terminating

conditions to the iterations are U 0 = 1 and when Pmax , which is the maximum number of
ports in routers for the suggested NoC topology, exceeds Pcritical (maximum supported
number of ports by the library).
88

Algorithm 8.1: Topology Synthesis Design Flow


Require: Core Task Graph G(V,E), with |V | Cores, |E| Edges/Links and Edge e ij associated with a bandwidth weight
function bij : eij → R.
Require: Ap/c−ovrhd(i,j) Area overhead in replacing a packet/circuit-switched router with i → j ports (M oClib Library),
Critical Bandwidth bc , Aav Available FPGA area in Slices for NoC, Maximum C-layer ports Ccritical .
Ensure: Low Area FPGA based NoC Topology, Satisfying Bandwidth
1: for all Edges eij ∈ E, G(V,E) do
2: if Required bandwidth for edge eij : bij > bc then
3: Clustering: Group Vertices i, j, update edges, and add no. circuit-switched ports by one, thereby
removing the bandwidth violation
4: end if
5: end for
6: Output Modified Core Task Graph G’(V,E) such that |V | = U 0 , the upper bound on the no. of router instances
7: γ: generate all mesh topologies possible with U’ network components
8: for all Mesh Topology in γ do
9: δ: permute all IP mapping combinations of G’(V,E) → C(V,E), where C(V,E) is one instance of the mesh
topology.
10: for all Topology mapping in δ do
11: Candidate Topology Selection: foreach source, destination pair (i, j) ∈ |V | of G’(V,E), estimate required
link bandwidth in C(V,E) by applying XY routing
12: M oClib Library: Estimate the bandwidth available in the edge eij ∈ C(V,E), Link BW = Channel
Width (b)× Router Operating Frequency (f)
13: Determine one valid topology instance such that available BW meets required BW b i,j ∀eij ∈ E
14: end for
15: repeat
16: U 0 = U 0 − 1 and Pmax = Pmax + 1
17: AreaPNgain: sum the area of the routers for every instance
18: if i=0 Ai < Aav , where N is the number of routers then
19: best.topology ← current.topology
20: end if
21: until {U 0 = 1 .or. all Pmax > Pcritical }
22: end for
23: Output best.topology to ISE design flow
89

8.6 Complexity Analysis

We analyze the time complexity of our algorithm in this section and present the execution
time results for a set of chosen benchmarks. With respect to the type of computation

performed, the algorithm presented in Section 8.5 can be divided into the following
phases,

1. Clustering

2. Mesh Generation

3. Candidate Topology Selection

4. Area Optimization

The Clustering phase presents a time complexity of (|E|), where |E| is the number
of edges in the task graph. For the design sizes considered in this research, the above
phase contributes only to a negligible portion of the total execution time. Based on

the determined router upper bound U’, the next phase, Mesh Generation outputs all
possible mesh configurations. In terms of complexity, this step is linear to the number
of vertices, |V |, therefore having a time complexity of the order of (|V |). During the
Candidate Topology Selection phase, the operations performed are, the exhaustive IP

mapping and the link bandwidth estimation. The worst case time complexity of both
the phases can be expressed as, ( (U 0 !) + (|E|))× (# mesh configurations). Finally,
the Area Optimization phase also has a time complexity of the order of (|V |). Of the
above four phases, the candidate topology selection phase dominates the computational

complexity of the algorithm due to its exhaustive nature. Even though the IP mapping
design space is factorial, we exit early from the exhaustive search once the first valid
mapping is found. i.e we do not optimize the CAD Flow for performance. As a result,
it will be shown in Section 8.6 that the typical execution times are much less compared
to the worst case complexity.
90

Table 8.2: Algorithm Execution Time


Benchmark |V | |E| Execution Time (minutes)
MPEG4 12 26 12.67
FFT 15 22 35.12
VOPD 12 14 11.26
MWD 12 13 10.75
LU Dec 9 11 4.84
Laplace 9 12 5.42
Basic-1 9 8 4.26
Parallel-1 9 14 5.98
Packed-1 9 16 6.94
Packed-2 9 15 6.76

8.7 Experimental Results and Analysis

Experimental Platform: As mentioned in Section 7.2, our target FPGA device is


XC4VLX100 [1], on a Nallatech BenDATAT M [3] development board, where the per-
formance and area results for the M oClib NoC library are extracted. To characterize
our library of routers accurately for area and performance, we model them in structural

VHDL and use Xilinx ISE 8.2i [1] to follow the FPGA design flow for the router models.

8.7.1 Execution Time Results

Algorithm 6.1 is implemented in C++ using Standard Template Library (STL) data
structures and is supported with Perl for benchmark processing. We execute the above
algorithm on a AMD Opteron Processor with Linux, operating at 2.4 GHz and having

3GB RAM on our chosen benchmarks and report the results in Table 8.3.
The execution time of the algorithm for a benchmark is directly related to the time
complexity of the mesh generation and IP mapping phase. With an exception of one
benchmark (FFT, with 15 cores), the average execution time was around 8 minutes.
91

MPEG4 VOPD MWD FFT

12 11 1 2 3 9 5 3 3 1 2

10
6 5 7 8 6 1 7 6 4 5
5 8
4 10 11 8 9
9 8 4 7 4 2

3 1 2
10 12 13 14 15
12 11 10
7 6
11 12

Figure 8.5: Application Benchmarks

Table 8.3: MPEG4 Area Improvement


Iteration # U’ # Meshes × Mappings NoC Area (slices)
0 12 1437004800 4350
1 9 725760 3860
2 8 80640 3610
3 7 5040 3370
4 6 1440 3120

8.7.2 Area Results

In order to determine the impact of the proposed algorithm on area, we compare our
results in this chapter with the solution provided by the baseline NoC described in
Chapter 3. This traditional multi-clock NoC has one IP attached to every router and

does not support the hybrid architecture presented in Chapter 5.


Benchmarks: As described in Section 7.1, we utilize the multi-processor SoC applica-
tions modeled as Directed Communication Task Graphs (CTG). Vertices in the graph
represent IP cores and edges represent precedence and bandwidth requirement. We apply

our technique on four widely used application benchmarks, (FFT, MPEG4, VOPD and
MWD) [39] and six synthetic benchmarks [36] that represent a variety of communication
patterns that are frequenty encountered in multi-processor designs.
Using our hybrid architecture and integrated design flow, results were obtained for
92

various benchmarks. For similar bandwidth constraints applied through task graph edges,
Figure 8.6 compares the synthesized topology area between the proposed and baseline

approaches. With the number of cores in the benchmarks varying between 6 and 15, it
can be seen that there is an average reduction of 21.6% (maximum of 26%) in the NoC
area which can be used for efficient implementation of application logic. The bandwidth
constraints were translated into the original design and estimation of area was performed

in slices. It is to be noted that the CAD tool does not optimize the design for execution
time. However, ensuring that the required bandwidth is satisfied is the primary goal. For
all of the application benchmarks, our approach was able to obtain alternate topologies
utilizing our hybrid router library with fewer FPGA resources.

8.8 Conclusions

In chapters 5 to 8 of this proposal, we present a multi-clock hybrid two-layer router


architecture suitable for FPGAs. We analyze the merits and issues involved with the
architecture and characterize a library of network components for area and performance.
For equivalent area overhead, our proposed architecture achieves 20.4% increased band-
width when compared with a baseline approach. We effectively automate the NoC design

cycle by integrating the router with an algorithm that optimizes for FPGA area while
satisfying the required bandwidth. Experimental results for a set of real applications and
synthetic benchmarks show an average reduction of 21.6% in FPGA area (maximum of
26%) for equivalent bandwidth constraints when compared with a baseline approach.
93

% Area Savings for Eq. Bandwidth


14 16 18 20 22 24 26

MPEG4

FFT

VOPD

MWD
Benchmarks

Lu

Laplace

Basic - 1
% Area Savings
Hybrid Router
Parallel - 1
Baseline (MoCReS)

Packed - 1

Packed - 2

0
2000 2500 3000 3500 4000 4500 5000

NoC Topology Area (Slices)

Figure 8.6: NoC Benchmark Results


Chapter 9

NoC based System-on-Chip


Development

In this chapter we present the design methodology that we have developed for implement-
ing a complete SoC application using our on-chip network backbone. The methodology

comprises of the following three important parts:

• IP Core Characterization

• Network Interface Implementation

• NoC Floorplanning

Upon presenting the motivation behind our methodology, we will then address our
contribution in each of its parts listed above at a greater depth. In this chapter, we
also present a case study comprising of a multi-processor Image Compression Applica-
tion, wherein we obtain real VHDL cores and perform comparisons between alternate
implementations in our target FPGA.

94
95

Figure 9.1: ITRS 2007 Showing IP Design Reuse Trends

9.1 Motivation

With increasing device capacities and design sizes, ITRS 2007 [43] advocates high design
re-use as the solution paradigm. As shown in Figure 9.1, the percentage of the whole

design that is constructed from re-used IPs is expected to increase steadily in the next
several years. Figure 9.1 also presents the increasing future trend for the percentage
of reconfigurable components in future designs. The above phenomenon along with
increasing time-to-market constraints motivates the standardization of IP cores in FPGA
based designs.

Multimedia applications are widely prevalent in automotive industries (Global Po-


sitioning Systems), medical imaging, HDTVs, Military/Space applications. Due to the
high computation requirements and parallelism available, FPGAs are increasingly be-
coming a target for these applications. We implement a complete framework for NoC

based multi-processor implementation in FPGAs that is presented in subsequent sections.


96

9.2 IP Core Library

9.2.1 Core Abstraction

IP Abstraction: Towards implementing a fully customizable Network Interface and


NoC framework, we make certain important assumptions on the properties of IPs. Based
on the amount of hardware commitment, the IPs implemented in FPGA can be classified
as soft, firm or hard IPs. Even though our framework can be adapted to all the above

kinds of IPs, we restrict ourselves to soft IP cores in this discussion. We assume the Soft
IPs to have the following properties:

• Supports an RTL implementation under one main hierarchy.

• All the inputs/outputs to/from the IP block are registered.

• Data transfers to/from the IP block take place through a finite number of buses
with large data widths (32 bits for example).

9.2.2 Xilinx [1] IP Support

Xilinx MicroBlaze [1] offers soft configurable IP cores for software implementation of
the design. This feature adds tremendous flexibility to the application implemented in

FPGAs. The alternative hard processors available in recent FPGA devices are called
Power PC hard processors. These processors offer very high performance while they are
limited in number. The soft microblaze IPs as opposed to the embedded processors must
be implemented in the configurable logic of FPGA, thereby competing for resources with
the other parts of the design. However, as opposed to the hard embedded procssors, tens

of these microblaze cores can be implemented using present FPGA devices. Figure 9.2
presents the architecture of microblaze IPs. Some of the main features of soft IPs in
Xilinx FPGAs can be classified into - Computation based features and Communication
based features:
97

Figure 9.2: Xilinx [1] MicroBlaze System Design and Architecture

Computation Based:

• Variable Size Cache implementations

• High Throughput Floating Point Support

• Abundant Debug and Memory Management Capabilities

Communication Based:

• Flexible and Efficient Processor Local Bus (PLB) or On-chip Peripheral Bus (OPB)

• Upto 16 FSLs each 32 bits (Fast Simplex Links) for interfacing external modules

With the above logic and interface support, microblaze IPs can be efficiently imple-
mented in a multi-processor SoC. The standardized interfaces of the IP readily supports
the NoC paradigm for communication. Especially the 16 FSL links available to intercon-
nect external co-processors or other computation modules can serve as the interface for

the NoC.
98

We have designed the On-Chip network and the network interface keeping in mind
these communication requirements. For example, multiple FSL links (of size 32 bits)

emerging out of the microblaze cores could be interfaced using our multi-module cus-
tomizable NI to the network back bone. Certain IP communications might be less time
critical. In those cases, the data could be packetized and transmitted over the P-Layer
of the router, thereby achieving tremendous parallelism in the applications. On the

other hand, time critical data communication requiring predictable latencies could be
pre-scheduled through the C-layer of the NoC.
In addition to the above soft cores, Xilinx CORE Generator [1] provides a rich set
up IP cores optimized for Xilinx FPGAs. The kinds of IPs they provide span from Au-
dio, Video and Image processing to Automotive Industry applications to FPGA specific

storage elements. However, Xilinx does not automatically synthesize a suitable commu-
nication architecture for the IPs. We advocate our NoC based framework for this IP
based design environment. In the next section we obtain a set of freely available soft IP
cores [44] and customize them for our NoC-NI framework and also present the overheads

involved in the customization process.

9.2.3 IP Library Characterization

We obtain a set of publicly available cores [44] and develop an IP library that will be
compatible with our Network Interface and NoC. These application IPs can serve as
individual cores of a wide variety of multi-processor applications.

Later in this chapter we present the area characterization of the library of IPs we
consider in this study. As mentioned before, each of the IPs will be wrapped by a
suitable instance of the Network Interface that will be described in the next section.
Further, in subsequent sections we study the area and power overheads incurred by our

Network Interface when used with this library of IPs.


99

NI Wrapper

IP Core

Computation

Data/Ctrl Bus

Data/Ctrl Bus

Unit

Upto 4 IP Module
Connections

Figure 9.3: IP Core Abstraction and NI Wrapper

9.3 Network Interface Implementation

The modified two-layer router architecture presented in Chapter 5 supports high through-
put intra router connections (C-Layer) in addition to the packet switched online routing
layer (P-Layer). This hybrid router sustains a high average bandwidth per port thereby

increasing the overall performance of the communication architecture. Towards support-


ing the varied application needs, a library of such routers were developed (MoClib) and
was integrated with an automatic topology synthesis framework presented in Chapter 8.
As a part of this research, we make use of a generic two-layer network interface com-
patible with our library of routers. The primary objective behind this network interface

is to standardize the external communication of the IP core, thereby hiding the imple-
mentation details of the interconnect. In this section, we first present the design goals
behind this network interface and then describe its compatibility with our IP abstraction.
This work was carried out in collaboration with another student. See his thesis [45] for

detailed information on design goals, architecture and implementation of the Network


Interface.
100

As mentioned above, data transfers can take place to/from the core through a finite
number of buses. Towards designing a customizable Network Interface to this generic

core abstraction, we keep an upper bound of 4 for the number entry/exit points for the
IP (called IP Modules).

9.3.1 Primary Design Goals

Some of the key design objectives behind the NI are,

• Hybrid router compatibility

• Customizablility

• Low Area

Hybrid router compatibility: Being compatible with our library of hybrid two-
layer routers was our most important design goal. The NI must be able to support data

transfers between variable number of IP cores through the Circuit Switched Layer as well
as the Packet Switched Layer. If required, the NI needs to resolve operating frequency
differences between the communicating IPs.
Customizability: The RTL description of the NI needs to support certain important

design parameters. These parameters allow seamless integration of the NI with a library
of IP cores. The main parameters of the NI are,

• Number of Modules in the IP

• Bus Widths in C- and P- Layers

• FIFO Depths

• Configuration Modes
101

CLK RST Func. Sel LD_Data LD_Key OUT_RDY

Key 1 (0 : 63) Data_Out (0 : 63)

Key 2 (0 : 63)

Key 3 (0 : 63)
Triple DES IP Core
Data (0 : 63)

NI Wrapper

Figure 9.4: Triple DES IP Interface

9.3.2 Customized IP Library

The IP cores presented in previous sections needs to be interfaced with our NoC. For do-

ing so, we customize the NI to suit the variable requirements of every IP. Upon preparing
the IPs for our NoC framework, we characterize them for their area and power overheads.
The results are presented in Figure 9.5. It can be seen that the average area increase
due to the NI overhead is 18.82% while there is an average 10.37% increase in dynamic

power incurred to interface the IPs with our NoC.

9.4 NoC Floorplanning

In this section we first present overview of floorplanning multi-processor applications


in FPGAs. Xilinx Planahead [1] is a tool that can be obtained as part of Xilinx ISE
and used for hierarchical place and route of the design. A main advantage behind the

NoC approach is that it has predictable power-performance metrics. Compared with


102

Input (I) /Output (O) Area (Slices) Dynamic Power (mW)


CORES
I/O Bus Width w/NI IP+NI %Increase w/NI IP+NI %Increase
I 8
DCT 1246 1324 6.26 74.22 82.35 10.95
0 13

Color Converter I 8, 8, 8 765


658 16.26 56.17 62.69 11.61
0 8, 8, 8
Triple-DES I 64, 64, 64, 64
1615 1926 19.25 119.26 127.51 6.92
0 64
SIMD CPU I 32 5911 6211 5.08 87.67 92.56 5.58
0 32, 32
LCD Ctrl 0 8 218 19.73 36.45 41.13 12.84
261

Quantizer I 12, 8
1175 1367 16.34 33.08 37.81 14.30
0 8

Figure 9.5: IP Properties and Customization Overhead

Figure 9.6: IP Customization Overhead: Area and Power


103

ASICs, on-chip networks implemented in FPGAs have higher unpredictability in design


metrics due to limited resources available. In this section we also quantify the variation

in power-performance of the NoC due to this unpredictability. However addressing a


complete CAD flow for floorplanning that includes these issues is beyond the scope of
this study.
Xilinx Planahead [1]: The Xilinx Planahead [1] utility has recently become avail-

able as a part of the existing Xilinx Design flow for FPGA implementation. The objective
behind this tool is to support hierarchical place and route (from the synthesized design),
thereby improving overall performance of the design.
Advantages: 1) This CAD enhancement can be used to identify and move critical
blocks in the design 2) Direct place and route based on Pin & Memory constraints and

3) Implementing the Clock Tree with logic


Increase in design size and complexity traditionally is accompanied with an increase
in amount design re-use. This paradigm leads to significant increase in multi-processor
applications that support standardized IPs. As the number of IPs continue to scale,

floorplanning them along with the communication architecture is a great challenge. In


this research we attempt to alleviate the problems in implementing an efficient commu-
nication architecture. Presently IPs are independently synthesized, placed and routed
based on its performance, power and resource requirements. Also, until now power is

hardly featured as a priority in traditional FPGA design flow. However, the current de-
sign complexities and increasing need for portable hand held applications are motivating
power aware CAD enhancements to the design flow.

9.4.1 Synthesis of a Predictable NoC

As we mentioned before, the limited set of resources supported by FPGAs presents a

great challenge to achieve predictable performance and power goals.


Some of the factors that contribute to this complexity are: 1. Unknown IP Core
104

Figure 9.7: NoC Mesh Floorplan

Aspect Ratio and Resource needs


2. Variable router complexity and topology (based on application)
3. Variable inter port, inter router, inter core distances within the implementation

Until now floorplanning in the design flow (Planahead) only emphasizes on manual
area constraints placed on multiple cores and the design takes a lot of time to converge,
which we believe as the number of cores increase would be almost infeasible. Figure 9.7
presents the area constraints applied for a 3 × 2 mesh NoC. Upon completion of place

and route, the resource congestion of the mesh is shown in Figure 9.8.
In this research we have pre-characterized the network components and obtained
delay and power models for its components. When this communication architecture
knowledge is used to floorplan the multi-processor application and NoC, it could lead to
rapid timing violation removal and design convergence. Furthermore, while proposing

a CAD enhancement it needs to be ensured that existing industry level design flow for
FPGA implementation are only marginally varied.
105

Figure 9.8: NoC Mesh Routing

NoC links implemented in FPGAs are time-multiplexed over various communications.


The link length has a significant impact on the performance of the NoC. In our routers,

buffering is performed only at the input ports. Therefore, the inter router links appear in
the critical path of the design. Also, with increasing link lengths and distances between
routers (increased switched capacitance), the link power tends to dominate the overall
power consumption. While implementing an NoC with predictable power-performance

metrics, the above factors needs to be considered. We present the routing resource
characterization results in Figure 9.9. Xilinx FPGA Editor [1] is used to vary the routing
between various points in the NoC and the delay variations are measured in ns and
dynamic power is measured at 200 MHz clock frequency.

It can be seen that with every inter router connection (NoC Link) that spans 4 CLBs,
delay could increase upto 0.5 ns (2 × 0.25). As the link delay appears in the critical path,
a router operating at 250 MHz could suffer a performance degradation of upto 12.5%,
which is significant. Through efficient floorplanning and choice of appropriate frequencies
106

Figure 9.9: Routing Resources: Delay and Power

in our multi-clock NoC framework, these variations in delay can be optimized.


The above analysis we believe will limit the number of IPs that can be added to

a router. When the number of ports are increased, theoretically the performance of
the NoC could increase while keeping the area overhead low. However, the floorplan
perspective presented above needs to be considered while estimating the actual benefits
behind routers that support increased number of IPs.

9.5 Image Compression Implementation

Image processing applications are in general computation intensive due to the need for
processing several million pixels for operating with reasonable resolutions. Further, these
applications have portability and constrained time-to-market requirements and there-
fore merit FPGAs as the target architecture. NoC based multi-media communication
107

architectures are expected to highly competitive alternatives to its existing bus based
counterparts. The main reason for this is the natural division of these applications into

multiple cores that require highly parallel communications.


With FPGAs are increasingly used in portable devices and space applications, image
compression is an important application implemented on it. Image processing hardware
can be synthesized and implemented on FPGAs. Alternatively, the same application

can be implemented as software in the microblaze or other processors. Therefore, in


this discussion we restrict ourselves to hardware implementation (and not microblaze
implementation) of the image compression application.
Using these experiments, we demonstrate the feasibility of our NoC framework to
implement multimedia applications on FPGAs. Furthermore, as opposed to custom

designs, the freely available cores that we utilize to construct our experiments serve to
significantly reduce the design and test time. As the multi-processor domain of digital
design continues to evolve, more sophisticated applications could be implemented using
our NoC framework for FPGAs.

Application Description: The name JPEG stands for Joint Photographic Experts
Group. It specifies the way in which an input image is transformed/compressed and
stored. The JPEG conversion is performed through a set of IPs that convert the raw
input data format into jpeg standard. It is performed through multiple stages from

RGB to YCrCb conversion and DCT and huffman encoding and Run Length Encoding
(RLE) and finally stored in the memory. Traditionally, a color conversion from RGB
space to Ycrcb space is performed before storage and transmission. This leads to drastic
bandwidth reduction sustaining a marginal quality trade-off. The application is designed

to accept 352 × 288 pixel images in bitmap format as input and output in JPEG format.
Figure 9.10 presents the JPEG application that we have implemented in this case study.
108

9.5.1 NoC Implementation

The Network Interface described in the previous section is customized for implementing
the IPs of the image compression application. The objective behind this experiment is to
implement a typical SoC application on an FPGA along with the network components
and characterize the NoC design for area, performance and power. Present industry

applications operating at very high frequencies and having various highly complex pro-
cessing units certainly merit these sophisticated communication architectures. However,
as we could not obtain these industry level benchmarks for performing our experiments,
we restrict ourselves to freely available cores from Opencores [44].

The network channel width is set at 16 bits per flit. We assume each packet to
consist of a block with 8 × 8 pixels, each represented using 16 bits. Therefore, an
entire packet could consist of 64 flits with each having 16 bits to constitute one block
information independently. To study the area and power overheads of the NoC in a

typical application scenario, we implement various NoC topology versions. Figure 9.11
presents the various configurations that we have synthesized in order to study the NoC
overheads. We have retained a mesh topology for all the experiments as we already have
a library of router components that supports its flow control. Furthermore, all the multi-
port routers utilized in implementing this application sustain only the P-Layer. We make

this decision keeping in mind the marginal bandwidth requirements of this application.

9.6 NoC Analysis

In this section we present detailed area and power overheads of the NoC design. Fig-
ure 9.12 presents the comparison between these alternate implementations. The four
configurations shown in this figure are the same as those shown in Figure 9.11. The

last column presents results from a flat synthesis implementation of the whole applica-
tion. Upon enforcing tight timing constraints, the maximum operating frequency of the
109

Raw BMP Input

RGB
_ > YCrCb

DCT DCT DCT

QUANTIZER QUANTIZER QUANTIZER

Run Length
Encoder

Compressed JPEG Output

Figure 9.10: JPEG Application


110

DCT DCT QNT


DCT DCT DCT QNT

RGB Router RLE Router Router

QNT QNT RLE


RGB
QNT DCT QNT

(a) (b)

RGB DCT DCT RGB DCT QNT QNT

Router Router Router Router Router Router


RLE

DCT
Router Router Router Router Router Router

QNT QNT QNT DCT DCT QNT RLE


(c) (d)

RGB __ RGB2YCrCb QNT __ Quantizer


__ Run Length Encoder
DCT __ Discrete Cosine Transform RLE

Figure 9.11: NoC Implementation Alternatives

application was around 86 MHz in our target device.


We refrain from making a detailed performance comparison of our NoC architecture

with the flat synthesis approach as the amount of parallelism inherently present in the
application is very marginal. Our experimental results mainly serve to demonstrate the
trade-off in area and power of the NoC present while choosing an appropriate implemen-
tation methodology.

Area and Power Overhead: Within the limited available logic and routing re-
sources, the FPGAs need to contain the user logic along with the communication archi-
111

tecture. During all phases of our design we have accomplished the design goals with as
fewer resources as possible. Routers being the central component of the network, are

being replicated multiple times to interface all the IPs to the network. In this research
we determine the area overhead of the NoC as a % of total slices utilized by the applica-
tion and NoC. To accurately estimate the logic, we: 1) floorplan the design manually 2)
place large constraints on the area and 3) obtain the place and route results. Figure 9.12

shows that the overall area utilization of the image compression application gradually
increases with the number of routers. Also, the % overhead of NoC increases and reaches
11.23% for the 8-router topology (configuration d). As the number of cores and amount
of parallelism scales, the performance and power benefits obtained through this approach
is expected to outweigh the area overhead seen above. Figure 9.12 also presents power

overheads for various configurations. For the purposes of comparing power consumption
between alternate implementations, we consider only the dynamic component. It can
be seen that with variations in number of routers, dynamic power follows a different
trend compared to the area overhead. Compared with configuration (a), the two-router

version sustains lower dynamic power due to reduced switch power. Even though there
is a marginal increase in area, the reduced switching activity in the routers leads to
lower dynamic power. The average power overhead in the on-chip network across all
configurations was around 18% for the JPEG application.
112

Flat
Configuration (a) (b) (c) (d)
Synthesis

Total Area
9947 10158 10336 10743 9546
(Slices)

% NoC Area 11.23


4.28 5.93 7.76
Overhead

Total Dyn.
337.19 322.42 335.64 367.85 274.77
Power (mW)
%NoC Dyn.
17.74 16.86 17.28 19.11
Power

NoC/Design
108 MHz 186 MHz 242 MHz 273 MHz 86 MHz
FMax

RGB

DCT DCT DCT

QNT QNT QNT

RLE

Figure 9.12: JPEG Configuration: Area and Power Overhead Analysis


Chapter 10

Summary of Contributions and


Future Work

In the following sub sections, we briefly summarize the contributions made in this dis-
sertation. Furthermore, we have suggested directions for future work beyond the contri-

butions made in this dissertation.

10.1 Contributions

This dissertation primarily focuses on development of a high performance communica-


tion architecture for current Field Programmable Gate Arrays using Networks-on-Chips.
Present SoCs implemented in FPGAs merit these sophisticated on-chip network based

communication architectures to offset the performance bottleneck inherent in bus based


architectures.

10.1.1 NoC Framework: MoCReS

As a first step towards designing a novel communication architecture, an NoC framework


for FPGAs, MoCReS, an area efficient Multi-Clock On-Chip Network for Reconfigurable

113
114

Systems was developed. The design addresses area, performance and multi-clock ca-
pability which are the primary design goals in NoC design for FPGAs. Our 5-port

virtual cut-through router has an area overhead of only 282 Virtex-4 slices (a marginal
0.57% of XC4VLX100) and operates at 357 MHz supporting a competitive data rate of
2.85 Gbit/s.

10.1.2 Power Characterization

We determine the power consumption of the NoC framework on FPGA. Various com-

ponents of power consumed were presented in detail. Further, we analyze the power
trade-offs associated with our design novelties by comparing it with a baseline approach,
implemented on the same target device. Results show a marginal dynamic power over-
head (11.5%) for the performance advantage observed in our multi-clock NoC design.

Further, we associate the power consumed by various components in the NoC architec-
ture to the underlying FPGA resources utilized by them.

10.1.3 Hybrid 2-Layer Architecture

To address the bandwidth limitations of MoCReS, we extend the design by developing


a hybrid two-layer router architecture. The novel design of the network component

supports high throughput time-multiplexed circuit-switched connections between IPs


interfaced to the same router, in addition to the packet-switched communication layer.
Various instances of the NoC components are characterized for area and performance in
the form of a M oClib NoC component library. The advanced router architecture achieves
an average improvement of 20.4% in NoC bandwidth (maximum of 24% compared to a

traditional NoC).
115

10.1.4 Performance and Power Analysis

The proposed alternate router architecture inherently supports higher throughput. We


quantify the performance benefits over our baseline approach using a previously devel-
oped experimental platform. Utilizing this modified router architecture to instantiate
mesh topologies involves analyzing power trade-offs during topology synthesis. We thor-

oughly characterize the novel router architecture and its library of network components
for power consumed.

10.1.5 CAD Flow: Topology Synthesis

A CAD tool has been developed for implementing the design flow for FPGA NoC Topol-
ogy Synthesis. It implements an exhaustive search algorithm with multiple phases that

optimizes the area of the NoC while meeting the bandwidth requirements of the applica-
tion. The M oClib NoC component library is used to perform the NoC topology design
space exploration. For any given specific application as a task graph, our integrated
synthesis framework determines a suitable NoC topology that satisfies the bandwidth

requirements, while optimizing for the area overhead. We report the results for a wide
set of application and synthetic benchmarks, represented as task graphs. Results show
an average reduction of 21.6% in FPGA NoC area (maximum of 26%) for equivalent
bandwidth constraints when compared with a baseline approach.

10.1.6 NoC Based SoC Development

A framework for characterizing IP cores has been developed. To integrate variable IP


cores along with our NoC, a two-layer network interface was designed preserving cus-
tomizability, induced latency and area utilized as its primary goals. A customized li-
brary of IP cores were characterized for area and power overheads. This library of our

NoC compatible IPs were utilized to implement an image compression application using
116

FPGA based NoCs. We evaluate the area and power overhead involved in our alternate
implementation methodology.

10.2 Future Work

In this section we outline the future research directions that could serve as an extension
of our work.
FPGA CAD Enhancement: As the number of processing elements within a SoC
increases, the design time tends to increase due to the manual effort required. Therefore,

it is important to automate floorplanning of the on-chip network and IPs, taking into
consideration the constraints of the application. Real IP cores have their independent
pin, bRAM and logic/routing resource requirements. We have modeled the resource,
performance and power overheads of our NoC and IP library in our research. An effi-

cient CAD flow that takes into consideration these resource requirements needs to be
developed.
Multi-Processor Benchmarking: Standardization of multi-processor benchmarks
would greatly help future research in enhancing FPGA based NoCs. It is certain that the
semiconductor industry is driving towards more and more processing elements. However,

It is important to benchmark these SoC applications that have high inherent parallelism
in them. This would model the realistic traffic scenarios and congestion that are typical
to FPGAs. Furthermore, similar to bus-based architectures, IP cores developed in future
could be standardized for NoC centric communication.

FPGA Device Support: In this research we have considered NoCs suitable for
implementing on FPGAs. Overlaying on-chip networks on an existing FPGA device
offers tremendous flexibility at the loss of performance. Future heterogeneous FPGAs
could incorporate hardware support for on-chip networks and thereby increase the overall

performance of the SoC. As pointed out in [46], it is an interesting design challenge to


117

partition the NoC components over hard embedded and soft configurable FPGA blocks.
Bibliography

[1] Xilinx Inc. http://www.xilinx.com.

[2] International Sematech. International technology roadmap for semiconductors 2005


edition.

[3] Nallatech Inc. http://www.nallatech.com.

[4] Theodore Marescaux et al. Interconnection Networks Enable Fine-Grain Multi-


Tasking on FPGAs. In FPL’2002, pages 795–805, 2002.

[5] A.Kumar et al. An FPGA Design Flow for Reconfigurable Network-Based Multi-

Processor Systems on Chip. In DATE’07, 2007.

[6] N.Kapre. Packet-Switched On-Chip FPGA Overlay Networks. ms thesis, california


institute of technology. 2006.

[7] Clint Hilton and Brent Nelson. PNoC: a flexible circuit-switched NoC for FPGA
based systems. In IEEE Proc. Computers and Digital Techniques, 2006.

[8] N. K. Kavaldjiev and G. J. M. Smit. A survey of efficient on chip communications


for SoC. In 4th PROGRESS Symp. on Embedded Systems, Nieuwegein, Netherlands,
2003.

[9] OCP-IP. http://www.ocpip.org.

118
119

[10] Manuel Saldaa, Lesley Shannon, and Paul Chow. The Routability of Multiprocessor
Network Topologies in FPGAs. In SLIP’06, pages 49–56, 2006.

[11] Balasubramanian Sethuraman, Prasun Bhattacharya, Jawad Khan, and Ranga Ve-
muri. LiPaR: A Light-Weight Parallel Router for FPGA-based Networks-on-Chip.
In Great Lakes Symposium on VLSI, 2005.

[12] William J. Dally and Brian Towles. Route Packets, Not Wires: On-Chip Intercon-

nection Networks. In Design Automation Conference, pages 684–689, 2001.

[13] Luca Benini and Giovanni De Micheli. Network on Chips: A New SOC Paradigm.
In IEEE Computer, 2002.

[14] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. berg, K. Tiensyrj,

and A. Hemani. A Network on Chip Architecture and Design Methodology. In IEEE


International Symposium on VLSI, 2002.

[15] J.Dielissen et al. Concepts and implementation of the phillips network-on-chip. In


Proceedings of the IPSOC’03, 2003.

[16] J.A.Kahle et al. Introduction to the cell multiprocessor. In IBM Journal of Research
and Development, 2005.

[17] Sonics - smart interconnects. http://www.sonicsinc.com.

[18] Arteris. http://www.arteris.com.

[19] Silistix. http://www.silistix.com.

[20] F. Moraes, N. Calazans, A. Mello, L. Moller, and L. Ost. HERMES: an Infrastruc-


ture for Low Area Overhead Packet-Switching Networks on Chip. In INTEGRA-
TION, The VLSI Journal, 2002.
120

[21] T.A Bartic et. al. Topology Adaptive Network-on-Chip Design and Implementation.
In Computer and Digital Tecniques, IEE Proceedings, pages 467–472, 2005.

[22] N. Kapre, N. Mehta, M. deLorimier, R. Rubin, H. Barnor, M. J. Wilson,


M. Wrighton, and A. DeHon. Packet-Switched vs. Time-Multiplexed FPGA Over-
lay Networks. In IEEE Symposium on Field-Programmable Custom Computing Ma-
chines, 2006.

[23] N.Kavaldjiev and G.J.M.Smit. An Energy-efficient Network-on-Chip for a Heteroge-

nous Tiled Reconfigurable Systems-on-Chip. In EUROMICRO Symposium on DSD,


pages 492–498, 2004.

[24] Fernando Moraes et al. A Low Area Overhead Packet-switched Network on Chip:
Architecture and Prototyping. In IFIP VLSI-SOC, 2003.

[25] T.Bartic et.al. Topology Adaptive Network-on-Chip Design and Implementation.


In IEEE Proc Computer Digit. Tech, 2005.

[26] Daewook Kim, Manho Kim, and Gerald E.T Sobelman. Asynchronous FIFO Inter-
faces for GALS On-Chip Switched Networks. In Intl. SoC Design Conference’2005,
pages 186–189, 2005.

[27] Vijay Swaminathan. Performance analysis of multi-clock noc for fpgas. Master’s
thesis, University of Cincinnati, 2007.

[28] MentorGraphics Inc. http://www.mentorgraphics.com.

[29] Prasun Bhattacharya. Comparison of Single-Port and Multi-Port NoCs with Con-
temporary Buses on FPGAs. Master’s thesis, University of Cincinnati, 2006.

[30] N. Banerjee, P. Vellanki, and K. S. Chatha. A Power and Performance Model for
Network-on-Chip Architectures. In DATE 04: Proceedings of the conference on
Design, automation and test in Europe, 2004.
121

[31] J. Hu and R. Marculesu. Exploiting the Routing Flexibility for Energy/Performance


Aware Mapping of Regular NoC Architectures. In DATE’03, 2003.

[32] Mrio P. Vstias and Horcio C. Neto. Co-Synthesis of a Configurable SoC Platform
based on a Network on Chip Architecture. In ASPDAC, 2006.

[33] Alukayode Arole. Power profiling: An incremental power analysis technique for
fpga-based designs. Master’s thesis, University of Cincinnati, 2006.

[34] T.S.T. Mak et.al. On-FPGA Communication Architectures and Design Factors. In
FPL’06, 2006.

[35] D. Wiklund and L.Dake. SoCBUS: switched network on chip for hard real time
embedded systems. In Parallel and Distributed Processing Symposium, 2003, 2003.

[36] Balasubramanian Sethuraman and Ranga Vemuri. optiMap: a tool for automated
generation of NoC architectures using multi-port routers for FPGAs. In Design,
Automation and Test in Europe, 2006. DATE ’06, 2006.

[37] A.Janarthanan et.al. MoCReS: an Area-Efficient Multi Clock On-Chip Network for

Reconfigurable Systems. In IEEE Computer Society ISVLSI’07, 2007.

[38] T. Lei and S. Kumar. A two-step Genetic Algorithm for Mapping Task Graphs to a
Network on Chip Architecture. In Euromicro Symposium on Digital System Design,
2003, 2003.

[39] . A.Jalabert et. al. xpipes: a Latency Insensitive Parameterized Network-on-chip


Architecture For Multi-Processor SoCs. In ICCD, 2003.

[40] R.P.Dick et. al. TGFF: Task Graphs for Free. In 6th International Workshop on
Hardware/Software Codesign, 1998.
122

[41] Davide Bertozzi et. al. NoC Synthesis Fow for Customized Domain Specific Multi-
processor Systems-on-Chip. In IEEE Transaction on Parallel and Distributed Sys-

tems, 2005.

[42] K.Srinivasan and K.Chatha. A low complexity heuristic for design of custom
network-on-chip architectures. In DATE 2006, 2006.

[43] International Sematech. International technology roadmap for semiconductors 2007


edition.

[44] http://www.opencores.org.

[45] Daniel Williams. Implementation of a Generic Two-Layer Network Interface for


FPGA based NoCs. Master’s thesis, University of Cincinnati, 2008.

[46] R.Gindin, I.Cidon, and I.Keidar. NoC-Based FPGA: Architecture and Routing. In
NOCS 2007, 2007.

[47] Balasubramanian Sethuraman and Ranga Vemuri. A Force-directed Approach for


Fast Generation of Efficient Multi-Port NoC Architectures. In VLSI Design, 2007,

2007.

[48] N.Kavaldjiev, G.J.M.Smit, and P.G.Jansen. A Virtual Channel Router for On-Chip
Networks. In SOC Conference, 2004.

[49] N.Kavaldjiev, G.J.M.Smit, and P.G.Jansen. Two Architectures for On-Chip Virtual
Channel Router. In PROGRESS Symposium on Embedded Systems, 2004.

[50] Theodore Marescaux et al. Networks on Chip s Hardware Components of an OS for


Reconfigurable Systems. In FPL’2003, 2003.

[51] C.A. Zeferino, M.E. Kreutz, and A.A Susin. RASoC: a router soft-core for networks-
on-chip. In DATE’2004-Designer’s Forum, IEEE CS Press, 2004, pages 198–203,
2004.
123

[52] William J. Dally and Brian Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann, 2003.

[53] K.Srinivasan and K.Chatha. A technique for low energy mapping and routing in
network-on-chip architectures. In ISPELD, 2005.

[54] P. T. Wolkotte, G. J. M. Smit, G. K. Rauwerda, and L. T. Smit. An Energy-Efficient


Reconfigurable Circuit Switched Network-on-Chip. In 19th IEEE International Par-
allel and Distributed Processing Symposium (IPDPS05), 2005.

[55] C.R.Hilton. A Flexible Circuit-Switched Communication Network for FPGA-Based

SoC Design. MS Thesis, Brigham Young University. 2005.

[56] S. Murali and G. D. Micheli. SUNMAP: A Tool for Automatic Topology Selection
and Generation for NoCs. In In Proceedings of the ACM/IEEE Design Automation
Conference, 2004.

[57] L.-R. Zheng J. Liu, , and H. Tenhunen. A Circuit-Switched Network Architecture

for Network-on-chip. In Proceedings of the International Symposium on System-on-


Chip, 2004.

[58] T.Bjerregaard and S.Mahadevan. A Survey of Research and Practices of Network-


on-Chip. In ACM computing surveys, 2006.

[59] J. Duato, A. Robles, F. Silla, and R. Beivide. A comparison of router architectures

for virtual cut-through andwormhole switching in a now environment. In Interna-


tional Symposium on Parallel and Distributed Processing, 1999.

[60] R.Gindin, I.Cidon, and I.Keidar. Noc-based fpga: Architecture and routing. In
International Symposium on Networks-on-Chip, 2007.

[61] A.Mello et. al. Virtual channels in networks on chip: Implementation and evaluation
on hermes noc. In SBCCI’05, 2005.
124

[62] I.Kuon and J.Rose. Measuring the Gap Between FPGAs and ASICs. In FPGA’06,
2006.

[63] S.E.Lee and N.Bagherzadeh. Increasing the Throughput of an Adaptive Router in


Network-on-Chip (NoC). In CODES + ISSS’06, 2006.

[64] A.Reimer, A.Schulz, and W.Nebel. Modelling Macromodules for High-Level Dy-
namic Power Estimation of FPGA-based Digital Designs. In ISPELD’06, 2006.

[65] R.Soares, I.S.Silva, and A.Azevedo. When Reconfigurable Architecture Meets


Network-on-Chip. In SBCCI’04, 2004.

[66] J.Xu, W.Wolf, J.Henkel, and S.Chakradhar. A design methodology for application-
specific networks-on-chip. In ACM Transactions on Embedded Computing Systems,

2006.

[67] T.Bartic et.al. Network-on-Chip for Reconfigurable Systems: From High-Level De-
sign Down to Implementation. In FPL’04, 2004.

[68] Y.Zhang, J.Roivainen, and A.Mmmel. A design methodology for application-specific

networks-on-chip. In DSD’06, 2006.

[69] P.Vstias and H.Heto. Area and Performance Optimization of a Generic Network-
on-Chip Architecture. In SBCCI’06, 2006.

[70] U.Y.Ogras et.al. Communication Architecture Optimization: Making the Shortest

Path Shorter in Regular Networks-on-Chip. In DATE 06, 2006.

[71] C.A.Zeferino, F.G.M.E.Santo, and A.A.Susin. ParIS: A Parameterizable Intercon-


nect Switch for Networks-on-Chip. In SBCCI’04, 2004.

[72] U.Y.Ogras and R.Marculescu. Its a Small World After All: NoC Performance
Optimization Via Long-Range Link Insertion. In IEEE Transactions on VLSI, 2006.
125

[73] S.L.Liu, P.Yiannacouras, and T.Suh. An FPGA Based Pentium in a Complete


Desktop System. In FPGA’07, 2007.

[74] A.Kumar et. al. Express Virtual Channels: Towards the Ideal Interconnection
Fabric. In ISCA’07, 2007.

[75] E. Rijpkema et. al. Trade offs in the design of a router with both guaranteed and
best-effort services for networks on chip. In DATE’03, 2003.

[76] S. Stergiou et. al. A synthesis oriented design library for networks on chips. In
DATE’05, 2005.

[77] S. Murali et.al. Bandwidth Constrained Mapping of Cores onto NoC Architectures.
In DATE’04, 2004.

[78] H.S. Wang et. al. Orion: A Power-Performance Simulator for Interconnection Net-
works. In Microarchitecture’02, 2002.

[79] U.Y.Ogras, J. Hu, and R.Marculescu. Key research problems in NoC design: a
holistic perspective. In International Workshop on Hardware/Software Codesign,

2005.

[80] H. G. Lee, U. Y. Ogras, R. Marculescu, and N. Chang. Design Space Exploration


and Prototyping for on-chip Multimedia Applications. In DAC’06, 2006.

[81] Y.Feng and D.P.Mehta. Heterogenous Floorplanning for FPGAs. In VLSID’06,

2006.

[82] M.Wang, A.Ranjan, and S.Raje. Multi-million gate fpga physical design challenges.
In ICCAD’03, 2003.

[83] J.Liu, L.Zheng, and H.Tenhunen. Global Routing for Multicast-Supporting TDM
Network-on-Chip. In SOC’04, 2004.
126

[84] S.Murali et. al. Designing Application Specific Networks on Chips with Floorplan
Information. In ICCAD’06, 2006.

[85] K.Poon, A.Yan, and J.E.Wilton. A Flexible Power Model for FPGAs. In FPL’02,
2002.

[86] J.Kim, C.Nicopoulos, and D.Park. A Gracefully Degrading and Energy-Efficient


Modular Router Architecture for On-Chip Networks. In ISCA’06, 2006.

[87] D.Wu, B.M.Hashimi, and M.T.Schmitz. Improving Routing Efficiency for Network-
on-Chip through Content-Aware Input Selection. In ASPDAC’06, 2006.

[88] Tong Li. Estimation of Power Consumption in Wormhole Routed Networks on Chip.
Master’s thesis, IMIT/LECS Stockholm, Sweden, 2005.

[89] K.Paulsson, M.Hubner, and J.Becker. Online Optimization of FPGA Power Dissipa-
tion by Exploiting Runtime Adaptation of Communication Primitives. In SBCCI’06,
2006.

[90] L.Shang, A.S.Kaviani, and K.Bathala. Dynamic Power Consumption in Virtex-II

FPGA Family. In FPGA’02, 2002.

[91] V.Deghalahal and T.Tuan. Methodology of High-Level Estimation of FPGA Power


Consumption. In FPGA’05, 2005.

[92] R.A.Shafik, P.Rosinger, and B.M.Al-Hashimi. MPEG-based Performance Compar-

ison between Network-on-Chip and AMBA MPSoC. In IEEE Workshop on Design


and Diagnostics of Electronic Systems, 2008.

[93] A.Laffely, J.Liang, P.Jain, W.Burleson, and R. Tessier. Adaptive systems on a chip
(aSoC) for low-power signal processing. In IEEE Signals, Systems and Computers,
2001.
127

[94] A. Kumar, S. Fernando, Yajun Ha, B. Mesman, and H. Corporaal. Multi-Processor


System-Level Synthesis for Multiple Applications on Platform FPGA. In IEEE FPL,

2007.

[95] Z.Lu, M.Liu, and A.Jantsch. Layered Switching for Networks on Chip. In DAC
2007, 2007.

You might also like